Oslo Multilingual Corpus - background and use

The Oslo Multilingual Corpus (OMC) is a collection of text corpora comprising original texts and translations from several languages.

En akademisk løpebane, jo takk som byr!

An academic career? No thanks!

Une carrière universitaire? Allons donc.

Eine akademische Laufbahn, du meine Güte.

The various sub-corpora differ in that they contain a different number of languages or a different combination of languages.

The OMC provides unique research material for use in contrastive studies and translation studies, as well as in theoretical and applied linguistics.

Collaboration

The Oslo Multilingual Corpus is a product of the interdisciplinary research project Languages in Contrast (SPRIK), which is a collaboration between researchers at the Faculty of Humanities, University of Oslo. See further the Languages in Contrast homepage.

Sub-corpora

The OMC contains many sub-corpora that differ in composition with regard to languages and number of texts included. It is mainly the languages Norwegian, English, French, and German that are represented in the sub-corpora, but some of the corpora include Dutch and Portuguese texts. In addition, there are related parallel corpora for English-Swedish and English-Finnish, compiled in Sweden and Finland, which are accessible from the same site.

The sub-corpus French-Norwegian Parallel Corpus (FNPC/fiction) was compiled at the University of Bergen (UiB), and completed and made ready for inclusion in the OMC at the University of Oslo (UiO). FNPC/non-fiction contains texts that were collected both at UiO and UiB. The French-Norwegian anchor word list and the rules for French word splitting were developed at the University of Bergen.

Many of the texts are found in more than one sub-corpus, i.e. the different composition in terms of languages included in the sub-corpora allows for re-use of the texts. This is particularly the case with sub-corpora including Norwegian, English, or German originals. An overview of the different sub-corpora is given here.

Access to the OMC

The OMC material is password protected and can only be used for research purposes. The right to use it is first and foremost reserved for MA students, PhD students, and researchers at the Univeristy of Oslo and the University of Bergen. A list of publications connected to the OMC and the SPRIK project can be found here.

The OMC is an extension of the English-Norwegian Parallel Corpus (ENPC), which was compiled and completed at the Department of British and American Studies in 1996 (for more information, see the ENPC's homepage).

Technical Matters

The Oslo Multilingual Corpus was compiled according to the same principles as the English-Norwegian Parallel Corpus and the coding and mark-up of the texts follow the same guidelines as the ENPC (i.e. TEI's Guidelines for Electronic Text Encoding and Interchange (Sperberg-McQueen og Burnard, 1994). Reference is therefore made to the ENPC manual for further information on coding and mark-up of the texts in the OMC (see link to the right).

As regards the structure of the OMC, see SPRIK report No. 1 (see link to the right).

How to cite the OMC

The Oslo Multilingual Corpus (1999-2008), the Faculty of Humanities, University of Oslo. The Oslo Multilingual Corpus is a product of the interdisciplinary research project Languages in Contrast (SPRIK), directed by Stig Johansson and Cathrine Fabricius-Hansen, and compiled by the OMC corpus team. http://www.hf.uio.no/ilos/english/services/omc/
 

Published July 6, 2010 10:39 AM - Last modified Dec. 8, 2014 11:20 AM