Forsiden UiO Det humanistiske fakultet Institutt for litteratur, områdestudier og europeiske språk
print logo

Oslo Multilingual Corpus

We are currently developing the Oslo Multilingual Corpus (OMC), which is an extension of the English-Norwegian Parallel Corpus (ENPC). The ENPC has the following structure:

  • The corpus consists of text excerpts of approximately 10.000 to 15.000 words from fictional and non-fictional Norwegian and English original texts and their translations, amounting to a total of 200 texts, or 2.6 million words. German, Dutch and Portugese translations were added for some of the texts.
  • The texts are SGML-encoded and aligned at sentence level. For this purpose, we have developed a program for automatic alignment (Knut Hofland: The Translation Corpus Aligner). Tools for searching in the parallel texts have also been developed (Jarle Ebeling/Lars Wilhelmsen: PerlTCE and TaggedTCE (TCE = Translation Corpus Explorer)).

A bi-directional corpus of this type can be used for studies of different kinds: a cross-linguistic comparison of original texts, a cross-linguistic comparison of original and translated texts, a comparison of original and translated texts in the same language, and a cross-linguistic comparison of translated texts.

The corpus is now being extended on the German side in particular, to ensure equal representation of texts in English, German, and Norwegian, to the extent that this is possible. Recently, the project has been extended to French. Eventually, the corpus will contain original texts in four languages (English, German, French, Norwegian) and their translations into as many as possible of the other three languages. Currently (November 2005), the English-German-Norwegian part of the corpus consists of 32 English, 37 German, and 27 Norwegian original texts with translations into the other two languages, whereas the French-Norwegian part comprises excerpts from 10 Norwegian and 10 French non-fictional texts with their respective translations.

Due to copyright restrictions, the corpus is only available to researchers and graduate students at the universities in Oslo and Bergen. However, some texts from the European Union (EU) and the World Health Organization (WHO) are generally available and offer the opportunity to see how the search in parallel texts is done. The search tool is WebTCE, an earlier version of PerlTCE (see above).

Lists of the OMC texts that are currently available can be obtained by accessing the corpus.

  • Access the corpus (user name/password required). (Broken links - SEARCH OMC in new UiO web site)
  • Apply for access to the corpus(restricted to researchers and students at the University of Oslo and Bergen)