Oslo Multilingual Corpus

Sub-corpora

The different sub-corpora of the OMC can be divided into two main types of multilingual corpora: parallel corpora and translation corpora.

By parallel corpus is here understood a collection of texts containing both original texts and translations from two or more languages. As far as possible, the same number of original texts is found in the two (or three) languages.

By translation corpus is understood a collection of texts containing original texts from one language with translations into one or more languages, i.e. only one language is represented with original texts.

The corpora texts have been divided into fiction and non-fiction. For an overview of the classification of these, click here.

(Please note that you need to have access to the OMC to see the overview of texts in the different sub-corpora.)


ENPC/Fiction
ENPC/Non-fiction
 

The English-Norwegian Parallel Corpus is the mother corpus of the OMC and is composed of one fiction part and one non-fiction part. The corpus contains 50 original texts from each language and their translations (English-Norwegian and Norwegian-English), 30 of which are fiction and 20 of which are non-fiction. Each text is an extract of 10,000-15,000 words, amounting to some 2.6 million words in all.

--------------------------------------------------------------------------------
ENPC/Fiction

English original text: approx. 402,500 words
Norwegian translated text: approx. 398,000 words
Norwegian original text: approx. 403,500 words
English translated text: approx. 423,000 words
--------------------------------------------------------------------------------
ENPC/Non-fiction

English original text: approx. 252,000 words
Norwegian translated text: approx. 244,000 words
Norwegian original text: approx. 220,100 words
English translated text: approx. 252,700 words
--------------------------------------------------------------------------------


FNPC/Fiction
FNPC/Non-fiction
 

The French-Norwegian Parallel Corpus contains original texts and their translations (French-Norwegian, Norwegian-French). The corpus is composed of a fictional and a non-fictional part. FNPC/Fiction includes 6 French original texts with translations into Norwegian and 5 Norwegian original texts with translations into French, while FNPC/Non-fiction includes 10 original text extracts from each language (with translations).

The text extracts are chunks of 6,000-41,000 words. In total, the corpus contains approx. 864 600 running words, distributed as follows:

---------------------------------------------------------------------------------
FNPC/Fiction

Norwegian original text: approx. 55,800
French translated text: approx. 63,300
French original text: approx. 111,200
Norwegian translated text: approx. 109,300
---------------------------------------------------------------------------------
FNPC/Non-fiction

Norwegian original text: approx. 117,500
French translated text: approx. 134,000
French original text: approx. 136,500
Norwegian translated text: approx. 137,000
--------------------------------------------------------------------------------


GNPC/Fiction
GNPC/Non-fiction
 

The German-Norwegian Parallel Corpus contains original texts and their translations (German-Norwegian, Norwegian-German). The corpus is composed of a fictional and a non-fictional part. GNPC/Fiction includes 18 original text extracts from each language, while GNPC/Non-fiction includes 5 original text extracts from each language.

In total, the corpus contains approx. 1,275,000 running words, distributed as follows:

---------------------------------------------------------------------------------
GNPC/Fiction

Norwegian original text: approx. 240,600
German translated text: approx. 238,800
German original text: approx. 269,500
Norwegian translated text: approx. 256,800
---------------------------------------------------------------------------------
GNPC/Non-fiction

Norwegian original text: approx. 63,200
German translated text: approx. 66,900
German original text: approx. 67,600
Norwegian translated text: approx. 71,900
--------------------------------------------------------------------------------


En-Ge-En
 

An English-German parallel corpus containing original texts and their translations (English-German, German-English). The corpus is composed of both fiction and non-fiction texts.

En-Ge-En is composed of 33 English and 21 German original texts, and, on average, each text extract contains 10,000-15,000 words. In total, the corpus contains approx. 1,500,000 running words, distributed as follows:

English original text: approx. 432,500
German translated text: approx. 442,200
German original text: approx. 303,500
English translated text: approx. 320,900



Ge-No-Ge
 

A German-Norwegian parallel corpus containing original texts and their translations (German-Norwegian, Norwegian-German). The corpus is composed of both fiction and non-fiction texts.

Ge-No-Ge is composed of 37 German and 28 Norwegian original texts, and, on average, each text extract contains 10,000-15,000 words. In total, the corpus contains approx. 1,793,500 running words, distributed as follows:

German original text: approx. 517,800
Norwegian translated text: approx. 515,100
Norwegian original text: approx. 378,000
German translated text: approx. 382,600



No-En-Ge
En-Ge-No
Ge-En-No
 

A Norwegian-English-German parallel corpus containing original texts and translations from three languages (Norwegian-English-German, English-German-Norwegian, and German-English-Norwegian). The corpus is split into three different databases.

Together these three sub-corpora make up a Norwegian-English-German parallel corpus. No-En-Ge contains Norwegian original texts and their English and German translations; En-Ge-No contains English originals and their translations into German and Norwegian; Ge-En-No contains German originals and their translations into English and Norwegian.

The Norwegian-English-German parallel corpus comprises a different number of original texts in the three languages. The aim is to get 25-30 original texts from each language. The current status for the corpus (January 2006) is 22 Norwegian, 33 English, and 21 German originals. Most of these are fictional texts. The number of running words in each of the sub-corpora is as follows:

No-En-Ge:
Norwegian original text: approx. 289,230
English translated text (from Norwegian): approx. 306,050
German translated text (from Norwegian): approx. 289,860

 

En-Ge-No:
English original text: approx. 432,500
German translated text (from English): approx. 442,200
Norwegian translated text (from English): approx. 430,300

 

Ge-En-No:
German original text: approx. 287,400
English translated text (from German): approx. 305,800
Norwegian translated text (from German): approx. 280,300

 



En-Du
 

An English-Dutch translation corpus containing 12 English original texts and their translations into Dutch. The corpus includes fictional texts only and overlaps with the English texts in the ENPC.

Each text extract amounts to 10,000-15,000 running words. In total, the corpus contains approx. 326,300 running words, distributed as follows:

English original text: approx. 158,000
Dutch translated text: approx. 168,300



En-No-Po
 

An English-Norwegian-Portuguese translation corpus containing 15 English original texts and their translations into Norwgian and Portuguese. The corpus includes fictional texts only, all of which overlap with the English originals in ENPC. One of the texts includes translation both into European and Brazilian Portuguese.

Each text extract amounts to 10,000-15,000 words. In total, the corpus contains approx. 606,000 running words, distributed as follows:

English original text: approx. 197,000
Norwegian translated text: approx. 197,000
Portuguese translated text: approx. 212,000



No-Fr-Ge
 

A Norwegian-French-German translation corpus containing Norwegian fictional texts and their translations into French and German. The corpus contains 7 fictional texts.

The text extracts amount to about 80% of each book. In total, the corpus contains approx. 1,525,398 running words, distributed as follows:

Norwegian original text: approx. 498,724
French translated text: approx. 540,887
German translated text: approx. 485,787



No-En-Fr-Ge
 

A Norwegian-English-French-German translation corpus containing Norwegian fictional texts and their translations into English, French, and German. The corpus contains 5 texts, all of which are also part of No-Fr-Ge. The difference is that this corpus includes English translations in addition to French and German ones.

The text extracts amount to about 80% of each book. In total, the corpus contains approx. 1,666,964 running words, distributed as follows:

Norwegian original text: approx. 408,558
English translated text: approx. 425,949
French translated text: approx. 439,687
German translated text: approx. 392,770



ESPC/Fiction
ESPC/Non-fiction
 

The English-Swedish Parallel Corpus is the Swedish sister corpus of the ENPC. Like the ENPC it is composed of one fiction and one non-fiction part. The corpus contains original texts and their translations (English-Swedish and Swedish-English), amounting to approx. 2.8 million words in all.

Since the two corpora were developed within a larger Nordic network many of the English original texts are the same in the two corpora.

An overview of the texts in ESPC/Fiction can be found here.
An overview of the texts in ESPC/Non-fiction can be found here.



En-Fi
 

An English-Finnish translation corpus containing English original texts and their translations into Finnish. This corpus is also a product of the Nordic network mentioned in connection with the ESPC. The corpus includes fictional texts only and is originally part of the English-Finnish Parallel Corpus, which also includes Finnish original texts. Many of the English original texts in the En-Fi are the same as in the ENPC/ESPC.

En-Fi contains 21 texts, and each text extract amounts to 10,000-15,000 words, i.e. approx. 295,000 running words in the originals.

The Finnish-English Contrastive Corpus Studies (FECCS) Project is responsible for the En-Fi.

Published Sep. 5, 2008 7:53 AM - Last modified Aug. 18, 2010 3:53 PM