Norwegian version of this page

Sub-corpora

The various sub-corpora differ in that they contain a different number of languages or a different combination of languages.

Languages in the sub-corpora

The Oslo Multilingual Corpus contains many sub-corpora that differ in composition with regard to languages and the number of texts included.

It is mainly the languages Norwegian, English, French, and German that are represented in the sub-corpora, but some of the corpora include Dutch and Portuguese texts.

In addition, there are related parallel corpora for English-Swedish and English-Finnish, compiled in Sweden and Finland, which are accessible from the same site.

Many of the texts are found in more than one sub-corpus, i.e. the different composition in terms of languages included in the sub-corpora allows for the re-use of the texts. This is particularly the case with sub-corpora including Norwegian, English, or German originals.

Parallel and translation corpora

The different sub-corpora of the OMC can be divided into two main types of multilingual corpora: parallel corpora and translation corpora.

By parallel corpus is here understood as a collection of texts containing both original texts and translations from two or more languages. As far as possible, the same number of original texts is found in the two (or three) languages.

By translation corpus is understood as a collection of texts containing original texts from one language with translations into one or more languages, i.e. only one language is represented with original texts.

The corpora texts have been divided into fiction texts and non-fiction tests.

Overview of the subcorpora

English-Norwegian Parallel Corpus (ENPC)

This corpus is the mother corpus of the Oslo Multilingual Corpus and is composed of one fiction part and one non-fiction part.

The corpus contains 50 original texts from each language and their translations (English-Norwegian and Norwegian-English), 30 of which are fiction and 20 of which are non-fiction. Each text is an extract of 10,000-15,000 words, amounting to some 2.6 million words in all. See further the ENPC manual (PDF).

ENPC/Fiction

(Counts, using AntConc v. 3.5.9 (Anthony 2020))

English original text: approx. 422,000 words
Norwegian translated text: approx. 411,000 words
Norwegian original text: approx. 402,000 words
English translated text: approx. 443,000 words

ENPC/Non-fiction

English original text: approx. 252,000 words
Norwegian translated text: approx. 244,000 words
Norwegian original text: approx. 220,100 words
English translated text: approx. 252,700 words

French-Norwegian Parallel Corpus (FNPC)

The sub-corpus French-Norwegian Parallel Corpus (FNPC) was compiled at the University of Bergen (UiB) and completed and made ready for inclusion in the OMC at the University of Oslo (UiO).

FNPC/non-fiction contains texts that were collected both at UiO and UiB. The French-Norwegian anchor word list and the rules for French word splitting were developed at the University of Bergen.

The French-Norwegian Parallel Corpus contains original texts and their translations (French-Norwegian, Norwegian-French).

The corpus is composed of a fictional and a non-fictional part. FNPC/Fiction includes six original French texts with translations into Norwegian and five Norwegian original texts with translations into French, while FNPC/Non-fiction includes 10 original text extracts from each language (with translations).

The text extracts are chunks of 6,000–41,000 words. In total, the corpus contains approx. 864 600 running words.

FNPC/Fiction

Norwegian original text: approx. 55,800
French translated text: approx. 63,300
French original text: approx. 111,200
Norwegian translated text: approx. 109,300

FNPC/Non-fiction

Norwegian original text: approx. 117,500
French translated text: approx. 134,000
French original text: approx. 136,500
Norwegian translated text: approx. 137,000

German-Norwegian Parallel Corpus (GNPC)

The corpus contains original texts and their translations (German-Norwegian, Norwegian-German).

The corpus is composed of a fictional and a non-fictional part. GNPC/Fiction includes 18 original text extracts from German and 20 original text extracts from Norwegian, while GNPC/Non-fiction includes 6 original text extracts from each language.

In total, the corpus contains approx. 1,432,400 running words.

GNPC/Fiction

Norwegian original text: approx. 276,900
German translated text: approx. 276,400
German original text: approx. 269,300
Norwegian translated text: approx. 272,100

GNPC/Non-fiction

Norwegian original text: approx. 87,500
German translated text: approx. 85,900
German original text: approx. 80,200
Norwegian translated text: approx. 84,100

English-German parallel corpus (En-Ge-En)

An English-German parallel corpus containing original texts and their translations (English-German, German-English). The corpus is composed of both fiction and non-fiction texts.

En-Ge-En is composed of 33 English and 21 German original texts, and, on average, each text extract contains 10,000-15,000 words. In total, the corpus contains approx. 1,500,000 running words, distributed as follows:

English original text: approx. 432,500
German translated text: approx. 442,200
German original text: approx. 303,500
English translated text: approx. 320,900

German-Norwegian parallel (Ge-No-Ge) (not balanced)

A German-Norwegian parallel corpus containing original texts and their translations (German-Norwegian, Norwegian-German). The corpus is composed of both fiction and non-fiction texts.

Ge-No-Ge is composed of 37 German and 28 Norwegian original texts, and, on average, each text extract contains 10,000–15,000 words.

In total, the corpus contains approx. 1,793,500 running words, distributed as follows:

German original text: approx. 517,800
Norwegian translated text: approx. 515,100
Norwegian original text: approx. 378,000
German translated text: approx. 382,600

Norwegian-English-German parallel corpus (No-En-Ge, En-Ge-No, Ge-En-No)

A Norwegian-English-German parallel corpus containing original texts and translations from three languages (Norwegian-English-German, English-German-Norwegian, and German-English-Norwegian).

The corpus is split into three different databases. Together these three sub-corpora make up a Norwegian-English-German parallel corpus.

No-En-Ge contains Norwegian original texts and their English and German translations.
En-Ge-No contains English originals and their translations into German and Norwegian.
Ge-En-No contains German originals and their translations into English and Norwegian.

The Norwegian-English-German parallel corpus comprises a different number of original texts in the three languages.

The aim is to get 25-30 original texts from each language. In January 2006, the status was 22 Norwegian, 33 English, and 21 German originals. Most of these are fictional texts.

Number of running words in each of the sub-corpora

No-En-Ge

Norwegian original text: approx. 289,230
English translated text (from Norwegian): approx. 306,050
German translated text (from Norwegian): approx. 289,860

En-Ge-No

English original text: approx. 432,500
German translated text (from English): approx. 442,200
Norwegian translated text (from English): approx. 430,300

Ge-En-No

German original text: approx. 287,400
English translated text (from German): approx. 305,800
Norwegian translated text (from German): approx. 280,300

English-Dutch translation corpus (En-Du)

An English-Dutch translation corpus containing 12 English original texts and their translations into Dutch. The corpus includes fictional texts only and overlaps with the English texts in the ENPC.

Each text extract amounts to 10,000-15,000 running words. In total, the corpus contains approx. 326,300 running words, distributed as follows:

English original text: approx. 158,000
Dutch translated text: approx. 168,300

English-Norwegian-Portuguese translation corpus (En-No-Po)

The corpus contains 15 English original texts and their translations into Norwegian and Portuguese.

The corpus includes fictional texts only, all of which overlap with the English originals in ENPC. One of the texts includes translation both into European and Brazilian Portuguese.

Each text extract amounts to 10,000–15,000 words. In total, the corpus contains approx. 606,000 running words, distributed as follows:

English original text: approx. 197,000
Norwegian translated text: approx. 197,000
Portuguese translated text: approx. 212,000

Norwegian-French-German translation corpus (No-Fr-Ge)

A Norwegian-French-German translation corpus containing Norwegian fictional texts and their translations into French and German. The corpus contains 7 fictional texts.

The text extracts amount to about 80% of each book. In total, the corpus contains approx. 1,525,398 running words, distributed as follows:

Norwegian original text: approx. 498,724
French translated text: approx. 540,887
German translated text: approx. 485,787

Norwegian-English-French-German translation corpus (No-En-Fr-Ge)

The corpus contains Norwegian fictional texts and their translations into English, French, and German.

The corpus contains five texts, all of which are also part of No-Fr-Ge. The difference is that this corpus includes English translations in addition to French and German ones.

The text extracts amount to about 80% of each book. In total, the corpus contains approx. 1,666,964 running words, distributed as follows:

Norwegian original text: approx. 408,558
English translated text: approx. 425,949
French translated text: approx. 439,687
German translated text: approx. 392,770

Sister corpora (not part of OMC)

English-Swedish Parallel Corpus (ESPC)

This corpus is the Swedish sister corpus of the ENPC. Like the ENPC it is composed of one fiction and one non-fiction part.

The corpus contains original texts and their translations (English-Swedish and Swedish-English), amounting to approx. 2.8 million words in all. See further the ESPC manual.

Since the two corpora were developed within a larger Nordic network many of the English original texts are the same in the two corpora.

English-Finnish translation corpus (En-Fi)

The corpus contains English original texts and their translations into Finnish. This corpus is also a product of the Nordic network mentioned in connection with the ESPC.

The corpus includes fictional texts only and is originally part of the English-Finnish Parallel Corpus, which also includes Finnish original texts. Many of the English original texts in the En-Fi are the same as in the ENPC/ESPC.

En-Fi contains 21 texts, and each text extract amounts to 10,000–15,000 words, i.e. approx. 295,000 running words in the originals.

The Finnish-English Contrastive Corpus Studies (FECCS) Project is responsible for the En-Fi.

Published July 6, 2010 10:39 AM - Last modified Mar. 9, 2023 1:32 PM