The RuN Corpus (and its extension: The RuN-Euro Corpus)

The RuN corpus is a parallel corpus consisting (mostly) of Norwegian and Russian texts, some of which are also available in English translations. Parallel texts in other languages are currently being added in the extended RuN-Euro Corpus.

For an updated overview of the texts contained in the corpus, have a look at our database, an interface developed by research assistant Vladyslav Dorokhin. If you are interested in using the corpus for research, please contact Atle Grønn (head of the RuN project).

The texts are aligned at the sentence level and have been tagged for grammatical information at the word level. As of November 2010, the corpus contains approximately 2 million words in Norwegian, 2 million words in Russian and 900 000 words in English. We are now extending the corpus to include texts also from Bulgarian, BCS (Bosnian-Croatian-Serbian), Polish, Italian and French.

A number of English translations are included in the RuN corpus, but in some cases, access to translations in English (and French or German) can only be obtained through the Oslo Multilingual Corpus (OMC), developed at the University of Oslo.

Both the RuN corpus and OMC use the Glossa interface developed by Lars Nygaard and maintained by the Text Laboratory at UiO. In the RuN project we are grateful to our colleagues at the SPRIK project (responsible for the OMC) and the Text Laboratory for all their help building the RuN corpus. We especially thank Signe Oksefjell Ebeling (English department; OMC) and Anders Nøklestad (Text Laboratory) for their valuable help. We also wish to thank Jarle Ebeling (USIT, UiO) and Knut Hofland (Bergen) who in the late 1990-ies developed some of the tools (i.e. the corpus aligner) we have been using in the RuN corpus.

The following people have been directly involved in the corpus part of the RuN project (2008-2010), under the leadership of Atle Grønn: Research assistant Vladyslav Dorokhin has been responsible for a large part of the texts which are currently in the corpus. He has also been in charge of the training of assistants working in the project. The first texts which appeared in the RuN corpus were prepared and aligned by Maria Filiouchkina Krave (Oslo). Research assistants Tim Roos (Oslo) and Olga Dolzhykova (Kiev) have also been engaged in text preparation and alignment.

In 2010, a research group from ILOS (Atle Grønn (Russian), Kjetil Rå Hauge (Bulgarian), Elizaveta Khachatourian (Italian) and Liljana Saric (BCS)) received funding from the Factulty of Humantities aimed at extending the RuN corpus to include new languages and more texts. This extended corpus is called the RuN-Euro corpus.  In this connection, Boris Orekhov (Ufa, Russia), was invited to the project in autumn 2010 as a visiting researcher. Orekhov is working on the extension of the corpus with texts in various Slavic languages. Other assistants engaged by the project at this stage include Marina Mozharovskaja (Oslo), Evgenij Shaulskij (Moscow) and Anne Østhus Halvorsen (Oslo).

The RuN project has developed the a web interface with some useful tools for text preparation. For an overview of the texts included in the corpus, we refer to our online database .

Published May 20, 2010 2:59 PM - Last modified Mar. 30, 2011 2:18 PM