Norwegian version of this page

Services and tools from the Text Laboratory

Written corpora

  • Bosnian Corpus
    1.5 million words, from novels, stories, law texts, newspapers, religious texts.
    Read more and search the corpus.
  • The Corpus for Bokmål Lexicography LBK
    LBK is a representative, weighted corpus made for lexicographic purposes. The corpus is tagged with the Oslo-Bergen tagger and is marked with information about gender, age and geographic affinity of the authors, in addition to genre, topic and other common source information.
    Read about the corpus (in Norwegian)
    Search the corpus
  • The ELENOR Corpus
    ELENOR (Spanish as a Foreign Language in Norway) is a database containing texts in Spanish written as course assignments by university students in Norway.
    Read about the corpus
    Search the corpus
  • The French Newspaper Corpus
    115 million words from French news papers (from the LDC)
    Read more and search the corpus
  • KAL
    3300 texts written by pupils for the final in Norwegian language in 1998, 1999, 2000 and 2001. The database also includes associated grades and other background material.
    Read about the KAL project
    Search the corpus (In Norwegian)
  • Macedonian Corpus
    Search the Corpus
  • Two Corpora with music reviews
    Two corpora with music reviews. One corpus also contains transcriptions of music therapy sessions.
    Search "Korpus med musikkanmeldelser"
    Search the corpus "Music, Motion and Emotion"
  • NoWaC
    This corpus is the first version of a large web-based corpus of Bokmål Norwegian currently containing about 700 million tokens. The corpus has been built by crawling, downloading and processing web documents in the .no top-level internet domain.
    Read more and search the corpus
  • The Oslo Corpus of Tagged Norwegian Texts, bokmål and nynorsk
    Bokmål: 18.5 million words, taken from newspapers, magazines, novels and public documents. Tagged with the Oslo-Bergen tagger.
    Nynorsk: 3.8 million words, taken from newspapers, magazines, novels and public documents. Tagged with the Oslo-Bergen tagger.
    Read more and search the Corpus
  • Sidaama Corpus
    150,000 words, mostly from a translation of the New Testament translated by Kjell Magne Yri.
  • SKRIV Corpus
    Texts written by students in upper secondary education programs. The corpus is especially suitable for the analysis of texts written by students with Norwegian as their second language.
    Search the corpus

  • Usenet Corpus
    140 million words, taken from *- no hierarchy of Usenet from 1998 to 2002.
    Search the corpus

Speech corpora

  • The BigBrother Corpus
    Transcripts of TVNorge's BigBrother broadcasts from 2001. The transcriptions are linked to audio and video recordings.
    Read more and search the corpus
  • Corpus of American Norwegian Speech (CANS)
    Speech corpus where Americans of Norwegian heritage speak Norwegian.
    Read more and search the corpus
  • Corpus of Doctor-Patient Conversations from Ahus
    Transcripts of conversations between doctors and patients in different types of consultations at Akershus University Hospital (Ahus). The audio files are not available in the corpus due to the sensitiveness of the conversations.
    Read more about the corpus (in Norwegian)
    Search the corpus
  • MAID
    The Mandarin Audio Idiolect Dictionary (MAID) is a comprehensive ca. 2000-hour long audio dictionary of the language of a Manchu speaker of the Peking Chinese dialect.
    Read more
  • Nordic Dialect Corpus
    Nordic Dialect Corpus is a corpus of Norwegian, Swedish, Danish, Faroese and Övdalian spoken language. It consists of spontaneous speech data from dialects of the North Germanic languages across all of the Nordic countries.
    Read more and search the corpus

  • NoTa-Oslo
    Speech corpus with recordings from 2004 - 2006, with about 900 000 transcribed words associated with audio and video. Informants are born and raised in the Oslo area. Representative selection with 144 informants.
    Read more and search the corpus
  • The Ruija Corpus
    The Ruija Corpus is a speech corpus from areas where Kvens and Finnish language are spoken. The recordings were done in 1960 to 2009. The Ruija Corpus is the first on-line corpus with the Kven language. The Ruija Corpus has 428 971 words and 76 hours and 18 minutes of speech.
    Read more
    Search the corpus
  • Talko
    Finland-Swedish speech corpus with recordings and transcriptions from The Society of Swedish Literature in Finland.
    Read more (in Swedish)
    Search the corpus
  • TAUS
    Speech corpus from Oslo with interviews from 1971 - 1973. The transcriptions are linked to the original sound recordings.
    Read more and search the corpus


  • Nordic Syntax Database
    The database consists of judgments by 924 Nordic dialect speakers from 207 places to a list of sentences that illustrate various syntactic phenomena.
    Read more
    Search the database

  • KELLY (Keywords for Language Learning for Young and adults alike)
    Searchable database of language pairs from 9 languages: Arabic, Chinese, English, Greek, Italian, Norwegian, Polish, Russianand Swedish
    Read more about the EU-project Kelly
    Search the multilingual database from Kelly.

  • Ordforrådet
    A searchable lexical database of 1650 Norwegian nouns, verbs and adjectives.
    Read more
  • Repertory of Conjectures on Horace
    Repertory of Conjectures on Horace is a searchable database that allows scholars to find information about ca. 7500 conjectures proposed in printed works from around 1500 up to the present.
    Read more
    Search the database


Multilingual corpora

  • OMC
    The Oslo Multilingual Corpus (OMC) is a collection of multilingual corpora that consists of original works and translations. OMC is a unique research resource for contrastive studies, translation studies and linguistics generally.
    Read more
    Search the corpus
  • RuN
    The RuN corpus is a parallel corpus consisting of Norwegian, Russian and English texts. The texts are aligned at the sentence level and have been tagged for grammatical information at the word level. As of August 2009, the corpus contains approximately 1,2 million words in Norwegian, 1,2 million words in Russian and 500 000 words in English.
    Read more
    Search the corpus
  • The Sofie Treebank
    The Sofie Treebank has syntactically analysed sentences from seven North European languages: Danish, Estonian, Faroese, Icelandic, Norwegian, Swedish and German. The sentences are taken from the first chapters of Jostein Gaarder's novel Sophie's World.
    Read more

Language technology tools

  • Glossa - a search and post-processing tool
    Glossa is a tool for researchers who want to search linguistically annotated corpora
    Read more
  • The Oslo-Bergen Tagger
    The Oslo-Bergen tagger is a robust morphological and syntactic constraint grammar tagger.
    Read more
  • The Oslo Transliterator
    The Oslo Transliterator is a semi-automatic tool developed to assist in creating a second, alternative transcription, from an original transcription.
    Read more

Grammar games


Published Nov. 1, 2010 3:50 PM - Last modified Feb. 3, 2017 5:11 PM