• Frequency lists
• Written Corpora
• Speech corpora
• Multilingual corpora
• Databases
• Language technology tools
• Older resources
• Access to older resources
Written corpora
- The Bokselskap Corpus
The Bokselskap Corpus contains mostly older fictional texts from the website bokselskap.no.
Read more about bokselskap.no
Search the corpus
- The Corpus for Bokmål Lexicography LBK
LBK is a representative, weighted corpus made for lexicographic purposes with texts from 1985 - 2013. The corpus is tagged with the Oslo-Bergen tagger and is marked with information about gender, age and geographic affinity of the authors, in addition to genre, topic and other common source information.
Read about the corpus (in Norwegian)
Search the corpus
- The ELENOR Corpus
ELENOR (Spanish as a Foreign Language in Norway) is a database containing texts in Spanish written as course assignments by university students in Norway.
Read about the corpus
Search the corpus
- The French Newspaper Corpus
115 million words from French news papers (from the LDC)
Read more and search the corpus
- HaBiT Norwegian Web Corpora 2015
Web corpora for Norwegian Bokmål and Norwegian Nynorsk with1.18 billion words in Bokmål and 55 million words in Nynorsk.
Read more and search the corpus
- NORINT Text
NORINT text consists of 53,247 words from 116 exam papers written by adult L2 learners taking advanced Norwegian courses at the University of Oslo during the summer of 2014 and 2015.
Read more and search the corpus
- The Norm Corpus
The Norm Corpus consists of over 5,000 student texts from students aged 8 -13 years in Norwegian schools. The texts are collected by The Norm Project. The corpus contains more than 1.1 million words.
Read more and search the corpus
- NoWaC
This corpus is the first version of a large web-based corpus of Bokmål Norwegian currently containing about 700 million tokens. The corpus has been built by crawling, downloading and processing web documents in the .no top-level internet domain.
Read more and search the corpus
- SKRIV Corpus
Texts written by students in upper secondary vocational education programs. The corpus is especially suitable for the analysis of texts written by students with Norwegian as their second language.
Search the corpus
- Five Ethiopian web corpora
The HaBiT project has in cooperation with the project Linguistic Capacity Buliding – Tools for the inclusive development of Ethiopia developed web corpora for Amharic, Oromo, Somali and Tigrinya:- Corpus Amharic WaC [2013 + 2015 + 2016]
Amharic web corpus. 20,287,250 tokens / 17,320,000 words. - Amaharic WIC
Amaharic WIC is the tagged corpus described in Argaw and Asker (2005), Gambäck and Asker (2010) and Gambäck (2012), made searchable in SketchEngine. - Corpus Oromo WaC [2016]
Oromo web corpus.5,091,696 tokens / 4,249,953 words. - Corpus Somali WaC [2016]
Somali web corpus. 79,741,231 tokens / 71,871,585 words. - Corpus Tigrinya WaC [2016]
Tigrinya web corpus.2,531,443 tokens / 2,087,613 words.
- Corpus Amharic WaC [2013 + 2015 + 2016]
Speech corpora
- The BigBrother Corpus
Transcripts of TVNorge's BigBrother broadcasts from 2001. The transcriptions are linked to audio and video recordings.
Read more and search the corpus
- CANS - Corpus of American Nordic Speech
Speech corpus where Americans of Norwegian and Swedish heritage speak their heritage language. Read more and search the corpus
- Corpus of Doctor-Patient Conversations from Ahus
Transcripts of conversations in Norwegian between doctors and patients in different types of consultations at Akershus University Hospital (Ahus). The audio files are not available in the corpus due to the sensitiveness of the conversations.
Read more about the corpus
Search the corpus
- The LIA Treebank
Treebank with speech segments from LIA Norwegian - Corpus of historical dialect recordings. The LIA Treebank includes 7536 speech segments and 77 701 tokens with morphological and syntactic annotation.
Read more and search/download the treebank
- LIA Norwegian - Corpus of historical dialect recordings
Speech corpus with older recordings of Norwegian dialects. Approx. 3.5 million words, 1382 speakers from 227 places in Norway. The corpus is an output of the infrastructure project LIA.
Read about the project
Search the Corpus
- LIA sápmi - Sámegiela hállangiellakorpus
Speech corpus with older Saami dialects morphologically tagged by Giellatekno.
Read more about the LIA-project
Search the corpus
- MAID
The Mandarin Audio Idiolect Dictionary (MAID) is a comprehensive ca. 2000-hour long audio dictionary of the language of a Manchu speaker of the Peking Chinese dialect.
Read more
- The NDC Treebank
Treebank with speech segments from the Norwegian part of Nordic Dialect Corpus. The NDC Treebank includes 4637 speech segments and 66 042 words/tokens with morphological and syntactic annotation.
Read more and search/download the treebank
- Nordic Dialect Corpus
Nordic Dialect Corpus is a corpus of Norwegian, Swedish, Danish, Faroese and Övdalian spoken language. It consists of spontaneous speech data from dialects of the North Germanic languages across all of the Nordic countries.
Read more and search the corpus
- NORINT Speech and NORINT Recited
NORINT Speech consists of interviews and conversations, 140,000 words all together, spoken by adult L2 learners taking advanced Norwegian courses at the University of Oslo during the summer of 2014 and 2015. In NORINT Recited the same students recite a short story, as well as 60 non-contextualized sentences.
Read more and search the corpus
- NoTa-Oslo
Speech corpus with recordings from 2004 - 2006, with about 900 000 transcribed words associated with audio and video. Informants are born and raised in the Oslo area. Representative selection with 144 informants.
Read more and search the corpus
- The Oslo Corpus of Pskov Dialects
Speech corpus with recordings and transcriptions from 1992-1994 from the Russian Northwestern Region of Pskov. So far, only a small demo corpus is searchable but all sound files are available.
Read more and search the corpus
- The Ruija Corpus
The Ruija Corpus is a speech corpus from areas where Kvens and Finnish language are spoken. The recordings were done in 1960 to 2009. The Ruija Corpus is the first on-line corpus with the Kven language. The Ruija Corpus has 428 971 words and 76 hours and 18 minutes of speech.
Read more
Search the corpus
-
SILaNa
The corpus Spoken Italian – Interviews about Language and Nation (SILaNa) contains almost 240 000 tokens from 32 interviews, twenty-two with Italian native speakers living in Norway and ten interviews with non-native speakers who have been living in Italy for many years. The corpus represents spontaneous discourse that can be used for a sociolinguistic and for a linguistic analysis.
Read more
Search the corpus
- Talko
Finland-Swedish speech corpus with recordings and transcriptions from The Society of Swedish Literature in Finland.
Read more (in Swedish)
Search the corpus
- TAUS
Speech corpus from Oslo with interviews from 1971 - 1973. The transcriptions are linked to the original sound recordings.
Read more and search the corpus
- Eight Ethiopian speech corpora
The NORHED project Linguistic Capacity Buliding – Tools for the inclusive development of Ethiopia has so far made eight small speech corpora.- Amharic Speech Corpus 154 000 tokens, 82 speakers.
- Gumer Speech Corpus 37 250 tokens, 22 speakers.
- Hadiyya Speech Corpus 13 000 tokens, 39 speakers.
- Hamar Speech Corpus 16 900 tokens, 2 speakers.
- Kambata Speech Corpus 139 600 tokens, 69 speakers.
- Muher Speech Corpus 40 500 tokens, 8 speakers.
- Oromo Speech Corpus 266 500 tokens, 88 speakers.
- Tigrinya Speech Corpus 138 600 tokens, 45 speakers.
Databases
- KELLY (Keywords for Language Learning for Young and adults alike)
Searchable database of language pairs from 9 languages: Arabic, Chinese, English, Greek, Italian, Norwegian, Polish, Russianand Swedish
Read more about the EU-project Kelly
Search the multilingual database from Kelly.
- LIA file depot
Searchable file depot for all dialect recordings from the LIA project, more than 3000 files.
Search the file depot
- Nordic Syntax Database
The database consists of judgments by 924 Nordic dialect speakers from 207 places to a list of sentences that illustrate various syntactic phenomena.
Read more
Search the database
-
NWD - Nordic Word Order Database
NWD - Nordic Word Order Database - is an online database hosted by the Text Laboratory at the University of Oslo. The database contains elicited production data from speakers of all of the Scandinavian languages, including several different dialects. - Ordforrådet
A searchable lexical database of 1650 Norwegian nouns, verbs and adjectives.
Read more
-
Repertory of Conjectures on Horace
Repertory of Conjectures on Horace is a searchable database that allows scholars to find information about ca. 7500 conjectures proposed in printed works from around 1500 up to the present.
Read more
Search the database
-
Database of Norwegian Tags
A searchable database of sentence final Tags in Norwegian dialects. The database shows where the tags are used, how they are pronounced and the distribution among women and men, young people and old people. The database is the result of an investigation within the project "The meaning and function of Norwegian Tags" at NTNU.
Read more
Search the database
Multilingual corpora
- OMC
The Oslo Multilingual Corpus (OMC) is a collection of multilingual corpora that consists of original works and translations. OMC is a unique research resource for contrastive studies, translation studies and linguistics generally.
Read more
- RuN
The RuN corpus is a parallel corpus consisting of texts in 10 languages, among them Norwegian, Russian and English texts. The texts are aligned at the sentence level and have been tagged for grammatical information at the word level. Contact Atle Grønn for more information.
Read more
Search the corpus
Language technology tools
- Glossa - a search and post-processing tool
Glossa is a tool for researchers who want to search linguistically annotated corpora
Read more
- The Oslo-Bergen Tagger
The Oslo-Bergen tagger is a robust morphological and syntactic constraint grammar tagger.
Read more
- The Oslo Transliterator
The Oslo Transliterator is a semi-automatic tool developed to assist in creating a second, alternative transcription, from an original transcription.
Read more
Older corpora with old search interface
- Bosnian Corpus
1.5 million words, from novels, stories, law texts, newspapers, religious texts.
Read more and search the corpus.
- KAL
3300 texts written by pupils for the final in Norwegian language in 1998, 1999, 2000 and 2001. The database also includes associated grades and other background material.
Search the corpus (In Norwegian)
- Two Corpora with music reviews
Two corpora with music reviews. One corpus also contains transcriptions of music therapy sessions.
Search "Korpus med musikkanmeldelser"
Search the corpus "Music, Motion and Emotion"
- The Oslo Corpus of Tagged Norwegian Texts, bokmål and nynorsk
Bokmål: 18.5 million words, taken from newspapers, magazines, novels and public documents. Tagged with the Oslo-Bergen tagger.
Nynorsk: 3.8 million words, taken from newspapers, magazines, novels and public documents. Tagged with the Oslo-Bergen tagger.
Read more and search the Corpus
- The Sofie Treebank
The Sofie Treebank has syntactically analysed sentences from seven North European languages: Danish, Estonian, Faroese, Icelandic, Norwegian, Swedish and German. The sentences are taken from the first chapters of Jostein Gaarder's novel Sophie's World.
Read more
Grammar games
- GREI grammar games
The GREI portal provides links to grammar games and analysis of both Bokmål, Nynorsk and 24 other languages from the VISL web site. (Some of the activities at the website are unfortunately outdated.)
Read more about GREI (in Norwegian)