The NORINT Corpus

The NORINT Corpus consists of spoken and written data elicited from adult learners of Norwegian (international students) with an intermediate command of the target language i.e., the B1 level or higher in accordance with the Common European Framework of Reference for Languages (CEFR).

The NORINT Corpus consists of three sub-parts:

  • NORINT Speech consists of interviews with and conversations between 48 informants, 104 000 tokens (words and punctuation characters) all together. In the interviews, a teacher asks L2 learners general questions about their background, studies, work, and future plans. There is a limited number of topics to provide a similar setting for the interviews. In addition, the same L2 learners converse in pairs about topics such as culture, leisure, travel, or life in Norway. The L2 learners choose freely what to talk about from a list given to them right before the recordings were made. NORINT Speech contains 30–40 minutes of speech from each informant and the duration of the interviews and conversations is almost the same. There are both audio and video recordings of the interviews and conversations.

    The audio and video recordings are transcribed with ELAN, a tool for annotating sound and video files. NORINT Speech is transcribed in standard orthography together with information describing verbal communication e.g., “sigh”, “clears his/her/their throat”, “whispering” etc. Additionally, the NoTa-tagger, an automatic spoken language tagger, has been used to tag the corpus with grammatical information such as word class and different morphological features.
  • NORINT Recited contains data from 57 informants, 48 of whom are the same as in NORINT Speech. The L2 learners read out a short story as well as 60 non-contextualized sentences. The same story and sentences were first used in Språkmøterprosjektet at the Norwegian University of Science and Technology. NORINT Recited comprises audio recordings only, and is not grammatically tagged.
  • NORINT Text is a written language corpus comprising 116 exam papers. There is a partial overlap between the informants in NORINT Text and NORINT Speech (and NORINT Recited) but due to privacy protection regulations, there is no detailed information in the NORINT Corpus about its informants. Nevertheless, if you need spoken and written data from one and the same L2 learner, it is possible to contact the Text Laboratory ( for more information.

    The texts are available in three formats: a) an original handwritten version of the exam paper in full (answers to the listening practice test, reading comprehension test and answers to it, grammar tests and answers to them, written assignment (reflective writing)) in PDF format, b) a digital copy of the written assignment’s original version and c) a version of the written assignment where all the orthographic (as well as some morphological and syntactic) errors are corrected. The digital copies of written assignments and the corrected versions are linked together in the corpus. Additionally, they are automatically annotated with the Oslo-Bergen-tagger, an automatic written language tagger for grammar.
