Frequency lists from NoWaC

The frequency lists from NoWaC contain frequencies of word forms and lemmas.

Homonyms are counted separately according to how they have been tagged by the Oslo-Bergen grammatical tagger. For example, the verb "arbeid" and the noun "arbeid" are counted separately, and the same goes for e.g. past tense and past participle of verbs like "hoppe".

All words are converted to lowercase letters so that e.g. "The" and "the" are counted together. An exception is proper names that retain their original form.

It should be noted that parts of the corpus contain text in formats that are difficult to recognize for the grammatical tagger (e.g. different newpaper bylines or question-answer formats on chat sites). This means that many words have been analysed as proper names when they are in fact sentence initial common nouns, pronouns etc.

 

Download

The frequency lists are distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Generic license.

 

 

Publisert 18. jan. 2012 14:19 - Sist endret 24. apr. 2017 13:19