Two new, big web corpora for Nynorsk og Bokmål
The Text Laboratory can now offer two big web corpora for Norwegian, finished in 2017.
• HaBiT Norwegian Web Corpus 2015 (Bokmål) with 1.18 billion words (3.4 million documents).
• HaBiT Norwegian Web Corpus 2015 (Nynorsk) with more than 55 million words (214 000 documents).
The corpus for Nynorsk is the first web corpora collected for this language.
The two corpora contain a lot of blog texts and other texts that are less normative and closer to speech than texts found in corpora based solely on edited texts, such as newspapers, reports and fiction published by a publisher.
Both corpora are collected in February 2015 using SpiderLing. The texts are tagged with the Oslo-Bergen Tagger. The work has been done at Masarykova Univerzita in Brno, the Czech Republic in cooperation with the Text Laboratory, University of Oslo and NTNU within the framework of the HaBiT project, financed by the Czech-Norwegian Research Programme (EEA and Norway Grants).
The corpora can be searched in SketchEngine: