The corpus has been built by crawling, downloading and processing web documents in the .no top-level internet domain between November 2009 and January 2010. The computational procedure used to collect the NoWaC corpus is largely based on the techniques used to build the corpora published by the WaCky initiative.
In brief, first a list of URLs containing documents in the target language was collected by sending queries to commercial search engines (Google and Yahoo). The obtained URLs (overall 6900) were then used to start a crawling job using the Heritrix web-crawler, limited to the .no domain. The crawling job was configured to to behave as "politely" as possible, by not flooding external web-servers with too many simultaneous requests and by strictly following the Robots Exclusion Protocol ("robots.txt" rules).
The downloaded documents were then processed in various ways in order to build a linguistic corpus: filtering documents by size, identifying the language of each of them, detecting and removing duplicate and near-duplicate documents, applying tokenization, lemmatisation and POS-tagging.
This project has been possible thanks to NOTUR advanced user support and assistance from the Research Computing Services group (Vitenskapelig Databehandling) at USIT, University of Oslo.
NoWaC has been built with permission from the Norwegian Ministry of Culture (Kulturdepartementet).
Read more about Nowac:
Guevara, Emiliano Raul (2010). NoWaC: a large web-based corpus for Norwegian. In Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop, Association for Computational Linguistics page 1 - 7.
Download
The corpus and frequency lists are distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Generic license.
Download frequency lists from the corpus
For other kinds of use or questions:
Please contact Emiliano Guevara (emiguevara at gmail.com) or The Text Laboratory: tekstlab-post at iln.uio.no