New corpora collaboration with Czech university

For the next three years prof. Janne Bondi Johannessen and the Text Laboratory will take part in the project HaBiT -  Harvesting big text data for under-resourced languages - together with project partners ​Masarykova Univerzita in ​Brno and NTNU in ​Trondheim.

The project is financed by the Czech-Norwegian Research Programme (EEA and Norway Grants). Read more about the project here.

The goals for the HaBiT project are:

  1. Build large annotated corpora for Norwegian (tentatively with a size of at least 1 billion tokens, and with the aim of 5 billion tokens). For Czech, a corpus larger than 5 billion tokens will be compiled. For Amharic, Tigrinya, Oromo, and Somali, corpora of at least a few million tokens will be built (aiming at 20 million, at least for Amharic).
  2.  Develop a parallel Czech-Norwegian corpus (with size up to 10 million tokens),
  3. Develop software modules such as taggers, parsers, and Sketch Grammars for participating languages (Norwegian, and at least Amharic among the Ethiopian languages). Improve results for the already developed Czech modules as well,
  4. To give presentations at international conferences and workshops, with corresponding papers in the relevant journals,
  5. Organize a workshop related to the under-resourced languages (e.g., within the TSD – Text, Speech and Dialogue – conference framework).
HaBiT-meeting in Oslo September 5.-6. 2015. 
Feda Negesse, Pavel Rychlý, Björn Gambäck, Anders Nøklestad, Aleš Horák, Derib Ado,Vít Suchomel, Kristin Hagen, Janne Bondi Johannessen,  Lars Bungum, Joel Priestley.







