New corpora collaboration with Czech university
For the next three years prof. Janne Bondi Johannessen and the Text Laboratory will take part in the project HaBiT - Harvesting big text data for under-resourced languages - together with project partners Masarykova Univerzita in Brno and NTNU in Trondheim.
The goals for the HaBiT project are:
- Build large annotated corpora for Norwegian (tentatively with a size of at least 1 billion tokens, and with the aim of 5 billion tokens). For Czech, a corpus larger than 5 billion tokens will be compiled. For Amharic, Tigrinya, Oromo, and Somali, corpora of at least a few million tokens will be built (aiming at 20 million, at least for Amharic).
- Develop a parallel Czech-Norwegian corpus (with size up to 10 million tokens),
- Develop software modules such as taggers, parsers, and Sketch Grammars for participating languages (Norwegian, and at least Amharic among the Ethiopian languages). Improve results for the already developed Czech modules as well,
- To give presentations at international conferences and workshops, with corresponding papers in the relevant journals,
- Organize a workshop related to the under-resourced languages (e.g., within the TSD – Text, Speech and Dialogue – conference framework).