The goals for the HaBiT project are:
- Build large annotated corpora for Norwegian (tentatively with a size of at least 1 billion tokens, and with the aim of 5 billion tokens). For Czech, a corpus larger than 5 billion tokens will be compiled. For Amharic, Tigrinya, Oromo, and Somali, corpora of at least a few million tokens will be built (aiming at 20 million, at least for Amharic).
- Develop a parallel Czech-Norwegian corpus (with size up to 10 million tokens),
- Develop software modules such as taggers, parsers, and Sketch Grammars for participating languages (Norwegian, and at least Amharic among the Ethiopian languages). Improve results for the already developed Czech modules as well,
- To give presentations at international conferences and workshops, with corresponding papers in the relevant journals,
- Organize a workshop related to the under-resourced languages (e.g., within the TSD – Text, Speech and Dialogue – conference framework).