A brief introduction to
The English-Norwegian Parallel Corpus
A Research Project
The comparison of languages is of great interest in a theoretical as well as in an applied perspective. It reveals what is general and what is language specific and is therefore important both for the understanding of language in general and for the study of the individual languages compared. The analysis has applications within lexicography, language teaching, and translation studies.
Recently there has been a revival of interest in contrastive studies, partially due to the increasing internationalization of society and the growing need for advanced bilingual and multilingual competence. At the same time, linguistics has become increasingly concerned with the study of language in context, with the emergence of fields like text linguistics, discourse analysis, and pragmatics. The time is ripe for text-based contrastive studies.
Text-based contrastive studies can benefit from the progress in computer processing of texts, which has been a major area of research at the Department of British and American Studies, University of Oslo, and the Norwegian Computing Centre for the Humanities, University of Bergen. The present project extends this work to computer processing of parallel texts.
The aim of the project is (1) to compile a parallel corpus of English and Norwegian texts for computer processing; (2) to develop tools for analysing parallel texts; and (3) to carry out studies of the structure and communicative use of the two languages on the basis of the corpus. Areas to be studied include:
- presentative constructions in English and Norwegian (Jarle Ebeling)
- word order and information structure in English and Norwegian (Hilde Hasselgård)
- lexical comparison of English and Norwegian (Kay Wikberg)
Examples of more general questions to be addressed are: To what extent are there parallel differences in text genres across languages? In what respects do translated texts differ from comparable original texts in the same language? Are there any features in common among translated texts in different languages (and, if so, what are these features)?
The aim of studying translated texts is not to reveal translation mistakes, but rather to use the work of translators as a resource for contrastive analysis and the study of translation problems.
The parallel corpus is planned as an open text bank and will be expanded as allowed by the resources available. It is intended as a general research tool, available beyond the present project for applied and theoretical linguistic research.
The process of compiling the corpus has taken four years. A lot of work has gone into the development of software and into the preparation of the texts. The coding system used to mark up the ENPC follows the suggestions made by the Text Encoding Initiative (TEI) as presented in Guidelines for Electronic Text Encoding and Interchange (Sperberg-McQueen & Burnard, 1994). Start- and end-tags are used for the mark-up of the texts, <..> and </..>, respectively. The most important tags mark paragraphs (<p>...</p>) and sentence boundaries (<s>...</s>):
<p><s>These are the myths of beginnings.</s> <s>These are stories and moods deep in those who are seeded in rich lands, who still believe in mysteries.</s></p>
After the texts have been scanned, coded, and proofread they are aligned, i.e. the original text extract is linked to the translated text extract on the sentence level. The alignment is done automatically by a program developed by Knut Hofland, followed by a manual proofreading stage. The texts are stored in a data base and made searchable in the Translation Corpus Explorer, a browser developed by Jarle Ebeling.
Back to the English-Norwegian Parallel Corpus, homepage.