The VESPA corpus

The Varieties of English for Specific Purposes dAtabase (VESPA) learner corpus consists of academic essays written by learners of English from a variety of first language backgrounds. The project is co-ordinated by Dr. Magali Paquot at the Catholic University of Louvain.

The Norwegian subcorpus of VESPA (VESPA-NO) has been compiled by Signe Oksefjell Ebeling and Hilde Hasselgård at the Department of Literature, Area Studies and European Languages at the University of Oslo. The contributors to the corpus may be described as advanced learners of English. The corpus currently comprises texts from the following disciplines:

Linguistics
Literature
Business / international communication

VESPA-NO consists of texts written by students whose first language is Norwegian. There is also a separate component of texts written by students with other mother-tongue backgrounds. Texts are typically produced as part of a taught course, i.e. as obligatory assignments or term papers.

The corpus has been enriched with functional annotation using a set of macros and Perl scripts based on the macros first developed for the British Academic Written English Corpus (BAWE) (cf. Ebeling & Heuboeck 2007), and adjusted for VESPA by Alois Heuboeck (Reading University, UK).

The corpus is suitable for use with WordSmith Tools. It is available to students and researchers at the University of Oslo and to researchers developing other subcorpora of VESPA. As from 2022, the corpus is also available (password-protected) via an online interface along with VESPA corpora from other L1 backgrounds (French, Dutch, Spanish and Swedish).

Current status of the corpus: The texts in VESPA-NO were collected between 2009 and 2018. The linguistics component contains close to 330,000 words, the literature component comprises c. 150,000 words, while the business component is much smaller (at 50,000 words). More texts and disciplines can hopefully be added in the future.

We are grateful to the Department of Literature, Area Studies and European Languages for funding at various stages of the development of macros and the compilation and annotation of the corpus.

Published Feb. 17, 2014 4:43 PM - Last modified May 6, 2024 10:44 AM