ENPC: P-O-S tagging

The English-Norwegian Parallel Corpus

Extensions of the project

Part-of-speech tagging

 

The Norwegian part of the corpus has recently been tagged (October 2001), and we are in the process of post-editing the tagged texts.

The original texts of the English part of the ENPC have been tagged for part-of-speech (P-O-S). The tagging was done automatically using the English Constraint Grammar parser developed by Atro Voutilainen, Juha Heikkilä, Arto Anttila and Pasi Tapanainen according to the Constraint Grammar framework originally proposed by Fred Karlsson. We are grateful to Atro Voutilainen, Helsinki, for doing the actual tagging.

Before the tagger could be applied, the tagger's lexicon was updated – the texts in the corpus were checked for words not already in the lexicon, and these were manually given P-O-S tags. Additionally, all SGML/TEI tags and entities had to be removed from the texts. After the tagger had been run on the texts, the next step was to merge the original text, i.e. the text with alignment information, with the P-O-S tagged text, with the following result (before and after merging):

 Before merging:

After merging:

"<JANUARY>"

"January" <ADV-N> <Proper> N NOM SG

"<$>"

"<$>"

"<The>"

"the" DET SG/PL

"<year>"

"year" <ADV-N> N NOM SG

"<began>"

"begin" V PAST

"<with>"

"with" PREP

"<lunch>"

"lunch" N NOM SG

"<$.>"

"<$<s>>"

<div1 type=part id=PM1.1>

<head id=PM1.1.h1 corresp=PM1T.1.h1>

JANUARY <w lemma="January" pos="N NOM SG" feature="ADV-N Proper">

</head>

<pb n=1>

<p id=PM1.1.p1>

<s id=PM1.1.s1 corresp=PM1T.1.s1>

The <w lemma="the" pos="DET SG/PL">

year <w lemma="year" pos="N NOM SG" feature="ADV-N">

began <w lemma="begin" pos="V PAST">

with <w lemma="with" pos="PREP">

lunch <w lemma="lunch" pos="N NOM SG">

.

</s></p>

 

The last step was to convert the P-O-S tags in the merged file to the format used in the extended version of the tagger, EngCG-2. This compact tag set consists of some 35 simple tags. The final output format, which is TEI compliant, looks like this:

<div1 type=part id=PM1.1>
<head id=PM1.1.h1 corresp=PM1T.1.h1><w p="Nadv">JANUARY</w></head><pb n=1>
<p id=PM1.1.p1>
<s id=PM1.1.s1 corresp=PM1T.1.s1><w p="DET">The</w> <w p="Nadv">year</w> <w l="begin" p="Vpast">began</w> <w p="PREP">with</w> <w p="N">lunch</w>.</s></p>

The merging of the original and P-O-S tagged texts was carried out by Diana Santos and Helge Hauglin at the Text Laboratory, University of Oslo. The Norwegian texts were tagged using the Oslo-Bergen tagger, which is a Norwegian version of the English Constraint Grammar parser. This work was done by Anders Nøklestad at the Text Laboratory.

Published July 6, 2010 10:39 AM - Last modified July 12, 2010 3:40 PM