Searching the OMC/ENPC using PerlTCE

Searching the Oslo Multilingual Corpus / English-Norwegian Parallel Corpus with Perl Translation Corpus Explorer

By Hilde Hasselgård



Getting started

To get to the Oslo Multilingual Corpus (OMC) or the English-Norwegian Parallel Corpus (ENPC), you open the browser page at http://www.tekstlab.uio.no/cgi-bin/omc/PerlTCE.cgi, and type in your username and password.

Note: This link, and some of the others in this document will only work if you have access to the OMC/ENPC. You will find an application form here. Note that permission will be normally be given only to staff and students in the arts faculties at the universities of Oslo and Bergen who use the corpus for research or for courses where corpus work is specified as part of the syllabus.

With the default settings, the browser will search in English original texts from the fiction component of the corpus (the structure of the corpus is explained at http://www.hf.uio.no/ilos/english/services/omc/enpc/index.html).

Alternatively you can go directly to the corpus you want by using the quick link in the top right-hand corner of the OMC homepage. Here you will also need to type in your username and password.


Performing a simple search

To perform a search, you type in a word (e.g. however) in the 'Enter search' box and hit the 'Submit search' button. The next screen that comes up, will show you the sentences in which however occurs in the corpus, with the Norwegian translations immediately following the English sentences.

  • The three boxes in the third row of the search form allow you to specify your search. The default settings are 'Fiction', 'English', and 'Original'.
    • The first box allows you to choose among different databases. 'ENPC/Fiction' and 'ENPC/Non-fiction' belong to the English-Norwegian Parallel Corpus, while the other databases are part of the OMC and contain more languages. E.g. En-Ge-No contains English originals with German and Norwegian translations.
    • The second box is for specifying which language you want to search in.
    • The third box gives you the choice between original and translated text.
  • When box for 'hide tags' is ticked (by default), you will not see the long identification tag of each sentence, and special characters come out on the screen. Take away the tick and make a search to see the difference.
  • The box for 'direct speech' allows you to search in only the dialogue part of the fiction text (when the box is ticked). NB: This applies only to the ENPC, original texts.
  • The box for 'position' can be used if you are looking for a word in a certain position in a sentence. Thus if you write 1 in the box, the browser will find only the examples where the word you look for is the first word in a sentence. –1 will look for the last word in a sentence.
  • The box for 'context' allows you to specify the number of sentences (max. 25) to the right and to the left of the sentence you look for.
  • ‘Number of hits to display per page’ can be set to 50, 100, or 200. The default is 100 for the first page and the rest of the results on the second. You have to click on a link to see the second page. Example: If you search for good in ENPC fiction, you should get the following message on the results page:

Total before filters: 447. Displaying first 100 matches.

good : 447
Results: 101 - 447. (after filters)

·         If you want to search for a word with alternative forms or spellings, you can write the alternative forms together, separated by |. (The | means or.) Example: bein|ben

·         By ticking the box ‘sort output by matched word’ you get an alphabetically ordered concordance (word list) if you have searched for alternative forms or used a wildcard in your search (see below).

·         The box ‘List texts in corpus’ (below the search form) gives you a list of the texts included in the database shown in the box. 

Note:

·         Don’t use capital letters in your search, not even for proper nouns.

·         The ‘Enter search’ box can only contain one word. If you want to search for a string of words, you need to use the filters (see below).

·         It is possible to search for punctuation marks (e.g. ? to find all the questions in the corpus).

·         The code in brackets that appears at the end of each example is a reference to a corpus text. “T” at the end of a code shows that the sentence comes from a translated text.



Searching with filters

The various filters allow you to make a more refined search, e.g.:

  • search for a word at a fixed point in a sentence (e.g. the first word)
  • search for word combinations, using the and/not +/- <filter> box. Red in the search box and AND +3 blue in the and/not +/- filter box will give you all examples where red is followed by blue within a span of 3 words.
  • specify the relationship between original text and translation by using and/not <filter>. For example, the search string however combined with the filter AND imidlertid will give you all examples where the English sentence has however and the Norwegian sentence has imidlertid). A filter with NOT, e.g. however combined with NOT imidlertid will give you all examples where however does not correspond to imidlertid.
  • The filters can be combined with each other. It is also possible to specify two filters in each category. (E.g Red in the search box, AND +3 blue in the first and/not +/- filter box and NOT +5 white in the second and/not +/- filter box will give you combinations of red and blue, but not red, white and blue).

 

Read the Help menu for further details about searching in the corpus.


Wildcard (*)

A wildcard is a character that represents one or more unspecified characters. The wildcard used in the OMC is *. Note that the question mark (?) is not used as a wildcard in the OMC/ENPC. (On the contrary, a search for “?” will find all the question marks in the corpus.)

Wildcards are useful if you are unsure of the spelling of a word, or if a word has alternative spellings.

Examples:

  • If you want to look up all forms of the word mind (i.e. mind - minds - minded - minding), you can use the * wildcard to represent any set of characters. A search for mind* in the ENPC finds minds, minded, and minding, as well as mind's and mindful. Note that it does not find mind itself, only words where mind is followed by one or more characters. To find all forms including mind, type mind|mind* in the “Enter search” field.
  • Many English words can be spelt with the endings -ize or -ise. To make sure you get all uses of the word realize/realise you can type reali*, so that you get both spelling variants in the same search. (Alternatively, you can search for realize|realise.)

 

Wildcards can, in principle, be used to represent the beginning or the end of a word. Note that a search for a word with a wildcard at the beginning (e.g. *ly, to find all words ending in -ly) will usually take rather long, because the browser will have to check all the words in the corpus from beginning to end.



Saving your results

You can save your search results by using the 'Save' or 'Save as' option in your net browser ('Lagre' / 'Lagre som'). You can choose between saving the results as an html file or as a text (txt) file. A text file can be imported into a word processor and edited. If you do not need to save more than a few of your search results, the easiest way to save them might be to use the 'Cut-and-paste' function and paste the examples you want into a Word file.

For a large corpus investigation it is usually practical to store the results in a database, where they may be annotated, sorted and retrieved in various ways.



Using the tagged ENPC

The original texts in the ENPC have been tagged and lemmatized (meaning they have a word class tag, and that all grammatical forms of a word are grouped together under one lexeme).

  • Log on by clicking on the link “ENPC (tagged)” on the PerlTCE browser page. You will see that the interface looks slightly different from that of the untagged OMC/ENPC.
  • The box marked "L" means "lemma". (A lemma is a group of grammatically related word forms.) Tick this box and write take as your search string. Press "search". You will then see all occurrences of the lemma "take" (take, takes, took, taken) in the corpus. If the lemma box is not ticked, you will only get the word form take.
  • If you try the same kind of search with like, the search will produce not only all forms of the verb like, but also the preposition like plus the nouns likes (as in likes and dislikes) and liking and the adjective liked. In order to exclude the preposition, for example, tick the "not" box and choose "PREP" in the box to the right. Press "search" again. You will still have nouns and adjectives among your hits, so a better idea is to remove the tick in the “not” box and instead select all the word class codes preceded by “V” (“V” on its own won’t find anything in the English material, unfortunately) plus “ING”.
  • The next step might be to see how often the verb like is followed by an infinitive or by a present participle. Write "AND +1 to" in the box after the ‘L box’ in the “Original” row to get LIKE TO. Write “AND +1” in the same box and select the tag ING to get all examples of the verb like followed immediately by a present participle.
  • There is further information on how to search in the tagged ENPC just below the search form (or click here).
  • A list of all the word class tags is found in the manual to the ENPC.

A word of warning: The tagging has been performed automatically, and although the analysis is fairly reliable and has been partly checked, there are still some errors. If you are using the material for research, always check that your results are correct.



Hands-On

  • Look up words ending in -ish in the ENPC, and see how this ending (in words like 'reddish') is translated into Norwegian.
  • Use an English-Norwegian dictionary and check the translations of please, pardon, mister, and lady. Then look these words up in the ENPC. Which translations to you find? To what extent do the corpus findings agree with the dictionary?
  • Use the tagged corpus to look for the verb and noun show. How many do you find of each? What is the most common Norwegian correspondence of the verb? Of the noun?


© Hilde Hasselgård and the Department of Literature, Area Studies and European Languages, University of Oslo

 

Published July 6, 2010 10:39 AM - Last modified June 7, 2013 11:53 AM