Norwegian version of this page

The Oslo Transliterator

The Oslo Transliterator is a semi-automatic tool developed to assist in creating a second, alternative transcription, from an original transcription. The transliterator can be trained on any dialect or language.

The Oslo Transliterator web interface.

The Oslo Transliterator transforms one transcription into another, usually a phonetic transcription into an orthographic one, which is then manually corrected. The resulting pair of transcriptions is used for training the transliterator for this particular language variety (dialect), improving performance with this dialect on subsequent iterations.

The Oslo Transliterator has been used in the development of the Nordic Dialect Corpus (Johannessen et al. 2009), the Finland-Swedish Talko corpus, and is presently being used for the Norwegian LIA corpus, both for transcription to orthography and from one orthographic standard to another.

Contact tekstlab-post@iln.uio.no if you want more information about the Oslo Transliterator.

How the transliterator works

The transliterator has a web interface (see the picture above) that can be accessed using any modern web browser. The application is implemented in the Ruby on Rails web framework using the Ext JS JavaScript framework and a MySQL database. The top buttons in the left panel give the options of registering a new dialect and starting work on a new transcription in an existing dialect.

Once a transcription has been uploaded, it will be automatically split into a number of file parts, which are listed separately in the right panel. The four rightmost buttons on each line in the right panel allow the user to 1) have the part automatically transliterated, 2) download it for manual correction, 3) upload it after correcting it, and 4) re-train the system on all the parts that have been corrected so far. The manually corrected parts are marked with a green check mark. When all parts of a transcription have been transliterated and manually corrected, the entire transliteration can be downloaded using the appropriate A button in the left panel.

The first time the system is applied to a new dialect, it does not have any knowledge about how words in this particular dialect should be transliterated. However, if the dialect is more or less similar to other dialects that the system has already been trained on, it can be marked as such in the database, with the result that its initial guesses for transliterations stem from weighted combinations of transliterations in those other dialects. If a particular word does not have any existing transliterations in any dialects marked as similar, the system simply suggests a transliteration that is identical to the original, phonetically transcribed form.

When the system is trained on a set of transliterated parts, it takes each word in the original transcription, looks it up in the database and updates the number of times the word has been given this particular transliteration. As a result, the next time the same word is encountered in a new transcription, the system is more likely to provide a correct transliteration.

Apart from the main outcome of the transliterator, which is a quick rendering of the phonetic transcription into orthographic transcription, there is also another useful outcome: a full multidialectal word list, linked through the orthographic standard forms.

 

 

Published Jan. 27, 2017 2:32 PM - Last modified July 5, 2023 12:59 PM