Modelling spelling variation from electronic diplomatic transcripts
Jacob Thaisen, ILOS
The presentation demonstrates the adequacy of N-gram model perplexity, a standard metric in natural language processing, as an objective similarity metric for Middle English spelling data, despite the lexical differences between texts. N-gram models have rarely been constructed for the variable spelling systems characteristic of Middle English, most likely because a successful model presupposes a sizable body of training data. The tradition has instead been for the researcher to assess similarity based on visual, predominantly qualitative, comparison of spelling forms of selected words collected from samples of texts. Diplomatic transcripts of longer medieval English texts are increasingly becoming available in electronic form. Their arrival promises full models optimised through smoothing and interpolation as a basis for quantification and rigid testing. My examples of the adequacy of the perplexity metric are relevant to textual studies. For example, a scribe’s spelling is always biased in the direction of his exemplars. This bias opens up a window on the number of scribes behind the exemplars for a text executed in a single hand, when other factors such as authorship and poetic form are held constant.