Corpus-based computational dialectology: Data, methods and results

The CorCoDial (corpus-based computational dialectology) project aims to infer dialect classifications from variation-rich corpora, focusing in particular on the dialect-to-standard normalization task to introduce comparability between different texts. In this talk, I will present several case studies focusing mainly on the Finnish, Norwegian and Swiss German dialect landscapes. In the first study, we investigate to what extent topic models can find dialectological rather than semantic topics. In the second study, we formulate the dialect-to-standard normalization task as a neural machine translation problem and investigate what the embeddings of speaker labels tell us about the origin of the speakers. If time permits, I will also talk about the use of automatic character alignment for the induction of phonological and morphological variation patterns.
 
Published Nov. 10, 2023 9:43 AM - Last modified Dec. 11, 2023 9:20 PM