Christian Biemann fra Universitetet i Leipzig

Tekstlaboratoriet inviterer alle interesserte til gjesteforelesninger med Christian Biemann fra Universitetet i Leipzig.

FOREDRAG 1 : The Wortschatz Project: Language independent methods for enriching corpora
TID: Tirsdag 11. oktober, kl. 12.15 - 14
STED: HW 536

ABSTRACT: The goals of the Wortschatz Projekt ( University of Leipzig ) are to process and to provide large, annotated corpora for a variety of languages. The focus is on language-independent methods to enrich those plain text corpora with structure without using manually developed resources or language-dependent preprocessing.

Mainly building on an efficient implementation of co-occurrence statistics approaches for acquiring knowledge from text range from word sense discrimination over trend mining and time series analysis to thesaurus expansion and bilingual dictionary acquisition. Finally, the framework is applied to web graph analysis.


FOREDRAG 2 : Finding homogenous word sets: Towards a dissertation in NLP
TID : Onsdag 12. oktober, kl. 10.15 – 12
STED : HW 536

ABSTRACT: Methods are introduced that find sets of words that have something in common in some way by corpus analysis. Having the objective of vastly automatizing the task and putting the knowledge in algorithms instead of training sets, two kinds of methods can be distinguished: completely unsupervised methods (clustering) and weakly supervised methods (bootstrapping).
Two unsupervised variants for standard preprocessing steps will be discussed, namely language identification and part-of-speech tagging. In both, a novel, efficient graph clustering algorithm is employed.
After a general introduction to bootstrapping, which needs only a minimal training set, three bootstrapping experiments will be described: Gazetteer construction for Named Entity Recognition, extension of a semantic lexicon and expansion of a lexical-semantic word net.
Follow-ups on the latter two can give rise to automatic ontology creation and extension.



Publisert 18. okt. 2005 10:13 - Sist endret 18. juni 2010 14:49