PhD Course: Corpora of text and speech & databases in research

Program

Task for PhD presentations: Each participant is expected to give a short presentation on Friday, in which they answer the statistics task, plus describe how they would like to use corpora and statistics in their PhD thesis. This work can be done at the times allocated in the program. 

Monday

09.00 – 09.30:    Welcome, coffee, registration

09.30 – 10.30: Janne Bondi Johannessen:
Introduction to corpora 
- History
- Corpora and the web
- Metadata
- Fields of linguistics where corpora can be useful (morphology, syntax, dialectology, sociolinguistics...)
- The corpora at the Text Laboratory, UiO

10.30 – 10.45: Short break 

10.45 – 11.30: Bård Uri Jensen: 
Introduction to statistics. 
Learning to use statistical methods for calculating significance in data retrieved from corpus:
- Fundamental issues, presentation of data, simple tests
- Presentation of data using different types of diagrams
- Presentation of data using key values
- Fundamentals of hypothesis testing
- Significance, type-I and type-II errors
- Test assumptions
- Comparing samples (e.g. t-test)
- Correlations (e.g. Pearson’s)
- Count data 
- Why Fisher’s Exact test is often not a good idea 
 

11.30 – 11.45: Short break

11.45 – 12.30: Bård Uri Jensen:  
Introduction to statistics (cont.)

12.30 – 13.30: Lunch

13.30 – 14.30: Short walk

14.30 – 15.15: Bård Uri Jensen:  
Introduction to statistics (cont.)

15.15 – 15.45: Coffee and cake break

15.45 – 18.30: Bård Uri Jensen:
Hands on statistics. Work on presentation. Wrapping-up discussion.

19.30: Dinner
 

Tuesday

09.00 – 09.45: Atle Grønn: 
The RUN-Euro corpus (Russian, Norwegian, English, Swedish, Bosnian, Croation, Serbian, Bulgarian, German, Italian, Polish)

09.45 – 10.00: Short break

10.00 – 10.45: Atle Grønn: The RUN-Euro corpus (cont.)

10.45 – 11.00 Short break

11.00 – 12.30: Atle Grønn, Kristin Hagen and Joel Priestley. Hands on practice with linguistic tasks. Work on presentation. Wrapping-up discussion

12.30    – 13.30: Lunch

13.30 – 14.30: Short walk

14.30 – 15.15: Janne Bondi Johannessen: 
Spoken language corpora
- Nordic Dialect Corpus (with Norwegian, Swedish, Danish, Icelandic and Faroese)
- NoTa (Corpus of Oslo Speech)
- TAUS (Corpus of older Oslo Speech)
- Big Brother Corpus
- Doctor-Patient Corpus
- Norwegian in America
- Ruija-Corpus (Finnish and Kven)

15.15 – 15.45: Coffee and cake break

15.45 – 16.30: Janne Bondi Johannessen and Kristin Hagen:
Spoken language corpora (cont.). 
- Corpus annotation: tagging and transcription

16.45 – 18.30: Janne Bondi Johannessen, Kristin Hagen and Joel:
Hands on practice with linguistic tasks. Work on presentation. Wrapping-up discussion.

19.30: Dinner
 

Wednesday

09.00 – 09.45: Dag Haug: PROIEL (old Indo-European languages: Latin, Gothic, Armenian and Old Church Slavonic)

09.45 – 10.00: Short break

10.00 – 10.45: Dag Haug: PROIEL (cont.)

10.45 – 11.00 Short break

11.00 – 12.30: Dag Haug and Anders Nøklestad:
Hands on practice with linguistic tasks. Work on presentation. Wrapping-up discussion

12.30 – 13.30: Lunch

13.30    – 14.30: Short walk

14.30 – 15.15: Janne Bondi Johannessen: Introduction to databases
- Nordic Syntax Database
- Repertory of Conjectures on Horace
- Kelly (Keywords for Language Learning for Young and adults alike) word pairs from 9 languages:  Arabic, English, Greek, Italian, Chinese, Norwegian, Polish, Russian, Swedish) 
- Maid Chinese spoken dictionary

15.15 – 15.45: Coffee and cake break

15.45 – 16.30: Anders Nøklestad: Monolingual corpora
- French newspaper corpus
- Amharic corpus
- Norwegian Web as a Corpus
- Lexicographic Bokmål Corpus

16.45 – 18.30: Janne Bondi Johannessen and Anders Nøklestad:
Hands on practice with linguistic tasks. Work on presentation. Wrapping-up discussion.

19.30: Dinner
 

Thursday

09.00 – 09.45: Hilde Hasselgård: 
Introduction to Oslo Multilingual Corpus (English, Norwegian, German, French, Portuguese) and British National Corpus

09.45 – 10.00: Short break

10.00 – 10.45: Hilde Hasselgård: Introduction to Oslo Multilingual Corpus (cont.)

10.45  – 11.00 Short break

11.00 – 12.30: Hilde Hasselgård, Anders Nøklestad and Joel Priestley:
Hands on practice with linguistic tasks. Work on presentation. Wrapping-up discussion

12.30 – 13.30: Lunch

13.30 – 14.30: Short walk

14.30 – 15.15: Bård Uri Jensen: 
Statistics: More advanced issues, multivariate analyses, sparse data
- Anova
- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)
- Sparse or skewed data, combining methods

15.15 – 15.45: Coffee and cake break

15.45 – 18.30: Bård Uri Jensen:
Hands on statistics. Work on presentation. Wrapping-up discussion.

19.30: Dinner

 

Friday

09.00 – 09.45: Bård Uri Jensen: Summary, statistics

09.45 – 10.00: Short break

10.00 – 11.00: Student Presentations with feedback

11.00 – 11.15 Short break

11.15 – 12.30: Student Presentations with feedback

12.30 – 13.30: Lunch

13.30 – 14.30: Student Presentations with feedback

14.30 – 14.45: Coffee and cake break 

14.45 – 15.45: Student Presentations with feedback 

15.45 –16.00: Closing
 

 

Publisert 4. mai 2015 10:09 - Sist endret 19. aug. 2022 13:39