DanTok: Domain Beats Language for Danish Social Media POS Tagging
Kia Kirstein Hansen | Maria Barrett | Max Müller-Eberstein | Cathrine Damgaard | Trine Eriksen | Rob Goot
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Language from social media remains challenging to process automatically, especially for non-English languages. In this work, we introduce the first NLP dataset for TikTok comments and the first Danish social media dataset with part-of-speech annotation. We further supply annotations for normalization, code-switching, and annotator uncertainty. As transferring models to such a highly specialized domain is non-trivial, we conduct an extensive study into which source data and modeling decisions most impact the performance. Surprisingly, transferring from in-domain data, even from a different language, outperforms in-language, out-of-domain training. These benefits nonetheless rely on the underlying language models having been at least partially pre-trained on data from the target language. Using our additional annotation layers, we further analyze how normalization, code-switching, and human uncertainty affect the tagging accuracy.
“I’ll be there for you”: The One with Understanding Indirect Answers
Cathrine Damgaard | Paulina Toborek | Trine Eriksen | Barbara Plank
Proceedings of the 2nd Workshop on Computational Approaches to Discourse
Indirect answers are replies to polar questions without the direct use of word cues such as ‘yes’ and ‘no’. Humans are very good at understanding indirect answers, such as ‘I gotta go home sometime’, when asked ‘You wanna crash on the couch?’. Understanding indirect answers is a challenging problem for dialogue systems. In this paper, we introduce a new English corpus to study the problem of understanding indirect answers. Instead of crowd-sourcing both polar questions and answers, we collect questions and indirect answers from transcripts of a prominent TV series and manually annotate them for answer type. The resulting dataset contains 5,930 question-answer pairs. We release both aggregated and raw human annotations. We present a set of experiments in which we evaluate Convolutional Neural Networks (CNNs) for this task, including a cross-dataset evaluation and experiments with learning from disagreements in annotation. Our results show that the task of interpreting indirect answers remains challenging, yet we obtain encouraging improvements when explicitly modeling human disagreement.
- Trine Eriksen 2
- Paulina Toborek 1
- Barbara Plank 1
- Kia Kirstein Hansen 1
- Maria Barrett 1
- show all...