2024
pdf
bib
abs
What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects
Verena Blaschke
|
Christoph Purschke
|
Hinrich Schuetze
|
Barbara Plank
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Natural language processing (NLP) has largely focused on modelling standardized languages. More recently, attention has increasingly shifted to local, non-standardized languages and dialects. However, the relevant speaker populations’ needs and wishes with respect to NLP tools are largely unknown. In this paper, we focus on dialects and regional languages related to German – a group of varieties that is heterogeneous in terms of prestige and standardization. We survey speakers of these varieties (N=327) and present their opinions on hypothetical language technologies for their dialects. Although attitudes vary among subgroups of our respondents, we find that respondents are especially in favour of potential NLP tools that work with dialectal input (especially audio input) such as virtual assistants, and less so for applications that produce dialectal output such as machine translation or spellcheckers.
pdf
bib
abs
Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language
Alistair Plum
|
Tharindu Ranasinghe
|
Christoph Purschke
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Relation extraction is essential for extracting and understanding biographical information in the context of digital humanities and related subjects. There is a growing interest in the community to build datasets capable of training machine learning models to extract relationships. However, annotating such datasets can be expensive and time-consuming, in addition to being limited to English. This paper applies guided distant supervision to create a large biographical relationship extraction dataset for German. Our dataset, composed of more than 80,000 instances for nine relationship types, is the largest biographical German relationship extraction dataset. We also create a manually annotated dataset with 2000 instances to evaluate the models and release it together with the dataset compiled using guided distant supervision. We train several state-of-the-art machine learning models on the automatically created dataset and release them as well. Furthermore, we experiment with multilingual and cross-lingual zero-shot experiments that could benefit many low-resource languages.
2023
pdf
bib
Comparing Pre-Training Schemes for Luxembourgish BERT Models
Cedric Lothritz
|
Saad Ezzini
|
Christoph Purschke
|
Tegawendé Bissyandé
|
Jacques Klein
|
Isabella Olariu
|
Andrey Boytsov
|
Clément LeFebvre
|
Anne Goujon
Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)
pdf
bib
abs
Publish or Hold? Automatic Comment Moderation in Luxembourgish News Articles
Tharindu Ranasinghe
|
Alistair Plum
|
Christoph Purschke
|
Marcos Zampieri
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Recently, the internet has emerged as the primary platform for accessing news. In the majority of these news platforms, the users now have the ability to post comments on news articles and engage in discussions on various social media. While these features promote healthy conversations among users, they also serve as a breeding ground for spreading fake news, toxic discussions and hate speech. Moderating or removing such content is paramount to avoid unwanted consequences for the readers. How- ever, apart from a few notable exceptions, most research on automatic moderation of news article comments has dealt with English and other high resource languages. This leaves under-represented or low-resource languages at a loss. Addressing this gap, we perform the first large-scale qualitative analysis of more than one million Luxembourgish comments posted over the course of 14 years. We evaluate the performance of state-of-the-art transformer models in Luxembourgish news article comment moderation. Furthermore, we analyse how the language of Luxembourgish news article comments has changed over time. We observe that machine learning models trained on old comments do not perform well on recent data. The findings in this work will be beneficial in building news comment moderation systems for many low-resource languages
2021
pdf
bib
abs
Findings of the VarDial Evaluation Campaign 2021
Bharathi Raja Chakravarthi
|
Gaman Mihaela
|
Radu Tudor Ionescu
|
Heidi Jauhiainen
|
Tommi Jauhiainen
|
Krister Lindén
|
Nikola Ljubešić
|
Niko Partanen
|
Ruba Priyadharshini
|
Christoph Purschke
|
Eswari Rajagopal
|
Yves Scherrer
|
Marcos Zampieri
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects
This paper describes the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2021. The campaign was part of the eighth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2021. Four separate shared tasks were included this year: Dravidian Language Identification (DLI), Romanian Dialect Identification (RDI), Social Media Variety Geolocation (SMG), and Uralic Language Identification (ULI). DLI was organized for the first time and the other three continued a series of tasks from previous evaluation campaigns.
2020
pdf
bib
abs
A Report on the VarDial Evaluation Campaign 2020
Mihaela Gaman
|
Dirk Hovy
|
Radu Tudor Ionescu
|
Heidi Jauhiainen
|
Tommi Jauhiainen
|
Krister Lindén
|
Nikola Ljubešić
|
Niko Partanen
|
Christoph Purschke
|
Yves Scherrer
|
Marcos Zampieri
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020. The campaign included three shared tasks each focusing on a different challenge of language and dialect identification: Romanian Dialect Identification (RDI), Social Media Variety Geolocation (SMG), and Uralic Language Identification (ULI). The campaign attracted 30 teams who enrolled to participate in one or multiple shared tasks and 14 of them submitted runs across the three shared tasks. Finally, 11 papers describing participating systems are published in the VarDial proceedings and referred to in this report.
2018
pdf
bib
abs
Capturing Regional Variation with Distributed Place Representations and Geographic Retrofitting
Dirk Hovy
|
Christoph Purschke
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Dialects are one of the main drivers of language variation, a major challenge for natural language processing tools. In most languages, dialects exist along a continuum, and are commonly discretized by combining the extent of several preselected linguistic variables. However, the selection of these variables is theory-driven and itself insensitive to change. We use Doc2Vec on a corpus of 16.8M anonymous online posts in the German-speaking area to learn continuous document representations of cities. These representations capture continuous regional linguistic distinctions, and can serve as input to downstream NLP tasks sensitive to regional variation. By incorporating geographic information via retrofitting and agglomerative clustering with structure, we recover dialect areas at various levels of granularity. Evaluating these clusters against an existing dialect map, we achieve a match of up to 0.77 V-score (harmonic mean of cluster completeness and homogeneity). Our results show that representation learning with retrofitting offers a robust general method to automatically expose dialectal differences and regional variation at a finer granularity than was previously possible.