Rob Van Der Goot - ACL Anthology

Rob Van Der Goot

Also published as: Rob Van Der Goot, Rob van der Goot, Rob van der Goot

2025

Findings of the VarDial Evaluation Campaign 2025: The NorSID Shared Task on Norwegian Slot, Intent and Dialect Identification
Yves Scherrer | Rob van der Goot | Petter Mæhlum
Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects

The VarDial Evaluation Campaign 2025 was organized as part of the twelfth workshop on Natural Language Processing for Similar Languages, Varieties and Dialects (VarDial), colocated with COLING 2025. It consisted of one shared task with three subtasks: intent detection, slot filling and dialect identification for Norwegian dialects. This report presents the results of this shared task. Four participating teams have submitted systems with very high performance (> 97% accuracy) for intent detection, whereas slot detection and dialect identification showed to be much more challenging, with respectively span-F1 scores up to 89%, and weighted dialect F1 scores of 84%.

DistaLs: a Comprehensive Collection of Language Distance Measures
Rob Van Der Goot | Esther Ploeger | Verena Blaschke | Tanja Samardzic
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Languages vary along a wide variety of dimensions. In Natural Language Processing (NLP), it is useful to know how “distant” languages are from each other, so that we can inform NLP models about these differences or predict good transfer languages. Furthermore, it can inform us about how diverse language samples are. However, there are many different perspectives on how distances across languages could be measured, and previous work has predominantly focused on either intuition or a single type of distance, like genealogical or typological distance. Therefore, we propose DistaLs, a toolkit that is designed to provide users with easy access to a wide variety of language distance measures. We also propose a filtered subset, which contains less redundant and more reliable features. DistaLs is designed to be accessible for a variety of use cases, and offers a Python, CLI, and web interface. It is easily updateable, and available as a pip package. Finally, we provide a case-study in which we use DistaLs to measure correlations of distance measures with performance on four different morphosyntactic tasks.

DaKultur: Evaluating the Cultural Awareness of Language Models for Danish with Native Speakers
Max Müller-Eberstein | Mike Zhang | Elisa Bassignana | Peter Brunsgaard Trolle | Rob Van Der Goot
Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3NLP 2025)

Large Language Models (LLMs) have seen widespread societal adoption. However, while they are able to interact with users in languages beyond English, they have been shown to lack cultural awareness, providing anglocentric or inappropriate responses for underrepresented language communities. To investigate this gap and disentangle linguistic versus cultural proficiency, we conduct the first cultural evaluation study for the mid-resource language of Danish, in which native speakers prompt different models to solve tasks requiring cultural awareness. Our analysis of the resulting 1,038 interactions from 63 demographically diverse participants highlights open challenges to cultural adaptation: Particularly, how currently employed automatically translated data are insufficient to train or measure cultural adaptation, and how training on native-speaker data can more than double response acceptance rates. We release our study data as DaKultur - the first native Danish cultural awareness dataset.

Crossing Domains without Labels: Distant Supervision for Term Extraction
Elena Senger | Yuri Campbell | Rob Van Der Goot | Barbara Plank
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Automatic Term Extraction (ATE) is a critical component in downstream NLP tasks such as document tagging, ontology construction and patent analysis. Current state-of-the-art methods require expensive human annotation and struggle with domain transfer, limiting their practical deployment. This highlights the need for more robust, scalable solutions and realistic evaluation settings. To address this, we introduce a comprehensive benchmark spanning seven diverse domains, enabling performance evaluation at both the document- and corpus-levels. Furthermore, we propose a robust LLM-based model that outperforms both supervised cross-domain encoder models and few-shot learning baselines and performs competitively with its GPT-4o teacher on this benchmark.The first step of our approach is generating psuedo-labels with this black-box LLM on general and scientific domains to ensure generalizability. Building on this data, we fine-tune the first LLMs for ATE. To further enhance document-level consistency, oftentimes needed for downstream tasks, we introduce lightweight post-hoc heuristics. Our approach exceeds previous approaches on 5/7 domains with an average improvement of 10 percentage points. We release our dataset and fine-tuned models to support future research in this area.

Do Syntactic Categories Help in Developmentally Motivated Curriculum Learning for Language Models?
Arzu Burcu Güven | Anna Rogers | Rob Van Der Goot
Proceedings of the First BabyLM Workshop

We examine the syntactic properties of BabyLM corpus, and age-groups within CHILDES. While we find that CHILDES does not exhibit strong syntactic differentiation by age, we show that the syntactic knowledge about the training data can be helpful in interpreting model performance on linguistic tasks. For curriculum learning, we explore developmental and several alternative cognitively inspired curriculum approaches. We find that some curricula help with reading tasks, but the main performance improvement come from using the subset of syntactically categorizable data, rather than the full noisy corpus.

MorSeD: Morphological Segmentation of Danish and its Effect on Language Modeling
Rob van der Goot | Anette Jensen | Emil Allerslev Schledermann | Mikkel Wildner Kildeberg | Nicolaj Larsen | Mike Zhang | Elisa Bassignana
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Current language models (LMs) mostly exploit subwords as input units based on statistical co-occurrences of characters. Adjacently, previous work has shown that modeling morphemes can aid performance for Natural Language Processing (NLP) models. However, morphemes are challenging to obtain as there is no annotated data in most languages. In this work, we release a wide-coverage Danish morphological segmentation evaluation set. We evaluate a range of unsupervised token segmenters and evaluate the downstream effect of using morphemes as input units for transformer-based LMs. Our results show that popular subword algorithms perform poorly on this task, scoring at most an F1 of 57.6 compared to 68.0 for an unsupervised morphological segmenter (Morfessor). Furthermore, evaluate a range of segmenters on the task of language modeling.

DECAF: A Dynamically Extensible Corpus Analysis Framework
Max Müller-Eberstein | Rob Van Der Goot | Anna Rogers
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

The study of generalization in Language Models (LMs) requires controlled experiments that can precisely measure complex linguistic variations between training and testing datasets. We introduce DECAF, a framework that enables the analysis and filtering of linguistically-annotated datasets down to the character level. Rather than creating new resources for each experiment, DECAF starts from datasets with existing linguistic annotations, and leverages them to analyze, filter, and generate highly controlled and reproducible experimental settings targeting specific research questions. We demonstrate DECAF’s functionality by adding 28 morphosyntactic annotation layers to the 115M-word BabyLM corpus and indexing the resulting 1.1B annotations to analyze its internal domain variance, and to create a controlled training data curriculum for a small-scale gender bias study. We release DECAF as an open-source Python library, along with the parsed and indexed version of BabyLM, as resources for future generalization research.

SnakModel: Lessons Learned from Training an Open Danish Large Language Model
Mike Zhang | Max Müller-Eberstein | Elisa Bassignana | Rob van der Goot
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

We present SnakModel, a Danish large language model (LLM) based on Llama2-7B, which we continuously pre-train on 13.6B Danish words, and further tune on 3.7M Danish instructions. As best practices for creating LLMs for smaller language communities have yet to be established, we examine the effects of early modeling and training decisions on downstream performance throughout the entire training pipeline, including (1) the creation of a strictly curated corpus of Danish text from diverse sources; (2) the language modeling and instruction-tuning training process itself, including the analysis of intermediate training dynamics, and ablations across different hyperparameters; (3) an evaluation on eight language and culturally-specific tasks. Across these experiments SnakModel achieves the highest overall performance, outperforming multiple contemporary Llama2-7B-based models. By making SnakModel, the majority of our pre-training corpus, and the associated code available under open licenses, we hope to foster further research and development in Danish Natural Language Processing, and establish training guidelines for languages with similar resource constraints.

How to age BERT Well: Continuous Training for Historical Language Adaptation
Anika Harju | Rob van der Goot
Proceedings of the First Workshop on Language Models for Low-Resource Languages

As the application of computational tools increases to digitalize historical archives, automatic annotation challenges persist due to distinct linguistic and morphological features of historical languages like Old English (OE). Existing tools struggle with the historical language varieties due to insufficient training. Previous research has focused on adapting pre-trained language models to new languages or domains but has rarely explored the modeling of language variety across time. Hence, we investigate the effectiveness of continuous language model training for adapting language models to OE on domain-specific data. We compare the continuous training of an English model (EN) and a multilingual model (ML), and use POS tagging for downstream evaluation. Results show that continuous pre-training substantially improves performance. We retrain a modern English (EN) model and a Multi-lingual (ML) BERT model for OE. We confirmed the effectiveness of continuous pre-training for language adaptation and downstream evaluation utilizing part-of-speech (POS) tagging, advancing the potential to understand the unique grammatical structures of historical OE archives. More concretely, EN BERT initially outperformed ML BERT with an accuracy of 83% during the language modeling phase. However, on the POS tagging task, ML BERT surpassed EN BERT, achieving an accuracy of 94%, which suggests effective performance to the historical language varieties.

KARRIEREWEGE: A large scale Career Path Prediction Dataset
Elena Senger | Yuri Campbell | Rob van der Goot | Barbara Plank
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track

Accurate career path prediction can support many stakeholders, like job seekers, recruiters, HR, and project managers. However, publicly available data and tools for career path prediction are scarce. In this work, we introduce Karrierewege, a comprehensive, publicly available dataset containing over 500k career paths, significantly surpassing the size of previously available datasets. We link the dataset to the ESCO taxonomy to offer a valuable resource for predicting career trajectories. To tackle the problem of free-text inputs typically found in resumes, we enhance it by synthesizing job titles and descriptions resulting in Karrierewege+. This allows for accurate predictions from unstructured data, closely aligning with practical application challenges. We benchmark existing state-of-the-art (SOTA) models on our dataset and a previous benchmark and see increased performance and robustness by synthesizing the data for the free-text use cases.

data2lang2vec: Data Driven Typological Features Completion
Hamidreza Amirzadeh | Sadegh Jafari | Anika Harju | Rob van der Goot
Proceedings of the 31st International Conference on Computational Linguistics

Language typology databases enhance multi-lingual Natural Language Processing (NLP) by improving model adaptability to diverse linguistic structures. The widely-used lang2vec toolkit integrates several such databases, but its coverage remains limited at 28.9%. Previous work on automatically increasing coverage predicts missing values based on features from other languages or focuses on single features, we propose to use textual data for better-informed feature prediction. To this end, we introduce a multi-lingual Part-of-Speech (POS) tagger, achieving over 70% accuracy across 1,749 languages, and experiment with external statistical features and a variety of machine learning algorithms. We also introduce a more realistic evaluation setup, focusing on likely to be missing typology features, and show that our approach outperforms previous work in both setups.

Bias in Danish Medical Notes: Infection Classification of Long Texts Using Transformer and LSTM Architectures Coupled with BERT
Mehdi Parviz | Rudi Agius | Carsten Niemann | Rob Van Der Goot
Proceedings of the Second Workshop on Patient-Oriented Language Processing (CL4Health)

Medical notes contain a wealth of information related to diagnosis, prognosis, and overall patient care that can be used to help physicians make informed decisions. However, like any other data sets consisting of data from diverse demographics, they may be biased toward certain subgroups or subpopulations. Consequently, any bias in the data will be reflected in the output of the machine learning models trained on them. In this paper, we investigate the existence of such biases in Danish medical notes related to three types of blood cancer, with the goal of classifying whether the medical notes indicate severe infection. By employing a hierarchical architecture that combines a sequence model (Transformer and LSTM) with a BERT model to classify long notes, we uncover biases related to demographics and cancer types. Furthermore, we observe performance differences between hospitals. These findings underscore the importance of investigating bias in critical settings such as healthcare and the urgency of monitoring and mitigating it when developing AI-based systems.

Proceedings of the Tenth Workshop on Noisy and User-generated Text
JinYeong Bak | Rob van der Goot | Hyeju Jang | Weerayut Buaphet | Alan Ramponi | Wei Xu | Alan Ritter
Proceedings of the Tenth Workshop on Noisy and User-generated Text

Iterative Structured Knowledge Distillation: Optimizing Language Models Through Layer-by-Layer Distillation
Malthe Have Musaeus | Rob van der Goot
Proceedings of the 31st International Conference on Computational Linguistics

Traditional language model compression techniques, like knowledge distillation, require a fixed architecture, limiting flexibility, while structured pruning methods often fail to preserve performance. This paper introduces Iterative Structured Knowledge Distillation (ISKD), which integrates knowledge distillation and structured pruning by progressively replacing transformer blocks with smaller, efficient versions during training. This study validates ISKD on two transformer-based language models: GPT-2 and Phi-1. ISKD outperforms L1 pruning and achieves similar performance to knowledge distillation while offering greater flexibility. ISKD reduces model parameters - 30.68% for GPT-2 and 30.16% for Phi-1 - while maintaining at least four-fifths of performance on both language modeling and commonsense reasoning tasks. These findings suggest that this method offers a promising balance between model efficiency and accuracy.

Identifying Open Challenges in Language Identification
Rob Van Der Goot
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Automatic language identification is a core problem of many Natural LanguageProcessing (NLP) pipelines. A wide variety of architectures and benchmarks havebeen proposed with often near-perfect performance. Although previousstudies have focused on certain challenging setups (i.e. cross-domain, shortinputs), a systematic comparison is missing. We propose a benchmark that allows us to test for the effect of input size, training data size, domain, number oflanguages, scripts, and language families on performance. We evaluatefive popular models on this benchmark and identify which open challengesremain for this task as well as which architectures achieve robust performance. Wefind that cross-domain setups are the most challenging (although arguably mostrelevant), and that number of languages, variety in scripts, and variety inlanguage families have only a small impact on performance. We also contributepractical takeaways: training with 1,000 instances per language and a maximuminput length of 100 characters is enough for robust language identification.Based on our findings, we train an accurate (94.41%) multi-domain languageidentification model on 2,034 languages, for which we also provide an analysisof the remaining errors.

2024

EEVEE: An Easy Annotation Tool for Natural Language Processing
Axel Sorensen | Siyao Peng | Barbara Plank | Rob Van Der Goot
Proceedings of the 18th Linguistic Annotation Workshop (LAW-XVIII)

Annotation tools are the starting point for creating Natural Language Processing (NLP) datasets. There is a wide variety of tools available; setting up these tools is however a hindrance. We propose Eevee, an annotation tool focused on simplicity, efficiency, and ease of use. It can run directly in the browser (no setup required) and uses tab-separated files (as opposed to character offsets or task-specific formats) for annotation. It allows for annotation of multiple tasks on a single dataset and supports four task-types: sequence labeling, span labeling, text classification and seq2seq.

Can Humans Identify Domains?
Maria Barrett | Max Müller-Eberstein | Elisa Bassignana | Amalie Brogaard Pauli | Mike Zhang | Rob van der Goot
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Textual domain is a crucial property within the Natural Language Processing (NLP) community due to its effects on downstream model performance. The concept itself is, however, loosely defined and, in practice, refers to any non-typological property, such as genre, topic, medium or style of a document. We investigate the core notion of domains via human proficiency in identifying related intrinsic textual properties, specifically the concepts of genre (communicative purpose) and topic (subject matter). We publish our annotations in TGeGUM: A collection of 9.1k sentences from the GUM dataset (Zeldes, 2017) with single sentence and larger context (i.e., prose) annotations for one of 11 genres (source type), and its topic/subtopic as per the Dewey Decimal library classification system (Dewey, 1979), consisting of 10/100 hierarchical topics of increased granularity. Each instance is annotated by three annotators, for a total of 32.7k annotations, allowing us to examine the level of human disagreement and the relative difficulty of each annotation task. With a Fleiss’ kappa of at most 0.53 on the sentence level and 0.66 at the prose level, it is evident that despite the ubiquity of domains in NLP, there is little human consensus on how to define them. By training classifiers to perform the same task, we find that this uncertainty also extends to NLP models.

How to Encode Domain Information in Relation Classification
Elisa Bassignana | Viggo Unmack Gascou | Frida Nøhr Laustsen | Gustav Kristensen | Marie Haahr Petersen | Rob van der Goot | Barbara Plank
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Current language models require a lot of training data to obtain high performance. For Relation Classification (RC), many datasets are domain-specific, so combining datasets to obtain better performance is non-trivial. We explore a multi-domain training setup for RC, and attempt to improve performance by encoding domain information. Our proposed models improve > 2 Macro-F1 against the baseline setup, and our analysis reveals that not all the labels benefit the same: The classes which occupy a similar space across domains (i.e., their interpretation is close across them, for example “physical”) benefit the least, while domain-dependent relations (e.g., “part-of”) improve the most when encoding domain information.

NNOSE: Nearest Neighbor Occupational Skill Extraction
Mike Zhang | Rob van der Goot | Min-Yen Kan | Barbara Plank
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

The labor market is changing rapidly, prompting increased interest in the automatic extraction of occupational skills from text. With the advent of English benchmark job description datasets, there is a need for systems that handle their diversity well. We tackle the complexity in occupational skill datasets tasks—combining and leveraging multiple datasets for skill extraction, to identify rarely observed skills within a dataset, and overcoming the scarcity of skills across datasets. In particular, we investigate the retrieval-augmentation of language models, employing an external datastore for retrieving similar skills in a dataset-unifying manner. Our proposed method, Nearest Neighbor Occupational Skill Extraction (NNOSE) effectively leverages multiple datasets by retrieving neighboring skills from other datasets in the datastore. This improves skill extraction without additional fine-tuning. Crucially, we observe a performance gain in predicting infrequent patterns, with substantial gains of up to 30% span-F1 in cross-dataset settings.

Big City Bias: Evaluating the Impact of Metropolitan Size on Computational Job Market Abilities of Language Models
Charlie Campanella | Rob van der Goot
Proceedings of the First Workshop on Natural Language Processing for Human Resources (NLP4HR 2024)

Large language models have emerged as a useful technology for job matching, for both candidates and employers. Job matching is often based on a particular geographic location, such as a city or region. However, LMs have known biases, commonly derived from their training data. In this work, we aim to quantify the metropolitan size bias encoded within large language models, evaluating zero-shot salary, employer presence, and commute duration predictions in 384 of the United States’ metropolitan regions. Across all benchmarks, we observe correlations between metropolitan population and the accuracy of predictions, with the smallest 10 metropolitan regions showing upwards of 300% worse benchmark performance than the largest 10.

Where are we Still Split on Tokenization?
Rob van der Goot
Findings of the Association for Computational Linguistics: EACL 2024

Many Natural Language Processing (NLP) tasks are labeled on the token level, forthese tasks, the first step is to identify the tokens (tokenization). Becausethis step is often considered to be a solved problem, gold tokenization iscommonly assumed. In this paper, we propose an efficient method fortokenization with subword-based language models, and reflect on the status ofperformance on the tokenization task by evaluating on 122 languages in 20different scripts. We show that our proposed model performs on par with thestate-of-the-art, and that tokenization performance is mainly dependent on theamount and consistency of annotated data. We conclude that besidesinconsistencies in the data and exceptional cases the task can be consideredsolved for Latin languages for in-dataset settings (>99.5 F1). However,performance is 0.75 F1 point lower on average for datasets in other scripts andperformance deteriorates in cross-dataset setups.

Entity Linking in the Job Market Domain
Mike Zhang | Rob van der Goot | Barbara Plank
Findings of the Association for Computational Linguistics: EACL 2024

In Natural Language Processing, entity linking (EL) has centered around Wikipedia, but yet remains underexplored for the job market domain. Disambiguating skill mentions can help us get insight into the current labor market demands. In this work, we are the first to explore EL in this domain, specifically targeting the linkage of occupational skills to the ESCO taxonomy (le Vrang et al., 2014). Previous efforts linked coarse-grained (full) sentences to a corresponding ESCO skill. In this work, we link more fine-grained span-level mentions of skills. We tune two high-performing neural EL models, a bi-encoder (Wu et al., 2020) and an autoregressive model (Cao et al., 2021), on a synthetically generated mention–skill pair dataset and evaluate them on a human-annotated skill-linking benchmark. Our findings reveal that both models are capable of linking implicit mentions of skills to their correct taxonomy counterparts. Empirically, BLINK outperforms GENRE in strict evaluation, but GENRE performs better in loose evaluation (accuracy@k).

Slot and Intent Detection Resources for Bavarian and Lithuanian: Assessing Translations vs Natural Queries to Digital Assistants
Miriam Winkler | Virginija Juozapaityte | Rob van der Goot | Barbara Plank
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Digital assistants perform well in high-resource languages like English, where tasks like slot and intent detection (SID) are well-supported. Many recent SID datasets start including multiple language varieties. However, it is unclear how realistic these translated datasets are. Therefore, we extend one such dataset, namely xSID-0.4, to include two underrepresented languages: Bavarian, a German dialect, and Lithuanian, a Baltic language. Both language variants have limited speaker populations and are often not included in multilingual projects. In addition to translations we provide “natural” queries to digital assistants generated by native speakers. We further include utterances from another dataset for Bavarian to build the richest SID dataset available today for a low-resource dialect without standard orthography. We then set out to evaluate models trained on English in a zero-shot scenario on our target language variants. Our evaluation reveals that translated data can produce overly optimistic scores. However, the error patterns in translated and natural datasets are highly similar. Cross-dataset experiments demonstrate that data collection methods influence performance, with scores lower than those achieved with single-dataset translations. This work contributes to enhancing SID datasets for underrepresented languages, yielding NaLiBaSID, a new evaluation dataset for Bavarian and Lithuanian.

Enough Is Enough! a Case Study on the Effect of Data Size for Evaluation Using Universal Dependencies
Rob van der Goot | Zoey Liu | Max Müller-Eberstein
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

When creating a new dataset for evaluation, one of the first considerations is the size of the dataset. If our evaluation data is too small, we risk making unsupported claims based on the results on such data. If, on the other hand, the data is too large, we waste valuable annotation time and costs that could have been used to widen the scope of our evaluation (i.e. annotate for more domains/languages). Hence, we investigate the effect of the size and a variety of sampling strategies of evaluation data to optimize annotation efforts, using dependency parsing as a test case. We show that for in-language in-domain datasets, 5,000 tokens is enough to obtain a reliable ranking of different parsers; especially if the data is distant enough from the training split (otherwise, we recommend 10,000). In cross-domain setups, the same amounts are required, but in cross-lingual setups much less (2,000 tokens) is enough.

Deep Learning-based Computational Job Market Analysis: A Survey on Skill Extraction and Classification from Job Postings
Elena Senger | Mike Zhang | Rob van der Goot | Barbara Plank
Proceedings of the First Workshop on Natural Language Processing for Human Resources (NLP4HR 2024)

Recent years have brought significant advances to Natural Language Processing (NLP), which enabled fast progress in the field of computational job market analysis. Core tasks in this application domain are skill extraction and classification from job postings. Because of its quick growth and its interdisciplinary nature, there is no exhaustive assessment of this field. This survey aims to fill this gap by providing a comprehensive overview of deep learning methodologies, datasets, and terminologies specific to NLP-driven skill extraction. Our comprehensive cataloging of publicly available datasets addresses the lack of consolidated information on dataset creation and characteristics. Finally, the focus on terminology addresses the current lack of consistent definitions for important concepts, such as hard and soft skills, and terms relating to skill extraction and classification.

Proceedings of the Ninth Workshop on Noisy and User-generated Text (W-NUT 2024)
Rob van der Goot | JinYeong Bak | Max Müller-Eberstein | Wei Xu | Alan Ritter | Tim Baldwin
Proceedings of the Ninth Workshop on Noisy and User-generated Text (W-NUT 2024)

What’s wrong with your model? A Quantitative Analysis of Relation Classification
Elisa Bassignana | Rob van der Goot | Barbara Plank
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

With the aim of improving the state-of-the-art (SOTA) on a target task, a standard strategy in Natural Language Processing (NLP) research is to design a new model, or modify the existing SOTA, and then benchmark its performance on the target task. We argue in favor of enriching this chain of actions by a preliminary error-guided analysis: First, explore weaknesses by analyzing the hard cases where the existing model fails, and then target the improvement based on those. Interpretable evaluation has received little attention for structured prediction tasks. Therefore we propose the first in-depth analysis suite for Relation Classification (RC), and show its effectiveness through a case study. We propose a set of potentially influential attributes to focus on (e.g., entity distance, sentence length). Then, we bucket our datasets based on these attributes, and weight the importance of them through correlations. This allows us to identify highly challenging scenarios for the RC model. By exploiting the findings of our analysis, with a carefully targeted adjustment to our architecture, we effectively improve the performance over the baseline by >3 Micro-F1.

2023

Native Language Prediction from Gaze: a Reproducibility Study
Lina Skerath | Paulina Toborek | Anita Zielińska | Maria Barrett | Rob Van Der Goot
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Numerous studies found that the linguistic properties of a person’s native language affect the cognitive processing of other languages. However, only one study has shown that it was possible to identify the native language based on eye-tracking records of natural L2 reading using machine learning. A new corpus allows us to replicate these results on a more interrelated and larger set of native languages. Our results show that comparable classification performance is maintained despite using less data. However, analysis shows that the correlation between L2 eye movements and native language similarity may be more complex than the original study found.

DanTok: Domain Beats Language for Danish Social Media POS Tagging
Kia Kirstein Hansen | Maria Barrett | Max Müller-Eberstein | Cathrine Damgaard | Trine Eriksen | Rob van der Goot
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Language from social media remains challenging to process automatically, especially for non-English languages. In this work, we introduce the first NLP dataset for TikTok comments and the first Danish social media dataset with part-of-speech annotation. We further supply annotations for normalization, code-switching, and annotator uncertainty. As transferring models to such a highly specialized domain is non-trivial, we conduct an extensive study into which source data and modeling decisions most impact the performance. Surprisingly, transferring from in-domain data, even from a different language, outperforms in-language, out-of-domain training. These benefits nonetheless rely on the underlying language models having been at least partially pre-trained on data from the target language. Using our additional annotation layers, we further analyze how normalization, code-switching, and human uncertainty affect the tagging accuracy.

ESCOXLM-R: Multilingual Taxonomy-driven Pre-training for the Job Market Domain
Mike Zhang | Rob van der Goot | Barbara Plank
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The increasing number of benchmarks for Natural Language Processing (NLP) tasks in the computational job market domain highlights the demand for methods that can handle job-related tasks such as skill extraction, skill classification, job title classification, and de-identification. While some approaches have been developed that are specific to the job market domain, there is a lack of generalized, multilingual models and benchmarks for these tasks. In this study, we introduce a language model called ESCOXLM-R, based on XLM-R-large, which uses domain-adaptive pre-training on the European Skills, Competences, Qualifications and Occupations (ESCO) taxonomy, covering 27 languages. The pre-training objectives for ESCOXLM-R include dynamic masked language modeling and a novel additional objective for inducing multilingual taxonomical ESCO relations. We comprehensively evaluate the performance of ESCOXLM-R on 6 sequence labeling and 3 classification tasks in 4 languages and find that it achieves state-of-the-art results on 6 out of 9 datasets. Our analysis reveals that ESCOXLM-R performs better on short spans and outperforms XLM-R-large on entity-level and surface-level span-F1, likely due to ESCO containing short skill and occupation titles, and encoding information on the entity-level.

MaChAmp at SemEval-2023 tasks 2, 3, 4, 5, 7, 8, 9, 10, 11, and 12: On the Effectiveness of Intermediate Training on an Uncurated Collection of Datasets.
Rob van der Goot
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

To improve the ability of language models to handle Natural Language Processing(NLP) tasks and intermediate step of pre-training has recently beenintroduced. In this setup, one takes a pre-trained language model, trains it ona (set of) NLP dataset(s), and then finetunes it for a target task. It isknown that the selection of relevant transfer tasks is important, but recentlysome work has shown substantial performance gains by doing intermediatetraining on a very large set of datasets. Most previous work uses generativelanguage models or only focuses on one or a couple of tasks and uses acarefully curated setup. We compare intermediate training with one or manytasks in a setup where the choice of datasets is more arbitrary; we use allSemEval 2023 text-based tasks. We reach performance improvements for most taskswhen using intermediate training. Gains are higher when doing intermediatetraining on single tasks than all tasks if the right transfer taskis identified. Dataset smoothing and heterogeneous batching did not lead torobust gains in our setup.

Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training
Max Müller-Eberstein | Rob van der Goot | Barbara Plank | Ivan Titov
Findings of the Association for Computational Linguistics: EMNLP 2023

Representational spaces learned via language modeling are fundamental to Natural Language Processing (NLP), however there has been limited understanding regarding how and when during training various types of linguistic information emerge and interact. Leveraging a novel information theoretic probing suite, which enables direct comparisons of not just task performance, but their representational subspaces, we analyze nine tasks covering syntax, semantics and reasoning, across 2M pre-training steps and five seeds. We identify critical learning phases across tasks and time, during which subspaces emerge, share information, and later disentangle to specialize. Across these phases, syntactic knowledge is acquired rapidly after 0.5% of full training. Continued performance improvements primarily stem from the acquisition of open-domain knowledge, while semantics and reasoning tasks benefit from later boosts to long-range contextualization and higher specialization. Measuring cross-task similarity further reveals that linguistically related tasks share information throughout training, and do so more during the critical phase of learning than before or after. Our findings have implications for model interpretability, multi-task learning, and learning from limited data.

Findings of the VarDial Evaluation Campaign 2023
Noëmi Aepli | Çağrı Çöltekin | Rob Van Der Goot | Tommi Jauhiainen | Mourhaf Kazzaz | Nikola Ljubešić | Kai North | Barbara Plank | Yves Scherrer | Marcos Zampieri
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2023. The campaign is part of the tenth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2023. Three separate shared tasks were included this year: Slot and intent detection for low-resource language varieties (SID4LR), Discriminating Between Similar Languages – True Labels (DSL-TL), and Discriminating Between Similar Languages – Speech (DSL-S). All three tasks were organized for the first time this year.

Silver Syntax Pre-training for Cross-Domain Relation Extraction
Elisa Bassignana | Filip Ginter | Sampo Pyysalo | Rob van der Goot | Barbara Plank
Findings of the Association for Computational Linguistics: ACL 2023

Relation Extraction (RE) remains a challenging task, especially when considering realistic out-of-domain evaluations. One of the main reasons for this is the limited training size of current RE datasets: obtaining high-quality (manually annotated) data is extremely expensive and cannot realistically be repeated for each new domain. An intermediate training step on data from related tasks has shown to be beneficial across many NLP tasks. However, this setup still requires supplementary annotated data, which is often not available. In this paper, we investigate intermediate pre-training specifically for RE. We exploit the affinity between syntactic structure and semantic RE, and identify the syntactic relations which are closely related to RE by being on the shortest dependency path between two entities. We then take advantage of the high accuracy of current syntactic parsers in order to automatically obtain large amounts of low-cost pre-training data. By pre-training our RE model on the relevant syntactic relations, we are able to outperform the baseline in five out of six cross-domain setups, without any additional annotated data.

Multi-CrossRE A Multi-Lingual Multi-Domain Dataset for Relation Extraction
Elisa Bassignana | Filip Ginter | Sampo Pyysalo | Rob van der Goot | Barbara Plank
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Most research in Relation Extraction (RE) involves the English language, mainly due to the lack of multi-lingual resources. We propose Multi-CrossRE, the broadest multi-lingual dataset for RE, including 26 languages in addition to English, and covering six text domains. Multi-CrossRE is a machine translated version of CrossRE (Bassignana and Plank, 2022), with a sub-portion including more than 200 sentences in seven diverse languages checked by native speakers. We run a baseline model over the 26 new datasets and–as sanity check–over the 26 back-translations to English. Results on the back-translated data are consistent with the ones on the original English CrossRE, indicating high quality of the translation and the resulting dataset.

Establishing Trustworthiness: Rethinking Tasks and Model Evaluation
Robert Litschko | Max Müller-Eberstein | Rob van der Goot | Leon Weber-Genzel | Barbara Plank
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Language understanding is a multi-faceted cognitive capability, which the Natural Language Processing (NLP) community has striven to model computationally for decades. Traditionally, facets of linguistic intelligence have been compartmentalized into tasks with specialized model architectures and corresponding evaluation protocols. With the advent of large language models (LLMs) the community has witnessed a dramatic shift towards general purpose, task-agnostic approaches powered by generative models. As a consequence, the traditional compartmentalized notion of language tasks is breaking down, followed by an increasing challenge for evaluation and analysis. At the same time, LLMs are being deployed in more real-world scenarios, including previously unforeseen zero-shot setups, increasing the need for trustworthy and reliable systems. Therefore, we argue that it is time to rethink what constitutes tasks and model evaluation in NLP, and pursue a more holistic view on language, placing trustworthiness at the center. Towards this goal, we review existing compartmentalized approaches for understanding the origins of a model’s functional capacity, and provide recommendations for more multi-faceted evaluation protocols.

2022

MaChAmp at SemEval-2022 Tasks 2, 3, 4, 6, 10, 11, and 12: Multi-task Multi-lingual Learning for a Pre-selected Set of Semantic Datasets
Rob van der Goot
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

Previous work on multi-task learning in Natural Language Processing (NLP) oftenincorporated carefully selected tasks as well as carefully tuning ofarchitectures to share information across tasks. Recently, it has shown thatfor autoregressive language models, a multi-task second pre-training step on awide variety of NLP tasks leads to a set of parameters that more easily adaptfor other NLP tasks. In this paper, we examine whether a similar setup can beused in autoencoder language models using a restricted set of semanticallyoriented NLP tasks, namely all SemEval 2022 tasks that are annotated at theword, sentence or paragraph level. We first evaluate a multi-task model trainedon all SemEval 2022 tasks that contain annotation on the word, sentence orparagraph level (7 tasks, 11 sub-tasks), and then evaluate whetherre-finetuning the resulting model for each task specificially leads to furtherimprovements. Our results show that our mono-task baseline, our multi-taskmodel and our re-finetuned multi-task model each outperform the other modelsfor a subset of the tasks. Overall, huge gains can be observed by doingmulti-task learning: for three tasks we observe an error reduction of more than40%.

Spectral Probing
Max Müller-Eberstein | Rob van der Goot | Barbara Plank
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Linguistic information is encoded at varying timescales (subwords, phrases, etc.) and communicative levels, such as syntax and semantics. Contextualized embeddings have analogously been found to capture these phenomena at distinctive layers and frequencies. Leveraging these findings, we develop a fully learnable frequency filter to identify spectral profiles for any given task. It enables vastly more granular analyses than prior handcrafted filters, and improves on efficiency. After demonstrating the informativeness of spectral probing over manual filters in a monolingual setting, we investigate its multilingual characteristics across seven diverse NLP tasks in six languages. Our analyses identify distinctive spectral profiles which quantify cross-task similarity in a linguistically intuitive manner, while remaining consistent across languages—highlighting their potential as robust, lightweight task descriptors.

On Language Spaces, Scales and Cross-Lingual Transfer of UD Parsers
Tanja Samardžić | Ximena Gutierrez-Vasques | Rob van der Goot | Max Müller-Eberstein | Olga Pelloni | Barbara Plank
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)

Cross-lingual transfer of parsing models has been shown to work well for several closely-related languages, but predicting the success in other cases remains hard. Our study is a comprehensive analysis of the impact of linguistic distance on the transfer of UD parsers. As an alternative to syntactic typological distances extracted from URIEL, we propose three text-based feature spaces and show that they can be more precise predictors, especially on a more local scale, when only shorter distances are taken into account. Our analyses also reveal that the good coverage in typological databases is not among the factors that explain good transfer.

Probing for Labeled Dependency Trees
Max Müller-Eberstein | Rob van der Goot | Barbara Plank
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Probing has become an important tool for analyzing representations in Natural Language Processing (NLP). For graphical NLP tasks such as dependency parsing, linear probes are currently limited to extracting undirected or unlabeled parse trees which do not capture the full task. This work introduces DepProbe, a linear probe which can extract labeled and directed dependency parse trees from embeddings while using fewer parameters and compute than prior methods. Leveraging its full task coverage and lightweight parametrization, we investigate its predictive power for selecting the best transfer language for training a full biaffine attention parser. Across 13 languages, our proposed method identifies the best source treebank 94% of the time, outperforming competitive baselines and prior work. Finally, we analyze the informativeness of task-specific subspaces in contextual embeddings as well as which benefits a full parser’s non-linear parametrization provides.

Frustratingly Easy Performance Improvements for Low-resource Setups: A Tale on BERT and Segment Embeddings
Rob van der Goot | Max Müller-Eberstein | Barbara Plank
Proceedings of the Thirteenth Language Resources and Evaluation Conference

As input representation for each sub-word, the original BERT architecture proposes the sum of the sub-word embedding, position embedding and a segment embedding. Sub-word and position embeddings are well-known and studied, and encode lexical information and word position, respectively. In contrast, segment embeddings are less known and have so far received no attention, despite being ubiquitous in large pre-trained language models. The key idea of segment embeddings is to encode to which of the two sentences (segments) a word belongs to — the intuition is to inform the model about the separation of sentences for the next sentence prediction pre-training task. However, little is known on whether the choice of segment impacts performance. In this work, we try to fill this gap and empirically study the impact of the segment embedding during inference time for a variety of pre-trained embeddings and target tasks. We hypothesize that for single-sentence prediction tasks performance is not affected — neither in mono- nor multilingual setups — while it matters when swapping segment IDs in paired-sentence tasks. To our surprise, this is not the case. Although for classification tasks and monolingual BERT models no large differences are observed, particularly word-level multilingual prediction tasks are heavily impacted. For low-resource syntactic tasks, we observe impacts of segment embedding and multilingual BERT choice. We find that the default setting for the most used multilingual BERT model underperforms heavily, and a simple swap of the segment embeddings yields an average improvement of 2.5 points absolute LAS score for dependency parsing over 9 different treebanks.

Experimental Standards for Deep Learning in Natural Language Processing Research
Dennis Ulmer | Elisa Bassignana | Max Müller-Eberstein | Daniel Varab | Mike Zhang | Rob van der Goot | Christian Hardmeier | Barbara Plank
Findings of the Association for Computational Linguistics: EMNLP 2022

The field of Deep Learning (DL) has undergone explosive growth during the last decade, with a substantial impact on Natural Language Processing (NLP) as well. Yet, compared to more established disciplines, a lack of common experimental standards remains an open challenge to the field at large. Starting from fundamental scientific principles, we distill ongoing discussions on experimental standards in NLP into a single, widely-applicable methodology. Following these best practices is crucial to strengthen experimental evidence, improve reproducibility and enable scientific progress. These standards are further collected in a public repository to help them transparently adapt to future needs.

Increasing Robustness for Cross-domain Dialogue Act Classification on Social Media Data
Marcus Vielsted | Nikolaj Wallenius | Rob van der Goot
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)

Automatically detecting the intent of an utterance is important for various downstream natural language processing tasks. This task is also called Dialogue Act Classification (DAC) and was primarily researched on spoken one-to-one conversations. The rise of social media has made this an interesting data source to explore within DAC, although it comes with some difficulties: non-standard form, variety of language types (across and within platforms), and quickly evolving norms. We therefore investigate the robustness of DAC on social media data in this paper. More concretely, we provide a benchmark that includes cross-domain data splits, as well as a variety of improvements on our transformer-based baseline. Our experiments show that lexical normalization is not beneficial in this setup, balancing the labels through resampling is beneficial in some cases, and incorporating context is crucial for this task and leads to the highest performance improvements 7 F1 percentage points in-domain and 20 cross-domain).

Tafsir Dataset: A Novel Multi-Task Benchmark for Named Entity Recognition and Topic Modeling in Classical Arabic Literature
Sajawel Ahmed | Rob van der Goot | Misbahur Rehman | Carl Kruse | Ömer Özsoy | Alexander Mehler | Gemma Roig
Proceedings of the 29th International Conference on Computational Linguistics

Various historical languages, which used to be lingua franca of science and arts, deserve the attention of current NLP research. In this work, we take the first data-driven steps towards this research line for Classical Arabic (CA) by addressing named entity recognition (NER) and topic modeling (TM) on the example of CA literature. We manually annotate the encyclopedic work of Tafsir Al-Tabari with span-based NEs, sentence-based topics, and span-based subtopics, thus creating the Tafsir Dataset with over 51,000 sentences, the first large-scale multi-task benchmark for CA. Next, we analyze our newly generated dataset, which we make open-source available, with current language models (lightweight BiLSTM, transformer-based MaChAmP) along a novel script compression method, thereby achieving state-of-the-art performance for our target task CA-NER. We also show that CA-TM from the perspective of historical topic models, which are central to Arabic studies, is very challenging. With this interdisciplinary work, we lay the foundations for future research on automatic analysis of CA literature.

Sort by Structure: Language Model Ranking as Dependency Probing
Max Müller-Eberstein | Rob van der Goot | Barbara Plank
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Making an informed choice of pre-trained language model (LM) is critical for performance, yet environmentally costly, and as such widely underexplored. The field of Computer Vision has begun to tackle encoder ranking, with promising forays into Natural Language Processing, however they lack coverage of linguistic tasks such as structured prediction. We propose probing to rank LMs, specifically for parsing dependencies in a given language, by measuring the degree to which labeled trees are recoverable from an LM’s contextualized embeddings. Across 46 typologically and architecturally diverse LM-language pairs, our probing approach predicts the best LM choice 79% of the time using orders of magnitude less compute than training a full parser. Within this study, we identify and analyze one recently proposed decoupled LM—RemBERT—and find it strikingly contains less inherent dependency information, but often yields the best parser after full fine-tuning. Without this outlier our approach identifies the best LM in 89% of cases.

2021

MultiLexNorm: A Shared Task on Multilingual Lexical Normalization
Rob van der Goot | Alan Ramponi | Arkaitz Zubiaga | Barbara Plank | Benjamin Muller | Iñaki San Vicente Roncal | Nikola Ljubešić | Özlem Çetinoğlu | Rahmad Mahendra | Talha Çolakoğlu | Timothy Baldwin | Tommaso Caselli | Wladimir Sidorenko
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

Lexical normalization is the task of transforming an utterance into its standardized form. This task is beneficial for downstream analysis, as it provides a way to harmonize (often spontaneous) linguistic variation. Such variation is typical for social media on which information is shared in a multitude of ways, including diverse languages and code-switching. Since the seminal work of Han and Baldwin (2011) a decade ago, lexical normalization has attracted attention in English and multiple other languages. However, there exists a lack of a common benchmark for comparison of systems across languages with a homogeneous data and evaluation setup. The MultiLexNorm shared task sets out to fill this gap. We provide the largest publicly available multilingual lexical normalization benchmark including 13 language variants. We propose a homogenized evaluation setup with both intrinsic and extrinsic evaluation. As extrinsic evaluation, we use dependency parsing and part-of-speech tagging with adapted evaluation metrics (a-LAS, a-UAS, and a-POS) to account for alignment discrepancies. The shared task hosted at W-NUT 2021 attracted 9 participants and 18 submissions. The results show that neural normalization systems outperform the previous state-of-the-art system by a large margin. Downstream parsing and part-of-speech tagging performance is positively affected but to varying degrees, with improvements of up to 1.72 a-LAS, 0.85 a-UAS, and 1.54 a-POS for the winning system.

From Masked Language Modeling to Translation: Non-English Auxiliary Tasks Improve Zero-shot Spoken Language Understanding
Rob van der Goot | Ibrahim Sharaf | Aizhan Imankulova | Ahmet Üstün | Marija Stepanović | Alan Ramponi | Siti Oryza Khairunnisa | Mamoru Komachi | Barbara Plank
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The lack of publicly available evaluation data for low-resource languages limits progress in Spoken Language Understanding (SLU). As key tasks like intent classification and slot filling require abundant training data, it is desirable to reuse existing data in high-resource languages to develop models for low-resource scenarios. We introduce xSID, a new benchmark for cross-lingual (x) Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect. To tackle the challenge, we propose a joint learning approach, with English SLU training data and non-English auxiliary tasks from raw text, syntax and translation for transfer. We study two setups which differ by type and language coverage of the pre-trained embeddings. Our results show that jointly learning the main tasks with masked language modeling is effective for slots, while machine translation transfer works best for intent classification.

CL-MoNoise: Cross-lingual Lexical Normalization
Rob van der Goot
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

Social media is notoriously difficult to process for existing natural language processing tools, because of spelling errors, non-standard words, shortenings, non-standard capitalization and punctuation. One method to circumvent these issues is to normalize input data before processing. Most previous work has focused on only one language, which is mostly English. In this paper, we are the first to propose a model for cross-lingual normalization, with which we participate in the WNUT 2021 shared task. To this end, we use MoNoise as a starting point, and make a simple adaptation for cross-lingual application. Our proposed model outperforms the leave-as-is baseline provided by the organizers which copies the input. Furthermore, we explore a completely different model which converts the task to a sequence labeling task. Performance of this second system is low, as it does not take capitalization into account in our implementation.

Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-task Learning in NLP
Rob van der Goot | Ahmet Üstün | Alan Ramponi | Ibrahim Sharaf | Barbara Plank
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

Transfer learning, particularly approaches that combine multi-task learning with pre-trained contextualized embeddings and fine-tuning, have advanced the field of Natural Language Processing tremendously in recent years. In this paper we present MaChAmp, a toolkit for easy fine-tuning of contextualized embeddings in multi-task settings. The benefits of MaChAmp are its flexible configuration options, and the support of a variety of natural language processing tasks in a uniform toolkit, from text classification and sequence labeling to dependency parsing, masked language modeling, and text generation.

Much Gracias: Semi-supervised Code-switch Detection for Spanish-English: How far can we get?
Dana-Maria Iliescu | Rasmus Grand | Sara Qirko | Rob van der Goot
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Because of globalization, it is becoming more and more common to use multiple languages in a single utterance, also called code-switching. This results in special linguistic structures and, therefore, poses many challenges for Natural Language Processing. Existing models for language identification in code-switched data are all supervised, requiring annotated training data which is only available for a limited number of language pairs. In this paper, we explore semi-supervised approaches, that exploit out-of-domain mono-lingual training data. We experiment with word uni-grams, word n-grams, character n-grams, Viterbi Decoding, Latent Dirichlet Allocation, Support Vector Machine and Logistic Regression. The Viterbi model was the best semi-supervised model, scoring a weighted F1 score of 92.23%, whereas a fully supervised state-of-the-art BERT-based model scored 98.43%.

Genre as Weak Supervision for Cross-lingual Dependency Parsing
Max Müller-Eberstein | Rob van der Goot | Barbara Plank
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Recent work has shown that monolingual masked language models learn to represent data-driven notions of language variation which can be used for domain-targeted training data selection. Dataset genre labels are already frequently available, yet remain largely unexplored in cross-lingual setups. We harness this genre metadata as a weak supervision signal for targeted data selection in zero-shot dependency parsing. Specifically, we project treebank-level genre information to the finer-grained sentence level, with the goal to amplify information implicitly stored in unsupervised contextualized representations. We demonstrate that genre is recoverable from multilingual contextual embeddings and that it provides an effective signal for training data selection in cross-lingual, zero-shot scenarios. For 12 low-resource language treebanks, six of which are test-only, our genre-specific methods significantly outperform competitive baselines as well as recent embedding-based methods for data selection. Moreover, genre-based data selection provides new state-of-the-art results for three of these target languages.

Parsing with Pretrained Language Models, Multiple Datasets, and Dataset Embeddings
Rob van der Goot | Miryam de Lhoneux
Proceedings of the 20th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2021)

On the Effectiveness of Dataset Embeddings in Mono-lingual,Multi-lingual and Zero-shot Conditions
Rob van der Goot | Ahmet Üstün | Barbara Plank
Proceedings of the Second Workshop on Domain Adaptation for NLP

Recent complementary strands of research have shown that leveraging information on the data source through encoding their properties into embeddings can lead to performance increase when training a single model on heterogeneous data sources. However, it remains unclear in which situations these dataset embeddings are most effective, because they are used in a large variety of settings, languages and tasks. Furthermore, it is usually assumed that gold information on the data source is available, and that the test data is from a distribution seen during training. In this work, we compare the effect of dataset embeddings in mono-lingual settings, multi-lingual settings, and with predicted data source label in a zero-shot setting. We evaluate on three morphosyntactic tasks: morphological tagging, lemmatization, and dependency parsing, and use 104 datasets, 66 languages, and two different dataset grouping strategies. Performance increases are highest when the datasets are of the same language, and we know from which distribution the test-instance is drawn. In contrast, for setups where the data is from an unseen distribution, performance increase vanishes.

Challenges in Annotating and Parsing Spoken, Code-switched, Frisian-Dutch Data
Anouck Braggaar | Rob van der Goot
Proceedings of the Second Workshop on Domain Adaptation for NLP

While high performance have been obtained for high-resource languages, performance on low-resource languages lags behind. In this paper we focus on the parsing of the low-resource language Frisian. We use a sample of code-switched, spontaneously spoken data, which proves to be a challenging setup. We propose to train a parser specifically tailored towards the target domain, by selecting instances from multiple treebanks. Specifically, we use Latent Dirichlet Allocation (LDA), with word and character N-grams. We use a deep biaffine parser initialized with mBERT. The best single source treebank (nl_alpino) resulted in an LAS of 54.7 whereas our data selection outperformed the single best transfer treebank and led to 55.6 LAS on the test data. Additional experiments consisted of removing diacritics from our Frisian data, creating more similar training data by cropping sentences and running our best model using XLM-R. These experiments did not lead to a better performance.

Lexical Normalization for Code-switched Data and its Effect on POS Tagging
Rob van der Goot | Özlem Çetinoğlu
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Lexical normalization, the translation of non-canonical data to standard language, has shown to improve the performance of many natural language processing tasks on social media. Yet, using multiple languages in one utterance, also called code-switching (CS), is frequently overlooked by these normalization systems, despite its common use in social media. In this paper, we propose three normalization models specifically designed to handle code-switched data which we evaluate for two language pairs: Indonesian-English and Turkish-German. For the latter, we introduce novel normalization layers and their corresponding language ID and POS tags for the dataset, and evaluate the downstream effect of normalization on POS tagging. Results show that our CS-tailored normalization models significantly outperform monolingual ones, and lead to 5.4% relative performance increase for POS tagging as compared to unnormalized input.

How Universal is Genre in Universal Dependencies?
Max Müller-Eberstein | Rob van der Goot | Barbara Plank
Proceedings of the 20th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2021)

We Need to Talk About train-dev-test Splits
Rob van der Goot
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Standard train-dev-test splits used to benchmark multiple models against each other are ubiquitously used in Natural Language Processing (NLP). In this setup, the train data is used for training the model, the development set for evaluating different versions of the proposed model(s) during development, and the test set to confirm the answers to the main research question(s). However, the introduction of neural networks in NLP has led to a different use of these standard splits; the development set is now often used for model selection during the training procedure. Because of this, comparing multiple versions of the same model during development leads to overestimation on the development data. As an effect, people have started to compare an increasing amount of models on the test data, leading to faster overfitting and “expiration” of our test sets. We propose to use a tune-set when developing neural network methods, which can be used for model picking so that comparing the different versions of a new model can safely be done on the development data.

2020

NLP North at WNUT-2020 Task 2: Pre-training versus Ensembling for Detection of Informative COVID-19 English Tweets
Anders Giovanni Møller | Rob van der Goot | Barbara Plank
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

With the COVID-19 pandemic raging world-wide since the beginning of the 2020 decade, the need for monitoring systems to track relevant information on social media is vitally important. This paper describes our submission to the WNUT-2020 Task 2: Identification of informative COVID-19 English Tweets. We investigate the effectiveness for a variety of classification models, and found that domain-specific pre-trained BERT models lead to the best performance. On top of this, we attempt a variety of ensembling strategies, but these attempts did not lead to further improvements. Our final best model, the standalone CT-BERT model, proved to be highly competitive, leading to a shared first place in the shared task. Our results emphasize the importance of domain and task-related pre-training.

Biomedical Event Extraction as Sequence Labeling
Alan Ramponi | Rob van der Goot | Rosario Lombardo | Barbara Plank
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We introduce Biomedical Event Extraction as Sequence Labeling (BeeSL), a joint end-to-end neural information extraction model. BeeSL recasts the task as sequence labeling, taking advantage of a multi-label aware encoding strategy and jointly modeling the intermediate tasks via multi-task learning. BeeSL is fast, accurate, end-to-end, and unlike current methods does not require any external knowledge base or preprocessing tools. BeeSL outperforms the current best system (Li et al., 2019) on the Genia 2011 benchmark by 1.57% absolute F1 score reaching 60.22% F1, establishing a new state of the art for the task. Importantly, we also provide first results on biomedical event extraction without gold entity information. Empirical results show that BeeSL’s speed and accuracy makes it a viable approach for large-scale real-world scenarios.

Fair Is Better than Sensational: Man Is to Doctor as Woman Is to Doctor
Malvina Nissim | Rik van Noord | Rob van der Goot
Computational Linguistics, Volume 46, Issue 2 - June 2020

Analogies such as man is to king as woman is to X are often used to illustrate the amazing power of word embeddings. Concurrently, they have also been used to expose how strongly human biases are encoded in vector spaces trained on natural language, with examples like man is to computer programmer as woman is to homemaker. Recent work has shown that analogies are in fact not an accurate diagnostic for bias, but this does not mean that they are not used anymore, or that their legacy is fading. Instead of focusing on the intrinsic problems of the analogy task as a bias detection tool, we discuss a series of issues involving implementation as well as subjective choices that might have yielded a distorted picture of bias in word embeddings. We stand by the truth that human biases are present in word embeddings, and, of course, the need to address them. But analogies are not an accurate tool to do so, and the way they have been most often used has exacerbated some possibly non-existing biases and perhaps hidden others. Because they are still widely popular, and some of them have become classics within and outside the NLP community, we deem it important to provide a series of clarifications that should put well-known, and potentially new analogies, into the right perspective.

DaN+: Danish Nested Named Entities and Lexical Normalization
Barbara Plank | Kristian Nørgaard Jensen | Rob van der Goot
Proceedings of the 28th International Conference on Computational Linguistics

This paper introduces DAN+, a new multi-domain corpus and annotation guidelines for Dan-ish nested named entities (NEs) and lexical normalization to support research on cross-lingualcross-domain learning for a less-resourced language. We empirically assess three strategies tomodel the two-layer Named Entity Recognition (NER) task. We compare transfer capabilitiesfrom German versus in-language annotation from scratch. We examine language-specific versusmultilingual BERT, and study the effect of lexical normalization on NER. Our results show that 1) the most robust strategy is multi-task learning which is rivaled by multi-label decoding, 2) BERT-based NER models are sensitive to domain shifts, and 3) in-language BERT and lexicalnormalization are the most beneficial on the least canonical data. Our results also show that anout-of-domain setup remains challenging, while performance on news plateaus quickly. Thishighlights the importance of cross-domain evaluation of cross-lingual transfer.

Norm It! Lexical Normalization for Italian and Its Downstream Effects for Dependency Parsing
Rob van der Goot | Alan Ramponi | Tommaso Caselli | Michele Cafagna | Lorenzo De Mattei
Proceedings of the Twelfth Language Resources and Evaluation Conference

Lexical normalization is the task of translating non-standard social media data to a standard form. Previous work has shown that this is beneficial for many downstream tasks in multiple languages. However, for Italian, there is no benchmark available for lexical normalization, despite the presence of many benchmarks for other tasks involving social media data. In this paper, we discuss the creation of a lexical normalization dataset for Italian. After two rounds of annotation, a Cohen’s kappa score of 78.64 is obtained. During this process, we also analyze the inter-annotator agreement for this task, which is only rarely done on datasets for lexical normalization,and when it is reported, the analysis usually remains shallow. Furthermore, we utilize this dataset to train a lexical normalization model and show that it can be used to improve dependency parsing of social media data. All annotated data and the code to reproduce the results are available at: http://bitbucket.org/robvanderg/normit.

Synthetic Data for English Lexical Normalization: How Close Can We Get to Manually Annotated Data?
Kelly Dekker | Rob van der Goot
Proceedings of the Twelfth Language Resources and Evaluation Conference

Social media is a valuable data resource for various natural language processing (NLP) tasks. However, standard NLP tools were often designed with standard texts in mind, and their performance decreases heavily when applied to social media data. One solution to this problem is to adapt the input text to a more standard form, a task also referred to as normalization. Automatic approaches to normalization have shown that they can be used to improve performance on a variety of NLP tasks. However, all of these systems are supervised, thereby being heavily dependent on the availability of training data for the correct language and domain. In this work, we attempt to overcome this dependence by automatically generating training data for lexical normalization. Starting with raw tweets, we attempt two directions, to insert non-standardness (noise) and to automatically normalize in an unsupervised setting. Our best results are achieved by automatically inserting noise. We evaluate our approaches by using an existing lexical normalization system; our best scores are achieved by custom error generation system, which makes use of some manually created datasets. With this system, we score 94.29 accuracy on the test data, compared to 95.22 when it is trained on human-annotated data. Our best system which does not depend on any type of annotation is based on word embeddings and scores 92.04 accuracy. Finally, we perform an experiment in which we asked humans to predict whether a sentence was written by a human or generated by our best model. This experiment showed that in most cases it is hard for a human to detect automatically generated sentences.

2019

sthruggle at SemEval-2019 Task 5: An Ensemble Approach to Hate Speech Detection
Aria Nourbakhsh | Frida Vermeer | Gijs Wiltvank | Rob van der Goot
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper, we present our approach to detection of hate speech against women and immigrants in tweets for our participation in the SemEval-2019 Task 5. We trained an SVM and an RF classifier using character bi- and trigram features and a BiLSTM pre-initialized with external word embeddings. We combined the predictions of the SVM, RF and BiLSTM in two different ensemble models. The first was a majority vote of the binary values, and the second used the average of the confidence scores. For development, we got the highest accuracy (75%) by the final ensemble model with majority voting. For testing, all models scored substantially lower and the scores between the classifiers varied more. We believe that these large differences between the higher accuracies in the development phase and the lower accuracies we obtained in the testing phase have partly to do with differences between the training, development and testing data.

An In-depth Analysis of the Effect of Lexical Normalization on the Dependency Parsing of Social Media
Rob van der Goot
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

Existing natural language processing systems have often been designed with standard texts in mind. However, when these tools are used on the substantially different texts from social media, their performance drops dramatically. One solution is to translate social media data to standard language before processing, this is also called normalization. It is well-known that this improves performance for many natural language processing tasks on social media data. However, little is known about which types of normalization replacements have the most effect. Furthermore, it is unknown what the weaknesses of existing lexical normalization systems are in an extrinsic setting. In this paper, we analyze the effect of manual as well as automatic lexical normalization for dependency parsing. After our analysis, we conclude that for most categories, automatic normalization scores close to manually annotated normalization and that small annotation differences are important to take into consideration when exploiting normalization in a pipeline setup.

MoNoise: A Multi-lingual and Easy-to-use Lexical Normalization Tool
Rob van der Goot
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

In this paper, we introduce and demonstrate the online demo as well as the command line interface of a lexical normalization system (MoNoise) for a variety of languages. We further improve this model by using features from the original word for every normalization candidate. For comparison with future work, we propose the bundling of seven datasets in six languages to form a new benchmark, together with a novel evaluation metric which is particularly suitable for cross-dataset comparisons. MoNoise reaches a new state-of-art performance for six out of seven of these datasets. Furthermore, we allow the user to tune the ‘aggressiveness’ of the normalization, and show how the model can be made more efficient with only a small loss in performance. The online demo can be found on: http://www.robvandergoot.com/monoise and the corresponding code on: https://bitbucket.org/robvanderg/monoise/

Multi-Team: A Multi-attention, Multi-decoder Approach to Morphological Analysis.
Ahmet Üstün | Rob van der Goot | Gosse Bouma | Gertjan van Noord
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper describes our submission to SIGMORPHON 2019 Task 2: Morphological analysis and lemmatization in context. Our model is a multi-task sequence to sequence neural network, which jointly learns morphological tagging and lemmatization. On the encoding side, we exploit character-level as well as contextual information. We introduce a multi-attention decoder to selectively focus on different parts of character and word sequences. To further improve the model, we train on multiple datasets simultaneously and use external embeddings for initialization. Our final model reaches an average morphological tagging F1 score of 94.54 and a lemma accuracy of 93.91 on the test data, ranking respectively 3rd and 6th out of 13 teams in the SIGMORPHON 2019 shared task.

2018

A Taxonomy for In-depth Evaluation of Normalization for User Generated Content
Rob van der Goot | Rik van Noord | Gertjan van Noord
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Bleaching Text: Abstract Features for Cross-lingual Gender Prediction
Rob van der Goot | Nikola Ljubešić | Ian Matroos | Malvina Nissim | Barbara Plank
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Gender prediction has typically focused on lexical and social network features, yielding good performance, but making systems highly language-, topic-, and platform dependent. Cross-lingual embeddings circumvent some of these limitations, but capture gender-specific style less. We propose an alternative: bleaching text, i.e., transforming lexical strings into more abstract features. This study provides evidence that such features allow for better transfer across languages. Moreover, we present a first study on the ability of humans to perform cross-lingual gender prediction. We find that human predictive power proves similar to that of our bleached models, and both perform better than lexical models.

Modeling Input Uncertainty in Neural Network Dependency Parsing
Rob van der Goot | Gertjan van Noord
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Recently introduced neural network parsers allow for new approaches to circumvent data sparsity issues by modeling character level information and by exploiting raw data in a semi-supervised setting. Data sparsity is especially prevailing when transferring to non-standard domains. In this setting, lexical normalization has often been used in the past to circumvent data sparsity. In this paper, we investigate whether these new neural approaches provide similar functionality as lexical normalization, or whether they are complementary. We provide experimental results which show that a separate normalization component improves performance of a neural network parser even if it has access to character level information as well as external word embeddings. Further improvements are obtained by a straightforward but novel approach in which the top-N best candidates provided by the normalization component are available to the parser.

2017

Parser Adaptation for Social Media by Integrating Normalization
Rob van der Goot | Gertjan van Noord
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This work explores different approaches of using normalization for parser adaptation. Traditionally, normalization is used as separate pre-processing step. We show that integrating the normalization model into the parsing algorithm is more beneficial. This way, multiple normalization candidates can be leveraged, which improves parsing performance on social media. We test this hypothesis by modifying the Berkeley parser; out-of-the-box it achieves an F1 score of 66.52. Our integrated approach reaches a significant improvement with an F1 score of 67.36, while using the best normalization sequence results in an F1 score of only 66.94.

Last Words: Sharing Is Caring: The Future of Shared Tasks
Malvina Nissim | Lasha Abzianidze | Kilian Evang | Rob van der Goot | Hessel Haagsma | Barbara Plank | Martijn Wieling
Computational Linguistics, Volume 43, Issue 4 - December 2017

To normalize, or not to normalize: The impact of normalization on Part-of-Speech tagging
Rob van der Goot | Barbara Plank | Malvina Nissim
Proceedings of the 3rd Workshop on Noisy User-generated Text

Does normalization help Part-of-Speech (POS) tagging accuracy on noisy, non-canonical data? To the best of our knowledge, little is known on the actual impact of normalization in a real-world scenario, where gold error detection is not available. We investigate the effect of automatic normalization on POS tagging of tweets. We also compare normalization to strategies that leverage large amounts of unlabeled data kept in its raw form. Our results show that normalization helps, but does not add consistently beyond just word embedding layer initialization. The latter approach yields a tagging model that is competitive with a Twitter state-of-the-art tagger.

2016

The Denoised Web Treebank: Evaluating Dependency Parsing under Noisy Input Conditions
Joachim Daiber | Rob van der Goot
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We introduce the Denoised Web Treebank: a treebank including a normalization layer and a corresponding evaluation metric for dependency parsing of noisy text, such as Tweets. This benchmark enables the evaluation of parser robustness as well as text normalization methods, including normalization as machine translation and unsupervised lexical normalization, directly on syntactic trees. Experiments show that text normalization together with a combination of domain-specific and generic part-of-speech taggers can lead to a significant improvement in parsing accuracy on this test set.

2015

ROB: Using Semantic Meaning to Recognize Paraphrases
Rob van der Goot | Gertjan van Noord
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

The Meaning Factory: Formal Semantics for Recognizing Textual Entailment and Determining Semantic Similarity
Johannes Bjerva | Johan Bos | Rob van der Goot | Malvina Nissim
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

Co-authors

Malvina Nissim 5

Gertjan van Noord 5

Ahmet Üstün 4

Maria Barrett 3

Nikola Ljubešić 3

Timothy Baldwin 2

Yuri Campbell 2

Tommaso Caselli 2

Sampo Pyysalo 2

Tanja Samardzic 2

Yves Scherrer 2

Ibrahim Sharaf 2

Rik Van Noord 2

Özlem Çetinoğlu 2

Lasha Abzianidze 1

Sajawel Ahmed 1

Hamidreza Amirzadeh 1

Johannes Bjerva 1

Verena Blaschke 1

Anouck Braggaar 1

Weerayut Buaphet 1

Michele Cafagna 1

Charlie Campanella 1

Cagri Coltekin 1

Joachim Daiber 1

Cathrine Damgaard 1

Trine Eriksen 1

Viggo Unmack Gascou 1

Ximena Gutierrez-Vasques 1

Arzu Burcu Güven 1

Hessel Haagsma 1

Christian Hardmeier 1

Dana-Maria Iliescu 1

Aizhan Imankulova 1

Sadegh Jafari 1

Tommi Jauhiainen 1

Anette Jensen 1

Kristian Nørgaard Jensen 1

Virginija Juozapaityte 1

Mourhaf Kazzaz 1

Siti Oryza Khairunnisa 1

Mikkel Wildner Kildeberg 1

Kia Kirstein Hansen 1

Mamoru Komachi 1

Gustav Kristensen 1

Nicolaj Larsen 1

Frida Nøhr Laustsen 1

Robert Litschko 1

Rosario Lombardo 1

Rahmad Mahendra 1

Lorenzo De Mattei 1

Alexander Mehler 1

Benjamin Muller 1

Malthe Have Musaeus 1

Petter Mæhlum 1

Anders Giovanni Møller 1

Carsten Niemann 1

Aria Nourbakhsh 1

Amalie Brogaard Pauli 1

Marie Haahr Petersen 1

Esther Ploeger 1

Misbahur Rehman 1

Iñaki San Vicente Roncal 1

Emil Allerslev Schledermann 1

Wladimir Sidorenko 1

Axel Sorensen 1

Marija Stepanović 1

Paulina Toborek 1

Peter Brunsgaard Trolle 1

Frida Vermeer 1

Marcus Vielsted 1

Nikolaj Wallenius 1

Leon Weber-Genzel 1

Martijn Wieling 1

Gijs Wiltvank 1

Miriam Winkler 1

Marcos Zampieri 1

Anita Zielińska 1

Arkaitz Zubiaga 1

Miryam de Lhoneux 1

Talha Çolakoğlu 1

Venues