Walter Daelemans

Also published as: W. Daelemans

2025

pdf bib abs
BEIR-NL: Zero-shot Information Retrieval Benchmark for the Dutch Language
Ehsan Lotfi | Nikolay Banar | Walter Daelemans
Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC)

Zero-shot evaluation of information retrieval (IR) models is often performed using BEIR; a large and heterogeneous benchmark composed of multiple datasets, covering different retrieval tasks across various domains. Although BEIR has become a standard benchmark for the zero-shot setup, its exclusively English content reduces its utility for underrepresented languages in IR, including Dutch. To address this limitation and encourage the development of Dutch IR models, we introduce BEIR-NL by automatically translating the publicly accessible BEIR datasets into Dutch. Using BEIR-NL, we evaluated a wide range of multilingual dense ranking and reranking models, as well as the lexical BM25 method. Our experiments show that BM25 remains a competitive baseline, and is only outperformed by the larger dense models trained for retrieval. When combined with reranking models, BM25 achieves performance on par with the best dense ranking models. In addition, we explored the impact of translation on the data by back-translating a selection of datasets to English, and observed a performance drop for both dense and lexical methods, indicating the limitations of translation for creating benchmarks. BEIR-NL is publicly available on the Hugging Face hub.

pdf bib abs
Jump To Hyperspace: Comparing Euclidean and Hyperbolic Loss Functions for Hierarchical Multi-Label Text Classification
Jens Van Nooten | Walter Daelemans
Proceedings of the 31st International Conference on Computational Linguistics

Hierarchical Multi-Label Text Classification (HMTC) is a challenging machine learning task where multiple labels from a hierarchically organized label set are assigned to a single text. In this study, we examine the effectiveness of Euclidean and hyperbolic loss functions to improve the performance of BERT models on HMTC, which very few previous studies have adopted. We critically evaluate label-aware losses as well as contrastive losses in the Euclidean and hyperbolic space, demonstrating that hyperbolic loss functions perform comparably with non-hyperbolic loss functions on four commonly used HMTC datasets in most scenarios. While hyperbolic label-aware losses perform the best on low-level labels, the overall consistency and micro-averaged performance is compromised. Additionally, we find that our contrastive losses are less effective for HMTC when deployed in the hyperbolic space than non-hyperbolic counterparts. Our research highlights that with the right metrics and training objectives, hyperbolic space does not provide any additional benefits compared to Euclidean space for HMTC, thereby prompting a reevaluation of how different geometric spaces are used in other AI applications.

pdf bib abs
Bilingual BSARD: Extending Statutory Article Retrieval to Dutch
Ehsan Lotfi | Nikolay Banar | Nerses Yuzbashyan | Walter Daelemans
Proceedings of the 1st Regulatory NLP Workshop (RegNLP 2025)

Statutory article retrieval plays a crucial role in making legal information more accessible to both laypeople and legal professionals. Multilingual countries like Belgium present unique challenges for retrieval models due to the need for handling legal issues in multiple languages. Building on the Belgian Statutory Article Retrieval Dataset (BSARD) in French, we introduce the bilingual version of this dataset, bBSARD. The dataset contains parallel Belgian statutory articles in both French and Dutch, along with legal questions from BSARD and their Dutch translation. Using bBSARD, we conduct extensive benchmarking of retrieval models available for Dutch and French. Our benchmarking setup includes lexical models, zero-shot dense models, and fine-tuned small foundation models. Our experiments show that BM25 remains a competitive baseline compared to many zero-shot dense models in both languages. We also observe that while proprietary models outperform open alternatives in the zero-shot setting, they can be matched or surpassed by fine-tuning small language-specific models. Our dataset and evaluation code are publicly available.

2023

pdf bib abs
Follow the Knowledge: Structural Biases and Artefacts in Knowledge Grounded Dialog Datasets
Ehsan Lotfi | Maxime De Bruyn | Jeska.buhmann@uantwerpen.be Jeska.buhmann@uantwerpen.be | Walter Daelemans
Proceedings of the Third DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering

Crowd-sourcing has been one of the primary ways to curate conversational data, specially for certain scenarios like grounding in knowledge. In this setting, using online platforms like AMT, non-expert participants are hired to converse with each other, following instructions which try to guide the outcome towards the desired format. The resulting data then is used for different parts of dialog modelling like knowledge selection and response selection/generation. In this work, we take a closer look into two of the most popular knowledge grounded dialog (KGD) datasets. Investigating potential biases and artefacts in knowledge selection labels, we observe that in many cases the ‘knowledge selection flow’ simply follows the order of presented knowledge pieces. In Wizard of Wikipedia (the most popular KGD dataset) we use simple content-agnostic models based on this bias to get significant knowledge selection performance. In Topical-Chat we see a similar correlation between the knowledge selection sequence and the order of entities and their segments, as provided to crowd-source workers. We believe that the observed results, question the significance and origin of the presumed dialog-level attributes like ‘knowledge flow’ in these crowd-sourced datasets.

pdf bib abs
PersonalityChat: Conversation Distillation for Personalized Dialog Modeling with Facts and Traits
Ehsan Lotfi | Maxime De Bruyn | Jeska Buhmann | Walter Daelemans
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

The new wave of Large Language Models (LLM) has offered an efficient tool to curate sizeable conversational datasets. So far studies have mainly focused on task-oriented or generic open-domain dialogs, and have not fully explored the ability of LLMs in following complicated prompts. In this work, we focus on personalization, and employ LLMs to curate a dataset which is difficult and costly to crowd-source: PersonalityChat is a synthetic conversational dataset based upon the popular PersonaChat dataset, but conditioned on both personas and (Big-5) personality traits. Evaluating models fine-tuned on this dataset, we show that the personality trait labels can be used for trait-based personalization of generative dialogue models. We also perform a head-to-head comparison between PersonalityChat and PersonaChat, and show that training on the distilled dataset results in more fluent and coherent dialog agents in the small-model regime.

pdf bib abs
An Exploration of Zero-Shot Natural Language Inference-Based Hate Speech Detection
Nerses Yuzbashyan | Nikolay Banar | Ilia Markov | Walter Daelemans
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

Conventional techniques for detecting online hate speech rely on the availability of a sufficient number of annotated instances, which can be costly and time consuming. For this reason, zero-shot or few-shot detection can offer an attractive alternative. In this paper, we explore a zero-shot detection approach based on natural language inference (NLI) models. Since the performance of the models in this approach depends heavily on the choice of a hypothesis, our goal is to determine which factors affect the quality of detection. We conducted a set of experiments with three NLI models and four hate speech datasets. We demonstrate that a zero-shot NLI-based approach is competitive with approaches that require supervised learning, yet they are highly sensitive to the choice of hypothesis. In addition, our experiments indicate that the results for a set of hypotheses on different model-data pairs are positively correlated, and that the correlation is higher for different datasets when using the same model than it is for different models when using the same dataset. These results suggest that if we find a hypothesis that works well for a specific model and domain or for a specific type of hate speech, we can use that hypothesis with the same model also within a different domain. While, another model might require different suitable hypotheses in order to demonstrate high performance.

pdf bib abs
Advancing Topical Text Classification: A Novel Distance-Based Method with Contextual Embeddings
Andriy Kosar | Guy De Pauw | Walter Daelemans
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

This study introduces a new method for distance-based unsupervised topical text classification using contextual embeddings. The method applies and tailors sentence embeddings for distance-based topical text classification. This is achieved by leveraging the semantic similarity between topic labels and text content, and reinforcing the relationship between them in a shared semantic space. The proposed method outperforms a wide range of existing sentence embeddings on average by 35%. Presenting an alternative to the commonly used transformer-based zero-shot general-purpose classifiers for multiclass text classification, the method demonstrates significant advantages in terms of computational efficiency and flexibility, while maintaining comparable or improved classification results.

pdf bib abs
Combining Active Learning and Task Adaptation with BERT for Cost-Effective Annotation of Social Media Datasets
Jens Lemmens | Walter Daelemans
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

Social media provide a rich source of data that can be mined and used for a wide variety of research purposes. However, annotating this data can be expensive, yet necessary for state-of-the-art pre-trained language models to achieve high prediction performance. Therefore, we combine pool-based active learning based on prediction uncertainty (an established method for reducing annotation costs) with unsupervised task adaptation through Masked Language Modeling (MLM). The results on three different datasets (two social media corpora, one benchmark dataset) show that task adaptation significantly improves results and that with only a fraction of the available training data, this approach reaches similar F1-scores as those achieved by an upper-bound baseline model fine-tuned on all training data. We hereby contribute to the scarce corpus of research on active learning with pre-trained language models and propose a cost-efficient annotation sampling and fine-tuning approach that can be applied to a wide variety of tasks and datasets.

pdf bib abs
Improving Dutch Vaccine Hesitancy Monitoring via Multi-Label Data Augmentation with GPT-3.5
Jens Van Nooten | Walter Daelemans
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

In this paper, we leverage the GPT-3.5 language model both using the Chat-GPT API interface and the GPT-3.5 API interface to generate realistic examples of anti-vaccination tweets in Dutch with the aim of augmenting an imbalanced multi-label vaccine hesitancy argumentation classification dataset. In line with previous research, we devise a prompt that, on the one hand, instructs the model to generate realistic examples based on the gold standard dataset and, on the other hand, to assign multiple pseudo-labels (or a single pseudo-label) to the generated instances. We then augment our gold standard data with the generated examples and evaluate the impact thereof in a cross-validation setting with several state-of-the-art Dutch large language models. This augmentation technique predominantly shows improvements in F1 for classifying underrepresented classes while increasing the overall recall, paired with a slight decrease in precision for more common classes. Furthermore, we examine how well the synthetic data generalises to human data in the classification task. To our knowledge, we are the first to utilise Chat-GPT and GPT-3.5 for augmenting a Dutch multi-label dataset classification task.

2022

pdf bib abs
Is It Smaller Than a Tennis Ball? Language Models Play the Game of Twenty Questions
Maxime De Bruyn | Ehsan Lotfi | Jeska Buhmann | Walter Daelemans
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Researchers often use games to analyze the abilities of Artificial Intelligence models. In this work, we use the game of Twenty Questions to study the world knowledge of language models. Despite its simplicity for humans, this game requires a broad knowledge of the world to answer yes/no questions. We evaluate several language models on this task and find that only the largest model has enough world knowledge to play it well, although it still has difficulties with the shape and size of objects. We also present a new method to improve the knowledge of smaller models by leveraging external information from the web. Finally, we release our dataset and Twentle, a website to interactively test the knowledge of language models by playing Twenty Questions.

pdf bib abs
Open-Domain Dialog Evaluation Using Follow-Ups Likelihood
Maxime De Bruyn | Ehsan Lotfi | Jeska Buhmann | Walter Daelemans
Proceedings of the 29th International Conference on Computational Linguistics

Automatic evaluation of open-domain dialogs remains an unsolved problem. Existing methods do not correlate strongly with human annotations. In this paper, we present a new automated evaluation method based on the use of follow-ups. We measure the probability that a language model will continue the conversation with a fixed set of follow-ups (e.g. not really relevant here, what are you trying to say?). When compared against twelve existing methods, our new evaluation achieves the highest correlation with human evaluations.

pdf bib abs
Domain- and Task-Adaptation for VaccinChatNL, a Dutch COVID-19 FAQ Answering Corpus and Classification Model
Jeska Buhmann | Maxime De Bruyn | Ehsan Lotfi | Walter Daelemans
Proceedings of the 29th International Conference on Computational Linguistics

FAQs are important resources to find information. However, especially if a FAQ concerns many question-answer pairs, it can be a difficult and time-consuming job to find the answer you are looking for. A FAQ chatbot can ease this process by automatically retrieving the relevant answer to a user’s question. We present VaccinChatNL, a Dutch FAQ corpus on the topic of COVID-19 vaccination. Starting with 50 question-answer pairs we built VaccinChat, a FAQ chatbot, which we used to gather more user questions that were also annotated with the appropriate or new answer classes. This iterative process of gathering user questions, annotating them, and retraining the model with the increased data set led to a corpus that now contains 12,883 user questions divided over 181 answers. We provide the first publicly available Dutch FAQ answering data set of this size with large groups of semantically equivalent human-paraphrased questions. Furthermore, our study shows that before fine-tuning a classifier, continued pre-training of Dutch language models with task- and/or domain-specific data improves classification results. In addition, we show that large groups of semantically similar questions are important for obtaining well-performing intent classification models.

pdf bib abs
CoNTACT: A Dutch COVID-19 Adapted BERT for Vaccine Hesitancy and Argumentation Detection
Jens Lemmens | Jens Van Nooten | Tim Kreutz | Walter Daelemans
Proceedings of the 29th International Conference on Computational Linguistics

We present CoNTACT: a Dutch language model adapted to the domain of COVID-19 tweets. The model was developed by continuing the pre-training phase of RobBERT (Delobelle et al., 2020) by using 2.8M Dutch COVID-19 related tweets posted in 2021. In order to test the performance of the model and compare it to RobBERT, the two models were tested on two tasks: (1) binary vaccine hesitancy detection and (2) detection of arguments for vaccine hesitancy. For both tasks, not only Twitter but also Facebook data was used to show cross-genre performance. In our experiments, CoNTACT showed statistically significant gains over RobBERT in all experiments for task 1. For task 2, we observed substantial improvements in virtually all classes in all experiments. An error analysis indicated that the domain adaptation yielded better representations of domain-specific terminology, causing CoNTACT to make more accurate classification decisions. For task 2, we observed substantial improvements in virtually all classes in all experiments. An error analysis indicated that the domain adaptation yielded better representations of domain-specific terminology, causing CoNTACT to make more accurate classification decisions.

pdf bib abs
20Q: Overlap-Free World Knowledge Benchmark for Language Models
Maxime De Bruyn | Ehsan Lotfi | Jeska Buhmann | Walter Daelemans
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

What do language models know about our world? This question is hard to answer but important to get right. To this end, we introduce 20Q, a novel benchmark using the Twenty Questions game to evaluate world knowledge and common sense of language models. Thanks to our overlap-free benchmark, language models learn the game of Twenty Questions without learning relevant knowledge for the test set. We uncover two intuitive factors influencing the world knowledge of language models: the size of the model and the topic frequency in the pre-training data. Moreover, we show that in-context learning is inefficient for evaluating language models’ world knowledge — fine-tuning is necessary to show their true capabilities. Lastly, our results show room for improvement to enhance the world knowledge and common sense of large language models. A potential solution would be to up-sample unfrequent topics in the pre-training of language models.

pdf bib abs
What Was Your Name Again? Interrogating Generative Conversational Models For Factual Consistency Evaluation
Ehsan Lotfi | Maxime De Bruyn | Jeska Buhmann | Walter Daelemans
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Generative conversational agents are known to suffer from problems like inconsistency and hallucination, and a big challenge in studying these issues remains evaluation: they are not properly reflected in common text generation metrics like perplexity or BLEU, and alternative implicit methods like semantic similarity or NLI labels can be misguided when few specific tokens are decisive. In this work we propose ConsisTest; a factual consistency benchmark including both WH and Y/N questions based on PersonaChat, along with a hybrid evaluation pipeline which aims to get the best of symbolic and sub-symbolic methods. Using these and focusing on pretrained generative models like BART, we provide detailed statistics and analysis on how the model’s consistency is affected by variations in question and context.

pdf bib abs
Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations
Chris Emmery | Ákos Kádár | Grzegorz Chrupała | Walter Daelemans
Proceedings of the Thirteenth Language Resources and Evaluation Conference

A limited amount of studies investigates the role of model-agnostic adversarial behavior in toxic content classification. As toxicity classifiers predominantly rely on lexical cues, (deliberately) creative and evolving language-use can be detrimental to the utility of current corpora and state-of-the-art models when they are deployed for content moderation. The less training data is available, the more vulnerable models might become. This study is, to our knowledge, the first to investigate the effect of adversarial behavior and augmentation for cyberbullying detection. We demonstrate that model-agnostic lexical substitutions significantly hurt classifier performance. Moreover, when these perturbed samples are used for augmentation, we show models become robust against word-level perturbations at a slight trade-off in overall task performance. Augmentations proposed in prior work on toxicity prove to be less effective. Our results underline the need for such evaluations in online harm areas with small corpora.

pdf bib abs
Machine Translation for Multilingual Intent Detection and Slots Filling
Maxime De bruyn | Ehsan Lotfi | Jeska Buhmann | Walter Daelemans
Proceedings of the Massively Multilingual Natural Language Understanding Workshop (MMNLU-22)

We expect to interact with home assistants irrespective of our language. However, scaling the Natural Language Understanding pipeline to multiple languages while keeping the same level of accuracy remains a challenge. In this work, we leverage the inherent multilingual aspect of translation models for the task of multilingual intent classification and slot filling. Our experiments reveal that they work equally well with general-purpose multilingual text-to-text models. Furthermore, their accuracy can be further improved by artificially increasing the size of the training set. Unfortunately, increasing the training set also increases the overlap with the test set, leading to overestimating their true capabilities. As a result, we propose two new evaluation methods capable of accounting for an overlap between the training and test set.

pdf bib abs
The Role of Context in Detecting the Target of Hate Speech
Ilia Markov | Walter Daelemans
Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022)

Online hate speech detection is an inherently challenging task that has recently received much attention from the natural language processing community. Despite a substantial increase in performance, considerable challenges remain and include encoding contextual information into automated hate speech detection systems. In this paper, we focus on detecting the target of hate speech in Dutch social media: whether a hateful Facebook comment is directed against migrants or not (i.e., against someone else). We manually annotate the relevant conversational context and investigate the effect of different aspects of context on performance when adding it to a Dutch transformer-based pre-trained language model, BERTje. We show that performance of the model can be significantly improved by integrating relevant contextual information.

2021

pdf bib abs
Scalable Few-Shot Learning of Robust Biomedical Name Representations
Pieter Fivez | Simon Suster | Walter Daelemans
Proceedings of the 20th Workshop on Biomedical Language Processing

Recent research on robust representations of biomedical names has focused on modeling large amounts of fine-grained conceptual distinctions using complex neural encoders. In this paper, we explore the opposite paradigm: training a simple encoder architecture using only small sets of names sampled from high-level biomedical concepts. Our encoder post-processes pretrained representations of biomedical names, and is effective for various types of input representations, both domain-specific or unsupervised. We validate our proposed few-shot learning approach on multiple biomedical relatedness benchmarks, and show that it allows for continual learning, where we accumulate information from various conceptual hierarchies to consistently improve encoder performance. Given these findings, we propose our approach as a low-cost alternative for exploring the impact of conceptual distinctions on robust biomedical name representations.

pdf bib abs
Are we there yet? Exploring clinical domain knowledge of BERT models
Madhumita Sushil | Simon Suster | Walter Daelemans
Proceedings of the 20th Workshop on Biomedical Language Processing

We explore whether state-of-the-art BERT models encode sufficient domain knowledge to correctly perform domain-specific inference. Although BERT implementations such as BioBERT are better at domain-based reasoning than those trained on general-domain corpora, there is still a wide margin compared to human performance on these tasks. To bridge this gap, we explore whether supplementing textual domain knowledge in the medical NLI task: a) by further language model pretraining on the medical domain corpora, b) by means of lexical match algorithms such as the BM25 algorithm, c) by supplementing lexical retrieval with dependency relations, or d) by using a trained retriever module, can push this performance closer to that of humans. We do not find any significant difference between knowledge supplemented classification as opposed to the baseline BERT models, however. This is contrary to the results for evidence retrieval on other tasks such as open domain question answering (QA). By examining the retrieval output, we show that the methods fail due to unreliable knowledge retrieval for complex domain-specific reasoning. We conclude that the task of unsupervised text retrieval to bridge the gap in existing information to facilitate inference is more complex than what the state-of-the-art methods can solve, and warrants extensive research in the future.

pdf bib abs
Contextual explanation rules for neural clinical classifiers
Madhumita Sushil | Simon Suster | Walter Daelemans
Proceedings of the 20th Workshop on Biomedical Language Processing

Several previous studies on explanation for recurrent neural networks focus on approaches that find the most important input segments for a network as its explanations. In that case, the manner in which these input segments combine with each other to form an explanatory pattern remains unknown. To overcome this, some previous work tries to find patterns (called rules) in the data that explain neural outputs. However, their explanations are often insensitive to model parameters, which limits the scalability of text explanations. To overcome these limitations, we propose a pipeline to explain RNNs by means of decision lists (also called rules) over skipgrams. For evaluation of explanations, we create a synthetic sepsis-identification dataset, as well as apply our technique on additional clinical and sentiment analysis datasets. We find that our technique persistently achieves high explanation fidelity and qualitatively interpretable rules.

pdf bib abs
Conceptual Grounding Constraints for Truly Robust Biomedical Name Representations
Pieter Fivez | Simon Suster | Walter Daelemans
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Effective representation of biomedical names for downstream NLP tasks requires the encoding of both lexical as well as domain-specific semantic information. Ideally, the synonymy and semantic relatedness of names should be consistently reflected by their closeness in an embedding space. To achieve such robustness, prior research has considered multi-task objectives when training neural encoders. In this paper, we take a next step towards truly robust representations, which capture more domain-specific semantics while remaining universally applicable across different biomedical corpora and domains. To this end, we use conceptual grounding constraints which more effectively align encoded names to pretrained embeddings of their concept identifiers. These constraints are effective even when using a Deep Averaging Network, a simple feedforward encoding architecture that allows for scaling to large corpora while remaining sufficiently expressive. We empirically validate our approach using multiple tasks and benchmarks, which assess both literal synonymy as well as more general semantic relatedness.

While solving math word problems automatically has received considerable attention in the NLP community, few works have addressed probability word problems specifically. In this paper, we employ and analyse various neural models for answering such word problems. In a two-step approach, the problem text is first mapped to a formal representation in a declarative language using a sequence-to-sequence model, and then the resulting representation is executed using a probabilistic programming system to provide the answer. Our best performing model incorporates general-domain contextualised word representations that were finetuned using transfer learning on another in-domain dataset. We also apply end-to-end models to this task, which bring out the importance of the two-step approach in obtaining correct solutions to probability problems.

pdf bib abs
Integrating Higher-Level Semantics into Robust Biomedical Name Representations
Pieter Fivez | Simon Suster | Walter Daelemans
Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis

Neural encoders of biomedical names are typically considered robust if representations can be effectively exploited for various downstream NLP tasks. To achieve this, encoders need to model domain-specific biomedical semantics while rivaling the universal applicability of pretrained self-supervised representations. Previous work on robust representations has focused on learning low-level distinctions between names of fine-grained biomedical concepts. These fine-grained concepts can also be clustered together to reflect higher-level, more general semantic distinctions, such as grouping the names nettle sting and tick-borne fever together under the description puncture wound of skin. It has not yet been empirically confirmed that training biomedical name encoders on fine-grained distinctions automatically leads to bottom-up encoding of such higher-level semantics. In this paper, we show that this bottom-up effect exists, but that it is still relatively limited. As a solution, we propose a scalable multi-task training regime for biomedical name encoders which can also learn robust representations using only higher-level semantic classes. These representations can generalise both bottom-up as well as top-down among various semantic hierarchies. Moreover, we show how they can be used out-of-the-box for improved unsupervised detection of hypernyms, while retaining robust performance on various semantic relatedness benchmarks.

pdf bib abs
MFAQ: a Multilingual FAQ Dataset
Maxime De Bruyn | Ehsan Lotfi | Jeska Buhmann | Walter Daelemans
Proceedings of the 3rd Workshop on Machine Reading for Question Answering

In this paper, we present the first multilingual FAQ dataset publicly available. We collected around 6M FAQ pairs from the web, in 21 different languages. Although this is significantly larger than existing FAQ retrieval datasets, it comes with its own challenges: duplication of content and uneven distribution of topics. We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset. Our experiments reveal that a multilingual model based on XLM-RoBERTa achieves the best results, except for English. Lower resources languages seem to learn from one another as a multilingual model achieves a higher MRR than language-specific ones. Our qualitative analysis reveals the brittleness of the model on simple word changes. We publicly release our dataset, model, and training script.

pdf bib abs
Teach Me What to Say and I Will Learn What to Pick: Unsupervised Knowledge Selection Through Response Generation with Pretrained Generative Models
Ehsan Lotfi | Maxime De Bruyn | Jeska Buhmann | Walter Daelemans
Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI

Knowledge Grounded Conversation Models are usually based on a selection/retrieval module and a generation module, trained separately or simultaneously, with or without having access to a ‘gold’ knowledge option. With the introduction of large pre-trained generative models, the selection and generation part have become more and more entangled, shifting the focus towards enhancing knowledge incorporation (from multiple sources) instead of trying to pick the best knowledge option. These approaches however depend on knowledge labels and/or a separate dense retriever for their best performance. In this work we study the unsupervised selection abilities of pre-trained generative models (e.g. BART) and show that by adding a score-and-aggregate module between encoder and decoder, they are capable of learning to pick the proper knowledge through minimising the language modelling loss (i.e. without having access to knowledge labels). Trained as such, our model - K-Mine - shows competitive selection and generation performance against models that benefit from knowledge labels and/or separate dense retriever.

pdf bib abs
Improving Hate Speech Type and Target Detection with Hateful Metaphor Features
Jens Lemmens | Ilia Markov | Walter Daelemans
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

We study the usefulness of hateful metaphorsas features for the identification of the type and target of hate speech in Dutch Facebook comments. For this purpose, all hateful metaphors in the Dutch LiLaH corpus were annotated and interpreted in line with Conceptual Metaphor Theory and Critical Metaphor Analysis. We provide SVM and BERT/RoBERTa results, and investigate the effect of different metaphor information encoding methods on hate speech type and target detection accuracy. The results of the conducted experiments show that hateful metaphor features improve model performance for the both tasks. To our knowledge, it is the first time that the effectiveness of hateful metaphors as an information source for hatespeech classification is investigated.

pdf bib abs
Improving Cross-Domain Hate Speech Detection by Reducing the False Positive Rate
Ilia Markov | Walter Daelemans
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

Hate speech detection is an actively growing field of research with a variety of recently proposed approaches that allowed to push the state-of-the-art results. One of the challenges of such automated approaches – namely recent deep learning models – is a risk of false positives (i.e., false accusations), which may lead to over-blocking or removal of harmless social media content in applications with little moderator intervention. We evaluate deep learning models both under in-domain and cross-domain hate speech detection conditions, and introduce an SVM approach that allows to significantly improve the state-of-the-art results when combined with the deep learning models through a simple majority-voting ensemble. The improvement is mainly due to a reduction of the false positive rate.

pdf bib abs
Exploring Stylometric and Emotion-Based Features for Multilingual Cross-Domain Hate Speech Detection
Ilia Markov | Nikola Ljubešić | Darja Fišer | Walter Daelemans
Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

In this paper, we describe experiments designed to evaluate the impact of stylometric and emotion-based features on hate speech detection: the task of classifying textual content into hate or non-hate speech classes. Our experiments are conducted for three languages – English, Slovene, and Dutch – both in in-domain and cross-domain setups, and aim to investigate hate speech using features that model two linguistic phenomena: the writing style of hateful social media content operationalized as function word usage on the one hand, and emotion expression in hateful messages on the other hand. The results of experiments with features that model different combinations of these phenomena support our hypothesis that stylometric and emotion-based features are robust indicators of hate speech. Their contribution remains persistent with respect to domain and language variation. We show that the combination of features that model the targeted phenomena outperforms words and character n-gram features under cross-domain conditions, and provides a significant boost to deep learning models, which currently obtain the best results, when combined with them in an ensemble.

2020

pdf bib abs
A Deep Generative Approach to Native Language Identification
Ehsan Lotfi | Ilia Markov | Walter Daelemans
Proceedings of the 28th International Conference on Computational Linguistics

Native language identification (NLI) – identifying the native language (L1) of a person based on his/her writing in the second language (L2) – is useful for a variety of purposes, including marketing, security, and educational applications. From a traditional machine learning perspective,NLI is usually framed as a multi-class classification task, where numerous designed features are combined in order to achieve the state-of-the-art results. We introduce a deep generative language modelling (LM) approach to NLI, which consists in fine-tuning a GPT-2 model separately on texts written by the authors with the same L1, and assigning a label to an unseen text based on the minimum LM loss with respect to one of these fine-tuned GPT-2 models. Our method outperforms traditional machine learning approaches and currently achieves the best results on the benchmark NLI datasets.

pdf bib abs
Sarcasm Detection Using an Ensemble Approach
Jens Lemmens | Ben Burtenshaw | Ehsan Lotfi | Ilia Markov | Walter Daelemans
Proceedings of the Second Workshop on Figurative Language Processing

We present an ensemble approach for the detection of sarcasm in Reddit and Twitter responses in the context of The Second Workshop on Figurative Language Processing held in conjunction with ACL 2020. The ensemble is trained on the predicted sarcasm probabilities of four component models and on additional features, such as the sentiment of the comment, its length, and source (Reddit or Twitter) in order to learn which of the component models is the most reliable for which input. The component models consist of an LSTM with hashtag and emoji representations; a CNN-LSTM with casing, stop word, punctuation, and sentiment representations; an MLP based on Infersent embeddings; and an SVM trained on stylometric and emotion-based features. All component models use the two conversational turns preceding the response as context, except for the SVM, which only uses features extracted from the response. The ensemble itself consists of an adaboost classifier with the decision tree algorithm as base estimator and yields F1-scores of 67% and 74% on the Reddit and Twitter test data, respectively.

pdf bib abs
Neural Machine Translation of Artwork Titles Using Iconclass Codes
Nikolay Banar | Walter Daelemans | Mike Kestemont
Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

We investigate the use of Iconclass in the context of neural machine translation for NL<->EN artwork titles. Iconclass is a widely used iconographic classification system used in the cultural heritage domain to describe and retrieve subjects represented in the visual arts. The resource contains keywords and definitions to encode the presence of objects, people, events and ideas depicted in artworks, such as paintings. We propose a simple concatenation approach that improves the quality of automatically generated title translations for artworks, by leveraging textual information extracted from Iconclass. Our results demonstrate that a neural machine translation system is able to exploit this metadata to boost the translation performance of artwork titles. This technology enables interesting applications of machine learning in resource-scarce domains in the cultural sector.

pdf bib abs
Orthographic Codes and the Neighborhood Effect: Lessons from Information Theory
Stéphan Tulkens | Dominiek Sandra | Walter Daelemans
Proceedings of the Twelfth Language Resources and Evaluation Conference

We consider the orthographic neighborhood effect: the effect that words with more orthographic similarity to other words are read faster. The neighborhood effect serves as an important control variable in psycholinguistic studies of word reading, and explains variance in addition to word length and word frequency. Following previous work, we model the neighborhood effect as the average distance to neighbors in feature space for three feature sets: slots, character ngrams and skipgrams. We optimize each of these feature sets and find evidence for language-independent optima, across five megastudy corpora from five alphabetic languages. Additionally, we show that weighting features using the inverse of mutual information (MI) improves the neighborhood effect significantly for all languages. We analyze the inverse feature weighting, and show that, across languages, grammatical morphemes get the lowest weights. Finally, we perform the same experiments on Korean Hangul, a non-alphabetic writing system, where we find the opposite results: slower responses as a function of denser neighborhoods, and a negative effect of inverse feature weighting. This raises the question of whether this is a cognitive effect, or an effect of the way we represent Hangul orthography, and indicates more research is needed.

Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.

pdf bib abs
The LiLaH Emotion Lexicon of Croatian, Dutch and Slovene
Nikola Ljubešić | Ilia Markov | Darja Fišer | Walter Daelemans
Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media

In this paper, we present emotion lexicons of Croatian, Dutch and Slovene, based on manually corrected automatic translations of the English NRC Emotion lexicon. We evaluate the impact of the translation changes by measuring the change in supervised classification results of socially unacceptable utterances when lexicon information is used for feature construction. We further showcase the usage of the lexicons by calculating the difference in emotion distributions in texts containing and not containing socially unacceptable discourse, comparing them across four languages (English, Croatian, Dutch, Slovene) and two topics (migrants and LGBT). We show significant and consistent improvements in automatic classification across all languages and topics, as well as consistent (and expected) emotion distributions across all languages and topics, proving for the manually corrected lexicons to be a useful addition to the severely lacking area of emotion lexicons, the crucial resource for emotive analysis of text.

pdf bib abs
Streaming Language-Specific Twitter Data with Optimal Keywords
Tim Kreutz | Walter Daelemans
Proceedings of the 12th Web as Corpus Workshop

The Twitter Streaming API has been used to create language-specific corpora with varying degrees of success. Selecting a filter of frequent yet distinct keywords for German resulted in a near-complete collection of German tweets. This method is promising as it keeps within Twitter endpoint limitations and could be applied to other languages besides German. But so far no research has compared methods for selecting optimal keywords for this task. This paper proposes a method for finding optimal key phrases based on a greedy solution to the maximum coverage problem. We generate candidate key phrases for the 50 most frequent languages on Twitter. Candidates are then iteratively selected based on a variety of scoring functions applied to their coverage of target tweets. Selecting candidates based on the scoring function that exponentiates the precision of a key phrase and weighs it by recall achieved the best results overall. Some target languages yield lower results than what could be expected from their prevalence on Twitter. Upon analyzing the errors, we find that these are languages that are very close to more prevalent languages. In these cases, key phrases that limit finding the competitive language are selected, and overall recall on the target language also decreases. We publish the resulting optimized lists for each language as a resource. The code to generate lists for other research objectives is also supplied.

2018

pdf bib abs
Enhancing General Sentiment Lexicons for Domain-Specific Use
Tim Kreutz | Walter Daelemans
Proceedings of the 27th International Conference on Computational Linguistics

Lexicon based methods for sentiment analysis rely on high quality polarity lexicons. In recent years, automatic methods for inducing lexicons have increased the viability of lexicon based methods for polarity classification. SentProp is a framework for inducing domain-specific polarities from word embeddings. We elaborate on SentProp by evaluating its use for enhancing DuOMan, a general-purpose lexicon, for use in the political domain. By adding only top sentiment bearing words from the vocabulary and applying small polarity shifts in the general-purpose lexicon, we increase accuracy in an in-domain classification task. The enhanced lexicon performs worse than the original lexicon in an out-domain task, showing that the words we added and the polarity shifts we applied are domain-specific and do not translate well to an out-domain setting.

pdf bib abs
From Strings to Other Things: Linking the Neighborhood and Transposition Effects in Word Reading
Stéphan Tulkens | Dominiek Sandra | Walter Daelemans
Proceedings of the 22nd Conference on Computational Natural Language Learning

We investigate the relation between the transposition and deletion effects in word reading, i.e., the finding that readers can successfully read “SLAT” as “SALT”, or “WRK” as “WORK”, and the neighborhood effect. In particular, we investigate whether lexical orthographic neighborhoods take into account transposition and deletion in determining neighbors. If this is the case, it is more likely that the neighborhood effect takes place early during processing, and does not solely rely on similarity of internal representations. We introduce a new neighborhood measure, rd20, which can be used to quantify neighborhood effects over arbitrary feature spaces. We calculate the rd20 over large sets of words in three languages using various feature sets and show that feature sets that do not allow for transposition or deletion explain more variance in Reaction Time (RT) measurements. We also show that the rd20 can be calculated using the hidden state representations of an Multi-Layer Perceptron, and show that these explain less variance than the raw features. We conclude that the neighborhood effect is unlikely to have a perceptual basis, but is more likely to be the result of items co-activating after recognition. All code is available at: www.github.com/clips/conll2018

pdf bib
WordKit: a Python Package for Orthographic and Phonological Featurization
Stéphan Tulkens | Dominiek Sandra | Walter Daelemans
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs
CliCR: a Dataset of Clinical Case Reports for Machine Reading Comprehension
Simon Šuster | Walter Daelemans
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We present a new dataset for machine comprehension in the medical domain. Our dataset uses clinical case reports with around 100,000 gap-filling queries about these cases. We apply several baselines and state-of-the-art neural readers to the dataset, and observe a considerable gap in performance (20% F1) between the best human and machine readers. We analyze the skills required for successful answering and show how reader performance varies depending on the applicable skills. We find that inferences using domain knowledge and object tracking are the most frequently required skills, and that recognizing omitted information and spatio-temporal reasoning are the most difficult for the machines.

pdf bib abs
Exploring Classifier Combinations for Language Variety Identification
Tim Kreutz | Walter Daelemans
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

This paper describes CLiPS’s submissions for the Discriminating between Dutch and Flemish in Subtitles (DFS) shared task at VarDial 2018. We explore different ways to combine classifiers trained on different feature groups. Our best system uses two Linear SVM classifiers; one trained on lexical features (word n-grams) and one trained on syntactic features (PoS n-grams). The final prediction for a document to be in Flemish Dutch or Netherlandic Dutch is made by the classifier that outputs the highest probability for one of the two labels. This confidence vote approach outperforms a meta-classifier on the development data and on the test data.

pdf bib abs
Rule induction for global explanation of trained models
Madhumita Sushil | Simon Šuster | Walter Daelemans
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Understanding the behavior of a trained network and finding explanations for its outputs is important for improving the network’s performance and generalization ability, and for ensuring trust in automated systems. Several approaches have previously been proposed to identify and visualize the most important features by analyzing a trained network. However, the relations between different features and classes are lost in most cases. We propose a technique to induce sets of if-then-else rules that capture these relations to globally explain the predictions of a network. We first calculate the importance of the features in the trained network. We then weigh the original inputs with these feature importance scores, simplify the transformed input space, and finally fit a rule induction model to explain the model predictions. We find that the output rule-sets can explain the predictions of a neural network trained for 4-class text classification from the 20 newsgroups dataset to a macro-averaged F-score of 0.80. We make the code available at https://github.com/clips/interpret_with_rules.

pdf bib abs
Revisiting neural relation classification in clinical notes with external information
Simon Šuster | Madhumita Sushil | Walter Daelemans
Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis

Recently, segment convolutional neural networks have been proposed for end-to-end relation extraction in the clinical domain, achieving results comparable to or outperforming the approaches with heavy manual feature engineering. In this paper, we analyze the errors made by the neural classifier based on confusion matrices, and then investigate three simple extensions to overcome its limitations. We find that including ontological association between drugs and problems, and data-induced association between medical concepts does not reliably improve the performance, but that large gains are obtained by the incorporation of semantic classes to capture relation triggers.

pdf bib abs
Predicting Adolescents’ Educational Track from Chat Messages on Dutch Social Media
Lisa Hilte | Walter Daelemans | Reinhild Vandekerckhove
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

We aim to predict Flemish adolescents’ educational track based on their Dutch social media writing. We distinguish between the three main types of Belgian secondary education: General (theory-oriented), Vocational (practice-oriented), and Technical Secondary Education (hybrid). The best results are obtained with a Naive Bayes model, i.e. an F-score of 0.68 (std. dev. 0.05) in 10-fold cross-validation experiments on the training data and an F-score of 0.60 on unseen data. Many of the most informative features are character n-grams containing specific occurrences of chatspeak phenomena such as emoticons. While the detection of the most theory- and practice-oriented educational tracks seems to be a relatively easy task, the hybrid Technical level appears to be much harder to capture based on online writing style, as expected.

2017

pdf bib abs
Towards the Improvement of Automatic Emotion Pre-annotation with Polarity and Subjective Information
Lea Canales | Walter Daelemans | Ester Boldrini | Patricio Martínez-Barco
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Emotion detection has a high potential positive impact on the benefit of business, society, politics or education. Given this, the main objective of our research is to contribute to the resolution of one of the most important challenges in textual emotion detection: emotional corpora annotation. This will be tackled by proposing a semi-automatic methodology. It consists in two main phases: (1) an automatic process to pre-annotate the unlabelled sentences with a reduced number of emotional categories; and (2) a manual process of refinement where human annotators will determine which is the dominant emotion between the pre-defined set. Our objective in this paper is to show the pre-annotation process, as well as to evaluate the usability of subjective and polarity information in this process. The evaluation performed confirms clearly the benefits of employing the polarity and subjective information on emotion detection and thus endorses the relevance of our approach.

pdf bib abs
A Short Review of Ethical Challenges in Clinical Natural Language Processing
Simon Šuster | Stéphan Tulkens | Walter Daelemans
Proceedings of the First ACL Workshop on Ethics in Natural Language Processing

Clinical NLP has an immense potential in contributing to how clinical practice will be revolutionized by the advent of large scale processing of clinical records. However, this potential has remained largely untapped due to slow progress primarily caused by strict data access policies for researchers. In this paper, we discuss the concern for privacy and the measures it entails. We also suggest sources of less sensitive data. Finally, we draw attention to biases that can compromise the validity of empirical research and lead to socially harmful applications.

pdf bib abs
Unsupervised Context-Sensitive Spelling Correction of Clinical Free-Text with Word and Character N-Gram Embeddings
Pieter Fivez | Simon Šuster | Walter Daelemans
BioNLP 2017

We present an unsupervised context-sensitive spelling correction method for clinical free-text that uses word and character n-gram embeddings. Our method generates misspelling replacement candidates and ranks them according to their semantic fit, by calculating a weighted cosine similarity between the vectorized representation of a candidate and the misspelling context. We greatly outperform two baseline off-the-shelf spelling correction tools on a manually annotated MIMIC-III test set, and counter the frequency bias of an optimized noisy channel model, showing that neural embeddings can be successfully exploited to include context-awareness in a spelling correction model.

pdf bib abs
Simple Queries as Distant Labels for Predicting Gender on Twitter
Chris Emmery | Grzegorz Chrupała | Walter Daelemans
Proceedings of the 3rd Workshop on Noisy User-generated Text

The majority of research on extracting missing user attributes from social media profiles use costly hand-annotated labels for supervised learning. Distantly supervised methods exist, although these generally rely on knowledge gathered using external sources. This paper demonstrates the effectiveness of gathering distant labels for self-reported gender on Twitter using simple queries. We confirm the reliability of this query heuristic by comparing with manual annotation. Moreover, using these labels for distant supervision, we demonstrate competitive model performance on the same data as models trained on manual annotations. As such, we offer a cheap, extensible, and fast alternative that can be employed beyond the task of gender classification.

pdf bib abs
Assessing the Stylistic Properties of Neurally Generated Text in Authorship Attribution
Enrique Manjavacas | Jeroen De Gussem | Walter Daelemans | Mike Kestemont
Proceedings of the Workshop on Stylistic Variation

Recent applications of neural language models have led to an increased interest in the automatic generation of natural language. However impressive, the evaluation of neurally generated text has so far remained rather informal and anecdotal. Here, we present an attempt at the systematic assessment of one aspect of the quality of neurally generated text. We focus on a specific aspect of neural language generation: its ability to reproduce authorial writing styles. Using established models for authorship attribution, we empirically assess the stylistic qualities of neurally generated text. In comparison to conventional language models, neural models generate fuzzier text, that is relatively harder to attribute correctly. Nevertheless, our results also suggest that neurally generated text offers more valuable perspectives for the augmentation of training data.

2016

pdf bib abs
TwiSty: A Multilingual Twitter Stylometry Corpus for Gender and Personality Profiling
Ben Verhoeven | Walter Daelemans | Barbara Plank
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Personality profiling is the task of detecting personality traits of authors based on writing style. Several personality typologies exist, however, the Briggs-Myer Type Indicator (MBTI) is particularly popular in the non-scientific community, and many people use it to analyse their own personality and talk about the results online. Therefore, large amounts of self-assessed data on MBTI are readily available on social-media platforms such as Twitter. We present a novel corpus of tweets annotated with the MBTI personality type and gender of their author for six Western European languages (Dutch, German, French, Italian, Portuguese and Spanish). We outline the corpus creation and annotation, show statistics of the obtained data distributions and present first baselines on Myers-Briggs personality profiling and gender prediction for all six languages.

pdf bib abs
Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource
Stéphan Tulkens | Chris Emmery | Walter Daelemans
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Word embeddings have recently seen a strong increase in interest as a result of strong performance gains on a variety of tasks. However, most of this research also underlined the importance of benchmark datasets, and the difficulty of constructing these for a variety of language-specific tasks. Still, many of the datasets used in these tasks could prove to be fruitful linguistic resources, allowing for unique observations into language use and variability. In this paper we demonstrate the performance of multiple types of embeddings, created with both count and prediction-based architectures on a variety of corpora, in two language-specific tasks: relation evaluation, and dialect identification. For the latter, we compare unsupervised methods with a traditional, hand-crafted dictionary. With this research, we provide the embeddings themselves, the relation evaluation task benchmark for use in further research, and demonstrate how the benchmarked embeddings prove a useful unsupervised linguistic resource, effectively used in a downstream task.

pdf bib
Using Distributed Representations to Disambiguate Biomedical and Clinical Concepts
Stéphan Tulkens | Simon Suster | Walter Daelemans
Proceedings of the 15th Workshop on Biomedical Natural Language Processing

2015

pdf bib
Towards a Model of Prediction-based Syntactic Category Acquisition: First Steps with Word Embeddings
Robert Grimm | Giovanni Cassani | Walter Daelemans | Steven Gillis
Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning

pdf bib
Which distributional cues help the most? Unsupervised contexts selection for lexical category acquisition
Giovanni Cassani | Robert Grimm | Walter Daelemans | Steven Gillis
Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning

2014

pdf bib
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Alessandro Moschitti | Bo Pang | Walter Daelemans
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib abs
CLiPS Stylometry Investigation (CSI) corpus: A Dutch corpus for the detection of age, gender, personality, sentiment and deception in text
Ben Verhoeven | Walter Daelemans
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present the CLiPS Stylometry Investigation (CSI) corpus, a new Dutch corpus containing reviews and essays written by university students. It is designed to serve multiple purposes: detection of age, gender, authorship, personality, sentiment, deception, topic and genre. Another major advantage is its planned yearly expansion with each year’s new students. The corpus currently contains about 305,000 tokens spread over 749 documents. The average review length is 128 tokens; the average essay length is 1126 tokens. The corpus will be made available on the CLiPS website (www.clips.uantwerpen.be/datasets) and can freely be used for academic research purposes. An initial deception detection experiment was performed on this data. Deception detection is the task of automatically classifying a text as being either truthful or deceptive, in our case by examining the writing style of the author. This task has never been investigated for Dutch before. We performed a supervised machine learning experiment using the SVM algorithm in a 10-fold cross-validation setup. The only features were the token unigrams present in the training data. Using this simple method, we reached a state-of-the-art F-score of 72.2%.

This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiatives work throughout Europe in order to boost progress and innovation in our field.

pdf bib
Proceedings of the First Workshop on Computational Approaches to Compound Analysis (ComAComA 2014)
Ben Verhoeven | Walter Daelemans | Menno van Zaanen | Gerhard van Huyssteen
Proceedings of the First Workshop on Computational Approaches to Compound Analysis (ComAComA 2014)

pdf bib
Automatic Compound Processing: Compound Splitting and Semantic Analysis for Afrikaans and Dutch
Ben Verhoeven | Menno van Zaanen | Walter Daelemans | Gerhard van Huyssteen
Proceedings of the First Workshop on Computational Approaches to Compound Analysis (ComAComA 2014)

2013

2012

pdf bib
Improving Topic Classification for Highly Inflective Languages
Jurgita Kapociute-Dzikiene | Frederik Vaassen | Walter Daelemans | Algis Krupavičius
Proceedings of COLING 2012

pdf bib
A Statistical Relational Learning Approach to Identifying Evidence Based Medicine Categories
Mathias Verbeke | Vincent Van Asch | Roser Morante | Paolo Frasconi | Walter Daelemans | Luc De Raedt
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Walter Daelemans
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib abs
ConanDoyle-neg: Annotation of negation cues and their scope in Conan Doyle stories
Roser Morante | Walter Daelemans
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper we present ConanDoyle-neg, a corpus of stories by Conan Doyle annotated with negation information. The negation cues and their scope, as well as the event or property that is negated have been annotated by two annotators. The inter-annotator agreement is measured in terms of F-scores at scope level. It is higher for cues (94.88 and 92.77), less high for scopes (85.04 and 77.31), and lower for the negated event (79.23 and 80.67). The corpus is publicly available.

pdf bib abs
“Vreselijk mooi!” (terribly beautiful): A Subjectivity Lexicon for Dutch Adjectives.
Tom De Smedt | Walter Daelemans
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present a new open source subjectivity lexicon for Dutch adjectives. The lexicon is a dictionary of 1,100 adjectives that occur frequently in online product reviews, manually annotated with polarity strength, subjectivity and intensity, for each word sense. We discuss two machine learning methods (using distributional extraction and synset relations) to automatically expand the lexicon to 5,500 words. We evaluate the lexicon by comparing it to the user-given star rating of online product reviews. We show promising results in both in-domain and cross-domain evaluation. The lexicon is publicly available as part of the PATTERN software package (http://www.clips.ua.ac.be/pages/pattern).

Although in recent years numerous forms of Internet communication ― such as e-mail, blogs, chat rooms and social network environments ― have emerged, balanced corpora of Internet speech with trustworthy meta-information (e.g. age and gender) or linguistic annotations are still limited. In this paper we present a large corpus of Flemish Dutch chat posts that were collected from the Belgian online social network Netlog. For all of these posts we also acquired the users' profile information, making this corpus a unique resource for computational and sociolinguistic research. However, for analyzing such a corpus on a large scale, NLP tools are required for e.g. automatic POS tagging or lemmatization. Because many NLP tools fail to correctly analyze the surface forms of chat language usage, we propose to normalize this anomalous' input into a format suitable for existing NLP solutions for standard Dutch. Additionally, we have annotated a substantial part of the corpus (i.e. the Chatty subset) to provide a gold standard for the evaluation of future approaches to automatic (Flemish) chat language normalization.

pdf bib
Towards a Self-Learning Assistive Vocal Interface: Vocabulary and Grammar Learning
Janneke van de Loo | Jort F. Gemmeke | Guy De Pauw | Joris Driesen | Hugo Van hamme | Walter Daelemans
Proceedings of the 1st Workshop on Speech and Multimodal Interaction in Assistive Environments

We present a new corpus for computational stylometry, more specifically authorship attribution and the prediction of author personality from text. Because of the large number of authors (145), the corpus will allow previously impossible studies of variation in features considered predictive for writing style. The innovative meta-information (personality profiles of the authors) associated with these texts allows the study of personality prediction, a not yet very well researched aspect of style. In this paper, we describe the contents of the corpus and show its use in both authorship attribution and personality prediction. We focus on features that have been proven useful in the field of author recognition. Syntactic features like part-of-speech n-grams are generally accepted as not being under the authors conscious control and therefore providing good clues for predicting gender or authorship. We want to test whether these features are helpful for personality prediction and authorship attribution on a large set of authors. Both tasks are approached as text categorization tasks. First a document representation is constructed based on feature selection from the linguistically analyzed corpus (using the Memory-Based Shallow Parser (MBSP)). These are associated with each of the 145 authors or each of the four components of the Myers-Briggs Type Indicator (Introverted-Extraverted, Sensing-iNtuitive, Thinking-Feeling, Judging-Perceiving). Authorship attribution on 145 authors achieves results around 50%-accuracy. Preliminary results indicate that the first two personality dimensions can be predicted fairly accurately.

We present the main outcomes of the COREA project: a corpus annotated with coreferential relations and a coreference resolution system for Dutch. In the project we developed annotation guidelines for coreference resolution for Dutch and annotated a corpus of 135K tokens. We discuss these guidelines, the annotation tool, and the inter-annotator agreement. We also show a visualization of the annotated relations. The standard approach to evaluate a coreference resolution system is to compare the predictions of the system to a hand-annotated gold standard test set (cross-validation). A more practically oriented evaluation is to test the usefulness of coreference relation information in an NLP application. We run experiments with an Information Extraction module for the medical domain, and measure the performance of this module with and without the coreference relation information. We present the results of both this application-oriented evaluation of our system and of a standard cross-validation evaluation. In a separate experiment we also evaluate the effect of coreference information produced by a simple rule-based coreference module in a Question Answering application.

pdf bib
CNTS: Memory-Based Learning of Generating Repeated References
Iris Hendrickx | Walter Daelemans | Kim Luyckx | Roser Morante | Vincent Van Asch
Proceedings of the Fifth International Natural Language Generation Conference

pdf bib
A Combined Memory-Based Semantic Role Labeler of English
Roser Morante | Walter Daelemans | Vincent Van Asch
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning

2007

pdf bib
Letter to the Editor
Walter Daelemans | Antal van den Bosch
Computational Linguistics, Volume 33, Number 1, March 2007

pdf bib
Invited talk: Text Analysis and Machine Learning for Stylometrics and Stylogenetics
Walter Daelemans
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

2006

pdf bib abs
A mixed word / morphological approach for extending CELEX for high coverage on contemporary large corpora
Joris Vaneyghen | Guy De Pauw | Dirk Van Compernolle | Walter Daelemans
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes an alternative approach to morphological language modeling, which incorporates constraints on the morphological production of new words. This is done by applying the constraints as a preprocessing step in which only one morphological production rule can be applied to an extended lexicon of knownmorphemes, lemmas and word forms. This approach is used to extend the CELEX Dutch morphological database, so that a higher coverage can be reached on a largecorpus of Dutch newspaper articles. We present experimental results on the coverage of this extended database and use the extension to further evaluate our morphologicalsystem, as well as the impact of the constraints on the coverage of out-of-vocabulary words.

pdf bib
Constraint Satisfaction Inference: Non-probabilistic Global Inference for Sequence Labelling
Sander Canisius | Antal van den Bosch | Walter Daelemans
Proceedings of the Workshop on Learning Structured Information in Natural Language Applications

pdf bib
A Mission for Computational Natural Language Learning
Walter Daelemans
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)

pdf bib
Investigating Lexical Substitution Scoring for Subtitle Generation
Oren Glickman | Ido Dagan | Walter Daelemans | Mikaela Keller | Samy Bengio
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)

2005

pdf bib
Improving Sequence Segmentation Learning by Predicting Trigrams
Antal van den Bosch | Walter Daelemans
Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005)

2004

pdf bib
Unsupervised Text Mining for Ontology Extraction: An Evaluation of Statistical Measures
Marie-Laure Reinberger | Walter Daelemans
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib abs
Evaluation and Adaptation of the Celex Dutch Morphological Database
Tom Laureys | Guy De Pauw | Hugo Van hamme | Walter Daelemans | Dirk Van Compernolle
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper describes some important modifications to the Celex morphological database in the context of the FLaVoR project. FLaVoR aims to develop a novel modular framework for speech recognition, enabling the integration of complex linguistic knowledge sources, such as a morphological model. Morphology is a fairly unexploited linguistic information source speech recognizers could benefit from. This is especially true for languages which allow for a rich set of morphological operations, such as our target language Dutch. In this paper we focus on the exploitation of the Celex Dutch morphological database as the information source underlying two different morphological analyzers being developed within the project. Although the Celex database provides a valuable source of morphological information for Dutch, many modifications were necessary before it could be practically applied. We identify major problems, discuss the implemented solutions and finally experimentally evaluate the effect of our modifications to the database.

pdf bib
Automatic Sentence Simplification for Subtitling in Dutch and English
Walter Daelemans | Anja Höthker | Erik Tjong Kim Sang
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
A Comparison of Two Different Approaches to Morphological Analysis of Dutch
Guy De Pauw | Tom Laureys | Walter Daelemans | Hugo Van hamme
Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology

pdf bib
GAMBL, genetic algorithm optimization of memory-based WSD
Bart Decadt | Véronique Hoste | Walter Daelemans | Antal van den Bosch
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

pdf bib
Memory-based semantic role labeling: Optimizing features, algorithm, and output
Antal van den Bosch | Sander Canisius | Walter Daelemans | Iris Hendrickx | Erik Tjong Kim Sang
Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004

2003

pdf bib
Learning to Predict Pitch Accents and Prosodic Boundaries in Dutch
Erwin Marsi | Martin Reynaert | Antal van den Bosch | Walter Daelemans | Véronique Hoste
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

pdf bib
Memory-Based Named Entity Recognition using Unannotated Data
Fien De Meulder | Walter Daelemans
Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003

2002

pdf bib
Evaluation of Machine Learning Methods for Natural Language Processing Tasks
Walter Daelemans | Véronique Hoste
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
A Field Survey for Establishing Priorities in the Development of HLT Resources for Dutch
D. Binnenpoorte | F. De Vriend | J. Sturm | W. Daelemans | H. Strik | C. Cucchiarini
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Dutch Word Sense Disambiguation: Optimizing the Localness of Context
Antal van den Bosch | Iris Hendrickx | Veronique Hoste | Walter Daelemans
Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions

pdf bib
Evaluating the results of a memory-based word-expert approach to unrestricted word sense disambiguation
Veronique Hoste | Walter Daelemans | Iris Hendrickx | Antal van den Bosch
Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions

2001

pdf bib
Improving Accuracy in word class tagging through the Combination of Machine Learning Systems
Hans Van Halteren | Jakub Zavrel | Walter Daelemans
Computational Linguistics, Volume 27, Number 2, June 2001

pdf bib
Book Reviews: Learnability in Optimality Theory
Walter Daelemans
Computational Linguistics, Volume 27, Number 2, June 2001

pdf bib
Classifier Optimization and Combination in the English All Words Task
Véronique Hoste | Anne Kool | Walter Daelemans
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

2000

pdf bib
A Rule Induction Approach to Modeling Regional Pronunciation Variation
Veronique Hoste | Steven Gillis | Walter Daelemans
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

pdf bib
Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers
Jakub Zavrel | Walter Daelemans
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
Part of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus
Frank Van Eynde | Jakub Zavrel | Walter Daelemans
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
The Role of Algorithm Bias vs Information Source in Learning Algorithms for Morphosyntactic Disambiguation
Guy De Pauw | Walter Daelemans
Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop

pdf bib
Genetic Algorithms for Feature Relevance Assignment in Memory-Based Language Processing
Anne Kool | Walter Daelemans | Jakub Zavrel
Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop

1999

pdf bib
Memory-Based Morphological Analysis
Antal van den Bosch | Walter Daelemans
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics

pdf bib
Cascaded Grammatical Relation Assignment
Sabine Buchholz | Jorn Veenstra | Walter Daelemans
1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora

pdf bib
Memory-Based Shallow Parsing
Walter Daelemans | Sabine Buchholz | Jorn Veenstra
EACL 1999: CoNLL-99 Computational Natural Language Learning

1998

pdf bib
Improving Data Driven Wordclass Tagging by System Combination
Hans van Halteren | Jakub Zavrel | Walter Daelemans
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

pdf bib
Improving Data Driven Wordclass Tagging by System Combination
Hans van Halteren | Jakub Zavrel | Walter Daelemans
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

pdf bib
Abstraction Is Harmful in Language Learning
Walter Daelemans
New Methods in Language Processing and Computational Natural Language Learning

pdf bib
Modularity in Inductively-Learned Word Pronunciation Systems
Antal van den Bosch | Ton Weijters | Walter Daelemans
New Methods in Language Processing and Computational Natural Language Learning

pdf bib
Do Not Forget: Full Memory in Memory-Based Learning of Word Pronunciation
Antal van den Bosch | Walter Daelemans
New Methods in Language Processing and Computational Natural Language Learning

1997

pdf bib
Memory-Based Learning: Using Similarity for Smoothing
Jakub Zavrel | Walter Daelemans
35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Resolving PP attachment Ambiguities with Memory-Based Learning
Jakub Zavrel | Walter Daelemans | Jorn Veenstra
CoNLL97: Computational Natural Language Learning

1996

pdf bib
Unsupervised Discovery of Phonological Categories through Supervised Learning of Morphological Rules
Walter Daelemans | Peter Berck | Steven Gillis
COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics

pdf bib
MBT: A Memory-Based Part of Speech Tagger-Generator
Walter Daelemans | Jakub Zavrel | Peter Berck | Steven Gillis
Fourth Workshop on Very Large Corpora

1994

pdf bib
The Acquisition of Stress: A Data-Oriented Approach
Walter Daelemans | Steven Gillis | Gert Durieux
Computational Linguistics, Volume 20, Number 3, September 1994

pdf bib
Book Reviews: Inheritance, Defaults, and the Lexicon
Walter Daelemans
Computational Linguistics, Volume 20, Number 4, December 1994

1993

bib abs
Memory-based lexical acquisition and processing
Walter Daelemans
Third International EAMT Workshop: Machine Translation and the Lexicon

Current approaches to computational lexicology in language technology are knowledge-based (competence-oriented) and try to abstract away from specific formalisms, domains, and applications. This results in severe complexity, acquisition and reusability bottlenecks. As an alternative, we propose a particular performance-oriented approach to Natural Language Processing based on automatic memory-based learning of linguistic (lexical) tasks. The consequences of the approach for computational lexicology are discussed, and the application of the approach on a number of lexical acquisition and disambiguation tasks in phonology, morphology and syntax is described.

pdf bib
Data-Oriented Methods for Grapheme-to-Phoneme Conversion
Antal van den Bosch | Walter Daelemans
Sixth Conference of the European Chapter of the Association for Computational Linguistics