Antske Fokkens - ACL Anthology

Antske Fokkens

2025

Engagement-driven Persona Prompting for Rewriting News Tweets
Reshmi Gopalakrishna Pillai | Antske Fokkens | Wouter van Atteveldt
Proceedings of the 31st International Conference on Computational Linguistics

Text style transfer is a challenging research task which modifies the linguistic style of a given text to meet pre-set objectives such as making the text simpler or more accessible. Though large language models have been found to give promising results, text rewriting to improve audience engagement of social media content is vastly unexplored. Our research investigates the performance of various prompting strategies in the task of rewriting Dutch news tweets in specific linguistic styles (formal, casual and factual). Apart from zero-shot and few-shot prompting variants, with and without personas, we also explore prompting with feedback on predicted engagement. We perform an extensive analysis of 18 different combinations of Large Language Models (GPT-3.5, GPT-4, Mistral-7B) and prompting strategies on three different metrics: ROUGE-L, semantic similarity and predicted engagement. We find that GPT-4 with feedback and persona prompting performs the best in terms of predicted engagement for all three language styles. Our results motivate further application of usage of prompting techniques to rewrite news headlines on Twitter to align with specific style guidelines.

DefVerify: Do Hate Speech Models Reflect Their Dataset’s Definition?
Urja Khurana | Eric Nalisnick | Antske Fokkens
Proceedings of the 31st International Conference on Computational Linguistics

When building a predictive model, it is often difficult to ensure that application-specific requirements are encoded by the model that will eventually be deployed. Consider researchers working on hate speech detection. They will have an idea of what is considered hate speech, but building a model that reflects their view accurately requires preserving those ideals throughout the workflow of data set construction and model training. Complications such as sampling bias, annotation bias, and model misspecification almost always arise, possibly resulting in a gap between the application specification and the model’s actual behavior upon deployment. To address this issue for hate speech detection, we propose DefVerify: a 3-step procedure that (i) encodes a user-specified definition of hate speech, (ii) quantifies to what extent the model reflects the intended definition, and (iii) tries to identify the point of failure in the workflow. We use DefVerify to find gaps between definition and model behavior when applied to six popular hate speech benchmark datasets.

Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification
Leon Eshuijs | Shihan Wang | Antske Fokkens
Proceedings of the 29th Conference on Computational Natural Language Learning

Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model’s decision-making mechanism.We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanistic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively deactivating shortcut-related attention heads.

Improving Causal Interventions in Amnesic Probing with Mean Projection or LEACE
Alicja Dobrzeniecka | Antske Fokkens | Pia Sommerauer
Findings of the Association for Computational Linguistics: ACL 2025

Amnesic probing is a technique used to examine the influence of specific linguistic information on the behaviour of a model. This involves identifying and removing the relevant information and then assessing whether the model’s performance on the main task changes. If the removed information is relevant, the model’s performance should decline. The difficulty with this approach lies in removing only the target information while leaving other information unchanged. It has been shown that Iterative Nullspace Projection (INLP), a widely used removal technique, introduces random modifications to representations when eliminating target information. We demonstrate that Mean Projection (MP) and LEACE, two proposed alternatives, remove information in a more targeted manner, thereby enhancing the potential for obtaining behavioural explanations through Amnesic Probing.

Asking a Language Model for Diverse Responses
Sergey Troshin | Irina Saparina | Antske Fokkens | Vlad Niculae
Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025)

Large language models increasingly rely on explicit reasoning chains and can produce multiple plausible responses for a given context. We study the candidate sampler that produces the set of plausible responses contrasting the ancestral (parallel) sampling against two alternatives: enumeration, which asks the model to produce n candidates in one pass, and iterative sampling, which proposes candidates sequentially while conditioning on the currently generated response set. Under matched budgets, we compare these samplers on quality, lexical and computation flow diversity, and efficiency. Our empirical results demonstrate that enumeration and iterative strategies result in higher diversity at comparable quality. Our findings highlight the potential of simple non-independent sampling strategies to improve response diversity without sacrificing generation quality.

Learning from Sufficient Rationales: Analysing the Relationship Between Explanation Faithfulness and Token-level Regularisation Strategies
Jonathan Kamp | Lisa Beinborn | Antske Fokkens
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Human explanations of natural language, rationales, form a tool to assess whether models learn a label for the right reasons or rely on dataset-specific shortcuts. Sufficiency is a common metric for estimating the informativeness of rationales, but it provides limited insight into the effects of rationale information on model performance. We address this limitation by relating sufficiency to two modelling paradigms: the ability of models to identify which tokens are part of the rationale (through token classification) and the ability of improving model performance by incorporating rationales in the input (through attention regularisation). We find that highly informative rationales are not likely to help classify the instance correctly. Sufficiency conversely captures the classification impact of the non-rationalised context, which interferes with rationale information in the same input. We also find that incorporating rationale information in model inputs can boost cross-domain classification, but results are inconsistent per task and model type. Finally, sufficiency and token classification appear to be unrelated. These results exemplify the complexity of rationales, showing that metrics capable of systematically capturing this type of information merit further investigation.

2024

The Role of Syntactic Span Preferences in Post-Hoc Explanation Disagreement
Jonathan Kamp | Lisa Beinborn | Antske Fokkens
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Post-hoc explanation methods are an important tool for increasing model transparency for users. Unfortunately, the currently used methods for attributing token importance often yield diverging patterns. In this work, we study potential sources of disagreement across methods from a linguistic perspective. We find that different methods systematically select different classes of words and that methods that agree most with other methods and with humans display similar linguistic preferences. Token-level differences between methods are smoothed out if we compare them on the syntactic span level. We also find higher agreement across methods by estimating the most important spans dynamically instead of relying on a fixed subset of size k. We systematically investigate the interaction between k and spans and propose an improved configuration for selecting important tokens.

Tracking Perspectives on Event Participants: a Structural Analysis of the Framing of Real-World Events in Co-Referential Corpora
Levi Remijnse | Pia Sommerauer | Antske Fokkens | Piek T.J.M. Vossen
Proceedings of the First Workshop on Reference, Framing, and Perspective @ LREC-COLING 2024

In this paper, we present the outcome of a structural linguistic analysis performed on a referentially grounded FrameNet dataset. In this dataset, multiple Dutch events are referenced by multiple co-referential Dutch news texts. Mentions in those documents are annotated with respect to their referential grounding (i.e., links to structured Wikidata), and their conceptual representation (i.e., frames). Provided with each document’s temporal reporting distance, we selected documents for two events - the Utrecht shooting and MH17 - and performed an analysis in which we tracked the events’ participants over time in both their focalization (number of mentions) and their framing (distribution of frame element labels). This way, we use the carefully collected and annotated data to schematize shifts in focalization and perspectivization of the participants as a result of the constantly developing narrative surrounding the events. This novel type of linguistic research involves reference to the real-world referents and takes into account storytelling in news streams.

Topic-specific social science theory in stance detection: a proposal and interdisciplinary pilot study on sustainability initiatives
Myrthe Reuver | Alessandra Polimeno | Antske Fokkens | Ana Isabel Lopes
Proceedings of the 4th Workshop on Computational Linguistics for the Political and Social Sciences: Long and short papers

Topic-specificity is often seen as a limitation of stance detection models and datasets, especially for analyzing political and societal debates. However, stances contain topic-specific aspects that are crucial for an in-depth understanding of these debates. Our interdisciplinary approach identifies social science theories on specific debate topics as an opportunity for further defining stance detection research and analyzing online debate. This paper explores sustainability as debate topic, and connects stance to the sustainability-related Value-Belief-Norm (VBN) theory. VBN theory states that arguments in favor or against sustainability initiatives contain the dimensions of feeling power to change the issue with the initiative, and thinking whether or not the initiative tackles an urgent threat to the environment. In a pilot study with our Reddit European Sustainability Initiatives corpus, we develop an annotation procedure for these complex concepts. We then compare crowd-workers with Natural Language Processing experts’ annotation proficiency. Both crowd-workers and NLP experts find the tasks difficult, but experts reach more agreement on some difficult examples. This pilot study shows that complex theories about debate topics are feasible and worthwhile as annotation tasks for stance detection.

Investigating the Robustness of Modelling Decisions for Few-Shot Cross-Topic Stance Detection: A Preregistered Study
Myrthe Reuver | Suzan Verberne | Antske Fokkens
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

For a viewpoint-diverse news recommender, identifying whether two news articles express the same viewpoint is essential. One way to determine “same or different” viewpoint is stance detection. In this paper, we investigate the robustness of operationalization choices for few-shot stance detection, with special attention to modelling stance across different topics. Our experiments test pre-registered hypotheses on stance detection. Specifically, we compare two stance task definitions (Pro/Con versus Same Side Stance), two LLM architectures (bi-encoding versus cross-encoding), and adding Natural Language Inference knowledge, with pre-trained RoBERTa models trained with shots of 100 examples from 7 different stance detection datasets. Some of our hypotheses and claims from earlier work can be confirmed, while others give more inconsistent results. The effect of the Same Side Stance definition on performance differs per dataset and is influenced by other modelling choices. We found no relationship between the number of training topics in the training shots and performance. In general, cross-encoding out-performs bi-encoding, and adding NLI training to our models gives considerable improvement, but these results are not consistent across all datasets. Our results indicate that it is essential to include multiple datasets and systematic modelling experiments when aiming to find robust modelling choices for the concept ‘stance’.

2023

Dynamic Top-k Estimation Consolidates Disagreement between Feature Attribution Methods
Jonathan Kamp | Lisa Beinborn | Antske Fokkens
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Feature attribution scores are used for explaining the prediction of a text classifier to users by highlighting a k number of tokens. In this work, we propose a way to determine the number of optimal k tokens that should be displayed from sequential properties of the attribution scores. Our approach is dynamic across sentences, method-agnostic, and deals with sentence length bias. We compare agreement between multiple methods and humans on an NLI task, using fixed k and dynamic k. We find that perturbation-based methods and Vanilla Gradient exhibit highest agreement on most method–method and method–human agreement metrics with a static k. Their advantage over other methods disappears with dynamic ks which mainly improve Integrated Gradient and GradientXInput. To our knowledge, this is the first evidence that sequential properties of attribution scores are informative for consolidating attribution signals for human interpretation.

Methodological Insights in Detecting Subtle Semantic Shifts with Contextualized and Static Language Models
Sanne Hoeken | Özge Alacam | Antske Fokkens | Pia Sommerauer
Findings of the Association for Computational Linguistics: EMNLP 2023

In this paper, we investigate automatic detection of subtle semantic shifts between social communities of different political convictions in Dutch and English. We perform a methodological study comparing methods using static and contextualized language models. We investigate the impact of specializing contextualized models through fine-tuning on target corpora, word sense disambiguation and sentiment. We furthermore propose a new approach using masked token prediction, that relies on behavioral information, specifically the most probable substitutions, instead of geometrical comparison of representations. Our results show that methods using static models and our masked token prediction method can detect differences in connotation of politically loaded terms, whereas methods that rely on measuring the distance between contextualized representations are not providing clear signals, even in synthetic scenarios of extreme shifts.

2022

Hate Speech Criteria: A Modular Approach to Task-Specific Hate Speech Definitions
Urja Khurana | Ivar Vermeulen | Eric Nalisnick | Marloes Van Noorloos | Antske Fokkens
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)

The subjectivity of automatic hate speech detection makes it a complex task, reflected in different and incomplete definitions in NLP. We present hate speech criteria, developed with insights from a law and social science expert, that help researchers create more explicit definitions and annotation guidelines on five aspects: (1) target groups and (2) dominance, (3) perpetrator characteristics, (4) explicit presence of negative interactions, and the (5) type of consequences/effects. Definitions can be structured so that they cover a more broad or more narrow phenomenon and conscious choices can be made on specifying criteria or leaving them open. We argue that the goal and exact task developers have in mind should determine how the scope of hate speech is defined. We provide an overview of the properties of datasets from hatespeechdata.com that may help select the most suitable dataset for a specific scenario.

Introducing Frege to Fillmore: A FrameNet Dataset that Captures both Sense and Reference
Levi Remijnse | Piek Vossen | Antske Fokkens | Sam Titarsolej
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This article presents the first output of the Dutch FrameNet annotation tool, which facilitates both referential- and frame annotations of language-independent corpora. On the referential level, the tool links in-text mentions to structured data, grounding the text in the real world. On the frame level, those same mentions are annotated with respect to their semantic sense. This way of annotating not only generates a rich linguistic dataset that is grounded in real-world event instances, but also guides the annotators in frame identification, resulting in high inter-annotator-agreement and consistent annotations across documents and at discourse level, exceeding traditional sentence level annotations of frame elements. Moreover, the annotation tool features a dynamic lexical lookup that increases the development of a cross-domain FrameNet lexicon.

Better Hit the Nail on the Head than Beat around the Bush: Removing Protected Attributes with a Single Projection
Pantea Haghighatkhah | Antske Fokkens | Pia Sommerauer | Bettina Speckmann | Kevin Verbeek
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Bias elimination and recent probing studies attempt to remove specific information from embedding spaces. Here it is important to remove as much of the target information as possible, while preserving any other information present. INLP is a popular recent method which removes specific information through iterative nullspace projections.Multiple iterations, however, increase the risk that information other than the target is negatively affected.We introduce two methods that find a single targeted projection: Mean Projection (MP, more efficient) and Tukey Median Projection (TMP, with theoretical guarantees). Our comparison between MP and INLP shows that (1) one MP projection removes linear separability based on the target and (2) MP has less impact on the overall space.Further analysis shows that applying random projections after MP leads to the same overall effects on the embedding space as the multiple projections of INLP. Applying one targeted (MP) projection hence is methodologically cleaner than applying multiple (INLP) projections that introduce random effects.

Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)
Antske Fokkens | Vivek Srikumar
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)

Dealing with Abbreviations in the Slovenian Biographical Lexicon
Angel Daza | Antske Fokkens | Tomaž Erjavec
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Abbreviations present a significant challenge for NLP systems because they cause tokenization and out-of-vocabulary errors. They can also make the text less readable, especially in reference printed books, where they are extensively used. Abbreviations are especially problematic in low-resource settings, where systems are less robust to begin with. In this paper, we propose a new method for addressing the problems caused by a high density of domain-specific abbreviations in a text. We apply this method to the case of a Slovenian biographical lexicon and evaluate it on a newly developed gold-standard dataset of 51 Slovenian biographies. Our abbreviation identification method performs significantly better than commonly used ad-hoc solutions, especially at identifying unseen abbreviations. We also propose and present the results of a method for expanding the identified abbreviations in context.

Perturbations and Subpopulations for Testing Robustness in Token-Based Argument Unit Recognition
Jonathan Kamp | Lisa Beinborn | Antske Fokkens
Proceedings of the 9th Workshop on Argument Mining

Argument Unit Recognition and Classification aims at identifying argument units from text and classifying them as pro or against. One of the design choices that need to be made when developing systems for this task is what the unit of classification should be: segments of tokens or full sentences. Previous research suggests that fine-tuning language models on the token-level yields more robust results for classifying sentences compared to training on sentences directly. We reproduce the study that originally made this claim and further investigate what exactly token-based systems learned better compared to sentence-based ones. We develop systematic tests for analysing the behavioural differences between the token-based and the sentence-based system. Our results show that token-based models are generally more robust than sentence-based models both on manually perturbed examples and on specific subpopulations of the data.

Story Trees: Representing Documents using Topological Persistence
Pantea Haghighatkhah | Antske Fokkens | Pia Sommerauer | Bettina Speckmann | Kevin Verbeek
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Topological Data Analysis (TDA) focuses on the inherent shape of (spatial) data. As such, it may provide useful methods to explore spatial representations of linguistic data (embeddings) which have become central in NLP. In this paper we aim to introduce TDA to researchers in language technology. We use TDA to represent document structure as so-called story trees. Story trees are hierarchical representations created from semantic vector representations of sentences via persistent homology. They can be used to identify and clearly visualize prominent components of a story line. We showcase their potential by using story trees to create extractive summaries for news stories.

2021

Challenging distributional models with a conceptual network of philosophical terms
Yvette Oortwijn | Jelke Bloem | Pia Sommerauer | Francois Meyer | Wei Zhou | Antske Fokkens
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Computational linguistic research on language change through distributional semantic (DS) models has inspired researchers from fields such as philosophy and literary studies, who use these methods for the exploration and comparison of comparatively small datasets traditionally analyzed by close reading. Research on methods for small data is still in early stages and it is not clear which methods achieve the best results. We investigate the possibilities and limitations of using distributional semantic models for analyzing philosophical data by means of a realistic use-case. We provide a ground truth for evaluation created by philosophy experts and a blueprint for using DS models in a sound methodological setup. We compare three methods for creating specialized models from small datasets. Though the models do not perform well enough to directly support philosophers yet, we find that models designed for small data yield promising directions for future work.

How Emotionally Stable is ALBERT? Testing Robustness with Stochastic Weight Averaging on a Sentiment Analysis Task
Urja Khurana | Eric Nalisnick | Antske Fokkens
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

Despite their success, modern language models are fragile. Even small changes in their training pipeline can lead to unexpected results. We study this phenomenon by examining the robustness of ALBERT (Lan et al., 2020) in combination with Stochastic Weight Averaging (SWA)—a cheap way of ensembling—on a sentiment analysis task (SST-2). In particular, we analyze SWA’s stability via CheckList criteria (Ribeiro et al., 2020), examining the agreement on errors made by models differing only in their random seed. We hypothesize that SWA is more stable because it ensembles model snapshots taken along the gradient descent trajectory. We quantify stability by comparing the models’ mistakes with Fleiss’ Kappa (Fleiss, 1971) and overlap ratio scores. We find that SWA reduces error rates in general; yet the models still suffer from their own distinct biases (according to CheckList).

Are we human, or are we users? The role of natural language processing in human-centric news recommenders that nudge users to diverse content
Myrthe Reuver | Nicolas Mattis | Marijn Sax | Suzan Verberne | Nava Tintarev | Natali Helberger | Judith Moeller | Sanne Vrijenhoek | Antske Fokkens | Wouter van Atteveldt
Proceedings of the 1st Workshop on NLP for Positive Impact

In this position paper, we present a research agenda and ideas for facilitating exposure to diverse viewpoints in news recommendation. Recommending news from diverse viewpoints is important to prevent potential filter bubble effects in news consumption, and stimulate a healthy democratic debate. To account for the complexity that is inherent to humans as citizens in a democracy, we anticipate (among others) individual-level differences in acceptance of diversity. We connect this idea to techniques in Natural Language Processing, where distributional language models would allow us to place different users and news articles in a multidimensional space based on semantic content, where diversity is operationalized as distance and variance. In this way, we can model individual “latitudes of diversity” for different users, and thus personalize viewpoint diversity in support of a healthy public debate. In addition, we identify technical, ethical and conceptual issues related to our presented ideas. Our investigation describes how NLP can play a central role in diversifying news recommendations.

No NLP Task Should be an Island: Multi-disciplinarity for Diversity in News Recommender Systems
Myrthe Reuver | Antske Fokkens | Suzan Verberne
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

Natural Language Processing (NLP) is defined by specific, separate tasks, with each their own literature, benchmark datasets, and definitions. In this position paper, we argue that for a complex problem such as the threat to democracy by non-diverse news recommender systems, it is important to take into account a higher-order, normative goal and its implications. Experts in ethics, political science and media studies have suggested that news recommendation systems could be used to support a deliberative democracy. We reflect on the role of NLP in recommendation systems with this specific goal in mind and show that this theory of democracy helps to identify which NLP tasks and techniques can support this goal, and what work still needs to be done. This leads to recommendations for NLP researchers working on this specific problem as well as researchers working on other complex multidisciplinary problems.

Is Stance Detection Topic-Independent and Cross-topic Generalizable? - A Reproduction Study
Myrthe Reuver | Suzan Verberne | Roser Morante | Antske Fokkens
Proceedings of the 8th Workshop on Argument Mining

Cross-topic stance detection is the task to automatically detect stances (pro, against, or neutral) on unseen topics. We successfully reproduce state-of-the-art cross-topic stance detection work (Reimers et. al, 2019), and systematically analyze its reproducibility. Our attention then turns to the cross-topic aspect of this work, and the specificity of topics in terms of vocabulary and socio-cultural context. We ask: To what extent is stance detection topic-independent and generalizable across topics? We compare the model’s performance on various unseen topics, and find topic (e.g. abortion, cloning), class (e.g. pro, con), and their interaction affecting the model’s performance. We conclude that investigating performance on different topics, and addressing topic-specific vocabulary and context, is a future avenue for cross-topic stance detection. References Nils Reimers, Benjamin Schiller, Tilman Beck, Johannes Daxenberger, Christian Stab, and Iryna Gurevych. 2019. Classification and Clustering of Arguments with Contextualized Word Embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 567–578, Florence, Italy. Association for Computational Linguistics.

2020

Combining Conceptual and Referential Annotation to Study Variation in Framing
Marten Postma | Levi Remijnse | Filip Ilievski | Antske Fokkens | Sam Titarsolej | Piek Vossen
Proceedings of the International FrameNet Workshop 2020: Towards a Global, Multilingual FrameNet

We introduce an annotation tool whose purpose is to gain insights into variation of framing by combining FrameNet annotation with referential annotation. English FrameNet enables researchers to study variation in framing at the conceptual level as well through its packaging in language. We enrich FrameNet annotations in two ways. First, we introduce the referential aspect. Secondly, we annotate on complete texts to encode connections between mentions. As a result, we can analyze the variation of framing for one particular event across multiple mentions and (cross-lingual) documents. We can examine how an event is framed over time and how core frame elements are expressed throughout a complete text. The data model starts with a representation of an event type. Each event type has many incidents linked to it, and each incident has several reference texts describing it as well as structured data about the incident. The user can apply two types of annotations: 1) mappings from expressions to frames and frame elements, 2) reference relations from mentions to events and participants of the structured data.

Large-scale Cross-lingual Language Resources for Referencing and Framing
Piek Vossen | Filip Ilievski | Marten Postma | Antske Fokkens | Gosse Minnema | Levi Remijnse
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this article, we lay out the basic ideas and principles of the project Framing Situations in the Dutch Language. We provide our first results of data acquisition, together with the first data release. We introduce the notion of cross-lingual referential corpora. These corpora consist of texts that make reference to exactly the same incidents. The referential grounding allows us to analyze the framing of these incidents in different languages and across different texts. During the project, we will use the automatically generated data to study linguistic framing as a phenomenon, build framing resources such as lexicons and corpora. We expect to capture larger variation in framing compared to traditional approaches for building such resources. Our first data release, which contains structured data about a large number of incidents and reference texts, can be found at http://dutchframenet.nl/data-releases/.

Would you describe a leopard as yellow? Evaluating crowd-annotations with justified and informative disagreement
Pia Sommerauer | Antske Fokkens | Piek Vossen
Proceedings of the 28th International Conference on Computational Linguistics

Semantic annotation tasks contain ambiguity and vagueness and require varying degrees of world knowledge. Disagreement is an important indication of these phenomena. Most traditional evaluation methods, however, critically hinge upon the notion of inter-annotator agreement. While alternative frameworks have been proposed, they do not move beyond agreement as the most important indicator of quality. Critically, evaluations usually do not distinguish between instances in which agreement is expected and instances in which disagreement is not only valid but desired because it captures the linguistic and cognitive phenomena in the data. We attempt to overcome these limitations using the example of a dataset that provides semantic representations for diagnostic experiments on language models. Ambiguity, vagueness, and difficulty are not only highly relevant for this use-case, but also play an important role in other types of semantic annotation tasks. We establish an additional, agreement-independent quality metric based on answer-coherence and evaluate it in comparison to existing metrics. We compare against a gold standard and evaluate on expected disagreement. Despite generally low agreement, annotations follow expected behavior and have high accuracy when selected based on coherence. We show that combining different quality metrics enables a more comprehensive evaluation than relying exclusively on agreement.

2019

A larger-scale evaluation resource of terms and their shift direction for diachronic lexical semantics
Astrid van Aggelen | Antske Fokkens | Laura Hollink | Jacco van Ossenbruggen
Proceedings of the 22nd Nordic Conference on Computational Linguistics

Determining how words have changed their meaning is an important topic in Natural Language Processing. However, evaluations of methods to characterise such change have been limited to small, handcrafted resources. We introduce an English evaluation set which is larger, more varied, and more realistic than seen to date, with terms derived from a historical thesaurus. Moreover, the dataset is unique in that it represents change as a shift from the term of interest to a WordNet synset. Using the synset lemmas, we can use this set to evaluate (standard) methods that detect change between word pairs, as well as (adapted) methods that detect the change between a term and a sense overall. We show that performance on the new data set is much lower than earlier reported findings, setting a new standard.

Towards interpretable, data-derived distributional meaning representations for reasoning: A dataset of properties and concepts
Pia Sommerauer | Antske Fokkens | Piek Vossen
Proceedings of the 10th Global Wordnet Conference

This paper proposes a framework for investigating which types of semantic properties are represented by distributional data. The core of our framework consists of relations between concepts and properties. We provide hypotheses on which properties are reflected in distributional data or not based on the type of relation. We outline strategies for creating a dataset of positive and negative examples for various semantic properties, which cannot easily be separated on the basis of general similarity (e.g. fly: seagull, penguin). This way, a distributional model can only distinguish between positive and negative examples through evidence for a target property. Once completed, this dataset can be used to test our hypotheses and work towards data-derived interpretable representations.

Conceptual Change and Distributional Semantic Models: an Exploratory Study on Pitfalls and Possibilities
Pia Sommerauer | Antske Fokkens
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

Studying conceptual change using embedding models has become increasingly popular in the Digital Humanities community while critical observations about them have received less attention. This paper investigates what the impact of known pitfalls can be on the conclusions drawn in a digital humanities study through the use case of “Racism”. In addition, we suggest an approach for modeling a complex concept in terms of words and relations representative of the conceptual system. Our results show that different models created from the same data yield different results, but also indicate that using different model architectures, comparing different corpora and comparing to control words and relations can help to identify which results are solid and which may be due to artefact. We propose guidelines to conduct similar studies, but also note that more work is needed to fully understand how we can distinguish artefacts from actual conceptual changes.

Evaluating the Consistency of Word Embeddings from Small Data
Jelke Bloem | Antske Fokkens | Aurélie Herbelot
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

In this work, we address the evaluation of distributional semantic models trained on smaller, domain-specific texts, specifically, philosophical text. Specifically, we inspect the behaviour of models using a pre-trained background space in learning. We propose a measure of consistency which can be used as an evaluation metric when no in-domain gold-standard data is available. This measure simply computes the ability of a model to learn similar embeddings from different parts of some homogeneous data. We show that in spite of being a simple evaluation, consistency actually depends on various combinations of factors, including the nature of the data itself, the model used to train the semantic space, and the frequency of the learnt terms, both in the background space and in the in-domain data of interest.

2018

Neural Models of Selectional Preferences for Implicit Semantic Role Labeling
Minh Le | Antske Fokkens
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Studying Muslim Stereotyping through Microportrait Extraction
Antske Fokkens | Nel Ruigrok | Camiel Beukeboom | Gagestein Sarah | Wouter van Atteveldt
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Meaning_space at SemEval-2018 Task 10: Combining explicitly encoded knowledge with information extracted from word embeddings
Pia Sommerauer | Antske Fokkens | Piek Vossen
Proceedings of the 12th International Workshop on Semantic Evaluation

This paper presents the two systems submitted by the meaning space team in Task 10 of the SemEval competition 2018 entitled Capturing discriminative attributes. The systems consist of combinations of approaches exploiting explicitly encoded knowledge about concepts in WordNet and information encoded in distributional semantic vectors. Rather than aiming for high performance, we explore which kind of semantic knowledge is best captured by different methods. The results indicate that WordNet glosses on different levels of the hierarchy capture many attributes relevant for this task. In combination with exploiting word embedding similarities, this source of information yielded our best results. Our best performing system ranked 5th out of 13 final ranks. Our analysis yields insights into the different kinds of attributes represented by different sources of knowledge.

Firearms and Tigers are Dangerous, Kitchen Knives and Zebras are Not: Testing whether Word Embeddings Can Tell
Pia Sommerauer | Antske Fokkens
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

This paper presents an approach for investigating the nature of semantic information captured by word embeddings. We propose a method that extends an existing human-elicited semantic property dataset with gold negative examples using crowd judgments. Our experimental approach tests the ability of supervised classifiers to identify semantic features in word embedding vectors and compares this to a feature-identification method based on full vector cosine similarity. The idea behind this method is that properties identified by classifiers, but not through full vector comparison are captured by embeddings. Properties that cannot be identified by either method are not. Our results provide an initial indication that semantic properties relevant for the way entities interact (e.g. dangerous) are captured, while perceptual information (e.g. colors) is not represented. We conclude that, though preliminary, these results show that our method is suitable for identifying which properties are captured by embeddings.

2017

GRaSP: Grounded Representation and Source Perspective
Antske Fokkens | Piek Vossen | Marco Rospocher | Rinke Hoekstra | Willem Robert van Hage
Proceedings of the Workshop Knowledge Resources for the Socio-Economic Sciences and Humanities associated with RANLP 2017

When people or organizations provide information, they make choices regarding what information they include and how they present it. The combination of these two aspects (the content and stance provided by the source) represents a perspective. Investigating differences in perspective can provide various useful insights in the reliability of information, the way perspectives change over time, shared beliefs among groups of a similar social or political background and contrasts between other groups, etc. This paper introduces GRaSP, a generic framework for modeling perspectives and their sources.

Storyteller: Visual Analytics of Perspectives on Rich Text Interpretations
Maarten van Meersbergen | Piek Vossen | Janneke van der Zwaan | Antske Fokkens | Willem van Hage | Inger Leemans | Isa Maks
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism

Complexity of event data in texts makes it difficult to assess its content, especially when considering larger collections in which different sources report on the same or similar situations. We present a system that makes it possible to visually analyze complex event and emotion data extracted from texts. We show that we can abstract from different data models for events and emotions to a single data model that can show the complex relations in four dimensions. The visualization has been applied to analyze 1) dynamic developments in how people both conceive and express emotions in theater plays and 2) how stories are told from the perspectyive of their sources based on rich event data extracted from news or biographies.

Tackling Error Propagation through Reinforcement Learning: A Case of Greedy Dependency Parsing
Minh Lê | Antske Fokkens
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Error propagation is a common problem in NLP. Reinforcement learning explores erroneous states during training and can therefore be more robust when mistakes are made early in a process. In this paper, we apply reinforcement learning to greedy dependency parsing which is known to suffer from error propagation. Reinforcement learning improves accuracy of both labeled and unlabeled dependencies of the Stanford Neural Dependency Parser, a high performance greedy parser, while maintaining its efficiency. We investigate the portion of errors which are the result of error propagation and confirm that reinforcement learning reduces the occurrence of error propagation.

2016

Two Architectures for Parallel Processing of Huge Amounts of Text
Mathijs Kattenberg | Zuhaitz Beloki | Aitor Soroa | Xabier Artola | Antske Fokkens | Paul Huygen | Kees Verstoep
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents two alternative NLP architectures to analyze massive amounts of documents, using parallel processing. The two architectures focus on different processing scenarios, namely batch-processing and streaming processing. The batch-processing scenario aims at optimizing the overall throughput of the system, i.e., minimizing the overall time spent on processing all documents. The streaming architecture aims to minimize the time to process real-time incoming documents and is therefore especially suitable for live feeds. The paper presents experiments with both architectures, and reports the overall gain when they are used for batch as well as for streaming processing. All the software described in the paper is publicly available under free licenses.

GRaSP: A Multilayered Annotation Scheme for Perspectives
Chantal van Son | Tommaso Caselli | Antske Fokkens | Isa Maks | Roser Morante | Lora Aroyo | Piek Vossen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents a framework and methodology for the annotation of perspectives in text. In the last decade, different aspects of linguistic encoding of perspectives have been targeted as separated phenomena through different annotation initiatives. We propose an annotation scheme that integrates these different phenomena. We use a multilayered annotation approach, splitting the annotation of different aspects of perspectives into small subsequent subtasks in order to reduce the complexity of the task and to better monitor interactions between layers. Currently, we have included four layers of perspective annotation: events, attribution, factuality and opinion. The annotations are integrated in a formal model called GRaSP, which provides the means to represent instances (e.g. events, entities) and propositions in the (real or assumed) world in relation to their mentions in text. Then, the relation between the source and target of a perspective is characterized by means of perspective annotations. This enables us to place alternative perspectives on the same entity, event or proposition next to each other.

Unshared Task at the 3rd Workshop on Argument Mining: Perspective Based Local Agreement and Disagreement in Online Debate
Chantal van Son | Tommaso Caselli | Antske Fokkens | Isa Maks | Roser Morante | Lora Aroyo | Piek Vossen
Proceedings of the Third Workshop on Argument Mining (ArgMining2016)

2015

SPINOZA_VU: An NLP Pipeline for Cross Document TimeLines
Tommaso Caselli | Antske Fokkens | Roser Morante | Piek Vossen
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

Taxonomy Beats Corpus in Similarity Identification, but Does It Matter?
Minh Le | Antske Fokkens
Proceedings of the International Conference Recent Advances in Natural Language Processing

2014

BiographyNet: Methodological Issues when NLP supports historical research
Antske Fokkens | Serge ter Braake | Niels Ockeloen | Piek Vossen | Susan Legêne | Guus Schreiber
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

When NLP is used to support research in the humanities, new methodological issues come into play. NLP methods may introduce a bias in their analysis that can influence the results of the hypothesis a humanities scholar is testing. This paper addresses this issue in the context of BiographyNet a multi-disciplinary project involving NLP, Linked Data and history. We introduce the project to the NLP community. We argue that it is essential for historians to get insight into the provenance of information, including how information was extracted from text by NLP tools.

Hope and Fear: How Opinions Influence Factuality
Chantal van Son | Marieke van Erp | Antske Fokkens | Piek Vossen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Both sentiment and event factuality are fundamental information levels for our understanding of events mentioned in news texts. Most research so far has focused on either modeling opinions or factuality. In this paper, we propose a model that combines the two for the extraction and interpretation of perspectives on events. By doing so, we can explain the way people perceive changes in (their belief of) the world as a function of their fears of changes to the bad or their hopes of changes to the good. This study seeks to examine the effectiveness of this approach by applying factuality annotations, based on FactBank, on top of the MPQA Corpus, a corpus containing news texts annotated for sentiments and other private states. Our findings suggest that this approach can be valuable for the understanding of perspectives, but that there is still some work to do on the refinement of the integration.

2013

GAF: A Grounded Annotation Framework for Events
Antske Fokkens | Marieke van Erp | Piek Vossen | Sara Tonelli | Willem Robert van Hage | Luciano Serafini | Rachele Sprugnoli | Jesper Hoeksema
Workshop on Events: Definition, Detection, Coreference, and Representation

Offspring from Reproduction Problems: What Replication Failure Teaches Us
Antske Fokkens | Marieke van Erp | Marten Postma | Ted Pedersen | Piek Vossen | Nuno Freire
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

CLIMB grammars: three projects using metagrammar engineering
Antske Fokkens | Tania Avgustinova | Yi Zhang
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper introduces the CLIMB (Comparative Libraries of Implementations with Matrix Basis) methodology and grammars. The basic idea behind CLIMB is to use code generation as a general methodology for grammar development in order to create a more systematic approach to grammar development. The particular method used in this paper is closely related to the LinGO Grammar Matrix. Like the Grammar Matrix, resulting grammars are HPSG grammars that can map bidirectionally between strings and MRS representations. The main purpose of this paper is to provide insight into the process of using CLIMB for grammar development. In addition, we describe three projects that make use of this methodology or have concrete plans to adapt CLIMB in the future: CLIMB for Germanic languages, CLIMB for Slavic languages and CLIMB to combine two grammars of Mandarin Chinese. We present the first results that indicate feasibility and development time improvements for creating a medium to large coverage precision grammar.

2011

Spring Cleaning and Grammar Compression: Two Techniques for Detection of Redundancy in HPSG Grammars
Antske Fokkens | Yi Zhang | Emily M. Bender
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation

Metagrammar engineering: Towards systematic exploration of implemented grammars
Antske Fokkens
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

Grammar Prototyping and Testing with the LinGO Grammar Matrix Customization System
Emily M. Bender | Scott Drellishak | Antske Fokkens | Michael Wayne Goodman | Daniel P. Mills | Laurie Poulson | Safiyyah Saleem
Proceedings of the ACL 2010 System Demonstrations

Co-authors

Roser Morante 4

Levi Remijnse 4

Suzan Verberne 4

Tommaso Caselli 3

Eric Nalisnick 3

Marten Postma 3

Wouter van Atteveldt 3

Marieke van Erp 3

Willem Robert van Hage 3

Chantal van Son 3

Emily M. Bender 2

Pantea Haghighatkhah 2

Filip Ilievski 2

Bettina Speckmann 2

Sam Titarsolej 2

Kevin Verbeek 2

Xabier Artola 1

Tania Avgustinova 1

Zuhaitz Beloki 1

Camiel Beukeboom 1

Alicja Dobrzeniecka 1

Scott Drellishak 1

Tomaž Erjavec 1

Michael Wayne Goodman 1

Reshmi Gopalakrishna Pillai 1

Natali Helberger 1

Aurélie Herbelot 1

Jesper Hoeksema 1

Rinke Hoekstra 1

Laura Hollink 1

Mathijs Kattenberg 1

Inger Leemans 1

Susan Legêne 1

Ana Isabel Lopes 1

Nicolas Mattis 1

Francois Meyer 1

Daniel P. Mills 1

Gosse Minnema 1

Judith Moeller 1

Niels Ockeloen 1

Yvette Oortwijn 1

Alessandra Polimeno 1

Laurie Poulson 1

Marco Rospocher 1

Safiyyah Saleem 1

Irina Saparina 1

Gagestein Sarah 1

Guus Schreiber 1

Luciano Serafini 1

Rachele Sprugnoli 1

Vivek Srikumar 1

Nava Tintarev 1

Sergey Troshin 1

Marloes Van Noorloos 1

Ivar Vermeulen 1

Kees Verstoep 1

Piek T.J.M. Vossen 1

Sanne Vrijenhoek 1

Serge ter Braake 1

Astrid van Aggelen 1

Maarten van Meersbergen 1

Jacco van Ossenbruggen 1

Janneke van der Zwaan 1

Venues