Suzan Verberne - ACL Anthology

Suzan Verberne

2026

The Correlation Between Emotion in Text and Speech Segments is Limited: A Cross-Modal Study
David Lindevelt | Suzan Verberne | Joost Broekens
Findings of the Association for Computational Linguistics: EACL 2026

Although expressive TTS systems aim to capture human-like emotion, little is known about how well emotional signals in text correspond to those in speech. In this short paper, we investigate how emotion (Valence, Arousal, Dominance) in text relates to emotion in speech. We use 8 large language models for identifying emotion in text and two audio models for emotion in speech, across three genres: Podcasts, Audiobooks and TED talks. Findings show that while language models perform well on emotion recognition from situational text, and the audio models perform well on speech, they show a strong correlation for Valence only. Further, the genre of the content significantly impacts the correlation: audiobooks exhibit higher text-audio correlation than TED talks. Finally, we show that more context for LLMs fails to improve this correlation between text and speech emotion prediction. Our results highlight that emotional signals in text do not correspond well to those in speech: emotion prediction from text alone is insufficient for emotional TTS.

2025

Dataset Creation for Visual Entailment using Generative AI
Rob Reijtenbach | Suzan Verberne | Gijs Wijnholds
Proceedings of the 5th Workshop on Natural Logic Meets Machine Learning (NALOMA)

In this paper we present and validate a new synthetic dataset for training visual entailment models. Existing datasets for visual entailment are small and sparse compared to datasets for textual entailment. Manually creating datasets is labor-intensive. We base our synthetic dataset on the SNLI dataset for textual entailment. We take the premise text from SNLI as input prompts in a generative image model, Stable Diffusion, creating an image to replace each textual premise. We evaluate our dataset both intrinsically and extrinsically. For extrinsic evaluation, we evaluate the validity of the generated images by using them as training data for a visual entailment classifier based on CLIP feature vectors. We find that synthetic training data only leads to a slight drop in quality on SNLI-VE, with an F-score 0.686 compared to 0.703 when trained on real data. We also compare the quality of our generated training data to original training data on another dataset: SICK-VTE. Again, there is only a slight drop in F-score: from 0.400 to 0.384. These results indicate that in settings with data sparsity, synthetic data can be a promising solution for training visual entailment models.

QUIDS: Query Intent Description for Exploratory Search via Dual Space Modeling
Yumeng Wang | Xiuying Chen | Suzan Verberne
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

In exploratory search, users often submit vague queries to investigate unfamiliar topics, but receive limited feedback about how the search engine understood their input. This leads to a self-reinforcing cycle of mismatched results and trial-and-error reformulation. To address this, we study the task of generating user-facing natural language query intent descriptions that surface what the system likely inferred the query to mean, based on post-retrieval evidence. We propose QUIDS, a method that leverages dual-space contrastive learning to isolate intent-relevant information while suppressing irrelevant content. QUIDS combines a dual-encoder representation space with a disentangling decoder that works together to produce concise and accurate intent descriptions. Enhanced by intent-driven hard negative sampling, the model significantly outperforms state-of-the-art baselines across ROUGE, BERTScore, and human/LLM evaluations. Our qualitative analysis confirms QUIDS’ effectiveness in generating accurate intent descriptions for exploratory search. Our work contributes to improving the interaction between users and search engines by providing feedback to the user in exploratory search settings.

Evaluation of Attribution Bias in Generator-Aware Retrieval-Augmented Large Language Models
Amin Abolghasemi | Leif Azzopardi | Seyyed Hadi Hashemi | Maarten de Rijke | Suzan Verberne
Findings of the Association for Computational Linguistics: ACL 2025

Attributing answers to source documents is an approach used to enhance the verifiability of a model’s output in retrieval-augmented generation (RAG). Prior work has mainly focused on improving and evaluating the attribution quality of large language models (LLMs) in RAG, but this may come at the expense of inducing biases in the attribution of answers. We define and examine two aspects in the evaluation of LLMs in RAG pipelines, namely attribution sensitivity and bias with respect to authorship information. We explicitly inform an LLM about the authors of source documents, instruct it to attribute its answers, and analyze (i) how sensitive the LLM’s output is to the author of source documents, and (ii) whether the LLM exhibits a bias towards human-written or AI-generated source documents. We design an experimental setup in which we use counterfactual evaluation to study three LLMs in terms of their attribution sensitivity and bias in RAG pipelines. Our results show that adding authorship information to source documents can significantly change the attribution quality of LLMs by 3 to 18%. We show that LLMs can have an attribution bias towards explicit human authorship, which can serve as a competing hypothesis for findings of prior work that shows that LLM-generated content may be preferred over human-written contents. Our findings indicate that metadata of source documents can influence LLMs’ trust, and how they attribute their answers. Furthermore, our research highlights attribution bias and sensitivity as a novel aspect of the vulnerability of LLMs.

SPILL: Domain-Adaptive Intent Clustering based on Selection and Pooling with Large Language Models
I-Fan Lin | Faegheh Hasibi | Suzan Verberne
Findings of the Association for Computational Linguistics: ACL 2025

In this paper, we propose Selection and Pooling with Large Language Models (SPILL), an intuitive, domain-adaptive method for intent clustering without fine-tuning. Existing embeddings-based clustering methods rely on a few labeled examples or unsupervised fine-tuning to optimize results for each new dataset, which makes them less generalizable to multiple datasets. Our goal is to make these existing embedders more generalizable to new domain datasets without further fine-tuning. Inspired by our theoretical derivation and simulation results on the effectiveness of sampling and pooling techniques, we view the clustering task as a small-scale selection problem. A good solution to this problem is associated with better clustering performance. Accordingly, we propose a two-stage approach: First, for each utterance (referred to as the seed), we derive its embedding using an existing embedder. Then, we apply a distance metric to select a pool of candidates close to the seed. Because the embedder is not optimized for new datasets, in the second stage, we use an LLM to further select utterances from these candidates that share the same intent as the seed. Finally, we pool these selected candidates with the seed to derive a refined embedding for the seed. We found that our method generally outperforms directly using an embedder, and it achieves comparable results to other state-of-the-art studies, even those that use much larger models and require fine-tuning, showing its strength and efficiency. Our results indicate that our method enables existing embedders to be further improved without additional fine-tuning, making them more adaptable to new domain datasets. Additionally, viewing the clustering task as a small-scale selection problem gives the potential of using LLMs to customize clustering tasks according to the user’s goals.

Memorization is Language-Sensitive: Analyzing Memorization and Inference Risks of LLMs in a Multilingual Setting
Ali Satvaty | Anna Visman | Daniel Seidel | Suzan Verberne | Fatih Turkmen
Proceedings of the First Workshop on Large Language Model Memorization (L2M2)

Large Language Models (LLMs) are known to memorize and reproduce parts of their training data during inference, raising significant privacy and safety concerns. While this phenomenon has been extensively studied to explain its contributing factors and countermeasures, its implications in multilingual contexts remain largely unexplored. In this work, we investigate cross-lingual differences in memorization behaviors of multilingual LLMs. Specifically, we examine both discoverable memorization and susceptibility to perplexity ratio attacks using Pythia models of varying sizes, evaluated on two parallel multilingual datasets. Our results reveal that lower-resource languages consistently exhibit higher vulnerability to perplexity ratio attacks, indicating greater privacy risks. In contrast, patterns of discoverable memorization appear to be influenced more strongly by the model’s pretraining or fine-tuning phases than by language resource level alone. These findings highlight the nuanced interplay between language resource availability and memorization in multilingual LLMs, providing insights toward developing safer and more privacy-preserving language models across diverse linguistic settings.

Controlled Retrieval-augmented Context Evaluation for Long-form RAG
Jia-Huei Ju | Suzan Verberne | Maarten de Rijke | Andrew Yates
Findings of the Association for Computational Linguistics: EMNLP 2025

Retrieval-augmented generation (RAG) enhances large language models by incorporating context retrieved from external knowledge sources. While the effectiveness of the retrieval module is typically evaluated with relevance-based ranking metrics, such metrics may be insufficient to reflect the retrieval’s impact on the final RAG result, especially in long-form generation scenarios. We argue that providing a comprehensive retrieval-augmented context is important for long-form RAG tasks like report generation and propose metrics for assessing the context independent of generation. We introduce CRUX, a Controlled Retrieval-aUgmented conteXt evaluation framework designed to directly assess retrieval-augmented contexts. This framework uses human-written summaries to control the information scope of knowledge, enabling us to measure how well the context covers information essential for long-form generation. CRUX uses question-based evaluation to assess RAG’s retrieval in a fine-grained manner. Empirical results show that CRUX offers more reflective and diagnostic evaluation. Our findings also reveal substantial room for improvement in current retrieval methods, pointing to promising directions for advancing RAG’s retrieval. Our data and code are publicly available to support and advance future research on retrieval for RAG. Github: https://github.com/DylanJoo/crux

SOLID: Self-seeding and Multi-intent Self-instructing LLMs for Generating Intent-aware Information-Seeking Dialogs
Arian Askari | Roxana Petcu | Chuan Meng | Mohammad Aliannejadi | Amin Abolghasemi | Evangelos Kanoulas | Suzan Verberne
Findings of the Association for Computational Linguistics: NAACL 2025

Intent prediction in information-seeking dialogs is challenging and requires a substantial amount of data with human-labeled intents for effective model training. While Large Language Models (LLMs) have demonstrated effectiveness in generating synthetic data, existing methods typically rely on human feedback and are tailored to structured, task-oriented intents. In this paper, we leverage LLMs for zero-shot generation of large-scale, open-domain, intent-aware information-seeking dialogs to serve as training data for intent prediction models. We introduce SOLID, a method that generates dialogs turn by turn using novel self-seeding and multi-intent self-instructing strategies. Additionally, we propose SOLID-RL, a finetuned version that generates an entire dialog in one step using data created with SOLID. SOLID and SOLID-RL are each used to generate over 300k intent-aware dialogs, significantly surpassing the size of existing datasets. Experiments show that intent prediction models trained on sampled dialogs generated by SOLID and SOLID-RL outperform those trained solely on human-generated dialogs. Our findings demonstrate the potential of LLMs to expand training datasets, as they provide valuable resources for conversational agents across multiple tasks. Our self-seeding and self-instructing approaches are adaptable to various conversational data types and languages with minimal modifications.

2024

Biomedical Entity Linking for Dutch: Fine-tuning a Self-alignment BERT Model on an Automatically Generated Wikipedia Corpus
Fons Hartendorp | Tom Seinen | Erik van Mulligen | Suzan Verberne
Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024

Biomedical entity linking, a main component in automatic information extraction from health-related texts, plays a pivotal role in connecting textual entities (such as diseases, drugs and body parts mentioned by patients) to their corresponding concepts in a structured biomedical knowledge base. The task remains challenging despite recent developments in natural language processing. This report presents the first evaluated biomedical entity linking model for the Dutch language. We use MedRoBERTa.nl as basemodel and perform second-phase pretraining through self-alignment on a Dutch biomedical ontology extracted from the UMLS and Dutch SNOMED. We derive a corpus from Wikipedia of ontology-linked Dutch biomedical entities in context and fine-tune our model on this dataset. We evaluate our model on the Dutch portion of the Mantra GSC-corpus and achieve 54.7% classification accuracy and 69.8% 1-distance accuracy. We then perform a case study on a collection of unlabeled, patient-support forum data and show that our model is hampered by the limited quality of the preceding entity recognition step. Manual evaluation of small sample indicates that of the correctly extracted entities, around 65% is linked to the correct concept in the ontology. Our results indicate that biomedical entity linking in a language other than English remains challenging, but our Dutch model can be used to for high-level analysis of patient-generated text.

Generate then Refine: Data Augmentation for Zero-shot Intent Detection
I-Fan Lin | Faegheh Hasibi | Suzan Verberne
Findings of the Association for Computational Linguistics: EMNLP 2024

In this short paper we propose a data augmentation method for intent detection in zero-resource domains.Existing data augmentation methods rely on few labelled examples for each intent category, which can be expensive in settings with many possible intents.We use a two-stage approach: First, we generate utterances for intent labels using an open-source large language model in a zero-shot setting. Second, we develop a smaller sequence-to-sequence model (the Refiner), to improve the generated utterances. The Refiner is fine-tuned on seen domains and then applied to unseen domains. We evaluate our method by training an intent classifier on the generated data, and evaluating it on real (human) data.We find that the Refiner significantly improves the data utility and diversity over the zero-shot LLM baseline for unseen domains and over common baseline approaches.Our results indicate that a two-step approach of a generative LLM in zero-shot setting and a smaller sequence-to-sequence model can provide high-quality data for intent detection.

Tree Transformer’s Disambiguation Ability of Prepositional Phrase Attachment and Garden Path Effects
Lingling Zhou | Suzan Verberne | Gijs Wijnholds
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This work studies two types of ambiguity in natural language: prepositional phrase (PP) attachment ambiguity, and garden path constructions. Due to the different nature of these ambiguities – one being structural, the other incremental in nature – we pretrain and evaluate the Tree Transformer of Wang et al. (2019), an unsupervised Transformer model that induces tree representations internally. To assess PP attachment ambiguity we inspect the model’s induced parse trees against a newly prepared dataset derived from the PP attachment corpus (Ratnaparkhi et al., 1994). Measuring garden path effects is done by considering surprisal rates of the underlying language model on a number of dedicated test suites, following Futrell et al. (2019). For comparison we evaluate a pretrained supervised BiLSTM-based model trained on constituency parsing as sequence labelling (Gómez-Rodríguez and Vilares, 2018). Results show that the unsupervised Tree Transformer does exhibit garden path effects, but its parsing ability is far inferior to the supervised BiLSTM, and it is not as sensitive to lexical cues as other large LSTM models, suggesting that supervised parsers based on a pre-Transformer architecture may be the better choice in the presence of ambiguity.

Learning to Use Tools via Cooperative and Interactive Agents
Zhengliang Shi | Shen Gao | Xiuyi Chen | Yue Feng | Lingyong Yan | Haibo Shi | Dawei Yin | Pengjie Ren | Suzan Verberne | Zhaochun Ren
Findings of the Association for Computational Linguistics: EMNLP 2024

Tool learning empowers large language models (LLMs) as agents to use external tools and extend their utility. Existing methods employ one single LLM-based agent to iteratively select and execute tools, thereafter incorporating execution results into the next action prediction. Despite their progress, these methods suffer from performance degradation when addressing practical tasks due to: (1) the pre-defined pipeline with restricted flexibility to calibrate incorrect actions, and (2) the struggle to adapt a general LLM-based agent to perform a variety of specialized actions. To mitigate these problems, we propose ConAgents, a Cooperative and interactive Agents framework, which coordinates three specialized agents for tool selection, tool execution, and action calibration separately. ConAgents introduces two communication protocols to enable the flexible cooperation of agents. To effectively generalize the ConAgents into open-source models, we also propose specialized action distillation, enhancing their ability to perform specialized actions in our framework. Our extensive experiments on three datasets show that the LLMs, when equipped with the ConAgents, outperform baselines with substantial improvement (i.e., up to 14% higher success rate).

Investigating the Robustness of Modelling Decisions for Few-Shot Cross-Topic Stance Detection: A Preregistered Study
Myrthe Reuver | Suzan Verberne | Antske Fokkens
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

For a viewpoint-diverse news recommender, identifying whether two news articles express the same viewpoint is essential. One way to determine “same or different” viewpoint is stance detection. In this paper, we investigate the robustness of operationalization choices for few-shot stance detection, with special attention to modelling stance across different topics. Our experiments test pre-registered hypotheses on stance detection. Specifically, we compare two stance task definitions (Pro/Con versus Same Side Stance), two LLM architectures (bi-encoding versus cross-encoding), and adding Natural Language Inference knowledge, with pre-trained RoBERTa models trained with shots of 100 examples from 7 different stance detection datasets. Some of our hypotheses and claims from earlier work can be confirmed, while others give more inconsistent results. The effect of the Same Side Stance definition on performance differs per dataset and is influenced by other modelling choices. We found no relationship between the number of training topics in the training shots and performance. In general, cross-encoding out-performs bi-encoding, and adding NLI training to our models gives considerable improvement, but these results are not consistent across all datasets. Our results indicate that it is essential to include multiple datasets and systematic modelling experiments when aiming to find robust modelling choices for the concept ‘stance’.

Attributed Question Answering for Preconditions in the Dutch Law
Felicia Redelaar | Romy Van Drie | Suzan Verberne | Maaike De Boer
Proceedings of the Natural Legal Language Processing Workshop 2024

In this paper, we address the problem of answering questions about preconditions in the law, e.g. “When can the court terminate the guardianship of a natural person?”. When answering legal questions, it is important to attribute the relevant part of the law; we therefore not only generate answers but also references to law articles. We implement a retrieval augmented generation (RAG) pipeline for long-form answers based on the Dutch law, using several state-of-the-art retrievers and generators. For evaluating our pipeline, we create a dataset containing legal QA pairs with attributions. Our experiments show promising results on our extended version for the automatic evaluation metrics from the Automatic LLMs’ Citation Evaluation (ALCE) Framework and the G-EVAL Framework. Our findings indicate that RAG has significant potential in complex, citation-heavy domains like law, as it helps laymen understand legal preconditions and rights by generating high-quality answers with accurate attributions.

CAUSE: Counterfactual Assessment of User Satisfaction Estimation in Task-Oriented Dialogue Systems
Amin Abolghasemi | Zhaochun Ren | Arian Askari | Mohammad Aliannejadi | Maarten de Rijke | Suzan Verberne
Findings of the Association for Computational Linguistics: ACL 2024

An important unexplored aspect in previous work on user satisfaction estimation for Task-Oriented Dialogue (TOD) systems is their evaluation in terms of robustness for the identification of user dissatisfaction: current benchmarks for user satisfaction estimation in TOD systems are highly skewed towards dialogues for which the user is satisfied. The effect of having a more balanced set of satisfaction labels on performance is unknown. However, balancing the data with more dissatisfactory dialogue samples requires further data collection and human annotation, which is costly and time-consuming. In this work, we leverage large language models (LLMs) and unlock their ability to generate satisfaction-aware counterfactual dialogues to augment the set of original dialogues of a test collection. We gather human annotations to ensure the reliability of the generated samples. We evaluate two open-source LLMs as user satisfaction estimators on our augmented collection against state-of-the-art fine-tuned models. Our experiments show that when used as few-shot user satisfaction estimators, open-source LLMs show higher robustness to the increase in the number of dissatisfaction labels in the test collection than the fine-tuned state-of-the-art models. Our results shed light on the need for data augmentation approaches for user satisfaction estimation in TOD systems. We release our aligned counterfactual dialogues, which are curated by human annotation, to facilitate further research on this topic.

2023

Expand, Highlight, Generate: RL-driven Document Generation for Passage Reranking
Arian Askari | Mohammad Aliannejadi | Chuan Meng | Evangelos Kanoulas | Suzan Verberne
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Generating synthetic training data based on large language models (LLMs) for ranking models has gained attention recently. Prior studies use LLMs to build pseudo query-document pairs by generating synthetic queries from documents in a corpus. In this paper, we propose a new perspective of data augmentation: generating synthetic documents from queries. To achieve this, we propose DocGen, that consists of a three-step pipeline that utilizes the few-shot capabilities of LLMs. DocGen pipeline performs synthetic document generation by (i) expanding, (ii) highlighting the original query, and then (iii) generating a synthetic document that is likely to be relevant to the query. To further improve the relevance between generated synthetic documents and their corresponding queries, we propose DocGen-RL, which regards the estimated relevance of the document as a reward and leverages reinforcement learning (RL) to optimize DocGen pipeline. Extensive experiments demonstrate that DocGen pipeline and DocGen-RL significantly outperform existing state-of-theart data augmentation methods, such as InPars, indicating that our new perspective of generating documents leverages the capacity of LLMs in generating synthetic data more effectively. We release the code, generated data, and model checkpoints to foster research in this area.

ChiSCor: A Corpus of Freely-Told Fantasy Stories by Dutch Children for Computational Linguistics and Cognitive Science
Bram van Dijk | Max van Duijn | Suzan Verberne | Marco Spruit
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

In this resource paper we release ChiSCor, a new corpus containing 619 fantasy stories, told freely by 442 Dutch children aged 4-12. ChiSCor was compiled for studying how children render character perspectives, and unravelling language and cognition in development, with computational tools. Unlike existing resources, ChiSCor’s stories were produced in natural contexts, in line with recent calls for more ecologically valid datasets. ChiSCor hosts text, audio, and annotations for character complexity and linguistic complexity. Additional metadata (e.g. education of caregivers) is available for one third of the Dutch children. ChiSCor also includes a small set of 62 English stories. This paper details how ChiSCor was compiled and shows its potential for future work with three brief case studies: i) we show that the syntactic complexity of stories is strikingly stable across children’s ages; ii) we extend work on Zipfian distributions in free speech and show that ChiSCor obeys Zipf’s law closely, reflecting its social context; iii) we show that even though ChiSCor is relatively small, the corpus is rich enough to train informative lemma vectors that allow us to analyse children’s language use. We end with a reflection on the value of narrative datasets in computational linguistics.

Pretrained Transformers for Text Ranking: BERT and Beyond
Suzan Verberne
Computational Linguistics, Volume 49, Issue 1 - March 2023

2022

Small Data Problems in Political Research: A Critical Replication Study
Hugo de Vos | Suzan Verberne
Journal for Language Technology and Computational Linguistics, Vol. 35 No. 2

2021

FuzzyBIO: A Proposal for Fuzzy Representation of Discontinuous Entities
Anne Dirkson | Suzan Verberne | Wessel Kraaij
Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis

Discontinuous entities pose a challenge to named entity recognition (NER). These phenomena occur commonly in the biomedical domain. As a solution, expansions of the BIO representation scheme that can handle these entity types are commonly used (i.e. BIOHD). However, the extra tag types make the NER task more difficult to learn. In this paper we propose an alternative; a fuzzy continuous BIO scheme (FuzzyBIO). We focus on the task of Adverse Drug Response extraction and normalization to compare FuzzyBIO to BIOHD. We find that FuzzyBIO improves recall of NER for two of three data sets and results in a higher percentage of correctly identified disjoint and composite entities for all data sets. Using FuzzyBIO also improves end-to-end performance for continuous and composite entities in two of three data sets. Since FuzzyBIO improves performance for some data sets and the conversion from BIOHD to FuzzyBIO is straightforward, we recommend investigating which is more effective for any data set containing discontinuous entities.

No NLP Task Should be an Island: Multi-disciplinarity for Diversity in News Recommender Systems
Myrthe Reuver | Antske Fokkens | Suzan Verberne
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

Natural Language Processing (NLP) is defined by specific, separate tasks, with each their own literature, benchmark datasets, and definitions. In this position paper, we argue that for a complex problem such as the threat to democracy by non-diverse news recommender systems, it is important to take into account a higher-order, normative goal and its implications. Experts in ethics, political science and media studies have suggested that news recommendation systems could be used to support a deliberative democracy. We reflect on the role of NLP in recommendation systems with this specific goal in mind and show that this theory of democracy helps to identify which NLP tasks and techniques can support this goal, and what work still needs to be done. This leads to recommendations for NLP researchers working on this specific problem as well as researchers working on other complex multidisciplinary problems.

Are we human, or are we users? The role of natural language processing in human-centric news recommenders that nudge users to diverse content
Myrthe Reuver | Nicolas Mattis | Marijn Sax | Suzan Verberne | Nava Tintarev | Natali Helberger | Judith Moeller | Sanne Vrijenhoek | Antske Fokkens | Wouter van Atteveldt
Proceedings of the 1st Workshop on NLP for Positive Impact

In this position paper, we present a research agenda and ideas for facilitating exposure to diverse viewpoints in news recommendation. Recommending news from diverse viewpoints is important to prevent potential filter bubble effects in news consumption, and stimulate a healthy democratic debate. To account for the complexity that is inherent to humans as citizens in a democracy, we anticipate (among others) individual-level differences in acceptance of diversity. We connect this idea to techniques in Natural Language Processing, where distributional language models would allow us to place different users and news articles in a multidimensional space based on semantic content, where diversity is operationalized as distance and variance. In this way, we can model individual “latitudes of diversity” for different users, and thus personalize viewpoint diversity in support of a healthy public debate. In addition, we identify technical, ethical and conceptual issues related to our presented ideas. Our investigation describes how NLP can play a central role in diversifying news recommendations.

Is Stance Detection Topic-Independent and Cross-topic Generalizable? - A Reproduction Study
Myrthe Reuver | Suzan Verberne | Roser Morante | Antske Fokkens
Proceedings of the 8th Workshop on Argument Mining

Cross-topic stance detection is the task to automatically detect stances (pro, against, or neutral) on unseen topics. We successfully reproduce state-of-the-art cross-topic stance detection work (Reimers et. al, 2019), and systematically analyze its reproducibility. Our attention then turns to the cross-topic aspect of this work, and the specificity of topics in terms of vocabulary and socio-cultural context. We ask: To what extent is stance detection topic-independent and generalizable across topics? We compare the model’s performance on various unseen topics, and find topic (e.g. abortion, cloning), class (e.g. pro, con), and their interaction affecting the model’s performance. We conclude that investigating performance on different topics, and addressing topic-specific vocabulary and context, is a future avenue for cross-topic stance detection. References Nils Reimers, Benjamin Schiller, Tilman Beck, Johannes Daxenberger, Christian Stab, and Iryna Gurevych. 2019. Classification and Clustering of Arguments with Contextualized Word Embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 567–578, Florence, Italy. Association for Computational Linguistics.

2020

Named Entity Recognition for Chinese biomedical patents
Yuting Hu | Suzan Verberne
Proceedings of the 28th International Conference on Computational Linguistics

There is a large body of work on Biomedical Entity Recognition (Bio-NER) for English but there have only been a few attempts addressing NER for Chinese biomedical texts. Because of the growing amount of Chinese biomedical discoveries being patented, and lack of NER models for patent data, we train and evaluate NER models for the analysis of Chinese biomedical patent data, based on BERT. By doing so, we show the value and potential of this domain-specific NER task. For the evaluation of our methods we built our own Chinese biomedical patents NER dataset, and our optimized model achieved an F1 score of 0.54±0.15. Further biomedical analysis indicates that our solution can help detecting meaningful biomedical entities and novel gene-gene interactions, with limited labeled data, training time and computing power.

Challenges of Applying Automatic Speech Recognition for Transcribing EU Parliament Committee Meetings: A Pilot Study
Hugo de Vos | Suzan Verberne
Proceedings of the Second ParlaCLARIN Workshop

Challenges of Applying Automatic Speech Recognition for Transcribing EUParliament Committee Meetings: A Pilot StudyHugo de Vos and Suzan VerberneInstitute of Public Administration and Leiden Institute of Advanced Computer Science, Leiden Universityh.p.de.vos@fgga.leidenuniv.nl, s.verberne@liacs.leidenuniv.nlAbstractWe tested the feasibility of automatically transcribing committee meetings of the European Union parliament with the use of AutomaticSpeech Recognition techniques. These committee meetings contain more valuable information for political science scholars than theplenary meetings since these meetings showcase actual debates opposed to the more formal plenary meetings. However, since there areno transcriptions of those meetings, they are a lot less accessible for research than the plenary meetings, of which multiple corpora exist. We explored a freely available ASR application and analysed the output in order to identify the weaknesses of an out-of-the box system. We followed up on those weaknesses by proposing directions for optimizing the ASR for our goals. We found that, despite showcasingacceptable results in terms of Word Error Rate, the model did not yet suffice for the purpose of generating a data set for use in PoliticalScience. The application was unable to successfully recognize domain specific terms and names. To overcome this issue, future researchwill be directed at using domain specific language models in combination with off-the-shelf acoustic models.

Conversation-Aware Filtering of Online Patient Forum Messages
Anne Dirkson | Suzan Verberne | Wessel Kraaij
Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task

Previous approaches to NLP tasks on online patient forums have been limited to single posts as units, thereby neglecting the overarching conversational structure. In this paper we explore the benefit of exploiting conversational context for filtering posts relevant to a specific medical topic. We experiment with two approaches to add conversational context to a BERT model: a sequential CRF layer and manually engineered features. Although neither approach can outperform the F1 score of the BERT baseline, we find that adding a sequential layer improves precision for all target classes whereas adding a non-sequential layer with manually engineered features leads to a higher recall for two out of three target classes. Thus, depending on the end goal, conversation-aware modelling may be beneficial for identifying relevant messages. We hope our findings encourage other researchers in this domain to move beyond studying messages in isolation towards more discourse-based data collection and classification. We release our code for the purpose of follow-up research.

Creating a Dataset for Named Entity Recognition in the Archaeology Domain
Alex Brandsen | Suzan Verberne | Milco Wansleeben | Karsten Lambers
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we present the development of a training dataset for Dutch Named Entity Recognition (NER) in the archaeology domain. This dataset was created as there is a dire need for semantic search within archaeology, in order to allow archaeologists to find structured information in collections of Dutch excavation reports, currently totalling around 60,000 (658 million words) and growing rapidly. To guide this search task, NER is needed. We created rigorous annotation guidelines in an iterative process, then instructed five archaeology students to annotate a number of documents. The resulting dataset contains ~31k annotations between six entity types (artefact, time period, place, context, species & material). The inter-annotator agreement is 0.95, and when we used this data for machine learning, we observed an increase in F1 score from 0.51 to 0.70 in comparison to a machine learning model trained on a dataset created in prior work. This indicates that the data is of high quality, and can confidently be used to train NER classifiers.

2019

Lexical Normalization of User-Generated Medical Text
Anne Dirkson | Suzan Verberne | Wessel Kraaij
Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task

In the medical domain, user-generated social media text is increasingly used as a valuable complementary knowledge source to scientific medical literature. The extraction of this knowledge is complicated by colloquial language use and misspellings. Yet, lexical normalization of such data has not been addressed properly. This paper presents an unsupervised, data-driven spelling correction module for medical social media. Our method outperforms state-of-the-art spelling correction and can detect mistakes with an F0.5 of 0.888. Additionally, we present a novel corpus for spelling mistake detection and correction on a medical patient forum.

Transfer Learning for Health-related Twitter Data
Anne Dirkson | Suzan Verberne
Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task

Transfer learning is promising for many NLP applications, especially in tasks with limited labeled data. This paper describes the methods developed by team TMRLeiden for the 2019 Social Media Mining for Health Applications (SMM4H) Shared Task. Our methods use state-of-the-art transfer learning methods to classify, extract and normalise adverse drug effects (ADRs) and to classify personal health mentions from health-related tweets. The code and fine-tuned models are publicly available.

2016

Abstractive Compression of Captions with Attentive Recurrent Neural Networks
Sander Wubben | Emiel Krahmer | Antal van den Bosch | Suzan Verberne
Proceedings of the 9th International Natural Language Generation conference

2013

Text Representations for Patent Classification
Eva D’hondt | Suzan Verberne | Cornelis Koster | Lou Boves
Computational Linguistics, Volume 39, Issue 3 - September 2013

2012

The effect of domain and text type on text prediction quality
Suzan Verberne | Antal van den Bosch | Helmer Strik | Lou Boves
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

2010

What Is Not in the Bag of Words for Why-QA?
Suzan Verberne | Lou Boves | Nelleke Oostdijk | Peter-Arno Coppen
Computational Linguistics, Volume 36, Number 2, June 2010

Constructing a Broad-coverage Lexicon for Text Mining in the Patent Domain
Nelleke Oostdijk | Suzan Verberne | Cornelis Koster
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

For mining intellectual property texts (patents), a broad-coverage lexicon that covers general English words together with terminology from the patent domain is indispensable. The patent domain is very diffuse as it comprises a variety of technical domains (e.g. Human Necessities, Chemistry & Metallurgy and Physics in the International Patent Classification). As a result, collecting a lexicon that covers the language used in patent texts is not a straightforward task. In this paper we describe the approach that we have developed for the semi-automatic construction of a broad-coverage lexicon for classification and information retrieval in the patent domain and which combines information from multiple sources. Our contribution is twofold. First, we provide insight into the difficulties of developing lexical resources for information retrieval and text mining in the patent domain, a research and development field that is expanding quickly. Second, we create a broad coverage lexicon annotated with rich lexical information and containing both general English word forms and domain terminology for various technical domains.

2008

Using Syntactic Information for Improving Why-Question Answering
Suzan Verberne | Lou Boves | Nelleke Oostdijk | Peter-Arno Coppen
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

Passage Retrieval for Question Answering using Sliding Windows
Mahboob Khalid | Suzan Verberne
Coling 2008: Proceedings of the 2nd workshop on Information Retrieval for Question Answering

2006

Data for question answering: The case of why
Suzan Verberne | Lou Boves | Nelleke Oostdijk | Peter-Arno Coppen
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

For research and development of an approach for automatically answering why-questions (why-QA) a data collection was created. The data set was obtained by way of elicitation and comprises a total of 395 why-questions. For each question, the data set includes the source document and one or two user-formulated answers. In addition, for a subset of the questions, user-formulated paraphrases are available. All question-answer pairs have been annotated with information on topic and semantic answer type. The resulting data set is of importance not only for our research, but we expect it to contribute to and stimulate other research in the field of why-QA.

Developing an Approach for Why-Question Answering
Suzan Verberne
Student Research Workshop

Discourse-based answering of why-questions
Suzan Verberne | Lou Boves | Peter-Arno Coppen | Nelleke Oostdijk
Traitement Automatique des Langues, Volume 47, Numéro 2 : Discours et document : traitements automatiques [Computational Approaches to Discourse and Document Processing]

Co-authors

Myrthe Reuver 4

Amin Abolghasemi 3

Mohammad Aliannejadi 3

Wessel Kraaij 3

Maarten de Rijke 3

Faegheh Hasibi 2

Evangelos Kanoulas 2

Cornelis Koster 2

Gijs Wijnholds 2

Antal van den Bosch 2

Leif Azzopardi 1

Alex Brandsen 1

Joost Broekens 1

Maaike De Boer 1

Eva D’hondt 1

Fons Hartendorp 1

Seyyed Hadi Hashemi 1

Natali Helberger 1

Mahboob Khalid 1

Emiel Krahmer 1

Karsten Lambers 1

David Lindevelt 1

Nicolas Mattis 1

Judith Moeller 1

Roser Morante 1

Felicia Redelaar 1

Rob Reijtenbach 1

Daniel Seidel 1

Zhengliang Shi 1

Nava Tintarev 1

Fatih Turkmen 1

Bram Van Dijk 1

Romy Van Drie 1

Sanne Vrijenhoek 1

Milco Wansleeben 1

Sander Wubben 1

Lingling Zhou 1

Wouter van Atteveldt 1

Max van Duijn 1

Erik van Mulligen 1

Venues