Martha Larson

2025

Evaluating Large Language Models for Confidence-based Check Set Selection
Jane Arleth dela Cruz | Iris Hendrickx | Martha Larson
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) have shown promise in automating high-labor data tasks, but the adoption of LLMs in high-stake scenarios faces two key challenges: their tendency to answer despite uncertainty and their difficulty handling long input contexts robustly.We investigate commonly used off-the-shelf LLMs’ ability to identify low-confidence outputs for human review through “check set selection”–a process where LLMs prioritize information needing human judgment.Using a case study on social media monitoring for disaster risk management,we define the “check set” as a list of tweets escalated to the disaster manager when the LLM has the least confidence, enabling human oversight within budgeted effort.We test two strategies for LLM check set selection: *individual confidence elicitation* – LLMs assesses confidence for each tweet classification individually, requiring more prompts with shorter contexts, and *direct set confidence elicitation* – LLM evaluates confidence for a list of tweet classifications at once, using less prompts but longer contexts.Our results reveal that set selection via individual probabilities is more reliable but that direct set confidence merits further investigation.Direct set selection challenges include inconsistent outputs, incorrect check set size, and low inter-annotator agreement. Despite these challenges, our approach improves collaborative disaster tweet classification by outperforming random-sample check set selection, demonstrating the potential of human-LLM collaboration.

pdf bib abs

Improving Large Language Model Confidence Estimates using Extractive Rationales for Classification
Jane Arleth dela Cruz | Iris Hendrickx | Martha Larson
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

The adoption of large language models (LLMs) in high-stake scenarios continues to be a challenge due to lack of effective confidence calibration. Although LLMs are capable of providing convincing self-explanations and verbalizing confidence in NLP tasks, they tend to exhibit overconfidence when using generative or free-text rationales (e.g. Chain-of-Thought), where reasoning steps tend to lack verifiable grounding.In this paper, we investigate whether adding explanations in the form of extractive rationales –snippets of the input text that directly support the predictions, can improve the confidence calibration of LLMs in classification tasks.We examine two approaches for integrating these rationales: (1) a one-stage rationale-generation with prediction and (2) a two-stage rationale-guided confidence calibration.We evaluate these approaches on a disaster tweet classification task using four different off-the-shelf LLMs. Our results show that extracting rationales both before and after prediction can improve the confidence estimates of the LLMs. Furthermore, we find that replacing valid extractive rationales with irrelevant ones significantly lowers model confidence, highlighting the importance of rationale quality.This simple yet effective method improves LLM verbalized confidence and reduces overconfidence in possible hallucination.

pdf bib abs

Typographic Attacks in a Multi-Image Setting
Xiaomeng Wang | Zhengyu Zhao | Martha Larson
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large Vision-Language Models (LVLMs) are susceptible to typographic attacks, which are misclassifications caused by an attack text that is added to an image. In this paper, we introduce a multi-image setting for studying typographic attacks, broadening the current emphasis of the literature on attacking individual images. Specifically, our focus is on attacking image sets without repeating the attack query. Such non-repeating attacks are stealthier, as they are more likely to evade a gatekeeper than attacks that repeat the same attack text. We introduce two attack strategies for the multi-image setting, leveraging the difficulty of the target image, the strength of the attack text, and text-image similarity. Our text-image similarity approach improves attack success rates by 21% over random, non-specific methods on the CLIP model using ImageNet while maintaining stealth in a multi-image scenario. An additional experiment demonstrates transferability, i.e., text-image similarity calculated using CLIP transfers when attacking InstructBLIP.

2023

pdf bib abs

Analyzing the Potential of Linguistic Features for Sign Spotting: A Look at Approximative Features
Natalie Hollain | Martha Larson | Floris Roelofsen
Proceedings of the Second International Workshop on Automatic Translation for Signed and Spoken Languages

Sign language processing is the field of research that aims to recognize, retrieve, and spot signs in videos. Various approaches have been developed, varying in whether they use linguistic features and whether they use landmark detection tools or not. Incorporating linguistics holds promise for improving sign language processing in terms of performance, generalizability, and explainability. This paper focuses on the task of sign spotting and aims to expand on the approximative linguistic features that have been used in previous work, and to understand when linguistic features deliver an improvement over landmark features. We detect landmarks with Mediapipe and extract linguistically relevant features from them, including handshape, orientation, location, and movement. We compare a sign spotting model using linguistic features with a model operating on landmarks directly, finding that the approximate linguistic features tested in this paper capture some aspects of signs better than the landmark features, while they are worse for others.

pdf bib abs

Anchors in Embedding Space: A Simple Concept Tracking Approach to Support Conceptual History Research
Jetske Adams | Martha Larson | Jaap Verheul | Michael Boyden
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change

We introduce a simple concept tracking approach to support conceptual history research. Building on the existing practices of conceptual historians, we use dictionaries to identify “anchors”, which represent primary dimensions of meaning of a concept. Then, we create a plot showing how a key concept has evolved over time in a historical corpus in relation to these dimensions. We demonstrate the approach by plotting the change of several key concepts in the COHA corpus.

2022

pdf bib abs

Regex in a Time of Deep Learning: The Role of an Old Technology in Age Discrimination Detection in Job Advertisements
Anna Pillar | Kyrill Poelmans | Martha Larson
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Deep learning holds great promise for detecting discriminatory language in the public sphere. However, for the detection of illegal age discrimination in job advertisements, regex approaches are still strong performers. In this paper, we investigate job advertisements in the Netherlands. We present a qualitative analysis of the benefits of the ‘old’ approach based on regexes and investigate how neural embeddings could address its limitations.

pdf bib abs

Doing not Being: Concrete Language as a Bridge from Language Technology to Ethnically Inclusive Job Ads
Jetske Adams | Kyrill Poelmans | Iris Hendrickx | Martha Larson
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

This paper makes the case for studying concreteness in language as a bridge that will allow language technology to support the understanding and improvement of ethnic inclusivity in job advertisements. We propose an annotation scheme that guides the assignment of sentences in job ads to classes that reflect concrete actions, i.e., what the employer needs people to do, and abstract dispositions, i.e., who the employer expects people to be. Using an annotated dataset of Dutch-language job ads, we demonstrate that machine learning technology is effectively able to distinguish these classes.

2021

pdf bib abs

Exploring Inspiration Sets in a Data Programming Pipeline for Product Moderation
Justine Winkler | Simon Brugman | Bas van Berkel | Martha Larson
Proceedings of the 4th Workshop on e-Commerce and NLP

We carry out a case study on the use of data programming to create data to train classifiers used for product moderation on a large e-commerce platform. Data programming is a recently-introduced technique that uses human-defined rules to generate training data sets without tedious item-by-item hand labeling. Our study investigates methods for allowing product moderators to quickly modify the rules given their knowledge of the domain and, especially, of textual item descriptions. Our results show promise that moderators can use this approach to steer the training data, making possible fast and close control of classifiers that detect policy violations.

2020

pdf bib abs

Truth or Error? Towards systematic analysis of factual errors in abstractive summaries
Klaus-Michael Lux | Maya Sappelli | Martha Larson
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

This paper presents a typology of errors produced by automatic summarization systems. The typology was created by manually analyzing the output of four recent neural summarization systems. Our work is motivated by the growing awareness of the need for better summary evaluation methods that go beyond conventional overlap-based metrics. Our typology is structured into two dimensions. First, the Mapping Dimension describes surface-level errors and provides insight into word-sequence transformation issues. Second, the Meaning Dimension describes issues related to interpretation and provides insight into breakdowns in truth, i.e., factual faithfulness to the original text. Comparative analysis revealed that two neural summarization systems leveraging pre-trained models have an advantage in decreasing grammaticality errors, but not necessarily factual errors. We also discuss the importance of ensuring that summary length and abstractiveness do not interfere with evaluating summary quality.

pdf bib abs

The Connection between the Text and Images of News Articles: New Insights for Multimedia Analysis
Nelleke Oostdijk | Hans van Halteren | Erkan Bașar | Martha Larson
Proceedings of the Twelfth Language Resources and Evaluation Conference

We report on a case study of text and images that reveals the inadequacy of simplistic assumptions about their connection and interplay. The context of our work is a larger effort to create automatic systems that can extract event information from online news articles about flooding disasters. We carry out a manual analysis of 1000 articles containing a keyword related to flooding. The analysis reveals that the articles in our data set cluster into seven categories related to different topical aspects of flooding, and that the images accompanying the articles cluster into five categories related to the content they depict. The results demonstrate that flood-related news articles do not consistently report on a single, currently unfolding flooding event and we should also not assume that a flood-related image will directly relate to a flooding-event described in the corresponding article. In particular, spatiotemporal distance is important. We validate the manual analysis with an automatic classifier demonstrating the technical feasibility of multimedia analysis approaches that admit more realistic relationships between text and images. In sum, our case study confirms that closer attention to the connection between text and images has the potential to improve the collection of multimodal information from news articles.

2012

pdf bib abs

Creating a Data Collection for Evaluating Rich Speech Retrieval
Maria Eskevich | Gareth J.F. Jones | Martha Larson | Roeland Ordelman
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We describe the development of a test collection for the investigation of speech retrieval beyond identification of relevant content. This collection focuses on satisfying user information needs for queries associated with specific types of speech acts. The collection is based on an archive of the Internet video from Internet video sharing platform (blip.tv), and was provided by the MediaEval benchmarking initiative. A crowdsourcing approach was used to identify segments in the video data which contain speech acts, to create a description of the video containing the act and to generate search queries designed to refind this speech act. We describe and reflect on our experiences with crowdsourcing this test collection using the Amazon Mechanical Turk platform. We highlight the challenges of constructing this dataset, including the selection of the data source, design of the crowdsouring task and the specification of queries and relevant items.