Anna Wegmann

2025

Tokenization is Sensitive to Language Variation
Anna Wegmann | Dong Nguyen | David Jurgens
Findings of the Association for Computational Linguistics: ACL 2025

Variation in language is ubiquitous and often systematically linked to regional, social, and contextual factors. Tokenizers split texts into smaller units and might behave differently for less common linguistic forms. This might affect downstream LLM performance differently on two types of tasks: Tasks where the model should be robust to language variation (e.g., for semantic tasks like NLI, labels do not depend on whether a text uses British or American spelling) and tasks where the model should be sensitive to language variation (e.g., for form-based tasks like authorship verification, labels depend on whether a text uses British or American spelling). We pre-train BERT base models with the popular Byte-Pair Encoding algorithm to investigate how key tokenization design choices impact the performance of downstream models: the corpus used to train the tokenizer, the pre-tokenizer and the vocabulary size. We find that the best tokenizer varies on the two task types and that the pre-tokenizer has the biggest overall impact on performance. Further, we introduce a new approach to estimate tokenizer impact on downstream LLM performance, showing substantial improvement over metrics like Rényi efficiency. We encourage more work on language variation and its relation to tokenizers and thus LLM performance.

2024

pdf bib abs

What’s Mine becomes Yours: Defining, Annotating and Detecting Context-Dependent Paraphrases in News Interview Dialogs
Anna Wegmann | Tijs A. Van Den Broek | Dong Nguyen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Best practices for high conflict conversations like counseling or customer support almost always include recommendations to paraphrase the previous speaker. Although paraphrase classification has received widespread attention in NLP, paraphrases are usually considered independent from context, and common models and datasets are not applicable to dialog settings. In this work, we investigate paraphrases across turns in dialog (e.g., Speaker 1: “That book is mine.” becomes Speaker 2: “That book is yours.”). We provide an operationalization of context-dependent paraphrases, and develop a training for crowd-workers to classify paraphrases in dialog. We introduce ContextDeP, a dataset with utterance pairs from NPR and CNN news interviews annotated for context-dependent paraphrases. To enable analyses on label variation, the dataset contains 5,581 annotations on 600 utterance pairs. We present promising results with in-context learning and with token classification models for automatic paraphrase detection in dialog.

pdf bib abs

Detecting Perspective-Getting in Wikipedia Discussions
Evgeny Vasilets | Tijs Van den Broek | Anna Wegmann | David Abadi | Dong Nguyen
Proceedings of the Sixth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS 2024)

Perspective-getting (i.e., the effort to obtain information about the other person’s perspective) can lead to more accurate interpersonal understanding. In this paper, we develop an approach to measure perspective-getting and apply it to English Wikipedia discussions. First, we develop a codebook based on perspective-getting theory to operationalize perspective-getting into two categories: asking questions about and attending the other’s perspective. Second, we use the codebook to annotate perspective-getting in Wikipedia discussion pages. Third, we fine-tune a RoBERTa model that achieves an average F-1 score of 0.76 on the two perspective-getting categories. Last, we test whether perspective-getting is associated with discussion outcomes. Perspective-getting was not higher in non-escalated discussions. However, discussions starting with a post attending the other’s perspective are followed by responses that are more likely to also attend the other’s perspective. Future research may use our model to study the influence of perspective-getting on the dynamics and outcomes of online discussions.

2022

pdf bib abs

Same Author or Just Same Topic? Towards Content-Independent Style Representations
Anna Wegmann | Marijn Schraagen | Dong Nguyen
Proceedings of the 7th Workshop on Representation Learning for NLP

Linguistic style is an integral component of language. Recent advances in the development of style representations have increasingly used training objectives from authorship verification (AV)”:” Do two texts have the same author? The assumption underlying the AV training task (same author approximates same writing style) enables self-supervised and, thus, extensive training. However, a good performance on the AV task does not ensure good “general-purpose” style representations. For example, as the same author might typically write about certain topics, representations trained on AV might also encode content information instead of style alone. We introduce a variation of the AV training task that controls for content using conversation or domain labels. We evaluate whether known style dimensions are represented and preferred over content information through an original variation to the recently proposed STEL framework. We find that representations trained by controlling for conversation are better than representations trained with domain or no content control at representing style independent from content.

2021

pdf bib abs

Does It Capture STEL? A Modular, Similarity-based Linguistic Style Evaluation Framework
Anna Wegmann | Dong Nguyen
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Style is an integral part of natural language. However, evaluation methods for style measures are rare, often task-specific and usually do not control for content. We propose the modular, fine-grained and content-controlled similarity-based STyle EvaLuation framework (STEL) to test the performance of any model that can compare two sentences on style. We illustrate STEL with two general dimensions of style (formal/informal and simple/complex) as well as two specific characteristics of style (contrac’tion and numb3r substitution). We find that BERT-based methods outperform simple versions of commonly used style measures like 3-grams, punctuation frequency and LIWC-based approaches. We invite the addition of further tasks and task instances to STEL and hope to facilitate the improvement of style-sensitive measures.

Co-authors

Tijs Van den Broek 1

Evgeny Vasilets 1

Venues

Fix author