David Strohmaier

2025

Knowing which words language learners struggle with is crucial for developing personalised education technologies. In this paper, we advocate for the novel task of “dictionary look-up prediction” as a means for evaluating the complexity of words in reading tasks. We release the Dictionary Look-Up development dataset (DLU-dev) and the Dialogue Dictionary Look-Up dataset (D-DLU), which is based on chatbot dialogues. We demonstrate that dictionary look-up is a challenging task for LLMs (results are presented for LLaMA, Gemma, and Longformer models). We explore finetuning with the ROC* loss function as a more appropriate loss for this task than the commonly used Binary Cross Entropy (BCE). We show that a feature-based model outperforms the LLMs. Finally, we investigate the transfer between DLU and the related tasks of Complex Word Identification (CWI) and Semantic Error Prediction (SEP), establishing new state-of-the-art results for SEP.

2024

pdf bib

Semantic Error Prediction: Estimating Word Production Complexity
David Strohmaier | Paula Buttery
Proceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning

2022

pdf bib abs

A Category Theory Framework for Sense Systems
David Strohmaier | Gladys Tyen
Proceedings of Globalex Workshop on Linked Lexicography within the 13th Language Resources and Evaluation Conference

Sense repositories are a key component of many NLP applications that require the identification of word senses. Many sense repositories exist: a large proportion is based on lexicographic resources such as WordNet and various dictionaries, but there are others which are the product of clustering algorithms and other automatic techniques. Over the years these repositories have been mapped to each other. However, there have been no attempts (until now) to provide any theoretical grounding for such mappings, causing inconsistencies and unintuitive results. The present paper draws on category theory to formalise assumptions about mapped repositories that are often left implicit, providing formal grounding for this type of language resource. The paper first gives an overview of the word sense disambiguation literature and four types of sense representations: dictionary definitions, clusters of senses, domain labels, and embedding vectors. These different sense representations make different assumptions about the relations and mappings between word senses. We then introduce notation to represent the mappings and repositories as a category, which we call a “sense system”. We represent a sense system as a small category S, where the object set of S, denoted by Ob(S), is a set of sense repositories; and the homomorphism set or hom-set of S, denoted by Hom(S), is a set of mappings between these repositories. On the basis of the sense system description, we propose, formalise, and motivate four basic and two guiding criteria for such sense systems. The four basic criteria are: 1) Correctness preservation: Mappings should preserve the correctness of sense labels in all contexts. Intuitively, if the correct sense for a word token is mapped to another sense, this sense should also be correct for that token. This criterion is endorsed by virtually all existing mappings, but the formalism presented in the paper makes this assumption explicit and allows us to distinguish it from other criteria. 2) Candidacy preservation: Mappings should preserve what we call “the lexical candidacy” of sense labels. Assume that a sense s is mapped to another sense s’ in a different repository. Candidacy preservation then requires that if s is a sense associated with word type w, then so is s’. This criterion is trivially fulfilled by clustering-based approaches, but is not typically explicitly stated for repositories, and we demonstrate how a violation might occur. Our formalisation allows us to specify the difference of this criterion to correctness preservation. As we argue, candidacy preservation allows us to straightforwardly and consistently compare granularity levels by counting the number of senses for each word type. 3) Uniqueness criterion: There should be at most one mapping from one repository to another. This criterion is also fulfilled by clustering-based approaches, but is often violated by repositories that use domain labels. We argue that adhering to the uniqueness criterion provides several benefits, including: a) being able to consistently convert between sets of labels and evaluation metrics, allowing researchers to work with data and models that use different sets of labels; b) ensuring that sense repositories would form a partial preorder, which would roughly correspond to the notion of granularity; and c) ensuring transitivity of mapped senses. 4) Connectivity: A sense system should be a connected category. The connectivity criterion on its own is not very informative, but it enables other criteria by extending their benefits to the rest of the sense system, such as allowing cross-checking between multiple repositories, allowing comparison of grain level, and label conversion. As we argue, connectivity should be considered a formal requirement helping to describe sense repositories and how they relate. We also offer two guiding criteria, which we consider aspirational rather than requirements that have to be strictly fulfilled for all purposes: 1) Non-contradiction: Mappings cannot exist between senses that semantically contradict each other. The non-contradiction criterion forbids mappings between senses whose (strict) implications contradict each other. We demonstrate how such a contradiction might occur, but acknowledge the difficulty in identifying such contradictions. As we argue, the reason to consider this a guiding rather than a strict criterion is that many sense repositories lack the semantic specificity that would allow researchers to identify these contradictions. 2) Inter-annotator agreement: Mappings should correspond to a partial preorder of inter-annotator agreement levels. It has been observed that, when annotating corpora with senses from a given sense repository, inter-annotator agreement tends to drop when the repository is more fine-grained. Therefore, if one repository is coarser-grained than another, one can expect agreement levels to be higher when annotating corpora with senses from the first repository. While this criterion will necessarily be subject to empirical variability (and does apply to sense repositories using non-interpretable representations such as embeddings), we argue that strong violations suggest that the sense distinctions of the coarse-grained sense repository are unnatural, i.e. not in accordance with human linguistic intuitions. Our list is by no means exhaustive, as there are other properties that may be desirable depending on the downstream application. Our category-theory based formalism will serve as the basis for describing any such further properties. However, we also envision that the criteria we have proposed will serve as guidelines for future sense repositories and mappings, in order to avoid the inconsistencies and counterintuitive results derived from existing mappings.

2021

pdf bib abs

Cambridge at SemEval-2021 Task 1: An Ensemble of Feature-Based and Neural Models for Lexical Complexity Prediction
Zheng Yuan | Gladys Tyen | David Strohmaier
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper describes our submission to the SemEval-2021 shared task on Lexical Complexity Prediction. We approached it as a regression problem and present an ensemble combining four systems, one feature-based and three neural with fine-tuning, frequency pre-training and multi-task learning, achieving Pearson scores of 0.8264 and 0.7556 on the trial and test sets respectively (sub-task 1). We further present our analysis of the results and discuss our findings.

pdf bib abs

Cambridge at SemEval-2021 Task 2: Neural WiC-Model with Data Augmentation and Exploration of Representation
Zheng Yuan | David Strohmaier
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper describes the system of the Cambridge team submitted to the SemEval-2021 shared task on Multilingual and Cross-lingual Word-in-Context Disambiguation. Building on top of a pre-trained masked language model, our system is first pre-trained on out-of-domain data, and then fine-tuned on in-domain data. We demonstrate the effectiveness of the proposed two-step training strategy and the benefits of data augmentation from both existing examples and new resources. We further investigate different representations and show that the addition of distance-based features is helpful in the word-in-context disambiguation task. Our system yields highly competitive results in the cross-lingual track without training on any cross-lingual data; and achieves state-of-the-art results in the multilingual track, ranking first in two languages (Arabic and Russian) and second in French out of 171 submitted systems.

2020

pdf bib abs

SeCoDa: Sense Complexity Dataset
David Strohmaier | Sian Gooding | Shiva Taslimipoor | Ekaterina Kochmar
Proceedings of the Twelfth Language Resources and Evaluation Conference

The Sense Complexity Dataset (SeCoDa) provides a corpus that is annotated jointly for complexity and word senses. It thus provides a valuable resource for both word sense disambiguation and the task of complex word identification. The intention is that this dataset will be used to identify complexity at the level of word senses rather than word tokens. For word sense annotation SeCoDa uses a hierarchical scheme that is based on information available in the Cambridge Advanced Learner’s Dictionary. This way we can offer more coarse-grained senses than directly available in WordNet.

Co-authors

Ekaterina Kochmar 1

Diane Nicholls 1

Shiva Taslimipoor 1

Venues

nlp4call1

Fix author