Gladys Tyen

2025

Current benchmarks for large language model (LLM) reasoning predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty. We evaluate various general-purpose and reasoning-specialized models on BBEH and observe an accuracy of 23.9% for the best general-purpose model and 54.2% for the best reasoning-specialized model, indicating substantial room for improvement and highlighting the ongoing challenge of achieving robust general reasoning in LLMs. We release BBEH publicly at: https://github.com/google-deepmind/bbeh.

pdf bib abs

Knowing which words language learners struggle with is crucial for developing personalised education technologies. In this paper, we advocate for the novel task of “dictionary look-up prediction” as a means for evaluating the complexity of words in reading tasks. We release the Dictionary Look-Up development dataset (DLU-dev) and the Dialogue Dictionary Look-Up dataset (D-DLU), which is based on chatbot dialogues. We demonstrate that dictionary look-up is a challenging task for LLMs (results are presented for LLaMA, Gemma, and Longformer models). We explore finetuning with the ROC* loss function as a more appropriate loss for this task than the commonly used Binary Cross Entropy (BCE). We show that a feature-based model outperforms the LLMs. Finally, we investigate the transfer between DLU and the related tasks of Complex Word Identification (CWI) and Semantic Error Prediction (SEP), establishing new state-of-the-art results for SEP.

pdf bib abs

Although preference optimization methods have improved reasoning performance in Large Language Models (LLMs), they often lack transparency regarding why one reasoning outcome is preferred over another. This limitation is especially critical in Automated Student Answer Scoring (ASAS), where explainability is essential to justify assessment outcomes. Verbal reinforcement learning offers the potential to generate explicit reflection, but it tends to produce superficial critiques that can harm assessment performance. Existing LLMs also struggle to reliably detect subtle reasoning errors in ASAS tasks. Moreover, manually identifying intermediate reasoning errors is expensive and difficult to scale. To address these challenges, we introduce a **contrastive reflection synthesis pipeline** that generates precise verbal feedback by identifying discrepancies in structure reasoning graph paths. Leveraging these synthetic reflection data, we propose *DARS*, a Dual-model Reflective Scoring framework featuring a dedicated Critic model trained for effective reflection. *DARS* achieves strong performance and consistently outperforms existing ASAS baselines across all evaluation metrics. Extensive experiments further provide novel insights into the value of reflection data, framework design, and the scaling behavior of *DARS*. We release the DARS code at https://github.com/lijiazheng99/DARS.

2024

pdf bib abs

LLMs cannot find reasoning errors, but can correct them given the error location
Gladys Tyen | Hassan Mansoor | Victor Carbune | Peter Chen | Tony Mak
Findings of the Association for Computational Linguistics: ACL 2024

While self-correction has shown promise in improving LLM outputs in terms of style and quality (e.g. Chen et al., 2023b; Madaan et al.,2023), recent attempts to self-correct logical or reasoning errors often cause correct answers to become incorrect, resulting in worse performances overall (Huang et al., 2023). In this paper, we show that poor self-correction performance stems from LLMs’ inability tofind logical mistakes, rather than their ability to correct a known mistake. Firstly, we benchmark several state-of-the-art LLMs ontheir mistake-finding ability and demonstrate that they generally struggle with the task, even in highly objective, unambiguous cases. Secondly, we test the correction abilities of LLMs – separately from mistake finding – using a backtracking setup that feeds ground truth mistake location information to the model. We show that this boosts downstream task performance across our 5 reasoning tasks, indicating that LLMs’ correction abilities are robust. Finally, we show that it is possible to obtain mistake location information without ground truth labels or in-domain training data. We train a small classifier with out-of-domain data, which exhibits stronger mistake-finding performance than prompting a large model. We release our dataset of LLM-generated logical mistakes, BIG-Bench Mistake, to enable further research into locating LLM reasoning mistakes.

pdf bib

LLM chatbots as a language practice tool: a user study
Gladys Tyen | Andrew Caines | Paula Buttery
Proceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning

2022

pdf bib abs

Towards an open-domain chatbot for language practice
Gladys Tyen | Mark Brenchley | Andrew Caines | Paula Buttery
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

State-of-the-art chatbots for English are now able to hold conversations on virtually any topic (e.g. Adiwardana et al., 2020; Roller et al., 2021). However, existing dialogue systems in the language learning domain still use hand-crafted rules and pattern matching, and are much more limited in scope. In this paper, we make an initial foray into adapting open-domain dialogue generation for second language learning. We propose and implement decoding strategies that can adjust the difficulty level of the chatbot according to the learner’s needs, without requiring further training of the chatbot. These strategies are then evaluated using judgements from human examiners trained in language education. Our results show that re-ranking candidate outputs is a particularly effective strategy, and performance can be further improved by adding sub-token penalties and filtering.

pdf bib abs

A Category Theory Framework for Sense Systems
David Strohmaier | Gladys Tyen
Proceedings of Globalex Workshop on Linked Lexicography within the 13th Language Resources and Evaluation Conference

Sense repositories are a key component of many NLP applications that require the identification of word senses. Many sense repositories exist: a large proportion is based on lexicographic resources such as WordNet and various dictionaries, but there are others which are the product of clustering algorithms and other automatic techniques. Over the years these repositories have been mapped to each other. However, there have been no attempts (until now) to provide any theoretical grounding for such mappings, causing inconsistencies and unintuitive results. The present paper draws on category theory to formalise assumptions about mapped repositories that are often left implicit, providing formal grounding for this type of language resource. The paper first gives an overview of the word sense disambiguation literature and four types of sense representations: dictionary definitions, clusters of senses, domain labels, and embedding vectors. These different sense representations make different assumptions about the relations and mappings between word senses. We then introduce notation to represent the mappings and repositories as a category, which we call a “sense system”. We represent a sense system as a small category S, where the object set of S, denoted by Ob(S), is a set of sense repositories; and the homomorphism set or hom-set of S, denoted by Hom(S), is a set of mappings between these repositories. On the basis of the sense system description, we propose, formalise, and motivate four basic and two guiding criteria for such sense systems. The four basic criteria are: 1) Correctness preservation: Mappings should preserve the correctness of sense labels in all contexts. Intuitively, if the correct sense for a word token is mapped to another sense, this sense should also be correct for that token. This criterion is endorsed by virtually all existing mappings, but the formalism presented in the paper makes this assumption explicit and allows us to distinguish it from other criteria. 2) Candidacy preservation: Mappings should preserve what we call “the lexical candidacy” of sense labels. Assume that a sense s is mapped to another sense s’ in a different repository. Candidacy preservation then requires that if s is a sense associated with word type w, then so is s’. This criterion is trivially fulfilled by clustering-based approaches, but is not typically explicitly stated for repositories, and we demonstrate how a violation might occur. Our formalisation allows us to specify the difference of this criterion to correctness preservation. As we argue, candidacy preservation allows us to straightforwardly and consistently compare granularity levels by counting the number of senses for each word type. 3) Uniqueness criterion: There should be at most one mapping from one repository to another. This criterion is also fulfilled by clustering-based approaches, but is often violated by repositories that use domain labels. We argue that adhering to the uniqueness criterion provides several benefits, including: a) being able to consistently convert between sets of labels and evaluation metrics, allowing researchers to work with data and models that use different sets of labels; b) ensuring that sense repositories would form a partial preorder, which would roughly correspond to the notion of granularity; and c) ensuring transitivity of mapped senses. 4) Connectivity: A sense system should be a connected category. The connectivity criterion on its own is not very informative, but it enables other criteria by extending their benefits to the rest of the sense system, such as allowing cross-checking between multiple repositories, allowing comparison of grain level, and label conversion. As we argue, connectivity should be considered a formal requirement helping to describe sense repositories and how they relate. We also offer two guiding criteria, which we consider aspirational rather than requirements that have to be strictly fulfilled for all purposes: 1) Non-contradiction: Mappings cannot exist between senses that semantically contradict each other. The non-contradiction criterion forbids mappings between senses whose (strict) implications contradict each other. We demonstrate how such a contradiction might occur, but acknowledge the difficulty in identifying such contradictions. As we argue, the reason to consider this a guiding rather than a strict criterion is that many sense repositories lack the semantic specificity that would allow researchers to identify these contradictions. 2) Inter-annotator agreement: Mappings should correspond to a partial preorder of inter-annotator agreement levels. It has been observed that, when annotating corpora with senses from a given sense repository, inter-annotator agreement tends to drop when the repository is more fine-grained. Therefore, if one repository is coarser-grained than another, one can expect agreement levels to be higher when annotating corpora with senses from the first repository. While this criterion will necessarily be subject to empirical variability (and does apply to sense repositories using non-interpretable representations such as embeddings), we argue that strong violations suggest that the sense distinctions of the coarse-grained sense repository are unnatural, i.e. not in accordance with human linguistic intuitions. Our list is by no means exhaustive, as there are other properties that may be desirable depending on the downstream application. Our category-theory based formalism will serve as the basis for describing any such further properties. However, we also envision that the criteria we have proposed will serve as guidelines for future sense repositories and mappings, in order to avoid the inconsistencies and counterintuitive results derived from existing mappings.

2021

pdf bib abs

Cambridge at SemEval-2021 Task 1: An Ensemble of Feature-Based and Neural Models for Lexical Complexity Prediction
Zheng Yuan | Gladys Tyen | David Strohmaier
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper describes our submission to the SemEval-2021 shared task on Lexical Complexity Prediction. We approached it as a regression problem and present an ensemble combining four systems, one feature-based and three neural with fine-tuning, frequency pre-training and multi-task learning, achieving Pearson scores of 0.8264 and 0.7556 on the trial and test sets respectively (sub-task 1). We further present our analysis of the results and discuss our findings.

Co-authors

Venues