While self-correction has shown promise in improving LLM outputs in terms of style and quality (e.g. Chen et al., 2023b; Madaan et al.,2023), recent attempts to self-correct logical or reasoning errors often cause correct answers to become incorrect, resulting in worse performances overall (Huang et al., 2023). In this paper, we show that poor self-correction performance stems from LLMs’ inability tofind logical mistakes, rather than their ability to correct a known mistake. Firstly, we benchmark several state-of-the-art LLMs ontheir mistake-finding ability and demonstrate that they generally struggle with the task, even in highly objective, unambiguous cases. Secondly, we test the correction abilities of LLMs – separately from mistake finding – using a backtracking setup that feeds ground truth mistake location information to the model. We show that this boosts downstream task performance across our 5 reasoning tasks, indicating that LLMs’ correction abilities are robust. Finally, we show that it is possible to obtain mistake location information without ground truth labels or in-domain training data. We train a small classifier with out-of-domain data, which exhibits stronger mistake-finding performance than prompting a large model. We release our dataset of LLM-generated logical mistakes, BIG-Bench Mistake, to enable further research into locating LLM reasoning mistakes.
State-of-the-art chatbots for English are now able to hold conversations on virtually any topic (e.g. Adiwardana et al., 2020; Roller et al., 2021). However, existing dialogue systems in the language learning domain still use hand-crafted rules and pattern matching, and are much more limited in scope. In this paper, we make an initial foray into adapting open-domain dialogue generation for second language learning. We propose and implement decoding strategies that can adjust the difficulty level of the chatbot according to the learner’s needs, without requiring further training of the chatbot. These strategies are then evaluated using judgements from human examiners trained in language education. Our results show that re-ranking candidate outputs is a particularly effective strategy, and performance can be further improved by adding sub-token penalties and filtering.
Sense repositories are a key component of many NLP applications that require the identification of word senses. Many sense repositories exist: a large proportion is based on lexicographic resources such as WordNet and various dictionaries, but there are others which are the product of clustering algorithms and other automatic techniques. Over the years these repositories have been mapped to each other. However, there have been no attempts (until now) to provide any theoretical grounding for such mappings, causing inconsistencies and unintuitive results. The present paper draws on category theory to formalise assumptions about mapped repositories that are often left implicit, providing formal grounding for this type of language resource. The paper first gives an overview of the word sense disambiguation literature and four types of sense representations: dictionary definitions, clusters of senses, domain labels, and embedding vectors. These different sense representations make different assumptions about the relations and mappings between word senses. We then introduce notation to represent the mappings and repositories as a category, which we call a “sense system”. We represent a sense system as a small category S, where the object set of S, denoted by Ob(S), is a set of sense repositories; and the homomorphism set or hom-set of S, denoted by Hom(S), is a set of mappings between these repositories. On the basis of the sense system description, we propose, formalise, and motivate four basic and two guiding criteria for such sense systems. The four basic criteria are: 1) Correctness preservation: Mappings should preserve the correctness of sense labels in all contexts. Intuitively, if the correct sense for a word token is mapped to another sense, this sense should also be correct for that token. This criterion is endorsed by virtually all existing mappings, but the formalism presented in the paper makes this assumption explicit and allows us to distinguish it from other criteria. 2) Candidacy preservation: Mappings should preserve what we call “the lexical candidacy” of sense labels. Assume that a sense s is mapped to another sense s’ in a different repository. Candidacy preservation then requires that if s is a sense associated with word type w, then so is s’. This criterion is trivially fulfilled by clustering-based approaches, but is not typically explicitly stated for repositories, and we demonstrate how a violation might occur. Our formalisation allows us to specify the difference of this criterion to correctness preservation. As we argue, candidacy preservation allows us to straightforwardly and consistently compare granularity levels by counting the number of senses for each word type. 3) Uniqueness criterion: There should be at most one mapping from one repository to another. This criterion is also fulfilled by clustering-based approaches, but is often violated by repositories that use domain labels. We argue that adhering to the uniqueness criterion provides several benefits, including: a) being able to consistently convert between sets of labels and evaluation metrics, allowing researchers to work with data and models that use different sets of labels; b) ensuring that sense repositories would form a partial preorder, which would roughly correspond to the notion of granularity; and c) ensuring transitivity of mapped senses. 4) Connectivity: A sense system should be a connected category. The connectivity criterion on its own is not very informative, but it enables other criteria by extending their benefits to the rest of the sense system, such as allowing cross-checking between multiple repositories, allowing comparison of grain level, and label conversion. As we argue, connectivity should be considered a formal requirement helping to describe sense repositories and how they relate. We also offer two guiding criteria, which we consider aspirational rather than requirements that have to be strictly fulfilled for all purposes: 1) Non-contradiction: Mappings cannot exist between senses that semantically contradict each other. The non-contradiction criterion forbids mappings between senses whose (strict) implications contradict each other. We demonstrate how such a contradiction might occur, but acknowledge the difficulty in identifying such contradictions. As we argue, the reason to consider this a guiding rather than a strict criterion is that many sense repositories lack the semantic specificity that would allow researchers to identify these contradictions. 2) Inter-annotator agreement: Mappings should correspond to a partial preorder of inter-annotator agreement levels. It has been observed that, when annotating corpora with senses from a given sense repository, inter-annotator agreement tends to drop when the repository is more fine-grained. Therefore, if one repository is coarser-grained than another, one can expect agreement levels to be higher when annotating corpora with senses from the first repository. While this criterion will necessarily be subject to empirical variability (and does apply to sense repositories using non-interpretable representations such as embeddings), we argue that strong violations suggest that the sense distinctions of the coarse-grained sense repository are unnatural, i.e. not in accordance with human linguistic intuitions. Our list is by no means exhaustive, as there are other properties that may be desirable depending on the downstream application. Our category-theory based formalism will serve as the basis for describing any such further properties. However, we also envision that the criteria we have proposed will serve as guidelines for future sense repositories and mappings, in order to avoid the inconsistencies and counterintuitive results derived from existing mappings.
This paper describes our submission to the SemEval-2021 shared task on Lexical Complexity Prediction. We approached it as a regression problem and present an ensemble combining four systems, one feature-based and three neural with fine-tuning, frequency pre-training and multi-task learning, achieving Pearson scores of 0.8264 and 0.7556 on the trial and test sets respectively (sub-task 1). We further present our analysis of the results and discuss our findings.