This paper presents multi-reference benchmarks for the Grammatical Error Correction (GEC) of Russian, based on two existing single-reference datasets, for a total of 7,444 learner sentences from a variety of first language backgrounds. Each sentence is corrected independently by two new raters, and their corrections are reviewed by a senior annotator, resulting in a total of three references per sentence. Analysis of the annotations reveals that the new raters tend to make more changes, compared to the original raters, especially at the lexical level. We conduct experiments with two popular GEC approaches and show competitive performance on the original datasets and the new benchmarks. We also compare system scores as evaluated against individual annotators and discuss the effect of using multiple references overall and on specific error types. We find that using the union of the references increases system scores by more than 10 points and decreases the gap between system and human performance, thereby providing a more realistic evaluation of GEC system performance, although the effect is not the same across the error types. The annotations are available for research.
We introduce a pilot annotation of Russian learner data with syntactic dependency relations. The annotation is performed on a subset of sentences from RULEC-GEC and RU-Lang8, two error-corrected Russian learner datasets. We provide manually labeled Universal Dependency (UD) trees for 500 sentence pairs, annotating both the original (source) and the corrected (target) version of each sentence. Further, we outline guidelines for annotating learner Russian data containing non-standard erroneous text and analyze the effect that the individual errors have on the resulting dependency trees. This study should contribute to a wide range of computational and theoretical research directions in second language learning and grammatical error correction.
We propose a novel approach to automatically generate distractors for cloze exercises for English language learners, using round-trip neural machine translation. A carrier sentence is translated from English into another (pivot) language and back, and distractors are produced by aligning the original sentence with its round-trip translation. We make use of 16 linguistically-diverse pivots and generate hundreds of translation hypotheses in each direction. We show that using hundreds of translations allows us to generate a rich set of challenging distractors. Moreover, we find that typologically unrelated language pivots contribute more diverse candidate distractors, compared to language pivots that are closely related. We further evaluate the use of machine translation systems of varying quality and find that better quality MT systems produce more challenging distractors. Finally, we conduct a study with language learners, demonstrating that the automatically generated distractors are of the same difficulty as the gold distractors produced by human experts.
We present our system that participated in the shared task on the grammatical error correction of Ukrainian. We have implemented two approaches that make use of large pre-trained language models and synthetic data, that have been used for error correction of English as well as low-resource languages. The first approach is based on fine-tuning a large multilingual language model (mT5) in two stages: first, on synthetic data, and then on gold data. The second approach trains a (smaller) seq2seq Transformer model pre-trained on synthetic data and fine-tuned on gold data. Our mT5-based model scored first in “GEC only” track, and a very close second in the “GEC+Fluency” track. Our two key innovations are (1) finetuning in stages, first on synthetic, and then on gold data; and (2) a high-quality corruption method based on roundtrip machine translation to complement existing noisification approaches.
In a fill-in-the-blank exercise, a student is presented with a carrier sentence with one word hidden, and a multiple-choice list that includes the correct answer and several inappropriate options, called distractors. We propose to automatically generate distractors using round-trip neural machine translation: the carrier sentence is translated from English into another (pivot) language and back, and distractors are produced by aligning the original sentence and its round-trip translation. We show that using hundreds of translations for a given sentence allows us to generate a rich set of challenging distractors. Further, using multiple pivot languages produces a diverse set of candidates. The distractors are evaluated against a real corpus of cloze exercises and checked manually for validity. We demonstrate that the proposed method significantly outperforms two strong baselines.
Grammatical Error Correction systems are typically evaluated overall, without taking into consideration performance on individual error types because system output is not annotated with respect to error type. We introduce a tool that automatically classifies errors in Russian learner texts. The tool takes an edit pair consisting of the original token(s) and the corresponding replacement and provides a grammatical error category. Manual evaluation of the output reveals that in more than 93% of cases the error categories are judged as correct or acceptable. We apply the tool to carry out a fine-grained evaluation on the performance of two error correction systems for Russian.
We develop a minimally-supervised model for spelling correction and evaluate its performance on three datasets annotated for spelling errors in Russian. The first corpus is a dataset of Russian social media data that was recently used in a shared task on Russian spelling correction. The other two corpora contain texts produced by learners of Russian as a foreign language. Evaluating on three diverse datasets allows for a cross-corpus comparison. We compare the performance of the minimally-supervised model to two baseline models that do not use context for candidate re-ranking, as well as to a character-level statistical machine translation system with context-based re-ranking. We show that the minimally-supervised model outperforms all of the other models. We also present an analysis of the spelling errors and discuss the difficulty of the task compared to the spelling correction problem in English.
Standard evaluations of Grammatical Error Correction (GEC) systems make use of a fixed reference text generated relative to the original text; they show, even when using multiple references, that we have a long way to go. This analysis paper studies the performance of GEC systems relative to closest-gold – a gold reference text created relative to the output of a system. Surprisingly, we show that the real performance is 20-40 points better than standard evaluations show. Moreover, the performance remains high even when considering any of the top-10 hypotheses produced by a system. Importantly, the type of mistakes corrected by lower-ranked hypotheses differs in interesting ways from the top one, providing an opportunity to focus on a range of errors – local spelling and grammar edits vs. more complex lexical improvements. Our study shows these results in English and Russian, and thus provides a preliminary proposal for a more realistic evaluation of GEC systems.
Grammatical Error Correction (GEC) is concerned with correcting grammatical errors in written text. Current GEC systems, namely those leveraging statistical and neural machine translation, require large quantities of annotated training data, which can be expensive or impractical to obtain. This research compares techniques for generating synthetic data utilized by the two highest scoring submissions to the restricted and low-resource tracks in the BEA-2019 Shared Task on Grammatical Error Correction.
Until now, most of the research in grammar error correction focused on English, and the problem has hardly been explored for other languages. We address the task of correcting writing mistakes in morphologically rich languages, with a focus on Russian. We present a corrected and error-tagged corpus of Russian learner writing and develop models that make use of existing state-of-the-art methods that have been well studied for English. Although impressive results have recently been achieved for grammar error correction of non-native English writing, these results are limited to domains where plentiful training data are available. Because annotation is extremely costly, these approaches are not suitable for the majority of domains and languages. We thus focus on methods that use “minimal supervision”; that is, those that do not rely on large amounts of annotated training data, and show how existing minimal-supervision approaches extend to a highly inflectional language such as Russian. The results demonstrate that these methods are particularly useful for correcting mistakes in grammatical phenomena that involve rich morphology.
Spelling correction has attracted a lot of attention in the NLP community. However, models have been usually evaluated on artificiallycreated or proprietary corpora. A publiclyavailable corpus of authentic misspellings, annotated in context, is still lacking. To address this, we present and release an annotated data set of 6,121 spelling errors in context, based on a corpus of essays written by English language learners. We also develop a minimallysupervised context-aware approach to spelling correction. It achieves strong results on our data: 88.12% accuracy. This approach can also train with a minimal amount of annotated data (performance reduced by less than 1%). Furthermore, this approach allows easy portability to new domains. We evaluate our model on data from a medical domain and demonstrate that it rivals the performance of a model trained and tuned on in-domain data.
Overcrowding in emergency rooms is a major challenge faced by hospitals across the United States. Overcrowding can result in longer wait times, which, in turn, has been shown to adversely affect patient satisfaction, clinical outcomes, and procedure reimbursements. This paper presents research that aims to automatically predict discharge disposition of patients who received medical treatment in an emergency department. We make use of a corpus that consists of notes containing patient complaints, diagnosis information, and disposition, entered by health care providers. We use this corpus to develop a model that uses the complaint and diagnosis information to predict patient disposition. We show that the proposed model substantially outperforms the baseline of predicting the most common disposition type. The long-term goal of this research is to build a model that can be implemented as a real-time service in an application to predict disposition as patients arrive.
This article considers the problem of correcting errors made by English as a Second Language writers from a machine learning perspective, and addresses an important issue of developing an appropriate training paradigm for the task, one that accounts for error patterns of non-native writers using minimal supervision. Existing training approaches present a trade-off between large amounts of cheap data offered by the native-trained models and additional knowledge of learner error patterns provided by the more expensive method of training on annotated learner data. We propose a novel training approach that draws on the strengths offered by the two standard training paradigms—of training either on native or on annotated learner data—and that outperforms both of these standard methods. Using the key observation that parameters relating to error regularities exhibited by non-native writers are relatively simple, we develop models that can incorporate knowledge about error regularities based on a small annotated sample but that are otherwise trained on native English data. The key contribution of this article is the introduction and analysis of two methods for adapting the learned models to error patterns of non-native writers; one method that applies to generative classifiers and a second that applies to discriminative classifiers. Both methods demonstrated state-of-the-art performance in several text correction competitions. In particular, the Illinois system that implements these methods ranked at the top in two recent CoNLL shared tasks on error correction.1 We conduct further evaluation of the proposed approaches studying the effect of using error data from speakers of the same native language, languages that are closely related linguistically, and unrelated languages.
We present annotation guidelines and a web-based annotation framework developed as part of an effort to create a manually annotated Arabic corpus of errors and corrections for various text types. Such a corpus will be invaluable for developing Arabic error correction tools, both for training models and as a gold standard for evaluating error correction algorithms. We summarize the guidelines we created. We also describe issues encountered during the training of the annotators, as well as problems that are specific to the Arabic language that arose during the annotation process. Finally, we present the annotation tool that was developed as part of this project, the annotation pipeline, and the quality of the resulting annotations.
This paper identifies and examines the key principles underlying building a state-of-the-art grammatical error correction system. We do this by analyzing the Illinois system that placed first among seventeen teams in the recent CoNLL-2013 shared task on grammatical error correction. The system focuses on five different types of errors common among non-native English writers. We describe four design principles that are relevant for correcting all of these errors, analyze the system along these dimensions, and show how each of these dimensions contributes to the performance.
Processing of Colloquial Arabic is a relatively new area of research, and a number of interesting challenges pertaining to spoken Arabic dialects arise. On the one hand, a whole continuum of Arabic dialects exists, with linguistic differences on phonological, morphological, syntactic, and lexical levels. On the other hand, there are inter-dialectal similarities that need be explored. Furthermore, due to scarcity of dialect-specific linguistic resources and availability of a wide range of resources for Modern Standard Arabic (MSA), it is desirable to explore the possibility of exploiting MSA tools when working on dialects. This paper describes challenges in processing of Colloquial Arabic in the context of language modeling for Automatic Speech Recognition. Using data from Egyptian Colloquial Arabic and MSA, we investigate the question of improving language modeling of Egyptian Arabic with MSA data and resources. As part of the project, we address the problem of linguistic variation between Egyptian Arabic and MSA. To account for differences between MSA and Colloquial Arabic, we experiment with the following techniques of data transformation: morphological simplification (stemming), lexical transductions, and syntactic transformations. While the best performing model remains the one built using only dialectal data, these techniques allow us to obtain an improvement over the baseline MSA model. More specifically, while the effect on perplexity of syntactic transformations is not very significant, stemming of the training and testing data improves the baseline perplexity of the MSA model trained on words by 51%, and lexical transductions yield an 82% perplexity reduction. Although the focus of the present work is on language modeling, we believe the findings of the study will be useful for researchers involved in other areas of processing Arabic dialects, such as parsing and machine translation.