Mariano Felice

2025

pdf bib abs
Transformer Architectures for Vocabulary Test Item Difficulty Prediction
Lucy Skidmore | Mariano Felice | Karen Dunn
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

Establishing the difficulty of test items is an essential part of the language assessment development process. However, traditional item calibration methods are often time-consuming and difficult to scale. To address this, recent research has explored natural language processing (NLP) approaches for automatically predicting item difficulty from text. This paper investigates the use of transformer models to predict the difficulty of second language (L2) English vocabulary test items that have multilingual prompts. We introduce an extended version of the British Council’s Knowledge-based Vocabulary Lists (KVL) dataset, containing 6,768 English words paired with difficulty scores and question prompts written in Spanish, German, and Mandarin Chinese. Using this new dataset for fine-tuning, we explore various transformer-based architectures. Our findings show that a multilingual model jointly trained on all L1 subsets of the KVL achieve the best results, with analysis suggesting that the model is able to learn global patterns of cross-linguistic influence on target word difficulty. This study establishes a foundation for NLP-based item difficulty estimation using the KVL dataset, providing actionable insights for developing multilingual test items.

2024

pdf bib abs
The British Council submission to the BEA 2024 shared task
Mariano Felice | Zeynep Duran Karaoz
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

This paper describes our submission to the item difficulty prediction track of the BEA 2024 shared task. Our submission included the output of three systems: 1) a feature-based linear regression model, 2) a RoBERTa-based model and 3) a linear regression ensemble built on the predictions of the two previous models. Our systems ranked 7th, 8th and 5th respectively, demonstrating that simple models can achieve optimal results. A closer look at the results shows that predictions are more accurate for items in the middle of the difficulty range, with no other obvious relationships between difficulty and the accuracy of predictions.

pdf bib abs
Distractor Generation Using Generative and Discriminative Capabilities of Transformer-based Models
Shiva Taslimipoor | Luca Benedetto | Mariano Felice | Paula Buttery
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Multiple Choice Questions (MCQs) are very common in both high-stakes and low-stakes examinations, and their effectiveness in assessing students relies on the quality and diversity of distractors, which are the incorrect answer options provided alongside the correct answer. Motivated by the progress in generative language models, we propose a two-step automatic distractor generation approach which is based on text to text transfer transformer models. Unlike most of previous methods for distractor generation, our approach does not rely on the correct answer options. Instead, it first generates both correct and incorrect answer options, and then discriminates potential correct options from distractors. Identified distractors are finally categorised based on semantic similarity scores into separate clusters, and the cluster heads are selected as our final distinct distractors. Experiments on two publicly available datasets show that our approach outperforms previous models both in the case of single-word answer options and longer-sequence reading comprehension questions.

pdf bib abs
Language Variety Identification with True Labels
Marcos Zampieri | Kai North | Tommi Jauhiainen | Mariano Felice | Neha Kumari | Nishant Nair | Yash Mahesh Bangera
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Language identification is an important first step in many NLP applications. Most publicly available language identification datasets, however, are compiled under the assumption that the gold label of each instance is determined by where texts are retrieved from. Research has shown that this is a problematic assumption, particularly in the case of very similar languages (e.g., Croatian and Serbian) and national language varieties (e.g., Brazilian and European Portuguese), where texts may contain no distinctive marker of the particular language or variety. To overcome this important limitation, this paper presents DSL True Labels (DSL-TL), the first human-annotated multilingual dataset for language variety identification. DSL-TL contains a total of 12,900 instances in Portuguese, split between European Portuguese and Brazilian Portuguese; Spanish, split between Argentine Spanish and Castilian Spanish; and English, split between American English and British English. We trained multiple models to discriminate between these language varieties, and we present the results in detail. The data and models presented in this paper provide a reliable benchmark toward the development of robust and fairer language variety identification systems. We make DSL-TL freely available to the research community.

2022

pdf bib abs
Constructing Open Cloze Tests Using Generation and Discrimination Capabilities of Transformers
Mariano Felice | Shiva Taslimipoor | Paula Buttery
Findings of the Association for Computational Linguistics: ACL 2022

This paper presents the first multi-objective transformer model for generating open cloze tests that exploits generation and discrimination capabilities to improve performance. Our model is further enhanced by tweaking its loss function and applying a post-processing re-ranking algorithm that improves overall test structure. Experiments using automatic and human evaluation show that our approach can achieve up to 82% accuracy according to experts, outperforming previous work and baselines. We also release a collection of high-quality open cloze tests along with sample system output and human annotations that can serve as a future benchmark.

pdf bib abs
CEPOC: The Cambridge Exams Publishing Open Cloze dataset
Mariano Felice | Shiva Taslimipoor | Øistein E. Andersen | Paula Buttery
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Open cloze tests are a standard type of exercise where examinees must complete a text by filling in the gaps without any given options to choose from. This paper presents the Cambridge Exams Publishing Open Cloze (CEPOC) dataset, a collection of open cloze tests from world-renowned English language proficiency examinations. The tests in CEPOC have been expertly designed and validated using standard principles in language research and assessment. They are prepared for language learners at different proficiency levels and hence classified into different CEFR levels (A2, B1, B2, C1, C2). This resource can be a valuable testbed for various NLP tasks. We perform a complete set of experiments on three tasks: gap filling, gap prediction, and CEFR text classification. We implement transformer-based systems based on pre-trained language models to model each task and use our dataset as a test set, providing promising benchmark results.

2020

pdf bib abs
A Crash Course in Automatic Grammatical Error Correction
Roman Grundkiewicz | Christopher Bryant | Mariano Felice
Proceedings of the 28th International Conference on Computational Linguistics: Tutorial Abstracts

Grammatical Error Correction (GEC) is the task of automatically detecting and correcting all types of errors in written text. Although most research has focused on correcting errors in the context of English as a Second Language (ESL), GEC can also be applied to other languages and native text. The main application of a GEC system is thus to assist humans with their writing. Academic and commercial interest in GEC has grown significantly since the Helping Our Own (HOO) and Conference on Natural Language Learning (CoNLL) shared tasks in 2011-14, and a record-breaking 24 teams took part in the recent Building Educational Applications (BEA) shared task. Given this interest, and the recent shift towards neural approaches, we believe the time is right to offer a tutorial on GEC for researchers who may be new to the field or who are interested in the current state of the art and future challenges. With this in mind, the main goal of this tutorial is not only to bring attendees up to speed with GEC in general, but also examine the development of neural-based GEC systems.

2019

pdf bib abs
Entropy as a Proxy for Gap Complexity in Open Cloze Tests
Mariano Felice | Paula Buttery
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This paper presents a pilot study of entropy as a measure of gap complexity in open cloze tests aimed at learners of English. Entropy is used to quantify the information content in each gap, which can be used to estimate complexity. Our study shows that average gap entropy correlates positively with proficiency levels while individual gap entropy can capture contextual complexity. To the best of our knowledge, this is the first unsupervised information-theoretical approach to evaluating the quality of cloze tests.

pdf bib abs
The BEA-2019 Shared Task on Grammatical Error Correction
Christopher Bryant | Mariano Felice | Øistein E. Andersen | Ted Briscoe
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

This paper reports on the BEA-2019 Shared Task on Grammatical Error Correction (GEC). As with the CoNLL-2014 shared task, participants are required to correct all types of errors in test data. One of the main contributions of the BEA-2019 shared task is the introduction of a new dataset, the Write&Improve+LOCNESS corpus, which represents a wider range of native and learner English levels and abilities. Another contribution is the introduction of tracks, which control the amount of annotated data available to participants. Systems are evaluated in terms of ERRANT F_0.5, which allows us to report a much wider range of performance statistics. The competition was hosted on Codalab and remains open for further submissions on the blind test set.

2017

pdf bib abs
Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction
Christopher Bryant | Mariano Felice | Ted Briscoe
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Until now, error type performance for Grammatical Error Correction (GEC) systems could only be measured in terms of recall because system output is not annotated. To overcome this problem, we introduce ERRANT, a grammatical ERRor ANnotation Toolkit designed to automatically extract edits from parallel original and corrected sentences and classify them according to a new, dataset-agnostic, rule-based framework. This not only facilitates error type evaluation at different levels of granularity, but can also be used to reduce annotator workload and standardise existing GEC datasets. Human experts rated the automatic edits as “Good” or “Acceptable” in at least 95% of cases, so we applied ERRANT to the system output of the CoNLL-2014 shared task to carry out a detailed error type analysis for the first time.

pdf bib abs
Artificial Error Generation with Machine Translation and Syntactic Patterns
Marek Rei | Mariano Felice | Zheng Yuan | Ted Briscoe
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Shortage of available training data is holding back progress in the area of automated error detection. This paper investigates two alternative methods for artificially generating writing errors, in order to create additional resources. We propose treating error generation as a machine translation task, where grammatically correct text is translated to contain errors. In addition, we explore a system for extracting textual patterns from an annotated corpus, which can then be used to insert errors into grammatically correct sentences. Our experiments show that the inclusion of artificially generated errors significantly improves error detection accuracy on both FCE and CoNLL 2014 datasets.

2016

pdf bib abs
Automatic Extraction of Learner Errors in ESL Sentences Using Linguistically Enhanced Alignments
Mariano Felice | Christopher Bryant | Ted Briscoe
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We propose a new method of automatically extracting learner errors from parallel English as a Second Language (ESL) sentences in an effort to regularise annotation formats and reduce inconsistencies. Specifically, given an original and corrected sentence, our method first uses a linguistically enhanced alignment algorithm to determine the most likely mappings between tokens, and secondly employs a rule-based function to decide which alignments should be merged. Our method beats all previous approaches on the tested datasets, achieving state-of-the-art results for automatic error extraction.

pdf bib
Candidate re-ranking for SMT-based grammatical error correction
Zheng Yuan | Ted Briscoe | Mariano Felice
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications