2024
pdf
bib
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
Ekaterina Kochmar
|
Marie Bexte
|
Jill Burstein
|
Andrea Horbach
|
Ronja Laarmann-Quante
|
Anaïs Tack
|
Victoria Yaneva
|
Zheng Yuan
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
pdf
bib
abs
Predicting Initial Essay Quality Scores to Increase the Efficiency of Comparative Judgment Assessments
Michiel De Vrindt
|
Anaïs Tack
|
Renske Bouwer
|
Wim Van Den Noortgate
|
Marije Lesterhuis
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
Comparative judgment (CJ) is a method that can be used to assess the writing quality of student essays based on repeated pairwise comparisons by multiple assessors. Although the assessment method is known to have high validity and reliability, it can be particularly inefficient, as assessors must make many judgments before the scores become reliable. Prior research has investigated methods to improve the efficiency of CJ, yet these methods introduce additional challenges, notably stemming from the initial lack of information at the start of the assessment, which is known as a cold-start problem. This paper reports on a study in which we predict the initial quality scores of essays to establish a warm start for CJ. To achieve this, we construct informative prior distributions for the quality scores based on the predicted initial quality scores. Through simulation studies, we demonstrate that our approach increases the efficiency of CJ: On average, assessors need to make 30% fewer judgments for each essay to reach an overall reliability level of 0.70.
pdf
bib
abs
ITEC at BEA 2024 Shared Task: Predicting Difficulty and Response Time of Medical Exam Questions with Statistical, Machine Learning, and Language Models
Anaïs Tack
|
Siem Buseyne
|
Changsheng Chen
|
Robbe D’hondt
|
Michiel De Vrindt
|
Alireza Gharahighehi
|
Sameh Metwaly
|
Felipe Kenji Nakano
|
Ann-Sophie Noreillie
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
This paper presents the results of our participation in the BEA 2024 shared task on the automated prediction of item difficulty and item response time (APIDIRT), hosted by the NBME (National Board of Medical Examiners). During this task, practice multiple-choice questions from the United States Medical Licensing Examination® (USMLE®) were shared, and research teams were tasked with devising systems capable of predicting the difficulty and average response time for new exam questions.Our team, part of the interdisciplinary itec research group, participated in the task. We extracted linguistic features and clinical embeddings from question items and tested various modeling techniques, including statistical regression, machine learning, language models, and ensemble methods. Surprisingly, simplermodels such as Lasso and random forest regression, utilizing principal component features from linguistic and clinical embeddings, outperformed more complex models. In the competition, our random forest model ranked 4th out of 43 submissions for difficulty prediction, while the Lasso model secured the 2nd position out of 34 submissions for response time prediction. Further analysis suggests that had we submitted the Lasso model for difficulty prediction, we would have achieved an even higher ranking. We also observed that predicting response time is easier than predicting difficulty, with features such as item length, type, exam step, and analytical thinking influencing response time prediction more significantly.
pdf
bib
abs
ITEC at MLSP 2024: Transferring Predictions of Lexical Difficulty from Non-Native Readers
Anaïs Tack
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
This paper presents the results of our team’s participation in the BEA 2024 shared task on the multilingual lexical simplification pipeline (MLSP; Shardlow et al., 2024). During the task, organizers supplied data that combined two components of the simplification pipeline: lexical complexity prediction and lexical substitution. This dataset encompassed ten languages, including French. Given the absence of dedicated training data, teams were challenged with employing systems trained on pre-existing resources and evaluating their performance on unexplored test data.Our team contributed to the task using previously developed models for predicting lexical difficulty in French (Tack, 2021). These models were built on deep learning architectures, adding to our participation in the CWI 2018 shared task (De Hertog and Tack, 2018). The training dataset comprised 262,054 binary decision annotations, capturing perceived lexical difficulty, collected from a sample of 56 non-native French readers. Two pre-trained neural logistic models were used: (1) a model for predicting difficulty for words within their sentence context, and (2) a model for predicting difficulty for isolated words.The findings revealed that despite being trained for a distinct prediction task (as indicated by a negative R2 fit), transferring the logistic predictions of lexical difficulty to continuous scores of lexical complexity exhibited a positive correlation. Specifically, the results indicated that isolated predictions exhibited a higher correlation (r = .36) compared to contextualized predictions (r = .33). Moreover, isolated predictions demonstrated a remarkably higher Spearman rank correlation (ρ = .50) than contextualized predictions (ρ = .35). These results align with earlier observations by Tack (2021), suggesting that the ground truth primarily captures more lexical access difficulties than word-to-context integration problems.
2023
pdf
bib
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
Ekaterina Kochmar
|
Jill Burstein
|
Andrea Horbach
|
Ronja Laarmann-Quante
|
Nitin Madnani
|
Anaïs Tack
|
Victoria Yaneva
|
Zheng Yuan
|
Torsten Zesch
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
pdf
bib
abs
The BEA 2023 Shared Task on Generating AI Teacher Responses in Educational Dialogues
Anaïs Tack
|
Ekaterina Kochmar
|
Zheng Yuan
|
Serge Bibauw
|
Chris Piech
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
This paper describes the results of the first shared task on generation of teacher responses in educational dialogues. The goal of the task was to benchmark the ability of generative language models to act as AI teachers, replying to a student in a teacher-student dialogue. Eight teams participated in the competition hosted on CodaLab and experimented with a wide variety of state-of-the-art models, including Alpaca, Bloom, DialoGPT, DistilGPT-2, Flan-T5, GPT- 2, GPT-3, GPT-4, LLaMA, OPT-2.7B, and T5- base. Their submissions were automatically scored using BERTScore and DialogRPT metrics, and the top three among them were further manually evaluated in terms of pedagogical ability based on Tack and Piech (2022). The NAISTeacher system, which ranked first in both automated and human evaluation, generated responses with GPT-3.5 Turbo using an ensemble of prompts and DialogRPT-based ranking of responses for given dialogue contexts. Despite promising achievements of the participating teams, the results also highlight the need for evaluation metrics better suited to educational contexts.
2022
pdf
bib
abs
FABRA: French Aggregator-Based Readability Assessment toolkit
Rodrigo Wilkens
|
David Alfter
|
Xiaoou Wang
|
Alice Pintard
|
Anaïs Tack
|
Kevin P. Yancey
|
Thomas François
Proceedings of the Thirteenth Language Resources and Evaluation Conference
In this paper, we present the FABRA: readability toolkit based on the aggregation of a large number of readability predictor variables. The toolkit is implemented as a service-oriented architecture, which obviates the need for installation, and simplifies its integration into other projects. We also perform a set of experiments to show which features are most predictive on two different corpora, and how the use of aggregators improves performance over standard feature-based readability prediction. Our experiments show that, for the explored corpora, the most important predictors for native texts are measures of lexical diversity, dependency counts and text coherence, while the most important predictors for foreign texts are syntactic variables illustrating language development, as well as features linked to lexical sophistication. FABRA: have the potential to support new research on readability assessment for French.
pdf
bib
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)
Ekaterina Kochmar
|
Jill Burstein
|
Andrea Horbach
|
Ronja Laarmann-Quante
|
Nitin Madnani
|
Anaïs Tack
|
Victoria Yaneva
|
Zheng Yuan
|
Torsten Zesch
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)
2020
pdf
bib
abs
Alector: A Parallel Corpus of Simplified French Texts with Alignments of Misreadings by Poor and Dyslexic Readers
Núria Gala
|
Anaïs Tack
|
Ludivine Javourey-Drevet
|
Thomas François
|
Johannes C. Ziegler
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this paper, we present a new parallel corpus addressed to researchers, teachers, and speech therapists interested in text simplification as a means of alleviating difficulties in children learning to read. The corpus is composed of excerpts drawn from 79 authentic literary (tales, stories) and scientific (documentary) texts commonly used in French schools for children aged between 7 to 9 years old. The excerpts were manually simplified at the lexical, morpho-syntactic, and discourse levels in order to propose a parallel corpus for reading tests and for the development of automatic text simplification tools. A sample of 21 poor-reading and dyslexic children with an average reading delay of 2.5 years read a portion of the corpus. The transcripts of readings errors were integrated into the corpus with the goal of identifying lexical difficulty in the target population. By means of statistical testing, we provide evidence that the manual simplifications significantly reduced reading errors, highlighting that the words targeted for simplification were not only well-chosen but also substituted with substantially easier alternatives. The entire corpus is available for consultation through a web interface and available on demand for research purposes.
2018
pdf
bib
abs
A Report on the Complex Word Identification Shared Task 2018
Seid Muhie Yimam
|
Chris Biemann
|
Shervin Malmasi
|
Gustavo Paetzold
|
Lucia Specia
|
Sanja Štajner
|
Anaïs Tack
|
Marcos Zampieri
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications
We report the findings of the second Complex Word Identification (CWI) shared task organized as part of the BEA workshop co-located with NAACL-HLT’2018. The second CWI shared task featured multilingual and multi-genre datasets divided into four tracks: English monolingual, German monolingual, Spanish monolingual, and a multilingual track with a French test set, and two tasks: binary classification and probabilistic classification. A total of 12 teams submitted their results in different task/track combinations and 11 of them wrote system description papers that are referred to in this report and appear in the BEA workshop proceedings.
pdf
bib
abs
NT2Lex: A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch WordNet
Anaïs Tack
|
Thomas François
|
Piet Desmet
|
Cédrick Fairon
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications
In this paper, we introduce NT2Lex, a novel lexical resource for Dutch as a foreign language (NT2) which includes frequency distributions of 17,743 words and expressions attested in expert-written textbook texts and readers graded along the scale of the Common European Framework of Reference (CEFR). In essence, the lexicon informs us about what kind of vocabulary should be understood when reading Dutch as a non-native reader at a particular proficiency level. The main novelty of the resource with respect to the previously developed CEFR-graded lexicons concerns the introduction of corpus-based evidence for L2 word sense complexity through the linkage to Open Dutch WordNet (Postma et al., 2016). The resource thus contains, on top of the lemmatised and part-of-speech tagged lexical entries, a total of 11,999 unique word senses and 8,934 distinct synsets.
pdf
bib
abs
Deep Learning Architecture for Complex Word Identification
Dirk De Hertog
|
Anaïs Tack
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications
We describe a system for the CWI-task that includes information on 5 aspects of the (complex) lexical item, namely distributional information of the item itself, morphological structure, psychological measures, corpus-counts and topical information. We constructed a deep learning architecture that combines those features and apply it to the probabilistic and binary classification task for all English sets and Spanish. We achieved reasonable performance on all sets with best performances seen on the probabilistic task, particularly on the English news set (MAE 0.054 and F1-score of 0.872). An analysis of the results shows that reasonable performance can be achieved with a single architecture without any domain-specific tweaking of the parameter settings and that distributional features capture almost all of the information also found in hand-crafted features.
2017
pdf
bib
abs
Human and Automated CEFR-based Grading of Short Answers
Anaïs Tack
|
Thomas François
|
Sophie Roekhaut
|
Cédrick Fairon
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications
This paper is concerned with the task of automatically assessing the written proficiency level of non-native (L2) learners of English. Drawing on previous research on automated L2 writing assessment following the Common European Framework of Reference for Languages (CEFR), we investigate the possibilities and difficulties of deriving the CEFR level from short answers to open-ended questions, which has not yet been subjected to numerous studies up to date. The object of our study is twofold: to examine the intricacy involved with both human and automated CEFR-based grading of short answers. On the one hand, we describe the compilation of a learner corpus of short answers graded with CEFR levels by three certified Cambridge examiners. We mainly observe that, although the shortness of the answers is reported as undermining a clear-cut evaluation, the length of the answer does not necessarily correlate with inter-examiner disagreement. On the other hand, we explore the development of a soft-voting system for the automated CEFR-based grading of short answers and draw tentative conclusions about its use in a computer-assisted testing (CAT) setting.
2016
pdf
bib
abs
Modèles adaptatifs pour prédire automatiquement la compétence lexicale d’un apprenant de français langue étrangère (Adaptive models for automatically predicting the lexical competence of French as a foreign language learners)
Anaïs Tack
|
Thomas François
|
Anne-Laure Ligozat
|
Cédrick Fairon
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)
Cette étude examine l’utilisation de méthodes d’apprentissage incrémental supervisé afin de prédire la compétence lexicale d’apprenants de français langue étrangère (FLE). Les apprenants ciblés sont des néerlandophones ayant un niveau A2/B1 selon le Cadre européen commun de référence pour les langues (CECR). À l’instar des travaux récents portant sur la prédiction de la maîtrise lexicale à l’aide d’indices de complexité, nous élaborons deux types de modèles qui s’adaptent en fonction d’un retour d’expérience, révélant les connaissances de l’apprenant. En particulier, nous définissons (i) un modèle qui prédit la compétence lexicale de tous les apprenants du même niveau de maîtrise et (ii) un modèle qui prédit la compétence lexicale d’un apprenant individuel. Les modèles obtenus sont ensuite évalués par rapport à un modèle de référence déterminant la compétence lexicale à partir d’un lexique spécialisé pour le FLE et s’avèrent gagner significativement en exactitude (9%-17%).
pdf
bib
abs
SVALex: a CEFR-graded Lexical Resource for Swedish Foreign and Second Language Learners
Thomas François
|
Elena Volodina
|
Ildikó Pilán
|
Anaïs Tack
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
The paper introduces SVALex, a lexical resource primarily aimed at learners and teachers of Swedish as a foreign and second language that describes the distribution of 15,681 words and expressions across the Common European Framework of Reference (CEFR). The resource is based on a corpus of coursebook texts, and thus describes receptive vocabulary learners are exposed to during reading activities, as opposed to productive vocabulary they use when speaking or writing. The paper describes the methodology applied to create the list and to estimate the frequency distribution. It also discusses some characteristics of the resulting resource and compares it to other lexical resources for Swedish. An interesting feature of this resource is the possibility to separate the wheat from the chaff, identifying the core vocabulary at each level, i.e. vocabulary shared by several coursebook writers at each level, from peripheral vocabulary which is used by the minority of the coursebook writers.
pdf
bib
abs
Evaluating Lexical Simplification and Vocabulary Knowledge for Learners of French: Possibilities of Using the FLELex Resource
Anaïs Tack
|
Thomas François
|
Anne-Laure Ligozat
|
Cédrick Fairon
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This study examines two possibilities of using the FLELex graded lexicon for the automated assessment of text complexity in French as a foreign language learning. From the lexical frequency distributions described in FLELex, we derive a single level of difficulty for each word in a parallel corpus of original and simplified texts. We then use this data to automatically address the lexical complexity of texts in two ways. On the one hand, we evaluate the degree of lexical simplification in manually simplified texts with respect to their original version. Our results show a significant simplification effect, both in the case of French narratives simplified for non-native readers and in the case of simplified Wikipedia texts. On the other hand, we define a predictive model which identifies the number of words in a text that are expected to be known at a particular learning level. We assess the accuracy with which these predictions are able to capture actual word knowledge as reported by Dutch-speaking learners of French. Our study shows that although the predictions seem relatively accurate in general (87.4% to 92.3%), they do not yet seem to cover the learners’ lack of knowledge very well.