Nabil Oulbaz


2022

pdf bib
Open corpora and toolkit for assessing text readability in French
Nicolas Hernandez | Nabil Oulbaz | Tristan Faine
Proceedings of the 2nd Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI) within the 13th Language Resources and Evaluation Conference

Measuring the linguistic complexity or assessing the readability of spoken or written productions has been the concern of several researchers in pedagogy and (foreign) language teaching for decades. Researchers study for example the children’s language development or the second language (L2) learning with tasks such as age or reader’s level recommendation, or text simplification. Despite the interest for the topic, open datasets and toolkits for processing French are scarce. Our contributions are: (1) three open corpora for supporting research on readability assessment in French, (2) a dataset analysis with traditional formulas and an unsupervised measure, (3) a toolkit dedicated for French processing which includes the implementation of statistical formulas, a pseudo-perplexity measure, and state-of-the-art classifiers based on SVM and fine-tuned BERT for predicting readability levels, and (4) an evaluation of the toolkit on the three data sets.