Patrick Watrin

2024

Exploring hybrid approaches to readability: experiments on the complementarity between linguistic features and transformers
Rodrigo Wilkens | Patrick Watrin | Rémi Cardon | Alice Pintard | Isabelle Gribomont | Thomas François
Findings of the Association for Computational Linguistics: EACL 2024

Linguistic features have a strong contribution in the context of the automatic assessment of text readability (ARA). They have been one of the anchors between the computational and theoretical models. With the development in the ARA field, the research moved to Deep Learning (DL). In an attempt to reconcile the mixed results reported in this context, we present a systematic comparison of 6 hybrid approaches along with standard Machine Learning and DL approaches, on 4 corpora (different languages and target audiences). The various experiments clearly highlighted two rather simple hybridization methods (soft label and simple concatenation). They also appear to be the most robust on smaller datasets and across various tasks and languages. This study stands out as the first to systematically compare different architectures and approaches to feature hybridization in DL, as well as comparing performance in terms of two languages and two target audiences of the text, which leads to a clearer pattern of results.

pdf bib abs

LLM-Generated Contexts to Practice Specialised Vocabulary: Corpus Presentation and Comparison
Iglika Nikolova-Stoupak | Serge Bibauw | Amandine Dumont | Françoise Stas | Patrick Watrin | Thomas François
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

This project evaluates the potential of LLM and dynamic corpora to generate contexts ai- med at the practice and acquisition of specialised English vocabulary. We compared reference contexts—handpicked by expert teachers—for a specialised vocabulary list to contexts generated by three recent large language models (LLM) of different sizes (Mistral-7B-Instruct, Vicuna-13B, and Gemini 1.0 Pro) and to contexts extracted from articles web-crawled from specialised websites. The comparison uses a representative set of length-based, morphosyntactic, semantic, and discourse- related textual characteristics. We conclude that the LLM-based corpora can be combined effectively with a web-crawled one to form an academic corpus characterised by appropriate complexity and textual variety.

pdf bib abs

Exploration d’approches hybrides pour la lisibilité : expériences sur la complémentarité entre les traits linguistiques et les transformers
Rodrigo Wilkens | Patrick Watrin | Rémi Cardon | Alice Pintard | Isabelle Gribomont | Thomas François
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 2 : traductions d'articles publiès

Les architectures d’apprentissage automatique reposant sur la définition de traits linguistiques ont connu un succès important dans le domaine de l’évaluation automatique de la lisibilité des textes (ARA) et ont permis de faire se rencontrer informatique et théorie psycholinguistique. Toutefois, les récents développements se sont tournés vers l’apprentissage profond et les réseaux de neurones. Dans cet article, nous cherchons à réconcilier les deux approches. Nous présentons une comparaison systématique de 6 architectures hybrides (appliquées à plusieurs langues et publics) que nous comparons à ces deux approches concurrentes. Les diverses expériences réalisées ont clairement mis en évidence deux méthodes d’hybridation : Soft-Labeling et concaténation simple. Ces deux architectures sont également plus efficaces lorsque les données d’entraînement sont réduites. Cette étude est la première à comparer systématiquement différentes architectures hybrides et à étudier leurs performances dans plusieurs tâches de lisibilité.

pdf bib

Generating Contexts for ESP Vocabulary Exercises with LLMs
Iglika Nikolova-Stoupak | Serge Bibauw | Amandine Dumont | Françoise Stas | Patrick Watrin | Thomas François
Proceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning

pdf bib abs

Paying attention to the words: explaining readability prediction for French as a foreign language
Rodrigo Wilkens | Patrick Watrin | Thomas François
Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024

Automatic text Readability Assessment (ARA) has been seen as a way of helping people with reading difficulties. Recent advancements in Natural Language Processing have shifted ARA from linguistic-based models to more precise black-box models. However, this shift has weakened the alignment between ARA models and the reading literature, potentially leading to inaccurate predictions based on unintended factors. In this paper, we investigate the explainability of ARA models, inspecting the relationship between attention mechanism scores, ARA features, and CEFR level predictions made by the model. We propose a method for identifying features associated with the predictions made by a model through the use of the attention mechanism. Exploring three feature families (i.e., psycho-linguistic, work frequency and graded lexicon), we associated features with the model’s attention heads. Finally, while not fully explanatory of the model’s performance, the correlations of these associations surpass those between features and text readability levels.

2023

pdf bib abs

Annotation Linguistique pour l’Évaluation de la Simplification Automatique de Textes
Rémi Cardon | Adrien Bibal | Rodrigo Wilkens | David Alfter | Magali Norré | Adeline Müller | Patrick Watrin | Thomas François
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 4 : articles déjà soumis ou acceptés en conférence internationale

L’évaluation des systèmes de simplification automatique de textes (SAT) est une tâche difficile, accomplie à l’aide de métriques automatiques et du jugement humain. Cependant, d’un point de vue linguistique, savoir ce qui est concrètement évalué n’est pas clair. Nous proposons d’annoter un des corpus de référence pour la SAT, ASSET, que nous utilisons pour éclaircir cette question. En plus de la contribution que constitue la ressource annotée, nous montrons comment elle peut être utilisée pour analyser le comportement de SARI, la mesure d’évaluation la plus populaire en SAT. Nous présentons nos conclusions comme une étape pour améliorer les protocoles d’évaluation en SAT à l’avenir.

2022

pdf bib abs

The performance of deep learning models in NLP and other fields of machine learning has led to a rise in their popularity, and so the need for explanations of these models becomes paramount. Attention has been seen as a solution to increase performance, while providing some explanations. However, a debate has started to cast doubt on the explanatory power of attention in neural networks. Although the debate has created a vast literature thanks to contributions from various areas, the lack of communication is becoming more and more tangible. In this paper, we provide a clear overview of the insights on the debate by critically confronting works from these different areas. This holistic vision can be of great interest for future works in all the communities concerned by this debate. We sum up the main challenges spotted in these areas, and we conclude by discussing the most promising future avenues on attention as an explanation.

pdf bib abs

L’Attention est-elle de l’Explication ? Une Introduction au Débat (Is Attention Explanation ? An Introduction to the Debate )
Adrien Bibal | Remi Cardon | David Alfter | Rodrigo Wilkens | Xiaoou Wang | Thomas François | Patrick Watrin
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Nous présentons un résumé en français et un résumé en anglais de l’article Is Attention Explanation ? An Introduction to the Debate (Bibal et al., 2022), publié dans les actes de la conférence 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022).

2021

pdf bib abs

FrenLyS: A Tool for the Automatic Simplification of French General Language Texts
Eva Rolin | Quentin Langlois | Patrick Watrin | Thomas François
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Lexical simplification (LS) aims at replacing words considered complex in a sentence by simpler equivalents. In this paper, we present the first automatic LS service for French, FrenLys, which offers different techniques to generate, select and rank substitutes. The paper describes the different methods proposed by our tool, which includes both classical approaches (e.g. generation of candidates from lexical resources, frequency filter, etc.) and more innovative approaches such as the exploitation of CamemBERT, a model for French based on the RoBERTa architecture. To evaluate the different methods, a new evaluation dataset for French is introduced.

2016

pdf bib

CENTAL at SemEval-2016 Task 12: a linguistically fed CRF model for medical and temporal information extraction
Charlotte Hansart | Damien De Meyere | Patrick Watrin | André Bittar | Cédrick Fairon
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2014

pdf bib abs

FLELex: a graded lexical resource for French foreign learners
Thomas François | Nùria Gala | Patrick Watrin | Cédrick Fairon
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present FLELex, the first graded lexicon for French as a foreign language (FFL) that reports word frequencies by difficulty level (according to the CEFR scale). It has been obtained from a tagged corpus of 777,000 words from available textbooks and simplified readers intended for FFL learners. Our goal is to freely provide this resource to the community to be used for a variety of purposes going from the assessment of the lexical difficulty of a text, to the selection of simpler words within text simplification systems, and also as a dictionary in assistive tools for writing.

2013

pdf bib

Stratégies discriminantes pour intégrer la reconnaissance des mots composés dans un analyseur syntaxique en constituants [Discriminative strategies for integrating multiword expression recognition in a constituent parser]
Matthieu Constant | Anthony Sigogne | Patrick Watrin
Traitement Automatique des Langues, Volume 54, Numéro 1 : Varia [Varia]

2012

pdf bib

La reconnaissance des mots composés à l’épreuve de l’analyse syntaxique et vice-versa : évaluation de deux stratégies discriminantes (Recognition of Compound Words Tested against Parsing and Vice-versa : Evaluation of Two Discriminative Approaches) [in French]
Matthieu Constant | Anthony Sigogne | Patrick Watrin
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib abs

Extraction of unmarked quotations in Newspapers
Stéphanie Weiser | Patrick Watrin
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents work in progress to automatically extract quotation sentences from newspaper articles. The focus is the extraction and annotation of unmarked quotation sentences. A linguistic study shows that unmarked quotation sentences can be formalised into 16 patterns that can be used to develop an extraction grammar. The question of unmarked quotation boundaries identification is also raised as they are often ambiguous. An annotation scheme allowing to describe all the elements that can take place in a quotation sentence is defined. This paper presents the creation of two resources necessary to our system. A dictionary of verbs introducing quotations has been automatically built using a grammar of marked quotations sentences to identify the verbs able to introduce quotations. A grammar formalising the patterns of unmarked quotation sentences ― using the tool Unitex, based on finite state machines ― has been developed. A short experiment has been performed on two patterns and shows some promising results.

pdf bib

Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing
Matthieu Constant | Anthony Sigogne | Patrick Watrin
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2011

pdf bib

On the Contribution of MWE-based Features to a Readability Formula for French as a Foreign Language
Thomas François | Patrick Watrin
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf bib

An N-gram Frequency Database Reference to Handle MWE Extraction in NLP Applications
Patrick Watrin | Thomas François
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

pdf bib

Temporal Expressions Extraction in SMS messages
Stéphanie Weiser | Louis-Amélie Cougnon | Patrick Watrin
Proceedings of the RANLP 2011 Workshop on Information Extraction and Knowledge Acquisition

pdf bib abs

Quel apport des unités polylexicales dans une formule de lisibilité pour le français langue étrangère (What is the contribution of multi-word expressions in a readability formula for the French as a foreign language)
Thomas François | Patrick Watrin
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Cette étude envisage l’emploi des unités polylexicales (UPs) comme prédicteurs dans une formule de lisibilité pour le français langue étrangère. À l’aide d’un extracteur d’UPs combinant une approche statistique à un filtre linguistique, nous définissons six variables qui prennent en compte la densité et la probabilité des UPs nominales, mais aussi leur structure interne. Nos expérimentations concluent à un faible pouvoir prédictif de ces six variables et révèlent qu’une simple approche basée sur la probabilité moyenne des n-grammes des textes est plus efficace.

2010

pdf bib abs

Partial Parsing of Spontaneous Spoken French
Olivier Blanc | Matthieu Constant | Anne Dister | Patrick Watrin
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes the process and the resources used to automatically annotate a French corpus of spontaneous speech transcriptions in super-chunks. Super-chunks are enhanced chunks that can contain lexical multiword units. This partial parsing is based on a preprocessing stage of the spoken data that consists in reformatting and tagging utterances that break the syntactic structure of the text, such as disfluencies. Spoken specificities were formalized thanks to a systematic linguistic study of a 40-hour-long speech transcription corpus. The chunker uses large-coverage and fine-grained language resources for general written language that have been augmented with resources specific to spoken French. It consists in iteratively applying finite-state lexical and syntactic resources and outputing a finite automaton representing all possible chunk analyses. The best path is then selected thanks to a hybrid disambiguation stage. We show that our system reaches scores that are comparable with state-of-the-art results in the field.

2007

pdf bib abs

Segmentation en super-chunks
Olivier Blanc | Matthieu Constant | Patrick Watrin
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Depuis l’analyseur développé par Harris à la fin des années 50, les unités polylexicales ont peu à peu été intégrées aux analyseurs syntaxiques. Cependant, pour la plupart, elles sont encore restreintes aux mots composés qui sont plus stables et moins nombreux. Toutefois, la langue est remplie d’expressions semi-figées qui forment également des unités sémantiques : les expressions adverbiales et les collocations. De même que pour les mots composés traditionnels, l’identification de ces structures limite la complexité combinatoire induite par l’ambiguïté lexicale. Dans cet article, nous détaillons une expérience qui intègre ces notions dans un processus de segmentation en super-chunks, préalable à l’analyse syntaxique. Nous montrons que notre chunker, développé pour le français, atteint une précision et un rappel de 92,9 % et 98,7 %, respectivement. Par ailleurs, les unités polylexicales réalisent 36,6 % des attachements internes aux constituants nominaux et prépositionnels.