Amalia Todirascu - ACL Anthology

Amalia Todirascu

Also published as: Amalia Todiraşcu

2025

GeNRe: A French Gender-Neutral Rewriting System Using Collective Nouns
Enzo Doyen | Amalia Todirascu
Findings of the Association for Computational Linguistics: ACL 2025

A significant portion of the textual data used in the field of Natural Language Processing (NLP) exhibits gender biases, particularly due to the use of masculine generics (masculine words that are supposed to refer to mixed groups of men and women), which can perpetuate and amplify stereotypes. Gender rewriting, an NLP task that involves automatically detecting and replacing gendered forms with neutral or opposite forms (e.g., from masculine to feminine), can be employed to mitigate these biases. While such systems have been developed in a number of languages (English, Arabic, Portuguese, German, French), automatic use of gender neutralization techniques (as opposed to inclusive or gender-switching techniques) has only been studied for English. This paper presents GeNRe, the very first French gender-neutral rewriting system using collective nouns, which are gender-fixed in French. We introduce a rule-based system (RBS) tailored for the French language alongside two fine-tuned language models trained on data generated by our RBS. We also explore the use of instruct-based models to enhance the performance of our other systems and find that Claude 3 Opus combined with our dictionary achieves results close to our RBS. Through this contribution, we hope to promote the advancement of gender bias mitigation techniques in NLP for French.

Chain-of-MetaWriting: Linguistic and Textual Analysis of How Small Language Models Write Young Students Texts
Ioana Buhnila | Georgeta Cislaru | Amalia Todirascu
Proceedings of the First Workshop on Writing Aids at the Crossroads of AI, Cognitive Science and NLP (WRAICOGS 2025)

Large Language Models (LLMs) have been used to generate texts in response to different writing tasks: reports, essays, story telling. However, language models do not have a metarepresentation of the text writing process, nor inherent communication learning needs, comparable to those of young human students. This paper introduces a fine-grained linguistic and textual analysis of multilingual Small Language Models’ (SLMs) writing. With our method, Chain-of-MetaWriting, SLMs can imitate some steps of the human writing process, such as planning and evaluation. We mainly focused on short story and essay writing tasks in French for schoolchildren and undergraduate students respectively. Our results show that SLMs encounter difficulties in assisting young students on sensitive topics such as violence in the schoolyard, and they sometimes use words too complex for the target audience. In particular, the output is quite different from the human produced texts in term of text cohesion and coherence regarding temporal connectors, topic progression, reference.

GeNRe : un système de neutralisation automatique du genre exploitant les noms collectifs
Enzo Doyen | Amalia Todirascu
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : traductions d'articles publiés

Les outils de traitement automatique des langues (TAL) ont tendance à introduire des biais de genre, notamment par une surutilisation du masculin générique. La tâche de réécriture du genre en TAL, qui vise à remplacer des formes genrées par des formes neutres, inclusives ou contraires, peut permettre de réduire ces biais. Bien que des travaux de neutralisation automatique du genre aient été conduits en anglais, aucun projet similaire n’existe pour le français. Cet article présente GeNRe, le tout premier système de neutralisation automatique du genre, qui exploite les noms collectifs. Nous présentons un modèle à base de règles (SBR) et affinons deux modèles de langue à partir des données générées. Nous nous intéressons aussi aux modèles d’instruction, jusque-là inutilisés pour cette tâche, en particulier Claude 3 Opus. Nous obtenons des résultats similaires pour le SBR et Claude 3 Opus lorsqu’il est utilisé conjointement avec notre dictionnaire.

2024

Annotating Emotions in Acquired Brain Injury Patients’ Narratives
Salomé Klein | Amalia Todirascu | Hélène Vassiliadou | Marie Kuppelin | Joffrey Becart | Thalassio Briand | Clara Coridon | Francine Gerhard-Krait | Joé Laroche | Jean Ulrich | Agata Krasny-Pacini
Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024

In this article, we aim to measure the patients’ progress in recognizing and naming emotions by capturing a variety of phenomena that express emotion in discourse. To do so, we introduce an emotion annotation scheme adapted for Acquired Brain Injury (ABI) patients’ narratives. We draw on recent research outcomes in line with linguistic and psychological theories of emotion in the development of French resources for Natural Language Processing (NLP). From this perspective and following Battistelli et al. (2022) guidelines, our protocol considers several means of expressing emotions, including prototypical expressions as well as implicit means. Its originality lies on the methodology adopted for its creation, as we combined, adapted, and tested several previous annotation schemes to create a tool tailored to our spoken clinical French corpus and its unique characteristics and challenges.

LARGEMED: A Resource for Identifying and Generating Paraphrases for French Medical Terms
Ioana Buhnila | Amalia Todirascu
Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024

This article presents a method extending an existing French corpus of paraphrases of medical terms ANONYMOUS with new data from Web archives created during the Covid-19 pandemic. Our method semi-automatically detects new terms and paraphrase markers introducing paraphrases from these Web archives, followed by a manual annotation step to identify paraphrases and their lexical and semantic properties. The extended large corpus LARGEMED could be used for automatic medical text simplification for patients and their families. To automatise data collection, we propose two experiments. The first experiment uses the new LARGEMED dataset to train a binary classifier aiming to detect new sentences containing possible paraphrases. The second experiment aims to use correct paraphrases to train a model for paraphrase generation, by adapting T5 Language Model to the paraphrase generation task using an adversarial algorithm.

Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024
Rodrigo Wilkens | Rémi Cardon | Amalia Todirascu | Núria Gala
Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024

2023

Évaluation d’un générateur automatique de reformulations médicales
Ioana Buhnila | Amalia Todirascu
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux -- articles longs

Les textes médicaux sont difficiles à comprendre pour le grand public à cause des termes de spécialité. Ces notions médicales ont besoin d’être reformulées en utilisant des mots de la langue commune. La reformulation représente le processus de réécriture qui a le rôle d’expliquer ou simplifier une phrase ou syntagme. Nous présentons la méthodologie de construction d’un jeu de données original (termes et reformulations) permettant la détection et génération des nouvelles reformulations médicales. Pour compléter ce corpus, nous menons des expériences de génération automatique de reformulations médicales sous-phrastiques avec l’outil APT (Nighojkar & Licato, 2021), qui s’appuie sur des techniques d’apprentissage profond. Nous adaptons le modèle de langue de type Transformer T5 (Raffel et al., 2020) avec des termes médicaux et leur reformulations annotés manuellement en français et en roumain, langue romane peu dotée en ressources pour le TAL. Nous présentons une analyse détaillée des résultats de la génération automatique des paraphrases.

2022

HECTOR: A Hybrid TExt SimplifiCation TOol for Raw Texts in French
Amalia Todirascu | Rodrigo Wilkens | Eva Rolin | Thomas François | Delphine Bernhard | Núria Gala
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Reducing the complexity of texts by applying an Automatic Text Simplification (ATS) system has been sparking interest inthe area of Natural Language Processing (NLP) for several years and a number of methods and evaluation campaigns haveemerged targeting lexical and syntactic transformations. In recent years, several studies exploit deep learning techniques basedon very large comparable corpora. Yet the lack of large amounts of corpora (original-simplified) for French has been hinderingthe development of an ATS tool for this language. In this paper, we present our system, which is based on a combination ofmethods relying on word embeddings for lexical simplification and rule-based strategies for syntax and discourse adaptations. We present an evaluation of the lexical, syntactic and discourse-level simplifications according to automatic and humanevaluations. We discuss the performances of our system at the lexical, syntactic, and discourse levels

2020

Coreference-Based Text Simplification
Rodrigo Wilkens | Bruno Oberle | Amalia Todirascu
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

Text simplification aims at adapting documents to make them easier to read by a given audience. Usually, simplification systems consider only lexical and syntactic levels, and, moreover, are often evaluated at the sentence level. Thus, studies on the impact of simplification in text cohesion are lacking. Some works add coreference resolution in their pipeline to address this issue. In this paper, we move forward in this direction and present a rule-based system for automatic text simplification, aiming at adapting French texts for dyslexic children. The architecture of our system takes into account not only lexical and syntactic but also discourse information, based on coreference chains. Our system has been manually evaluated in terms of grammaticality and cohesion. We have also built and used an evaluation corpus containing multiple simplification references for each sentence. It has been annotated by experts following a set of simplification guidelines, and can be used to run automatic evaluation of other simplification systems. Both the system and the evaluation corpus are freely available.

French Coreference for Spoken and Written Language
Rodrigo Wilkens | Bruno Oberle | Frédéric Landragin | Amalia Todirascu
Proceedings of the Twelfth Language Resources and Evaluation Conference

Coreference resolution aims at identifying and grouping all mentions referring to the same entity. In French, most systems run different setups, making their comparison difficult. In this paper, we present an extensive comparison of several coreference resolution systems for French. The systems have been trained on two corpora (ANCOR for spoken language and Democrat for written language) annotated with coreference chains, and augmented with syntactic and semantic information. The models are compared with different configurations (e.g. with and without singletons). In addition, we evaluate mention detection and coreference resolution apart. We present a full-stack model that outperforms other approaches. This model allows us to study the impact of mention detection errors on coreference resolution. Our analysis shows that mention detection can be improved by focusing on boundary identification while advances in the pronoun-noun relation detection can help the coreference task. Another contribution of this work is the first end-to-end neural French coreference resolution model trained on Democrat (written texts), which compares to the state-of-the-art systems for oral French.

Un corpus d’évaluation pour un système de simplification discursive (An Evaluation Corpus for Automatic Discourse Simplification)
Rodrigo Wilkens | Amalia Todirascu
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Nous présentons un nouveau corpus simplifié, disponible en français pour l’évaluation d’un système de simplification discursive. Ce système utilise des chaînes de référence pour simplifier et pour préserver la cohésion textuelle après simplification. Nous présentons la méthodologie de collecte de corpus (via un formulaire, qui recueille les simplifications manuelles faites par des participants experts), les règles présentées dans le guide, une analyse des types de simplifications et une évaluation de notre corpus, par comparaison avec la sortie du système de simplification automatique.

Simplifying Coreference Chains for Dyslexic Children
Rodrigo Wilkens | Amalia Todirascu
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present a work aiming to generate adapted content for dyslexic children for French, in the context of the ALECTOR project. Thus, we developed a system to transform the texts at the discourse level. This system modifies the coreference chains, which are markers of text cohesion, by using rules. These rules were designed following a careful study of coreference chains in both original texts and its simplified versions. Moreover, in order to define reliable transformation rules, we analysed several coreference properties as well as the concurrent simplification operations in the aligned texts. This information is coded together with a coreference resolution system and a text rewritten tool in the proposed system, which comprise a coreference module specialised in written text and seven text transformation operations. The evaluation of the system firstly focused on check the simplification by manual validation of three judges. These errors were grouped into five classes that combined can explain 93% of the errors. The second evaluation step consisted of measuring the simplification perception by 23 judges, which allow us to measure the simplification impact of the proposed rules.

2019

PolylexFLE : une base de données d’expressions polylexicales pour le FLE (PolylexFLE : a database of multiword expressions for French L2 language learning)
Amalia Todirascu | Marion Cargill | Thomas Francois
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume I : Articles longs

Nous présentons la base PolylexFLE, contenant 4295 expressions polylexicales. Elle est integrée dans une plateforme d’apprentissage du FLE, SimpleApprenant, destinée à l’apprentissage des expressions polylexicales verbales (idiomatiques, collocations ou expressions figées). Afin de proposer des exercices adaptés au niveau du Cadre européen de référence pour les langues (CECR), nous avons utilisé une procédure mixte (manuelle et automatique) pour annoter 1098 expressions selon les niveaux de compétence du CECR. L’article se concentre sur la procédure automatique qui identifie, dans un premier temps, les expressions de la base PolylexFLE dans un corpus à l’aide d’un système à base d’expressions régulières. Dans un second temps, leur distribution au sein de corpus, annoté selon l’échelle du CECR, est estimée et transformée en un niveau CECR unique.

2017

Survey: Multiword Expression Processing: A Survey
Mathieu Constant | Gülşen Eryiǧit | Johanna Monti | Lonneke van der Plas | Carlos Ramisch | Michael Rosner | Amalia Todirascu
Computational Linguistics, Volume 43, Issue 4 - December 2017

Multiword expressions (MWEs) are a class of linguistic forms spanning conventional word boundaries that are both idiosyncratic and pervasive across different languages. The structure of linguistic processing that depends on the clear distinction between words and phrases has to be re-thought to accommodate MWEs. The issue of MWE handling is crucial for NLP applications, where it raises a number of challenges. The emergence of solutions in the absence of guiding principles motivates this survey, whose aim is not only to provide a focused review of MWE processing, but also to clarify the nature of interactions between MWE processing and downstream applications. We propose a conceptual framework within which challenges and research contributions can be positioned. It offers a shared understanding of what is meant by “MWE processing,” distinguishing the subtasks of MWE discovery and identification. It also elucidates the interactions between MWE processing and two use cases: Parsing and machine translation. Many of the approaches in the literature can be differentiated according to how MWE processing is timed with respect to underlying use cases. We discuss how such orchestration choices affect the scope of MWE-aware systems. For each of the two MWE processing subtasks and for each of the two use cases, we conclude on open issues and research perspectives.

2016

Are Cohesive Features Relevant for Text Readability Evaluation?
Amalia Todirascu | Thomas François | Delphine Bernhard | Núria Gala | Anne-Laure Ligozat
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This paper investigates the effectiveness of 65 cohesion-based variables that are commonly used in the literature as predictive features to assess text readability. We evaluate the efficiency of these variables across narrative and informative texts intended for an audience of L2 French learners. In our experiments, we use a French corpus that has been both manually and automatically annotated as regards to co-reference and anaphoric chains. The efficiency of the 65 variables for readability is analyzed through a correlational analysis and some modelling experiments.

2015

Caractériser les discours académiques et de vulgarisation : quelles propriétés ?
Amalia Todirascu | Beatriz Sanchez Cardenas
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

L’article présente une étude des propriétés linguistiques (lexicales, morpho-syntaxiques, syntaxiques) permettant la classification automatique de documents selon leur genre (articles scientifiques et articles de vulgarisation), dans deux domaines différentes (médecine et informatique). Notre analyse, effectuée sur des corpus comparables en genre et en thèmes disponibles en français, permet de valider certaines propriétés identifiées dans la littérature comme caractéristiques des discours académiques ou de vulgarisation scientifique. Les premières expériences de classification évaluent l’influence de ces propriétés pour l’identification automatique du genre pour le cas spécifique des textes scientifiques ou de vulgarisation.

2012

French and German Corpora for Audience-based Text Type Classification
Amalia Todirascu | Sebastian Padó | Jennifer Krisch | Max Kisselew | Ulrich Heid
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents some of the results of the CLASSYN project which investigated the classification of text according to audience-related text types. We describe the design principles and the properties of the French and German linguistically annotated corpora that we have created. We report on tools used to collect the data and on the quality of the syntactic annotation. The CLASSYN corpora comprise two text collections to investigate general text types difference between scientific and popular science text on the two domains of medical and computer science.

2011

Using Cognates in a French-Romanian Lexical Alignment System: A Comparative Study
Mirabela Navlea | Amalia Todiraşcu
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

Identification de cognats à partir de corpus parallèles français-roumain (Identification of cognates from French-Romanian parallel corpora)
Mirabela Navlea | Amalia Todiraşcu
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Cet article présente une méthode hybride d’identification de cognats français - roumain. Cette méthode exploite des corpus parallèles alignés au niveau propositionnel, lemmatisés et étiquetés (avec des propriétés morphosyntaxiques). Notre méthode combine des techniques statistiques et des informations linguistiques pour améliorer les résultats obtenus. Nous évaluons le module d’identification de cognats et nous faisons une comparaison avec des méthodes statistiques pures, afin d’étudier l’impact des informations linguistiques utilisées sur la qualité des résultats obtenus. Nous montrons que l’utilisation des informations linguistiques augmente significativement la performance de la méthode.

RefGen, outil d’identification automatique des chaînes de référence en français (RefGen, an automatic identification tool of reference chains in French)
Laurence Longo | Amalia Todirascu
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Cognate Identification for a French - Romanian Lexical Alignment System: Empirical Study
Mirabela Navlea | Amalia Todiraşcu
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

2010

RefGen : un module d’identification des chaînes de référence dépendant du genre textuel
Laurence Longo | Amalia Todiraşcu
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cet article, nous présentons RefGen, un module d’identification des chaînes de référence pour le français. RefGen effectue une annotation automatique des expressions référentielles puis identifie les relations de coréférence établies entre ces expressions pour former des chaînes de référence. Le calcul de la référence utilise des propriétés des chaînes de référence dépendantes du genre textuel, l’échelle d’accessibilité d’(Ariel, 1990) et une série de filtres lexicaux, morphosyntaxiques et sémantiques. Nous évaluons les premiers résultats de RefGen sur un corpus issu de rapports publics.

2008

A Hybrid Approach to Extracting and Classifying Verb+Noun Constructions
Amalia Todiraşcu | Dan Tufiş | Ulrich Heid | Christopher Gledhill | Dan Ştefanescu | Marion Weller | François Rousselot
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present the main findings and preliminary results of an ongoing project aimed at developing a system for collocation extraction based on contextual morpho-syntactic properties. We explored two hybrid extraction methods: the first method applies language-indepedent statistical techniques followed by a linguistic filtering, while the second approach, available only for German, is based on a set of lexico-syntactic patterns to extract collocation candidates. To define extraction and filtering patterns, we studied a specific collocation category, the Verb-Noun constructions, using a model inspired by the systemic functional grammar, proposing three level analysis: lexical, functional and semantic criteria. From tagged and lemmatized corpus, we identify some contextual morpho-syntactic properties helping to filter the output of the statistical methods and to extract some potential interesting VN constructions (complex predicates vs complex predicators). The extracted candidates are validated and classified manually.

2004

Experiments on Building Language Resources for Multi-Modal Dialogue Systems
Laurent Romary | Amalia Todirascu | David Langlois
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Introduction to ROMAND 2004
Vincenzo Pallotta | Amalia Todirascu
Proceedings of the 3rd workshop on RObust Methods in Analysis of Natural Language Data (ROMAND 2004)

2002

Towards Reusable NLP Components
Amalia Todirascu | Eric Kow | Laurent Romary
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2001

Ontologies for Information Retrieval
Amalia Todiraşcu | François Rousselot
Actes de la 8ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

The paper presents a system for querying (in natural language) a set of text documents from a limited domain. The domain knowledge, represented in description logics (DL), is used for filtering the documents returned as answer and it is extended dynamically (when new concepts are identified in the texts), as result of DL inference mechanisms. The conceptual hierarchy is built semi-automatically from the texts. Concept instances are identified using shallow natural language parsing techniques.

Venues