Damien Nouvel

2025

Systèmes d’écriture et qualité des données : l’affinage de modèles de translittération dans un contexte de faibles ressources
Emmett Strickland | Ilaine Wang | Damien Nouvel | Bénédicte Diot-Parvaz Ahmad
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : articles scientifiques originaux

Cet article présente une expérience visant à construire des modèles de romanisation affinés pour onze langues parmi lesquelles se trouvent des langues dites peu dotées. Nous démontrons qu’un modèle de romanisation efficace peut être créé en affinant un modèle de base entraîné sur un corpus important d’une ou plusieurs autres langues. Le système d’écriture semblerait jouer un rôle dans l’efficacité de certains modèles affinés. Nous présentons également des méthodes pour évaluer la qualité des données d’entraînement et d’évaluation, et comparons notre modèle arabe le plus performant à un modèle de référence.

pdf bib abs

LLM-Based Product Recommendation with Prospect Theoretic Self Alignment Strategy
Manying Zhang | Zehua Cheng | Damien Nouvel
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Accurate and personalized product recommendation is central to user satisfaction in e-commerce. However, a persistent language gap often exists between user queries and product titles or descriptions. While traditional user behavior-based recommenders and LLM-based Retrieval-Augmented Generation systems typically optimize for maximum likelihood objectives, they may struggle to bridge this gap or capture users’ true intent. In this paper, we propose a strategy based on Prospect Theoretic Self-Alignment, that reframes LLM-based recommendations as a utility-driven process. Given a user query and a set of candidate products, our model acts as a seller who anticipates latent user needs and generates product descriptions tailored to the user’s perspective. Simultaneously, it simulates user decision-making utility to assess whether the generated content would lead to a purchase. This self-alignment is achieved through a training strategy grounded in Kahneman & Tversky’s prospect theory, ensuring that recommendations are optimized for perceived user value rather than likelihood alone. Experiments on real-world product data demonstrate substantial improvements in intent alignment and recommendation quality, validating the effectiveness of our approach in producing personalized and decision-aware recommendations.

pdf bib abs

Apprentissage Actif à l’ère des Grands Modèles de Langue (LLMs)
Shami Thirion Sen | Rime Abrougui | Guillaume Lechien | Damien Nouvel
Actes de la session industrielle de CORIA-TALN 2025

En TAL, la performance des modèles dépend fortement de la qualité et de la quantité des données annotées. Lorsque ces ressources sont limitées, l’apprentissage actif (Active Learning) offre une solution efficace en sélectionnant les échantillons les plus pertinents à annoter. Traditionnellement, cette tâche est réalisée par des annotateurs humains, mais nous explorons ici le potentiel du grand modèle de langue Mixtral-8x7B pour générer automatiquement ces annotations. Nous analysons l’influence de l’augmentation des données dans un processus d’apprentissage actif pour la reconnaissance d’entités nommées, ainsi que l’impact du prompt et des hyper-paramètres sur la qualité des annotations. Les évaluations conduites sur le corpus WiNER montrent que, malgré l’absence d’annotations manuelles, cette approche permet d’obtenir des performances comparables à notre baseline, tout en réduisant de 80 % la quantité des données.

2024

pdf bib abs

Bi-dialectal ASR of Armenian from Naturalistic and Read Speech
Malajyan Arthur | Victoria Khurshudyan | Karen Avetisyan | Hossep Dolatian | Damien Nouvel
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024

The paper explores the development of Automatic Speech Recognition (ASR) models for Armenian, by using data from two standard dialects (Eastern Armenian and Western Armenian). The goal is to develop a joint bi-variational model. We achieve state-of-the-art results. Results from our ASR experiments demonstrate the impact of dataset selection and data volume on model performance. The study reveals limited transferability between dialects, although integrating datasets from both dialects enhances overall performance. The paper underscores the importance of dataset diversity and volume in ASR model training for under-resourced languages like Armenian.

2023

pdf bib abs

Ertim at SemEval-2023 Task 2: Fine-tuning of Transformer Language Models and External Knowledge Leveraging for NER in Farsi, English, French and Chinese
Kevin Deturck | Pierre Magistry | Bénédicte Diot-Parvaz Ahmad | Ilaine Wang | Damien Nouvel | Hugo Lafayette
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Transformer language models are now a solid baseline for Named Entity Recognition and can be significantly improved by leveraging complementary resources, either by integrating external knowledge or by annotating additional data. In a preliminary step, this work presents experiments on fine-tuning transformer models. Then, a set of experiments has been conducted with a Wikipedia-based reclassification system. Additionally, we conducted a small annotation campaign on the Farsi language to evaluate the impact of additional data. These two methods with complementary resources showed improvements compared to fine-tuning only.

2022

pdf bib

Proceedings of the Workshop on Processing Language Variation: Digital Armenian (DigitAm) within the 13th Language Resources and Evaluation Conference
Victoria Khurshudyan | Nadi Tomeh | Damien Nouvel | Anaid Donabedian | Chahan Vidal-Gorene
Proceedings of the Workshop on Processing Language Variation: Digital Armenian (DigitAm) within the 13th Language Resources and Evaluation Conference

pdf bib abs

Annotation of Messages from Social Media for Influencer Detection
Kevin Deturck | Damien Nouvel | Namrata Patel | Frédérique Segond
Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022

To develop an influencer detection system, we designed an influence model based on the analysis of conversations in the “Change My View” debate forum. This led us to identify enunciative features (argumentation, emotion expression, view change, ...) related to influence between participants. In this paper, we present the annotation campaign we conducted to build up a reference corpus on these enunciative features. The annotation task was to identify in social media posts the text segments that corresponded to each enunciative feature. The posts to be annotated were extracted from two social media: the “Change My View” debate forum, with discussions on various topics, and Twitter, with posts from users identified as supporters of ISIS (Islamic State of Iraq and Syria). Over a thousand posts have been double or triple annotated throughout five annotation sessions gathering a total of 27 annotators. Some of the sessions involved the same annotators, which allowed us to analyse the evolution of their annotation work. Most of the sessions resulted in a reconciliation phase between the annotators, allowing for discussion and iterative improvement of the guidelines. We measured and analysed inter-annotator agreements over the course of the sessions, which allowed us to validate our iterative approach.

pdf bib abs

Détection des influenceurs dans des médias sociaux par une approche hybride (Influencer detection in social media, a hybrid approach)
Kevin Deturck | Damien Nouvel | Namrata Patel | Frederique Segond
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

L’influence sociale est un phénomène important dans divers domaines, tels que l’économie et la politique, qui a gagné en résonnance avec la popularité des médias sociaux, notamment les réseaux sociaux et les forums. La majorité des travaux sur ce sujet propose des approches fondées sur des théories en sciences humaines (sociologie, linguistique), et des techniques d’analyse de réseau (mesures de propagation et de centralité) ou de TAL. Dans cet article, nous présentons un modèle d’influence inspiré de travaux en psychologie sociale, sur lequel nous construisons un système combinant un module de TAL pour détecter les messages reflétant les processus d’influence, associé à une analyse par centralité de la transmission de ces messages. Nos expériences sur le forum de débats Change My View montrent que l’approche par hybridation, comparée à la centralité seule, aide à mieux détecter les influenceurs.

2021

pdf bib abs

Toward Creation of Ancash Lexical Resources from OCR
Johanna Cordova | Damien Nouvel
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

The Quechua linguistic family has a limited number of NLP resources, most of them being dedicated to Southern Quechua, whereas the varieties of Central Quechua have, to the best of our knowledge, no specific resources (software, lexicon or corpus). Our work addresses this issue by producing two resources for the Ancash Quechua: a full digital version of a dictionary, and an OCR model adapted to the considered variety. In this paper, we describe the steps towards this goal: we first measure performances of existing models for the task of digitising a Quechua dictionary, then adapt a model for the Ancash variety, and finally create a reliable resource for NLP in XML-TEI format. We hope that this work will be a basis for initiating NLP projects for Central Quechua, and that it will encourage digitisation initiatives for under-resourced languages.

2020

pdf bib abs

NLU-Co at SemEval-2020 Task 5: NLU/SVM Based Model Apply Tocharacterise and Extract Counterfactual Items on Raw Data
Elvis Mboning Tchiaze | Damien Nouvel
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this article, we try to solve the problem of classification of counterfactual statements and extraction of antecedents/consequences in raw data, by mobilizing on one hand Support vector machine (SVMs) and on the other hand Natural Language Understanding (NLU) infrastructures available on the market for conversational agents. Our experiments allowed us to test different pipelines of two known platforms (Snips NLU and Rasa NLU). The results obtained show that a Rasa NLU pipeline, built with a well-preprocessed dataset and tuned algorithms, allows to model accurately the structure of a counterfactual event, in order to facilitate the identification and the extraction of its components.

2018

pdf bib abs

Apprentissage déséquilibré pour la détection des signaux de l’implication durable dans les conversations en parfumerie (Automatic detection of positive enduring involvement signals in fragrance products reviews)
Yizhe Wang | Damien Nouvel | Gaël Patin | Marguerite Leenhardt
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Une simple détection d’opinions positives ou négatives ne satisfait plus les chercheurs et les entreprises. Le monde des affaires est à la recherche d’un «aperçu des affaires». Beaucoup de méthodes peuvent être utilisées pour traiter le problème. Cependant, leurs performances, lorsque les classes ne sont pas équilibrées, peuvent être dégradées. Notre travail se concentre sur l’étude des techniques visant à traiter les données déséquilibrées en parfumerie. Cinq méthodes ont été comparées : Smote, Adasyn, Tomek links, Smote-TL et la modification du poids des classe. L’algorithme d’apprentissage choisi est le SVM et l’évaluation est réalisée par le calcul des scores de précision, de rappel et de f-mesure. Selon les résultats expérimentaux, la méthode en ajustant le poids sur des coût d’erreurs avec SVM, nous permet d’obtenir notre meilleure F-mesure.

2017

pdf bib abs

A Bambara Tonalization System for Word Sense Disambiguation Using Differential Coding, Segmentation and Edit Operation Filtering
Luigi Yu-Cheng Liu | Damien Nouvel
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In many languages such as Bambara or Arabic, tone markers (diacritics) may be written but are actually often omitted. NLP applications are confronted to ambiguities and subsequent difficulties when processing texts. To circumvent this problem, tonalization may be used, as a word sense disambiguation task, relying on context to add diacritics that partially disambiguate words as well as senses. In this paper, we describe our implementation of a Bambara tonalizer that adds tone markers using machine learning (CRFs). To make our tool efficient, we used differential coding, word segmentation and edit operation filtering. We describe our approach that allows tractable machine learning and improves accuracy: our model may be learned within minutes on a 358K-word corpus and reaches 92.3% accuracy.

pdf bib abs

Une approche linguistique pour la détection des dialectes arabes (A linguistic approach for the detection of Arabic dialects)
Houda Saâdane | Damien Nouvel | Hosni Seffih | Christian Fluhr
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Dans cet article, nous présentons un processus d’identification automatique de l’origine dialectale pour la langue arabe de textes écrits en caractères arabes ou en écriture latine (arabizi). Nous décrivons le processus d’annotation des ressources construites et du système de translittération adopté. Deux approches d’identification de la langue sont comparées : la première est linguistique et exploite des dictionnaires, la seconde est statistique et repose sur des méthodes traditionnelles d’apprentissage automatique (n-grammes). L’évaluation de ces approches montre que la méthode linguistique donne des résultats satisfaisants, sans être dépendante des corpus d’apprentissage.

2016

pdf bib abs

Named Entity Resources - Overview and Outlook
Maud Ehrmann | Damien Nouvel | Sophie Rosset
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Recognition of real-world entities is crucial for most NLP applications. Since its introduction some twenty years ago, named entity processing has undergone a significant evolution with, among others, the definition of new tasks (e.g. entity linking) and the emergence of new types of data (e.g. speech transcriptions, micro-blogging). These pose certainly new challenges which affect not only methods and algorithms but especially linguistic resources. Where do we stand with respect to named entity resources? This paper aims at providing a systematic overview of named entity resources, accounting for qualities such as multilingualism, dynamicity and interoperability, and to identify shortfalls in order to guide future developments.

pdf bib

ReadME generation from an OWL ontology describing NLP tools
Driss Sadoun | Satenik Mkhitaryan | Damien Nouvel | Mathieu Valette
Proceedings of the 2nd International Workshop on Natural Language Generation and the Semantic Web (WebNLG 2016)

pdf bib abs

The MultiTal NLP tool infrastructure
Driss Sadoun | Satenik Mkhitaryan | Damien Nouvel | Mathieu Valette
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)

This paper gives an overview of the MultiTal project, which aims to create a research infrastructure that ensures long-term distribution of NLP tools descriptions. The goal is to make NLP tools more accessible and usable to end-users of different disciplines. The infrastructure is built on a meta-data scheme modelling and standardising multilingual NLP tools documentation. The model is conceptualised using an OWL ontology. The formal representation of the ontology allows us to automatically generate organised and structured documentation in different languages for each represented tool.

2015

pdf bib abs

Proposition méthodologique pour la détection automatique de Community Manager. Étude multilingue sur un corpus relatif à la Junk Food
Johan Ferguth | Aurélie Jouannet | Asma Zamiti | Yunhe Wu | Jia Li | Antonina Bondarenko | Damien Nouvel | Mathieu Valette
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cet article, nous présentons une méthodologie pour l’identification de messages suspectés d’être produits par des Community Managers à des fins commerciales déguisées dans des documents du Web 2.0. Le champ d’application est la malbouffe (junkfood) et le corpus est multilingue (anglais, chinois, français). Nous exposons dans un premier temps la stratégie de constitution et d’annotation de nos corpus, en explicitant notamment notre guide d’annotation, puis nous développons la méthode adoptée, basée sur la combinaison d’une analyse textométrique et d’un apprentissage supervisé.

2013

pdf bib

Dynamic extension of a French morphological lexicon based a text stream (Extension dynamique de lexiques morphologiques pour le français à partir d’un flux textuel) [in French]
Benoît Sagot | Damien Nouvel | Virginie Mouilleron | Marion Baranes
Proceedings of TALN 2013 (Volume 1: Long Papers)

pdf bib

Fouille de règles d’annotation pour la reconnaissance d’entités nommées [Annotation rule mining for named entity recognition]
Damien Nouvel | Jean-Yves Antoine | Nathalie Friburger | Arnaud Soulet
Traitement Automatique des Langues, Volume 54, Numéro 2 : Entité Nommées [Named Entities]

pdf bib

Supervised learning on encyclopaedic resources for the extension of a lexicon of proper names dedicated to the recognition of named entities (Apprentissage supervisé sur ressources encyclopédiques pour l’enrichissement d’un lexique de noms propres destiné à la reconnaissance des entités nommées) [in French]
Nadia Okinina | Damien Nouvel | Nathalie Friburger | Jean-Yves Antoine
Proceedings of TALN 2013 (Volume 2: Short Papers)

pdf bib

Mining Partial Annotation Rules for Named Entity Recognition (Fouille de règles d’annotation partielles pour la reconnaissance des entités nommées) [in French]
Damien Nouvel | Jean-Yves Antoine | Nathalie Friburger | Arnaud Soulet
Proceedings of TALN 2013 (Volume 1: Long Papers)

2012

pdf bib

Coupling Knowledge-Based and Data-Driven Systems for Named Entity Recognition
Damien Nouvel | Jean-Yves Antoine | Nathalie Friburger | Arnaud Soulet
Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data

2011

pdf bib

Cascades de transducteurs autour de la reconnaissance des entités nommées [CasEN: a transducer cascade to recognize French Named Entities]
Denis Maurel | Nathalie Friburger | Jean-Yves Antoine | Iris Eshkol-Taravella | Damien Nouvel
Traitement Automatique des Langues, Volume 52, Numéro 1 : Varia [Varia]

2010

pdf bib abs

Reconnaissance d’entités nommées : enrichissement d’un système à base de connaissances à partir de techniques de fouille de textes
Damien Nouvel | Arnaud Soulet | Jean-Yves Antoine | Nathalie Friburger | Denis Maurel
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cet article, nous présentons et analysons les résultats du système de reconnaissance d’entités nommées CasEN lors de sa participation à la campagne d’évaluation Ester2. Nous identifions quelles ont été les difficultés pour notre système, essentiellement : les mots hors-vocabulaire, la métonymie, les frontières des entités nommées. Puis nous proposons une approche pour améliorer les performances de systèmes à base de connaissances, en utilisant des techniques exhaustives de fouille de données séquentielles afin d’extraire des motifs qui représentent les structures linguistiques en jeu lors de la reconnaissance d’entités nommées. Enfin, nous décrivons l’expérimentation menée à cet effet, donnons les résultats obtenus à ce jour et en faisons une première analyse.

pdf bib abs

An Analysis of the Performances of the CasEN Named Entities Recognition System in the Ester2 Evaluation Campaign
Damien Nouvel | Jean-Yves Antoine | Nathalie Friburger | Denis Maurel
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present a detailed and critical analysis of the behaviour of the CasEN named entity recognition system during the French Ester2 evaluation campaign. In this project, CasEN has been confronted with the task of detecting and categorizing named entities in manual and automatic transcriptions of radio broadcastings. At first, we give a general presentation of the Ester2 campaign. Then, we describe our system, based on transducers. Next, we depict how systems were evaluated during this campaign and we report the main official results. Afterwards, we investigate in details the influence of some annotation biases which have significantly affected the estimation of the performances of systems. At last, we conduct an in-depth analysis of the effective errors of the CasEN system, providing us with some useful indications about phenomena that gave rise to errors (e.g. metonymy, encapsulation, detection of right boundaries) and are as many challenges for named entity recognition systems.