Benoit Favre - ACL Anthology

Benoit Favre

Also published as: Benoît Favre

2026

Do audio and visual tokenizers capture backchannels?
Benoit Favre | Auriane Boudin
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology

Audio and video tokenizers are autoencoders trained to represent the content of recordings as a sequence of vectors. They are prevalently used to interface large language models with non-textual modalities. While they allow advanced applications such as video generation, the envelope of their limitations is not known in the context of multimodal conversation. This work focuses on backchannels, which listeners use to signal to the speaker that they are listening. This feedback is essential to maintain the conversation flow. We evaluate whether a representative set of audio and video tokenizers encode backchannels using linear probing. Results show that although audio tokenizers capture the phenomenon relatively well, backchannels are not linearly separated by video tokenizers. However, joint representations resulting from concatenating representations in both modalities improve accuracy significantly over audio-only representations, suggesting to train multimodal tokenizers.

Mixed-Initiative Dialogue Management for Human-Virtual Agents Interaction in Forum Theatre Inspired Training
Samuel Otofa | Yacine Zerenini | Frederic Bechet | Benoit Favre | Jean-Marie Pergandi | Magalie Ochs
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology

This work presents a virtual reality (VR) training tool designed to raise awareness of social discrimination (ethnic and gender-based) and to train individuals to respond effectively when witnessing such situations. Inspired by Augusto Boal’s forum theatre, the system recreates interactive scenarios of discrimination using autonomous virtual agents. The user first observes a discriminatory scene, then analyzes it through an interaction with a virtual conversational agent, and finally replays the scene by embodying the discriminated character to explore alternative reactions. From a dialogue system perspective, the project introduces a hybrid dialogue management architecture combining state-based control with Large Language Model (LLM)-driven open dialogue. This mixed-initiative approach allows the system to manage structured training sequences while supporting flexible, context-aware interactions on sensitive topics. The demonstrator illustrates this approach through a case of ordinary sexism in a professional setting, highlighting the potential of spoken dialogue systems in VR for experiential learning and social behavior training.

Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
Ikram Belmadani | Oumaima El Khettari | Pacôme Constant dit Beaufils | Richard Dufour | Benoit Favre
Proceedings of the 1st Workshop on Linguistic Analysis for Health (HeaLing 2026)

Automatic evaluation of open-ended question answering in specialized domains remains challenging mainly because it relies on manual annotations from domain experts. In this work, we assess the ability of several large language models (LLMs), including closed-access (GPT-5.1, Gemini-2.5-Pro), open-source general-purpose (Qwen-80B), and biomedical domain-adapted models (MedGemma-27B, Phi-3.5-mini variants), to act as automatic evaluators of semantic equivalence in French medical open-ended QA. Our analysis reveals that LLM-based judgments are sensitive to the source of answer generation: judgement correlation varies substantially across different generator models. Among the judges, MedGemma-27B and Qwen-80B achieve the highest agreement with expert annotations in terms of F1 score and Pearson correlation. We further explore lightweight adaptation strategies on Phi-3.5-mini using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO). Even with 184 training instances, these adaptations significantly improve Phi-3.5’s results and reduce variability across answer generators, achieving performance comparable to larger domain-adapted models. Our results highlight the importance of generator-aware evaluation, the limitations of general-purpose LLMs in domain-specific settings, and the effectiveness of lightweight adaptation for compact models in low-resource scenarios.

2025

Actes de l'atelier Évaluation des modèles génératifs (LLM) et challenge 2025 (EvalLLM)
Frédéric Bechet | Adrian-Gabriel Chifu | Karen Pinel-sauvagnat | Benoit Favre | Eliot Maes | Diana Nurbakova
Actes de l'atelier Évaluation des modèles génératifs (LLM) et challenge 2025 (EvalLLM)

Actes de l'atelier Avancement de l’AMR et de l’Analyse Sémantique 2025 (4AS)
Frédéric Bechet | Adrian-Gabriel Chifu | Karen Pinel-sauvagnat | Benoit Favre | Eliot Maes | Diana Nurbakova
Actes de l'atelier Avancement de l’AMR et de l’Analyse Sémantique 2025 (4AS)

Actes de l'atelier Intelligence Artificielle générative et ÉDUcation : Enjeux, Défis et Perspectives de Recherche 2025 (IA-ÉDU)
Frédéric Bechet | Adrian-Gabriel Chifu | Karen Pinel-sauvagnat | Benoit Favre | Eliot Maes | Diana Nurbakova
Actes de l'atelier Intelligence Artificielle générative et ÉDUcation : Enjeux, Défis et Perspectives de Recherche 2025 (IA-ÉDU)

Adaptation des connaissances médicales pour les grands modèles de langue : Stratégies et analyse comparative
Ikram Belmadani | Benoit Favre | Richard Dufour | Frédéric Béchet | Carlos Ramisch
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : articles scientifiques originaux

Cet article présente une étude sur l’adaptation des grands modèles de langue (LLMs) à des domaines spécialisés disposant de données limitées. Bien que certaines recherches remettent en question le pré-entraînement adaptatif (DAPT) dans le contexte médical en anglais, nous montrons que l’adaptation au domaine peut être efficace sous certaines conditions. En prenant comme exemple l’adaptation au domaine médical en français, nous comparons de manière systématique le pré-entraînement continu (CPT), l’affinage supervisé (SFT) et une approche combinée (CPT suivi de SFT). Nos résultats indiquent que l’adaptation d’un modèle généraliste à de nouvelles données dans le domaine médical offre des améliorations notables (taux de réussite de 87%), tandis que l’adaptation supplémentaire de modèles déjà familiarisés avec ce domaine procure des bénéfices limités. Bien que CPT+SFT offre les meilleures performances globales, SFT-seul présente des résultats solides et requiert moins de ressources matérielles.

Actes des 18e Rencontres Jeunes Chercheurs en RI (RJCRI) et 27ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL)
Frédéric Bechet | Adrian-Gabriel Chifu | Karen Pinel-sauvagnat | Benoit Favre | Eliot Maes | Diana Nurbakova
Actes des 18e Rencontres Jeunes Chercheurs en RI (RJCRI) et 27ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL)

Actes de l'atelier Accès à l’information basé sur le dialogue et grands modèles de langage 2025 (DIAG-LLM)
Frédéric Bechet | Adrian-Gabriel Chifu | Karen Pinel-sauvagnat | Benoit Favre | Eliot Maes | Diana Nurbakova
Actes de l'atelier Accès à l’information basé sur le dialogue et grands modèles de langage 2025 (DIAG-LLM)

Actes de la session industrielle de CORIA-TALN 2025
Frédéric Bechet | Adrian-Gabriel Chifu | Karen Pinel-sauvagnat | Benoit Favre | Eliot Maes | Diana Nurbakova
Actes de la session industrielle de CORIA-TALN 2025

Actes de l'atelier Traitement de données langagières dynamiques par les outils et méthodes du TAL 2025 (DYN-TAL)
Frédéric Bechet | Adrian-Gabriel Chifu | Karen Pinel-sauvagnat | Benoit Favre | Eliot Maes | Diana Nurbakova
Actes de l'atelier Traitement de données langagières dynamiques par les outils et méthodes du TAL 2025 (DYN-TAL)

Étude comparative de réponses humaines et de grands modèles de langue à des QCM en pharmacie
Ricardo Rodriguez | Stéphane Huet | Benoît Favre | Mickael Rouvier
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : articles scientifiques originaux

Cet article propose d’étudier les réponses générées par plusieurs Grands Modèles de Langue à un ensemble de Questions à Choix Multiple en pharmacie. Ces réponses sont comparées aux réponses données par des étudiants, afin de comprendre quelles sont les questions difficiles pour les modèles par rapport aux humains et pour quelles raisons. Nous utilisons les logits internes des modèles pour construire des distributions de probabilité et analyser les caractéristiques principales qui déterminent la difficulté des questions via une approche statistique. Nous apportons aussi une extension du jeu de données FRENCH MEDMCQA avec des paires question-réponses en pharmacie, enrichies avec les réponses des étudiants, la ponctuation assignée aux réponses, les thématiques cliniques correspondantes et des annotations manuelles sur la structure et certains traits sémantiques des questions.

Actes de l'atelier Science Participative pour les Données et Corpus Linguistiques 2025 (ParCol)
Frédéric Bechet | Adrian-Gabriel Chifu | Karen Pinel-sauvagnat | Benoit Favre | Eliot Maes | Diana Nurbakova
Actes de l'atelier Science Participative pour les Données et Corpus Linguistiques 2025 (ParCol)

Comparative Analysis of Human and Large Language Model Performance in Pharmacology Multiple-Choice Questions
Ricardo Rodriguez | Stéphane Huet | Benoît Favre | Mickael Rouvier
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

In this article, we study the answers generated by a selection of Large Language Models to a set of Multiple Choice Questions in Pharmacology, and compare them to the answers provided by students, to understand which questions in this clinical domain are difficult for the models when compared to humans and why. We extract the internal logits to infer probability distributions and analyse the main features that determine the difficulty of questions using statistical methods. We also provide an extension to the FrenchMedMCQA dataset, with pairs of question-answers in pharmacology, enriched with student response rate, answer scoring, clinical topics, and annotations on question structure and semantics.

Actes de l'atelier Ethic and Alignment of (Large) Language Models 2025 (EALM)
Frédéric Bechet | Adrian-Gabriel Chifu | Karen Pinel-sauvagnat | Benoit Favre | Eliot Maes | Diana Nurbakova
Actes de l'atelier Ethic and Alignment of (Large) Language Models 2025 (EALM)

Implicit Hate Target Span Detection in Zero- and Few-Shot Settings with Selective Sub-Billion Parameter Models
Hossam Boudraa | Benoit Favre | Raquel Urena
Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH)

This work investigates the effectiveness of masked language models (MLMs) and autoregressive language models (LLMs) with fewer than one billion parameters in the detection of implicit hate speech through fine-grained span identification. The evaluation spans zero-shot, few-shot, and full supervision settings across two core benchmarks—SBIC and IHC—and an auxiliary testbed, OffensiveLang.RoBERTa-Large-355M emerges as the strongest zero-shot model, achieving the highest F1 scores of 75.8 (SBIC) and 72.5 (IHC), outperforming larger models like LLaMA 3.2-1B. ModernBERT-125M closely matches this performance with scores of 75.1 and 72.2, demonstrating the advantage of architectural efficiency. Among instruction-tuned models, SmolLM2-135M Instruct and LLaMA 3.2 1B Instruct consistently outperform their non-instructed counterparts, with up to +2.3 F1 gain on SBIC and +1.7 on IHC. Interestingly, the larger SmolLM2-360M Instruct does not outperform the 135M variant, highlighting that model scale does not always correlate with performance in implicit hate detection tasks.Few-shot fine-tuning with SmolLM2-135M Instruct achieves F1 scores of 68.2 (SBIC) and 64.0 (IHC), trailing full-data fine-tuning by only 1.6 and 2.0 points, respectively, with accuracy drops under 0.5 points. This illustrates the promise of compact, instruction-aligned models in data-scarce settings, particularly when optimized with Low-Rank Adaptation (LoRA).Topic-guided error analysis using Latent Dirichlet Allocation (LDA) reveals recurring model failures in ideologically charged or euphemistic discourse. Misclassifications often involve neutral references to identity, politics, or advocacy language, underscoring current limitations in discourse-level inference and sociopragmatic understanding.

Traitement Automatique des Langues, Volume 66, Numéro 1
Maxime Amblard | Marie Candito | Cécile Fabre | Benoît Favre | Aurélie Névéol | Sophie Rosset
Traitement Automatique des Langues, Volume 66, Numéro 1

Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : traductions d'articles publiés
Frédéric Bechet | Adrian-Gabriel Chifu | Karen Pinel-sauvagnat | Benoit Favre | Eliot Maes | Diana Nurbakova
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : traductions d'articles publiés

Actes de l'atelier Traitement du langage médical à l’époque des LLMs 2025 (MLP-LLM)
Frédéric Bechet | Adrian-Gabriel Chifu | Karen Pinel-sauvagnat | Benoit Favre | Eliot Maes | Diana Nurbakova
Actes de l'atelier Traitement du langage médical à l’époque des LLMs 2025 (MLP-LLM)

Estimation de l’inclusion entre tâches par projection spectrale de vecteurs de tâches
Loïc Fosse | Benoît Favre | Frédéric Béchet | Géraldine Damnati | Gwénolé Lecorvé
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : articles scientifiques originaux

L’affinage des modèles a permis la plupart des avancées significatives récentes dans les tâches de TALN. Des études ont exploré les raisons de ces succès en étudiant le mécanisme d’attention, la manière dont les connaissances linguistiques et factuelles sont encodées, etc... . Il est cependant difficile d’interpréter les changements causés par l’affinage dans les poids des modèles. Pour mieux comprendre cela, nous proposons une méthode fondée théoriquement pour projeter et comparer les changements de poids (i.e. vecteurs de tâches) dans un espace à faible dimension. Cette approche permet de mieux comprendre les connaissances encodées dans un vecteur de tâches, relativement à un autre vecteur de tâche. Nous validons notre méthode en montrant qu’un modèle affiné sur une tâche de résumé encode des informations sur la reconnaissance d’entités nommées.

Statistical Deficiency for Task Inclusion Estimation
Loïc Fosse | Frederic Bechet | Benoit Favre | Géraldine Damnati | Gwénolé Lecorvé | Maxime Darrin | Philippe Formont | Pablo Piantanida
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Tasks are central in machine learning, as they are the most natural objects to assess the capabilities of current models. The trend is to build general models able to address any task. Even though transfer learning and multitask learning try to leverage the underlying task space, no well-founded tools are available to study its structure. This study proposes a theoretically grounded setup to define the notion of task and to compute the inclusion between two tasks from a statistical deficiency point of view. We propose a tractable proxy as information sufficiency to estimate the degree of inclusion between tasks, show its soundness on synthetic data, and use it to reconstruct empirically the classic NLP pipeline.

Actes de la 20e Conférence en Recherche d’Information et Applications (CORIA)
Frédéric Bechet | Adrian-Gabriel Chifu | Karen Pinel-sauvagnat | Benoit Favre | Eliot Maes | Diana Nurbakova
Actes de la 20e Conférence en Recherche d’Information et Applications (CORIA)

Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : articles scientifiques originaux
Frédéric Bechet | Adrian-Gabriel Chifu | Karen Pinel-sauvagnat | Benoit Favre | Eliot Maes | Diana Nurbakova
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : articles scientifiques originaux

2024

Actes du Défi Fouille de Textes@TALN 2024
Richard Dufour | Benoit Favre | Mickael Rouvier | Adrien Bazoge | Yanis Labrak
Actes du Défi Fouille de Textes@TALN 2024

Tâches et systèmes de sélection automatique de réponses à des QCM dans le domaine médical : Présentation de la campagne DEFT 2024
Adrien Bazoge | Yanis Labrak | Richard Dufour | Benoit Favre | Mickael Rouvier
Actes du Défi Fouille de Textes@TALN 2024

L’édition 2024 du DÉfi Fouille de Textes (DEFT) met l’accent sur le développement de méthodes pour la sélection automatique de réponses pour des questions à choix multiples (QCM) en français. Les méthodes sont évaluées sur un nouveau sous-ensemble du corpus FrenchMedMCQA, comprenant 3 105 questions fermées avec cinq options chacune, provenant des archives d’examens de pharmacie. Dans la première tâche, les participants doivent se concentrer sur des petits modèles de langue (PML) avec moins de 3 milliards de paramètres et peuvent également utiliser les corpus spécifiques au domaine médical NACHOS et Wikipedia s’ils souhaitent appliquer des approches du type Retrieval-Augmented Generation (RAG). La second tâche lève la restriction sur la taille des modèles de langue. Les résultats, mesurés par l’Exact Match Ratio (EMR), varient de 1,68 à 11,74 tandis que les performances selon le score de Hamming vont de 28,75 à 49,15 pour la première tâche. Parmi les approches proposées par les cinq équipes participantes, le meilleur système utilise une chaîne combinant un classifieur CamemBERT-bio pour identifier le type de question et un système RAG fondé sur Apollo 2B, affiné avec la méthode d’adaptation LoRA sur les données de l’année précédente.

Approche multitâche pour l’amélioration de la fiabilité des systèmes de résumé automatique de conversation
Eunice Akani | Benoit Favre | Frederic Bechet | Romain Gemignani
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Le résumé de dialogue consiste à générer un résumé bref et cohérent d’une conversation ou d’un dialogue entre deux ou plusieurs locuteurs. Même si les modèles de langue les plus récents ont permis des progrès remarquables dans ce domaine, générer un résumé fidèle au dialogue de départ reste un défi car cela nécessite de prendre en compte l’interaction entre les locuteurs pour conserver les informations les plus pertinentes du dialogue. Nous nous plaçons dans le cadre des dialogues humain-humain avec but. Ce cadre nous permet d’intégrer des informations relatives à la tâche dans le cadre du résumé de dialogue afin d’aider le système à générer des résumés plus fidèles sémantiquement. Nous évaluons dans cette étude des approches multitâches permettant de lier la tâche de résumé à des tâches de compréhension du langage comme la détection de motifs d’appels. Les informations liées à la tâche nous permettent également de proposer des nouvelles méthodes de sélection de résumés basées sur l’analyse sémantique du dialogue ainsi que des métriques d’évaluation basées également sur cette même analyse. Nous avons testé ces méthodes sur DECODA, un corpus français de dialogue collecté dans le centre d’appel de la RATP entre des usagers et des téléconseillers. Nous montrons que l’ajout d’informations liées à la tâche augmente la fiabilité des résumés générés.

CHICA: A Developmental Corpus of Child-Caregiver’s Face-to-face vs. Video Call Conversations in Middle Childhood
Dhia Elhak Goumri | Abhishek Agrawal | Mitja Nikolaus | Hong Duc Thang Vu | Kübra Bodur | Elias Emmar | Cassandre Armand | Chiara Mazzocconi | Shreejata Gupta | Laurent Prévot | Benoit Favre | Leonor Becerra-Bonache | Abdellah Fourtassi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Existing studies of naturally occurring language-in-interaction have largely focused on the two ends of the developmental spectrum, i.e., early childhood and adulthood, leaving a gap in our knowledge about how development unfolds, especially across middle childhood. The current work contributes to filling this gap by introducing CHICA (for Child Interpersonal Communication Analysis), a developmental corpus of child-caregiver conversations at home, involving groups of French-speaking children aged 7, 9, and 11 years old. Each dyad was recorded twice: once in a face-to-face setting and once using computer-mediated video calls. For the face-to-face settings, we capitalized on recent advances in mobile, lightweight eye-tracking and head motion detection technology to optimize the naturalness of the recordings, allowing us to obtain both precise and ecologically valid data. Further, we mitigated the challenges of manual annotation by relying – to the extent possible – on automatic tools in speech processing and computer vision. Finally, to demonstrate the richness of this corpus for the study of child communicative development, we provide preliminary analyses comparing several measures of child-caregiver conversational dynamics across developmental age, modality, and communicative medium. We hope the current corpus will allow new discoveries into the properties and mechanisms of multimodal communicative development across middle childhood.

ChiCA: un corpus de conversations face-à-face vs. Zoom entre enfants et parents
Dhia Elhak Goumri | Abhishek Agrawal | Mitja Nikolaus | Hong Duc Thang Vu | Kübra Bodur | Elias Semmar | Cassandre Armand | Chiara Mazzocconi | Shreejata Gupta | Laurent Prévot | Benoit Favre | Leonor Becerra-Bonache | Abdellah Fourtassi
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 2 : traductions d'articles publiès

Les études existantes sur la parole en interaction naturelle se sont principalement concentrées sur les deux extrémités du spectre développemental, c’est-à-dire la petite enfance et l’âge adulte, laissant un vide dans nos connaissances sur la manière dont se déroule le développement, en particulier pendant l’age scolaire (6 à 11 ans). Le travail actuel contribue à combler cette lacune en introduisant un corpus développemental de conversations entre enfants et parents à domicile, impliquant des groupes d’enfants âgés de 7, 9 et 11 ans dont la langue maternelle est le français. Chaque dyade a été enregistrée deux fois: une fois en face-à-face et une fois en utilisant des appels vidéo par ordinateur. Pour les paramètres en face-à-face, nous avons capitalisé sur les progrès récents en matière de technologie de suivi oculaire mobile et de détection des mouvements de la tête pour optimiser le caractère naturel des enregistrements, nous permettant d’obtenir à la fois des données précises et écologiquement valides. De plus, nous avons contourné les difficultés de l’annotation manuelle en nous appuyant, dans la mesure du possible, sur des outils automatiques de traitement de la parole et de vision par ordinateur. Enfin, pour démontrer la richesse de ce corpus pour l’étude du développement communicatif de l’enfant, nous fournissons des analyses préliminaires comparant plusieurs mesures de la dynamique conversationnelle entre l’enfant et le parent selon l’âge, la modalité et le support communicatif. Nous espérons que le travail actuel ouvrira la voie à de futures découvertes sur les propriétés et les mécanismes du développement communicatif multimodal pendant l’age scolaire de l’enfant.

Automatic Coding of Contingency in Child-Caregiver Conversations
Abhishek Agrawal | Mitja Nikolaus | Benoit Favre | Abdellah Fourtassi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

One of the most important communicative skills children have to learn is to engage in meaningful conversations with people around them. At the heart of this learning lies the mastery of contingency, i.e., the ability to contribute to an ongoing exchange in a relevant fashion (e.g., by staying on topic). Current research on this question relies on the manual annotation of a small sample of children, which limits our ability to draw general conclusions about development. Here, we propose to mitigate the limitations of manual labor by relying on automatic tools for contingency judgment in children’s early natural interactions with caregivers. Drawing inspiration from the field of dialogue systems evaluation, we built and compared several automatic classifiers. We found that a Transformer-based pre-trained language model – when fine-tuned on a relatively small set of data we annotated manually (around 3,500 turns) – provided the best predictions. We used this model to automatically annotate, new and large-scale data, almost two orders of magnitude larger than our fine-tuning set. It was able to replicate existing results and generate new data-driven hypotheses. The broad impact of the work is to provide resources that can help the language development community study communicative development at scale, leading to more robust theories.

2023

LIS@DEFT’23 : les LLMs peuvent-ils répondre à des QCM ? (a) oui; (b) non; (c) je ne sais pas.
Benoit Favre
Actes de CORIA-TALN 2023. Actes du Défi Fouille de Textes@TALN2023

Cet article présente un ensemble d’expériences sur la tâche de réponse à des questions à choix multiple de DEFT 2023. Des grands modèles de langage sont amorcés avec les questions afin de collecter les réponses générées. Les résultats montrent que les modèles ouverts sans affinage obtiennent des performances similaires à celles d’un système supervisé fondé sur BERT, et que l’affinage sur les données de la tâche apporte des améliorations.

Reducing named entity hallucination risk to ensure faithful summary generation
Eunice Akani | Benoit Favre | Frederic Bechet | Romain Gemignani
Proceedings of the 16th International Natural Language Generation Conference

The faithfulness of abstractive text summarization at the named entities level is the focus of this study. We propose to add a new criterion to the summary selection method based on the “risk” of generating entities that do not belong to the source document. This method is based on the assumption that Out-Of-Document entities are more likely to be hallucinations. This assumption was verified by a manual annotation of the entities occurring in a set of generated summaries on the CNN/DM corpus. This study showed that only 29% of the entities outside the source document were inferrable by the annotators, leading to 71% of hallucinations among OOD entities. We test our selection method on the CNN/DM corpus and show that it significantly reduces the hallucination risk on named entities while maintaining competitive results with respect to automatic evaluation metrics like ROUGE.

OLISIA: a Cascade System for Spoken Dialogue State Tracking
Léo Jacqmin | Lucas Druart | Yannick Estève | Benoît Favre | Lina M Rojas | Valentin Vielzeuf
Proceedings of the Eleventh Dialog System Technology Challenge

Though Dialogue State Tracking (DST) is a core component of spoken dialogue systems, recent work on this task mostly deals with chat corpora, disregarding the discrepancies between spoken and written language. In this paper, we propose OLISIA, a cascade system which integrates an Automatic Speech Recognition (ASR) model and a DST model. We introduce several adaptations in the ASR and DST modules to improve integration and robustness to spoken conversations. With these adaptations, our system ranked first in DSTC11 Track 3, a benchmark to evaluate spoken DST. We conduct an in-depth analysis of the results and find that normalizing the ASR outputs and adapting the DST inputs through data augmentation, along with increasing the pre-trained models size all play an important role in reducing the performance discrepancy between written and spoken conversations.

2022

Abstraction ou hallucination ? État des lieux et évaluation du risque pour les modèles de génération de résumés automatiques de type séquence-à-séquence (Abstraction or Hallucination ? Status and Risk assessment for sequence-to-sequence Automatic)
Eunice Akani | Benoit Favre | Frederic Bechet
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

La génération de texte a récemment connu un très fort intérêt au vu des avancées notables dans le domaine des modèles de langage neuronaux. Malgré ces avancées, cette tâche reste difficile quand il s’agit d’un résumé automatique de texte par abstraction. Certains systèmes de résumés génèrent des textes qui ne sont pas forcément fidèles au document source. C’est sur cette thématique que porte notre étude. Nous présentons une typologie d’erreurs pour les résumés automatique et ainsi qu’une caractérisation du phénomène de l’abstraction pour les résumés de référence afin de mieux comprendre l’ampleur de ces différents phénomènes sur les entités nommées. Nous proposons également une mesure d’évaluation du risque d’erreur lorsqu’un système tente de faire des abstractions sur les entités nommées d’un document.

Using ASR-Generated Text for Spoken Language Modeling
Nicolas Hervé | Valentin Pelloin | Benoit Favre | Franck Dary | Antoine Laurent | Sylvain Meignier | Laurent Besacier
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models

This papers aims at improving spoken language modeling (LM) using very large amount of automatically transcribed speech. We leverage the INA (French National Audiovisual Institute) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. From this, spoken language models are trained either by fine-tuning an existing LM (FlauBERT) or through training a LM from scratch. The new models (FlauBERT-Oral) will be shared with the community and are evaluated not only in terms of word prediction accuracy but also for two downstream tasks : classification of TV shows and syntactic parsing of speech. Experimental results show that FlauBERT-Oral is better than its initial FlauBERT version demonstrating that, despite its inherent noisy nature, ASR-Generated text can be useful to improve spoken language modeling.

“Do you follow me?”: A Survey of Recent Approaches in Dialogue State Tracking
Léo Jacqmin | Lina M. Rojas Barahona | Benoit Favre
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

While communicating with a user, a task-oriented dialogue system has to track the user’s needs at each turn according to the conversation history. This process called dialogue state tracking (DST) is crucial because it directly informs the downstream dialogue policy. DST has received a lot of interest in recent years with the text-to-text paradigm emerging as the favored approach. In this review paper, we first present the task and its associated datasets. Then, considering a large number of recent publications, we identify highlights and advances of research in 2021-2022. Although neural approaches have enabled significant progress, we argue that some critical aspects of dialogue systems such as generalizability are still underexplored. To motivate future studies, we propose several research avenues.

Simulation d’erreurs d’OCR dans les systèmes de TAL pour le traitement de données anachroniques (Simulation of OCR errors in NLP systems for processing anachronistic data)
Baptiste Blouin | Benoit Favre | Jeremy Auguste
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Atelier TAL et Humanités Numériques (TAL-HN)

L’extraction d’information offre de nouvelles perspectives au sein des recherches historiques. Cependant, la majorité des recherches liées à ce domaine s’effectue sur des données contemporaines. Malgré l’évolution constante des systèmes d’OCR, les textes historiques résultant de ce procédé contiennent toujours de multiples erreurs. Du fait d’un manque de ressources historiques dédiées au TAL, le traitement de ce domaine reste dépendant de l’utilisation de ressources contemporaines. De nombreuses études ont démontré l’impact négatif que pouvaient avoir les erreurs d’OCR sur les systèmes prêts à l’emploi contemporains. Mais l’évaluation des nouvelles architectures, proposant des résultats prometteurs sur des données récentes, face à ce problème reste encore très minime. Dans cette étude, nous quantifions l’impact des erreurs d’OCR sur trois tâches d’extraction d’information en utilisant plusieurs architectures de type Transformers. Au vu de ces résultats, nous proposons une approche permettant de réduire de plus de 50% cet impact sans avoir recours à des ressources historiques spécialisées.

Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?
Mitja Nikolaus | Emmanuelle Salin | Stephane Ayache | Abdellah Fourtassi | Benoit Favre
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Recent advances in vision-and-language modeling have seen the development of Transformer architectures that achieve remarkable performance on multimodal reasoning tasks.Yet, the exact capabilities of these black-box models are still poorly understood. While much of previous work has focused on studying their ability to learn meaning at the word-level, their ability to track syntactic dependencies between words has received less attention.We take a first step in closing this gap by creating a new multimodal task targeted at evaluating understanding of predicate-noun dependencies in a controlled setup.We evaluate a range of state-of-the-art models and find that their performance on the task varies considerably, with some models performing relatively well and others at chance level. In an effort to explain this variability, our analyses indicate that the quality (and not only sheer quantity) of pretraining data is essential. Additionally, the best performing models leverage fine-grained multimodal pretraining objectives in addition to the standard image-text matching objectives.This study highlights that targeted and controlled evaluations are a crucial step for a precise and rigorous test of the multimodal knowledge of vision-and-language models.

Zero-Shot Aspect-Based Scientific Document Summarization using Self-Supervised Pre-training
Amir Soleimani | Vassilina Nikoulina | Benoit Favre | Salah Ait Mokhtar
Proceedings of the 21st Workshop on Biomedical Language Processing

We study the zero-shot setting for the aspect-based scientific document summarization task. Summarizing scientific documents with respect to an aspect can remarkably improve document assistance systems and readers experience. However, existing large-scale datasets contain a limited variety of aspects, causing summarization models to over-fit to a small set of aspects and a specific domain. We establish baseline results in zero-shot performance (over unseen aspects and the presence of domain shift), paraphrasing, leave-one-out, and limited supervised samples experimental setups. We propose a self-supervised pre-training approach to enhance the zero-shot performance. We leverage the PubMed structured abstracts to create a biomedical aspect-based summarization dataset. Experimental results on the PubMed and FacetSum aspect-based datasets show promising performance when the model is pre-trained using unlabelled in-domain data.

2021

Transferring Modern Named Entity Recognition to the Historical Domain: How to Take the Step?
Baptiste Blouin | Benoit Favre | Jeremy Auguste | Christian Henriot
Proceedings of the Workshop on Natural Language Processing for Digital Humanities

Named entity recognition is of high interest to digital humanities, in particular when mining historical documents. Although the task is mature in the field of NLP, results of contemporary models are not satisfactory on challenging documents corresponding to out-of-domain genres, noisy OCR output, or old-variants of the target language. In this paper we study how model transfer methods, in the context of the aforementioned challenges, can improve historical named entity recognition according to how much effort is allocated to describing the target data, manually annotating small amounts of texts, or matching pre-training resources. In particular, we explore the situation where the class labels, as well as the quality of the documents to be processed, are different in the source and target domains. We perform extensive experiments with the transformer architecture on the LitBank and HIPE historical datasets, with different annotation schemes and character-level noise. They show that annotating 250 sentences can recover 93% of the full-data performance when models are pre-trained, that the choice of self-supervised and target-task pre-training data is crucial in the zero-shot setting, and that OCR errors can be handled by simulating noise on pre-training data and resorting to recent character-aware transformers.

2020

Analyse sémantique robuste par apprentissage antagoniste pour la généralisation de domaine (Robust Semantic Parsing with Adversarial Learning for Domain Generalization )
Gabriel Marzinotto | Géraldine Damnati | Frédéric Béchet | Benoît Favre
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 4 : Démonstrations et résumés d'articles internationaux

Nous présentons des résumés en français et en anglais de l’article (Marzinotto et al., 2019) présenté à la conférence North American Chapter of the Association for Computational Linguistics : Human Language Technologies en 2019.

Filtering conversations through dialogue acts labels for improving corpus-based convergence studies
Simone Fuscone | Benoit Favre | Laurent Prévot
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Cognitive models of conversation and research on user-adaptation in dialogue systems involves a better understanding of speakers convergence in conversation. Convergence effects have been established on controlled data sets, for various acoustic and linguistic variables. Tracking interpersonal dynamics on generic corpora has provided positive but more contrasted outcomes. We propose here to enrich large conversational corpora with dialogue act (DA) information. We use DA-labels as filters in order to create data sub sets featuring homogeneous conversational activity. Those data sets allow a more precise comparison between speakers’ speech variables. Our experiences consist of comparing convergence on low level variables (Energy, Pitch, Speech Rate) measured on raw data sets, with human and automatically DA-labelled data sets. We found that such filtering does help in observing convergence suggesting that studies on interpersonal dynamics should consider such high level dialogue activity types and their related NLP topics as important ingredients of their toolboxes.

Development of Multi-level Linguistic Alignment in Child-adult Conversations
Thomas Misiek | Benoit Favre | Abdellah Fourtassi
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Interactive alignment is a major mechanism of linguistic coordination. Here we study the way this mechanism emerges in development across the lexical, syntactic, and conceptual levels. We leverage NLP tools to analyze a large-scale corpus of child-adult conversations between 2 and 5 years old. We found that, across development, children align consistently to adults above chance and that adults align consistently more to children than vice versa (even controlling for language production abilities). Besides these consistencies, we found a diversity of developmental trajectories across linguistic levels. These corpus-based findings provide strong support for an early onset of multi-level linguistic alignment in children and invites new experimental work.

2019

Robust Semantic Parsing with Adversarial Learning for Domain Generalization
Gabriel Marzinotto | Géraldine Damnati | Frédéric Béchet | Benoît Favre
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers)

This paper addresses the issue of generalization for Semantic Parsing in an adversarial framework. Building models that are more robust to inter-document variability is crucial for the integration of Semantic Parsing technologies in real applications. The underlying question throughout this study is whether adversarial learning can be used to train models on a higher level of abstraction in order to increase their robustness to lexical and stylistic variations. We propose to perform Semantic Parsing with a domain classification adversarial task, covering various use-cases with or without explicit knowledge of the domain. The strategy is first evaluated on a French corpus of encyclopedic documents, annotated with FrameNet, in an information retrieval perspective. This corpus constitutes a new public benchmark, gathering documents from various thematic domains and various sources. We show that adversarial learning yields improved results when using explicit domain classification as the adversarial task. We also propose an unsupervised domain discovery approach that yields equivalent improvements. The latter is also evaluated on a PropBank Semantic Role Labeling task on the CoNLL-2005 benchmark and is shown to increase the model’s generalization capabilities on out-of-domain data.

Typological Features for Multilingual Delexicalised Dependency Parsing
Manon Scholivet | Franck Dary | Alexis Nasr | Benoit Favre | Carlos Ramisch
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The existence of universal models to describe the syntax of languages has been debated for decades. The availability of resources such as the Universal Dependencies treebanks and the World Atlas of Language Structures make it possible to study the plausibility of universal grammar from the perspective of dependency parsing. Our work investigates the use of high-level language descriptions in the form of typological features for multilingual dependency parsing. Our experiments on multilingual parsing for 40 languages show that typological information can indeed guide parsers to share information between similar languages beyond simple language identification.

2018

Veyn at PARSEME Shared Task 2018: Recurrent Neural Networks for VMWE Identification
Nicolas Zampieri | Manon Scholivet | Carlos Ramisch | Benoit Favre
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This paper describes the Veyn system, submitted to the closed track of the PARSEME Shared Task 2018 on automatic identification of verbal multiword expressions (VMWEs). Veyn is based on a sequence tagger using recurrent neural networks. We represent VMWEs using a variant of the begin-inside-outside encoding scheme combined with the VMWE category tag. In addition to the system description, we present development experiments to determine the best tagging scheme. Veyn is freely available, covers 19 languages, and was ranked ninth (MWE-based) and eight (Token-based) among 13 submissions, considering macro-averaged F1 across languages.

Correction automatique d’attachements prépositionnels par utilisation de traits visuels (PP-attachement resolution using visual features)
Sébastien Delecraz | Leonor Becerra-Bonache | Benoît Favre | Alexis Nasr | Frédéric Bechet
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

La désambiguïsation des rattachements prépositionnels est une tâche syntaxique qui demande des connaissances sémantiques, pouvant être extraites d’une image associée au texte traité. Nous présentons et analysons les difficultés de cette tâche pour laquelle nous construisons un système complet entraîné sur une version étendue des annotations du corpus Flickr30k Entities. Lorsque la sémantique lexicale n’est pas disponible, l’information visuelle apporte 3 % d’amélioration.

Adding Syntactic Annotations to Flickr30k Entities Corpus for Multimodal Ambiguous Prepositional-Phrase Attachment Resolution
Sebastien Delecraz | Alexis Nasr | Frederic Bechet | Benoit Favre
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Détection d’erreurs dans des transcriptions OCR de documents historiques par réseaux de neurones récurrents multi-niveau (Combining character level and word level RNNs for post-OCR error detection)
Thibault Magallon | Frederic Bechet | Benoit Favre
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Le traitement à posteriori de transcriptions OCR cherche à détecter les erreurs dans les sorties d’OCR pour tenter de les corriger, deux tâches évaluées par la compétition ICDAR-2017 Post-OCR Text Correction. Nous présenterons dans ce papier un système de détection d’erreurs basé sur un modèle à réseaux récurrents combinant une analyse du texte au niveau des mots et des caractères en deux temps. Ce système a été classé second dans trois catégories évaluées parmi 11 candidats lors de la compétition.

Evaluation automatique de la satisfaction client à partir de conversations de type “chat” par réseaux de neurones récurrents avec mécanisme d’attention (Customer satisfaction prediction with attention-based RNNs from a chat contact center corpus)
Jeremy Auguste | Delphine Charlet | Géraldine Damnati | Benoit Favre | Frederic Bechet
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Cet article présente des méthodes permettant l’évaluation de la satisfaction client à partir de très vastes corpus de conversation de type “chat” entre des clients et des opérateurs. Extraire des connaissances dans ce contexte demeure un défi pour les méthodes de traitement automatique des langues de par la dimension interactive et les propriétés de ce nouveau type de langage à l’intersection du langage écrit et parlé. Nous présentons une étude utilisant des réponses à des sondages utilisateurs comme supervision faible permettant de prédire la satisfaction des usagers d’un service en ligne d’assistance technique et commerciale.

2017

Evaluation of word embeddings against cognitive processes: primed reaction times in lexical decision and naming tasks
Jeremy Auguste | Arnaud Rey | Benoit Favre
Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP

This work presents a framework for word similarity evaluation grounded on cognitive sciences experimental data. Word pair similarities are compared to reaction times of subjects in large scale lexical decision and naming tasks under semantic priming. Results show that GloVe embeddings lead to significantly higher correlation with experimental measurements than other controlled and off-the-shelf embeddings, and that the choice of a training corpus is less important than that of the algorithm. Comparison of rankings with other datasets shows that the cognitive phenomenon covers more aspects than simply word relatedness or similarity.

Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres
George Giannakopoulos | Elena Lloret | John M. Conroy | Josef Steinberger | Marina Litvak | Peter Rankel | Benoit Favre
Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres

Détection de coréférences de bout en bout en français (End-to-end coreference resolution for French)
Elisabeth Godbert | Benoit Favre
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Notre objectif est l’élaboration d’un système de détection automatique de relations de coréférence le plus général possible, pour le traitement des anaphores pronominales et les coréférences directes. Nous décrivons dans cet article les différentes étapes de traitement des textes dans le système que nous avons développé : (i) l’annotation en traits lexicaux et syntaxiques par le système Macaon ; (ii) le repérage des mentions par un modèle obtenu par apprentissage sur le corpus ANCOR ; (iii) l’annotation sémantique des mentions à partir de deux ressources : le DEM et le LVF ; (iv) l’annotation en coréférences par un système à base de règles. Le système est évalué sur le corpus ANCOR.

MultiLing 2017 Overview
George Giannakopoulos | John Conroy | Jeff Kubina | Peter A. Rankel | Elena Lloret | Josef Steinberger | Marina Litvak | Benoit Favre
Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres

In this brief report we present an overview of the MultiLing 2017 effort and workshop, as implemented within EACL 2017. MultiLing is a community-driven initiative that pushes the state-of-the-art in Automatic Summarization by providing data sets and fostering further research and development of summarization systems. This year the scope of the workshop was widened, bringing together researchers that work on summarization across sources, languages and genres. We summarize the main tasks planned and implemented this year, the contributions received, and we also provide insights on next steps.

Correcting prepositional phrase attachments using multimodal corpora
Sebastien Delecraz | Alexis Nasr | Frederic Bechet | Benoit Favre
Proceedings of the 15th International Conference on Parsing Technologies

PP-attachments are an important source of errors in parsing natural language. We propose in this article to use data coming from a multimodal corpus, combining textual, visual and conceptual information, as well as a correction strategy, to propose alternative attachments in the output of a parser.

Apprentissage d’agents conversationnels pour la gestion de relations clients (Training chatbots for customer relation management)
Benoit Favre | Frederic Bechet | Géraldine Damnati | Delphine Charlet
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 3 - Démonstrations

Ce travail démontre la faisabilité d’entraîner des chatbots sur des traces de conversations dans le domaine de la relation client. Des systèmes à base de modèles de langage, de recherche d’information et de traduction sont comparés pour la tâche.

2016

A Document Repository for Social Media and Speech Conversations
Adam Funk | Robert Gaizauskas | Benoit Favre
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a successfully implemented document repository REST service for flexible SCRUD (search, crate, read, update, delete) storage of social media conversations, using a GATE/TIPSTER-like document object model and providing a query language for document features. This software is currently being used in the SENSEI research project and will be published as open-source software before the project ends. It is, to the best of our knowledge, the first freely available, general purpose data repository to support large-scale multimodal (i.e., speech or text) conversation analytics.

Détection de concepts pertinents pour le résumé automatique de conversations par recombinaison de patrons (Relevant concepts detection for the automatic summary of conversations using patterns recombination )
Jérémy Trione | Benoit Favre | Frederic Bechet
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

automatique de conversations par recombinaison de patrons Jérémy Trione Benoit Favre Frédéric Béchet Aix-Marseille Université, CNRS, LIF UMR 7279, 13000, Marseille, France prénom.nom@lif.univ-mrs.fr R ÉSUMÉ Ce papier décrit une approche pour créer des résumés de conversations parlées par remplissage de patrons. Les patrons sont générés automatiquement à partir de fragments généralisés depuis un corpus de résumés d’apprentissage. Les informations nécessaires pour remplir les patrons sont détectées dans les transcriptions des conversations et utilisées pour sélectionner les fragments candidats. L’approche obtient un score ROUGE-2 de 0.116 sur le corpus RATP-DECODA. Les résultats obtenus montrent que cette approche abstractive est plus performante que les approches extractives utilisées habituellement dans le domaine du résumé automatique.

Fusion d’espaces de représentations multimodaux pour la reconnaissance du rôle du locuteur dans des documents télévisuels (Multimodal embedding fusion for robust speaker role recognition in video broadcast )
Sebastien Delecraz | Frederic Bechet | Benoit Favre | Mickael Rouvier
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 1 : JEP

L’identification du rôle d’un locuteur dans des émissions de télévision est un problème de classification de personne selon une liste de rôles comme présentateur, journaliste, invité, etc. À cause de la nonsynchronie entre les modalités, ainsi que par le manque de corpus de vidéos annotées dans toutes les modalités, seulement une des modalités est souvent utilisée. Nous présentons dans cet article une fusion multimodale des espaces de représentations de l’audio, du texte et de l’image pour la reconnaissance du rôle du locuteur pour des données asynchrones. Les espaces de représentations monomodaux sont entraînés sur des corpus de données exogènes puis ajustés en utilisant des réseaux de neurones profonds sur un corpus d’émissions françaises pour notre tâche de classification. Les expériences réalisées sur le corpus de données REPERE ont mis en évidence les gains d’une fusion au niveau des espaces de représentations par rapport aux méthodes de fusion tardive standard.

Summarizing Behaviours: An Experiment on the Annotation of Call-Centre Conversations
Morena Danieli | Balamurali A R | Evgeny Stepanov | Benoit Favre | Frederic Bechet | Giuseppe Riccardi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Annotating and predicting behavioural aspects in conversations is becoming critical in the conversational analytics industry. In this paper we look into inter-annotator agreement of agent behaviour dimensions on two call center corpora. We find that the task can be annotated consistently over time, but that subjectivity issues impacts the quality of the annotation. The reformulation of some of the annotated dimensions is suggested in order to improve agreement.

Word Embedding Evaluation and Combination
Sahar Ghannay | Benoit Favre | Yannick Estève | Nathalie Camelin
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Word embeddings have been successfully used in several natural language processing tasks (NLP) and speech processing. Different approaches have been introduced to calculate word embeddings through neural networks. In the literature, many studies focused on word embedding evaluation, but for our knowledge, there are still some gaps. This paper presents a study focusing on a rigorous comparison of the performances of different kinds of word embeddings. These performances are evaluated on different NLP and linguistic tasks, while all the word embeddings are estimated on the same training data using the same vocabulary, the same number of dimensions, and other similar characteristics. The evaluation results reported in this paper match those in the literature, since they point out that the improvements achieved by a word embedding in one task are not consistently observed across all tasks. For that reason, this paper investigates and evaluates approaches to combine word embeddings in order to take advantage of their complementarity, and to look for the effective word embeddings that can achieve good performances on all tasks. As a conclusion, this paper provides new perceptions of intrinsic qualities of the famous word embedding families, which can be different from the ones provided by works previously published in the scientific literature.

SENSEI-LIF at SemEval-2016 Task 4: Polarity embedding fusion for robust sentiment analysis
Mickael Rouvier | Benoit Favre
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

MultiLing 2015: Multilingual Summarization of Single and Multi-Documents, On-line Fora, and Call-center Conversations
George Giannakopoulos | Jeff Kubina | John Conroy | Josef Steinberger | Benoit Favre | Mijail Kabadjov | Udo Kruschwitz | Massimo Poesio
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Concept-based Summarization using Integer Linear Programming: From Concept Pruning to Multiple Optimal Solutions
Florian Boudin | Hugo Mougard | Benoit Favre
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Call Centre Conversation Summarization: A Pilot Task at Multiling 2015
Benoit Favre | Evgeny Stepanov | Jérémy Trione | Frédéric Béchet | Giuseppe Riccardi
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Rapid FrameNet annotation of spoken conversation transcripts
Jeremy Trione | Frederic Bechet | Benoit Favre | Alexis Nasr
Proceedings of the 11th Joint ACL-ISO Workshop on Interoperable Semantic Annotation (ISA-11)

2014

Automatically enriching spoken corpora with syntactic information for linguistic studies
Alexis Nasr | Frederic Bechet | Benoit Favre | Thierry Bazillon | Jose Deulofeu | Andre Valli
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Syntactic parsing of speech transcriptions faces the problem of the presence of disfluencies that break the syntactic structure of the utterances. We propose in this paper two solutions to this problem. The first one relies on a disfluencies predictor that detects disfluencies and removes them prior to parsing. The second one integrates the disfluencies in the syntactic structure of the utterances and train a disfluencies aware parser.

A Repository of State of the Art and Competitive Baseline Summaries for Generic News Summarization
Kai Hong | John Conroy | Benoit Favre | Alex Kulesza | Hui Lin | Ani Nenkova
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In the period since 2004, many novel sophisticated approaches for generic multi-document summarization have been developed. Intuitive simple approaches have also been shown to perform unexpectedly well for the task. Yet it is practically impossible to compare the existing approaches directly, because systems have been evaluated on different datasets, with different evaluation measures, against different sets of comparison systems. Here we present a corpus of summaries produced by several state-of-the-art extractive summarization systems or by popular baseline systems. The inputs come from the 2004 DUC evaluation, the latest year in which generic summarization was addressed in a shared task. We use the same settings for ROUGE automatic evaluation to compare the systems directly and analyze the statistical significance of the differences in performance. We show that in terms of average scores the state-of-the-art systems appear similar but that in fact they produce very different summaries. Our corpus will facilitate future research on generic summarization and motivates the need for development of more sensitive evaluation measures and for approaches to system combination in summarization.

2012

Generative Constituent Parsing and Discriminative Dependency Reranking: Experiments on English and French
Joseph Le Roux | Benoît Favre | Alexis Nasr | Seyed Abolghasem Mirroshandel
Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages

Robustesse et portabilités multilingue et multi-domaines des systèmes de compréhension de la parole : les corpus du projet PortMedia (Robustness and portability of spoken language understanding systems among languages and domains : the PORTMEDIA project) [in French]
Fabrice Lefèvre | Djamel Mostefa | Laurent Besacier | Yannick Estève | Matthieu Quignard | Nathalie Camelin | Benoit Favre | Bassam Jabaian | Lina Rojas-Barahona
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 1: JEP

Syntactic annotation of spontaneous speech: application to call-center conversation data
Thierry Bazillon | Melanie Deplano | Frederic Bechet | Alexis Nasr | Benoit Favre
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes the syntactic annotation process of the DECODA corpus. This corpus contains manual transcriptions of spoken conversations recorded in the French call-center of the Paris Public Transport Authority (RATP). Three levels of syntactic annotation have been performed with a semi-supervised approach: POS tags, Syntactic Chunks and Dependency parses. The main idea is to use off-the-shelf NLP tools and models, originaly developped and trained on written text, to perform a first automatic annotation on the manually transcribed corpus. At the same time a fully manual annotation process is performed on a subset of the original corpus, called the GOLD corpus. An iterative process is then applied, consisting in manually correcting errors found in the automatic annotations, retraining the linguistic models of the NLP tools on this corrected corpus, then checking the quality of the adapted models on the fully manual annotations of the GOLD corpus. This process iterates until a certain error rate is reached. This paper describes this process, the main issues raising when adapting NLP tools to process speech transcriptions, and presents the first evaluations performed with these new adapted tools.

Percol0 - un système multimodal de détection de personnes dans des documents vidéo (Percol0 - A multimodal person detection system in video documents) [in French]
Frederic Bechet | Remi Auguste | Stephane Ayache | Delphine Charlet | Geraldine Damnati | Benoit Favre | Corinne Fredouille | Christophe Levy | Georges Linares | Jean Martinet
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 1: JEP

Leveraging study of robustness and portability of spoken language understanding systems across languages and domains: the PORTMEDIA corpora
Fabrice Lefèvre | Djamel Mostefa | Laurent Besacier | Yannick Estève | Matthieu Quignard | Nathalie Camelin | Benoit Favre | Bassam Jabaian | Lina M. Rojas-Barahona
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The PORTMEDIA project is intended to develop new corpora for the evaluation of spoken language understanding systems. The newly collected data are in the field of human-machine dialogue systems for tourist information in French in line with the MEDIA corpus. Transcriptions and semantic annotations, obtained by low-cost procedures, are provided to allow a thorough evaluation of the systems' capabilities in terms of robustness and portability across languages and domains. A new test set with some adaptation data is prepared for each case: in Italian as an example of a new language, for ticket reservation as an example of a new domain. Finally the work is complemented by the proposition of a new high level semantic annotation scheme well-suited to dialogue data.

2011

<StuMaBa>: From Deep Representation to Surface
Bernd Bohnet | Simon Mille | Benoît Favre | Leo Wanner
Proceedings of the 13th European Workshop on Natural Language Generation

MACAON An NLP Tool Suite for Processing Word Lattices
Alexis Nasr | Frédéric Béchet | Jean-François Rey | Benoît Favre | Joseph Le Roux
Proceedings of the ACL-HLT 2011 System Demonstrations

Modèles génératif et discriminant en analyse syntaxique : expériences sur le corpus arboré de Paris 7 (Generative and discriminative models in parsing: experiments on the Paris 7 Treebank)
Joseph Le Roux | Benoît Favre | Seyed Abolghasem Mirroshandel | Alexis Nasr
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous présentons une architecture pour l’analyse syntaxique en deux étapes. Dans un premier temps un analyseur syntagmatique construit, pour chaque phrase, une liste d’analyses qui sont converties en arbres de dépendances. Ces arbres sont ensuite réévalués par un réordonnanceur discriminant. Cette méthode permet de prendre en compte des informations auxquelles l’analyseur n’a pas accès, en particulier des annotations fonctionnelles. Nous validons notre approche par une évaluation sur le corpus arboré de Paris 7. La seconde étape permet d’améliorer significativement la qualité des analyses retournées, quelle que soit la métrique utilisée.

2010

The UMUS System for Named Entity Generation at GREC 2010
Benoit Favre | Bernd Bohnet
Proceedings of the 6th International Natural Language Generation Conference

2009

ICSI-CRF: The Generation of References to the Main Subject and Named Entities Using Conditional Random Fields
Benoit Favre | Bernd Bohnet
Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009)

A Scalable Global Model for Summarization
Dan Gillick | Benoit Favre
Proceedings of the Workshop on Integer Linear Programming for Natural Language Processing

2005

Robust Named Entity Extraction from Large Spoken Archives
Benoît Favre | Frédéric Béchet | Pascal Nocéra
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

Co-authors

Géraldine Damnati 7

Mickael Rouvier 6

Abdellah Fourtassi 5

Jérémy Auguste 4

Sébastien Delecraz 4

Richard Dufour 4

Yannick Estève 4

Mitja Nikolaus 4

Abhishek Agrawal 3

Leonor Becerra-Bonache 3

Laurent Besacier 3

Nathalie Camelin 3

Delphine Charlet 3

George Giannakopoulos 3

Joseph Le Roux 3

Laurent Prévot 3

Carlos Ramisch 3

Lina M. Rojas Barahona 3

Josef Steinberger 3

Jérémy Trione 3

Cassandre Armand 2

Stephane Ayache 2

Thierry Bazillon 2

Adrien Bazoge 2

Ikram Belmadani 2

Baptiste Blouin 2

Romain Gemignani 2

Dhia Elhak Goumri 2

Shreejata Gupta 2

Stéphane Huet 2

Bassam Jabaian 2

Gwénolé Lecorvé 2

Fabrice Lefèvre 2

Marina Litvak 2

Gabriel Marzinotto 2

Chiara Mazzocconi 2

Seyed Abolghasem Mirroshandel 2

Djamel Mostefa 2

Matthieu Quignard 2

Peter A. Rankel 2

Giuseppe Riccardi 2

Ricardo Rodriguez 2

Manon Scholivet 2

Evgeny Stepanov 2

Hong Duc Thang Vu 2

Balamurali AR 1

Salah Ait-Mokhtar 1

Maxime Amblard 1

Auriane Boudin 1

Florian Boudin 1

Hossam Boudraa 1

Marie Candito 1

Pacôme Constant dit Beaufils 1

Morena Danieli 1

Maxime Darrin 1

Melanie Deplano 1

José Deulofeu 1

Oumaima El Khettari 1

Cécile Fabre 1

Philippe Formont 1

Corinne Fredouille 1

Simone Fuscone 1

Robert Gaizauskas 1

Sahar Ghannay 1

Élisabeth Godbert 1

Christian Henriot 1

Nicolas Hervé 1

Mijail Kabadjov 1

Udo Kruschwitz 1

Antoine Laurent 1

Christophe Levy 1

Georges Linarès 1

Thibault Magallon 1

Jean Martinet 1

Sylvain Meignier 1

Thomas Misiek 1

Aurelie Neveol 1

Vassilina Nikoulina 1

Pascal Nocéra 1

Valentin Pelloin 1

Jean-Marie Pergandi 1

Pablo Piantanida 1

Massimo Poesio 1

Jean-François Rey 1

Sophie Rosset 1

Emmanuelle Salin 1

Amir Soleimani 1

Valentin Vielzeuf 1

Nicolas Zampieri 1

Yacine Zerenini 1

Venues