Aurelie Neveol

Also published as: Aurélie Névéol

2025

pdf bib abs
“Women do not have heart attacks!” Gender Biases in Automatically Generated Clinical Cases in French
Fanny Ducel | Nicolas Hiebel | Olivier Ferret | Karën Fort | Aurélie Névéol
Findings of the Association for Computational Linguistics: NAACL 2025

Healthcare professionals are increasingly including Language Models (LMs) in clinical practice. However, LMs have been shown to exhibit and amplify stereotypical biases that can cause life-threatening harm in a medical context. This study aims to evaluate gender biases in automatically generated clinical cases in French, on ten disorders. Using seven LMs fine-tuned for clinical case generation and an automatic linguistic gender detection tool, we measure the associations between disorders and gender. We unveil that LMs over-generate cases describing male patients, creating synthetic corpora that are not consistent with documented prevalence for these disorders. For instance, when prompts do not specify a gender, LMs generate eight times more clinical cases describing male (vs. female patients) for heart attack. We discuss the ideal synthetic clinical case corpus and establish that explicitly mentioning demographic information in generation instructions appears to be the fairest strategy. In conclusion, we argue that the presence of gender biases in synthetic text raises concerns about LM-induced harm, especially for women and transgender people.

pdf bib abs
« Les femmes ne font pas de crise cardiaque ! » Étude des biais de genre dans les cas cliniques synthétiques en français
Fanny Ducel | Nicolas Hiebel | Olivier Ferret | Karën Fort | Aurélie Névéol
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : traductions d'articles publiés

De plus en plus de professionnels de santé utilisent des modèles de langue. Cependant, ces modèles présentent et amplifient des biais stéréotypés qui peuvent mettre en danger des vies. Cette étude vise à évaluer les biais de genre dans des cas cliniques générés automatiquement en français concernant dix pathologies. Nous utilisons sept modèles de langue affinés et un outil de détection automatique du genre pour mesurer les associations entre pathologie et genre. Nous montrons que les modèles sur-génèrent des cas décrivant des patients masculins, allant à l’encontre des prévalences réelles. Par exemple, lorsque les invites ne spécifient pas de genre, les modèles génèrent huit fois plus de cas cliniques décrivant des patients (plutôt que des patientes) pour les crises cardiaques. Nous discutons des possibles dommages induits par les modèles de langue, en particulier pour les femmes et les personnes transgenres, de la définition d’un modèle de langue « idéal » et des moyens d’y parvenir.

pdf bib abs
Évaluation de la confidentialité des textes cliniques synthétiques générés par des modèles de langue
Foucauld Estignard | Sahar Ghannay | Julien Girard-Satabin | Nicolas Hiebel | Aurélie Névéol
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : traductions d'articles publiés

Les grands modèles de langue (LLM) peuvent être utilisés pour produire des documents synthétiques similaires à des documents réels dont la disponibilité est limitée pour des raisons de confidentialité ou de droits d’auteur. Dans cet article, nous étudions les risques en lien avec la confidentialité dans les documents générés automatiquement. Nous utilisons des textes synthétiques générés à partir d’un modèle pré-entraîné et affiné sur des cas cliniques en français afin d’évaluer ces risques selon trois critères : (1) la similarité entre un corpus d’entraînement réel et le corpus synthétique (2) les corrélations entre les caractéristiques cliniques dans le corpus d’entraînement et le corpus synthétique et (3) une attaque par inférence d’appartenance (MIA, en anglais) utilisant un modèle affiné sur le corpus synthétique. Nous identifions des associations de caractéristiques cliniques qui suggèrent que le filtrage du corpus d’entraînement pourrait contribuer à la préservation de la confidentialité. Les attaques par inférence d’appartenance n’ont pas été concluantes.

pdf bib abs
MascuLead : le premier tableau de bord de biais de genre
Fanny Ducel | Jeffrey André | Aurélie Névéol | Karën Fort
Actes de l'atelier Ethic and Alignment of (Large) Language Models 2025 (EALM)

Nous présentons MascuLead, le premier tableau de bord centré sur une tâche de détection de biais de genre pour les langues fleuries. Nous appliquons un cadre existant sur plusieurs Grands Modèles de Langue et comparons leurs performances. Nous soulignons également l’importance d’inclure des références de biais dans les tableaux de bord, et nous questionnons la notion même de tableaux de bord.

pdf bib abs
Comment évaluer un grand modèle de langue dans le domaine médical en français ?
Christophe Servan | Cyril Grouin | Aurélie Névéol | Pierre Zweigenbaum
Actes de l'atelier Évaluation des modèles génératifs (LLM) et challenge 2025 (EvalLLM)

Les récentes avancées en Traitement Automatique des Langues liées aux grands modèles de langue (LLM) auto-régressifs investissent également les domaines spécialisés dont celui de la santé. Cette étude examine les questions qui se posent dans l’évaluation de LLM appliqués au domaine de la santé en se focalisant sur le français. Après un bref tour d’horizon des tâches et des données d’évaluation disponibles pour ce domaine de spécialité, l’article examine le mode d’évaluation des LLM dans des tâches de nature discriminante (détection d’entités nommées, classification de textes) et génératives (résumé de comptes rendus, génération de cas cliniques). L’article n’a pas vocation à rapporter une évaluation concrète, mais à discuter et préparer la méthodologie pour le faire.

Large Language Models (LLMs) reproduce and exacerbate the social biases present in their training data, and resources to quantify this issue are limited. While research has attempted to identify and mitigate such biases, most efforts have been concentrated around English, lagging the rapid advancement of LLMs in multilingual settings. In this paper, we introduce a new multilingual parallel dataset SHADES to help address this issue, designed for examining culturally-specific stereotypes that may be learned by LLMs. The dataset includes stereotypes from 20 regions around the world and 16 languages, spanning multiple identity categories subject to discrimination worldwide. We demonstrate its utility in a series of exploratory evaluations for both “base” and “instruction-tuned” language models. Our results suggest that stereotypes are consistently reflected across models and languages, with some languages and models indicating much stronger stereotype biases than others.

2024

pdf bib
La recherche sur les biais dans les modèles de langue est biaisée : état de l’art en abyme [Bias Research for Language Models is Biased: a Survey for Deconstructing Bias in Large Language Models]
Fanny Ducel | Aurélie Névéol | Karën Fort
Traitement Automatique des Langues, Volume 64, Numéro 3 : Explicabilité des modèles de TAL [Explainability of NLP models]

pdf bib abs
Using Structured Health Information for Controlled Generation of Clinical Cases in French
Hugo Boulanger | Nicolas Hiebel | Olivier Ferret | Karën Fort | Aurélie Névéol
Proceedings of the 6th Clinical Natural Language Processing Workshop

Text generation opens up new prospects for overcoming the lack of open corpora in fields such as healthcare, where data sharing is bound by confidentiality. In this study, we compare the performance of encoder-decoder and decoder-only language models for the controlled generation of clinical cases in French. To do so, we fine-tuned several pre-trained models on French clinical cases for each architecture and generate clinical cases conditioned by patient demographic information (gender and age) and clinical features.Our results suggest that encoder-decoder models are easier to control than decoder-only models, but more costly to train.

pdf bib abs
Few-shot clinical entity recognition in English, French and Spanish: masked language models outperform generative model prompting
Marco Naguib | Xavier Tannier | Aurélie Névéol
Findings of the Association for Computational Linguistics: EMNLP 2024

Large language models (LLMs) have become the preferred solution for many natural language processing tasks. In low-resource environments such as specialized domains, their few-shot capabilities are expected to deliver high performance. Named Entity Recognition (NER) is a critical task in information extraction that is not covered in recent LLM benchmarks. There is a need for better understanding the performance of LLMs for NER in a variety of settings including languages other than English. This study aims to evaluate generative LLMs, employed through prompt engineering, for few-shot clinical NER. We compare 13 auto-regressive models using prompting and 16 masked models using fine-tuning on 14 NER datasets covering English, French and Spanish. While prompt-based auto-regressive models achieve competitive F1 for general NER, they are outperformed within the clinical domain by lighter biLSTM-CRF taggers based on masked models. Additionally, masked models exhibit lower environmental impact compared to auto-regressive models. Findings are consistent across the three languages studied, which suggests that LLM prompting is not yet suited for NER production in the clinical domain.

pdf bib abs
Hostomytho: A GWAP for Synthetic Clinical Texts Evaluation and Annotation
Nicolas Hiebel | Bertrand Remy | Bruno Guillaume | Olivier Ferret | Aurélie Névéol | Karen Fort
Proceedings of the 10th Workshop on Games and Natural Language Processing @ LREC-COLING 2024

This paper presents the creation of Hostomytho, a game with a purpose intended for evaluating the quality of synthetic biomedical texts through multiple mini-games. Hostomytho was developed entirely using open source technologies both for internet browser and mobile platforms (IOS & Android). The code and the annotations created for synthetic clinical cases in French will be made freely available.

pdf bib abs
Évaluation automatique des biais de genre dans des modèles de langue auto-régressifs
Fanny Ducel | Aurélie Névéol | Karën Fort
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Nous proposons un outil pour mesurer automatiquement les biais de genre dans des textes générés par des grands modèles de langue dans des langues flexionnelles. Nous évaluons sept modèles à l’aide de 52 000 textes en français et 2 500 textes en italien, pour la rédaction de lettres de motivation. Notre outil s’appuie sur la détection de marqueurs morpho-syntaxiques de genre pour mettre au jour des biais. Ainsi, les modèles favorisent largement la génération de masculin : le genre masculin est deux fois plus présent que le féminin en français, et huit fois plus en italien. Les modèles étudiés exacerbent également des stéréotypes attestés en sociologie en associant les professions stéréotypiquement féminines aux textes au féminin, et les professions stéréotypiquement masculines aux textes au masculin.

pdf bib abs
Reconnaissance d’entités cliniques en few-shot en trois langues
Marco Naguib | Aurélie Névéol | Xavier Tannier
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Les grands modèles de langage deviennent la solution de choix pour de nombreuses tâches de traitement du langage naturel, y compris dans des domaines spécialisés où leurs capacités few-shot devraient permettre d’obtenir des performances élevées dans des environnements à faibles ressources. Cependant, notre évaluation de 10 modèles auto-régressifs et 16 modèles masqués montre que, bien que les modèles auto-régressifs utilisant des prompts puissent rivaliser en termes de reconnaissance d’entités nommées (REN) en dehors du domaine clinique, ils sont dépassés dans le domaine clinique par des taggers biLSTM-CRF plus légers reposant sur des modèles masqués. De plus, les modèles masqués ont un bien moindre impact environnemental que les modèles auto-régressifs. Ces résultats, cohérents dans les trois langues étudiées, suggèrent que les modèles à apprentissage few-shot ne sont pas encore adaptés à la production de REN dans le domaine clinique, mais pourraient être utilisés pour accélérer la création de données annotées de qualité.

pdf bib abs
Extraction d’entités nommées décrivant des chaînes de traitement bioinformatiques dans des articles scientifiques en anglais
Clémence Sebe | Sarah Cohen-Boulakia | Olivier Ferret | Aurélie Névéol
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Les chaînes de traitement d’analyses de données biologiques utilisées en bioinformatique sont une solution pour la portabilité et la reproductibilité des analyses. Ces chaînes figurent à la fois sous forme descriptive dans des articles scientifiques et/ou sous forme de codes dans des dépôts. L’identification de publications scientifiques décrivant de nouvelles chaînes de traitement et l’extraction de leurs informations sont des enjeux importants pour la communauté bioinformatique. Nous proposons ici d’étendre le corpus BioToFlow ayant trait aux articles décrivant des chaînes de traitement bioinformatiques et de l’utiliser pour entraîner et évaluer des modèles de reconnaissance d’entités nommées bioinformatiques. Ce travail est accompagné d’une discussion critique portant à la fois sur le processus d’annotation du corpus et sur les résultats de l’extraction d’entités.

pdf bib abs
Génération contrôlée de cas cliniques en français à partir de données médicales structurées
Hugo Boulanger | Nicolas Hiebel | Olivier Ferret | Karën Fort | Aurélie Névéol
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

La génération de texte ouvre des perspectives pour pallier l’absence de corpus librement partageables dans des domaines contraints par la confidentialité, comme le domaine médical. Dans cette étude, nous comparons les performances de modèles encodeurs-décodeurs et décodeurs seuls pour la génération conditionnée de cas cliniques en français. Nous affinons plusieurs modèles pré-entraînés pour chaque architecture sur des cas cliniques en français conditionnés par les informations démographiques des patient·es (sexe et âge) et des éléments cliniques.Nous observons que les modèles encodeur-décodeurs sont plus facilement contrôlables que les modèles décodeurs seuls, mais plus coûteux à entraîner.

La génération de textes neuronaux fait l’objet d’une grande attention avec la publication de nouveaux outils tels que ChatGPT. La principale raison en est que la qualité du texte généré automatiquement peut être attribuée à un$cdot$e rédacteurice humain$cdot$e même quand l’évaluation est faite par un humain. Dans cet article, nous proposons un nouveau corpus en français et en anglais pour la tâche d’identification de textes générés automatiquement et nous menons une étude sur la façon dont les humains perçoivent ce texte. Nos résultats montrent, comme les travaux antérieurs à l’ère de ChatGPT, que les textes générés par des outils tels que ChatGPT partagent certaines caractéristiques communes mais qu’ils ne sont pas clairement identifiables, ce qui génère des perceptions différentes de ces textes par l’humain.

pdf bib abs
A Benchmark Evaluation of Clinical Named Entity Recognition in French
Nesrine Bannour | Christophe Servan | Aurélie Névéol | Xavier Tannier
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Background: Transformer-based language models have shown strong performance on many Natural Language Processing (NLP) tasks. Masked Language Models (MLMs) attract sustained interest because they can be adapted to different languages and sub-domains through training or fine-tuning on specific corpora while remaining lighter than modern Large Language Models (MLMs). Recently, several MLMs have been released for the biomedical domain in French, and experiments suggest that they outperform standard French counterparts. However, no systematic evaluation comparing all models on the same corpora is available. Objective: This paper presents an evaluation of masked language models for biomedical French on the task of clinical named entity recognition. Material and methods: We evaluate biomedical models CamemBERT-bio and DrBERT and compare them to standard French models CamemBERT, FlauBERT and FrAlBERT as well as multilingual mBERT using three publically available corpora for clinical named entity recognition in French. The evaluation set-up relies on gold-standard corpora as released by the corpus developers. Results: Results suggest that CamemBERT-bio outperforms DrBERT consistently while FlauBERT offers competitive performance and FrAlBERT achieves the lowest carbon footprint. Conclusion: This is the first benchmark evaluation of biomedical masked language models for French clinical entity recognition that compares model performance consistently on nested entity recognition using metrics covering performance and environmental impact.

User-generated data sources have gained significance in uncovering Adverse Drug Reactions (ADRs), with an increasing number of discussions occurring in the digital world. However, the existing clinical corpora predominantly revolve around scientific articles in English. This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social media, and clinical reports in German, French, and Japanese. Our corpus contains annotations covering 12 entity types, four attribute types, and 13 relation types. It contributes to the development of real-world multilingual language models for healthcare. We provide statistics to highlight certain challenges associated with the corpus and conduct preliminary experiments resulting in strong baselines for extracting entities and relations between these entities, both within and across languages.

Neural text generation is receiving broad attention with the publication of new tools such as ChatGPT. The main reason for that is that the achieved quality of the generated text may be attributed to a human writer by the naked eye of a human evaluator. In this paper, we propose a new corpus in French and English for the task of recognising automatically generated texts and we conduct a study of how humans perceive the text. Our results show, as previous work before the ChatGPT era, that the generated texts by tools such as ChatGPT share some common characteristics but they are not clearly identifiable which generates different perceptions of these texts.

Warning: This paper contains explicit statements of offensive stereotypes which may be upsetting The study of bias, fairness and social impact in Natural Language Processing (NLP) lacks resources in languages other than English. Our objective is to support the evaluation of bias in language models in a multilingual setting. We use stereotypes across nine types of biases to build a corpus containing contrasting sentence pairs, one sentence that presents a stereotype concerning an underadvantaged group and another minimally changed sentence, concerning a matching advantaged group. We build on the French CrowS-Pairs corpus and guidelines to provide translations of the existing material into seven additional languages. In total, we produce 11,139 new sentence pairs that cover stereotypes dealing with nine types of biases in seven cultural contexts. We use the final resource for the evaluation of relevant monolingual and multilingual masked language models. We find that language models in all languages favor sentences that express stereotypes in most bias categories. The process of creating a resource that covers a wide range of language types and cultural settings highlights the difficulty of bias evaluation, in particular comparability across languages and contexts.

We present the results of the ninth edition of the Biomedical Translation Task at WMT’24. We released test sets for six language pairs, namely, French, German, Italian, Portuguese, Russian, and Spanish, from and into English. Eachtest set consists of 50 abstracts from PubMed. Differently from previous years, we did not split abstracts into sentences. We received submissions from five teams, and for almost all language directions. We used a baseline/comparison system based on Llama 3.1 and share the source code at https://github.com/cgrozea/wmt24biomed-ref.

2023

pdf bib abs
The Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research
Mohamed Abdalla | Jan Philip Wahle | Terry Ruas | Aurélie Névéol | Fanny Ducel | Saif Mohammad | Karen Fort
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advances in deep learning methods for natural language processing (NLP) have created new business opportunities and made NLP research critical for industry development. As one of the big players in the field of NLP, together with governments and universities, it is important to track the influence of industry on research. In this study, we seek to quantify and characterize industry presence in the NLP community over time. Using a corpus with comprehensive metadata of 78,187 NLP publications and 701 resumes of NLP publication authors, we explore the industry presence in the field since the early 90s. We find that industry presence among NLP authors has been steady before a steep increase over the past five years (180% growth from 2017 to 2022). A few companies account for most of the publications and provide funding to academic researchers through grants and internships. Our study shows that the presence and impact of the industry on natural language processing research are significant and fast-growing. This work calls for increased transparency of industry influence in the field.

pdf bib abs
Event-independent temporal positioning: application to French clinical text
Nesrine Bannour | Bastien Rance | Xavier Tannier | Aurelie Neveol
Proceedings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks

Extracting temporal relations usually entails identifying and classifying the relation between two mentions. However, the definition of temporal mentions strongly depends on the text type and the application domain. Clinical text in particular is complex. It may describe events that occurred at different times, contain redundant information and a variety of domain-specific temporal expressions. In this paper, we propose a novel event-independent representation of temporal relations that is task-independent and, therefore, domain-independent. We are interested in identifying homogeneous text portions from a temporal standpoint and classifying the relation between each text portion and the document creation time. Temporal relation extraction is cast as a sequence labeling task and evaluated on oncology notes. We further evaluate our temporal representation by the temporal positioning of toxicity events of chemotherapy administrated to colon and lung cancer patients described in French clinical reports. An overall macro F-measure of 0.86 is obtained for temporal relation extraction by a neural token classification model trained on clinical texts written in French. Our results suggest that the toxicity event extraction task can be performed successfully by automatically identifying toxicity events and placing them within the patient timeline (F-measure .62). The proposed system has the potential to assist clinicians in the preparation of tumor board meetings.

pdf bib abs
Can Synthetic Text Help Clinical Named Entity Recognition? A Study of Electronic Health Records in French
Nicolas Hiebel | Olivier Ferret | Karen Fort | Aurélie Névéol
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

In sensitive domains, the sharing of corpora is restricted due to confidentiality, copyrights or trade secrets. Automatic text generation can help alleviate these issues by producing synthetic texts that mimic the linguistic properties of real documents while preserving confidentiality. In this study, we assess the usability of synthetic corpus as a substitute training corpus for clinical information extraction. Our goal is to automatically produce a clinical case corpus annotated with clinical entities and to evaluate it for a named entity recognition (NER) task. We use two auto-regressive neural models partially or fully trained on generic French texts and fine-tuned on clinical cases to produce a corpus of synthetic clinical cases. We study variants of the generation process: (i) fine-tuning on annotated vs. plain text (in that case, annotations are obtained a posteriori) and (ii) selection of generated texts based on models parameters and filtering criteria. We then train NER models with the resulting synthetic text and evaluate them on a gold standard clinical corpus. Our experiments suggest that synthetic text is useful for clinical NER.

pdf bib abs
Stratégies d’apprentissage actif pour la reconnaissance d’entités nommées en français
Marco Naguib | Aurélie Névéol | Xavier Tannier
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux -- articles longs

L’annotation manuelle de corpus est un processus coûteux et lent, notamment pour la tâche de re-connaissance d’entités nommées. L’apprentissage actif vise à rendre ce processus plus efficace, ensélectionnant les portions les plus pertinentes à annoter. Certaines stratégies visent à sélectionner lesportions les plus représentatives du corpus, d’autres, les plus informatives au modèle de langage.Malgré un intérêt grandissant pour l’apprentissage actif, rares sont les études qui comparent cesdifférentes stratégies dans un contexte de reconnaissance d’entités nommées médicales. Nous pro-posons une comparaison de ces stratégies en fonction des performances de chacune sur 3 corpus dedocuments cliniques en langue française : MERLOT, QuaeroFrenchMed et E3C. Nous comparonsles stratégies de sélection mais aussi les différentes façons de les évaluer. Enfin, nous identifions lesstratégies qui semblent les plus efficaces et mesurons l’amélioration qu’elles présentent, à différentesphases de l’apprentissage.

pdf bib abs
Positionnement temporel indépendant des évènements : application à des textes cliniques en français
Nesrine Bannour | Xavier Tannier | Bastien Rance | Aurélie Névéol
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : travaux de recherche originaux -- articles courts

L’extraction de relations temporelles consiste à identifier et classifier la relation entre deux mentions. Néanmoins, la définition des mentions temporelles dépend largement du type du texte et du domained’application. En particulier, le texte clinique est complexe car il décrit des évènements se produisant à des moments différents et contient des informations redondantes et diverses expressions temporellesspécifiques au domaine. Dans cet article, nous proposons une nouvelle représentation des relations temporelles, qui est indépendante du domaine et de l’objectif de la tâche d’extraction. Nous nousintéressons à extraire la relation entre chaque portion du texte et la date de création du document. Nous formulons l’extraction de relations temporelles comme une tâche d’étiquetage de séquences.Une macro F-mesure de 0,8 est obtenue par un modèle neuronal entraîné sur des textes cliniques, écrits en français. Nous évaluons notre représentation temporelle par le positionnement temporel desévènements de toxicité des chimiothérapies.

pdf bib abs
Les textes cliniques français générés sont-ils dangereusement similaires à leur source ? Analyse par plongements de phrases
Nicolas Hiebel | Olivier Ferret | Karën Fort | Aurélie Névéol
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : travaux de recherche originaux -- articles courts

Les ressources textuelles disponibles dans le domaine biomédical sont rares pour des raisons de confidentialité. Des données existent mais ne sont pas partageables, c’est pourquoi il est intéressant de s’inspirer de ces données pour en générer de nouvelles sans contrainte de partage. Une difficulté majeure de la génération de données médicales est que les données générées doivent ressembler aux données originales sans compromettre leur confidentialité. L’évaluation de cette tâche est donc difficile. Dans cette étude, nous étendons l’évaluation de corpus cliniques générés en français en y ajoutant une dimension sémantique à l’aide de plongements de phrases. Nous recherchons des phrases proches à l’aide de similarité cosinus entre plongements, et analysons les scores de similarité. Nous observons que les phrases synthétiques sont thématiquement proches du corpus original, mais suffisamment éloignées pour ne pas être de simples reformulations qui compromettraient la confidentialité.

Le projet ANR Gender Equality Monitor (GEM) est coordonné par l’Institut National de l’Audiovisuel(INA) et vise à étudier la place des femmes dans les médias (radio et télévision). Dans cette soumission,nous présentons le travail réalisé au LISN : (i) étude diachronique des caractéristiques acoustiquesde la voix en fonction du genre et de l’âge, (ii) comparaison acoustique de la voix des femmeset hommes politiques montrant une incohérence entre performance vocale et commentaires sur lavoix, (iii) réalisation d’un système automatique d’estimation de la féminité perçue à partir descaractéristiques vocales, (iv) comparaison de systèmes de segmentation thématique de transcriptionsautomatiques de données audiovisuelles, (v) mesure des biais sociétaux dans les modèles de languedans un contexte multilingue et multi-culturel, et (vi) premiers essais d’identification de la publicitéen fonction du genre du locuteur.

We present an overview of the Biomedical Translation Task that was part of the Eighth Conference on Machine Translation (WMT23). The aim of the task was the automatic translation of biomedical abstracts from the PubMed database. It included twelve language directions, namely, French, Spanish, Portuguese, Italian, German, and Russian, from and into English. We received submissions from 18 systems and for all the test sets that we released. Our comparison system was based on ChatGPT 3.5 and performed very well in comparison to many of the submissions.

2022

pdf bib abs
French CrowS-Pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than English
Aurélie Névéol | Yoann Dupont | Julien Bezançon | Karën Fort
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Warning: This paper contains explicit statements of offensive stereotypes which may be upsetting. Much work on biases in natural language processing has addressed biases linked to the social and cultural experience of English speaking individuals in the United States. We seek to widen the scope of bias studies by creating material to measure social bias in language models (LMs) against specific demographic groups in France. We build on the US-centered CrowS-pairs dataset to create a multilingual stereotypes dataset that allows for comparability across languages while also characterizing biases that are specific to each country and language. We introduce 1,679 sentence pairs in French that cover stereotypes in ten types of bias like gender and age. 1,467 sentence pairs are translated from CrowS-pairs and 212 are newly crowdsourced. The sentence pairs contrast stereotypes concerning underadvantaged groups with the same sentence concerning advantaged groups. We find that four widely used language models (three French, one multilingual) favor sentences that express stereotypes in most bias categories. We report on the translation process from English into French, which led to a characterization of stereotypes in CrowS-pairs including the identification of US-centric cultural traits. We offer guidelines to further extend the dataset to other languages and cultural environments.

Evaluating bias, fairness, and social impact in monolingual language models is a difficult task. This challenge is further compounded when language modeling occurs in a multilingual context. Considering the implication of evaluation biases for large multilingual language models, we situate the discussion of bias evaluation within a wider context of social scientific research with computational work. We highlight three dimensions of developing multilingual bias evaluation frameworks: (1) increasing transparency through documentation, (2) expanding targets of bias beyond gender, and (3) addressing cultural differences that exist between languages. We further discuss the power dynamics and consequences of training large language models and recommend that researchers remain cognizant of the ramifications of developing such technologies.

pdf bib abs
CLISTER : Un corpus pour la similarité sémantique textuelle dans des cas cliniques en français (CLISTER : A Corpus for Semantic Textual Similarity in French Clinical Narratives)
Nicolas Hiebel | Karën Fort | Aurélie Névéol | Olivier Ferret
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Le TAL repose sur la disponibilité de corpus annotés pour l’entraînement et l’évaluation de modèles. Il existe très peu de ressources pour la similarité sémantique dans le domaine clinique en français. Dans cette étude, nous proposons une définition de la similarité guidée par l’analyse clinique et l’appliquons au développement d’un nouveau corpus partagé de 1 000 paires de phrases annotées manuellement en scores de similarité. Nous évaluons ensuite le corpus par des expériences de mesure automatique de similarité. Nous montrons ainsi qu’un modèle de plongements de phrases peut capturer la similarité avec des performances à l’état de l’art sur le corpus DEFT STS (Spearman=0,8343). Nous montrons également que le contenu du corpus CLISTER est complémentaire de celui de DEFT STS.

pdf bib abs
French CrowS-Pairs: Extension à une langue autre que l’anglais d’un corpus de mesure des biais sociétaux dans les modèles de langue masqués (French CrowS-Pairs : Extending a challenge dataset for measuring social bias in masked language models to a language other than English)
Aurélie Névéol | Yoann Dupont | Julien Bezançon | Karën Fort
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Afin de permettre l’étude des biais en traitement automatique de la langue au delà de l’anglais américain, nous enrichissons le corpus américain CrowS-pairs de 1 677 paires de phrases en français représentant des stéréotypes portant sur dix catégories telles que le genre. 1 467 paires de phrases sont traduites à partir de CrowS-pairs et 210 sont nouvellement recueillies puis traduites en anglais. Selon le principe des paires minimales, les phrases du corpus contrastent un énoncé stéréotypé concernant un groupe défavorisé et son équivalent pour un groupe favorisé. Nous montrons que quatre modèles de langue favorisent les énoncés qui expriment des stéréotypes dans la plupart des catégories. Nous décrivons le processus de traduction et formulons des recommandations pour étendre le corpus à d’autres langues. Attention : Cet article contient des énoncés de stéréotypes qui peuvent être choquants.

pdf bib abs
CLISTER : A Corpus for Semantic Textual Similarity in French Clinical Narratives
Nicolas Hiebel | Olivier Ferret | Karën Fort | Aurélie Névéol
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Modern Natural Language Processing relies on the availability of annotated corpora for training and evaluating models. Such resources are scarce, especially for specialized domains in languages other than English. In particular, there are very few resources for semantic similarity in the clinical domain in French. This can be useful for many biomedical natural language processing applications, including text generation. We introduce a definition of similarity that is guided by clinical facts and apply it to the development of a new French corpus of 1,000 sentence pairs manually annotated according to similarity scores. This new sentence similarity corpus is made freely available to the community. We further evaluate the corpus through experiments of automatic similarity measurement. We show that a model of sentence embeddings can capture similarity with state-of-the-art performance on the DEFT STS shared task evaluation data set (Spearman=0.8343). We also show that the corpus is complementary to DEFT STS.

pdf bib abs
Use of a Citizen Science Platform for the Creation of a Language Resource to Study Bias in Language Models for French: A Case Study
Karën Fort | Aurélie Névéol | Yoann Dupont | Julien Bezançon
Proceedings of the 2nd Workshop on Novel Incentives in Data Collection from People: models, implementations, challenges and results within LREC 2022

There is a growing interest in the evaluation of bias, fairness and social impact of Natural Language Processing models and tools. However, little resources are available for this task in languages other than English. Translation of resources originally developed for English is a promising research direction. However, there is also a need for complementing translated resources by newly sourced resources in the original languages and social contexts studied. In order to collect a language resource for the study of biases in Language Models for French, we decided to resort to citizen science. We created three tasks on the LanguageARC citizen science platform to assist with the translation of an existing resource from English into French as well as the collection of complementary resources in native French. We successfully collected data for all three tasks from a total of 102 volunteer participants. Participants from different parts of the world contributed and we noted that although calls sent to mailing lists had a positive impact on participation, some participants pointed barriers to contributions due to the collection platform.

In the seventh edition of the WMT Biomedical Task, we addressed a total of seven languagepairs, namely English/German, English/French, English/Spanish, English/Portuguese, English/Chinese, English/Russian, English/Italian. This year’s test sets covered three types of biomedical text genre. In addition to scientific abstracts and terminology items used in previous editions, we released test sets of clinical cases. The evaluation of clinical cases translations were given special attention by involving clinicians in the preparation of reference translations and manual evaluation. For the main MEDLINE test sets, we received a total of 609 submissions from 37 teams. For the ClinSpEn sub-task, we had the participation of five teams.

2021

pdf bib abs
Reviewing Natural Language Processing Research
Kevin Cohen | Karën Fort | Margot Mieskes | Aurélie Névéol | Anna Rogers
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts

The reviewing procedure has been identified as one of the major issues in the current situation of the NLP field. While it is implicitly assumed that junior researcher learn reviewing during their PhD project, this might not always be the case. Additionally, with the growing NLP community and the efforts in the context of widening the NLP community, researchers joining the field might not have the opportunity to practise reviewing. This tutorial fills in this gap by providing an opportunity to learn the basics of reviewing. Also more experienced researchers might find this tutorial interesting to revise their reviewing procedure.

pdf bib abs
Evaluating the carbon footprint of NLP methods: a survey and analysis of existing tools
Nesrine Bannour | Sahar Ghannay | Aurélie Névéol | Anne-Laure Ligozat
Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing

Modern Natural Language Processing (NLP) makes intensive use of deep learning methods because of the accuracy they offer for a variety of applications. Due to the significant environmental impact of deep learning, cost-benefit analysis including carbon footprint as well as accuracy measures has been suggested to better document the use of NLP methods for research or deployment. In this paper, we review the tools that are available to measure energy use and CO2 emissions of NLP methods. We describe the scope of the measures provided and compare the use of six tools (carbon tracker, experiment impact tracker, green algorithms, ML CO2 impact, energy usage and cumulator) on named entity recognition experiments performed on different computational set-ups (local server vs. computing facility). Based on these findings, we propose actionable recommendations to accurately measure the environmental impact of NLP experiments.

In the sixth edition of the WMT Biomedical Task, we addressed a total of eight language pairs, namely English/German, English/French, English/Spanish, English/Portuguese, English/Chinese, English/Russian, English/Italian, and English/Basque. Further, our tests were composed of three types of textual test sets. New to this year, we released a test set of summaries of animal experiments, in addition to the test sets of scientific abstracts and terminologies. We received a total of 107 submissions from 15 teams from 6 countries.

2020

pdf bib abs
Reviewing Natural Language Processing Research
Kevin Cohen | Karën Fort | Margot Mieskes | Aurélie Névéol
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

This tutorial will cover the theory and practice of reviewing research in natural language processing. Heavy reviewing burdens on natural language processing researchers have made it clear that our community needs to increase the size of our pool of potential reviewers. Simultaneously, notable “false negatives”—rejection by our conferences of work that was later shown to be tremendously important after acceptance by other conferences—have raised awareness of the fact that our reviewing practices leave something to be desired. We do not often talk about “false positives” with respect to conference papers, but leaders in the field have noted that we seem to have a publication bias towards papers that report high performance, with perhaps not much else of interest in them. It need not be this way. Reviewing is a learnable skill, and you will learn it here via lectures and a considerable amount of hands-on practice.

pdf bib abs
Modèle neuronal pour la résolution de la coréférence dans les dossiers médicaux électroniques (Neural approach for coreference resolution in electronic health records )
Julien Tourille | Olivier Ferret | Aurélie Névéol | Xavier Tannier
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

La résolution de la coréférence est un élément essentiel pour la constitution automatique de chronologies médicales à partir des dossiers médicaux électroniques. Dans ce travail, nous présentons une approche neuronale pour la résolution de la coréférence dans des textes médicaux écrits en anglais pour les entités générales et cliniques en nous évaluant dans le cadre de référence pour cette tâche que constitue la tâche 1C de la campagne i2b2 2011.

pdf bib abs
MEDLINE as a Parallel Corpus: a Survey to Gain Insight on French-, Spanish- and Portuguese-speaking Authors’ Abstract Writing Practice
Aurélie Névéol | Antonio Jimeno Yepes | Mariana Neves
Proceedings of the Twelfth Language Resources and Evaluation Conference

Background: Parallel corpora are used to train and evaluate machine translation systems. To alleviate the cost of producing parallel resources for evaluation campaigns, existing corpora are leveraged. However, little information may be available about the methods used for producing the corpus, including translation direction. Objective: To gain insight on MEDLINE parallel corpus used in the biomedical task at the Workshop on Machine Translation in 2019 (WMT 2019). Material and Methods: Contact information for the authors of MEDLINE articles included in the English/Spanish (EN/ES), English/French (EN/FR), and English/Portuguese (EN/PT) WMT 2019 test sets was obtained from PubMed and publisher websites. The authors were asked about their abstract writing practices in a survey. Results: The response rate was above 20%. Authors reported that they are mainly native speakers of languages other than English. Although manual translation, sometimes via professional translation services, was commonly used for abstract translation, authors of articles in the EN/ES and EN/PT sets also relied on post-edited machine translation. Discussion: This study provides a characterization of MEDLINE authors’ language skills and abstract writing practices. Conclusion: The information collected in this study will be used to inform test set design for the next WMT biomedical task.

pdf bib
Traitement Automatique des Langues, Volume 61, Numéro 2 : TAL et Santé [NLP and Health]
Aurélie Névéol | Berry de Bruijn | Corinne Fredouille
Traitement Automatique des Langues, Volume 61, Numéro 2 : TAL et Santé [NLP and Health]

pdf bib
TAL et Santé [NLP and Health]
Aurélie Névéol | Berry de Bruijn | Corinne Fredouille
Traitement Automatique des Langues, Volume 61, Numéro 2 : TAL et Santé [NLP and Health]

Machine translation of scientific abstracts and terminologies has the potential to support health professionals and biomedical researchers in some of their activities. In the fifth edition of the WMT Biomedical Task, we addressed a total of eight language pairs. Five language pairs were previously addressed in past editions of the shared task, namely, English/German, English/French, English/Spanish, English/Portuguese, and English/Chinese. Three additional languages pairs were also introduced this year: English/Russian, English/Italian, and English/Basque. The task addressed the evaluation of both scientific abstracts (all language pairs) and terminologies (English/Basque only). We received submissions from a total of 20 teams. For recurring language pairs, we observed an improvement in the translations in terms of automatic scores and qualitative evaluations, compared to previous years.

2019

pdf bib abs
Community Perspective on Replicability in Natural Language Processing
Margot Mieskes | Karën Fort | Aurélie Névéol | Cyril Grouin | Kevin Cohen
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

With recent efforts in drawing attention to the task of replicating and/or reproducing results, for example in the context of COLING 2018 and various LREC workshops, the question arises how the NLP community views the topic of replicability in general. Using a survey, in which we involve members of the NLP community, we investigate how our community perceives this topic, its relevance and options for improvement. Based on over two hundred participants, the survey results confirm earlier observations, that successful reproducibility requires more than having access to code and data. Additionally, the results show that the topic has to be tackled from the authors’, reviewers’ and community’s side.

pdf bib abs
A distantly supervised dataset for automated data extraction from diagnostic studies
Christopher Norman | Mariska Leeflang | René Spijker | Evangelos Kanoulas | Aurélie Névéol
Proceedings of the 18th BioNLP Workshop and Shared Task

Systematic reviews are important in evidence based medicine, but are expensive to produce. Automating or semi-automating the data extraction of index test, target condition, and reference standard from articles has the potential to decrease the cost of conducting systematic reviews of diagnostic test accuracy, but relevant training data is not available. We create a distantly supervised dataset of approximately 90,000 sentences, and let two experts manually annotate a small subset of around 1,000 sentences for evaluation. We evaluate the performance of BioBERT and logistic regression for ranking the sentences, and compare the performance for distant and direct supervision. Our results suggest that distant supervision can work as well as, or better than direct supervision on this problem, and that distantly trained models can perform as well as, or better than human annotators.

In the fourth edition of the WMT Biomedical Translation task, we considered a total of six languages, namely Chinese (zh), English (en), French (fr), German (de), Portuguese (pt), and Spanish (es). We performed an evaluation of automatic translations for a total of 10 language directions, namely, zh/en, en/zh, fr/en, en/fr, de/en, en/de, pt/en, en/pt, es/en, and en/es. We provided training data based on MEDLINE abstracts for eight of the 10 language pairs and test sets for all of them. In addition to that, we offered a new sub-task for the translation of terms in biomedical terminologies for the en/es language direction. Higher BLEU scores (close to 0.5) were obtained for the es/en, en/es and en/pt test sets, as well as for the terminology sub-task. After manual validation of the primary runs, some submissions were judged to be better than the reference translations, for instance, for de/en, en/es and es/en.

2018

pdf bib abs
Détection automatique de phrases en domaine de spécialité en français (Sentence boundary detection for specialized domains in French )
Arthur Boyer | Aurélie Névéol
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

La détection de frontières de phrase est généralement considéré comme un problème résolu. Cependant, les outils performant sur des textes en domaine général, ne le sont pas forcement sur des domaines spécialisés, ce qui peut engendrer des dégradations de performance des outils intervenant en aval dans une chaîne de traitement automatique s’appuyant sur des textes découpés en phrases. Dans cet article, nous évaluons 5 outils de segmentation en phrase sur 3 corpus issus de différent domaines. Nous ré-entrainerons l’un de ces outils sur un corpus de spécialité pour étudier l’adaptation en domaine. Notamment, nous utilisons un nouveau corpus biomédical annoté spécifiquement pour cette tâche. La detection de frontières de phrase à l’aide d’un modèle OpenNLP entraîné sur un corpus clinique offre une F-mesure de .73, contre .66 pour la version standard de l’outil.

pdf bib
Parallel Corpora for the Biomedical Domain
Aurélie Névéol | Antonio Jimeno Yepes | Mariana Neves | Karin Verspoor
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Automating Document Discovery in the Systematic Review Process: How to Use Chaff to Extract Wheat
Christopher Norman | Mariska Leeflang | Pierre Zweigenbaum | Aurélie Névéol
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs
Evaluation of a Sequence Tagging Tool for Biomedical Texts
Julien Tourille | Matthieu Doutreligne | Olivier Ferret | Aurélie Névéol | Nicolas Paris | Xavier Tannier
Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis

Many applications in biomedical natural language processing rely on sequence tagging as an initial step to perform more complex analysis. To support text analysis in the biomedical domain, we introduce Yet Another SEquence Tagger (YASET), an open-source multi purpose sequence tagger that implements state-of-the-art deep learning algorithms for sequence tagging. Herein, we evaluate YASET on part-of-speech tagging and named entity recognition in a variety of text genres including articles from the biomedical literature in English and clinical narratives in French. To further characterize performance, we report distributions over 30 runs and different sizes of training datasets. YASET provides state-of-the-art performance on the CoNLL 2003 NER dataset (F1=0.87), MEDPOST corpus (F1=0.97), MERLoT corpus (F1=0.99) and NCBI disease corpus (F1=0.81). We believe that YASET is a versatile and efficient tool that can be used for sequence tagging in biomedical and clinical texts.

pdf bib abs
Findings of the WMT 2018 Biomedical Translation Shared Task: Evaluation on Medline test sets
Mariana Neves | Antonio Jimeno Yepes | Aurélie Névéol | Cristian Grozea | Amy Siu | Madeleine Kittner | Karin Verspoor
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

Machine translation enables the automatic translation of textual documents between languages and can facilitate access to information only available in a given language for non-speakers of this language, e.g. research results presented in scientific publications. In this paper, we provide an overview of the Biomedical Translation shared task in the Workshop on Machine Translation (WMT) 2018, which specifically examined the performance of machine translation systems for biomedical texts. This year, we provided test sets of scientific publications from two sources (EDP and Medline) and for six language pairs (English with each of Chinese, French, German, Portuguese, Romanian and Spanish). We describe the development of the various test sets, the submissions that we received and the evaluations that we carried out. We obtained a total of 39 runs from six teams and some of this year’s BLEU scores were somewhat higher that last year’s, especially for teams that made use of biomedical resources or state-of-the-art MT algorithms (e.g. Transformer). Finally, our manual evaluation scored automatic translations higher than the reference translations for German and Spanish.

2017

pdf bib abs
Tri Automatique de la Littérature pour les Revues Systématiques (Automatically Ranking the Literature in Support of Systematic Reviews)
Christopher Norman | Mariska Leeflang | Pierre Zweigenbaum | Aurélie Névéol
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Les revues systématiques de la littérature dans le domaine biomédical reposent essentiellement sur le travail bibliographique manuel d’experts. Nous évaluons les performances de la classification supervisée pour la découverte automatique d’articles à l’aide de plusieurs définitions des critères d’inclusion. Nous appliquons un modèle de regression logistique sur deux corpus issus de revues systématiques conduites dans le domaine du traitement automatique de la langue et de l’efficacité des médicaments. La classification offre une aire sous la courbe moyenne (AUC) de 0.769 si le classifieur est contruit à partir des jugements experts portés sur les titres et résumés des articles, et de 0.835 si on utilise les jugements portés sur le texte intégral. Ces résultats indiquent l’importance des jugements portés dès le début du processus de sélection pour développer un classifieur efficace pour accélérer l’élaboration des revues systématiques à l’aide d’un algorithme de classification standard.

pdf bib abs
Traitement automatique de la langue biomédicale au LIMSI (Biomedical language processing at LIMSI)
Christopher Norman | Cyril Grouin | Thomas Lavergne | Aurélie Névéol | Pierre Zweigenbaum
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 3 - Démonstrations

Nous proposons des démonstrations de trois outils développés par le LIMSI en traitement automatique des langues appliqué au domaine biomédical : la détection de concepts médicaux dans des textes courts, la catégorisation d’articles scientifiques pour l’assistance à l’écriture de revues systématiques, et l’anonymisation de textes cliniques.

pdf bib abs
Temporal information extraction from clinical text
Julien Tourille | Olivier Ferret | Xavier Tannier | Aurélie Névéol
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

In this paper, we present a method for temporal relation extraction from clinical narratives in French and in English. We experiment on two comparable corpora, the MERLOT corpus and the THYME corpus, and show that a common approach can be used for both languages.

pdf bib abs
Neural Architecture for Temporal Relation Extraction: A Bi-LSTM Approach for Detecting Narrative Containers
Julien Tourille | Olivier Ferret | Aurélie Névéol | Xavier Tannier
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We present a neural architecture for containment relation identification between medical events and/or temporal expressions. We experiment on a corpus of de-identified clinical notes in English from the Mayo Clinic, namely the THYME corpus. Our model achieves an F-measure of 0.613 and outperforms the best result reported on this corpus to date.

pdf bib abs
LIMSI-COT at SemEval-2017 Task 12: Neural Architecture for Temporal Information Extraction from Clinical Narratives
Julien Tourille | Olivier Ferret | Xavier Tannier | Aurélie Névéol
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

In this paper we present our participation to SemEval 2017 Task 12. We used a neural network based approach for entity and temporal relation extraction, and experimented with two domain adaptation strategies. We achieved competitive performance for both tasks.

2016

pdf bib abs
Extraction de relations temporelles dans des dossiers électroniques patient (Extracting Temporal Relations from Electronic Health Records)
Julien Tourille | Olivier Ferret | Aurélie Névéol | Xavier Tannier
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

L’analyse temporelle des documents cliniques permet d’obtenir des représentations riches des informations contenues dans les dossiers électroniques patient. Cette analyse repose sur l’extraction d’événements, d’expressions temporelles et des relations entre eux. Dans ce travail, nous considérons que nous disposons des événements et des expressions temporelles pertinents et nous nous intéressons aux relations temporelles entre deux événements ou entre un événement et une expression temporelle. Nous présentons des modèles de classification supervisée pour l’extraction de des relations en français et en anglais. Les performances obtenues sont comparables dans les deux langues, suggérant ainsi que différents domaines cliniques et différentes langues pourraient être abordés de manière similaire.

pdf bib abs
The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine
Mariana Neves | Antonio Jimeno Yepes | Aurélie Névéol
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The biomedical scientific literature is a rich source of information not only in the English language, for which it is more abundant, but also in other languages, such as Portuguese, Spanish and French. We present the first freely available parallel corpus of scientific publications for the biomedical domain. Documents from the ”Biological Sciences” and ”Health Sciences” categories were retrieved from the Scielo database and parallel titles and abstracts are available for the following language pairs: Portuguese/English (about 86,000 documents in total), Spanish/English (about 95,000 documents) and French/English (about 2,000 documents). Additionally, monolingual data was also collected for all four languages. Sentences in the parallel corpus were automatically aligned and a manual analysis of 200 documents by native experts found that a minimum of 79% of sentences were correctly aligned in all language pairs. We demonstrate the utility of the corpus by running baseline machine translation experiments. We show that for all language pairs, a statistical machine translation system trained on the parallel corpora achieves performance that rivals or exceeds the state of the art in the biomedical domain. Furthermore, the corpora are currently being used in the biomedical task in the First Conference on Machine Translation (WMT’16).

pdf bib
LIMSI-COT at SemEval-2016 Task 12: Temporal relation identification using a pipeline of classifiers
Julien Tourille | Olivier Ferret | Aurélie Névéol | Xavier Tannier
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib abs
A Dataset for ICD-10 Coding of Death Certificates: Creation and Usage
Thomas Lavergne | Aurélie Névéol | Aude Robert | Cyril Grouin | Grégoire Rey | Pierre Zweigenbaum
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)

Very few datasets have been released for the evaluation of diagnosis coding with the International Classification of Diseases, and only one so far in a language other than English. This paper describes a large-scale dataset prepared from French death certificates, and the problems which needed to be solved to turn it into a dataset suitable for the application of machine learning and natural language processing methods of ICD-10 coding. The dataset includes the free-text statements written by medical doctors, the associated meta-data, the human coder-assigned codes for each statement, as well as the statement segments which supported the coder’s decision for each code. The dataset comprises 93,694 death certificates totalling 276,103 statements and 377,677 ICD-10 code assignments (3,457 unique codes). It was made available for an international automated coding shared task, which attracted five participating teams. An extended version of the dataset will be used in a new edition of the shared task.

pdf bib abs
Detection of Text Reuse in French Medical Corpora
Eva D’hondt | Cyril Grouin | Aurélie Névéol | Efstathios Stamatatos | Pierre Zweigenbaum
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)

Electronic Health Records (EHRs) are increasingly available in modern health care institutions either through the direct creation of electronic documents in hospitals’ health information systems, or through the digitization of historical paper records. Each EHR creation method yields the need for sophisticated text reuse detection tools in order to prepare the EHR collections for efficient secondary use relying on Natural Language Processing methods. Herein, we address the detection of two types of text reuse in French EHRs: 1) the detection of updated versions of the same document and 2) the detection of document duplicates that still bear surface differences due to OCR or de-identification processing. We present a robust text reuse detection method to automatically identify redundant document pairs in two French EHR corpora that achieves an overall macro F-measure of 0.68 and 0.60, respectively and correctly identifies all redundant document pairs of interest.

pdf bib
Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis
Cyril Grouin | Thierry Hamon | Aurélie Névéol | Pierre Zweigenbaum
Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis

pdf bib
Replicability of Research in Biomedical Natural Language Processing: a pilot evaluation for a coding task
Aurélie Névéol | Kevin Cohen | Cyril Grouin | Aude Robert
Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis

2015

pdf bib abs
Analyse d’expressions temporelles dans les dossiers électroniques patients
Mike Donald Tapi Nzali | Aurélie Névéol | Xavier Tannier
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Les références à des phénomènes du monde réel et à leur caractérisation temporelle se retrouvent dans beaucoup de types de discours en langue naturelle. Ainsi, l’analyse temporelle apparaît comme un élément important en traitement automatique de la langue. Cet article présente une analyse de textes en domaine de spécialité du point de vue temporel. En s’appuyant sur un corpus de documents issus de plusieurs dossiers électroniques patient désidentifiés, nous décrivons la construction d’une ressource annotée en expressions temporelles selon la norme TimeML. Par suite, nous utilisons cette ressource pour évaluer plusieurs méthodes d’extraction automatique d’expressions temporelles adaptées au domaine médical. Notre meilleur système statistique offre une performance de 0,91 de F-mesure, surpassant pour l’identification le système état de l’art HeidelTime. La comparaison de notre corpus de travail avec le corpus journalistique FR-Timebank permet également de caractériser les différences d’utilisation des expressions temporelles dans deux domaines de spécialité.

pdf bib abs
Etiquetage morpho-syntaxique en domaine de spécialité: le domaine médical
Christelle Rabary | Thomas Lavergne | Aurélie Névéol
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

L’étiquetage morpho-syntaxique est une tâche fondamentale du Traitement Automatique de la Langue, sur laquelle reposent souvent des traitements plus complexes tels que l’extraction d’information ou la traduction automatique. L’étiquetage en domaine de spécialité est limité par la disponibilité d’outils et de corpus annotés spécifiques au domaine. Dans cet article, nous présentons le développement d’un corpus clinique du français annoté morpho-syntaxiquement à l’aide d’un jeu d’étiquettes issus des guides d’annotation French Treebank et Multitag. L’analyse de ce corpus nous permet de caractériser le domaine clinique et de dégager les points clés pour l’adaptation d’outils d’analyse morpho-syntaxique à ce domaine. Nous montrons également les limites d’un outil entraîné sur un corpus journalistique appliqué au domaine clinique. En perspective de ce travail, nous envisageons une application du corpus clinique annoté pour améliorer l’étiquetage morpho-syntaxique des documents cliniques en français.

pdf bib
Automatic Extraction of Time Expressions Accross Domains in French Narratives
Mike Donald Tapi Nzali | Xavier Tannier | Aurélie Névéol
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis
Cyril Grouin | Thierry Hamon | Aurélie Névéol | Pierre Zweigenbaum
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

pdf bib
Redundancy in French Electronic Health Records: A preliminary study
Eva D’hondt | Xavier Tannier | Aurélie Névéol
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

pdf bib
Is it possible to recover personal health information from an automatically de-identified corpus of French EHRs?
Cyril Grouin | Nicolas Griffon | Aurélie Névéol
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

2014

pdf bib
Automatic identification of document sections for designing a French clinical corpus (Identification automatique de zones dans des documents pour la constitution d’un corpus médical en français) [in French]
Louise Deléger | Aurélie Névéol
Proceedings of TALN 2014 (Volume 2: Short Papers)

pdf bib abs
Annotation of specialized corpora using a comprehensive entity and relation scheme
Louise Deléger | Anne-Laure Ligozat | Cyril Grouin | Pierre Zweigenbaum | Aurélie Névéol
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Annotated corpora are essential resources for many applications in Natural Language Processing. They provide insight on the linguistic and semantic characteristics of the genre and domain covered, and can be used for the training and evaluation of automatic tools. In the biomedical domain, annotated corpora of English texts have become available for several genres and subfields. However, very few similar resources are available for languages other than English. In this paper we present an effort to produce a high-quality corpus of clinical documents in French, annotated with a comprehensive scheme of entities and relations. We present the annotation scheme as well as the results of a pilot annotation study covering 35 clinical documents in a variety of subfields and genres. We show that high inter-annotator agreement can be achieved using a complex annotation scheme.

pdf bib abs
Language Resources for French in the Biomedical Domain
Aurélie Névéol | Julien Grosjean | Stéfan Darmoni | Pierre Zweigenbaum
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The biomedical domain offers a wealth of linguistic resources for Natural Language Processing, including terminologies and corpora. While many of these resources are prominently available for English, other languages including French benefit from substantial coverage thanks to the contribution of an active community over the past decades. However, access to terminological resources in languages other than English may not be as straight-forward as access to their English counterparts. Herein, we review the extent of resource coverage for French and give pointers to access French-language resources. We also discuss the sources and methods for making additional material available for French.

pdf bib
Optimizing annotation efforts to build reliable annotated corpora for training statistical models
Cyril Grouin | Thomas Lavergne | Aurélie Névéol
Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop

L’indexation est une composante importante de tout système de recherche d’information. Dans MEDLINE, la base documentaire de référence pour la littérature du domaine biomédical, le contenu des articles référencés est indexé à l’aide de descripteurs issus du thésaurus MeSH. Avec l’augmentation constante de publications à indexer pour maintenir la base à jour, le besoin d’outils automatiques se fait pressant pour les indexeurs. Dans cet article, nous décrivons l’utilisation et l’adaptation de la Programmation Logique Inductive (PLI) pour découvrir des règles d’indexation permettant de générer automatiquement des recommandations d’indexation pour MEDLINE. Les résultats obtenus par cette approche originale sont très satisfaisants comparés à ceux obtenus à l’aide de règles manuelles lorsque celles-ci existent. Ainsi, les jeux de règles obtenus par PLI devraient être prochainement intégrés au système produisant les recommandations d’indexation automatique pour MEDLINE.

pdf bib
Automatic inference of indexing rules for MEDLINE
Aurélie Névéol | Sonya Shooshan | Vincent Claveau
Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing

2007

pdf bib
Automatic Indexing of Specialized Documents: Using Generic vs. Domain-Specific Document Representations
Aurélie Névéol | James G. Mork | Alan R. Aronson
Biological, translational, and clinical language processing

2005

pdf bib abs
Indexation automatique de ressources de santé à l’aide de paires de descripteurs MeSH
Aurélie Névéol | Alexandrina Rogozan | Stéfan Darmoni
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Depuis quelques années, médecins et documentalistes doivent faire face à une demande croissante dans le domaine du codage médico-économique et de l’indexation des diverses sources d’information disponibles dans le domaine de la santé. Il est donc nécessaire de développer des outils d’indexation automatique qui réduisent les délais d’indexation et facilitent l’accès aux ressources médicales. Nous proposons deux méthodes d’indexation automatique de ressources de santé à l’aide de paires de descripteurs MeSH. La combinaison de ces deux méthodes permet d’optimiser les résulats en exploitant la complémentarité des approches. Les performances obtenues sont équivalentes à celles des outils de la littérature pour une indexation à l’aide de descripteurs seuls.

2004

pdf bib abs
Indexation automatique de ressources de santé à l’aide d’un vocabulaire contrôlé
Aurélie Névéol
Actes de la 11ème conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues

Nous présentons ici le système d’indexation automatique actuellement en cours de développement dans l’équipe CISMeF afin d’aider les documentalistes lors de l’indexation de ressources de santé. Nous détaillons l’architecture du système pour l’extraction de mots clés MeSH, et présentons les résultats d’une première évaluation. La stratégie d’indexation choisie atteint une précision comparable à celle des systèmes existants. De plus, elle permet d’extraire des paires mot clé/qualificatif, et non des termes isolés, ce qui constitue une indexation beaucoup plus fine. Les travaux en cours s’attachent à étendre la couverture des dictionnaires, et des tests à plus grande échelle sont envisagés afin de valider le système et d’évaluer sa valeur ajoutée dans le travail quotidien des documentalistes.