Pierre Zweigenbaum - ACL Anthology

Pierre Zweigenbaum

Also published as: P. Zweigenbaum

2026

KAD: A Framework for Proxy-based Test-time Alignment with Knapsack Approximation Deferral
Ayoub Hammal | Pierre Zweigenbaum | Caio Corro
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Several previous works concluded that the largest part of generation capabilities of large language models (LLM) are learned (early) during pre-training. However, LLMs still require further alignment to adhere to downstream task requirements and stylistic preferences, among other desired properties. As LLMs continue to scale in terms of size, the computational cost of alignment procedures increase prohibitively.In this work, we propose a novel approach to circumvent these costs via proxy-based test-time alignment, i.e. using guidance from a small aligned model. Our approach can be described as a token-specific cascading method, where the token-specific deferral rule is reduced to 0-1 knapsack problem. In this setting, we derive primal and dual approximations of the optimal deferral decision. We experimentally show the benefits of our method both in task performance and speculative decoding speed.

2025

Leveraging External Knowledge Bases: Analyzing Presentation Methods and Their Impact on Model Performance
Hui-Syuan Yeh | Thomas Lavergne | Pierre Zweigenbaum
Proceedings of the Second Workshop on Patient-Oriented Language Processing (CL4Health)

Integrating external knowledge into large language models has demonstrated potential for performance improvement across a wide range of tasks. This approach is particularly appealing in domain-specific applications, such as in the biomedical field. However, the strategies for effectively presenting external knowledge to these models remain underexplored. This study investigates the impact of different knowledge presentation methods and their influence on model performance. Our results show that inserting knowledge between demonstrations helps the models perform better, and improve smaller LLMs (7B) to perform on par with larger LLMs (175B). Our further investigation indicates that the performance improvement, however, comes more from the effect of additional tokens and positioning than from the relevance of the knowledge.

Comment évaluer un grand modèle de langue dans le domaine médical en français ?
Christophe Servan | Cyril Grouin | Aurélie Névéol | Pierre Zweigenbaum
Actes de l'atelier Évaluation des modèles génératifs (LLM) et challenge 2025 (EvalLLM)

Les récentes avancées en Traitement Automatique des Langues liées aux grands modèles de langue (LLM) auto-régressifs investissent également les domaines spécialisés dont celui de la santé. Cette étude examine les questions qui se posent dans l’évaluation de LLM appliqués au domaine de la santé en se focalisant sur le français. Après un bref tour d’horizon des tâches et des données d’évaluation disponibles pour ce domaine de spécialité, l’article examine le mode d’évaluation des LLM dans des tâches de nature discriminante (détection d’entités nommées, classification de textes) et génératives (résumé de comptes rendus, génération de cas cliniques). L’article n’a pas vocation à rapporter une évaluation concrète, mais à discuter et préparer la méthodologie pour le faire.

Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC)
Serge Sharoff | Ayla Rigouts Terryn | Pierre Zweigenbaum | Reinhard Rapp
Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC)

Inférence en langue naturelle appliquée au recrutement de patients pour les essais cliniques : le point de vue du patient
Mathilde Aguiar | Pierre Zweigenbaum | Nona Naderi
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : traductions d'articles publiés

Recruter des patients pour les essais cliniques est long et complexe. Habituellement, le processus de recrutement est initié par un professionnel de santé qui propose à un patient de participer à l’essai clinique. Promouvoir les essais directement aux patients via des plateformes en ligne pourrait aider à en atteindre un plus grand nombre. Dans cette étude, nous nous intéressons au cas où le patient est l’initiateur de la démarche et veut savoir s’il est éligible à un essai clinique, tout cela en utilisant son propre langage patient. Pour déterminer si l’utilisation d’un tel langage permet tout de même au modèle de langue de déterminer l’égilibilité du patient pour l’essai clinique, nous construisons la tâche Natural Language Inference for Patient Recrutement (NLI4PR). Pour cela nous adaptons le jeu de données TREC 2022 Clinical Trial Track en réécrivant manuellement les profils médicaux en langage patient. Nous extrayons également les essais cliniques où les patients étaient labellisés « éligible » ou « exclu ». Nous soumettons des amorces à plusieurs grands modèles de langue, et obtenons un score F1 compris entre 56,6 et 71,8 avec le langage patient, contre 64,7 à 73,1 pour du langage médical. Nous observons que l’utilisation du langage patient ne mène qu’à une dégradation de performance relativement petite pour notre meilleur modèle. Cela suggère qu’avoir le patient en tant que point de départ du recrutement pourrait être réalisable. Nos scripts ainsi que nos jeux de données sont disponibles sur Github et HuggingFace(Aguiar et al. , 2025).

Am I eligible? Natural Language Inference for Clinical Trial Patient Recruitment: the Patient’s Point of View
Mathilde Aguiar | Pierre Zweigenbaum | Nona Naderi
Proceedings of the Second Workshop on Patient-Oriented Language Processing (CL4Health)

Recruiting patients to participate in clinical trials can be challenging and time-consuming. Usually, participation in a clinical trial is initiated by a healthcare professional and proposed to the patient. Promoting clinical trials directly to patients via online recruitment might help to reach them more efficiently. In this study, we address the case where a patient is initiating their own recruitment process and wants to determine whether they are eligible for a given clinical trial, using their own language to describe their medical profile. To study whether this creates difficulties in the patient-trial matching process, we design a new dataset and task, Natural Language Inference for Patient Recruitment (NLI4PR), in which patient-language profiles must be matched to clinical trials. We create it by adapting the TREC 2022 Clinical Trial Track dataset, which provides patients’ medical profiles, and rephrasing them manually using patient language. We also use the associated clinical trial reports where the patients are either eligible or excluded. We prompt several open-source Large Language Models on our task and achieve from 56.5 to 71.8 of F1 score using patient language, against 64.7 to 73.1 for the same task using medical language. When using patient language, we observe only a small loss in performance for the best model, suggesting that having the patient as a starting point could be adopted to help recruit patients for clinical trials. The corpus and code bases are all freely available on our GitHub and HuggingFace repositories.

2024

For the past nine years, the Social Media Mining for Health Applications (#SMM4H) shared tasks have promoted community-driven development and evaluation of advanced natural language processing systems to detect, extract, and normalize health-related information in publicly available user-generated content. This year, #SMM4H included seven shared tasks in English, Japanese, German, French, and Spanish from Twitter, Reddit, and health forums. A total of 84 teams from 22 countries registered for #SMM4H, and 45 teams participated in at least one task. This represents a growth of 180% and 160% in registration and participation, respectively, compared to the last iteration. This paper provides an overview of the tasks and participating systems. The data sets remain available upon request, and new systems can be evaluated through the post-evaluation phase on CodaLab.

Enriching a Time-Domain Astrophysics Corpus with Named Entity, Coreference and Astrophysical Relationship Annotations
Atilla Kaan Alkan | Felix Grezes | Cyril Grouin | Fabian Schussler | Pierre Zweigenbaum
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Interest in Astrophysical Natural Language Processing (NLP) has increased recently, fueled by the development of specialized language models for information extraction. However, the scarcity of annotated resources for this domain is still a significant challenge. Most existing corpora are limited to Named Entity Recognition (NER) tasks, leaving a gap in resource diversity. To address this gap and facilitate a broader spectrum of NLP research in astrophysics, we introduce astroECR, an extension of our previously built Time-Domain Astrophysics Corpus (TDAC). Our contributions involve expanding it to cover named entities, coreferences, annotations related to astrophysical relationships, and normalizing celestial object names. We showcase practical utility through baseline models for four NLP tasks and provide the research community access to our corpus, code, and models.

SEME at SemEval-2024 Task 2: Comparing Masked and Generative Language Models on Natural Language Inference for Clinical Trials
Mathilde Aguiar | Pierre Zweigenbaum | Nona Naderi
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

This paper describes our submission to Task 2 of SemEval-2024: Safe Biomedical Natural Language Inference for Clinical Trials. The Multi-evidence Natural Language Inference for Clinical Trial Data (NLI4CT) consists of a Textual Entailment (TE) task focused on the evaluation of the consistency and faithfulness of Natural Language Inference (NLI) models applied to Clinical Trial Reports (CTR). We test 2 distinct approaches, one based on finetuning and ensembling Masked Language Models and the other based on prompting Large Language Models using templates, in particular, using Chain-Of-Thought and Contrastive Chain-Of-Thought. Prompting Flan-T5-large in a 2-shot setting leads to our best system that achieves 0.57 F1 score, 0.64 Faithfulness, and 0.56 Consistency.

A Dataset for Pharmacovigilance in German, French, and Japanese: Annotating Adverse Drug Reactions across Languages
Lisa Raithel | Hui-Syuan Yeh | Shuntaro Yada | Cyril Grouin | Thomas Lavergne | Aurélie Névéol | Patrick Paroubek | Philippe Thomas | Tomohiro Nishiyama | Sebastian Möller | Eiji Aramaki | Yuji Matsumoto | Roland Roller | Pierre Zweigenbaum
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

User-generated data sources have gained significance in uncovering Adverse Drug Reactions (ADRs), with an increasing number of discussions occurring in the digital world. However, the existing clinical corpora predominantly revolve around scientific articles in English. This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social media, and clinical reports in German, French, and Japanese. Our corpus contains annotations covering 12 entity types, four attribute types, and 13 relation types. It contributes to the development of real-world multilingual language models for healthcare. We provide statistics to highlight certain challenges associated with the corpus and conduct preliminary experiments resulting in strong baselines for extracting entities and relations between these entities, both within and across languages.

Overview of #SMM4H 2024 – Task 2: Cross-Lingual Few-Shot Relation Extraction for Pharmacovigilance in French, German, and Japanese
Lisa Raithel | Philippe Thomas | Bhuvanesh Verma | Roland Roller | Hui-Syuan Yeh | Shuntaro Yada | Cyril Grouin | Shoko Wakamiya | Eiji Aramaki | Sebastian Möller | Pierre Zweigenbaum
Proceedings of the 9th Social Media Mining for Health Research and Applications (SMM4H 2024) Workshop and Shared Tasks

This paper provides an overview of Task 2 from the Social Media Mining for Health 2024 shared task (#SMM4H 2024), which focused on Named Entity Recognition (NER, Subtask 2a) and the joint task of NER and Relation Extraction (RE, Subtask 2b) for detecting adverse drug reactions (ADRs) in German, Japanese, and French texts written by patients. Participants were challenged with a few-shot learning scenario, necessitating models that can effectively generalize from limited annotated examples. Despite the diverse strategies employed by the participants, the overall performance across submissions from three teams highlighted significant challenges. The results underscored the complexity of extracting entities and relations in multi-lingual contexts, especially from the noisy and informal nature of user-generated content. Further research is required to develop robust systems capable of accurately identifying and associating ADR-related information in low-resource and multilingual settings.

Assessing Authenticity and Anonymity of Synthetic User-generated Content in the Medical Domain
Tomohiro Nishiyama | Lisa Raithel | Roland Roller | Pierre Zweigenbaum | Eiji Aramaki
Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024)

Since medical text cannot be shared easily due to privacy concerns, synthetic data bears much potential for natural language processing applications. In the context of social media and user-generated messages about drug intake and adverse drug effects, this work presents different methods to examine the authenticity of synthetic text. We conclude that the generated tweets are untraceable and show enough authenticity from the medical point of view to be used as a replacement for a real Twitter corpus. However, original data might still be the preferred choice as they contain much more diversity.

Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024
Pierre Zweigenbaum | Reinhard Rapp | Serge Sharoff
Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024

astroECR : enrichissement d’un corpus astrophysique en entités nommées, coréférences et relations sémantiques
Atilla Kaan Alkan | Felix Grezes | Cyril Grouin | Fabian Schüssler | Pierre Zweigenbaum
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Le manque de ressources annotées constitue un défi majeur pour le traitement automatique de la langue en astrophysique. Afin de combler cette lacune, nous présentons astroECR, une extension du corpus TDAC (Time-Domain Astrophysics Corpus). Notre corpus, constitué de 300 rapports d’observation en anglais, étend le schéma d’annotation initial de TDAC en introduisant cinq classes d’entités nommées supplémentaires spécifiques à l’astrophysique. Nous avons enrichi les annotations en incluant les coréférences, les relations sémantiques entre les objets célestes et leurs propriétés physiques, ainsi qu’en normalisant les noms d’objets célestes via des bases de données astronomiques. L’utilité de notre corpus est démontrée en fournissant des scores de référence à travers quatre tâches~: la reconnaissance d’entités nommées, la résolution de coréférences, la détection de relations, et la normalisation des noms d’objets célestes. Nous mettons à disposition le corpus ainsi que son guide d’annotation, les codes sources, et les modèles associés.

2023

La pré-annotation automatique de textes cliniques comme support au dialogue avec les experts du domaine lors de la mise au point d’un schéma d’annotation
Virgile Barthet | Marie-José Aroulanda | Laura Monceaux-Cachard | Christine Jacquin | Cyril Grouin | Johann Gutton | Guillaume Hocquet | Pascal De Groote | Michel Komajda | Emmanuel Morin | Pierre Zweigenbaum
Actes de CORIA-TALN 2023. Actes de l'atelier "Analyse et Recherche de Textes Scientifiques" (ARTS)@TALN 2023

La pré-annotation automatique de textes est une tâche essentielle qui peut faciliter l’annotationd’un corpus de textes. Dans le contexte de la cardiologie, l’annotation est une tâche complexe quinécessite des connaissances approfondies dans le domaine et une expérience pratique dans le métier.Pré-annoter les textes vise à diminuer le temps de sollicitation des experts, facilitant leur concentrationsur les aspects plus critiques de l’annotation. Nous rapportons ici une expérience de pré-annotationde textes cliniques en cardiologie : nous présentons ses modalités et les observations que nous enretirons sur l’interaction avec les experts du domaine et la mise au point du schéma d’an

Proceedings of the Workshop on Computational Terminology in NLP and Translation Studies (ConTeNTS) Incorporating the 16th Workshop on Building and Using Comparable Corpora (BUCC)
Amal Haddad Haddad | Ayla Rigouts Terryn | Ruslan Mitkov | Reinhard Rapp | Pierre Zweigenbaum | Serge Sharoff
Proceedings of the Workshop on Computational Terminology in NLP and Translation Studies (ConTeNTS) Incorporating the 16th Workshop on Building and Using Comparable Corpora (BUCC)

Étude de méthodes d’augmentation de données pour la reconnaissance d’entités nommées en astrophysique
Atilla Kaan Alkan | Cyril Grouin | Pierre Zweigenbaum
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux -- articles longs

Dans cet article nous étudions l’intérêt de l’augmentation de données pour le repérage d’entités nommées en domaine de spécialité : l’astrophysique. Pour cela, nous comparons trois méthodes d’augmentation en utilisant deux récents corpus annotés du domaine : DEAL et TDAC, tous deux en anglais. Nous avons générés les données artificielles en utilisant des méthodes à base de règles et à base de modèles de langue. Les données ont ensuite été ajoutées de manière itérative pour affiner un système de détection d’entités. Les résultats permettent de constater un effet de seuil : ajouter des données artificielles au-delà d’une certaine quantité ne présente plus d’intérêt et peut dégrader la F-mesure. Sur les deux corpus, le seuil varie selon la méthode employée, et en fonction du modèle de langue utilisé. Cette étude met également en évidence que l’augmentation de données est plus efficace sur de petits corpus, ce qui est cohérent avec d’autres études antérieures. En effet, nos expériences montrent qu’il est possible d’améliorer de 1 point la F-mesure sur le corpus DEAL, et jusqu’à 2 points sur le corpus TDAC.

Exploitation de plongements de graphes pour l’extraction de relations biomédicales
Anfu Tang | Robert Bossy | Louise Deléger | Claire Nédellec | Pierre Zweigenbaum
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux -- articles longs

L’intégration de connaissances externes dans les modèles neuronaux est très étudiée pour améliorer les performances des modèles de langue pré-entraînés, notamment en domaine biomédical. Dans cet article, nous explorons la contribution de plongements de bases de connaissances à une tâche d’extraction de relations. Pour deux mentions d’entités candidates dans un texte, nous faisons l’hypothèse que la connaissance de relations entre elles, issue d’une base de connaissances (BC) externe, aide à prédire l’existence d’une relation dans le texte, y compris lorsque les relations de BC sont différentes de celles du texte. Notre approche consiste à calculer des plongements du graphe de BC et à estimer la possibilité pour chaque paire d’entité du texte qu’elle soit reliée par une relation de BC. Les expériences menées sur trois tâches d’extraction de relations en domaine biomédical montrent que notre méthode surpasse le modèle PubMedBERT de base et donne des performances comparables aux méthodes de l’état de l’art.

2022

Proceedings of the BUCC Workshop within LREC 2022
Reinhard Rapp | Pierre Zweigenbaum | Serge Sharoff
Proceedings of the BUCC Workshop within LREC 2022

Specializing Static and Contextual Embeddings in the Medical Domain Using Knowledge Graphs: Let’s Keep It Simple
Hicham El Boukkouri | Olivier Ferret | Thomas Lavergne | Pierre Zweigenbaum
Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI)

Domain adaptation of word embeddings has mainly been explored in the context of retraining general models on large specialized corpora. While this usually yields good results, we argue that knowledge graphs, which are used less frequently, could also be utilized to enhance existing representations with specialized knowledge. In this work, we aim to shed some light on whether such knowledge injection could be achieved using a basic set of tools: graph-level embeddings and concatenation. To that end, we adopt an incremental approach where we first demonstrate that static embeddings can indeed be improved through concatenation with in-domain node2vec representations. Then, we validate this approach on contextual models and generalize it further by proposing a variant of BERT that incorporates knowledge embeddings within its hidden states through the same process of concatenation. We show that this variant outperforms plain retraining on several specialized tasks, then discuss how this simple approach could be improved further. Both our code and pre-trained models are open-sourced for future research. In this work, we conduct experiments that target the medical domain and the English language.

Decorate the Examples: A Simple Method of Prompt Design for Biomedical Relation Extraction
Hui-Syuan Yeh | Thomas Lavergne | Pierre Zweigenbaum
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Relation extraction is a core problem for natural language processing in the biomedical domain. Recent research on relation extraction showed that prompt-based learning improves the performance on both fine-tuning on full training set and few-shot training. However, less effort has been made on domain-specific tasks where good prompt design can be even harder. In this paper, we investigate prompting for biomedical relation extraction, with experiments on the ChemProt dataset. We present a simple yet effective method to systematically generate comprehensive prompts that reformulate the relation extraction task as a cloze-test task under a simple prompt formulation. In particular, we experiment with different ranking scores for prompt selection. With BioMed-RoBERTa-base, our results show that prompting-based fine-tuning obtains gains by 14.21 F1 over its regular fine-tuning baseline, and 1.14 F1 over SciFive-Large, the current state-of-the-art on ChemProt. Besides, we find prompt-based learning requires fewer training examples to make reasonable predictions. The results demonstrate the potential of our methods in such a domain-specific relation extraction task.

A Majority Voting Strategy of a SciBERT-based Ensemble Models for Detecting Entities in the Astrophysics Literature (Shared Task)
Atilla Kaan Alkan | Cyril Grouin | Fabian Schussler | Pierre Zweigenbaum
Proceedings of the First Workshop on Information Extraction from Scientific Publications

Detecting Entities in the Astrophysics Literature (DEAL) is a proposed shared task in the scope of the first Workshop on Information Extraction from Scientific Publications (WIESP) at AACL-IJCNLP 2022. It aims to propose systems identifying astrophysical named entities. This article presents our system based on a majority voting strategy of an ensemble composed of multiple SciBERT models. The system we propose is ranked second and outperforms the baseline provided by the organisers by achieving an F1 score of 0.7993 and a Matthews Correlation Coefficient (MCC) score of 0.8978 in the testing phase.

Re-train or Train from Scratch? Comparing Pre-training Strategies of BERT in the Medical Domain
Hicham El Boukkouri | Olivier Ferret | Thomas Lavergne | Pierre Zweigenbaum
Proceedings of the Thirteenth Language Resources and Evaluation Conference

BERT models used in specialized domains all seem to be the result of a simple strategy: initializing with the original BERT and then resuming pre-training on a specialized corpus. This method yields rather good performance (e.g. BioBERT (Lee et al., 2020), SciBERT (Beltagy et al., 2019), BlueBERT (Peng et al., 2019)). However, it seems reasonable to think that training directly on a specialized corpus, using a specialized vocabulary, could result in more tailored embeddings and thus help performance. To test this hypothesis, we train BERT models from scratch using many configurations involving general and medical corpora. Based on evaluations using four different tasks, we find that the initial corpus only has a weak influence on the performance of BERT models when these are further pre-trained on a medical corpus.

MAPA Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents
Victoria Arranz | Khalid Choukri | Montse Cuadros | Aitor García Pablos | Lucie Gianola | Cyril Grouin | Manuel Herranz | Patrick Paroubek | Pierre Zweigenbaum
Proceedings of the Workshop on Ethical and Legal Issues in Human Language Technologies and Multilingual De-Identification of Sensitive Data In Language Resources within the 13th Language Resources and Evaluation Conference

This paper presents the outcomes of the MAPA project, a set of annotated corpora for 24 languages of the European Union and an open-source customisable toolkit able to detect and substitute sensitive information in text documents from any domain, using state-of-the art, deep learning-based named entity recognition techniques. In the context of the project, the toolkit has been developed and tested on administrative, legal and medical documents, obtaining state-of-the-art results. As a result of the project, 24 dataset packages have been released and the de-identification toolkit is available as open source.

TDAC, The First Corpus in Time-Domain Astrophysics: Analysis and First Experiments on Named Entity Recognition
Atilla Kaan Alkan | Cyril Grouin | Fabian Schussler | Pierre Zweigenbaum
Proceedings of the First Workshop on Information Extraction from Scientific Publications

The increased interest in time-domain astronomy over the last decades has resulted in a substantial increase in observation reports publication leading to a saturation of how astrophysicists read, analyze and classify information. Due to the short life span of the detected astronomical events, the information related to the characterization of new phenomena has to be communicated and analyzed very rapidly to allow other observatories to react and conduct their follow-up observations. This paper introduces TDAC: the first Corpus in Time-Domain Astrophysics, based on observation reports. We also present the NLP experiments we made for named entity recognition based on annotations we made and annotations from the WIESP NLP Challenge.

Cross-lingual Approaches for the Detection of Adverse Drug Reactions in German from a Patient’s Perspective
Lisa Raithel | Philippe Thomas | Roland Roller | Oliver Sapina | Sebastian Möller | Pierre Zweigenbaum
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this work, we present the first corpus for German Adverse Drug Reaction (ADR) detection in patient-generated content. The data consists of 4,169 binary annotated documents from a German patient forum, where users talk about health issues and get advice from medical doctors. As is common in social media data in this domain, the class labels of the corpus are very imbalanced. This and a high topic imbalance make it a very challenging dataset, since often, the same symptom can have several causes and is not always related to a medication intake. We aim to encourage further multi-lingual efforts in the domain of ADR detection and provide preliminary experiments for binary classification using different methods of zero- and few-shot learning based on a multi-lingual model. When fine-tuning XLM-RoBERTa first on English patient forum data and then on the new German data, we achieve an F1-score of 37.52 for the positive class. We make the dataset and models publicly available for the community.

Building Comparable Corpora for Assessing Multi-Word Term Alignment
Omar Adjali | Emmanuel Morin | Pierre Zweigenbaum
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Recent work has demonstrated the importance of dealing with Multi-Word Terms (MWTs) in several Natural Language Processing applications. In particular, MWTs pose serious challenges for alignment and machine translation systems because of their syntactic and semantic properties. Thus, developing algorithms that handle MWTs is becoming essential for many NLP tasks. However, the availability of bilingual and more generally multi-lingual resources is limited, especially for low-resourced languages and in specialized domains. In this paper, we propose an approach for building comparable corpora and bilingual term dictionaries that help evaluate bilingual term alignment in comparable corpora. To that aim, we exploit parallel corpora to perform automatic bilingual MWT extraction and comparable corpus construction. Parallel information helps to align bilingual MWTs and makes it easier to build comparable specialized sub-corpora. Experimental validation on an existing dataset and on manually annotated data shows the interest of the proposed methodology.

2021

Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)
Reinhard Rapp | Serge Sharoff | Pierre Zweigenbaum
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)

Differential Evaluation: a Qualitative Analysis of Natural Language Processing System Behavior Based Upon Data Resistance to Processing
Lucie Gianola | Hicham El Boukkouri | Cyril Grouin | Thomas Lavergne | Patrick Paroubek | Pierre Zweigenbaum
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

Most of the time, when dealing with a particular Natural Language Processing task, systems are compared on the basis of global statistics such as recall, precision, F1-score, etc. While such scores provide a general idea of the behavior of these systems, they ignore a key piece of information that can be useful for assessing progress and discerning remaining challenges: the relative difficulty of test instances. To address this shortcoming, we introduce the notion of differential evaluation which effectively defines a pragmatic partition of instances into gradually more difficult bins by leveraging the predictions made by a set of systems. Comparing systems along these difficulty bins enables us to produce a finer-grained analysis of their relative merits, which we illustrate on two use-cases: a comparison of systems participating in a multi-label text classification task (CLEF eHealth 2018 ICD-10 coding), and a comparison of neural models trained for biomedical entity detection (BioCreative V chemical-disease relations dataset).

2020

Overview of the Fourth BUCC Shared Task: Bilingual Dictionary Induction from Comparable Corpora
Reinhard Rapp | Pierre Zweigenbaum | Serge Sharoff
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

The shared task of the 13th Workshop on Building and Using Comparable Corpora was devoted to the induction of bilingual dictionaries from comparable rather than parallel corpora. In this task, for a number of language pairs involving Chinese, English, French, German, Russian and Spanish, the participants were supposed to determine automatically the target language translations of several thousand source language test words of three frequency ranges. We describe here some background, the task definition, the training and test data sets and the evaluation used for ranking the participating systems. We also summarize the approaches used and present the results of the evaluation. In conclusion, the outcome of the competition are the results of a number of systems which provide surprisingly good solutions to the ambitious problem.

Proceedings of the 13th Workshop on Building and Using Comparable Corpora
Reinhard Rapp | Pierre Zweigenbaum | Serge Sharoff
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

TL-Explorer: A Digital Humanities Tool for Mapping and Analyzing Translated Literature
Alex Zhai | Zheng Zhang | Amel Fraisse | Ronald Jenn | Shelley Fisher Fishkin | Pierre Zweigenbaum
Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

TL-Explorer is a digital humanities tool for mapping and analyzing translated literature, encompassing the World Map and the Translation Dashboard. The World Map displays collected literature of different languages, locations, and cultures and establishes the foundation for further analysis. It comprises three global maps for spatial and temporal interpretation. A further investigation into an individual point on the map leads to the Translation Dashboard. Each point represents one edition or translation. Collected translations are processed in order to build multilingual parallel corpora for a large number of under-resourced languages as well as to highlight the transnational circulation of knowledge. Our first rendition of TL-Explorer was conducted on the well-traveled American novel, Adventures of Huckleberry Finn, by Mark Twain. The maps currently chronicle nearly 400 translations of this novel. And the dashboard supports over 30 collected translations. However, the TL-Explore is easily extended to other works of literature and is not limited to type of texts, such as academic manuscripts or constitutional documents to name a few.

CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters
Hicham El Boukkouri | Olivier Ferret | Thomas Lavergne | Hiroshi Noji | Pierre Zweigenbaum | Jun’ichi Tsujii
Proceedings of the 28th International Conference on Computational Linguistics

Due to the compelling improvements brought by BERT, many recent representation models adopted the Transformer architecture as their main building block, consequently inheriting the wordpiece tokenization system despite it not being intrinsically linked to the notion of Transformers. While this system is thought to achieve a good balance between the flexibility of characters and the efficiency of full words, using predefined wordpiece vocabularies from the general domain is not always suitable, especially when building models for specialized domains (e.g., the medical domain). Moreover, adopting a wordpiece tokenization shifts the focus from the word level to the subword level, making the models conceptually more complex and arguably less convenient in practice. For these reasons, we propose CharacterBERT, a new variant of BERT that drops the wordpiece system altogether and uses a Character-CNN module instead to represent entire words by consulting their characters. We show that this new model improves the performance of BERT on a variety of medical domain tasks while at the same time producing robust, word-level, and open-vocabulary representations.

We describe the MAPA project, funded under the Connecting Europe Facility programme, whose goal is the development of an open-source de-identification toolkit for all official European Union languages. It will be developed since January 2020 until December 2021.

Handling Entity Normalization with no Annotated Corpus: Weakly Supervised Methods Based on Distributional Representation and Ontological Information
Arnaud Ferré | Robert Bossy | Mouhamadou Ba | Louise Deléger | Thomas Lavergne | Pierre Zweigenbaum | Claire Nédellec
Proceedings of the Twelfth Language Resources and Evaluation Conference

Entity normalization (or entity linking) is an important subtask of information extraction that links entity mentions in text to categories or concepts in a reference vocabulary. Machine learning based normalization methods have good adaptability as long as they have enough training data per reference with a sufficient quality. Distributional representations are commonly used because of their capacity to handle different expressions with similar meanings. However, in specific technical and scientific domains, the small amount of training data and the relatively small size of specialized corpora remain major challenges. Recently, the machine learning-based CONTES method has addressed these challenges for reference vocabularies that are ontologies, as is often the case in life sciences and biomedical domains. And yet, its performance is dependent on manually annotated corpus. Furthermore, like other machine learning based methods, parametrization remains tricky. We propose a new approach to address the scarcity of training data that extends the CONTES method by corpus selection, pre-processing and weak supervision strategies, which can yield high-performance results without any manually annotated examples. We also study which hyperparameters are most influential, with sometimes different patterns compared to previous work. The results show that our approach significantly improves accuracy and outperforms previous state-of-the-art algorithms.

2019

Embedding Strategies for Specialized Domains: Application to Clinical Entity Recognition
Hicham El Boukkouri | Olivier Ferret | Thomas Lavergne | Pierre Zweigenbaum
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Using pre-trained word embeddings in conjunction with Deep Learning models has become the “de facto” approach in Natural Language Processing (NLP). While this usually yields satisfactory results, off-the-shelf word embeddings tend to perform poorly on texts from specialized domains such as clinical reports. Moreover, training specialized word representations from scratch is often either impossible or ineffective due to the lack of large enough in-domain data. In this work, we focus on the clinical domain for which we study embedding strategies that rely on general-domain resources only. We show that by combining off-the-shelf contextual embeddings (ELMo) with static word2vec embeddings trained on a small in-domain corpus built from the task data, we manage to reach and sometimes outperform representations learned from a large corpus in the medical domain.

Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Défi Fouille de Textes (atelier TALN-RECITAL)
Emmanuel Morin | Sophie Rosset | Pierre Zweigenbaum
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Défi Fouille de Textes (atelier TALN-RECITAL)

Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume IV : Démonstrations
Emmanuel Morin | Sophie Rosset | Pierre Zweigenbaum
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume IV : Démonstrations

Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume III : RECITAL
Emmanuel Morin | Sophie Rosset | Pierre Zweigenbaum
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume III : RECITAL

Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume I : Articles longs
Emmanuel Morin | Sophie Rosset | Pierre Zweigenbaum
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume I : Articles longs

Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Terminologie et Intelligence Artificielle (atelier TALN-RECITAL \& IC)
Emmanuel Morin | Sophie Rosset | Pierre Zweigenbaum
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Terminologie et Intelligence Artificielle (atelier TALN-RECITAL \& IC)

Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts
Emmanuel Morin | Sophie Rosset | Pierre Zweigenbaum
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

2018

Automating Document Discovery in the Systematic Review Process: How to Use Chaff to Extract Wheat
Christopher Norman | Mariska Leeflang | Pierre Zweigenbaum | Aurélie Névéol
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Three Dimensions of Reproducibility in Natural Language Processing
K. Bretonnel Cohen | Jingbo Xia | Pierre Zweigenbaum | Tiffany Callahan | Orin Hargraves | Foster Goss | Nancy Ide | Aurélie Névéol | Cyril Grouin | Lawrence E. Hunter
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

GNEG: Graph-Based Negative Sampling for word2vec
Zheng Zhang | Pierre Zweigenbaum
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Negative sampling is an important component in word2vec for distributed word representation learning. We hypothesize that taking into account global, corpus-level information and generating a different noise distribution for each target word better satisfies the requirements of negative examples for each training word than the original frequency-based distribution. In this purpose we pre-compute word co-occurrence statistics from the corpus and apply to it network algorithms such as random walk. We test this hypothesis through a set of experiments whose results show that our approach boosts the word analogy task by about 5% and improves the performance on word similarity tasks by about 1% compared to the skip-gram negative sampling baseline.

Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph
Zheng Zhang | Pierre Zweigenbaum | Ruiqing Yin
Proceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12)

Corpus2graph is an open-source NLP-application-oriented tool that generates a word co-occurrence network from a large corpus. It not only contains different built-in methods to preprocess words, analyze sentences, extract word pairs and define edge weights, but also supports user-customized functions. By using parallelization techniques, it can generate a large word co-occurrence network of the whole English Wikipedia data within hours. And thanks to its nodes-edges-weight three-level progressive calculation design, rebuilding networks with different configurations is even faster as it does not need to start all over again. This tool also works with other graph libraries such as igraph, NetworkX and graph-tool as a front end providing data to boost network generation speed.

A Multilingual Dataset for Evaluating Parallel Sentence Extraction from Comparable Corpora
Pierre Zweigenbaum | Serge Sharoff | Reinhard Rapp
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Détection des couples de termes translittérés à partir d’un corpus parallèle anglais-arabe ()
Wafa Neifar | Thierry Hamon | Pierre Zweigenbaum | Mariem Ellouze | Lamia-Hadrich Belguith
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Combining rule-based and embedding-based approaches to normalize textual entities with an ontology
Arnaud Ferré | Louise Deléger | Pierre Zweigenbaum | Claire Nédellec
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora
Pierre Zweigenbaum | Serge Sharoff | Reinhard Rapp
Proceedings of the 10th Workshop on Building and Using Comparable Corpora

This paper presents the BUCC 2017 shared task on parallel sentence extraction from comparable corpora. It recalls the design of the datasets, presents their final construction and statistics and the methods used to evaluate system results. 13 runs were submitted to the shared task by 4 teams, covering three of the four proposed language pairs: French-English (7 runs), German-English (3 runs), and Chinese-English (3 runs). The best F-scores as measured against the gold standard were 0.84 (German-English), 0.80 (French-English), and 0.43 (Chinese-English). Because of the design of the dataset, in which not all gold parallel sentence pairs are known, these are only minimum values. We examined manually a small sample of the false negative sentence pairs for the most precise French-English runs and estimated the number of parallel sentence pairs not yet in the provided gold standard. Adding them to the gold standard leads to revised estimates for the French-English F-scores of at most +1.5pt. This suggests that the BUCC 2017 datasets provide a reasonable approximate evaluation of the parallel sentence spotting task.

Automatic classification of doctor-patient questions for a virtual patient record query task
Leonardo Campillos Llanos | Sophie Rosset | Pierre Zweigenbaum
Proceedings of the 16th BioNLP Workshop

We present the work-in-progress of automating the classification of doctor-patient questions in the context of a simulated consultation with a virtual patient. We classify questions according to the computational strategy (rule-based or other) needed for looking up data in the clinical record. We compare ‘traditional’ machine learning methods (Gaussian and Multinomial Naive Bayes, and Support Vector Machines) and a neural network classifier (FastText). We obtained the best results with the SVM using semantic annotations, whereas the neural classifier achieved promising results without it.

zNLP: Identifying Parallel Sentences in Chinese-English Comparable Corpora
Zheng Zhang | Pierre Zweigenbaum
Proceedings of the 10th Workshop on Building and Using Comparable Corpora

This paper describes the zNLP system for the BUCC 2017 shared task. Our system identifies parallel sentence pairs in Chinese-English comparable corpora by translating word-by-word Chinese sentences into English, using the search engine Solr to select near-parallel sentences and then by using an SVM classifier to identify true parallel sentences from the previous results. It obtains an F1-score of 45% (resp. 32%) on the test (training) set.

Tri Automatique de la Littérature pour les Revues Systématiques (Automatically Ranking the Literature in Support of Systematic Reviews)
Christopher Norman | Mariska Leeflang | Pierre Zweigenbaum | Aurélie Névéol
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Les revues systématiques de la littérature dans le domaine biomédical reposent essentiellement sur le travail bibliographique manuel d’experts. Nous évaluons les performances de la classification supervisée pour la découverte automatique d’articles à l’aide de plusieurs définitions des critères d’inclusion. Nous appliquons un modèle de regression logistique sur deux corpus issus de revues systématiques conduites dans le domaine du traitement automatique de la langue et de l’efficacité des médicaments. La classification offre une aire sous la courbe moyenne (AUC) de 0.769 si le classifieur est contruit à partir des jugements experts portés sur les titres et résumés des articles, et de 0.835 si on utilise les jugements portés sur le texte intégral. Ces résultats indiquent l’importance des jugements portés dès le début du processus de sélection pour développer un classifieur efficace pour accélérer l’élaboration des revues systématiques à l’aide d’un algorithme de classification standard.

Détection de concepts et granularité de l’annotation (Concept detection and annotation granularity )
Pierre Zweigenbaum | Thomas Lavergne
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Nous nous intéressons ici à une tâche de détection de concepts dans des textes sans exigence particulière de passage par une phase de détection d’entités avec leurs frontières. Il s’agit donc d’une tâche de catégorisation de textes multiétiquette, avec des jeux de données annotés au niveau des textes entiers. Nous faisons l’hypothèse qu’une annotation à un niveau de granularité plus fin, typiquement au niveau de l’énoncé, devrait améliorer la performance d’un détecteur automatique entraîné sur ces données. Nous examinons cette hypothèse dans le cas de textes courts particuliers : des certificats de décès où l’on cherche à reconnaître des diagnostics, avec des jeux de données initialement annotés au niveau du certificat entier. Nous constatons qu’une annotation au niveau de la « ligne » améliore effectivement les résultats, mais aussi que le simple fait d’appliquer au niveau de la ligne un classifieur entraîné au niveau du texte est déjà une source d’amélioration.

Traitement automatique de la langue biomédicale au LIMSI (Biomedical language processing at LIMSI)
Christopher Norman | Cyril Grouin | Thomas Lavergne | Aurélie Névéol | Pierre Zweigenbaum
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 3 - Démonstrations

Nous proposons des démonstrations de trois outils développés par le LIMSI en traitement automatique des langues appliqué au domaine biomédical : la détection de concepts médicaux dans des textes courts, la catégorisation d’articles scientifiques pour l’assistance à l’écriture de revues systématiques, et l’anonymisation de textes cliniques.

Proceedings of the 10th Workshop on Building and Using Comparable Corpora
Serge Sharoff | Pierre Zweigenbaum | Reinhard Rapp
Proceedings of the 10th Workshop on Building and Using Comparable Corpora

Representation of complex terms in a vector space structured by an ontology for a normalization task
Arnaud Ferré | Pierre Zweigenbaum | Claire Nédellec
Proceedings of the 16th BioNLP Workshop

We propose in this paper a semi-supervised method for labeling terms of texts with concepts of a domain ontology. The method generates continuous vector representations of complex terms in a semantic space structured by the ontology. The proposed method relies on a distributional semantics approach, which generates initial vectors for each of the extracted terms. Then these vectors are embedded in the vector space constructed from the structure of the ontology. This embedding is carried out by training a linear model. Finally, we apply a distance calculation to determine the proximity between vectors of terms and vectors of concepts and thus to assign ontology labels to terms. We have evaluated the quality of these representations for a normalization task by using the concepts of an ontology as semantic labels. Normalization of terms is an important step to extract a part of the information containing in texts, but the vector space generated might find other applications. The performance of this method is comparable to that of the state of the art for this task of standardization, opening up encouraging prospects.

2016

Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis
Cyril Grouin | Thierry Hamon | Aurélie Névéol | Pierre Zweigenbaum
Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis

Transfer-Based Learning-to-Rank Assessment of Medical Term Technicality
Dhouha Bouamor | Leonardo Campillos Llanos | Anne-Laure Ligozat | Sophie Rosset | Pierre Zweigenbaum
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

While measuring the readability of texts has been a long-standing research topic, assessing the technicality of terms has only been addressed more recently and mostly for the English language. In this paper, we train a learning-to-rank model to determine a specialization degree for each term found in a given list. Since no training data for this task exist for French, we train our system with non-lexical features on English data, namely, the Consumer Health Vocabulary, then apply it to French. The features include the likelihood ratio of the term based on specialized and lay language models, and tests for containing morphologically complex words. The evaluation of this approach is conducted on 134 terms from the UMLS Metathesaurus and 868 terms from the Eugloss thesaurus. The Normalized Discounted Cumulative Gain obtained by our system is over 0.8 on both test sets. Besides, thanks to the learning-to-rank approach, adding morphological features to the language model features improves the results on the Eugloss thesaurus.

Overview of the Regulatory Network of Plant Seed Development (SeeDev) Task at the BioNLP Shared Task 2016.
Estelle Chaix | Bertrand Dubreucq | Abdelhak Fatihi | Dialekti Valsamou | Robert Bossy | Mouhamadou Ba | Louise Deléger | Pierre Zweigenbaum | Philippe Bessières | Loic Lepiniec | Claire Nédellec
Proceedings of the 4th BioNLP Shared Task Workshop

Impact de l’agglutination dans l’extraction de termes en arabe standard moderne (Adaptation of a term extractor to the Modern Standard Arabic language)
Wafa Neifar | Thierry Hamon | Pierre Zweigenbaum | Mariem Ellouze | Lamia Hadrich Belguith
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

Nous présentons, dans cet article, une adaptation à l’arabe standard moderne d’un extracteur de termes pour le français et l’anglais. L’adaptation a d’abord consisté à décrire le processus d’extraction des termes de manière similaire à celui défini pour l’anglais et le français en prenant en compte certains particularités morpho-syntaxiques de la langue arabe. Puis, nous avons considéré le phénomène de l’agglutination de la langue arabe. L’évaluation a été réalisée sur un corpus de textes médicaux. Les résultats montrent que parmi 400 termes candidats maximaux analysés, 288 sont jugés corrects par rapport au domaine (72,1%). Les erreurs d’extraction sont dues à l’étiquetage morpho-syntaxique et à la non-voyellation des textes mais aussi à des phénomènes d’agglutination.

Une catégorisation de fins de lignes non-supervisée (End-of-line classification with no supervision)
Pierre Zweigenbaum | Cyril Grouin | Thomas Lavergne
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

Dans certains textes bruts, les marques de fin de ligne peuvent marquer ou pas la frontière d’une unité textuelle (typiquement un paragraphe). Ce problème risque d’influencer les traitements subséquents, mais est rarement traité dans la littérature. Nous proposons une méthode entièrement non-supervisée pour déterminer si une fin de ligne doit être vue comme un simple espace ou comme une véritable frontière d’unité textuelle, et la testons sur un corpus de comptes rendus médicaux. Cette méthode obtient une F-mesure de 0,926 sur un échantillon de 24 textes contenant des lignes repliées. Appliquée sur un échantillon plus grand de textes contenant ou pas des lignes repliées, notre méthode la plus prudente obtient une F-mesure de 0,898, valeur élevée pour une méthode entièrement non-supervisée.

A Dataset for ICD-10 Coding of Death Certificates: Creation and Usage
Thomas Lavergne | Aurélie Névéol | Aude Robert | Cyril Grouin | Grégoire Rey | Pierre Zweigenbaum
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)

Very few datasets have been released for the evaluation of diagnosis coding with the International Classification of Diseases, and only one so far in a language other than English. This paper describes a large-scale dataset prepared from French death certificates, and the problems which needed to be solved to turn it into a dataset suitable for the application of machine learning and natural language processing methods of ICD-10 coding. The dataset includes the free-text statements written by medical doctors, the associated meta-data, the human coder-assigned codes for each statement, as well as the statement segments which supported the coder’s decision for each code. The dataset comprises 93,694 death certificates totalling 276,103 statements and 377,677 ICD-10 code assignments (3,457 unique codes). It was made available for an international automated coding shared task, which attracted five participating teams. An extended version of the dataset will be used in a new edition of the shared task.

Managing Linguistic and Terminological Variation in a Medical Dialogue System
Leonardo Campillos Llanos | Dhouha Bouamor | Pierre Zweigenbaum | Sophie Rosset
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We introduce a dialogue task between a virtual patient and a doctor where the dialogue system, playing the patient part in a simulated consultation, must reconcile a specialized level, to understand what the doctor says, and a lay level, to output realistic patient-language utterances. This increases the challenges in the analysis and generation phases of the dialogue. This paper proposes methods to manage linguistic and terminological variation in that situation and illustrates how they help produce realistic dialogues. Our system makes use of lexical resources for processing synonyms, inflectional and derivational variants, or pronoun/verb agreement. In addition, specialized knowledge is used for processing medical roots and affixes, ontological relations and concept mapping, and for generating lay variants of terms according to the patient’s non-expert discourse. We also report the results of a first evaluation carried out by 11 users interacting with the system. We evaluated the non-contextual analysis module, which supports the Spoken Language Understanding step. The annotation of task domain entities obtained 91.8% of Precision, 82.5% of Recall, 86.9% of F-measure, 19.0% of Slot Error Rate, and 32.9% of Sentence Error Rate.

Detection of Text Reuse in French Medical Corpora
Eva D’hondt | Cyril Grouin | Aurélie Névéol | Efstathios Stamatatos | Pierre Zweigenbaum
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)

Electronic Health Records (EHRs) are increasingly available in modern health care institutions either through the direct creation of electronic documents in hospitals’ health information systems, or through the digitization of historical paper records. Each EHR creation method yields the need for sophisticated text reuse detection tools in order to prepare the EHR collections for efficient secondary use relying on Natural Language Processing methods. Herein, we address the detection of two types of text reuse in French EHRs: 1) the detection of updated versions of the same document and 2) the detection of document duplicates that still bear surface differences due to OCR or de-identification processing. We present a robust text reuse detection method to automatically identify redundant document pairs in two French EHR corpora that achieves an overall macro F-measure of 0.68 and 0.60, respectively and correctly identifies all redundant document pairs of interest.

Hybrid methods for ICD-10 coding of death certificates
Pierre Zweigenbaum | Thomas Lavergne
Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis

Supervised classification of end-of-lines in clinical text with no manual annotation
Pierre Zweigenbaum | Cyril Grouin | Thomas Lavergne
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)

In some plain text documents, end-of-line marks may or may not mark the boundary of a text unit (e.g., of a paragraph). This vexing problem is likely to impact subsequent natural language processing components, but is seldom addressed in the literature. We propose a method which uses no manual annotation to classify whether end-of-lines must actually be seen as simple spaces (soft line breaks) or as true text unit boundaries. This method, which includes self-training and co-training steps based on token and line length features, achieves 0.943 F-measure on a corpus of short e-books with controlled format, F=0.904 on a random sample of 24 clinical texts with soft line breaks, and F=0.898 on a larger set of mixed clinical texts which may or may not contain soft line breaks, a fairly high value for a method with no manual annotation.

Identification of Drug-Related Medical Conditions in Social Media
François Morlane-Hondère | Cyril Grouin | Pierre Zweigenbaum
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Monitoring social media has been shown to be an interesting approach for the early detection of drug adverse effects. In this paper, we describe a system which extracts medical entities in French drug reviews written by users. We focus on the identification of medical conditions, which is based on the concept of post-coordination: we first extract minimal medical-related entities (pain, stomach) then we combine them to identify complex ones (It was the worst [pain I ever felt in my stomach]). These two steps are respectively performed by two classifiers, the first being based on Conditional Random Fields and the second one on Support Vector Machines. The overall results of the minimal entity classifier are the following: P=0.926; R=0.849; F1=0.886. A thourough analysis of the feature set shows that, when combined with word lemmas, clusters generated by word2vec are the most valuable features. When trained on the output of the first classifier, the second classifier’s performances are the following: p=0.683;r=0.956;f1=0.797. The addition of post-processing rules did not add any significant global improvement but was found to modify the precision/recall ratio.

2015

Identification de facteurs de risque pour des patients diabétiques à partir de comptes-rendus cliniques par des approches hybrides
Cyril Grouin | Véronique Moriceau | Sophie Rosset | Pierre Zweigenbaum
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article, nous présentons les méthodes que nous avons développées pour analyser des comptes- rendus hospitaliers rédigés en anglais. L’objectif de cette étude consiste à identifier les facteurs de risque de décès pour des patients diabétiques et à positionner les événements médicaux décrits par rapport à la date de création de chaque document. Notre approche repose sur (i) HeidelTime pour identifier les expressions temporelles, (ii) des CRF complétés par des règles de post-traitement pour identifier les traitements, les maladies et facteurs de risque, et (iii) des règles pour positionner temporellement chaque événement médical. Sur un corpus de 514 documents, nous obtenons une F-mesure globale de 0,8451. Nous observons que l’identification des informations directement mentionnées dans les documents se révèle plus performante que l’inférence d’informations à partir de résultats de laboratoire.

Proceedings of the Eighth Workshop on Building and Using Comparable Corpora
Pierre Zweigenbaum | Serge Sharoff | Reinhard Rapp
Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

BUCC Shared Task: Cross-Language Document Similarity
Serge Sharoff | Pierre Zweigenbaum | Reinhard Rapp
Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

Un patient virtuel dialogant
Leonardo Campillos | Dhouha Bouamor | Éric Bilinski | Anne-Laure Ligozat | Pierre Zweigenbaum | Sophie Rosset
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Le démonstrateur que nous décrivons ici est un prototype de système de dialogue dont l’objectif est de simuler un patient. Nous décrivons son fonctionnement général en insistant sur les aspects concernant la langue et surtout le rapport entre langue médicale de spécialité et langue générale.

Étude des verbes introducteurs de noms de médicaments dans les forums de santé
François Morlane-Hondère | Cyril Grouin | Pierre Zweigenbaum
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cet article, nous combinons annotations manuelle et automatique pour identifier les verbes utilisés pour introduire un médicament dans les messages sur les forums de santé. Cette information est notamment utile pour identifier la relation entre un médicament et un effet secondaire. La mention d’un médicament dans un message ne garantit pas que l’utilisateur a pris ce traitement mais qu’il effectue un retour. Nous montrons ensuite que ces verbes peuvent servir pour extraire automatiquement des variantes de noms de médicaments. Nous estimons que l’analyse de ces variantes pourrait permettre de modéliser les erreurs faites par les usagers des forums lorsqu’ils écrivent les noms de médicaments, et améliorer en conséquence les systèmes de recherche d’information.

Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis
Cyril Grouin | Thierry Hamon | Aurélie Névéol | Pierre Zweigenbaum
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

Médicaments qui soignent, médicaments qui rendent malades : étude des relations causales pour identifier les effets secondaires
François Morlane-Hondère | Cyril Grouin | Véronique Moriceau | Pierre Zweigenbaum
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cet article, nous nous intéressons à la manière dont sont exprimés les liens qui existent entre un traitement médical et un effet secondaire. Parce que les patients se tournent en priorité vers internet, nous fondons cette étude sur un corpus annoté de messages issus de forums de santé en français. L’objectif de ce travail consiste à mettre en évidence des éléments linguistiques (connecteurs logiques et expressions temporelles) qui pourraient être utiles pour des systèmes automatiques de repérage des effets secondaires. Nous observons que les modalités d’écriture sur les forums ne permettent pas de se fonder sur les expressions temporelles. En revanche, les connecteurs logiques semblent utiles pour identifier les effets secondaires.

Description of the PatientGenesys Dialogue System
Leonardo Campillos Llanos | Dhouha Bouamor | Éric Bilinski | Anne-Laure Ligozat | Pierre Zweigenbaum | Sophie Rosset
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue

2014

Use of unsupervised word classes for entity recognition: Application to the detection of disorders in clinical reports
Maria Evangelia Chatzimina | Cyril Grouin | Pierre Zweigenbaum
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Unsupervised word classes induced from unannotated text corpora are increasingly used to help tasks addressed by supervised classification, such as standard named entity detection. This paper studies the contribution of unsupervised word classes to a medical entity detection task with two specific objectives: How do unsupervised word classes compare to available knowledge-based semantic classes? Does syntactic information help produce unsupervised word classes with better properties? We design and test two syntax-based methods to produce word classes: one applies the Brown clustering algorithm to syntactic dependencies, the other collects latent categories created by a PCFG-LA parser. When added to non-semantic features, knowledge-based semantic classes gain 7.28 points of F-measure. In the same context, basic unsupervised word classes gain 4.16pt, reaching 60% of the contribution of knowledge-based semantic classes and outperforming Wikipedia, and adding PCFG-LA unsupervised word classes gain one more point at 5.11pt, reaching 70%. Unsupervised word classes could therefore provide a useful semantic back-off in domains where no knowledge-based semantic classes are available. The combination of both knowledge-based and basic unsupervised classes gains 8.33pt. Therefore, unsupervised classes are still useful even when rich knowledge-based classes exist.

Automatic Analysis of Scientific and Literary Texts. Presentation and Results of the DEFT2014 Text Mining Challenge (Analyse automatique de textes littéraires et scientifiques : présentation et résultats du défi fouille de texte DEFT2014) [in French]
Thierry Hamon | Quentin Pleplé | Patrick Paroubek | Pierre Zweigenbaum | Cyril Grouin
TALN-RECITAL 2014 Workshop DEFT 2014 : DÉfi Fouille de Textes (DEFT 2014 Workshop: Text Mining Challenge)

Language Resources for French in the Biomedical Domain
Aurélie Névéol | Julien Grosjean | Stéfan Darmoni | Pierre Zweigenbaum
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The biomedical domain offers a wealth of linguistic resources for Natural Language Processing, including terminologies and corpora. While many of these resources are prominently available for English, other languages including French benefit from substantial coverage thanks to the contribution of an active community over the past decades. However, access to terminological resources in languages other than English may not be as straight-forward as access to their English counterparts. Herein, we review the extent of resource coverage for French and give pointers to access French-language resources. We also discuss the sources and methods for making additional material available for French.

MEANS : une approche sémantique pour la recherche de réponses aux questions médicales [MEANS: a semantic approach to medical question answering]
Asma Ben Abacha | Pierre Zweigenbaum
Traitement Automatique des Langues, Volume 55, Numéro 1 : Varia [Varia]

Annotation of specialized corpora using a comprehensive entity and relation scheme
Louise Deléger | Anne-Laure Ligozat | Cyril Grouin | Pierre Zweigenbaum | Aurélie Névéol
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Annotated corpora are essential resources for many applications in Natural Language Processing. They provide insight on the linguistic and semantic characteristics of the genre and domain covered, and can be used for the training and evaluation of automatic tools. In the biomedical domain, annotated corpora of English texts have become available for several genres and subfields. However, very few similar resources are available for languages other than English. In this paper we present an effort to produce a high-quality corpus of clinical documents in French, annotated with a comprehensive scheme of entities and relations. We present the annotation scheme as well as the results of a pilot annotation study covering 35 clinical documents in a variety of subfields and genres. We show that high inter-annotator agreement can be achieved using a complex annotation scheme.

2013

Building Specialized Bilingual Lexicons Using Word Sense Disambiguation
Dhouha Bouamor | Nasredine Semmar | Pierre Zweigenbaum
Proceedings of the Sixth International Joint Conference on Natural Language Processing

(Utilisation de la similarité sémantique pour l’extraction de lexiques bilingues à partir de corpus comparables) [in French]
Dhouha Bouamor | Nasredine Semmar | Pierre Zweigenbaum
Proceedings of TALN 2013 (Volume 1: Long Papers)

Extraction of temporal relations between clinical events in clinical documents (Extraction des relations temporelles entre événements médicaux dans des comptes rendus hospitaliers) [in French]
Pierre Zweigenbaum | Xavier Tannier
Proceedings of TALN 2013 (Volume 2: Short Papers)

Proceedings of the BioNLP Shared Task 2013 Workshop
Claire Nédellec | Robert Bossy | Jin-Dong Kim | Jung-jae Kim | Tomoko Ohta | Sampo Pyysalo | Pierre Zweigenbaum
Proceedings of the BioNLP Shared Task 2013 Workshop

Overview of BioNLP Shared Task 2013
Claire Nédellec | Robert Bossy | Jin-Dong Kim | Jung-jae Kim | Tomoko Ohta | Sampo Pyysalo | Pierre Zweigenbaum
Proceedings of the BioNLP Shared Task 2013 Workshop

Using WordNet and Semantic Similarity for Bilingual Terminology Mining from Comparable Corpora
Dhouha Bouamor | Nasredine Semmar | Pierre Zweigenbaum
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

Automatic Named Entity Pre-annotation for Out-of-domain Human Annotation
Sophie Rosset | Cyril Grouin | Thomas Lavergne | Mohamed Ben Jannet | Jérémy Leixa | Olivier Galibert | Pierre Zweigenbaum
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

Towards a Generic Approach for Bilingual Lexicon Extraction from Comparable Corpora
Dhouha Bouamor | Nasredine Semmar | Pierre Zweigenbaum
Proceedings of Machine Translation Summit XIV: Papers

Building Specialized Bilingual Lexicons Using Large Scale Background Knowledge
Dhouha Bouamor | Adrian Popescu | Nasredine Semmar | Pierre Zweigenbaum
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora
Dhouha Bouamor | Nasredine Semmar | Pierre Zweigenbaum
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Proceedings of the Sixth Workshop on Building and Using Comparable Corpora
Serge Sharoff | Pierre Zweigenbaum | Reinhard Rapp
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

2012

Une étude comparative empirique sur la reconnaissance des entités médicales [An empirical comparative study of medical entity recognition]
Asma Ben Abacha | Pierre Zweigenbaum
Traitement Automatique des Langues, Volume 53, Numéro 1 : Varia [Varia]

Indexation libre et contrôlée d’articles scientifiques. Présentation et résultats du défi fouille de textes DEFT2012 (Controlled and free indexing of scientific papers. Presentation and results of the DEFT2012 text-mining challenge) [in French]
Patrick Paroubek | Pierre Zweigenbaum | Dominic Forest | Cyril Grouin
JEP-TALN-RECITAL 2012, Workshop DEFT 2012: DÉfi Fouille de Textes (DEFT 2012 Workshop: Text Mining Challenge)

Extraction d’information automatique en domaine médical par projection inter-langue : vers un passage à l’échelle (Automatic Information Extraction in the Medical Domain by Cross-Lingual Projection) [in French]
Asma Ben Abacha | Pierre Zweigenbaum | Aurélien Max
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

Manual Corpus Annotation: Giving Meaning to the Evaluation Metrics
Yann Mathet | Antoine Widlöcher | Karën Fort | Claire François | Olivier Galibert | Cyril Grouin | Juliette Kahn | Sophie Rosset | Pierre Zweigenbaum
Proceedings of COLING 2012: Posters

Structured Named Entities in two distinct press corpora: Contemporary Broadcast News and Old Newspapers
Sophie Rosset | Cyril Grouin | Karën Fort | Olivier Galibert | Juliette Kahn | Pierre Zweigenbaum
Proceedings of the Sixth Linguistic Annotation Workshop

Extended Named Entities Annotation on OCRed Documents: From Corpus Constitution to Evaluation Campaign
Olivier Galibert | Sophie Rosset | Cyril Grouin | Pierre Zweigenbaum | Ludovic Quintard
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Within the framework of the Quaero project, we proposed a new definition of named entities, based upon an extension of the coverage of named entities as well as the structure of those named entities. In this new definition, the extended named entities we proposed are both hierarchical and compositional. In this paper, we focused on the annotation of a corpus composed of press archives, OCRed from French newspapers of December 1890. We present the methodology we used to produce the corpus and the characteristics of the corpus in terms of named entities annotation. This annotated corpus has been used in an evaluation campaign. We present this evaluation, the metrics we used and the results obtained by the participants.

Automatic Construction of a MultiWord Expressions Bilingual Lexicon: A Statistical Machine Translation Evaluation Perspective
Dhouha Bouamor | Nasredine Semmar | Pierre Zweigenbaum
Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon

Identifying bilingual Multi-Word Expressions for Statistical Machine Translation
Dhouha Bouamor | Nasredine Semmar | Pierre Zweigenbaum
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

MultiWord Expressions (MWEs) repesent a key issue for numerous applications in Natural Language Processing (NLP) especially for Machine Translation (MT). In this paper, we describe a strategy for detecting translation pairs of MWEs in a French-English parallel corpus. In addition we introduce three methods aiming to integrate extracted bilingual MWE S in M OSES, a phrase based Statistical Machine Translation (SMT) system. We experimentally show that these textual units can improve translation quality.

2011

Medical Entity Recognition: A Comparaison of Semantic and Statistical Methods
Asma Ben Abacha | Pierre Zweigenbaum
Proceedings of BioNLP 2011 Workshop

Accès au contenu sémantique en langue de spécialité : extraction des prescriptions et concepts médicaux (Accessing the semantic content in a specialized language: extracting prescriptions and medical concepts)
Cyril Grouin | Louise Deléger | Bruno Cartoni | Sophie Rosset | Pierre Zweigenbaum
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Pourtant essentiel pour appréhender rapidement et globalement l’état de santé des patients, l’accès aux informations médicales liées aux prescriptions médicamenteuses et aux concepts médicaux par les outils informatiques se révèle particulièrement difficile. Ces informations sont en effet généralement rédigées en texte libre dans les comptes rendus hospitaliers et nécessitent le développement de techniques dédiées. Cet article présente les stratégies mises en oeuvre pour extraire les prescriptions médicales et les concepts médicaux dans des comptes rendus hospitaliers rédigés en anglais. Nos systèmes, fondés sur des approches à base de règles et d’apprentissage automatique, obtiennent une F1-mesure globale de 0,773 dans l’extraction des prescriptions médicales et dans le repérage et le typage des concepts médicaux.

Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Pierre Zweigenbaum | Reinhard Rapp | Serge Sharoff
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

Proposal for an Extension of Traditional Named Entities: From Guidelines to Evaluation, an Overview
Cyril Grouin | Sophie Rosset | Pierre Zweigenbaum | Karën Fort | Olivier Galibert | Ludovic Quintard
Proceedings of the 5th Linguistic Annotation Workshop

Extraction d’informations médicales au LIMSI (Medical information extraction at LIMSI)
Cyril Grouin | Louise Deléger | Anne-Lyse Minard | Anne-Laure Ligozat | Asma Ben Abacha | Delphine Bernhard | Bruno Cartoni | Brigitte Grau | Sophie Rosset | Pierre Zweigenbaum
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Structured and Extended Named Entity Evaluation in Automatic Speech Transcriptions
Olivier Galibert | Sophie Rosset | Cyril Grouin | Pierre Zweigenbaum | Ludovic Quintard
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

MeTAE : Plate-forme d’annotation automatique et d’exploration sémantiques pour le domaine médical
Asma Ben Abacha | Pierre Zweigenbaum
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Nous présentons une plate-forme d’annotation sémantique et d’exploration de textes médicaux, appelée « MeTAE ». Le processus d’annotation automatique comporte une première étape de reconnaissance des entités médicales présentes dans les textes suivie d’une étape d’identification des relations sémantiques qui les relient. Cette identification se fonde sur des patrons linguistiques construits manuellement pour chaque type de relation. MeTAE génère des annotations RDF à partir des informations extraites et offre une interface d’exploration des textes annotés avec des requêtes sous forme de formulaire. La plate-forme peut être utilisée pour analyser sémantiquement les textes médicaux ou interroger la base d’annotation disponible pour avoir une/des réponses à une requête donnée (e.g. « ?X prévient maladie d’Alzheimer », équivalent à la question « comment prévenir la maladie d’Alzheimer ? »). Cette application peut être la base d’un système de questions-réponses pour le domaine médical.

Identifying Paraphrases between Technical and Lay Corpora
Louise Deléger | Pierre Zweigenbaum
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In previous work, we presented a preliminary study to identify paraphrases between technical and lay discourse types from medical corpora dedicated to the French language. In this paper, we test the hypothesis that the same kinds of paraphrases as for French can be detected between English technical and lay discourse types and report the adaptation of our method from French to English. Starting from the constitution of monolingual comparable corpora, we extract two kinds of paraphrases: paraphrases between nominalizations and verbal constructions and paraphrases between neo-classical compounds and modern-language phrases. We do this relying on morphological resources and a set of extraction rules we adapt from the original approach for French. Results show that paraphrases could be identified with a rather good precision, and that these types of paraphrase are relevant in the context of the opposition between technical and lay discourse types. These observations are consistent with the results obtained for French, which demonstrates the portability of the approach as well as the similarity of the two languages as regards the use of those kinds of expressions in technical and lay discourse types.

Semi-Automated Extension of a Specialized Medical Lexicon for French
Bruno Cartoni | Pierre Zweigenbaum
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes the development of a specialized lexical resource for a specialized domain, namely medicine. First, in order to assess the linguistic phenomena that need to be adressed, we based our observation on a large collection of more than 300'000 terms, organised around conceptual identifiers. Based on these observations, we highlight the specificities that such a lexicon should take into account, namely in terms of inflectional and derivational knowledge. In a first experiment, we show that general resources lack a large part of the words needed to process specialized language. Secondly, we describe an experiment to feed semi-automatically a medical lexicon and populate it with inflectional information. This experiment is based on a semi-automatic methods that tries to acquire inflectional knowledge from frequent endings of words recorded in existing lexicon. Thanks to this, we increased the coverage of the target vocabulary from 14.1% to 25.7%.

Named and Specific Entity Detection in Varied Data: The Quæro Named Entity Baseline Evaluation
Olivier Galibert | Ludovic Quintard | Sophie Rosset | Pierre Zweigenbaum | Claire Nédellec | Sophie Aubin | Laurent Gillard | Jean-Pierre Raysz | Delphine Pois | Xavier Tannier | Louise Deléger | Dominique Laurent
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The Quæro program that promotes research and industrial innovation on technologies for automatic analysis and classification of multimedia and multilingual documents. Within its context a set of evaluations of Named Entity recognition systems was held in 2009. Four tasks were defined. The first two concerned traditional named entities in French broadcast news for one (a rerun of ESTER 2) and of OCR-ed old newspapers for the other. The third was a gene and protein name extraction in medical abstracts. The last one was the detection of references in patents. Four different partners participated, giving a total of 16 systems. We provide a synthetic descriptions of all of them classifying them by the main approaches chosen (resource-based, rules-based or statistical), without forgetting the fact that any modern system is at some point hybrid. The metric (the relatively standard Slot Error Rate) and the results are also presented and discussed. Finally, a process is ongoing with preliminary acceptance of the partners to ensure the availability for the community of all the corpora used with the exception of the non-Quæro produced ESTER 2 one.

2009

Knowledge and Reasoning for Medical Question-Answering
Pierre Zweigenbaum
Proceedings of the 2009 Workshop on Knowledge and Reasoning for Answering Questions (KRAQ 2009)

Extracting Lay Paraphrases of Specialized Expressions from Monolingual Comparable Medical Corpora
Louise Deléger | Pierre Zweigenbaum
Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC)

Improvements in Analogical Learning: Application to Translating Multi-Terms of the Medical Domain
Philippe Langlais | François Yvon | Pierre Zweigenbaum
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC)
Pascale Fung | Pierre Zweigenbaum | Reinhard Rapp
Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC)

2007

Analyse morphosémantique des composés savants : transposition du français à l’anglais
Louise Deléger | Fiammetta Namer | Pierre Zweigenbaum
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

La plupart des vocabulaires spécialisés comprennent une part importante de lexèmes morphologiquement complexes, construits à partir de racines grecques et latines, qu’on appelle « composés savants ». Une analyse morphosémantique permet de décomposer et de donner des définitions à ces lexèmes, et semble pouvoir être appliquée de façon similaire aux composés de plusieurs langues. Cet article présente l’adaptation d’un analyseur morphosémantique, initialement dédié au français (DériF), à l’analyse de composés savants médicaux anglais, illustrant ainsi la similarité de structure de ces composés dans des langues européennes proches. Nous exposons les principes de cette transposition et ses performances. L’analyseur a été testé sur un ensemble de 1299 lexèmes extraits de la terminologie médicale WHO-ART : 859 ont pu être décomposés et définis, dont 675 avec succès. Outre une simple transposition d’une langue à l’autre, la méthode montre la potentialité d’un système multilingue.

2006

Productivité quantitative des suffixations par -ité et -Able dans un corpus journalistique moderne
Natalia Grabar | Delphine Tribout | Georgette Dal | Bernard Fradin | Nabil Hathout | Stéphanie Lignon | Fiammetta Namer | Clément Plancq | François Yvon | Pierre Zweigenbaum
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans ce travail, nous étudions en corpus la productivité quantitative des suffixations par -Able et par -ité du français, d’abord indépendamment l’une de l’autre, puis lorsqu’elles s’enchaînent dérivationnellement (la suffixation en -ité s’applique à des bases en -Able dans environ 15 % des cas). Nous estimons la productivité de ces suffixations au moyen de mesures statistiques dont nous suivons l’évolution par rapport à la taille du corpus. Ces deux suffixations sont productives en français moderne : elles forment de nouveaux lexèmes tout au long des corpus étudiés sans qu’on n’observe de saturation, leurs indices de productivité montrent une évolution stable bien qu’étant dépendante des calculs qui leur sont appliqués. On note cependant que, de façon générale, de ces deux suffixations, c’est la suffixation par -ité qui est la plus fréquente en corpus journalistique, sauf précisément quand -ité s’applique à un adjectif en -Able. Étant entendu qu’un adjectif en -Able et le nom en -ité correspondant expriment la même propriété, ce résultat indique que la complexité de la base est un paramètre à prendre en considération dans la formation du lexique possible.

2005

Recherche en corpus de réponses à des questions définitoires
Véronique Malaisé | Thierry Delbecque | Pierre Zweigenbaum
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Les systèmes de questions-réponses, essentiellement focalisés sur des questions factuelles en domaine ouvert, testent également d’autres tâches, comme le travail en domaine contraint ou la recherche de définitions. Nous nous intéressons ici à la recherche de réponses à des questions « définitoires » portant sur le domaine médical. La recherche de réponses de type définitoire se fait généralement en utilisant deux types de méthodes : celles s’appuyant essentiellement sur le contenu du corpus cible, et celles faisant appel à des connaissances externes. Nous avons choisi de nous limiter au premier de ces deux types de méthodes. Nous présentons une expérience dans laquelle nous réutilisons des patrons de repérage d’énoncés définitoires, conçus pour une autre tâche, pour localiser les réponses potentielles aux questions posées. Nous avons intégré ces patrons dans une chaîne de traitement que nous évaluons sur les questions définitoires et le corpus médical du projet EQueR sur l’évaluation de systèmes de questions-réponses. Cette évaluation montre que, si le rappel reste à améliorer, la « précision » des réponses obtenue (mesurée par la moyenne des inverses de rangs) est honorable. Nous discutons ces résultats et proposons des pistes d’amélioration.

Utilisation de corpus de spécialité pour le filtrage de synonymes de la langue générale
Natalia Grabar | Pierre Zweigenbaum
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Les ressources linguistiques les plus facilement disponibles en TAL ressortissent généralement au registre général d’une langue. Lorsqu’elles doivent être utilisées sur des textes de spécialité il peut être utile de les adapter à ces textes. Cet article est consacré à l’adaptation de ressources synonymiques générales à la langue médicale. L’adaptation est obtenue suite à une série de filtrages sur un corpus du domaine. Les synonymes originaux et les synonymes filtrés sont ensuite utilisés comme une des ressources pour la normalisation de variantes de termes dans une tâche de structuration de terminologie. Leurs apports respectifs sont évalués par rapport à la structure terminologique de référence. Cette évaluation montre que les résultats sont globalement encourageants après les filtrages, pour une tâche comme la structuration de terminologies : une amélioration de la précision contre une légère diminution du rappel.

Traduction de termes biomédicaux par inférence de transducteurs
Vincent Claveau | Pierre Zweigenbaum
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article propose et évalue une méthode de traduction automatique de termes biomédicaux simples du français vers l’anglais et de l’anglais vers le français. Elle repose sur une technique d’apprentissage artificiel supervisée permettant d’inférer des transducteurs à partir d’exemples de couples de termes bilingues ; aucune autre ressource ou connaissance n’est requise. Ces transducteurs, capturant les grandes régularités de traduction existant dans le domaine biomédical, sont ensuite utilisés pour traduire de nouveaux termes français en anglais et vice versa. Les évaluations menées montrent que le taux de bonnes traductions de notre technique se situe entre 52 et 67%. À travers un examen des erreurs les plus courantes, nous identifions quelques limites inhérentes à notre approche et proposons quelques pistes pour les dépasser. Nous envisageons enfin plusieurs extensions à ce travail.

2004

Repérage et exploitation d’énoncés définitoires en corpus pour l’aide à la construction d’ontologie
Véronique Malaisé | Pierre Zweigenbaum | Bruno Bachimont
Actes de la 11ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Pour construire une ontologie, un modéliseur a besoin d’objecter des informations sémantiques sur les termes principaux de son domaine d’étude. Les outils d’exploration de corpus peuvent aider à repérer ces types d’information, et l’identification de couples d’hyperonymes a fait l’objet de plusieurs travaux. Nous proposons d’exploiter des énoncés définitoires pour extraire d’un corpus des informations concernant les trois axes de l’ossature ontologique : l’axe vertical, lié à l’hyperonymie, l’axe horizontal, lié à la co-hyponymie et l’axe transversal, lié aux relations du domaine. Après un rappel des travaux existants en repérage d’énoncés définitoires en TAL, nous développons la méthode que nous avons mise en place, puis nous présentons son évaluation et les premiers résultats obtenus. Leur repérage atteint de 10% à 69% de précision suivant les patrons, celui des unités lexicales varie de 31% à 56%, suivant le référentiel adopté.

Detecting Semantic Relations between Terms in Definitions
Véronique Malaisé | Pierre Zweigenbaum | Bruno Bachimont
Proceedings of CompuTerm 2004: 3rd International Workshop on Computational Terminology

2003

Apprentissage de relations morphologiques en corpus
Pierre Zweigenbaum | Fadila Hadouche | Natalia Grabar
Actes de la 10ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous proposons une méthode pour apprendre des relations morphologiques dérivationnelles en corpus. Elle se fonde sur la cooccurrence en corpus de mots formellement proches et un filtrage complémentaire sur la forme des mots dérivés. Elle est mise en oeuvre et expérimentée sur un corpus médical. Les relations obtenues avant filtrage ont une précision moyenne de 75,6 % au 5000è rang (fenêtre de 150 mots). L’examen détaillé des dérivés adjectivaux d’un échantillon de 633 noms du champ de l’anatomie montre une bonne précision de 85–91 % et un rappel modéré de 32–34 %. Nous discutons ces résultats et proposons des pistes pour les compléter.

2002

Accenting unknown words in a specialized language
Pierre Zweigenbaum | Natalia Grabar
Proceedings of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain

Looking for Candidate Translational Equivalents in Specialized, Comparable Corpora
Yun-Chuang Chiao | Pierre Zweigenbaum
COLING 2002: The 17th International Conference on Computational Linguistics: Project Notes

Accentuation de mots inconnus : application au thesaurus biomédical MeSH
Pierre Zweigenbaum | Natalia Grabar
Actes de la 9ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Certaines ressources textuelles ou terminologiques sont écrites sans signes diacritiques, ce qui freine leur utilisation pour le traitement automatique des langues. Dans un domaine spécialisé comme la médecine, il est fréquent que les mots rencontrés ne se trouvent pas dans les lexiques électroniques disponibles. Se pose alors la question de l’accentuation de mots inconnus : c’est le sujet de ce travail. Nous proposons deux méthodes d’accentuation de mots inconnus fondées sur un apprentissage par observation des contextes d’occurrence des lettres à accentuer dans un ensemble de mots d’entraînement, l’une adaptée de l’étiquetage morphosyntaxique, l’autre adaptée d’une méthode d’apprentissage de règles morphologiques. Nous présentons des résultats expérimentaux pour la lettre e sur un thesaurus biomédical en français : le MeSH. Ces méthodes obtiennent une précision de 86 à 96 % (+-4 %) pour un rappel allant de 72 à 86 %.

Lexically-Based Terminology Structuring: Some Inherent Limits
Natalia Grabar | Pierre Zweigenbaum
COLING-02: COMPUTERM 2002: Second International Workshop on Computational Terminology

2001

L’apport de connaissances morphologiques pour la projection de requêtes sur une terminologie normalisée
Pierre Zweigenbaum | Natalia Grabar | Stefan Darmoni
Actes de la 8ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

L’apport de connaissances linguistiques à la recherche d’information reste un sujet de débat. Nous examinons ici l’influence de connaissances morphologiques (flexion, dérivation) sur les résultats d’une tâche spécifique de recherche d’information dans un domaine spécialisé. Cette influence est étudiée à l’aide d’une liste de requêtes réelles recueillies sur un serveur opérationnel ne disposant pas de connaissances linguistiques. Nous observons que pour cette tâche, flexion et dérivation apportent un gain modéré mais réel.

1996

Processing Metonymy- a Domain-Model Heuristic Graph Traversal Approach
Jacques Bouaud | Bruno Bachimont | Pierre Zweigenbaum
COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics

1992

First Results of a French Linguistic Development Environment
L. Bouchard | L. Emirkanian | D. Estival | C. Fay-Varnier | C. Fouquere | G. Prigent | P. Zweigenbaum
COLING 1992 Volume 4: The 14th International Conference on Computational Linguistics

1990

Deep Sentence Understanding in a Restricted Domain
Pierre Zweigenbaum | Marc Cavazza
COLING 1990 Volume 1: Papers presented to the 13th International Conference on Computational Linguistics

Co-authors

Dhouha Bouamor 12

Aurelie Neveol 12

Louise Deléger 11

Emmanuel Morin 8

Claire Nédellec 8

Nasredine Semmar 8

Olivier Galibert 7

Natalia Grabar 7

Asma Ben Abacha 6

Patrick Paroubek 6

Atilla Kaan Alkan 5

Hicham El Boukkouri 5

Thierry Hamon 5

Anne-Laure Ligozat 5

Roland Roller 5

Olivier Ferret 4

Leonardo Campillos Llanos 4

Ludovic Quintard 4

Fabian Schüssler 4

Philippe Thomas 4

Hui-Syuan Yeh 4

Mathilde Aguiar 3

Bruno Bachimont 3

Bruno Cartoni 3

Arnaud Ferré 3

Véronique Malaisé 3

François Morlane-Hondère 3

Sebastian Möller 3

Christopher Norman 3

Shuntaro Yada 3

Victoria Arranz 2

Mouhamadou Ba 2

Eric Bilinski 2

Khalid Choukri 2

Montse Cuadros 2

Stéfan Darmoni 2

Mariem Ellouze Khemekhem 2

Aitor García-Pablos 2

Lucie Gianola 2

Lamia Hadrich Belguith 2

Manuel Herranz 2

Juliette Kahn 2

Mariska Leeflang 2

Véronique Moriceau 2

Fiammetta Namer 2

Tomohiro Nishiyama 2

Sampo Pyysalo 2

Xavier Tannier 2

Ayla Rigouts Terryn 2

Shoko Wakamiya 2

François Yvon 2

Ēriks Ajausks 1

Marie-José Aroulanda 1

Virgile Barthet 1

Mohamed Ben Jannet 1

Delphine Bernhard 1

Philippe Bessières 1

Jacques Bouaud 1

Tiffany Callahan 1

Leonardo Campillos 1

Aleix Cerdà-i-Cucó 1

Estelle Chaix 1

Maria Evangelia Chatzimina 1

Yun-Chuang Chiao 1

Vincent Claveau 1

K. Bretonnel Cohen 1

Georgette Dal 1

Pascal De Groote 1

Hans Degroote 1

Thierry Delbecque 1

Bertrand Dubreucq 1

Eva D’Hondt 1

Louisette Emirkanian 1

Amando Estela 1

Dominique Estival 1

Thierry Etchegoyhen 1

Abdelhak Fatihi 1

C. Fay-Varnier 1

Shelley Fisher Fishkin 1

Dominic Forest 1

Christophe Fouqueré 1

Bernard Fradin 1

Claire François 1

Guillermo Garcia 1

Mercedes García-Martínez 1

Laurent Gillard 1

Graciela Gonzalez 1

Brigitte Grau 1

Julien Grosjean 1

Johann Gutton 1

Amal Haddad Haddad 1

Fadila Hadouche 1

Orin Hargraves 1

Nabil Hathout 1

Sophia Hernandez 1

Guillaume Hocquet 1

Lawrence Hunter 1

Christine Jacquin 1

Alejandro Kohan 1

Michel Komajda 1

Philippe Langlais 1

Dominique Laurent 1

Loic Lepiniec 1

Stéphanie Lignon 1

Yuji Matsumoto 1

Aurélien Max 1

Anne-Lyse Minard 1

Ruslan Mitkov 1

Laura Monceaux-Cachard 1

Karen O’Connor 1

Clément Plancq 1

Quentin Pleplé 1

Delphine Pois 1

Adrian Popescu 1

Jean-Pierre Raysz 1

Grégoire Rey 1

Raul Rodriguez-Esteban 1

Michael Rosner 1

Roberts Rozis 1

Oliver Sapina 1

Christophe Servan 1

Vishakha Sharma 1

Efstathios Stamatatos 1

Delphine Tribout 1

Jun’ichi Tsujii 1

Dialekti Valsamou 1

Artūrs Vasiļevskis 1

Bhuvanesh Verma 1

Davy Weissenbacher 1

Antoine Widlöcher 1

Venues