Asma Ben Abacha

Also published as: Asma Ben Abacha


pdf bib
To Err Is Human, How about Medical Large Language Models? Comparing Pre-trained Language Models for Medical Assessment Errors and Reliability
Wen-wai Yim | Yujuan Fu | Asma Ben Abacha | Meliha Yetisgen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Unpredictability, especially unpredictability with unknown error characteristics, is a highly undesirable trait, particularly in medical patient care applications. Although large pre-trained language models (LLM) have been applied to a variety of unseen tasks with highly competitive and successful results, their sensitivity to language inputs and resulting performance variability is not well-studied. In this work, we test state-of-the-art pre-trained language models from a variety of families to characterize their error generation and reliability in medical assessment ability. Particularly, we experiment with general medical assessment multiple choice tests, as well as their open-ended and true-false alternatives. We also profile model consistency, error agreements with each other and to humans; and finally, quantify their ability to recover and explain errors. The findings in this work can be used to give further information about medical models so that modelers can make better-informed decisions rather than relying on standalone performance metrics alone.


pdf bib
An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters
Asma Ben Abacha | Wen-wai Yim | Yadan Fan | Thomas Lin
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Medical doctors spend on average 52 to 102 minutes per day writing clinical notes from their patient encounters (Hripcsak et al., 2011). Reducing this workload calls for relevant and efficient summarization methods. In this paper, we introduce new resources and empirical investigations for the automatic summarization of doctor-patient conversations in a clinical setting. In particular, we introduce the MTS-Dialog dataset; a new collection of 1,700 doctor-patient dialogues and corresponding clinical notes. We use this new dataset to investigate the feasibility of this task and the relevance of existing language models, data augmentation, and guided summarization techniques. We compare standard evaluation metrics based on n-gram matching, contextual embeddings, and Fact Extraction to assess the accuracy and the factual consistency of the generated summaries. To ground these results, we perform an expert-based evaluation using relevant natural language generation criteria and task-specific criteria such as critical omissions, and study the correlation between the automatic metrics and expert judgments. To the best of our knowledge, this study is the first attempt to introduce an open dataset of doctor-patient conversations and clinical notes, with detailed automated and manual evaluations of clinical note generation.

pdf bib
An Investigation of Evaluation Methods in Automatic Medical Note Generation
Asma Ben Abacha | Wen-wai Yim | George Michalopoulos | Thomas Lin
Findings of the Association for Computational Linguistics: ACL 2023

Recent studies on automatic note generation have shown that doctors can save significant amounts of time when using automatic clinical note generation (Knoll et al., 2022). Summarization models have been used for this task to generate clinical notes as summaries of doctor-patient conversations (Krishna et al., 2021; Cai et al., 2022). However, assessing which model would best serve clinicians in their daily practice is still a challenging task due to the large set of possible correct summaries, and the potential limitations of automatic evaluation metrics. In this paper we study evaluation methods and metrics for the automatic generation of clinical notes from medical conversation. In particular, we propose new task-specific metrics and we compare them to SOTA evaluation metrics in text summarization and generation, including: (i) knowledge-graph embedding-based metrics, (ii) customized model-based metrics with domain-specific weights, (iii) domain-adapted/fine-tuned metrics, and (iv) ensemble metrics. To study the correlation between the automatic metrics and manual judgments, we evaluate automatic notes/summaries by comparing the system and reference facts and computing the factual correctness, and the hallucination and omission rates for critical medical facts. This study relied on seven datasets manually annotated by domain experts. Our experiments show that automatic evaluation metrics can have substantially different behaviors on different types of clinical notes datasets. However, the results highlight one stable subset of metrics as the most correlated with human judgments with a relevant aggregation of different evaluation criteria.

pdf bib
Proceedings of the 5th Clinical Natural Language Processing Workshop
Tristan Naumann | Asma Ben Abacha | Steven Bethard | Kirk Roberts | Anna Rumshisky
Proceedings of the 5th Clinical Natural Language Processing Workshop

pdf bib
Overview of the MEDIQA-Chat 2023 Shared Tasks on the Summarization & Generation of Doctor-Patient Conversations
Asma Ben Abacha | Wen-wai Yim | Griffin Adams | Neal Snider | Meliha Yetisgen
Proceedings of the 5th Clinical Natural Language Processing Workshop

Automatic generation of clinical notes from doctor-patient conversations can play a key role in reducing daily doctors’ workload and improving their interactions with the patients. MEDIQA-Chat 2023 aims to advance and promote research on effective solutions through shared tasks on the automatic summarization of doctor-patient conversations and on the generation of synthetic dialogues from clinical notes for data augmentation. Seventeen teams participated in the challenge and experimented with a broad range of approaches and models. In this paper, we describe the three MEDIQA-Chat 2023 tasks, the datasets, and the participants’ results and methods. We hope that these shared tasks will lead to additional research efforts and insights on the automatic generation and evaluation of clinical notes.


pdf bib
Reinforcement Learning for Abstractive Question Summarization with Question-aware Semantic Rewards
Shweta Yadav | Deepak Gupta | Asma Ben Abacha | Dina Demner-Fushman
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

The growth of online consumer health questions has led to the necessity for reliable and accurate question answering systems. A recent study showed that manual summarization of consumer health questions brings significant improvement in retrieving relevant answers. However, the automatic summarization of long questions is a challenging task due to the lack of training data and the complexity of the related subtasks, such as the question focus and type recognition. In this paper, we introduce a reinforcement learning-based framework for abstractive question summarization. We propose two novel rewards obtained from the downstream tasks of (i) question-type identification and (ii) question-focus recognition to regularize the question generation model. These rewards ensure the generation of semantically valid questions and encourage the inclusion of key medical entities/foci in the question summary. We evaluated our proposed method on two benchmark datasets and achieved higher performance over state-of-the-art models. The manual evaluation of the summaries reveals that the generated questions are more diverse and have fewer factual inconsistencies than the baseline summaries. The source code is available here:

pdf bib
Overview of the MEDIQA 2021 Shared Task on Summarization in the Medical Domain
Asma Ben Abacha | Yassine Mrabet | Yuhao Zhang | Chaitanya Shivade | Curtis Langlotz | Dina Demner-Fushman
Proceedings of the 20th Workshop on Biomedical Language Processing

The MEDIQA 2021 shared tasks at the BioNLP 2021 workshop addressed three tasks on summarization for medical text: (i) a question summarization task aimed at exploring new approaches to understanding complex real-world consumer health queries, (ii) a multi-answer summarization task that targeted aggregation of multiple relevant answers to a biomedical question into one concise and relevant answer, and (iii) a radiology report summarization task addressing the development of clinically relevant impressions from radiology report findings. Thirty-five teams participated in these shared tasks with sixteen working notes submitted (fifteen accepted) describing a wide variety of models developed and tested on the shared and external datasets. In this paper, we describe the tasks, the datasets, the models and techniques developed by various teams, the results of the evaluation, and a study of correlations among various summarization evaluation measures. We hope that these shared tasks will bring new research and insights in biomedical text summarization and evaluation.

pdf bib
Evidence-based Fact-Checking of Health-related Claims
Mourad Sarrouti | Asma Ben Abacha | Yassine Mrabet | Dina Demner-Fushman
Findings of the Association for Computational Linguistics: EMNLP 2021

The task of verifying the truthfulness of claims in textual documents, or fact-checking, has received significant attention in recent years. Many existing evidence-based factchecking datasets contain synthetic claims and the models trained on these data might not be able to verify real-world claims. Particularly few studies addressed evidence-based fact-checking of health-related claims that require medical expertise or evidence from the scientific literature. In this paper, we introduce HEALTHVER, a new dataset for evidence-based fact-checking of health-related claims that allows to study the validity of real-world claims by evaluating their truthfulness against scientific articles. Using a three-step data creation method, we first retrieved real-world claims from snippets returned by a search engine for questions about COVID-19. Then we automatically retrieved and re-ranked relevant scientific papers using a T5 relevance-based model. Finally, the relations between each evidence statement and the associated claim were manually annotated as SUPPORT, REFUTE and NEUTRAL. To validate the created dataset of 14,330 evidence-claim pairs, we developed baseline models based on pretrained language models. Our experiments showed that training deep learning models on real-world medical claims greatly improves performance compared to models trained on synthetic and open-domain claims. Our results and manual analysis suggest that HEALTHVER provides a realistic and challenging dataset for future efforts on evidence-based fact-checking of health-related claims. The dataset, source code, and a leaderboard are available at


pdf bib
Visual Question Generation from Radiology Images
Mourad Sarrouti | Asma Ben Abacha | Dina Demner-Fushman
Proceedings of the First Workshop on Advances in Language and Vision Research

Visual Question Generation (VQG), the task of generating a question based on image contents, is an increasingly important area that combines natural language processing and computer vision. Although there are some recent works that have attempted to generate questions from images in the open domain, the task of VQG in the medical domain has not been explored so far. In this paper, we introduce an approach to generation of visual questions about radiology images called VQGR, i.e. an algorithm that is able to ask a question when shown an image. VQGR first generates new training data from the existing examples, based on contextual word embeddings and image augmentation techniques. It then uses the variational auto-encoders model to encode images into a latent space and decode natural language questions. Experimental automatic evaluations performed on the VQA-RAD dataset of clinical visual questions show that VQGR achieves good performances compared with the baseline system. The source code is available at


pdf bib
Overview of the MEDIQA 2019 Shared Task on Textual Inference, Question Entailment and Question Answering
Asma Ben Abacha | Chaitanya Shivade | Dina Demner-Fushman
Proceedings of the 18th BioNLP Workshop and Shared Task

This paper presents the MEDIQA 2019 shared task organized at the ACL-BioNLP workshop. The shared task is motivated by a need to develop relevant methods, techniques and gold standards for inference and entailment in the medical domain, and their application to improve domain specific information retrieval and question answering systems. MEDIQA 2019 includes three tasks: Natural Language Inference (NLI), Recognizing Question Entailment (RQE), and Question Answering (QA) in the medical domain. 72 teams participated in the challenge, achieving an accuracy of 98% in the NLI task, 74.9% in the RQE task, and 78.3% in the QA task. In this paper, we describe the tasks, the datasets, and the participants’ approaches and results. We hope that this shared task will attract further research efforts in textual inference, question entailment, and question answering in the medical domain.

pdf bib
On the Summarization of Consumer Health Questions
Asma Ben Abacha | Dina Demner-Fushman
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Question understanding is one of the main challenges in question answering. In real world applications, users often submit natural language questions that are longer than needed and include peripheral information that increases the complexity of the question, leading to substantially more false positives in answer retrieval. In this paper, we study neural abstractive models for medical question summarization. We introduce the MeQSum corpus of 1,000 summarized consumer health questions. We explore data augmentation methods and evaluate state-of-the-art neural abstractive models on this new task. In particular, we show that semantic augmentation from question datasets improves the overall performance, and that pointer-generator networks outperform sequence-to-sequence attentional models on this task, with a ROUGE-1 score of 44.16%. We also present a detailed error analysis and discuss directions for improvement that are specific to question summarization.


pdf bib
NLM_NIH at SemEval-2017 Task 3: from Question Entailment to Question Similarity for Community Question Answering
Asma Ben Abacha | Dina Demner-Fushman
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes our participation in SemEval-2017 Task 3 on Community Question Answering (cQA). The Question Similarity subtask (B) aims to rank a set of related questions retrieved by a search engine according to their similarity to the original question. We adapted our feature-based system for Recognizing Question Entailment (RQE) to the question similarity task. Tested on cQA-B-2016 test data, our RQE system outperformed the best system of the 2016 challenge in all measures with 77.47 MAP and 80.57 Accuracy. On cQA-B-2017 test data, performances of all systems dropped by around 30 points. Our primary system obtained 44.62 MAP, 67.27 Accuracy and 47.25 F1 score. The cQA-B-2017 best system achieved 47.22 MAP and 42.37 F1 score. Our system is ranked sixth in terms of MAP and third in terms of F1 out of 13 participating teams.


pdf bib
Annotating Named Entities in Consumer Health Questions
Halil Kilicoglu | Asma Ben Abacha | Yassine Mrabet | Kirk Roberts | Laritza Rodriguez | Sonya Shooshan | Dina Demner-Fushman
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We describe a corpus of consumer health questions annotated with named entities. The corpus consists of 1548 de-identified questions about diseases and drugs, written in English. We defined 15 broad categories of biomedical named entities for annotation. A pilot annotation phase in which a small portion of the corpus was double-annotated by four annotators was followed by a main phase in which double annotation was carried out by six annotators, and a reconciliation phase in which all annotations were reconciled by an expert. We conducted the annotation in two modes, manual and assisted, to assess the effect of automatic pre-annotation and calculated inter-annotator agreement. We obtained moderate inter-annotator agreement; assisted annotation yielded slightly better agreement and fewer missed annotations than manual annotation. Due to complex nature of biomedical entities, we paid particular attention to nested entities for which we obtained slightly lower inter-annotator agreement, confirming that annotating nested entities is somewhat more challenging. To our knowledge, the corpus is the first of its kind for consumer health text and is publicly available.

pdf bib
A Hybrid Approach to Generation of Missing Abstracts in Biomedical Literature
Suchet Chachra | Asma Ben Abacha | Sonya Shooshan | Laritza Rodriguez | Dina Demner-Fushman
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Readers usually rely on abstracts to identify relevant medical information from scientific articles. Abstracts are also essential to advanced information retrieval methods. More than 50 thousand scientific publications in PubMed lack author-generated abstracts, and the relevancy judgements for these papers have to be based on their titles alone. In this paper, we propose a hybrid summarization technique that aims to select the most pertinent sentences from articles to generate an extractive summary in lieu of a missing abstract. We combine i) health outcome detection, ii) keyphrase extraction, and iii) textual entailment recognition between sentences. We evaluate our hybrid approach and analyze the improvements of multi-factor summarization over techniques that rely on a single method, using a collection of 295 manually generated reference summaries. The obtained results show that the hybrid approach outperforms the baseline techniques with an improvement of 13% in recall and 4% in F1 score.


pdf bib
LIST-LUX: Disorder Identification from Clinical Texts
Asma Ben Abacha | Aikaterini Karanasiou | Yassine Mrabet | Julio Cesar Dos Reis
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)


pdf bib
MEANS : une approche sémantique pour la recherche de réponses aux questions médicales [MEANS: a semantic approach to medical question answering]
Asma Ben Abacha | Pierre Zweigenbaum
Traitement Automatique des Langues, Volume 55, Numéro 1 : Varia [Varia]


pdf bib
Extraction d’information automatique en domaine médical par projection inter-langue : vers un passage à l’échelle (Automatic Information Extraction in the Medical Domain by Cross-Lingual Projection) [in French]
Asma Ben Abacha | Pierre Zweigenbaum | Aurélien Max
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib
Une étude comparative empirique sur la reconnaissance des entités médicales [An empirical comparative study of medical entity recognition]
Asma Ben Abacha | Pierre Zweigenbaum
Traitement Automatique des Langues, Volume 53, Numéro 1 : Varia [Varia]


pdf bib
Medical Entity Recognition: A Comparaison of Semantic and Statistical Methods
Asma Ben Abacha | Pierre Zweigenbaum
Proceedings of BioNLP 2011 Workshop

pdf bib
Extraction d’informations médicales au LIMSI (Medical information extraction at LIMSI)
Cyril Grouin | Louise Deléger | Anne-Lyse Minard | Anne-Laure Ligozat | Asma Ben Abacha | Delphine Bernhard | Bruno Cartoni | Brigitte Grau | Sophie Rosset | Pierre Zweigenbaum
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations


pdf bib
MeTAE : Plate-forme d’annotation automatique et d’exploration sémantiques pour le domaine médical
Asma Ben Abacha | Pierre Zweigenbaum
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Nous présentons une plate-forme d’annotation sémantique et d’exploration de textes médicaux, appelée « MeTAE ». Le processus d’annotation automatique comporte une première étape de reconnaissance des entités médicales présentes dans les textes suivie d’une étape d’identification des relations sémantiques qui les relient. Cette identification se fonde sur des patrons linguistiques construits manuellement pour chaque type de relation. MeTAE génère des annotations RDF à partir des informations extraites et offre une interface d’exploration des textes annotés avec des requêtes sous forme de formulaire. La plate-forme peut être utilisée pour analyser sémantiquement les textes médicaux ou interroger la base d’annotation disponible pour avoir une/des réponses à une requête donnée (e.g. « ?X prévient maladie d’Alzheimer », équivalent à la question « comment prévenir la maladie d’Alzheimer ? »). Cette application peut être la base d’un système de questions-réponses pour le domaine médical.