Solen Quiniou

2026

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health
Trung Hieu Ngo | Adrien Bazoge | Solen Quiniou | Pierre-Antoine Gourraud | Emmanuel Morin
Findings of the Association for Computational Linguistics: EACL 2026

Large Language Models (LLMs) excel in Natural Language Processing (NLP) tasks, but they often propagate biases embedded in their training data, which is potentially impactful in sensitive domains like healthcare. While existing benchmarks evaluate biases related to individual social determinants of health (SDoH) such as gender or ethnicity, they often overlook interactions between these factors and lack context-specific assessments. This study investigates bias in LLMs by probing the relationships between gender and other SDoH in French patient records. Through a series of experiments, we found that embedded stereotypes can be probed using SDoH input and that LLMs rely on embedded stereotypes to make gendered decisions, suggesting that evaluating interactions among SDoH factors could usefully complement existing approaches to assessing LLM performance and bias.

2025

pdf bib abs

AdminSet and AdminBERT: a Dataset and a Pre-trained Language Model to Explore the Unstructured Maze of French Administrative Documents
Thomas Sebbag | Solen Quiniou | Nicolas Stucky | Emmanuel Morin
Proceedings of the 31st International Conference on Computational Linguistics

In recent years, Pre-trained Language Models(PLMs) have been widely used to analyze various documents, playing a crucial role in Natural Language Processing (NLP). However, administrative texts have rarely been used in information extraction tasks, even though this resource is available as open data in many countries. Most of these texts contain many specific domain terms. Moreover, especially in France, they are unstructured because many administrations produce them without a standardized framework. Due to this fact, current language models do not process these documents correctly. In this paper, we propose AdminBERT, the first French pre-trained language models for the administrative domain. Since interesting information in such texts corresponds to named entities and the relations between them, we compare this PLM with general domain language models, fine-tuned on the Named Entity Recognition (NER) task applied to administrative texts, as well as to a Large Language Model (LLM) and to a language model with an architecture different from the BERT one. We show that taking advantage of a PLM for French administrative data increases the performance in the administrative and general domains, on these texts. We also release AdminBERT as well as AdminSet, the pre-training corpus of administrative texts in French and the subset AdminSet-NER, the first NER dataset consisting exclusively of administrative texts in French.

pdf bib abs

AdminSet and AdminBERT : un jeu de données et un modèle de langue pré-entraîné pour explorer le dédale non structuré des données administratives françaises
Thomas Sebbag | Solen Quiniou | Nicolas Stucky | Emmanuel Morin
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : traductions d'articles publiés

Les modèles de langue pré-entraînés (PLM) sont largement utilisés en traitement automatique du langage naturel (TALN), mais peu adaptés aux textes administratifs, souvent non standardisés et spécialisés. En France, l’absence de réglementation uniforme et l’hétérogénéité des sources compliquent le traitement des documents administratifs. Pour pallier ce problème, nous proposons AdminBERT, le premier modèle de langue pré-entraîné en français dédié aux documents administratifs. Nous évaluons AdminBERT sur la tâche de reconnaissance des entités nommées (REN), en le comparant à des modèles génériques, un grand modèle de langue (LLM) et une variante du modèle BERT. Nos résultats montrent qu’un pré-entraînement sur des textes administratifs améliore significativement la reconnaissance des entités nommées. Nous mettons à disposition AdminBERT, AdminSet (un corpus de pré-entraînement) et AdminSet-NER, le premier jeu de données annoté pour la REN sur des textes administratifs français.

pdf bib

LLM as a Guide: an Approach for Unsupervised Economic Relation Discovery in Administrative Documents
Thomas Sebbag | Solen Quiniou | Emmanuel Morin
Proceedings of The 10th Workshop on Financial Technology and Natural Language Processing

pdf bib abs

Summarization for Generative Relation Extraction in the Microbiome Domain
Oumaima El Khettari | Solen Quiniou | Samuel Chaffron
Actes de l'atelier Traitement du langage médical à l’époque des LLMs 2025 (MLP-LLM)

We explore a generative relation extraction (RE) pipeline tailored to the study of interactions in the intestinal microbiome, a complex and low-resource biomedical domain. Our method leverages summarization with large language models (LLMs) to refine context before extracting relations via instruction-tuned generation. Preliminary results on a dedicated corpus show that summarization improves generative RE performance by reducing noise and guiding the model. However, BERT-based RE approaches still outperform generative models. This ongoing work demonstrates the potential of generative methods to support the study of specialized domains in low-resources setting.

2024

pdf bib abs

Biomedical information extraction is crucial for advancing research, enhancing healthcare, and discovering treatments by efficiently analyzing extensive data. Given the extensive amount of biomedical data available, automated information extraction methods are necessary due to manual extraction’s labor-intensive, expertise-dependent, and costly nature. In this paper, we propose a novel two-stage system for information extraction where we annotate biomedical articles based on a specific ontology (HOIP). The major challenge is annotating relation between biomedical processes often not explicitly mentioned in text articles. Here, we first predict the candidate processes and then determine the relationships between these processes. The experimental results show promising outcomes in mention-agnostic process identification using Large Language Models (LLMs). In relation classification, BERT-based supervised models still outperform LLMs significantly. The end-to-end evaluation results suggest the difficulty of this task and room for improvement in both process identification and relation classification.

pdf bib abs

Improving Text Readability through Segmentation into Rheses
Antoine Jamelot | Solen Quiniou | Sophie Hamon
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Enhancing text readability is crucial for readers with challenges like dyslexia. This paper delves into the segmentation of sentences into rheses, i.e. rhythmic and semantic units. Their aim is to clarify sentence structures for improved comprehension, through a harmonious balance between syntactic accuracy, the natural rhythm of reading aloud, and the delineation of meaningful units. This study relates and compares our various attempts to improve a pre-existing rhesis segmentation tool, which is based on the selection of candidate segmentations. We also release TeRheSe (Texts with Rhesis Segmentation), a bilingual dataset, segmented into rheses, comprising 12 books from classic literature in French and English. We evaluated our approaches on this dataset, showing the efficiency of a novel approach based on token classification, reaching a F1-score of 90.0% in English (previously 85.3%) and 91.3% in French (previously 88.0%). We also study the potential of leveraging prosodic elements, though its definitive impact remains inconclusive.

pdf bib abs

The biomedical domain has sparked a significant interest in the field of Natural Language Processing (NLP), which has seen substantial advancements with pre-trained language models (PLMs). However, comparing these models has proven challenging due to variations in evaluation protocols across different models. A fair solution is to aggregate diverse downstream tasks into a benchmark, allowing for the assessment of intrinsic PLMs qualities from various perspectives. Although still limited to few languages, this initiative has been undertaken in the biomedical field, notably English and Chinese. This limitation hampers the evaluation of the latest French biomedical models, as they are either assessed on a minimal number of tasks with non-standardized protocols or evaluated using general downstream tasks. To bridge this research gap and account for the unique sensitivities of French, we present the first-ever publicly available French biomedical language understanding benchmark called DrBenchmark. It encompasses 20 diversified tasks, including named-entity recognition, part-of-speech tagging, question-answering, semantic textual similarity, or classification. We evaluate 8 state-of-the-art pre-trained masked language models (MLMs) on general and biomedical-specific data, as well as English specific MLMs to assess their cross-lingual capabilities. Our experiments reveal that no single model excels across all tasks, while generalist models are sometimes still competitive.

2023

pdf bib abs

Annotation d’interactions hôte-microbiote dans des articles scientifiques par similarité sémantique avec une ontologie
Oumaima El Khettari | Solen Quiniou | Samuel Chaffron
Actes de CORIA-TALN 2023. Actes de l'atelier "Analyse et Recherche de Textes Scientifiques" (ARTS)@TALN 2023

Nous nous intéressons à l’extraction de relations, dans des articles scientifiques, portant sur le microbiome humain. Afin de construire un corpus annoté, nous avons évalué l’utilisation de l’ontologie OHMI pour détecter les relations présentes dans les phrases des articles scientifiques, en calculant la similarité sémantique entre les relations définies dans l’ontologie et les phrases des articles. Le modèle BERT et trois variantes biomédicales sont utilisés pour obtenir les représentations des relations et des phrases. Ces modèles sont comparés sur un corpus construit à partir d’articles scientifiques complets issus de la plateforme ISTEX, dont une sous-partie a été annotée manuellement.

pdf bib abs

Building a Corpus for Biomedical Relation Extraction of Species Mentions
Oumaima El Khettari | Solen Quiniou | Samuel Chaffron
Proceedings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks

We present a manually annotated new corpus, Species-Species Interaction (SSI), for extracting meaningful binary relations between species, in biomedical texts, at sentence level, with a focus on the gut microbiota. The corpus leverages PubTator to annotate species in full-text articles after evaluating different NER species taggers. Our first results are promising for extracting relations between species using BERT and its biomedical variants.

2020

pdf bib abs

This corpus is part of the PASTEL (Performing Automated Speech Transcription for Enhancing Learning) project aiming to explore the potential of synchronous speech transcription and application in specific teaching situations. It includes 10 hours of different lectures, manually transcribed and segmented. The main interest of this corpus lies in its multimodal aspect: in addition to speech, the courses were filmed and the written presentation supports (slides) are made available. The dataset may then serve researches in multiple fields, from speech and language to image and video processing. The dataset will be freely available to the research community. In this paper, we first describe in details the annotation protocol, including a detailed analysis of the manually labeled data. Then, we propose some possible use cases of the corpus with baseline results. The use cases concern scientific fields from both speech and text processing, with language model adaptation, thematic segmentation and transcription to slide alignment.

2019

pdf bib abs

Apport de l’adaptation automatique des modèles de langage pour la reconnaissance de la parole: évaluation qualitative extrinsèque dans un contexte de traitement de cours magistraux (Contribution of automatic adaptation of language models for speech recognition : extrinsic qualitative evaluation in a context of educational courses)
Salima Mdhaffar | Yannick Estève | Nicolas Hernandez | Antoine Laurent | Solen Quiniou
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

Malgré les faiblesses connues de cette métrique, les performances de différents systèmes de reconnaissance automatique de la parole sont généralement comparées à l’aide du taux d’erreur sur les mots. Les transcriptions automatiques de ces systèmes sont de plus en plus exploitables et utilisées dans des systèmes complexes de traitement automatique du langage naturel, par exemple pour la traduction automatique, l’indexation, la recherche documentaire... Des études récentes ont proposé des métriques permettant de comparer la qualité des transcriptions automatiques de différents systèmes en fonction de la tâche visée. Dans cette étude nous souhaitons mesurer, qualitativement, l’apport de l’adaptation automatique des modèles de langage au domaine visé par un cours magistral. Les transcriptions du discours de l’enseignant peuvent servir de support à la navigation dans le document vidéo du cours magistral ou permettre l’enrichissement de son contenu pédagogique. C’est à-travers le prisme de ces deux tâches que nous évaluons l’apport de l’adaptation du modèle de langage. Les expériences ont été menées sur un corpus de cours magistraux et montrent combien le taux d’erreur sur les mots est une métrique insuffisante qui masque les apports effectifs de l’adaptation des modèles de langage.

2018

pdf bib

pdf bib abs

Transfer Learning for a Letter-Ngrams to Word Decoder in the Context of Historical Handwriting Recognition with Scarce Resources
Adeline Granet | Emmanuel Morin | Harold Mouchère | Solen Quiniou | Christian Viard-Gaudin
Proceedings of the 27th International Conference on Computational Linguistics

Lack of data can be an issue when beginning a new study on historical handwritten documents. In order to deal with this, we present the character-based decoder part of a multilingual approach based on transductive transfer learning for a historical handwriting recognition task on Italian Comedy Registers. The decoder must build a sequence of characters that corresponds to a word from a vector of letter-ngrams. As learning data, we created a new dataset from untapped resources that covers the same domain and period of our Italian Comedy data, as well as resources from common domains, periods, or languages. We obtain a 97.42% Character Recognition Rate and a 86.57% Word Recognition Rate on our Italian Comedy data, despite a lexical coverage of 67% between the Italian Comedy data and the training data. These results show that an efficient system can be obtained by a carefully selecting the datasets used for the transfer learning.

pdf bib abs

Décodeur neuronal pour la transcription de documents manuscrits anciens (Neural decoder for the transcription of historical handwritten documents)
Adeline Granet | Emmanuel Morin | Harold Mouchère | Solen Quiniou | Christian Viard-Gaudin
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

L’absence de données annotées peut être une difficulté majeure lorsque l’on s’intéresse à l’analyse de documents manuscrits anciens. Pour contourner cette difficulté, nous proposons de diviser le problème en deux, afin de pouvoir s’appuyer sur des données plus facilement accessibles. Dans cet article nous présentons la partie décodeur d’un encodeur-décodeur multimodal utilisant l’apprentissage par transfert de connaissances pour la transcription des titres de pièces de la Comédie Italienne. Le décodeur transforme un vecteur de n-grammes au niveau caractères en une séquence de caractères correspondant à un mot. L’apprentissage par transfert de connaissances est réalisé principalement à partir d’une nouvelle ressource inexploitée contemporaine à la Comédie-Italienne et thématiquement proche ; ainsi que d’autres ressources couvrant d’autres domaines, des langages différents et même des périodes différentes. Nous obtenons 97,27% de caractères bien reconnus sur les données de la Comédie-Italienne, ainsi que 86,57% de mots correctement générés malgré une couverture de 67,58% uniquement entre la Comédie-Italienne et l’ensemble d’apprentissage. Les expériences montrent qu’un tel système peut être une approche efficace dans le cadre d’apprentissage par transfert.

pdf bib

Towards a Diagnosis of Textual Difficulties for Children with Dyslexia
Solen Quiniou | Béatrice Daille
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib abs

Segmentation automatique d’un texte en rhèses (Automatic segmentation of a text into rhesis)
Victor Pineau | Constance Nin | Solen Quiniou | Béatrice Daille
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

La segmentation d’un texte en rhèses, unités-membres signifiantes de la phrase, permet de fournir des adaptations de celui-ci pour faciliter la lecture aux personnes dyslexiques. Dans cet article, nous proposons une méthode d’identification automatique des rhèses basée sur un apprentissage supervisé à partir d’un corpus que nous avons annoté. Nous comparons celle-ci à l’identification manuelle ainsi qu’à l’utilisation d’outils et de concepts proches, tels que la segmentation d’un texte en chunks.