Béatrice Daille

Also published as: Beatrice Daille

2025

ACL-rlg: A Dataset for Reading List Generation
Julien Aubert-Béduchaud | Florian Boudin | Béatrice Daille | Richard Dufour
Proceedings of the 31st International Conference on Computational Linguistics

Familiarizing oneself with a new scientific field and its existing literature can be daunting due to the large amount of available articles. Curated lists of academic references, or reading lists, compiled by experts, offer a structured way to gain a comprehensive overview of a domain or a specific scientific challenge. In this work, we introduce ACL-rlg, the largest open expert-annotated reading list dataset. We also provide multiple baselines for evaluating reading list generation and formally define it as a retrieval task. Our qualitative study highlights that traditional scholarly search engines and indexing methods perform poorly on this task, and GPT-4o, despite showing better results, exhibits signs of potential data contamination.

pdf bib abs

ACL-rlg : Un dataset pour la génération de listes de lecture
Julien Aubert-Béduchaud | Florian Boudin | Béatrice Daille | Richard Dufour
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : traductions d'articles publiés

Se familiariser avec un nouveau domaine scientifique et sa littérature associée peut s’avérer complexe en raison du nombre considérable d’articles disponibles. Les listes de références académiques compilées par des experts, également appelées listes de lecture, offrent un moyen structuré et efficace d’acquérir une vue d’ensemble approfondie d’un domaine scientifique. Dans cet article, nous présentonsACL-rlg , le plus grand ensemble de données ouvertes rassemblant des listes de lecture annotées par des experts. Nous proposons également plusieurs bases de référence pour évaluer la génération de listes de lecture, que nous formalisons comme une tâche de récupération d’information. Notre étude qualitative met en évidence les performances limitées des moteurs de recherche académiques traditionnels et des méthodes d’indexation dans ce contexte, tandis que GPT-4o, bien que produisant de meilleurs résultats, présente des signes potentiels de contamination des données.

2024

pdf bib abs

The biomedical domain has sparked a significant interest in the field of Natural Language Processing (NLP), which has seen substantial advancements with pre-trained language models (PLMs). However, comparing these models has proven challenging due to variations in evaluation protocols across different models. A fair solution is to aggregate diverse downstream tasks into a benchmark, allowing for the assessment of intrinsic PLMs qualities from various perspectives. Although still limited to few languages, this initiative has been undertaken in the biomedical field, notably English and Chinese. This limitation hampers the evaluation of the latest French biomedical models, as they are either assessed on a minimal number of tasks with non-standardized protocols or evaluated using general downstream tasks. To bridge this research gap and account for the unique sensitivities of French, we present the first-ever publicly available French biomedical language understanding benchmark called DrBenchmark. It encompasses 20 diversified tasks, including named-entity recognition, part-of-speech tagging, question-answering, semantic textual similarity, or classification. We evaluate 8 state-of-the-art pre-trained masked language models (MLMs) on general and biomedical-specific data, as well as English specific MLMs to assess their cross-lingual capabilities. Our experiments reveal that no single model excels across all tasks, while generalist models are sometimes still competitive.

pdf bib abs

How Important Is Tokenization in French Medical Masked Language Models?
Yanis Labrak | Adrien Bazoge | Béatrice Daille | Mickael Rouvier | Richard Dufour
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Subword tokenization has become the prevailing standard in the field of natural language processing (NLP) over recent years, primarily due to the widespread utilization of pre-trained language models. This shift began with Byte-Pair Encoding (BPE) and was later followed by the adoption of SentencePiece and WordPiece. While subword tokenization consistently outperforms character and word-level tokenization, the precise factors contributing to its success remain unclear. Key aspects such as the optimal segmentation granularity for diverse tasks and languages, the influence of data sources on tokenizers, and the role of morphological information in Indo-European languages remain insufficiently explored. This is particularly pertinent for biomedical terminology, characterized by specific rules governing morpheme combinations. Despite the agglutinative nature of biomedical terminology, existing language models do not explicitly incorporate this knowledge, leading to inconsistent tokenization strategies for common terms. In this paper, we seek to delve into the complexities of subword tokenization in French biomedical domain across a variety of NLP tasks and pinpoint areas where further enhancements can be made. We analyze classical tokenization algorithms, including BPE and SentencePiece, and introduce an original tokenization strategy that integrates morpheme-enriched word segmentation into existing tokenization methods.

2023

pdf bib abs

DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains
Yanis Labrak | Adrien Bazoge | Richard Dufour | Mickael Rouvier | Emmanuel Morin | Béatrice Daille | Pierre-Antoine Gourraud
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In recent years, pre-trained language models (PLMs) achieve the best performance on a wide range of natural language processing (NLP) tasks. While the first models were trained on general domain data, specialized ones have emerged to more effectively treat specific domains. In this paper, we propose an original study of PLMs in the medical domain on French language. We compare, for the first time, the performance of PLMs trained on both public data from the web and private data from healthcare establishments. We also evaluate different learning strategies on a set of biomedical tasks. In particular, we show that we can take advantage of already existing biomedical PLMs in a foreign language by further pre-train it on our targeted data. Finally, we release the first specialized PLMs for the biomedical field in French, called DrBERT, as well as the largest corpus of medical data under free license on which these models are trained.

pdf bib abs

DrBERT: Un modèle robuste pré-entraîné en français pour les domaines biomédical et clinique
Yanis Labrak | Adrien Bazoge | Richard Dufour | Mickael Rouvier | Emmanuel Morin | Béatrice Daille | Pierre-Antoine Gourraud
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 4 : articles déjà soumis ou acceptés en conférence internationale

Ces dernières années, les modèles de langage pré-entraînés ont obtenu les meilleures performances sur un large éventail de tâches de traitement automatique du langage naturel (TALN). Alors que les premiers modèles ont été entraînés sur des données issues de domaines généraux, des modèles spécialisés sont apparus pour traiter plus efficacement des domaines spécifiques. Dans cet article, nous proposons une étude originale de modèles de langue dans le domaine médical en français. Nous comparons pour la première fois les performances de modèles entraînés sur des données publiques issues du web et sur des données privées issues d’établissements de santé. Nous évaluons également différentes stratégies d’apprentissage sur un ensemble de tâches biomédicales. Enfin, nous publions les premiers modèles spécialisés pour le domaine biomédical en français, appelés DrBERT, ainsi que le plus grand corpus de données médicales sous licence libre sur lequel ces modèles sont entraînés.

pdf bib

Actes de CORIA-TALN 2023. Actes de l'atelier "Analyse et Recherche de Textes Scientifiques" (ARTS)@TALN 2023
Florian Boudin | Béatrice Daille | Richard Dufour | Oumaima El | Maël Houbre | Léane Jourdan | Nihel Kooli
Actes de CORIA-TALN 2023. Actes de l'atelier "Analyse et Recherche de Textes Scientifiques" (ARTS)@TALN 2023

pdf bib abs

Projet NaviTerm : navigation terminologique pour une montée en compétence rapide et personnalisée sur un domaine de recherche
Florian Boudin | Richard Dufour | Béatrice Daille
Actes de CORIA-TALN 2023. Actes de l'atelier "Analyse et Recherche de Textes Scientifiques" (ARTS)@TALN 2023

Cet article présente le projet NaviTerm dont l’objectif est d’accélérer la montée en compétence des chercheurs sur un domaine de recherche par la création automatique de représentations terminologiques synthétiques et navigables des connaissances scientifiques.

pdf bib abs

Classification de relation pour la génération de mots-clés absents
Maël Houbre | Florian Boudin | Béatrice Daille
Actes de CORIA-TALN 2023. Actes de l'atelier "Analyse et Recherche de Textes Scientifiques" (ARTS)@TALN 2023

Les modèles encodeur-décodeur constituent l’état de l’art en génération de mots-clés. Cependant, malgré de nombreuses adaptations de cette architecture, générer des mots-clés absents du texte du document est toujours une tâche difficile. Cette étude montre qu’entraîner au préalable un modèle sur une tâche de classification de relation entre un document et un mot-clé, permet d’améliorer la génération de mots-clés absents.

pdf bib

pdf bib abs

Tâches et systèmes de détection automatique des réponses correctes dans des QCMs liés au domaine médical : Présentation de la campagne DEFT 2023
Yanis Labrak | Adrien Bazoge | Béatrice Daille | Richard Dufour | Emmanuel Morin | Mickael Rouvier
Actes de CORIA-TALN 2023. Actes du Défi Fouille de Textes@TALN2023

L’édition 2023 du DÉfi Fouille de Textes (DEFT) s’est concentrée sur le développement de méthodes permettant de choisir automatiquement des réponses dans des questions à choix multiples (QCMs) en français. Les approches ont été évaluées sur le corpus FrenchMedMCQA, intégrant un ensemble de QCMs avec, pour chaque question, cinq réponses potentielles, dans le cadre d’annales d’examens de pharmacie.Deux tâches ont été proposées. La première consistait à identifier automatiquement l’ensemble des réponses correctes à une question. Les résultats obtenus, évalués selon la métrique de l’Exact Match Ratio (EMR), variaient de 9,97% à 33,76%, alors que les performances en termes de distance de Hamming s’échelonnaient de 24,93 à 52,94. La seconde tâche visait à identifier automatiquement le nombre exact de réponses correctes. Les résultats, quant à eux, étaient évalués d’une part avec la métrique de F1-Macro, variant de 13,26% à 42,42%, et la métrique (Accuracy), allant de 47,43% à 68,65%. Parmi les approches variées proposées par les six équipes participantes à ce défi, le meilleur système s’est appuyé sur un modèle de langage large de type LLaMa affiné en utilisant la méthode d’adaptation LoRA.

2022

pdf bib abs

FrenchMedMCQA: A French Multiple-Choice Question Answering Dataset for Medical domain
Yanis Labrak | Adrien Bazoge | Richard Dufour | Beatrice Daille | Pierre-Antoine Gourraud | Emmanuel Morin | Mickael Rouvier
Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI)

This paper introduces FrenchMedMCQA, the first publicly available Multiple-Choice Question Answering (MCQA) dataset in French for medical domain. It is composed of 3,105 questions taken from real exams of the French medical specialization diploma in pharmacy, mixing single and multiple answers. Each instance of the dataset contains an identifier, a question, five possible answers and their manual correction(s). We also propose first baseline models to automatically process this MCQA task in order to report on the current performances and to highlight the difficulty of the task. A detailed analysis of the results showed that it is necessary to have representations adapted to the medical domain or to the MCQA task: in our case, English specialized models yielded better results than generic French ones, even though FrenchMedMCQA is in French. Corpus, models and tools are available online.

pdf bib abs

A Large-Scale Dataset for Biomedical Keyphrase Generation
Maël Houbre | Florian Boudin | Beatrice Daille
Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI)

Keyphrase generation is the task consisting in generating a set of words or phrases that highlight the main topics of a document. There are few datasets for keyphrase generation in the biomedical domain and they do not meet the expectations in terms of size for training generative models. In this paper, we introduce kp-biomed, the first large-scale biomedical keyphrase generation dataset collected from PubMed abstracts. We train and release several generative models and conduct a series of experiments showing that using large scale datasets improves significantly the performances for present and absent keyphrase generation. The dataset and models are available online.

pdf bib abs

Cross-lingual and Cross-domain Transfer Learning for Automatic Term Extraction from Low Resource Data
Amir Hazem | Merieme Bouhandi | Florian Boudin | Beatrice Daille
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Automatic Term Extraction (ATE) is a key component for domain knowledge understanding and an important basis for further natural language processing applications. Even with persistent improvements, ATE still exhibits weak results exacerbated by small training data inherent to specialized domain corpora. Recently, transformers-based deep neural models, such as BERT, have proven to be efficient in many downstream NLP tasks. However, no systematic evaluation of ATE has been conducted so far. In this paper, we run an extensive study on fine-tuning pre-trained BERT models for ATE. We propose strategies that empirically show BERT’s effectiveness using cross-lingual and cross-domain transfer learning to extract single and multi-word terms. Experiments have been conducted on four specialized domains in three languages. The obtained results suggest that BERT can capture cross-domain and cross-lingual terminologically-marked contexts shared by terms, opening a new design-pattern for ATE.

2021

pdf bib abs

Caractérisation des relations sémantiques entre termes multi-mots fondée sur l’analogie (Semantic relations recognition between multi-word terms by means of analogy )
Yizhe Wang | Béatrice Daille | Nabil Hathout
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

La terminologie d’un domaine rend compte de la structure du domaine grâce aux relations entre ses termes. Dans cet article, nous nous intéressons à la caractérisation des relations terminologiques qui existent entre termes multi-mots (MWT) dans les espaces vectoriels distributionnels. Nous avons constitué un jeu de données composé de MWT en français du domaine de l’environnement, reliés par des relations sémantiques lexicales. Nous présentons une expérience dans laquelle ces relations sémantiques entre MWT sont caractérisées au moyen de l’analogie. Les résultats obtenus permettent d’envisager un processus automatique pour aider à la structuration des terminologies.

2020

pdf bib abs

Hierarchical Text Segmentation for Medieval Manuscripts
Amir Hazem | Beatrice Daille | Dominique Stutzmann | Christopher Kermorvant | Louis Chevalier
Proceedings of the 28th International Conference on Computational Linguistics

In this paper, we address the segmentation of books of hours, Latin devotional manuscripts of the late Middle Ages, that exhibit challenging issues: a complex hierarchical entangled structure, variable content, noisy transcriptions with no sentence markers, and strong correlations between sections for which topical information is no longer sufficient to draw segmentation boundaries. We show that the main state-of-the-art segmentation methods are either inefficient or inapplicable for books of hours and propose a bottom-up greedy approach that considerably enhances the segmentation results. We stress the importance of such hierarchical segmentation of books of hours for historians to explore their overarching differences underlying conception about Church.

pdf bib

Proceedings of the 6th International Workshop on Computational Terminology
Béatrice Daille | Kyo Kageura | Ayla Rigouts Terryn
Proceedings of the 6th International Workshop on Computational Terminology

pdf bib abs

A study of semantic projection from single word terms to multi-word terms in the environment domain
Yizhe Wang | Beatrice Daille | Nabil Hathout
Proceedings of the 6th International Workshop on Computational Terminology

The semantic projection method is often used in terminology structuring to infer semantic relations between terms. Semantic projection relies upon the assumption of semantic compositionality: the relation that links simple term pairs remains valid in pairs of complex terms built from these simple terms. This paper proposes to investigate whether this assumption commonly adopted in natural language processing is actually valid. First, we describe the process of constructing a list of semantically linked multi-word terms (MWTs) related to the environmental field through the extraction of semantic variants. Second, we present our analysis of the results from the semantic projection. We find that contexts play an essential role in defining the relations between MWTs.

pdf bib abs

Towards Automatic Thesaurus Construction and Enrichment.
Amir Hazem | Beatrice Daille | Lanza Claudia
Proceedings of the 6th International Workshop on Computational Terminology

Thesaurus construction with minimum human efforts often relies on automatic methods to discover terms and their relations. Hence, the quality of a thesaurus heavily depends on the chosen methodologies for: (i) building its content (terminology extraction task) and (ii) designing its structure (semantic similarity task). The performance of the existing methods on automatic thesaurus construction is still less accurate than the handcrafted ones of which is important to highlight the drawbacks to let new strategies build more accurate thesauri models. In this paper, we will provide a systematic analysis of existing methods for both tasks and discuss their feasibility based on an Italian Cybersecurity corpus. In particular, we will provide a detailed analysis on how the semantic relationships network of a thesaurus can be automatically built, and investigate the ways to enrich the terminological scope of a thesaurus by taking into account the information contained in external domain-oriented semantic sets.

pdf bib abs

TermEval 2020: TALN-LS2N System for Automatic Term Extraction
Amir Hazem | Mérieme Bouhandi | Florian Boudin | Beatrice Daille
Proceedings of the 6th International Workshop on Computational Terminology

Automatic terminology extraction is a notoriously difficult task aiming to ease effort demanded to manually identify terms in domain-specific corpora by automatically providing a ranked list of candidate terms. The main ways that addressed this task can be ranged in four main categories: (i) rule-based approaches, (ii) feature-based approaches, (iii) context-based approaches, and (iv) hybrid approaches. For this first TermEval shared task, we explore a feature-based approach, and a deep neural network multitask approach -BERT- that we fine-tune for term extraction. We show that BERT models (RoBERTa for English and CamemBERT for French) outperform other systems for French and English languages.

pdf bib abs

The Book of Hours was the bestseller of the late Middle Ages and Renaissance. It is a historical invaluable treasure, documenting the devotional practices of Christians in the late Middle Ages. Up to now, its textual content has been scarcely studied because of its manuscript nature, its length and its complex content. At first glance, it looks too standardized. However, the study of book of hours raises important challenges: (i) in image analysis, its often lavish ornamentation (illegible painted initials, line-fillers, etc.), abbreviated words, multilingualism are difficult to address in Handwritten Text Recognition (HTR); (ii) its hierarchical entangled structure offers a new field of investigation for text segmentation; (iii) in digital humanities, its textual content gives opportunities for historical analysis. In this paper, we provide the first corpus of books of hours, which consists of Latin transcriptions of 300 books of hours generated by Handwritten Text Recognition (HTR) - that is like Optical Character Recognition (OCR) but for handwritten and not printed texts. We designed a structural scheme of the book of hours and annotated manually two books of hours according to this scheme. Lastly, we performed a systematic evaluation of the main state of the art text segmentation approaches.

2019

pdf bib abs

Réutilisation de Textes dans les Manuscrits Anciens (Text Reuse in Ancient Manuscripts)
Amir Hazem | Béatrice Daille | Dominique Stutzmann | Jacob Currie | Christine Jacquin
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

Nous nous intéressons dans cet article à la problématique de réutilisation de textes dans les livres liturgiques du Moyen Âge. Plus particulièrement, nous étudions les variations textuelles de la prière Obsecro Te souvent présente dans les livres d’heures. L’observation manuelle de 772 copies de l’Obsecro Te a montré l’existence de plus de 21 000 variantes textuelles. Dans le but de pouvoir les extraire automatiquement et les catégoriser, nous proposons dans un premier temps une classification lexico-sémantique au niveau n-grammes de mots pour ensuite rendre compte des performances de plusieurs approches état-de-l’art d’appariement automatique de variantes textuelles de l’Obsecro Te.

pdf bib abs

Terminology systematization for Cybersecurity domain in Italian Language
Claudia Lanza | Béatrice Daille
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Terminologie et Intelligence Artificielle (atelier TALN-RECITAL \& IC)

This paper aims at presenting the first steps to improve the quality of the first draft of an Italian thesaurus for Cybersecurity terminology that has been realized for a specific project activity in collaboration with CybersecurityLab at Informatics and Telematics Institute (IIT) of the National Council of Research (CNR) in Italy. In particular, the paper will focus, first, on the terminological knowledge base built to retrieve the most representative candidate terms of Cybersecurity domain in Italian language, giving examples of the main gold standard repositories that have been used to build this semantic tool. Attention will be then given to the methodology and software employed to configure a system of NLP rules to get the desired semantic results and to proceed with the enhancement of the candidate terms selection which are meant to be inserted in the controlled vocabulary.

pdf bib

Transcription automatique et segmentation thématique de livres d’heures manuscrits [Automatic transcription and thematic segmentation of Books of Hours]
Béatrice Daille | Amir Hazem | Christopher Kermorvant | Martin Maarand | Marie-Laurence Bonhomme | Dominique Stutzmann | Jacob Currie | Christine Jacquin
Traitement Automatique des Langues, Volume 60, Numéro 3 : TAL et humanités numériques [NLP and Digital Humanities]

pdf bib abs

Towards Automatic Variant Analysis of Ancient Devotional Texts
Amir Hazem | Béatrice Daille | Dominique Stutzmann | Jacob Currie | Christine Jacquin
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

We address in this paper the issue of text reuse in liturgical manuscripts of the middle ages. More specifically, we study variant readings of the Obsecro Te prayer, part of the devotional Books of Hours often used by Christians as guidance for their daily prayers. We aim at automatically extracting and categorising pairs of words and expressions that exhibit variant relations. For this purpose, we adopt a linguistic classification that allows to better characterize the variants than edit operations. Then, we study the evolution of Obsecro Te texts from a temporal and geographical axis. Finally, we contrast several unsupervised state-of-the-art approaches for the automatic extraction of Obsecro Te variants. Based on the manual observation of 772 Obsecro Te copies which show more than 21,000 variants, we show that the proposed methodology is helpful for an automatic study of variants and may serve as basis to analyze and to depict useful information from devotional texts.

pdf bib abs

KPTimes: A Large-Scale Dataset for Keyphrase Generation on News Documents
Ygor Gallina | Florian Boudin | Beatrice Daille
Proceedings of the 12th International Conference on Natural Language Generation

Keyphrase generation is the task of predicting a set of lexical units that conveys the main content of a source text. Existing datasets for keyphrase generation are only readily available for the scholarly domain and include non-expert annotations. In this paper we present KPTimes, a large-scale dataset of news texts paired with editor-curated keyphrases. Exploring the dataset, we show how editors tag documents, and how their annotations differ from those found in existing datasets. We also train and evaluate state-of-the-art neural keyphrase generation models on KPTimes to gain insights on how well they perform on the news domain. The dataset is available online at https://github.com/ygorg/KPTimes.

2018

pdf bib

Word Embedding Approach for Synonym Extraction of Multi-Word Terms
Amir Hazem | Béatrice Daille
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib

Towards a Diagnosis of Textual Difficulties for Children with Dyslexia
Solen Quiniou | Béatrice Daille
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib abs

Extraction de lexiques bilingues à partir de corpus comparables spécialisés à travers une langue pivot (Bilingual lexicon extraction from specialized comparable corpora using a pivot language)
Alexis Linard | Emmanuel Morin | Béatrice Daille
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

L’extraction de lexiques bilingues à partir de corpus comparables se réalise traditionnellement en s’appuyant sur deux langues. Des travaux précédents en extraction de lexiques bilingues à partir de corpus parallèles ont démontré que l’utilisation de plus de deux langues peut être utile pour améliorer la qualité des alignements extraits. Nos travaux montrent qu’il est possible d’utiliser la même stratégie pour des corpus comparables. Nous avons défini deux méthodes originales impliquant des langues pivots et nous les avons évaluées sur quatre langues et deux langues pivots en particulier. Nos expérimentations ont montré que lorsque l’alignement entre la langue source et la langue pivot est de bonne qualité, l’extraction du lexique en langue cible s’en trouve améliorée.

pdf bib abs

Modélisation unifiée du document et de son domaine pour une indexation par termes-clés libre et contrôlée (Unified document and domain-specific model for keyphrase extraction and assignment )
Adrien Bougouin | Florian Boudin | Beatrice Daille
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Dans cet article, nous nous intéressons à l’indexation de documents de domaines de spécialité par l’intermédiaire de leurs termes-clés. Plus particulièrement, nous nous intéressons à l’indexation telle qu’elle est réalisée par les documentalistes de bibliothèques numériques. Après analyse de la méthodologie de ces indexeurs professionnels, nous proposons une méthode à base de graphe combinant les informations présentes dans le document et la connaissance du domaine pour réaliser une indexation (hybride) libre et contrôlée. Notre méthode permet de proposer des termes-clés ne se trouvant pas nécessairement dans le document. Nos expériences montrent aussi que notre méthode surpasse significativement l’approche à base de graphe état de l’art.

pdf bib abs

Extraction d’expressions-cibles de l’opinion : de l’anglais au français (Opinion Target Expression extraction : from English to French)
Grégoire Jadi | Laura Monceaux | Vincent Claveau | Béatrice Daille
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

Dans cet article, nous présentons le développement d’un système d’extraction d’expressions-cibles pour l’anglais et sa transposition au français. En complément, nous avons réalisé une étude de l’efficacité des traits en anglais et en français qui tend à montrer qu’il est possible de réaliser un système d’extraction d’expressions-cibles indépendant du domaine. Pour finir, nous proposons une analyse comparative des erreurs commises par nos systèmes en anglais et français et envisageons différentes solutions à ces problèmes.

pdf bib abs

Segmentation automatique d’un texte en rhèses (Automatic segmentation of a text into rhesis)
Victor Pineau | Constance Nin | Solen Quiniou | Béatrice Daille
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

La segmentation d’un texte en rhèses, unités-membres signifiantes de la phrase, permet de fournir des adaptations de celui-ci pour faciliter la lecture aux personnes dyslexiques. Dans cet article, nous proposons une méthode d’identification automatique des rhèses basée sur un apprentissage supervisé à partir d’un corpus que nous avons annoté. Nous comparons celle-ci à l’identification manuelle ainsi qu’à l’utilisation d’outils et de concepts proches, tels que la segmentation d’un texte en chunks.

pdf bib abs

Keyphrase Annotation with Graph Co-Ranking
Adrien Bougouin | Florian Boudin | Béatrice Daille
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Keyphrase annotation is the task of identifying textual units that represent the main content of a document. Keyphrase annotation is either carried out by extracting the most important phrases from a document, keyphrase extraction, or by assigning entries from a controlled domain-specific vocabulary, keyphrase assignment. Assignment methods are generally more reliable. They provide better-formed keyphrases, as well as keyphrases that do not occur in the document. But they are often silent on the contrary of extraction methods that do not depend on manually built resources. This paper proposes a new method to perform both keyphrase extraction and keyphrase assignment in an integrated and mutual reinforcing manner. Experiments have been carried out on datasets covering different domains of humanities and social sciences. They show statistically significant improvements compared to both keyphrase extraction and keyphrase assignment state-of-the art methods.

pdf bib abs

Evaluating Lexical Similarity to build Sentiment Similarity
Grégoire Jadi | Vincent Claveau | Béatrice Daille | Laura Monceaux
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this article, we propose to evaluate the lexical similarity information provided by word representations against several opinion resources using traditional Information Retrieval tools. Word representation have been used to build and to extend opinion resources such as lexicon, and ontology and their performance have been evaluated on sentiment analysis tasks. We question this method by measuring the correlation between the sentiment proximity provided by opinion resources and the semantic similarity provided by word representations using different correlation coefficients. We also compare the neighbors found in word representations and list of similar opinion words. Our results show that the proximity of words in state-of-the-art word representations is not very effective to build sentiment similarity.

pdf bib abs

TermITH-Eval: a French Standard-Based Resource for Keyphrase Extraction Evaluation
Adrien Bougouin | Sabine Barreaux | Laurent Romary | Florian Boudin | Béatrice Daille
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Keyphrase extraction is the task of finding phrases that represent the important content of a document. The main aim of keyphrase extraction is to propose textual units that represent the most important topics developed in a document. The output keyphrases of automatic keyphrase extraction methods for test documents are typically evaluated by comparing them to manually assigned reference keyphrases. Each output keyphrase is considered correct if it matches one of the reference keyphrases. However, the choice of the appropriate textual unit (keyphrase) for a topic is sometimes subjective and evaluating by exact matching underestimates the performance. This paper presents a dataset of evaluation scores assigned to automatically extracted keyphrases by human evaluators. Along with the reference keyphrases, the manual evaluations can be used to validate new evaluation measures. Indeed, an evaluation measure that is highly correlated to the manual evaluation is appropriate for the evaluation of automatic keyphrase extraction methods.

pdf bib abs

Bilingual Lexicon Extraction at the Morpheme Level Using Distributional Analysis
Amir Hazem | Béatrice Daille
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Bilingual lexicon extraction from comparable corpora is usually based on distributional methods when dealing with single word terms (SWT). These methods often treat SWT as single tokens without considering their compositional property. However, many SWT are compositional (composed of roots and affixes) and this information, if taken into account can be very useful to match translational pairs, especially for infrequent terms where distributional methods often fail. For instance, the English compound xenograft which is composed of the root xeno and the lexeme graft can be translated into French compositionally by aligning each of its elements (xeno with xéno and graft with greffe) resulting in the translation: xénogreffe. In this paper, we experiment several distributional modellings at the morpheme level that we apply to perform compositional translation to a subset of French and English compounds. We show promising results using distributional analysis at the root and affix levels. We also show that the adapted approach significantly improve bilingual lexicon extraction from comparable corpora compared to the approach at the word level.

pdf bib abs

Ambiguity Diagnosis for Terms in Digital Humanities
Béatrice Daille | Evelyne Jacquey | Gaël Lejeune | Luis Felipe Melo | Yannick Toussaint
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Among all researches dedicating to terminology and word sense disambiguation, little attention has been devoted to the ambiguity of term occurrences. If a lexical unit is indeed a term of the domain, it is not true, even in a specialised corpus, that all its occurrences are terminological. Some occurrences are terminological and other are not. Thus, a global decision at the corpus level about the terminological status of all occurrences of a lexical unit would then be erroneous. In this paper, we propose three original methods to characterise the ambiguity of term occurrences in the domain of social sciences for French. These methods differently model the context of the term occurrences: one is relying on text mining, the second is based on textometry, and the last one focuses on text genre properties. The experimental results show the potential of the proposed approaches and give an opportunity to discuss about their hybridisation.

pdf bib

Terminology Extraction with Term Variant Detection
Damien Cram | Béatrice Daille
Proceedings of ACL-2016 System Demonstrations

2015

pdf bib abs

Extraction de Contextes Riches en Connaissances en corpus spécialisés
Firas Hmida | Emmanuel Morin | Béatrice Daille
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Les banques terminologiques et les dictionnaires sont des ressources précieuses qui facilitent l’accès aux connaissances des domaines spécialisés. Ces ressources sont souvent assez pauvres et ne proposent pas toujours pour un terme à illustrer des exemples permettant d’appréhender le sens et l’usage de ce terme. Dans ce contexte, nous proposons de mettre en œuvre la notion de Contextes Riches en Connaissances (CRC) pour extraire directement de corpus spécialisés des exemples de contextes illustrant son usage. Nous définissons un cadre unifié pour exploiter tout à la fois des patrons de connaissances et des collocations avec une qualité acceptable pour une révision humaine.

pdf bib abs

Vers un diagnostic d’ambiguïté des termes candidats d’un texte
Gaël Lejeune | Béatrice Daille
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Les recherches autour de la désambiguïsation sémantique traitent de la question du sens à accorder à différentes occurrences d’un mot ou plus largement d’une unité lexicale. Dans cet article, nous nous intéressons à l’ambiguïté d’un terme en domaine de spécialité. Nous posons les premiers jalons de nos recherches sur une question connexe que nous nommons le diagnostic d’ambiguïté. Cette tâche consiste à décider si une occurrence d’un terme est ou n’est pas ambiguë. Nous mettons en œuvre une approche d’apprentissage supervisée qui exploite un corpus d’articles de sciences humaines rédigés en français dans lequel les termes ambigus ont été détectés par des experts. Le diagnostic s’appuie sur deux types de traits : syntaxiques et positionnels. Nous montrons l’intérêt de la structuration du texte pour établir le diagnostic d’ambiguïté.

pdf bib

Méthode semi-compositionnelle pour l’extraction de synonymes des termes complexes [Semi-compositional method for synonym extraction of complex terms]
Amir Hazem | Béatrice Daille
Traitement Automatique des Langues, Volume 56, Numéro 2 : Sémantique distributionnelle [Distributional semantics]

pdf bib

Attempting to Bypass Alignment from Comparable Corpora via Pivot Language
Alexis Linard | Béatrice Daille | Emmanuel Morin
Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

2014

pdf bib

The impact of domains for Keyphrase extraction (Influence des domaines de spécialité dans l’extraction de termes-clés) [in French]
Adrien Bougouin | Florian Boudin | Béatrice Daille
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf bib abs

Semi-compositional Method for Synonym Extraction of Multi-Word Terms
Béatrice Daille | Amir Hazem
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Automatic synonyms and semantically related word extraction is a challenging task, useful in many NLP applications such as question answering, search query expansion, text summarization, etc. While different studies addressed the task of word synonym extraction, only a few investigations tackled the problem of acquiring synonyms of multi-word terms (MWT) from specialized corpora. To extract pairs of synonyms of multi-word terms, we propose in this paper an unsupervised semi-compositional method that makes use of distributional semantics and exploit the compositional property shared by most MWT. We show that our method outperforms significantly the state-of-the-art.

pdf bib

Splitting of Compound Terms in non-Prototypical Compounding Languages
Elizaveta Clouet | Béatrice Daille
Proceedings of the First Workshop on Computational Approaches to Compound Analysis (ComAComA 2014)

2013

pdf bib

Identification, Alignment, and Tranlsation of Relational Adjectives from Comparable Corpora (Identification, alignement, et traductions des adjectifs relationnels en corpus comparables) [in French]
Rima Harastani | Beatrice Daille | Emmanuel Morin
Proceedings of TALN 2013 (Volume 1: Long Papers)

pdf bib

Multilingual Compound Splitting (Segmentation Multilingue des Mots Composés) [in French]
Elizaveta Loginova-Clouet | Béatrice Daille
Proceedings of TALN 2013 (Volume 2: Short Papers)

pdf bib

Apopsis Demonstrator for Tweet Analysis (Démonstrateur Apopsis pour l’analyse des tweets) [in French]
Sebastián Peña Saldarriaga | Damien Vintache | Béatrice Daille
Proceedings of TALN 2013 (Volume 3: System Demonstrations)

pdf bib

TTC TermSuite - Terminological Alignment from Comparable Corpora (TTC TermSuite alignement terminologique à partir de corpus comparables) [in French]
Béatrice Daille | Rima Harastani
Proceedings of TALN 2013 (Volume 3: System Demonstrations)

pdf bib

Ranking Translation Candidates Acquired from Comparable Corpora
Rima Harastani | Béatrice Daille | Emmanuel Morin
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib

TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction
Adrien Bougouin | Florian Boudin | Béatrice Daille
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf bib abs

Identification of Fertile Translations in Comparable Corpora: A Morpho-Compositional Approach
Estelle Delpech | Béatrice Daille | Emmanuel Morin | Claire Lemaire
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

This paper defines a method for lexicon in the biomedical domain from comparable corpora. The method is based on compositional translation and exploits morpheme-level translation equivalences. It can generate translations for a large variety of morphologically constructed words and can also generate ’fertile’ translations. We show that fertile translations increase the overall quality of the extracted lexicon for English to French translation.

pdf bib

Extraction of Domain-Specific Bilingual Lexicon from Comparable Corpora: Compositional Translation and Ranking
Estelle Delpech | Béatrice Daille | Emmanuel Morin | Claire Lemaire
Proceedings of COLING 2012

pdf bib

Revising the Compositional Method for Terminology Acquisition from Comparable Corpora
Emmanuel Morin | Béatrice Daille
Proceedings of COLING 2012

pdf bib

Compositionnalité et contextes issus de corpus comparables pour la traduction terminologique (Compositionality and Context for Bilingual Lexicon Extraction from Comparable Corpora) [in French]
Emmanuel Morin | Béatrice Daille
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

2011

pdf bib abs

Identifier la cible d’un passage d’opinion dans un corpus multithématique (Identifying the target of an opinion transition in a thematic corpus)
Matthieu Vernier | Laura Monceaux | Béatrice Daille
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

L’identification de la cible d’une d’opinion fait l’objet d’une attention récente en fouille d’opinion. Les méthodes existantes ont été testées sur des corpus monothématiques en anglais. Elles permettent principalement de traiter les cas où la cible se situe dans la même phrase que l’opinion. Dans cet article, nous abordons cette problématique pour le français dans un corpus multithématique et nous présentons une nouvelle méthode pour identifier la cible d’une opinion apparaissant hors du contexte phrastique. L’évaluation de la méthode montre une amélioration des résultats par rapport à l’existant.

pdf bib

TTC TermSuite : une chaîne de traitement pour la fouille terminologique multilingue (TTC TermSuite: a processing chain for multilingual terminology mining)
Béatrice Daille | Christine Jacquin | Laura Monceaux | Emmanuel Morin | Jérome Rocheteau
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

bib

Traitement Automatique des Langues, Volume 52, Numéro 1 : Varia [Varia]
Éric Villemonte de La Clergerie | Béatrice Daille | Yves Lepage | François Yvon
Traitement Automatique des Langues, Volume 52, Numéro 1 : Varia [Varia]

pdf bib

Reduction of Search Space to Annotate Monolingual Corpora
Prajol Shrestha | Christine Jacquin | Beatrice Daille
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib

TTC TermSuite - A UIMA Application for Multilingual Terminology Extraction from Comparable Corpora
Jérôme Rocheteau | Béatrice Daille
Proceedings of the IJCNLP 2011 System Demonstrations

2010

pdf bib abs

Learning Subjectivity Phrases missing from Resources through a Large Set of Semantic Tests
Matthieu Vernier | Laura Monceaux | Béatrice Daille
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In recent years, blogs and social networks have particularly boosted interests for opinion mining research. In order to satisfy real-scale applicative needs, a main task is to create or to enhance lexical and semantic resources on evaluative language. Classical resources of the area are mostly built for english, they contain simple opinion word markers and are far to cover the lexical richness of this linguistic phenomenon. In particular, infrequent subjective words, idiomatic expressions, and cultural stereotypes are missing from resources. We propose a new method, applied on french, to enhance automatically an opinion word lexicon. This learning method relies on linguistic uses of internet users and on semantic tests to infer the degree of subjectivity of many new adjectives, nouns, verbs, noun phrases, verbal phrases which are usually forgotten by other resources. The final appraisal lexicon contains 3,456 entries. We evaluate the lexicon enhancement with and without textual context.

pdf bib

UNPMC: Naive Approach to Extract Keyphrases from Scientific Articles
Jungyeul Park | Jong Gun Lee | Béatrice Daille
Proceedings of the 5th International Workshop on Semantic Evaluation

2009

pdf bib abs

Catégorisation sémantico-discursive des évaluations exprimées dans la blogosphère
Matthieu Vernier | Laura Monceaux | Béatrice Daille | Estelle Dubreil
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Les blogs constituent un support d’observations idéal pour des applications liées à la fouille d’opinion. Toutefois, ils imposent de nouvelles problématiques et de nouveaux défis au regard des méthodes traditionnelles du domaine. De ce fait, nous proposons une méthode automatique pour la détection et la catégorisation des évaluations localement exprimées dans un corpus de blogs multi-domaine. Celle-ci rend compte des spécificités du langage évaluatif décrites dans deux théories linguistiques. L’outil développé au sein de la plateforme UIMA vise d’une part à construire automatiquement une grammaire du langage évaluatif, et d’autre part à utiliser cette grammaire pour la détection et la catégorisation des passages évaluatifs d’un texte. La catégorisation traite en particulier l’aspect axiologique de l’évaluation, sa configuration d’énonciation et sa modalité dans le discours.

pdf bib

Analyse conjointe du signal sonore et de sa transcription pour l’identification nommée de locuteurs [Joint signal and transcription analysis for named speaker identification]
Vincent Jousse | Sylvain Meignier | Christine Jacquin | Simon Petitrenaud | Yannick Estève | Béatrice Daille
Traitement Automatique des Langues, Volume 50, Numéro 1 : Varia [Varia]

pdf bib

Compilation of Specialized Comparable Corpora in French and Japanese
Lorraine Goeuriot | Emmanuel Morin | Béatrice Daille
Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC)

2008

pdf bib

An Effective Compositional Model for Lexical Alignment
Béatrice Daille | Emmanuel Morin
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib abs

A Multi-Word Term Extraction Program for Arabic Language
Siham Boulaknadel | Beatrice Daille | Driss Aboutajdine
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Terminology extraction commonly includes two steps: identification of term-like units in the texts, mostly multi-word phrases, and the ranking of the extracted term-like units according to their domain representativity. In this paper, we design a multi-word term extraction program for Arabic language. The linguistic filtering performs a morphosyntactic analysis and takes into account several types of variations. The domain representativity is measure thanks to statistical scores. We evalutate several association measures and show that the results we otained are consitent with those obtained for Romance languages.

pdf bib abs

Characterization of Scientific and Popular Science Discourse in French, Japanese and Russian
Lorraine Goeuriot | Natalia Grabar | Béatrice Daille
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We aim to characterize the comparability of corpora, we address this issue in the trilingual context through the distinction of expert and non expert documents. We work separately with corpora composed of documents from the medical domain in three languages (French, Japanese and Russian) which present an important linguistic distance between them. In our approach, documents are characterized in each language by their topic and by a discursive typology positioned at three levels of document analysis: structural, modal and lexical. The document typology is implemented with two learning algorithms (SVMlight and C4.5). Evaluation of results shows that the proposed discursive typology can be transposed from one language to another, as it indeed allows to distinguish the two aimed discourses (science and popular science). However, we observe that performances vary a lot according to languages, algorithms and types of discursive characteristics.

2007

pdf bib abs

Caractérisation des discours scientifiques et vulgarisés en français, japonais et russe
Lorraine Goeuriot | Natalia Grabar | Béatrice Daille
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

L’objectif principal de notre travail consiste à étudier la notion de comparabilité des corpus, et nous abordons cette question dans un contexte monolingue en cherchant à distinguer les documents scientifiques et vulgarisés. Nous travaillons séparément sur des corpus composés de documents du domaine médical dans trois langues à forte distance linguistique (le français, le japonais et le russe). Dans notre approche, les documents sont caractérisés dans chaque langue selon leur thématique et une typologie discursive qui se situe à trois niveaux de l’analyse des documents : structurel, modal et lexical. Le typage des documents est implémenté avec deux algorithmes d’apprentissage (SVMlight et C4.5). L’évaluation des résultats montre que la typologie discursive proposée est portable d’une langue à l’autre car elle permet en effet de distinguer les deux discours. Nous constatons néanmoins des performances très variées selon les langues, les algorithmes et les types de caractéristiques discursives.

pdf bib

Bilingual Terminology Mining - Using Brain, not brawn comparable corpora
Emmanuel Morin | Béatrice Daille | Koichi Takeuchi | Kyo Kageura
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2006

pdf bib

Comparabilité de corpus et fouille terminologique multilingue [Corpus comparability and multilingual terminology Mining]
Emmanuel Morin | Béatrice Daille
Traitement Automatique des Langues, Volume 47, Numéro 1 : Varia [Varia]

pdf bib

Une architecture de services pour mieux spécialiser les processus d’acquisition terminologique [A service architecture for better specialization of terminology acquisition processes]
Farid Cerbah | Béatrice Daille
Traitement Automatique des Langues, Volume 47, Numéro 3 : Varia [Varia]

2005

pdf bib

French-English Terminology Extraction from Comparable Corpora
Béatrice Daille | Emmanuel Morin
Second International Joint Conference on Natural Language Processing: Full Papers

2004

pdf bib abs

Extraction de terminologies bilingues à partir de corpus comparables
Emmanuel Morin | Samuel Dufour-Kowalski | Béatrice Daille
Actes de la 11ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article présente une méthode pour extraire, à partir de corpus comparables d’un domaine de spécialité, un lexique bilingue comportant des termes simples et complexes. Cette méthode extrait d’abord les termes complexes dans chaque langue, puis les aligne à l’aide de méthodes statistiques exploitant le contexte des termes. Après avoir rappelé les difficultés que pose l’alignement des termes complexes et précisé notre approche, nous présentons le processus d’extraction de terminologies bilingues adopté et les ressources utilisées pour nos expérimentations. Enfin, nous évaluons notre approche et démontrons son intérêt en particulier pour l’alignement de termes complexes non compositionnels.

pdf bib

French-English Multi-word Term Alignment Based on Lexical Context Analysis
Béatrice Daille | Samuel Dufour-Kowalski | Emmanuel Morin
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib

Construction of Grammar Based Term Extraction Model for Japanese
Koichi Takeuchi | Kyo Kageura | Béatrice Daille | Laurent Romary
Proceedings of CompuTerm 2004: 3rd International Workshop on Computational Terminology

Les collocations sont intéressantes dans de nombreuses applications du TALN comme la l’analyse ou la génération de textes ou encore la lexicographie monolingue ou bilingue. Les premières tentatives d’extraction automatique de collocations à partir de textes ou de dictionnaires ont vu le jour dans les années 1970. Il s’agissait principalement de méthodes à base de statistiques lexicales. Aujourd’hui, les méthodes d’identification automatique font toujours appel à des statistiques mais qu’elles combinent avec des analyses linguistiques. Nous examinons quelques méthodes d’identification des collocations en corpus en soulignant pour chaque méthode les propriétés linguistiques des collocations qui ont été prises en compte.