Thomas Lavergne


2021

pdf bib
Differential Evaluation: a Qualitative Analysis of Natural Language Processing System Behavior Based Upon Data Resistance to Processing
Lucie Gianola | Hicham El Boukkouri | Cyril Grouin | Thomas Lavergne | Patrick Paroubek | Pierre Zweigenbaum
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

2020

pdf bib
CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters
Hicham El Boukkouri | Olivier Ferret | Thomas Lavergne | Hiroshi Noji | Pierre Zweigenbaum | Jun’ichi Tsujii
Proceedings of the 28th International Conference on Computational Linguistics

Due to the compelling improvements brought by BERT, many recent representation models adopted the Transformer architecture as their main building block, consequently inheriting the wordpiece tokenization system despite it not being intrinsically linked to the notion of Transformers. While this system is thought to achieve a good balance between the flexibility of characters and the efficiency of full words, using predefined wordpiece vocabularies from the general domain is not always suitable, especially when building models for specialized domains (e.g., the medical domain). Moreover, adopting a wordpiece tokenization shifts the focus from the word level to the subword level, making the models conceptually more complex and arguably less convenient in practice. For these reasons, we propose CharacterBERT, a new variant of BERT that drops the wordpiece system altogether and uses a Character-CNN module instead to represent entire words by consulting their characters. We show that this new model improves the performance of BERT on a variety of medical domain tasks while at the same time producing robust, word-level, and open-vocabulary representations.

pdf bib
Handling Entity Normalization with no Annotated Corpus: Weakly Supervised Methods Based on Distributional Representation and Ontological Information
Arnaud Ferré | Robert Bossy | Mouhamadou Ba | Louise Deléger | Thomas Lavergne | Pierre Zweigenbaum | Claire Nédellec
Proceedings of the 12th Language Resources and Evaluation Conference

Entity normalization (or entity linking) is an important subtask of information extraction that links entity mentions in text to categories or concepts in a reference vocabulary. Machine learning based normalization methods have good adaptability as long as they have enough training data per reference with a sufficient quality. Distributional representations are commonly used because of their capacity to handle different expressions with similar meanings. However, in specific technical and scientific domains, the small amount of training data and the relatively small size of specialized corpora remain major challenges. Recently, the machine learning-based CONTES method has addressed these challenges for reference vocabularies that are ontologies, as is often the case in life sciences and biomedical domains. And yet, its performance is dependent on manually annotated corpus. Furthermore, like other machine learning based methods, parametrization remains tricky. We propose a new approach to address the scarcity of training data that extends the CONTES method by corpus selection, pre-processing and weak supervision strategies, which can yield high-performance results without any manually annotated examples. We also study which hyperparameters are most influential, with sometimes different patterns compared to previous work. The results show that our approach significantly improves accuracy and outperforms previous state-of-the-art algorithms.

2019

pdf bib
Embedding Strategies for Specialized Domains: Application to Clinical Entity Recognition
Hicham El Boukkouri | Olivier Ferret | Thomas Lavergne | Pierre Zweigenbaum
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Using pre-trained word embeddings in conjunction with Deep Learning models has become the “de facto” approach in Natural Language Processing (NLP). While this usually yields satisfactory results, off-the-shelf word embeddings tend to perform poorly on texts from specialized domains such as clinical reports. Moreover, training specialized word representations from scratch is often either impossible or ineffective due to the lack of large enough in-domain data. In this work, we focus on the clinical domain for which we study embedding strategies that rely on general-domain resources only. We show that by combining off-the-shelf contextual embeddings (ELMo) with static word2vec embeddings trained on a small in-domain corpus built from the task data, we manage to reach and sometimes outperform representations learned from a large corpus in the medical domain.

2018

pdf bib
Detecting context-dependent sentences in parallel corpora
Rachel Bawden | Thomas Lavergne | Sophie Rosset
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

In this article, we provide several approaches to the automatic identification of parallel sentences that require sentence-external linguistic context to be correctly translated. Our long-term goal is to automatically construct a test set of context-dependent sentences in order to evaluate machine translation models designed to improve the translation of contextual, discursive phenomena. We provide a discussion and critique that show that current approaches do not allow us to achieve our goal, and suggest that for now evaluating individual phenomena is likely the best solution.

pdf bib
Corpora with Part-of-Speech Annotations for Three Regional Languages of France: Alsatian, Occitan and Picard
Delphine Bernhard | Anne-Laure Ligozat | Fanny Martin | Myriam Bras | Pierre Magistry | Marianne Vergez-Couret | Lucie Steiblé | Pascale Erhart | Nabil Hathout | Dominique Huck | Christophe Rey | Philippe Reynés | Sophie Rosset | Jean Sibille | Thomas Lavergne
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Learning the Structure of Variable-Order CRFs: a finite-state perspective
Thomas Lavergne | François Yvon
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The computational complexity of linear-chain Conditional Random Fields (CRFs) makes it difficult to deal with very large label sets and long range dependencies. Such situations are not rare and arise when dealing with morphologically rich languages or joint labelling tasks. We extend here recent proposals to consider variable order CRFs. Using an effective finite-state representation of variable-length dependencies, we propose new ways to perform feature selection at large scale and report experimental results where we outperform strong baselines on a tagging task.

pdf bib
Détection de concepts et granularité de l’annotation (Concept detection and annotation granularity )
Pierre Zweigenbaum | Thomas Lavergne
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Nous nous intéressons ici à une tâche de détection de concepts dans des textes sans exigence particulière de passage par une phase de détection d’entités avec leurs frontières. Il s’agit donc d’une tâche de catégorisation de textes multiétiquette, avec des jeux de données annotés au niveau des textes entiers. Nous faisons l’hypothèse qu’une annotation à un niveau de granularité plus fin, typiquement au niveau de l’énoncé, devrait améliorer la performance d’un détecteur automatique entraîné sur ces données. Nous examinons cette hypothèse dans le cas de textes courts particuliers : des certificats de décès où l’on cherche à reconnaître des diagnostics, avec des jeux de données initialement annotés au niveau du certificat entier. Nous constatons qu’une annotation au niveau de la « ligne » améliore effectivement les résultats, mais aussi que le simple fait d’appliquer au niveau de la ligne un classifieur entraîné au niveau du texte est déjà une source d’amélioration.

pdf bib
Traitement automatique de la langue biomédicale au LIMSI (Biomedical language processing at LIMSI)
Christopher Norman | Cyril Grouin | Thomas Lavergne | Aurélie Névéol | Pierre Zweigenbaum
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 3 - Démonstrations

Nous proposons des démonstrations de trois outils développés par le LIMSI en traitement automatique des langues appliqué au domaine biomédical : la détection de concepts médicaux dans des textes courts, la catégorisation d’articles scientifiques pour l’assistance à l’écriture de revues systématiques, et l’anonymisation de textes cliniques.

2016

pdf bib
LIMSI@WMT’16: Machine Translation of News
Alexandre Allauzen | Lauriane Aufrant | Franck Burlot | Ophélie Lacroix | Elena Knyazeva | Thomas Lavergne | Guillaume Wisniewski | François Yvon
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
The QT21/HimL Combined Machine Translation System
Jan-Thorsten Peter | Tamer Alkhouli | Hermann Ney | Matthias Huck | Fabienne Braune | Alexander Fraser | Aleš Tamchyna | Ondřej Bojar | Barry Haddow | Rico Sennrich | Frédéric Blain | Lucia Specia | Jan Niehues | Alex Waibel | Alexandre Allauzen | Lauriane Aufrant | Franck Burlot | Elena Knyazeva | Thomas Lavergne | François Yvon | Mārcis Pinnis | Stella Frank
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
A Dataset for ICD-10 Coding of Death Certificates: Creation and Usage
Thomas Lavergne | Aurélie Névéol | Aude Robert | Cyril Grouin | Grégoire Rey | Pierre Zweigenbaum
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)

Very few datasets have been released for the evaluation of diagnosis coding with the International Classification of Diseases, and only one so far in a language other than English. This paper describes a large-scale dataset prepared from French death certificates, and the problems which needed to be solved to turn it into a dataset suitable for the application of machine learning and natural language processing methods of ICD-10 coding. The dataset includes the free-text statements written by medical doctors, the associated meta-data, the human coder-assigned codes for each statement, as well as the statement segments which supported the coder’s decision for each code. The dataset comprises 93,694 death certificates totalling 276,103 statements and 377,677 ICD-10 code assignments (3,457 unique codes). It was made available for an international automated coding shared task, which attracted five participating teams. An extended version of the dataset will be used in a new edition of the shared task.

pdf bib
Supervised classification of end-of-lines in clinical text with no manual annotation
Pierre Zweigenbaum | Cyril Grouin | Thomas Lavergne
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)

In some plain text documents, end-of-line marks may or may not mark the boundary of a text unit (e.g., of a paragraph). This vexing problem is likely to impact subsequent natural language processing components, but is seldom addressed in the literature. We propose a method which uses no manual annotation to classify whether end-of-lines must actually be seen as simple spaces (soft line breaks) or as true text unit boundaries. This method, which includes self-training and co-training steps based on token and line length features, achieves 0.943 F-measure on a corpus of short e-books with controlled format, F=0.904 on a random sample of 24 clinical texts with soft line breaks, and F=0.898 on a larger set of mixed clinical texts which may or may not contain soft line breaks, a fairly high value for a method with no manual annotation.

pdf bib
Hybrid methods for ICD-10 coding of death certificates
Pierre Zweigenbaum | Thomas Lavergne
Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis

pdf bib
Une catégorisation de fins de lignes non-supervisée (End-of-line classification with no supervision)
Pierre Zweigenbaum | Cyril Grouin | Thomas Lavergne
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

Dans certains textes bruts, les marques de fin de ligne peuvent marquer ou pas la frontière d’une unité textuelle (typiquement un paragraphe). Ce problème risque d’influencer les traitements subséquents, mais est rarement traité dans la littérature. Nous proposons une méthode entièrement non-supervisée pour déterminer si une fin de ligne doit être vue comme un simple espace ou comme une véritable frontière d’unité textuelle, et la testons sur un corpus de comptes rendus médicaux. Cette méthode obtient une F-mesure de 0,926 sur un échantillon de 24 textes contenant des lignes repliées. Appliquée sur un échantillon plus grand de textes contenant ou pas des lignes repliées, notre méthode la plus prudente obtient une F-mesure de 0,898, valeur élevée pour une méthode entièrement non-supervisée.

2015

pdf bib
Oublier ce qu’on sait, pour mieux apprendre ce qu’on ne sait pas : une étude sur les contraintes de type dans les modèles CRF
Nicolas Pécheux | Alexandre Allauzen | Thomas Lavergne | Guillaume Wisniewski | François Yvon
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Quand on dispose de connaissances a priori sur les sorties possibles d’un problème d’étiquetage, il semble souhaitable d’inclure cette information lors de l’apprentissage pour simplifier la tâche de modélisation et accélérer les traitements. Pourtant, même lorsque ces contraintes sont correctes et utiles au décodage, leur utilisation lors de l’apprentissage peut dégrader sévèrement les performances. Dans cet article, nous étudions ce paradoxe et montrons que le manque de contraste induit par les connaissances entraîne une forme de sous-apprentissage qu’il est cependant possible de limiter.

pdf bib
Etiquetage morpho-syntaxique en domaine de spécialité: le domaine médical
Christelle Rabary | Thomas Lavergne | Aurélie Névéol
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

L’étiquetage morpho-syntaxique est une tâche fondamentale du Traitement Automatique de la Langue, sur laquelle reposent souvent des traitements plus complexes tels que l’extraction d’information ou la traduction automatique. L’étiquetage en domaine de spécialité est limité par la disponibilité d’outils et de corpus annotés spécifiques au domaine. Dans cet article, nous présentons le développement d’un corpus clinique du français annoté morpho-syntaxiquement à l’aide d’un jeu d’étiquettes issus des guides d’annotation French Treebank et Multitag. L’analyse de ce corpus nous permet de caractériser le domaine clinique et de dégager les points clés pour l’adaptation d’outils d’analyse morpho-syntaxique à ce domaine. Nous montrons également les limites d’un outil entraîné sur un corpus journalistique appliqué au domaine clinique. En perspective de ce travail, nous envisageons une application du corpus clinique annoté pour améliorer l’étiquetage morpho-syntaxique des documents cliniques en français.

pdf bib
LIMSI@WMT’15 : Translation Task
Benjamin Marie | Alexandre Allauzen | Franck Burlot | Quoc-Khanh Do | Julia Ive | Elena Knyazeva | Matthieu Labeau | Thomas Lavergne | Kevin Löser | Nicolas Pécheux | François Yvon
Proceedings of the Tenth Workshop on Statistical Machine Translation

2014

pdf bib
Automatic language identity tagging on word and sentence-level in multilingual text sources: a case-study on Luxembourgish
Thomas Lavergne | Gilles Adda | Martine Adda-Decker | Lori Lamel
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Luxembourgish, embedded in a multilingual context on the divide between Romance and Germanic cultures, remains one of Europe’s under-described languages. This is due to the fact that the written production remains relatively low, and linguistic knowledge and resources, such as lexica and pronunciation dictionaries, are sparse. The speakers or writers will frequently switch between Luxembourgish, German, and French, on a per-sentence basis, as well as on a sub-sentence level. In order to build resources like lexicons, and especially pronunciation lexicons, or language models needed for natural language processing tasks such as automatic speech recognition, language used in text corpora should be identified. In this paper, we present the design of a manually annotated corpus of mixed language sentences as well as the tools used to select these sentences. This corpus of difficult sentences was used to test a word-based language identification system. This language identification system was used to select textual data extracted from the web, in order to build a lexicon and language models. This lexicon and language model were used in an Automatic Speech Recognition system for the Luxembourgish language which obtain a 25\% WER on the Quaero development data.

pdf bib
LIMSI @ WMT’14 Medical Translation Task
Nicolas Pécheux | Li Gong | Quoc Khanh Do | Benjamin Marie | Yulia Ivanishcheva | Alexander Allauzen | Thomas Lavergne | Jan Niehues | Aurélien Max | François Yvon
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Optimizing annotation efforts to build reliable annotated corpora for training statistical models
Cyril Grouin | Thomas Lavergne | Aurélie Névéol
Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop

2013

pdf bib
LIMSI’s participation to the 2013 shared task on Native Language Identification
Thomas Lavergne | Gabriel Illouz | Aurélien Max | Ryo Nagata
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
LIMSI @ WMT13
Alexander Allauzen | Nicolas Pécheux | Quoc Khanh Do | Marco Dinarelli | Thomas Lavergne | Aurélien Max | Hai-Son Le | François Yvon
Proceedings of the Eighth Workshop on Statistical Machine Translation

pdf bib
Automatic Named Entity Pre-annotation for Out-of-domain Human Annotation
Sophie Rosset | Cyril Grouin | Thomas Lavergne | Mohamed Ben Jannet | Jérémy Leixa | Olivier Galibert | Pierre Zweigenbaum
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

pdf bib
A fully discriminative training framework for Statistical Machine Translation (Un cadre d’apprentissage intégralement discriminant pour la traduction statistique) [in French]
Thomas Lavergne | Alexandre Allauzen | François Yvon
Proceedings of TALN 2013 (Volume 1: Long Papers)

2012

pdf bib
Joint Segmentation and POS Tagging for Arabic Using a CRF-based Classifier
Souhir Gahbiche-Braham | Hélène Bonneau-Maynard | Thomas Lavergne | François Yvon
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Arabic is a morphologically rich language, and Arabic texts abound of complex word forms built by concatenation of multiple subparts, corresponding for instance to prepositions, articles, roots prefixes, or suffixes. The development of Arabic Natural Language Processing applications, such as Machine Translation (MT) tools, thus requires some kind of morphological analysis. In this paper, we compare various strategies for performing such preprocessing, using generic machine learning techniques. The resulting tool is compared with two open domain alternatives in the context of a statistical MT task and is shown to be faster than its competitors, with no significant difference in MT quality.

pdf bib
Joint WMT 2012 Submission of the QUAERO Project
Markus Freitag | Stephan Peitz | Matthias Huck | Hermann Ney | Jan Niehues | Teresa Herrmann | Alex Waibel | Hai-son Le | Thomas Lavergne | Alexandre Allauzen | Bianka Buschbeck | Josep Maria Crego | Jean Senellart
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
LIMSI @ WMT12
Hai-Son Le | Thomas Lavergne | Alexandre Allauzen | Marianna Apidianaki | Li Gong | Aurélien Max | Artem Sokolov | Guillaume Wisniewski | François Yvon
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
Repérage des entités nommées pour l’arabe : adaptation non-supervisée et combinaison de systèmes (Named Entity Recognition for Arabic : Unsupervised adaptation and Systems combination) [in French]
Souhir Gahbiche-Braham | Hélène Bonneau-Maynard | Thomas Lavergne | François Yvon
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

2011

pdf bib
LIMSI’s experiments in domain adaptation for IWSLT11
Thomas Lavergne | Alexandre Allauzen | Hai-Son Le | François Yvon
Proceedings of the 8th International Workshop on Spoken Language Translation: Evaluation Campaign

LIMSI took part in the IWSLT 2011 TED task in the MT track for English to French using the in-house n-code system, which implements the n-gram based approach to Machine Translation. This framework not only allows to achieve state-of-the-art results for this language pair, but is also appealing due to its conceptual simplicity and its use of well understood statistical language models. Using this approach, we compare several ways to adapt our existing systems and resources to the TED task with mixture of language models and try to provide an analysis of the modest gains obtained by training a log linear combination of inand out-of-domain models.

pdf bib
Advances on spoken language translation in the Quaero program
Karim Boudahmane | Bianka Buschbeck | Eunah Cho | Josep Maria Crego | Markus Freitag | Thomas Lavergne | Hermann Ney | Jan Niehues | Stephan Peitz | Jean Senellart | Artem Sokolov | Alex Waibel | Tonio Wandmacher | Joern Wuebker | François Yvon
Proceedings of the 8th International Workshop on Spoken Language Translation: Evaluation Campaign

The Quaero program is an international project promoting research and industrial innovation on technologies for automatic analysis and classification of multimedia and multilingual documents. Within the program framework, research organizations and industrial partners collaborate to develop prototypes of innovating applications and services for access and usage of multimedia data. One of the topics addressed is the translation of spoken language. Each year, a project-internal evaluation is conducted by DGA to monitor the technological advances. This work describes the design and results of the 2011 evaluation campaign. The participating partners were RWTH, KIT, LIMSI and SYSTRAN. Their approaches are compared on both ASR output and reference transcripts of speech data for the translation between French and German. The results show that the developed techniques further the state of the art and improve translation quality.

pdf bib
LIMSI @ WMT11
Alexandre Allauzen | Hélène Bonneau-Maynard | Hai-Son Le | Aurélien Max | Guillaume Wisniewski | François Yvon | Gilles Adda | Josep Maria Crego | Adrien Lardilleux | Thomas Lavergne | Artem Sokolov
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
From n-gram-based to CRF-based Translation Models
Thomas Lavergne | Alexandre Allauzen | Josep Maria Crego | François Yvon
Proceedings of the Sixth Workshop on Statistical Machine Translation

2010

pdf bib
Practical Very Large Scale CRFs
Thomas Lavergne | Olivier Cappé | François Yvon
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

2009

pdf bib
Introduction of a new paraphrase generation tool based on Monte-Carlo sampling
Jonathan Chevelu | Thomas Lavergne | Yves Lepage | Thierry Moudenc
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

Search
Co-authors