Antoine Doucet


2022

pdf bib
IJS at TextGraphs-16 Natural Language Premise Selection Task: Will Contextual Information Improve Natural Language Premise Selection?
Thi Hong Hanh Tran | Matej Martinc | Antoine Doucet | Senja Pollak
Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing

Natural Language Premise Selection (NLPS) is a mathematical Natural Language Processing (NLP) task that retrieves a set of applicable relevant premises to support the end-user finding the proof for a particular statement. In this research, we evaluate the impact of Transformer-based contextual information and different fundamental similarity scores toward NLPS. The results demonstrate that the contextual representation is better at capturing meaningful information despite not being pretrained in the mathematical background compared to the statistical approach (e.g., the TF-IDF) with a boost of around 3.00% MAP@500.

pdf bib
Fine-tuning de modèles de langues pour la veille épidémiologique multilingue avec peu de ressources (Fine-tuning Language Models for Low-resource Multilingual Epidemic Surveillance)
Stephen Mutuvi | Emanuela Boros | Antoine Doucet | Adam Jatowt | Gaël Lejeune | Moses Odeo
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Les modèles de langues pré-entraînés connaissent un très grand succès en TAL, en particulier dans les situations où l’on dispose de suffisamment de données d’entraînement. Cependant, il reste difficile d’obtenir des résultats similaires dans des environnements multilingues avec peu de données d’entraînement, en particulier dans des domaines spécialisés tels que la surveillance des épidémies. Dans cet article, nous explorons plusieurs hypothèses concernant les facteurs qui pourraient avoir une influence sur les performances d’un système d’extraction d’événements épidémiologiques dans un scénario multilingue à faibles ressources : le type de modèle pré-entraîné, la qualité du tokenizer ainsi que les caractéristiques des entités à extraire. Nous proposons une analyse exhaustive de ces facteurs et observons une corrélation importante, quoique variable ; entre ces caractéristiques et les performances observées sur la base d’une tâche de veille épidémiologique multilingue à faibles ressources. Nous proposons aussi d’adapter les modèles de langues à cette tâche en étendant le vocabulaire du tokenizer pré-entraîné avec les entités continues, qui sont des entités qui ont été divisées en plusieurs sous-mots. Suite à cette adaptation, nous observons une amélioration notable des performances pour la plupart des modèles et des langues évalués.

pdf bib
L’importance des entités pour la tâche de détection d’événements en tant que système de question-réponse (Exploring Entities in Event Detection as Question Answering)
Emanuela Boros | Jose Moreno | Antoine Doucet
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Dans cet article, nous abordons un paradigme récent et peu étudié pour la tâche de détection d’événements en la présentant comme un problème de question-réponse avec possibilité de réponses multiples et le support d’entités. La tâche d’extraction des déclencheurs d’événements est ainsi transformée en une tâche d’identification des intervalles de réponse à partir d’un contexte, tout en se concentrant également sur les entités environnantes. L’architecture est basée sur un modèle de langage pré-entraîné et finement ajusté, où le contexte d’entrée est augmenté d’entités marquées à différents niveaux, de leurs positions, de leurs types et, enfin, de leurs rôles d’arguments. Nos expériences sur le corpus ACE 2005 démontrent que le modèle proposé exploite correctement les informations sur les entités dans le cadre de la détection des événements et qu’il constitue une solution viable pour cette tâche. De plus, nous démontrons que notre méthode, avec différents marqueurs d’entités, est particulièrement capable d’extraire des types d’événements non vus dans des contextes d’apprentissage en peu de coups.

pdf bib
L3i at SemEval-2022 Task 11: Straightforward Additional Context for Multilingual Named Entity Recognition
Emanuela Boros | Carlos-Emiliano González-Gallardo | Jose Moreno | Antoine Doucet
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper summarizes the participation of the L3i laboratory of the University of La Rochelle in the SemEval-2022 Task 11, Multilingual Complex Named Entity Recognition (MultiCoNER). The task focuses on detecting semantically ambiguous and complex entities in short and low-context monolingual and multilingual settings. We argue that using a language-specific and a multilingual language model could improve the performance of multilingual and mixed NER. Also, we consider that using additional contexts from the training set could improve the performance of a NER on short texts. Thus, we propose a straightforward technique for generating additional contexts with and without the presence of entities. Our findings suggest that, in our internal experimental setup, this approach is promising. However, we ranked above average for the high-resource languages and lower than average for low-resource and multilingual models.

pdf bib
Archive TimeLine Summarization (ATLS): Conceptual Framework for Timeline Generation over Historical Document Collections
Nicolas Gutehrlé | Antoine Doucet | Adam Jatowt
Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Archive collections are nowadays mostly available through search engines interfaces, which allow a user to retrieve documents by issuing queries. The study of these collections may be, however, impaired by some aspects of search engines, such as the overwhelming number of documents returned or the lack of contextual knowledge provided. New methods that could work independently or in combination with search engines are then required to access these collections. In this position paper, we propose to extend TimeLine Summarization (TLS) methods on archive collections to assist in their studies. We provide an overview of existing TLS methods and we describe a conceptual framework for an Archive TimeLine Summarization (ATLS) system, which aims to generate informative, readable and interpretable timelines.

2021

pdf bib
Multi-TimeLine Summarization (MTLS): Improving Timeline Summarization by Generating Multiple Summaries
Yi Yu | Adam Jatowt | Antoine Doucet | Kazunari Sugiyama | Masatoshi Yoshikawa
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In this paper, we address a novel task, Multiple TimeLine Summarization (MTLS), which extends the flexibility and versatility of Time-Line Summarization (TLS). Given any collection of time-stamped news articles, MTLS automatically discovers important yet different stories and generates a corresponding time-line for each story.To achieve this, we propose a novel unsupervised summarization framework based on two-stage affinity propagation. We also introduce a quantitative evaluation measure for MTLS based on previousTLS evaluation methods. Experimental results show that our MTLS framework demonstrates high effectiveness and MTLS task can give bet-ter results than TLS.

pdf bib
Relation Classification via Relation Validation
José G. Moreno | Antoine Doucet | Brigitte Grau
Proceedings of the 6th Workshop on Semantic Deep Learning (SemDeep-6)

pdf bib
Using a Frustratingly Easy Domain and Tagset Adaptation for Creating Slavic Named Entity Recognition Systems
Luis Adrián Cabrera-Diego | Jose G. Moreno | Antoine Doucet
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing

We present a collection of Named Entity Recognition (NER) systems for six Slavic languages: Bulgarian, Czech, Polish, Slovenian, Russian and Ukrainian. These NER systems have been trained using different BERT models and a Frustratingly Easy Domain Adaptation (FEDA). FEDA allow us creating NER systems using multiple datasets without having to worry about whether the tagset (e.g. Location, Event, Miscellaneous, Time) in the source and target domains match, while increasing the amount of data available for training. Moreover, we boosted the prediction on named entities by marking uppercase words and predicting masked words. Participating in the 3rd Shared Task on SlavNER, our NER systems reached a strict match micro F-score of up to 0.908. The results demonstrate good generalization, even in named entities with weak regularity, such as book titles, or entities that were never seen during the training.

pdf bib
EMBEDDIA Tools, Datasets and Challenges: Resources and Hackathon Contributions
Senja Pollak | Marko Robnik-Šikonja | Matthew Purver | Michele Boggia | Ravi Shekhar | Marko Pranjić | Salla Salmela | Ivar Krustok | Tarmo Paju | Carl-Gustav Linden | Leo Leppänen | Elaine Zosa | Matej Ulčar | Linda Freienthal | Silver Traat | Luis Adrián Cabrera-Diego | Matej Martinc | Nada Lavrač | Blaž Škrlj | Martin Žnidaršič | Andraž Pelicon | Boshko Koloski | Vid Podpečan | Janez Kranjc | Shane Sheehan | Emanuela Boros | Jose G. Moreno | Antoine Doucet | Hannu Toivonen
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

This paper presents tools and data sources collected and released by the EMBEDDIA project, supported by the European Union’s Horizon 2020 research and innovation program. The collected resources were offered to participants of a hackathon organized as part of the EACL Hackashop on News Media Content Analysis and Automated Report Generation in February 2021. The hackathon had six participating teams who addressed different challenges, either from the list of proposed challenges or their own news-industry-related tasks. This paper goes beyond the scope of the hackathon, as it brings together in a coherent and compact form most of the resources developed, collected and released by the EMBEDDIA project. Moreover, it constitutes a handy source for news media industry and researchers in the fields of Natural Language Processing and Social Science.

2020

pdf bib
Alleviating Digitization Errors in Named Entity Recognition for Historical Documents
Emanuela Boros | Ahmed Hamdi | Elvys Linhares Pontes | Luis Adrián Cabrera-Diego | Jose G. Moreno | Nicolas Sidere | Antoine Doucet
Proceedings of the 24th Conference on Computational Natural Language Learning

This paper tackles the task of named entity recognition (NER) applied to digitized historical texts obtained from processing digital images of newspapers using optical character recognition (OCR) techniques. We argue that the main challenge for this task is that the OCR process leads to misspellings and linguistic errors in the output text. Moreover, historical variations can be present in aged documents, which can impact the performance of the NER process. We conduct a comparative evaluation on two historical datasets in German and French against previous state-of-the-art models, and we propose a model based on a hierarchical stack of Transformers to approach the NER task for historical data. Our findings show that the proposed model clearly improves the results on both historical datasets, and does not degrade the results for modern datasets.

pdf bib
Dataset for Temporal Analysis of English-French Cognates
Esteban Frossard | Mickael Coustaty | Antoine Doucet | Adam Jatowt | Simon Hengchen
Proceedings of the Twelfth Language Resources and Evaluation Conference

Languages change over time and, thanks to the abundance of digital corpora, their evolutionary analysis using computational techniques has recently gained much research attention. In this paper, we focus on creating a dataset to support investigating the similarity in evolution between different languages. We look in particular into the similarities and differences between the use of corresponding words across time in English and French, two languages from different linguistic families yet with shared syntax and close contact. For this we select a set of cognates in both languages and study their frequency changes and correlations over time. We propose a new dataset for computational approaches of synchronized diachronic investigation of language pairs, and subsequently show novel findings stemming from the cognate-focused diachronic comparison of the two chosen languages. To the best of our knowledge, the present study is the first in the literature to use computational approaches and large data to make a cross-language diachronic analysis.

pdf bib
A Dataset for Multi-lingual Epidemiological Event Extraction
Stephen Mutuvi | Antoine Doucet | Gaël Lejeune | Moses Odeo
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper proposes a corpus for the development and evaluation of tools and techniques for identifying emerging infectious disease threats in online news text. The corpus can not only be used for information extraction, but also for other natural language processing (NLP) tasks such as text classification. We make use of articles published on the Program for Monitoring Emerging Diseases (ProMED) platform, which provides current information about outbreaks of infectious diseases globally. Among the key pieces of information present in the articles is the uniform resource locator (URL) to the online news sources where the outbreaks were originally reported. We detail the procedure followed to build the dataset, which includes leveraging the source URLs to retrieve the news reports and subsequently pre-processing the retrieved documents. We also report on experimental results of event extraction on the dataset using the Data Analysis for Information Extraction in any Language(DAnIEL) system. DAnIEL is a multilingual news surveillance system that leverages unique attributes associated with news reporting to extract events: repetition and saliency. The system has wide geographical and language coverage, including low-resource languages. In addition, we compare different classification approaches in terms of their ability to differentiate between epidemic-related and unrelated news articles that constitute the corpus.

pdf bib
Multilingual Epidemiological Text Classification: A Comparative Study
Stephen Mutuvi | Emanuela Boros | Antoine Doucet | Adam Jatowt | Gaël Lejeune | Moses Odeo
Proceedings of the 28th International Conference on Computational Linguistics

In this paper, we approach the multilingual text classification task in the context of the epidemiological field. Multilingual text classification models tend to perform differently across different languages (low- or high-resourced), more particularly when the dataset is highly imbalanced, which is the case for epidemiological datasets. We conduct a comparative study of different machine and deep learning text classification models using a dataset comprising news articles related to epidemic outbreaks from six languages, four low-resourced and two high-resourced, in order to analyze the influence of the nature of the language, the structure of the document, and the size of the data. Our findings indicate that the performance of the models based on fine-tuned language models exceeds by more than 50% the chosen baseline models that include a specialized epidemiological news surveillance system and several machine learning models. Also, low-resource languages are highly influenced not only by the typology of the languages on which the models have been pre-trained or/and fine-tuned but also by their size. Furthermore, we discover that the beginning and the end of documents provide the most salient features for this task and, as expected, the performance of the models was proportionate to the training data size.

2019

pdf bib
TLR at BSNLP2019: A Multilingual Named Entity Recognition System
Jose G. Moreno | Elvys Linhares Pontes | Mickael Coustaty | Antoine Doucet
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

This paper presents our participation at the shared task on multilingual named entity recognition at BSNLP2019. Our strategy is based on a standard neural architecture for sequence labeling. In particular, we use a mixed model which combines multilingualcontextual and language-specific embeddings. Our only submitted run is based on a voting schema using multiple models, one for each of the four languages of the task (Bulgarian, Czech, Polish, and Russian) and another for English. Results for named entity recognition are encouraging for all languages, varying from 60% to 83% in terms of Strict and Relaxed metrics, respectively.

2017

pdf bib
The PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions
Agata Savary | Carlos Ramisch | Silvio Cordeiro | Federico Sangati | Veronika Vincze | Behrang QasemiZadeh | Marie Candito | Fabienne Cap | Voula Giouli | Ivelina Stoyanova | Antoine Doucet
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

Multiword expressions (MWEs) are known as a “pain in the neck” for NLP due to their idiosyncratic behaviour. While some categories of MWEs have been addressed by many studies, verbal MWEs (VMWEs), such as to take a decision, to break one’s heart or to turn off, have been rarely modelled. This is notably due to their syntactic variability, which hinders treating them as “words with spaces”. We describe an initiative meant to bring about substantial progress in understanding, modelling and processing VMWEs. It is a joint effort, carried out within a European research network, to elaborate universal terminologies and annotation guidelines for 18 languages. Its main outcome is a multilingual 5-million-word annotated corpus which underlies a shared task on automatic identification of VMWEs. This paper presents the corpus annotation methodology and outcome, the shared task organisation and the results of the participating systems.

pdf bib
Neural Networks for Multi-Word Expression Detection
Natalia Klyueva | Antoine Doucet | Milan Straka
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

In this paper we describe the MUMULS system that participated to the 2017 shared task on automatic identification of verbal multiword expressions (VMWEs). The MUMULS system was implemented using a supervised approach based on recurrent neural networks using the open source library TensorFlow. The model was trained on a data set containing annotated VMWEs as well as morphological and syntactic information. The MUMULS system performed the identification of VMWEs in 15 languages, it was one of few systems that could categorize VMWEs type in nearly all languages.

2013

pdf bib
DAnIEL, parsimonious yet high-coverage multilingual epidemic surveillance (DAnIEL : Veille épidémiologique multilingue parcimonieuse) [in French]
Gaël Lejeune | Romain Brixtel | Charlotte Lecluze | Antoine Doucet | Nadine Lucas
Proceedings of TALN 2013 (Volume 3: System Demonstrations)

pdf bib
“Let Everything Turn Well in Your Wife”: Generation of Adult Humor Using Lexical Constraints
Alessandro Valitutti | Hannu Toivonen | Antoine Doucet | Jukka M. Toivanen
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2010

pdf bib
Filtering news for epidemic surveillance: towards processing more languages with fewer resources
Gaël Lejeune | Antoine Doucet | Roman Yangarber | Nadine Lucas
Proceedings of the 4th Workshop on Cross Lingual Information Access

2004

pdf bib
Non-Contiguous Word Sequences for Information Retrieval
Antoine Doucet | Helana Ahonen-Myka
Proceedings of the Workshop on Multiword Expressions: Integrating Processing