Vincent Labatut


2023

pdf bib
Learning to Rank Context for Named Entity Recognition Using a Synthetic Dataset
Arthur Amalvy | Vincent Labatut | Richard Dufour
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

While recent pre-trained transformer-based models can perform named entity recognition (NER) with great accuracy, their limited range remains an issue when applied to long documents such as whole novels. To alleviate this issue, a solution is to retrieve relevant context at the document level. Unfortunately, the lack of supervision for such a task means one has to settle for unsupervised approaches. Instead, we propose to generate a synthetic context retrieval training dataset using Alpaca, an instruction-tuned large language model (LLM). Using this dataset, we train a neural context retriever based on a BERT model that is able to find relevant context for NER. We show that our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.

pdf bib
The Role of Global and Local Context in Named Entity Recognition
Arthur Amalvy | Vincent Labatut | Richard Dufour
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Pre-trained transformer-based models have recently shown great performance when applied to Named Entity Recognition (NER). As the complexity of their self-attention mechanism prevents them from processing long documents at once, these models are usually applied in a sequential fashion. Such an approach unfortunately only incorporates local context and prevents leveraging global document context in long documents such as novels, which might hinder performance. In this article, we explore the impact of global document context, and its relationships with local context. We find that correctly retrieving global document context has a greater impact on performance than only leveraging local context, prompting for further research on how to better retrieve that context.

2022

pdf bib
Remplacement de mentions pour l’adaptation d’un corpus de reconnaissance d’entités nommées à un domaine cible (Mention replacement for adapting a named entity recognition dataset to a target domain)
Arthur Amalvy | Vincent Labatut | Richard Dufour
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

La reconnaissance d’entités nommées est une tâche de traitement automatique du langage naturel bien étudiée et utile dans de nombreuses applications. Dernièrement, les modèles neuronaux permettent de la résoudre avec de très bonnes performances. Cependant, les jeux de données permettant l’entraînement et l’évaluation de ces modèles se concentrent sur un nombre restreint de domaines et types de documents (articles journalistiques, internet). Or, les performances d’un modèle entraîné sur un domaine ciblé sont en général moindres dans un autre : ceux moins couverts sont donc pénalisés. Pour tenter de remédier à ce problème, cet article propose d’utiliser une technique d’augmentation de données permettant d’adapter un corpus annoté en entités nommées d’un domaine source à un domaine cible où les types de noms rencontrés peuvent être différents. Nous l’appliquons dans le cadre de la littérature de fantasy, où nous montrons qu’elle peut apporter des gains de performance.

2021

pdf bib
Approche multimodale par plongement de texte et de graphes pour la détection de messages abusifs [Multimodal approach using text and graph embeddings for abusive message detection]
Noé Cécillon | Richard Dufour | Vincent Labatut
Traitement Automatique des Langues, Volume 62, Numéro 2 : Nouvelles applications du TAL [New applications in NLP]

2020

pdf bib
WAC: A Corpus of Wikipedia Conversations for Online Abuse Detection
Noé Cécillon | Vincent Labatut | Richard Dufour | Georges Linarès
Proceedings of the Twelfth Language Resources and Evaluation Conference

With the spread of online social networks, it is more and more difficult to monitor all the user-generated content. Automating the moderation process of the inappropriate exchange content on Internet has thus become a priority task. Methods have been proposed for this purpose, but it can be challenging to find a suitable dataset to train and develop them. This issue is especially true for approaches based on information derived from the structure and the dynamic of the conversation. In this work, we propose an original framework, based on the the Wikipedia Comment corpus, with comment-level abuse annotations of different types. The major contribution concerns the reconstruction of conversations, by comparison to existing corpora, which focus only on isolated messages (i.e. taken out of their conversational context). This large corpus of more than 380k annotated messages opens perspectives for online abuse detection and especially for context-based approaches. We also propose, in addition to this corpus, a complete benchmarking platform to stimulate and fairly compare scientific works around the problem of content abuse detection, trying to avoid the recurring problem of result replication. Finally, we apply two classification methods to our dataset to demonstrate its potential.

pdf bib
Serial Speakers: a Dataset of TV Series
Xavier Bost | Vincent Labatut | Georges Linares
Proceedings of the Twelfth Language Resources and Evaluation Conference

For over a decade, TV series have been drawing increasing interest, both from the audience and from various academic fields. But while most viewers are hooked on the continuous plots of TV serials, the few annotated datasets available to researchers focus on standalone episodes of classical TV series. We aim at filling this gap by providing the multimedia/speech processing communities with “Serial Speakers”, an annotated dataset of 155 episodes from three popular American TV serials: “Breaking Bad”, “Game of Thrones” and “House of Cards”. “Serial Speakers” is suitable both for investigating multimedia retrieval in realistic use case scenarios, and for addressing lower level speech related tasks in especially challenging conditions. We publicly release annotations for every speech turn (boundaries, speaker) and scene boundary, along with annotations for shot boundaries, recurring shots, and interacting speakers in a subset of episodes. Because of copyright restrictions, the textual content of the speech turns is encrypted in the public version of the dataset, but we provide the users with a simple online tool to recover the plain text from their own subtitle files.