Guillaume Gadek


2023

pdf bib
K-pop and fake facts: from texts to smart alerting for maritime security
Maxime Prieur | Souhir Gahbiche | Guillaume Gadek | Sylvain Gatepaille | Kilian Vasnier | Valerian Justine
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)

Maritime security requires full-time monitoring of the situation, mainly based on technical data (radar, AIS) but also from OSINT-like inputs (e.g., newspapers). Some threats to the operational reliability of this maritime surveillance, such as malicious actors, introduce discrepancies between hard and soft data (sensors and texts), either by tweaking their AIS emitters or by emitting false information on pseudo-newspapers. Many techniques exist to identify these pieces of false information, including using knowledge base population techniques to build a structured view of the information. This paper presents a use case for suspect data identification in a maritime setting. The proposed system UMBAR ingests data from sensors and texts, processing them through an information extraction step, in order to feed a Knowledge Base and finally perform coherence checks between the extracted facts.

pdf bib
DWIE-FR : Un nouveau jeu de données en français annoté en entités nommées
Sylvain Verdy | Maxime Prieur | Guillaume Gadek | Cédric Lopez
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : travaux de recherche originaux -- articles courts

Ces dernières années, les contributions majeures qui ont eu lieu en apprentissage automatique supervisé ont mis en evidence la nécessité de disposer de grands jeux de données annotés de haute qualité. Les recherches menées sur la tâche de reconnaissance d’entités nommées dans des textes en français font face à l’absence de jeux de données annotés “à grande échelle” et avec de nombreuses classes d’entités hiérarchisées. Dans cet article, nous proposons une approche pour obtenir un tel jeu de données qui s’appuie sur des étapes de traduction puis d’annotation des données textuelles en anglais vers une langue cible (ici au français). Nous évaluons la qualité de l’approche proposée et mesurons les performances de quelques modèles d’apprentissage automatique sur ces données.

2020

pdf bib
Arabizi Language Models for Sentiment Analysis
Gaétan Baert | Souhir Gahbiche | Guillaume Gadek | Alexandre Pauchet
Proceedings of the 28th International Conference on Computational Linguistics

Arabizi is a written form of spoken Arabic, relying on Latin characters and digits. It is informal and does not follow any conventional rules, raising many NLP challenges. In particular, Arabizi has recently emerged as the Arabic language in online social networks, becoming of great interest for opinion mining and sentiment analysis. Unfortunately, only few Arabizi resources exist and state-of-the-art language models such as BERT do not consider Arabizi. In this work, we construct and release two datasets: (i) LAD, a corpus of 7.7M tweets written in Arabizi and (ii) SALAD, a subset of LAD, manually annotated for sentiment analysis. Then, a BERT architecture is pre-trained on LAD, in order to create and distribute an Arabizi language model called BAERT. We show that a language model (BAERT) pre-trained on a large corpus (LAD) in the same language (Arabizi) as that of the fine-tuning dataset (SALAD), outperforms a state-of-the-art multi-lingual pretrained model (multilingual BERT) on a sentiment analysis task.