Antoine Doucet


2024

pdf bib
DeBERTa Beats Behemoths: A Comparative Analysis of Fine-Tuning, Prompting, and PEFT Approaches on LegalLensNER
Hanh Thi Hong Tran | Nishan Chatterjee | Senja Pollak | Antoine Doucet
Proceedings of the Natural Legal Language Processing Workshop 2024

This paper summarizes the participation of our team (Flawless Lawgic) in the legal named entity recognition (L-NER) task at LegalLens 2024: Detecting Legal Violations. Given possible unstructured texts (e.g., online media texts), we aim to identify legal violations by extracting legal entities such as “violation”, “violation by”, “violation on”, and “law”. This system-description paper discusses our approaches to address the task, empirically highlighting the performances of fine-tuning models from the Transformers family (e.g., RoBERTa and DeBERTa) against open-sourced LLMs (e.g., Llama, Mistral) with different tuning settings (e.g., LoRA, Supervised Fine-Tuning (SFT) and prompting strategies). Our best results, with a weighted F1 of 0.705 on the test set, show a 30 percentage points increase in F1 compared to the baseline and rank 2 on the leaderboard, leaving a marginal gap of only 0.4 percentage points lower than the top solution. Our solutions are available at github.com/honghanhh/lner.

pdf bib
L3iTC at the FinLLM Challenge Task: Quantization for Financial Text Classification & Summarization
Elvys Linhares Pontes | Carlos-Emiliano González-Gallardo | Mohamed Benjannet | Caryn Qu | Antoine Doucet
Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning

pdf bib
L3i++ at SemEval-2024 Task 8: Can Fine-tuned Large Language Model Detect Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text?
Hanh Thi Hong Tran | Tien Nam Nguyen | Antoine Doucet | Senja Pollak
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

This paper summarizes our participation in SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection. In this task, we aim to solve two over three Subtasks: (1) Monolingual and Multilingual Binary Human-Written vs. Machine-Generated Text Classification; and (2) Multi-Way Machine-Generated Text Classification. We conducted a comprehensive comparative study across three methodological groups: Five metric-based models (Log-Likelihood, Rank, Log-Rank, Entropy, and MFDMetric), two fine-tuned sequence-labeling language models (RoBERTA and XLM-R); and a fine-tuned large-scale language model (LS-LLaMA). Our findings suggest that our LLM outperformed both traditional sequence-labeling LM benchmarks and metric-based approaches. Furthermore, our fine-tuned classifier excelled in detecting machine-generated multilingual texts and accurately classifying machine-generated texts within a specific category, (e.g., ChatGPT, bloomz, dolly). However, they do exhibit challenges in detecting them in other categories (e.g., cohere, and davinci). This is due to potential overlap in the distribution of the metric among various LLMs. Overall, we achieved a 6th rank in both Multilingual Binary Human-Written vs. Machine-Generated Text Classification and Multi-Way Machine-Generated Text Classification on the leaderboard.

2023

pdf bib
L3I++ at SemEval-2023 Task 2: Prompting for Multilingual Complex Named Entity Recognition
Carlos-Emiliano Gonzalez-Gallardo | Thi Hong Hanh Tran | Nancy Girdhar | Emanuela Boros | Jose G. Moreno | Antoine Doucet
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

This paper summarizes the participation of the L3i laboratory of the University of La Rochelle in the SemEval-2023 Task 2, Multilingual Complex Named Entity Recognition (MultiCoNER II). Similar to MultiCoNER I, the task seeks to develop methods to detect semantic ambiguous and complex entities in short and low-context settings. However, MultiCoNER II adds a fine-grained entity taxonomy with over 30 entity types and corrupted data on the test partitions. We approach these complications following prompt-based learning as (1) a ranking problem using a seq2seq framework, and (2) an extractive question-answering task. Our findings show that even if prompting techniques have a similar recall to fine-tuned hierarchical language model-based encoder methods, precision tends to be more affected.

pdf bib
Injection de connaissances temporelles dans la reconnaissance d’entités nommées historiques
Carlos-Emiliano González-Gallardo | Emanuela Boros | Edward Giamphy | Ahmed Hamdi | Jose Moreno | Antoine Doucet
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 4 : articles déjà soumis ou acceptés en conférence internationale

Dans cet article, nous abordons la reconnaissance d’entités nommées dans des documents historiques multilingues. Cette tâche présente des multiples défis tels que les erreurs générées suite à la numérisa- tion et la reconnaissance optique des caractères de ces documents. En outre, les documents historiques posent un autre défi puisque leurs collections sont distribuées sur une période de temps assez longue et suivent éventuellement plusieurs conventions orthographiques qui évoluent au fil du temps. Nous explorons, dans ce travail, l’idée d’injecter des connaissance temporelles à l’aide de graphes pour une reconnaissance d’entités nommées plus performante. Plus précisément, nous récupérons des contextes supplémentaires, sémantiquement pertinents, en explorant les informations temporelles fournies par les collections historiques et nous les incluons en tant que représentations mises en commun dans un modèle NER basé sur un transformeur. Nous expérimentons avec deux collections récentes en anglais, français et allemand, composées de journaux historiques (19C-20C) et de commentaires classiques (19C). Les résultats montrent l’efficacité de l’injection de connaissances temporelles dans des ensembles de données, des langues et des types d’entités différents.

pdf bib
Oui mais... ChatGPT peut-il identifier des entités dans des documents historiques ?
Carlos-Emiliano González-Gallardo | Emanuela Boros | Nancy Girdhar | Ahmed Hamdi | Jose Moreno | Antoine Doucet
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 4 : articles déjà soumis ou acceptés en conférence internationale

Les modèles de langage de grande taille (LLM) sont exploités depuis plusieurs années maintenant, obtenant des performances de pointe dans la reconnaissance d’entités à partir de documents modernes. Depuis quelques mois, l’agent conversationnel ChatGPT a suscité beaucoup d’intérêt auprès de la communauté scientifique et du grand public en raison de sa capacité à générer des réponses plausibles. Dans cet article, nous explorons cette compétence à travers la tâche de reconnaissance et de classification d’entités nommées (NERC) dans des sources primaires (des journaux historiques et des commentaires classiques) d’une manière zero-shot et en la comparant avec les systèmes de pointe basés sur des modèles de langage. Nos résultats indiquent plusieurs lacunes dans l’identification des entités dans le texte historique, qui concernant la cohérence des guidelines d’annotation des entités, la complexité des entités et du changement de code et la spécificité du prompt. De plus, comme prévu, l’inaccessibilité des archives historiques a également un impact sur les performances de ChatGPT.

pdf bib
Détection de faux tickets de caisse à l’aide d’entités et de relations basées sur une ontologie de domaine
Beatriz Martínez Tornés | Emanuela Boros | Petra Gomez-Krämer | Antoine Doucet | Jean-Marc Ogier
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 4 : articles déjà soumis ou acceptés en conférence internationale

Dans cet article, nous nous attaquons à la tâche de détection de fraude documentaire. Nous considérons que cette tâche peut être abordée avec des techniques de traitement automatique du langage naturel (TALN). Nous utilisons une approche basée sur la régression, en tirant parti d’un modèle de langage pré-entraîné afin de représenter le contenu textuel, et en enrichissant la représentation avec des entités et des relations basées sur une ontologie spécifique au domaine. Nous émulons une approche basée sur les entités en comparant différents types d’entrée : texte brut, entités extraites et une reformulation du contenu du document basée sur des triplets. Pour notre configuration expérimentale, nous utilisons le seul ensemble de données librement disponible de faux tickets de caisse, et nous fournissons une analyse approfondie de nos résultats. Ils montrent des corrélations intéressantes entre les types de relations ontologiques, les types d’entités (produit, entreprise, etc.) et la performance d’un modèle de langage basé sur la régression qui pourrait aider à étudier le transfert d’apprentissage à partir de méthodes de traitement du langage naturel (TALN) pour améliorer la performance des systèmes de détection de fraude existants.

pdf bib
Jeu de données de tickets de caisse pour la détection de fraude documentaire
Beatriz Martínez Tornés | Théo Taburet | Emanuela Boros | Kais Rouis | Petra Gomez-Krämer | Nicolas Sidere | Antoine Doucet | Vincent Poulain D’andecy
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 4 : articles déjà soumis ou acceptés en conférence internationale

L’utilisation généralisée de documents numériques non sécurisés par les entreprises et les administrations comme pièces justificatives les rend vulnérables à la falsification. En outre, les logiciels de retouche d’images et les possibilités qu’ils offrent compliquent les tâches de la détection de fraude d’images numériques. Néanmoins, la recherche dans ce domaine se heurte au manque de données réalistes accessibles au public. Dans cet article, nous proposons un nouveau jeu de données pour la détection des faux tickets contenant 988 images numérisées de tickets et leurs transcriptions, provenant du jeu de données SROIE (scanned receipts OCR and information extraction). 163 images et leurs transcriptions ont subi des modifications frauduleuses réalistes et ont été annotées. Nous décrivons en détail le jeu de données, les falsifications et leurs annotations et fournissons deux baselines (basées sur l’image et le texte) sur la tâche de détection de la fraude.

2022

pdf bib
Contextualizing Emerging Trends in Financial News Articles
Nhu Khoa Nguyen | Thierry Delahaut | Emanuela Boros | Antoine Doucet | Gaël Lejeune
Proceedings of the Fourth Workshop on Financial Technology and Natural Language Processing (FinNLP)

Identifying and exploring emerging trends in news is becoming more essential than ever with many changes occurring around the world due to the global health crises. However, most of the recent research has focused mainly on detecting trends in social media, thus, benefiting from social features (e.g. likes and retweets on Twitter) which helped the task as they can be used to measure the engagement and diffusion rate of content. Yet, formal text data, unlike short social media posts, comes with a longer, less restricted writing format, and thus, more challenging. In this paper, we focus our study on emerging trends detection in financial news articles about Microsoft, collected before and during the start of the COVID-19 pandemic (July 2019 to July 2020). We make the dataset freely available and we also propose a strong baseline (Contextual Leap2Trend) for exploring the dynamics of similarities between pairs of keywords based on topic modeling and term frequency. Finally, we evaluate against a gold standard (Google Trends) and present noteworthy real-world scenarios regarding the influence of the pandemic on Microsoft.

pdf bib
Using Contextual Sentence Analysis Models to Recognize ESG Concepts
Elvys Linhares Pontes | Mohamed Ben Jannet | Jose G. Moreno | Antoine Doucet
Proceedings of the Fourth Workshop on Financial Technology and Natural Language Processing (FinNLP)

This paper summarizes the joint participation of the Trading Central Labs and the L3i laboratory of the University of La Rochelle on both sub-tasks of the Shared Task FinSim-4 evaluation campaign. The first sub-task aims to enrich the ‘Fortia ESG taxonomy’ with new lexicon entries while the second one aims to classify sentences to either ‘sustainable’ or ‘unsustainable’ with respect to ESG (Environment, Social and Governance) related factors. For the first sub-task, we proposed a model based on pre-trained Sentence-BERT models to project sentences and concepts in a common space in order to better represent ESG concepts. The official task results show that our system yields a significant performance improvement compared to the baseline and outperforms all other submissions on the first sub-task. For the second sub-task, we combine the RoBERTa model with a feed-forward multi-layer perceptron in order to extract the context of sentences and classify them. Our model achieved high accuracy scores (over 92%) and was ranked among the top 5 systems.

pdf bib
Fine-tuning de modèles de langues pour la veille épidémiologique multilingue avec peu de ressources (Fine-tuning Language Models for Low-resource Multilingual Epidemic Surveillance)
Stephen Mutuvi | Emanuela Boros | Antoine Doucet | Adam Jatowt | Gaël Lejeune | Moses Odeo
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Les modèles de langues pré-entraînés connaissent un très grand succès en TAL, en particulier dans les situations où l’on dispose de suffisamment de données d’entraînement. Cependant, il reste difficile d’obtenir des résultats similaires dans des environnements multilingues avec peu de données d’entraînement, en particulier dans des domaines spécialisés tels que la surveillance des épidémies. Dans cet article, nous explorons plusieurs hypothèses concernant les facteurs qui pourraient avoir une influence sur les performances d’un système d’extraction d’événements épidémiologiques dans un scénario multilingue à faibles ressources : le type de modèle pré-entraîné, la qualité du tokenizer ainsi que les caractéristiques des entités à extraire. Nous proposons une analyse exhaustive de ces facteurs et observons une corrélation importante, quoique variable ; entre ces caractéristiques et les performances observées sur la base d’une tâche de veille épidémiologique multilingue à faibles ressources. Nous proposons aussi d’adapter les modèles de langues à cette tâche en étendant le vocabulaire du tokenizer pré-entraîné avec les entités continues, qui sont des entités qui ont été divisées en plusieurs sous-mots. Suite à cette adaptation, nous observons une amélioration notable des performances pour la plupart des modèles et des langues évalués.

pdf bib
L’importance des entités pour la tâche de détection d’événements en tant que système de question-réponse (Exploring Entities in Event Detection as Question Answering)
Emanuela Boros | Jose Moreno | Antoine Doucet
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Dans cet article, nous abordons un paradigme récent et peu étudié pour la tâche de détection d’événements en la présentant comme un problème de question-réponse avec possibilité de réponses multiples et le support d’entités. La tâche d’extraction des déclencheurs d’événements est ainsi transformée en une tâche d’identification des intervalles de réponse à partir d’un contexte, tout en se concentrant également sur les entités environnantes. L’architecture est basée sur un modèle de langage pré-entraîné et finement ajusté, où le contexte d’entrée est augmenté d’entités marquées à différents niveaux, de leurs positions, de leurs types et, enfin, de leurs rôles d’arguments. Nos expériences sur le corpus ACE 2005 démontrent que le modèle proposé exploite correctement les informations sur les entités dans le cadre de la détection des événements et qu’il constitue une solution viable pour cette tâche. De plus, nous démontrons que notre méthode, avec différents marqueurs d’entités, est particulièrement capable d’extraire des types d’événements non vus dans des contextes d’apprentissage en peu de coups.

pdf bib
L3i at SemEval-2022 Task 11: Straightforward Additional Context for Multilingual Named Entity Recognition
Emanuela Boros | Carlos-Emiliano González-Gallardo | Jose Moreno | Antoine Doucet
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper summarizes the participation of the L3i laboratory of the University of La Rochelle in the SemEval-2022 Task 11, Multilingual Complex Named Entity Recognition (MultiCoNER). The task focuses on detecting semantically ambiguous and complex entities in short and low-context monolingual and multilingual settings. We argue that using a language-specific and a multilingual language model could improve the performance of multilingual and mixed NER. Also, we consider that using additional contexts from the training set could improve the performance of a NER on short texts. Thus, we propose a straightforward technique for generating additional contexts with and without the presence of entities. Our findings suggest that, in our internal experimental setup, this approach is promising. However, we ranked above average for the high-resource languages and lower than average for low-resource and multilingual models.

pdf bib
Archive TimeLine Summarization (ATLS): Conceptual Framework for Timeline Generation over Historical Document Collections
Nicolas Gutehrlé | Antoine Doucet | Adam Jatowt
Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Archive collections are nowadays mostly available through search engines interfaces, which allow a user to retrieve documents by issuing queries. The study of these collections may be, however, impaired by some aspects of search engines, such as the overwhelming number of documents returned or the lack of contextual knowledge provided. New methods that could work independently or in combination with search engines are then required to access these collections. In this position paper, we propose to extend TimeLine Summarization (TLS) methods on archive collections to assist in their studies. We provide an overview of existing TLS methods and we describe a conceptual framework for an Archive TimeLine Summarization (ATLS) system, which aims to generate informative, readable and interpretable timelines.

pdf bib
IJS at TextGraphs-16 Natural Language Premise Selection Task: Will Contextual Information Improve Natural Language Premise Selection?
Thi Hong Hanh Tran | Matej Martinc | Antoine Doucet | Senja Pollak
Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing

Natural Language Premise Selection (NLPS) is a mathematical Natural Language Processing (NLP) task that retrieves a set of applicable relevant premises to support the end-user finding the proof for a particular statement. In this research, we evaluate the impact of Transformer-based contextual information and different fundamental similarity scores toward NLPS. The results demonstrate that the contextual representation is better at capturing meaningful information despite not being pretrained in the mathematical background compared to the statistical approach (e.g., the TF-IDF) with a boost of around 3.00% MAP@500.

2021

pdf bib
Multi-TimeLine Summarization (MTLS): Improving Timeline Summarization by Generating Multiple Summaries
Yi Yu | Adam Jatowt | Antoine Doucet | Kazunari Sugiyama | Masatoshi Yoshikawa
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In this paper, we address a novel task, Multiple TimeLine Summarization (MTLS), which extends the flexibility and versatility of Time-Line Summarization (TLS). Given any collection of time-stamped news articles, MTLS automatically discovers important yet different stories and generates a corresponding time-line for each story. To achieve this, we propose a novel unsupervised summarization framework based on two-stage affinity propagation. We also introduce a quantitative evaluation measure for MTLS based on previousTLS evaluation methods. Experimental results show that our MTLS framework demonstrates high effectiveness and MTLS task can give bet-ter results than TLS.

pdf bib
EMBEDDIA Tools, Datasets and Challenges: Resources and Hackathon Contributions
Senja Pollak | Marko Robnik-Šikonja | Matthew Purver | Michele Boggia | Ravi Shekhar | Marko Pranjić | Salla Salmela | Ivar Krustok | Tarmo Paju | Carl-Gustav Linden | Leo Leppänen | Elaine Zosa | Matej Ulčar | Linda Freienthal | Silver Traat | Luis Adrián Cabrera-Diego | Matej Martinc | Nada Lavrač | Blaž Škrlj | Martin Žnidaršič | Andraž Pelicon | Boshko Koloski | Vid Podpečan | Janez Kranjc | Shane Sheehan | Emanuela Boros | Jose G. Moreno | Antoine Doucet | Hannu Toivonen
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

This paper presents tools and data sources collected and released by the EMBEDDIA project, supported by the European Union’s Horizon 2020 research and innovation program. The collected resources were offered to participants of a hackathon organized as part of the EACL Hackashop on News Media Content Analysis and Automated Report Generation in February 2021. The hackathon had six participating teams who addressed different challenges, either from the list of proposed challenges or their own news-industry-related tasks. This paper goes beyond the scope of the hackathon, as it brings together in a coherent and compact form most of the resources developed, collected and released by the EMBEDDIA project. Moreover, it constitutes a handy source for news media industry and researchers in the fields of Natural Language Processing and Social Science.

pdf bib
Using a Frustratingly Easy Domain and Tagset Adaptation for Creating Slavic Named Entity Recognition Systems
Luis Adrián Cabrera-Diego | Jose G. Moreno | Antoine Doucet
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing

We present a collection of Named Entity Recognition (NER) systems for six Slavic languages: Bulgarian, Czech, Polish, Slovenian, Russian and Ukrainian. These NER systems have been trained using different BERT models and a Frustratingly Easy Domain Adaptation (FEDA). FEDA allow us creating NER systems using multiple datasets without having to worry about whether the tagset (e.g. Location, Event, Miscellaneous, Time) in the source and target domains match, while increasing the amount of data available for training. Moreover, we boosted the prediction on named entities by marking uppercase words and predicting masked words. Participating in the 3rd Shared Task on SlavNER, our NER systems reached a strict match micro F-score of up to 0.908. The results demonstrate good generalization, even in named entities with weak regularity, such as book titles, or entities that were never seen during the training.

pdf bib
Relation Classification via Relation Validation
José G. Moreno | Antoine Doucet | Brigitte Grau
Proceedings of the 6th Workshop on Semantic Deep Learning (SemDeep-6)

2020

pdf bib
Multilingual Epidemiological Text Classification: A Comparative Study
Stephen Mutuvi | Emanuela Boros | Antoine Doucet | Adam Jatowt | Gaël Lejeune | Moses Odeo
Proceedings of the 28th International Conference on Computational Linguistics

In this paper, we approach the multilingual text classification task in the context of the epidemiological field. Multilingual text classification models tend to perform differently across different languages (low- or high-resourced), more particularly when the dataset is highly imbalanced, which is the case for epidemiological datasets. We conduct a comparative study of different machine and deep learning text classification models using a dataset comprising news articles related to epidemic outbreaks from six languages, four low-resourced and two high-resourced, in order to analyze the influence of the nature of the language, the structure of the document, and the size of the data. Our findings indicate that the performance of the models based on fine-tuned language models exceeds by more than 50% the chosen baseline models that include a specialized epidemiological news surveillance system and several machine learning models. Also, low-resource languages are highly influenced not only by the typology of the languages on which the models have been pre-trained or/and fine-tuned but also by their size. Furthermore, we discover that the beginning and the end of documents provide the most salient features for this task and, as expected, the performance of the models was proportionate to the training data size.

pdf bib
Dataset for Temporal Analysis of English-French Cognates
Esteban Frossard | Mickael Coustaty | Antoine Doucet | Adam Jatowt | Simon Hengchen
Proceedings of the Twelfth Language Resources and Evaluation Conference

Languages change over time and, thanks to the abundance of digital corpora, their evolutionary analysis using computational techniques has recently gained much research attention. In this paper, we focus on creating a dataset to support investigating the similarity in evolution between different languages. We look in particular into the similarities and differences between the use of corresponding words across time in English and French, two languages from different linguistic families yet with shared syntax and close contact. For this we select a set of cognates in both languages and study their frequency changes and correlations over time. We propose a new dataset for computational approaches of synchronized diachronic investigation of language pairs, and subsequently show novel findings stemming from the cognate-focused diachronic comparison of the two chosen languages. To the best of our knowledge, the present study is the first in the literature to use computational approaches and large data to make a cross-language diachronic analysis.

pdf bib
A Dataset for Multi-lingual Epidemiological Event Extraction
Stephen Mutuvi | Antoine Doucet | Gaël Lejeune | Moses Odeo
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper proposes a corpus for the development and evaluation of tools and techniques for identifying emerging infectious disease threats in online news text. The corpus can not only be used for information extraction, but also for other natural language processing (NLP) tasks such as text classification. We make use of articles published on the Program for Monitoring Emerging Diseases (ProMED) platform, which provides current information about outbreaks of infectious diseases globally. Among the key pieces of information present in the articles is the uniform resource locator (URL) to the online news sources where the outbreaks were originally reported. We detail the procedure followed to build the dataset, which includes leveraging the source URLs to retrieve the news reports and subsequently pre-processing the retrieved documents. We also report on experimental results of event extraction on the dataset using the Data Analysis for Information Extraction in any Language(DAnIEL) system. DAnIEL is a multilingual news surveillance system that leverages unique attributes associated with news reporting to extract events: repetition and saliency. The system has wide geographical and language coverage, including low-resource languages. In addition, we compare different classification approaches in terms of their ability to differentiate between epidemic-related and unrelated news articles that constitute the corpus.

pdf bib
Alleviating Digitization Errors in Named Entity Recognition for Historical Documents
Emanuela Boros | Ahmed Hamdi | Elvys Linhares Pontes | Luis Adrián Cabrera-Diego | Jose G. Moreno | Nicolas Sidere | Antoine Doucet
Proceedings of the 24th Conference on Computational Natural Language Learning

This paper tackles the task of named entity recognition (NER) applied to digitized historical texts obtained from processing digital images of newspapers using optical character recognition (OCR) techniques. We argue that the main challenge for this task is that the OCR process leads to misspellings and linguistic errors in the output text. Moreover, historical variations can be present in aged documents, which can impact the performance of the NER process. We conduct a comparative evaluation on two historical datasets in German and French against previous state-of-the-art models, and we propose a model based on a hierarchical stack of Transformers to approach the NER task for historical data. Our findings show that the proposed model clearly improves the results on both historical datasets, and does not degrade the results for modern datasets.

2019

pdf bib
TLR at BSNLP2019: A Multilingual Named Entity Recognition System
Jose G. Moreno | Elvys Linhares Pontes | Mickael Coustaty | Antoine Doucet
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

This paper presents our participation at the shared task on multilingual named entity recognition at BSNLP2019. Our strategy is based on a standard neural architecture for sequence labeling. In particular, we use a mixed model which combines multilingualcontextual and language-specific embeddings. Our only submitted run is based on a voting schema using multiple models, one for each of the four languages of the task (Bulgarian, Czech, Polish, and Russian) and another for English. Results for named entity recognition are encouraging for all languages, varying from 60% to 83% in terms of Strict and Relaxed metrics, respectively.

2017

pdf bib
The PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions
Agata Savary | Carlos Ramisch | Silvio Cordeiro | Federico Sangati | Veronika Vincze | Behrang QasemiZadeh | Marie Candito | Fabienne Cap | Voula Giouli | Ivelina Stoyanova | Antoine Doucet
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

Multiword expressions (MWEs) are known as a “pain in the neck” for NLP due to their idiosyncratic behaviour. While some categories of MWEs have been addressed by many studies, verbal MWEs (VMWEs), such as to take a decision, to break one’s heart or to turn off, have been rarely modelled. This is notably due to their syntactic variability, which hinders treating them as “words with spaces”. We describe an initiative meant to bring about substantial progress in understanding, modelling and processing VMWEs. It is a joint effort, carried out within a European research network, to elaborate universal terminologies and annotation guidelines for 18 languages. Its main outcome is a multilingual 5-million-word annotated corpus which underlies a shared task on automatic identification of VMWEs. This paper presents the corpus annotation methodology and outcome, the shared task organisation and the results of the participating systems.

pdf bib
Neural Networks for Multi-Word Expression Detection
Natalia Klyueva | Antoine Doucet | Milan Straka
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

In this paper we describe the MUMULS system that participated to the 2017 shared task on automatic identification of verbal multiword expressions (VMWEs). The MUMULS system was implemented using a supervised approach based on recurrent neural networks using the open source library TensorFlow. The model was trained on a data set containing annotated VMWEs as well as morphological and syntactic information. The MUMULS system performed the identification of VMWEs in 15 languages, it was one of few systems that could categorize VMWEs type in nearly all languages.

2013

pdf bib
“Let Everything Turn Well in Your Wife”: Generation of Adult Humor Using Lexical Constraints
Alessandro Valitutti | Hannu Toivonen | Antoine Doucet | Jukka M. Toivanen
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
DAnIEL, parsimonious yet high-coverage multilingual epidemic surveillance (DAnIEL : Veille épidémiologique multilingue parcimonieuse) [in French]
Gaël Lejeune | Romain Brixtel | Charlotte Lecluze | Antoine Doucet | Nadine Lucas
Proceedings of TALN 2013 (Volume 3: System Demonstrations)

2010

pdf bib
Filtering news for epidemic surveillance: towards processing more languages with fewer resources
Gaël Lejeune | Antoine Doucet | Roman Yangarber | Nadine Lucas
Proceedings of the 4th Workshop on Cross Lingual Information Access

2004

pdf bib
Non-Contiguous Word Sequences for Information Retrieval
Antoine Doucet | Helana Ahonen-Myka
Proceedings of the Workshop on Multiword Expressions: Integrating Processing

Search