2022
pdf
bib
abs
A Methodology for Building a Diachronic Dataset of Semantic Shifts and its Application to QC-FR-Diac-V1.0, a Free Reference for French
David Kletz
|
Philippe Langlais
|
François Lareau
|
Patrick Drouin
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Different algorithms have been proposed to detect semantic shifts (changes in a word meaning over time) in a diachronic corpus. Yet, and somehow surprisingly, no reference corpus has been designed so far to evaluate them, leaving researchers to fallback to troublesome evaluation strategies. In this work, we introduce a methodology for the construction of a reference dataset for the evaluation of semantic shift detection, that is, a list of words where we know for sure whether they present a word meaning change over a period of interest. We leverage a state-of-the-art word-sense disambiguation model to associate a date of first appearance to all the senses of a word. Significant changes in sense distributions as well as clear stability are detected and the resulting words are inspected by experts using a dedicated interface before populating a reference dataset. As a proof of concept, we apply this methodology to a corpus of newspapers from Quebec covering the whole 20th century. We manually verified a subset of candidates, leading to QC-FR-Diac-V1.0, a corpus of 151 words allowing one to evaluate the identification of semantic shifts in French between 1910 and 1990.
2020
pdf
bib
abs
Automatic Term Extraction from Newspaper Corpora: Making the Most of Specificity and Common Features
Patrick Drouin
|
Jean-Benoît Morel
|
Marie-Claude L’ Homme
Proceedings of the 6th International Workshop on Computational Terminology
The first step of any terminological work is to setup a reliable, specialized corpus composed of documents written by specialists and then to apply automatic term extraction (ATE) methods to this corpus in order to retrieve a first list of potential terms. In this paper, the experiment we describe differs quite drastically from this usual process since we are applying ATE to unspecialized corpora. The corpus used for this study was built from newspaper articles retrieved from the Web using a short list of keywords. The general intuition on which this research is based is that ATE based corpus comparison techniques can be used to capture both similarities and dissimilarities between corpora. The former are exploited through a termhood measure and the latter through word embeddings. Our initial results were validated manually and show that combining a traditional ATE method that focuses on dissimilarities between corpora to newer methods that exploit similarities (more specifically distributional features of candidates) leads to promising results.
pdf
bib
abs
TermEval 2020: Shared Task on Automatic Term Extraction Using the Annotated Corpora for Term Extraction Research (ACTER) Dataset
Ayla Rigouts Terryn
|
Veronique Hoste
|
Patrick Drouin
|
Els Lefever
Proceedings of the 6th International Workshop on Computational Terminology
The TermEval 2020 shared task provided a platform for researchers to work on automatic term extraction (ATE) with the same dataset: the Annotated Corpora for Term Extraction Research (ACTER). The dataset covers three languages (English, French, and Dutch) and four domains, of which the domain of heart failure was kept as a held-out test set on which final f1-scores were calculated. The aim was to provide a large, transparent, qualitatively annotated, and diverse dataset to the ATE research community, with the goal of promoting comparative research and thus identifying strengths and weaknesses of various state-of-the-art methodologies. The results show a lot of variation between different systems and illustrate how some methodologies reach higher precision or recall, how different systems extract different types of terms, how some are exceptionally good at finding rare terms, or are less impacted by term length. The current contribution offers an overview of the shared task with a comparative evaluation, which complements the individual papers by all participants.
2019
pdf
bib
abs
Analysing the Impact of Supervised Machine Learning on Automatic Term Extraction: HAMLET vs TermoStat
Ayla Rigouts Terryn
|
Patrick Drouin
|
Veronique Hoste
|
Els Lefever
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Traditional approaches to automatic term extraction do not rely on machine learning (ML) and select the top n ranked candidate terms or candidate terms above a certain predefined cut-off point, based on a limited number of linguistic and statistical clues. However, supervised ML approaches are gaining interest. Relatively little is known about the impact of these supervised methodologies; evaluations are often limited to precision, and sometimes recall and f1-scores, without information about the nature of the extracted candidate terms. Therefore, the current paper presents a detailed and elaborate analysis and comparison of a traditional, state-of-the-art system (TermoStat) and a new, supervised ML approach (HAMLET), using the results obtained for the same, manually annotated, Dutch corpus about dressage.
2018
pdf
bib
Lexical Profiling of Environmental Corpora
Patrick Drouin
|
Marie-Claude L’Homme
|
Benoît Robichaud
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2016
pdf
bib
abs
Évaluation des modèles sémantiques distributionnels : le cas de la dérivation syntaxique (Evaluation of distributional semantic models : The case of syntactic derivation )
Gabriel Bernier-Colborne
|
Patrick Drouin
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)
Nous évaluons deux modèles sémantiques distributionnels au moyen d’un jeu de données représentant quatre types de relations lexicales et analysons l’influence des paramètres des deux modèles. Les résultats indiquent que le modèle qui offre les meilleurs résultats dépend des relations ciblées, et que l’influence des paramètres des deux modèles varie considérablement en fonction de ce facteur. Ils montrent également que ces modèles captent aussi bien la dérivation syntaxique que la synonymie, mais que les configurations qui captent le mieux ces deux types de relations sont très différentes.
pdf
bib
abs
Combiner des modèles sémantiques distributionnels pour mieux détecter les termes évoquant le même cadre sémantique (Combining distributional semantic models to improve the identification of terms that evoke the same semantic frame)
Gabriel Bernier-Colborne
|
Patrick Drouin
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)
Nous utilisons des modèles sémantiques distributionnels pour détecter des termes qui évoquent le même cadre sémantique. Dans cet article, nous vérifions si une combinaison de différents modèles permet d’obtenir une précision plus élevée qu’un modèle unique. Nous mettons à l’épreuve plusieurs méthodes simples pour combiner les mesures de similarité calculées à partir de chaque modèle. Les résultats indiquent qu’on obtient systématiquement une augmentation de la précision par rapport au meilleur modèle unique en combinant des modèles différents.
pdf
bib
Proceedings of the 5th International Workshop on Computational Terminology (Computerm2016)
Patrick Drouin
|
Natalia Grabar
|
Thierry Hamon
|
Kyo Kageura
|
Koichi Takeuchi
Proceedings of the 5th International Workshop on Computational Terminology (Computerm2016)
pdf
bib
abs
Evaluation of distributional semantic models: a holistic approach
Gabriel Bernier-Colborne
|
Patrick Drouin
Proceedings of the 5th International Workshop on Computational Terminology (Computerm2016)
We investigate how both model-related factors and application-related factors affect the accuracy of distributional semantic models (DSMs) in the context of specialized lexicography, and how these factors interact. This holistic approach to the evaluation of DSMs provides valuable guidelines for the use of these models and insight into the kind of semantic information they capture.
2015
pdf
bib
abs
La séparation des composantes lexicale et flexionnelle des vecteurs de mots
François Lareau
|
Gabriel Bernier-Colborne
|
Patrick Drouin
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts
En sémantique distributionnelle, le sens des mots est modélisé par des vecteurs qui représentent leur distribution en corpus. Les modèles étant souvent calculés sur des corpus sans pré-traitement linguistique poussé, ils ne permettent pas de rendre bien compte de la compositionnalité morphologique des mots-formes. Nous proposons une méthode pour décomposer les vecteurs de mots en vecteurs lexicaux et flexionnels.
2014
pdf
bib
Proceedings of the 4th International Workshop on Computational Terminology (Computerm)
Patrick Drouin
|
Natalia Grabar
|
Thierry Hamon
|
Kyo Kageura
Proceedings of the 4th International Workshop on Computational Terminology (Computerm)
2012
pdf
bib
abs
Texto4Science: a Quebec French Database of Annotated Short Text Messages
Philippe Langlais
|
Patrick Drouin
|
Amélie Paulus
|
Eugénie Rompré Brodeur
|
Florent Cottin
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
In October 2009, was launched the Quebec French part of the international sms4science project, called texto4science. Over a period of 10 months, we collected slightly more than 7000 SMSs that we carefully annotated. This database is now ready to be used by the community. The purpose of this article is to relate the efforts put into designing this database and provide some data analysis of the main linguistic phenomenon that we have annotated. We also report on a socio-linguistic survey we conducted within the project.
2006
pdf
bib
abs
Applying Lexical Constraints on Morpho-Syntactic Patterns for the Identification of Conceptual-Relational Content in Specialized Texts
Jean-François Couturier
|
Sylvain Neuvel
|
Patrick Drouin
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
In this paper, we describe a formal constraint mechanism, which we label Conceptual Constraint Variables (CCVs), introduced to restrict surface patterns during automated text analysis with the objective of increasing precision in the representation of informational contents. We briefly present, and exemplify, the various types of CCVs applicable to the English texts of our corpora, and show how these constraints allow us to resolve some of the problems inherent to surface pattern recognition, more specifically, those related to the resolution of conceptual or syntactic ambiguities introduced by the most frequent English prepositions.
2004
pdf
bib
Detection of Domain Specific Terminology Using Corpora Comparison
Patrick Drouin
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)