Lydia-Mai Ho-Dac

Also published as: Mai Ho-dac, Lydia-Mai Ho-dac, Lydia Mai Ho-Dac

2025

Embeddings, topic models, LLM : un air de famille
Ludovic Tanguy | Cécile Fabre | Nabil Hathout | Lydia-Mai Ho-Dac
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : articles scientifiques originaux

Word embeddings, topic models, LLMs: a family affair This article presents a study on terms denoting family relationships (brother, aunt, etc.) in French using three approaches: word embeddings, topic modeling, and pre-trained language models. The first two types of representations are built from the French version of Wikipedia, while the third is derived through direct interaction with ChatGPT. The aim is to compare how these three methods represent such terms, in two main ways: by evaluating them against a structural definition of family relations (in terms of features such as gender, lineage, etc.), and by comparing the topics associated with each term. These methods reveal different modes of structuring family-related vocabulary, while also underscoring the continued necessity of corpus-based and controlled analyses to obtain reliable results.

2024

pdf bib

Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 2 : traductions d'articles publiès
Mathieu Balaguer | Nihed Bendahman | Lydia-Mai Ho-dac | Julie Mauclair | Jose G Moreno | Julien Pinquier
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 2 : traductions d'articles publiès

pdf bib

Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position
Mathieu Balaguer | Nihed Bendahman | Lydia-Mai Ho-dac | Julie Mauclair | Jose G Moreno | Julien Pinquier
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

pdf bib

Actes de la 26ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues
Mathieu Balaguer | Nihed Bendahman | Lydia-Mai Ho-dac | Julie Mauclair | Jose G Moreno | Julien Pinquier
Actes de la 26ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues

pdf bib

2021

pdf bib abs

Coreference Chains Categorization by Sequence Clustering
Silvia Federzoni | Lydia-Mai Ho-Dac | Cécile Fabre
Proceedings of the 2nd Workshop on Computational Approaches to Discourse

The diversity of coreference chains is usually tackled by means of global features (length, types and number of referring expressions, distance between them, etc.). In this paper, we propose a novel approach that provides a description of their composition in terms of sequences of expressions. To this end, we apply sequence analysis techniques to bring out the various strategies for introducing a referent and keeping it active throughout discourse. We discuss a first application of this method to a French written corpus annotated with coreference chains. We obtain clusters that are linguistically coherent and interpretable in terms of reference strategies and we demonstrate the influence of text genre and semantic type of the referent on chain composition.

2020

pdf bib abs

This paper describes our participation to the SMM4H shared task 2. We designed a rule-based classifier that estimates whether a tweet mentions an adverse effect associated to a medication. Our system addresses English and French, and is based on a number of specific word lists and features. These cues were mostly obtained through an extensive corpus analysis of the provided training data. Different weighting schemes were tested (manually tuned or based on a logistic regression), the best one achieving a F1 score of 0.31 for English and 0.15 for French.

pdf bib abs

E:Calm Resource: a Resource for Studying Texts Produced by French Pupils and Students
Lydia-Mai Ho-Dac | Serge Fleury | Claude Ponton
Proceedings of the Twelfth Language Resources and Evaluation Conference

The E:Calm resource is constructed from French student texts produced in a variety of usual contexts of teaching. The distinction of the E:Calm resource is to provide an ecological data set that gives a broad overview of texts written at elementary school, high school and university. This paper describes the whole data processing: encoding of the main graphical aspects of the handwritten primary sources according to the TEI-P5 norm; spelling standardizing; POS tagging and syntactic parsing evaluation.

2019

pdf bib abs

This paper presents the first results of a multidisciplinary project, the “Evolex” project, gathering researchers in Psycholinguistics, Neuropsychology, Computer Science, Natural Language Processing and Linguistics. The Evolex project aims at proposing a new data-based inductive method for automatically characterising the relation between pairs of french words collected in psycholinguistics experiments on lexical access. This method takes advantage of several complementary computational measures of semantic similarity. We show that some measures are more correlated than others with the frequency of lexical associations, and that they also differ in the way they capture different semantic relations. This allows us to consider building a multidimensional lexical similarity to automate the classification of lexical associations.

2017

pdf bib abs

A corpus-driven approach to discourse organisation: from cues to complex markers
Marie-Paule Péry-Woodley | Lydia-Mai Ho-Dac | Josette Rebeyrolle | Ludovic Tanguy | Cécile Fabre
Dialogue & Discourse Volume 8

This paper reports on an experiment implementing a data-intensive approach to discourse organisation. Its focus is on enumerative structures envisaged as a type of textual pattern in a sequentiality-oriented approach to discourse. On the basis of a large-scale annotation exercise calling upon automatic feature mark-up alongside manual annotation, we explore a method to identify complex discourse markers seen as configurations of cues. The presentation of the background to what is termed "multi-level annotation" is organised around four issues: linearity, complexity of discourse markers, top-down processing, granularity and the multi-level nature of discourse structures. In this context, enumerative structures seem to deserve scrutiny for a number of reasons: they are frequent structures appearing at different granularity levels, they are signalled by a variety of devices appearing to work together in complex ways, and they combine a textual role (discourse organisation) with an ideational role (categorisation). We describe the annotation procedure and experimental framework which resulted in nearly 1,000 enumerative structures being annotated in a diversified corpus of over 600,000 words. The results of two approaches to the rich data produced are then presented: firstly, a descriptive survey highlights considerable variation in length and composition, while showing enumerative structure to be a basic strategy resorted to in all three sub-corpora, and leads to a granularity-based typology of the annotated structures; secondly, recurrent cue configurations—our "complex markers"—are identified by the application of data mining methods. The paper ends with perspectives for further exploitation of the data, in particular with respect to the semantic characterisation of enumerative structures.

2016

pdf bib abs

L’anti-correcteur : outil d’évaluation positive de l’orthographe et de la grammaire (The ”anticorrecteur”: a positive evaluation module for spell and grammar checking)
Lydia-Mai Ho-Dac | Sophie Muller | Valentine Delbar
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

L’objectif de cette étude est d’expérimenter l’intégration d’une nouvelle forme d’évaluation dans un correcteur orthographique et grammatical. L’« anticorrecteur » a pour objet de mesurer le taux de réussites orthographiques et grammaticales d’un texte sur certains points jugés difficiles selon la littérature et une observation d’erreurs en corpus. L’évaluation du niveau d’écriture ne se base plus uniquement sur les erreurs commises, mais également sur les réussites réalisées. Une version bêta de ce nouveau mode d’évaluation positive a été intégré dans le correcteur Cordial. Cet article a pour but de discuter de l’intérêt de ce nouveau rapport à l’orthographe et de présenter quelques premiers éléments d’analyse résultant de l’application de l’anticorrecteur sur un corpus de productions variées en matière de niveau d’écriture et genre discursif. Ici, un résumé en français (max. 150 mots). Times, 10pt.

2014

pdf bib

Presentation of the SemDis 2014 workshop: distributional semantics for two tasks - lexical substitution and exploration of specialized corpora (Présentation de l’atelier SemDis 2014 : sémantique distributionnelle pour la substitution lexicale et l’exploration de corpus spécialisés) [in French]
Cécile Fabre | Nabil Hathout | Lydia-Mai Ho-Dac | François Morlane-Hondère | Philippe Muller | Franck Sajous | Ludovic Tanguy | Tim Van de Cruys
TALN-RECITAL 2014 Workshop SemDis 2014 : Enjeux actuels de la sémantique distributionnelle (SemDis 2014: Current Challenges in Distributional Semantics)

pdf bib

TALN-RECITAL 2014 Workshop SemDis 2014 : Enjeux actuels de la sémantique distributionnelle (SemDis 2014: Current Challenges in Distributional Semantics)
Cécile Fabre | Nabil Hathout | Lydia-Mai Ho-Dac | François Morlane-Hondère | Philippe Muller | Franck Sajous | Ludovic Tanguy | Tim Van de Cruys
TALN-RECITAL 2014 Workshop SemDis 2014 : Enjeux actuels de la sémantique distributionnelle (SemDis 2014: Current Challenges in Distributional Semantics)

2012

bib abs

This paper describes the ANNODIS resource, a discourse-level annotated corpus for French. The corpus combines two perspectives on discourse: a bottom-up approach and a top-down approach. The bottom-up view incrementally builds a structure from elementary discourse units, while the top-down view focuses on the selective annotation of multi-level discourse structures. The corpus is composed of texts that are diversified with respect to genre, length and type of discursive organisation. The methodology followed here involves an iterative design of annotation guidelines in order to reach satisfactory inter-annotator agreement levels. This allows us to raise a few issues relevant for the comparison of such complex objects as discourse structures. The corpus also serves as a source of empirical evidence for discourse theories. We present here two first analyses taking advantage of this new annotated corpus --one that tested hypotheses on constraints governing discourse structure, and another that studied the variations in composition and signalling of multi-level discourse structures.

2011

pdf bib

Le corpus ANNODIS, un corpus enrichi d’annotations discursives [The ANNODIS corpus, a corpus enriched with discourse annotations]
Marie-Paule Péry-Woodley | Stergos D. Afantenos | Lydia-Mai Ho-Dac | Nicholas Asher
Traitement Automatique des Langues, Volume 52, Numéro 3 : Ressources linguistiques libres [Free Language Resources]

2010

pdf bib abs

Anatomie des structures énumératives
Lydia-Mai Ho-Dac | Marie-Paule Péry-Woodley | Ludovic Tanguy
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article présente les premiers résultats d’une campagne d’annotation de corpus à grande échelle réalisée dans le cadre du projet ANNODIS. Ces résultats concernent la partie descendante du dispositif d’annotation, et plus spécifiquement les structures énumératives. Nous nous intéressons à la structuration énumérative en tant que stratégie de base de mise en texte, apparaissant à différents niveaux de granularité, associée à différentes fonctions discursives, et signalée par des indices divers. Avant l’annotation manuelle, une étape de pré-traitement a permis d’obtenir le marquage systématique de traits associés à la signalisation de l’organisation du discours. Nous décrivons cette étape de marquage automatique, ainsi que la procédure d’annotation. Nous proposons ensuite une première typologie des structures énumératives basée sur la description quantitative des données annotées manuellement, prenant en compte la couverture textuelle, la composition et les types d’indices.

2009

Le projet ANNODIS vise la construction d’un corpus de textes annotés au niveau discursif ainsi que le développement d’outils pour l’annotation et l’exploitation de corpus. Les annotations adoptent deux points de vue complémentaires : une perspective ascendante part d’unités de discours minimales pour construire des structures complexes via un jeu de relations de discours ; une perspective descendante aborde le texte dans son entier et se base sur des indices pré-identifiés pour détecter des structures discursives de haut niveau. La construction du corpus est associée à la création de deux interfaces : la première assiste l’annotation manuelle des relations et structures discursives en permettant une visualisation du marquage issu des prétraitements ; une seconde sera destinée à l’exploitation des annotations. Nous présentons les modèles et protocoles d’annotation élaborés pour mettre en oeuvre, au travers de l’interface dédiée, la campagne d’annotation.

2003

pdf bib abs

Cet article concerne la structuration automatique de documents par des méthodes linguistiques. De telles procédures sont rendues nécessaires par les nouvelles tâches de recherche d’information intradocumentaires (systèmes de questions-réponses, navigation sélective dans des documents...). Nous développons une méthode exploitant la théorie de l’encadrement du discours de Charolles, avec une application visée en recherche d’information dans les documents géographiques - d’où l’intérêt tout particulier porté aux cadres spatiaux et temporels. Nous décrivons une implémentation de la méthode de délimitation de ces cadres et son exploitation pour une tâche d’indexation intratextuelle croisant les critères spatiaux et temporels avec des critères thématiques.