2024
pdf
bib
abs
Réductions temporelles en français parlé : Où peut-on trouver les zones de réduction ?
Yaru Wu
|
Kim Gerdes
|
Martine Adda-Decker
Actes des 35èmes Journées d'Études sur la Parole
Cet article examine la réduction dans la parole continue en français, ainsi que les différents facteurs qui contribuent au phénomène, tels que le style de parole, le débit de parole, la catégorie de mots, la position du phone dans le mot et la position du mot dans les groupes syntaxiques. L’étude utilise trois corpus de parole continue en français, couvrant la parole formelle, la parole moins formelle et la parole familière. La méthode utilisée comprend l’alignement forcé et l’étiquetage automatique des zones de réduction. Les résultats suggèrent que la réduction de la parole est présente dans tous les styles de parole, mais moins fréquente dans la parole formelle, et que la réduction est plus susceptible d’être observée dans les énoncés de parole avec un taux de parole élevé. La position médiane des mots ou des groupes syntaxiques tend à favoriser la réduction.
pdf
bib
abs
PatentEval: Understanding Errors in Patent Generation
You Zuo
|
Kim Gerdes
|
Éric Clergerie
|
Benoît Sagot
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
In this work, we introduce a comprehensive error typology specifically designed for evaluating two distinct tasks in machine-generated patent texts: claims-to-abstract generation, and the generation of the next claim given previous ones. We have also developed a benchmark, PatentEval, for systematically assessing language models in this context. Our study includes a comparative analysis, annotated by humans, of various models. These range from those specifically adapted during training for tasks within the patent domain to the latest general-purpose large language models (LLMs). Furthermore, we explored and evaluated some metrics to approximate human judgments in patent text evaluation, analyzing the extent to which these metrics align with expert assessments. These approaches provide valuable insights into the capabilities and limitations of current language models in the specialized field of patent text generation.
pdf
bib
abs
Joint Annotation of Morphology and Syntax in Dependency Treebanks
Bruno Guillaume
|
Kim Gerdes
|
Kirian Guiller
|
Sylvain Kahane
|
Yixuan Li
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
In this paper, we compare different ways to annotate both syntactic and morphological relations in a dependency treebank and we propose new formats we call mSUD and mUD, compatible with the Universal Dependencies (UD) schema for syntactic treebanks. We emphasize mSUD rather than mUD, the former being based on distributional criteria for the choice of the head of any combination, which allow us to clearly encode the internal structure of a word, that is, the derivational path. We investigate different problems posed by a morph-based annotation, concerning tokenization, choice of the head of a morph combination, relations between morphs, additional features needed, such as the token type differentiating roots and derivational and inflectional affixes. We show how our annotation schema can be applied to different languages from polysynthetic languages such as Yupik to isolating languages such as Chinese.
2023
pdf
bib
abs
Exploring Data-Centric Strategies for French Patent Classification: A Baseline and Comparisons
You Zuo
|
Benoît Sagot
|
Kim Gerdes
|
Houda Mouzoun
|
Samir Ghamri Doudane
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux -- articles longs
This paper proposes a novel approach to French patent classification leveraging data-centric strategies. We compare different approaches for the two deepest levels of the IPC hierarchy: the IPC group and subgroups. Our experiments show that while simple ensemble strategies work for shallower levels, deeper levels require more sophisticated techniques such as data augmentation, clustering, and negative sampling. Our research highlights the importance of language-specific features and data-centric strategies for accurate and reliable French patent classification. It provides valuable insights and solutions for researchers and practitioners in the field of patent classification, advancing research in French patent classification.
pdf
bib
abs
Autogramm : développement simultané de treebanks et de grammaires à partir de corpus
Sylvain Kahane
|
Santiago Herrera
|
Bruno Guillaume
|
Kim Gerdes
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 6 : projets
Ce projet de recherche vise à créer de nouveaux treebanks en dépendance pour des langues sous-dotées, en unifiant autant que possible leur développement avec celui de grammaires descriptives quantitatives. Nous présenterons notre chaîne de traitement et de développement de treebanks et nous discuterons du type de grammaire que nous voulons extraire. Enfin, nous examinerons l’utilisation de ces ressources en typologie quantitative.
pdf
bib
abs
Annotating Discursive Roles of Sentences in Patent Descriptions
Lufei Liu
|
Xu Sun
|
François Veltz
|
Kim Gerdes
Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII)
Patent descriptions are a crucial component of patent applications, as they are key to understanding the invention and play a significant role in securing patent grants. While discursive analyses have been undertaken for scientific articles, they have not been as thoroughly explored for patent descriptions, despite the increasing importance of Intellectual Property and the constant rise of the number of patent applications. In this study, we propose an annotation scheme containing 16 classes that allows categorizing each sentence in patent descriptions according to their discursive roles. We publish an experimental human-annotated corpus of 16 patent descriptions and analyze challenges that may be encountered in such work. This work can be base for an automated annotation and thus contribute to enriching linguistic resources in the patent domain.
pdf
bib
abs
Word order flexibility: a typometric study
Sylvain Kahane
|
Ziqian Peng
|
Kim Gerdes
Proceedings of the Seventh International Conference on Dependency Linguistics (Depling, GURT/SyntaxFest 2023)
This paper introduces a typometric measure of flexibility, which quantifies the variability of head-dependent word order on the whole set of treebanks of a language or on specific constructions. The measure is based on the notion of head-initiality and we show that it can be computed for all of languages of the Universal Dependency treebank set, that it does not require ad-hoc thresholds to categorize languages or constructions, and that it can be applied with any granularity of constructions and languages. We compare our results with Bakker’s (1998) categorical flexibility index. Typometric flexibility is shown to be a good measure for characterizing the language distribution with respect to word order for a given construction, and for estimating whether a construction predicts the global word order behavior of a language.
2021
pdf
bib
Starting a new treebank? Go SUD!
Kim Gerdes
|
Bruno Guillaume
|
Sylvain Kahane
|
Guy Perrier
Proceedings of the Sixth International Conference on Dependency Linguistics (Depling, SyntaxFest 2021)
pdf
bib
Annotation guidelines of UD and SUD treebanks for spoken corpora: A proposal
Sylvain Kahane
|
Bernard Caron
|
Emmett Strickland
|
Kim Gerdes
Proceedings of the 20th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2021)
2020
pdf
bib
abs
When Collaborative Treebank Curation Meets Graph Grammars
Gaël Guibon
|
Marine Courtin
|
Kim Gerdes
|
Bruno Guillaume
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this paper we present Arborator-Grew, a collaborative annotation tool for treebank development. Arborator-Grew combines the features of two preexisting tools: Arborator and Grew. Arborator is a widely used collaborative graphical online dependency treebank annotation tool. Grew is a tool for graph querying and rewriting specialized in structures needed in NLP, i.e. syntactic and semantic dependency trees and graphs. Grew also has an online version, Grew-match, where all Universal Dependencies treebanks in their classical, deep and surface-syntactic flavors can be queried. Arborator-Grew is a complete redevelopment and modernization of Arborator, replacing its own internal database storage by a new Grew API, which adds a powerful query tool to Arborator’s existing treebank creation and correction features. This includes complex access control for parallel expert and crowd-sourced annotation, tree comparison visualization, and various exercise modes for teaching and training of annotators. Arborator-Grew opens up new paths of collectively creating, updating, maintaining, and curating syntactic treebanks and semantic graph banks.
2019
pdf
bib
Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019)
Kim Gerdes
|
Sylvain Kahane
Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019)
pdf
bib
A Surface-Syntactic UD Treebank for Naija
Bernard Caron
|
Marine Courtin
|
Kim Gerdes
|
Sylvain Kahane
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)
pdf
bib
Improving Surface-syntactic Universal Dependencies (SUD): MWEs and deep syntactic features
Kim Gerdes
|
Bruno Guillaume
|
Sylvain Kahane
|
Guy Perrier
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)
pdf
bib
The relation between dependency distance and frequency
Xinying Chen
|
Kim Gerdes
Proceedings of the First Workshop on Quantitative Syntax (Quasy, SyntaxFest 2019)
pdf
bib
Rediscovering Greenberg’s Word Order Universals in UD
Kim Gerdes
|
Sylvain Kahane
|
Xinying Chen
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)
2018
pdf
bib
abs
SUD or Surface-Syntactic Universal Dependencies: An annotation scheme near-isomorphic to UD
Kim Gerdes
|
Bruno Guillaume
|
Sylvain Kahane
|
Guy Perrier
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)
This article proposes a surface-syntactic annotation scheme called SUD that is near-isomorphic to the Universal Dependencies (UD) annotation scheme while following distributional criteria for defining the dependency tree structure and the naming of the syntactic functions. Rule-based graph transformation grammars allow for a bi-directional transformation of UD into SUD. The back-and-forth transformation can serve as an error-mining tool to assure the intra-language and inter-language coherence of the UD treebanks.
2017
pdf
bib
Classifying Languages by Dependency Structure. Typologies of Delexicalized Universal Dependency Treebanks
Xinying Chen
|
Kim Gerdes
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)
pdf
bib
Quantitative Comparative Syntax on the Cantonese-Mandarin Parallel Dependency Treebank
Tak-sum Wong
|
Kim Gerdes
|
Herman Leung
|
John Lee
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)
pdf
bib
Multi-word annotation in syntactic treebanks - Propositions for Universal Dependencies
Sylvain Kahane
|
Marine Courtin
|
Kim Gerdes
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories
2016
pdf
bib
Dependency Annotation Choices: Assessing Theoretical and Practical Issues of Universal Dependencies
Kim Gerdes
|
Sylvain Kahane
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)
pdf
bib
abs
Developing Universal Dependencies for Mandarin Chinese
Herman Leung
|
Rafaël Poiret
|
Tak-sum Wong
|
Xinying Chen
|
Kim Gerdes
|
John Lee
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)
This article proposes a Universal Dependency Annotation Scheme for Mandarin Chinese, including POS tags and dependency analysis. We identify cases of idiosyncrasy of Mandarin Chinese that are difficult to fit into the current schema which has mainly been based on the descriptions of various Indo-European languages. We discuss differences between our scheme and those of the Stanford Chinese Dependencies and the Chinese Dependency Treebank.
2015
pdf
bib
abs
Analyse syntaxique de l’ancien français : quelles propriétés de la langue influent le plus sur la qualité de l’apprentissage ?
Gaël Guibon
|
Isabelle Tellier
|
Sophie Prévost
|
Matthieu Constant
|
Kim Gerdes
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs
L’article présente des résultats d’expériences d’apprentissage automatique pour l’étiquetage morpho-syntaxique et l’analyse syntaxique en dépendance de l’ancien français. Ces expériences ont pour objectif de servir une exploration de corpus pour laquelle le corpus arboré SRCMF sert de données de référence. La nature peu standardisée de la langue qui y est utilisée implique des données d’entraînement hétérogènes et quantitativement limitées. Nous explorons donc diverses stratégies, fondées sur différents critères (variabilité du lexique, forme Vers/Prose des textes, dates des textes), pour constituer des corpus d’entrainement menant aux meilleurs résultats possibles.
pdf
bib
Classifying Syntactic Categories in the Chinese Dependency Network
Xinying Chen
|
Haitao Liu
|
Kim Gerdes
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)
pdf
bib
Non-constituent coordination and other coordinative constructions as Dependency Graphs
Kim Gerdes
|
Sylvain Kahane
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)
2014
pdf
bib
abs
Rhapsodie: a Prosodic-Syntactic Treebank for Spoken French
Anne Lacheret
|
Sylvain Kahane
|
Julie Beliao
|
Anne Dister
|
Kim Gerdes
|
Jean-Philippe Goldman
|
Nicolas Obin
|
Paola Pietrandrea
|
Atanas Tchobanov
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The main objective of the Rhapsodie project (ANR Rhapsodie 07 Corp-030-01) was to define rich, explicit, and reproducible schemes for the annotation of prosody and syntax in different genres (± spontaneous, ± planned, face-to-face interviews vs. broadcast, etc.), in order to study the prosody/syntax/discourse interface in spoken French, and their roles in the segmentation of speech into discourse units (Lacheret, Kahane, & Pietrandrea forthcoming). We here describe the deliverable, a syntactic and prosodic treebank of spoken French, composed of 57 short samples of spoken French (5 minutes long on average, amounting to 3 hours of speech and 33000 words), orthographically and phonetically transcribed. The transcriptions and the annotations are all aligned on the speech signal: phonemes, syllables, words, speakers, overlaps. This resource is freely available at www.projet-rhapsodie.fr. The sound samples (wav/mp3), the acoustic analysis (original F0 curve manually corrected and automatic stylized F0, pitch format), the orthographic transcriptions (txt), the microsyntactic annotations (tabular format), the macrosyntactic annotations (txt, tabular format), the prosodic annotations (xml, textgrid, tabular format), and the metadata (xml and html) can be freely downloaded under the terms of the Creative Commons licence Attribution - Noncommercial - Share Alike 3.0 France. The metadata are encoded in the IMDI-CMFI format and can be parsed on line.
pdf
bib
abs
Correcting and Validating Syntactic Dependency in the Spoken French Treebank Rhapsodie
Rachel Bawden
|
Marie-Amélie Botalla
|
Kim Gerdes
|
Sylvain Kahane
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This article presents the methods, results, and precision of the syntactic annotation process of the Rhapsodie Treebank of spoken French. The Rhapsodie Treebank is an 33,000 word corpus annotated for prosody and syntax, licensed in its entirety under Creative Commons. The syntactic annotation contains two levels: a macro-syntactic level, containing a segmentation into illocutionary units (including discourse markers, parentheses â¦) and a micro-syntactic level including dependency relations and various paradigmatic structures, called pile constructions, the latter being particularly frequent and diverse in spoken language. The micro-syntactic annotation process, presented in this paper, includes a semi-automatic preparation of the transcription, the application of a syntactic dependency parser, transcoding of the parsing results to the Rhapsodie annotation scheme, manual correction by multiple annotators followed by a validation process, and finally the application of coherence rules that check common errors. The good inter-annotator agreement scores are presented and analyzed in greater detail. The article also includes the list of functions used in the dependency annotation and for the distinction of various pile constructions and presents the ideas underlying these choices.
2013
pdf
bib
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)
Eva Hajičová
|
Kim Gerdes
|
Leo Wanner
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)
pdf
bib
Collaborative Dependency Annotation
Kim Gerdes
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)
2012
pdf
bib
Intonosyntactic Data Structures: The Rhapsodie Treebank of Spoken French
Kim Gerdes
|
Sylvain Kahane
|
Anne Lacheret
|
Paola Pietandrea
|
Arthur Truong
Proceedings of the Sixth Linguistic Annotation Workshop
2010
pdf
bib
Depends on What the French Say - Spoken Corpus Annotation with and beyond Syntactic Functions
José Deulofeu
|
Lucie Duffort
|
Kim Gerdes
|
Sylvain Kahane
|
Paola Pietrandrea
Proceedings of the Fourth Linguistic Annotation Workshop
2009
pdf
bib
abs
Grammaires d’erreur – correction grammaticale avec analyse profonde et proposition de corrections minimales
Lionel Clément
|
Kim Gerdes
|
Renaud Marlet
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts
Nous présentons un système de correction grammatical ouvert, basé sur des analyses syntaxiques profondes. La spécification grammaticale est une grammaire hors-contexte équipée de structures de traits plates. Après une analyse en forêt partagée où les contraintes d’accord de traits sont relâchées, la détection d’erreur minimise globalement les corrections à effectuer et des phrases alternatives correctes sont automatiquement proposées.
2006
pdf
bib
A Polynomial Parsing Algorithm for the Topological Model: Synchronizing Constituent and Dependency Grammars, Illustrated by German Word Order Phenomena
Kim Gerdes
|
Sylvain Kahane
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics
2003
pdf
bib
abs
La topologie comme interface entre syntaxe et prosodie : un système de génération appliqué au grec moderne
Kim Gerdes
|
Hi-Yon Yoo
Actes de la 10ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs
Dans cet article, nous développons les modules syntaxique et topologique du modèle Sens- Texte et nous montrons l’utilité de la topologie comme représentation intermédiaire entre les représentations syntaxique et phonologique. Le modèle est implémenté dans un générateur et nous présentons la grammaire du grec moderne dans cette approche.
2002
pdf
bib
DTAG?
Kim Gerdes
Proceedings of the Sixth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+6)
2001
pdf
bib
Word Order in German: A Formal Dependency Grammar Using a Topological Hierarchy
Kim Gerdes
|
Sylvain Kahane
Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics