Bruno Guillaume

2025

pdf bib abs
A morpheme-based treebank for Gbaya, an Ubanguian language of Central Africa
Paulette Roulon-Doko | Sylvain Kahane | Bruno Guillaume
Proceedings of the Eighth International Conference on Dependency Linguistics (Depling, SyntaxFest 2025)

In this paper, we present the first treebank for Gbaya, a language from the under-resourced Niger-Congo family. The language has a rich system of tonal morphemes and virtually no affixes. The dependency analysis is based on a morpheme-based tokenisation and the treebank is also distributed in word-based Universal Dependencies version. Several constructions are discussed in the paper: genitive construction, clause coordination, sentence particles, adverbial and relative clauses, serial verb constructions, reported speech, topicalization, and focalization.

pdf bib abs
Evaluation Framework for Layered Meaning Representation
Rémi de Vergnette | Maxime Amblard | Bruno Guillaume
Proceedings of the Sixth International Workshop on Designing Meaning Representations

We propose different modular evaluation metrics for Layered Meaning Representation, defined as YARN, a semantic formalism encoded using rich structures that generalize AMR graphs. While existing metrics like SMATCH evaluate graph-based semantic representations such as AMR, they cannot directly handle YARN’s more complex structures. We make full use of the modular nature of YARN to propose two families of metrics, depending on the linguistic features and type of semantic phenomenon targeted. The first one, SMATCHY, extends the AMR SMATCH metric. We also propose YARNBLEU, based on the SEMBLEU metric for AMR. We evaluate both families on a small dataset of human annotated YARN structures, adding random modifications simulating annotation mistakes and show that SMATCHY provides a more consistent and reliable approach with respect to the type of modifications considered.

pdf bib abs
Creating a multi-layer Treebank for Tundra Nenets
Nikolett Mus | Bruno Guillaume | Sylvain Kahane | Daniel Zeman
Proceedings of the 10th International Workshop on Computational Linguistics for Uralic Languages

This paper presents the development of the Tundra Nenets Universal Dependencies (UD) Treebank, the first syntactically annotated resource for the Samoyedic branch of the Uralic family. The treebank integrates spoken-language data and adopts the morphologically enhanced Surface-Syntactic UD (mSUD) framework to capture inflectional morphology and morphology-based syntactic relations. It further incorporates Information Structure annotation. The methodological workflow includes data selection, transcription conventions, sentence and lexeme segmentation, annotation of spoken-language features, lemmatization, treatment of morpheme status, part-of-speech and morphological tagging, and syntactic annotation based on the functional and distributional properties of syntactic elements. We also outline the principles guiding multi-level annotation and justify the theoretical choices underlying the integration of prosodic, morphological, and syntactic information.

pdf bib abs
Extraction of Contrastive Rules from Syntactic Treebanks: A Case Study in Romance Languages
Santiago Herrera | Ioana-Madalina Silai | Caio Corro | Bruno Guillaume | Sylvain Kahane
Proceedings of the Third Workshop on Quantitative Syntax (QUASY, SyntaxFest 2025)

In this paper, we develop a data-driven contrastive framework to extract common and distinctive linguistic descriptions from syntactic treebanks. The extracted contrastive rules are defined by a statistically significant difference in precision and classified as common and distinctive rules across the set of treebanks. We illustrate our method by working on object word order using Universal Dependencies (UD) treebanks in 6 Romance languages: Brazilian Portuguese, Catalan, French, Italian, Romanian and Spanish. We discuss the limitations faced due to inconsistent annotation and the feasibility of conducting contrasting studies using the UD collection.

pdf bib abs
An intonosyntactic treebank for spoken French: What is new with Rhapsodie?
Maria Paz Botero-Garcia | Emmett Strickland | Bruno Guillaume | Sylvain Kahane | Anne Lacheret-Dujour
Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025)

This paper presents a new format of the Rhapsodie Treebank, which contains both syntactic and prosodic annotations, offering a comprehensive dataset for the study of spoken French.This integrated format allow us for complex multilevel queries and open the way for the extraction of intonosyntactic studies.

pdf bib abs
Status of morphosyntactic features Illustration with written and spoken French UD treebanks
Sylvain Kahane | Bruno Guillaume | Léna Brun | Simeng Song
Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025)

Morphosyntactic features used in UD treebanks have different status. If most of them correspond to values of inflectional morphemes, some describe lexical subclasses or are just conventional names of polysemic morphemes. Syncretism is also a challenge, because exact values are only deductible from contextual information. We propose an attempt at clarification and an implementation in the treebanks of written and spoken French.

2024

pdf bib abs
YARN is All You Knit: Encoding Multiple Semantic Phenomena with Layers
Siyana Pavlova | Maxime Amblard | Bruno Guillaume
Proceedings of the Fifth International Workshop on Designing Meaning Representations @ LREC-COLING 2024

In this paper, we present the first version of YARN, a new semantic representation formalism. We propose this new formalism to unify the advantages of logic-based formalisms while retaining direct interpretation, making it widely usable. YARN is rooted in the encoding of different semantic phenomena as separate layers. We begin by presenting a formal definition of the mathematical structure that constitutes YARN. We then illustrate with concrete examples how this structure can be used in the context of semantic representation for encoding multiple phenomena (such as modality, negation and quantification) as layers built on top of a central predicate-argument structure. The benefit of YARN is that it allows for the independent annotation and analysis of different phenomena as they are easy to “switch off”. Furthermore, we have explored YARN’s ability to encode simple interactions between phenomena. We wrap up the work presented by a discussion of some of the interesting observations made during the development of YARN so far and outline our extensive future plans for this formalism.

pdf bib abs
Hostomytho: A GWAP for Synthetic Clinical Texts Evaluation and Annotation
Nicolas Hiebel | Bertrand Remy | Bruno Guillaume | Olivier Ferret | Aurélie Névéol | Karen Fort
Proceedings of the 10th Workshop on Games and Natural Language Processing @ LREC-COLING 2024

This paper presents the creation of Hostomytho, a game with a purpose intended for evaluating the quality of synthetic biomedical texts through multiple mini-games. Hostomytho was developed entirely using open source technologies both for internet browser and mobile platforms (IOS & Android). The code and the annotations created for synthetic clinical cases in French will be made freely available.

pdf bib abs
Au-delà de la performance des modèles : la prédiction de liens peut-elle enrichir des graphes lexico-sémantiques du français ?
Hee-Soo Choi | Priyansh Trivedi | Mathieu Constant | Karën Fort | Bruno Guillaume
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Cet article présente une étude sur l’utilisation de modèles de prédiction de liens pour l’enrichissement de graphes lexico-sémantiques du français. Celle-ci porte sur deux graphes, RezoJDM16k et RL-fr et sept modèles de prédiction de liens. Nous avons étudié les prédictions du modèle le plus performant, afin d’extraire de potentiels nouveaux triplets en utilisant un score de confiance que nous avons évalué avec des annotations manuelles. Nos résultats mettent en évidence des avantages différentspour le graphe dense RezoJDM16k par rapport à RL-fr, plus clairsemé. Si l’ajout de nouveaux triplets à RezoJDM16k offre des avantages limités, RL-fr peut bénéficier substantiellement de notre approche.

pdf bib abs
De nouvelles méthodes pour l’exploration de l’interface syntaxe-prosodie : un treebank intonosyntaxique et un système de synthèse pour le pidgin nigérian
Emmett Strickland | Anne Lacheret-Dujour | Marc Evrard | Sylvain Kahane | Dana Aubakirova | Dorin Doncenco | Diego Torres | Perrine Quennehen | Bruno Guillaume
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Cet article présente deux ressources récemment développées pour explorer l’interface prosodie-syntaxe en pidgin nigérian, une langue à faibles ressources d’Afrique de l’Ouest. La première est un treebank intonosyntaxique dans laquelle chaque token est associé à une série de caractéristiques prosodiques au niveau de la syllabe, ce qui permet d’analyser diverses structures syntaxiques et prosodiques en utilisant une même interface. La seconde est un système de synthèse de la parole entraîné sur le même ensemble de données, conçu pour permettre un contrôle direct sur les contours intonatifs de la parole générée. Cet outil a été développé pour nous permettre de tester les hypothèses formulées à partir de l’exploration du treebank. Cet article est largement une adaptation de deux publications récentes présentant chaque outil, avec un accent sur leur interconnexion dans notre recherche en cours.

pdf bib abs
Beyond Model Performance: Can Link Prediction Enrich French Lexical Graphs?
Hee-Soo Choi | Priyansh Trivedi | Mathieu Constant | Karen Fort | Bruno Guillaume
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper presents a resource-centric study of link prediction approaches over French lexical-semantic graphs. Our study incorporates two graphs, RezoJDM16k and RL-fr, and we evaluated seven link prediction models, with CompGCN-ConvE emerging as the best performer. We also conducted a qualitative analysis of the predictions using manual annotations. Based on this, we found that predictions with higher confidence scores were more valid for inclusion. Our findings highlight different benefits for the dense graph compared to the sparser graph RL-fr. While the addition of new triples to RezoJDM16k offers limited advantages, RL-fr can benefit substantially from our approach.

pdf bib abs
Joint Annotation of Morphology and Syntax in Dependency Treebanks
Bruno Guillaume | Kim Gerdes | Kirian Guiller | Sylvain Kahane | Yixuan Li
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this paper, we compare different ways to annotate both syntactic and morphological relations in a dependency treebank and we propose new formats we call mSUD and mUD, compatible with the Universal Dependencies (UD) schema for syntactic treebanks. We emphasize mSUD rather than mUD, the former being based on distributional criteria for the choice of the head of any combination, which allow us to clearly encode the internal structure of a word, that is, the derivational path. We investigate different problems posed by a morph-based annotation, concerning tokenization, choice of the head of a morph combination, relations between morphs, additional features needed, such as the token type differentiating roots and derivational and inflectional affixes. We show how our annotation schema can be applied to different languages from polysynthetic languages such as Yupik to isolating languages such as Chinese.

pdf bib abs
New Methods for Exploring Intonosyntax: Introducing an Intonosyntactic Treebank for Nigerian Pidgin
Emmett Strickland | Anne Lacheret-Dujour | Sylvain Kahane | Marc Evrard | Perrine Quennehen | Bernard Caron | Francis Egbokhare | Bruno Guillaume
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper presents a new phonetic resource for Nigerian Pidgin, a low-resource language of West Africa. Aiming to provide a new tool for research on intonosyntax, we have augmented an existing syntactic treebank of Nigerian Pidgin, associating each orthographically transcribed token with a series of syllable-level alignments and phonetizations. Syllables are further described using a set of continuous and discrete prosodic features. This new approach provides a simple tool for researchers to explore the prosodic characteristics of various syntactic phenomena. In this paper, we present the format of the corpus, the various features added, and several explorations that can be performed using an online interface. We also present a prosodically specified lexicon extracted using this resource. In it, each orthographic form is accompanied by the frequency of its phoneme-level variants, as well as the suprasegmental features that most frequently accompany each syllable. Finally, we present several additional case studies on how this corpus can used in the study of the language’s prosody.

This paper presents the objectives, organization and activities of the UniDive COST Action, a scientific network dedicated to universality, diversity and idiosyncrasy in language technology. We describe the objectives and organization of this initiative, the people involved, the working groups and the ongoing tasks and activities. This paper is also an pen call for participation towards new members and countries.

2023

pdf bib abs
Structural and Global Features for Comparing Semantic Representation Formalisms
Siyana Pavlova | Maxime Amblard | Bruno Guillaume
Proceedings of the Fourth International Workshop on Designing Meaning Representations

The area of designing semantic/meaning representations is a dynamic one with new formalisms and extensions being proposed continuously. It may be challenging for users of semantic representations to select the relevant formalism for their purpose or for newcomers to the field to select the features they want to represent in a new formalism. In this paper, we propose a set of structural and global features to consider when designing formalisms, and against which formalisms can be compared. We also propose a sample comparison of a number of existing formalisms across the selected features, complemented by a more entailment-oriented comparison on the phenomena of the FraCaS corpus.

pdf bib abs
Bridging Semantic Frameworks: mapping DRS onto AMR
Siyana Pavlova | Maxime Amblard | Bruno Guillaume
Proceedings of the 15th International Conference on Computational Semantics

A number of graph-based semantic representation frameworks have emerged in recent years, but there are few parallel annotated corpora across them. We want to explore the viability of transforming graphs from one framework into another to construct parallel datasets. In this work, we consider graph rewriting from Discourse Representation Structures (Parallel Meaning Bank (PMB) variant) to Abstract Meaning Representation (AMR). We first build a gold AMR corpus of 102 sentences from the PMB. We then construct a rule base, aided by a further 95 sentences. No benchmark for this task exists, so we compare our system’s output to that of state-of-the-art AMR parsers, and explore the more challenging cases. Finally, we discuss where the two frameworks diverge in encoding semantic phenomena.

pdf bib abs
Des ressources lexicales du français et de leur utilisation en TAL : étude des actes de TALN
Hee-Soo Choi | Karën Fort | Bruno Guillaume | Mathieu Constant
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : travaux de recherche originaux -- articles courts

Au début du XXIe siècle, le français faisait encore partie des langues peu dotées. Grâce aux efforts de la communauté française du traitement automatique des langues (TAL), de nombreuses ressources librement disponibles ont été produites, dont des lexiques du français. À travers cet article, nous nous intéressons à leur devenir dans la communauté par le prisme des actes de la conférence TALN sur une période de 20 ans.

pdf bib abs
Autogramm : développement simultané de treebanks et de grammaires à partir de corpus
Sylvain Kahane | Santiago Herrera | Bruno Guillaume | Kim Gerdes
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 6 : projets

Ce projet de recherche vise à créer de nouveaux treebanks en dépendance pour des langues sous-dotées, en unifiant autant que possible leur développement avec celui de grammaires descriptives quantitatives. Nous présenterons notre chaîne de traitement et de développement de treebanks et nous discuterons du type de grammaire que nous voulons extraire. Enfin, nous examinerons l’utilisation de ces ressources en typologie quantitative.

We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced.

pdf bib abs
Graph-based multi-layer querying in Parseme Corpora
Bruno Guillaume
Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023)

We present a graph-based tool which can be used to explore Verbal Multi-Word Expression (VMWE) annotated in the Parseme project. The tool can be used for linguistic exploration on the data, for helping the manual annotation process and to search for errors or inconsistencies in the annotations.

2022

pdf bib abs
Graph Querying for Semantic Annotations
Maxime Amblard | Bruno Guillaume | Siyana Pavlova | Guy Perrier
Proceedings of the 18th Joint ACL - ISO Workshop on Interoperable Semantic Annotation within LREC2022

This paper presents how the online tool Grew-match can be used to make queries and visualise data from existing semantically annotated corpora. A dedicated syntax is available to construct simple to complex queries and execute them against a corpus. Such queries give transverse views of the annotated data, this views can help for checking the consistency of annotations in one corpus or across several corpora. Grew-match can then be seen as an error mining tool: when inconsistencies are detected, it helps finding the sentences which should be fixed. Finally, Grew-match can also be used as a side tool to assist annotation task helping to find annotations examples in existing corpora to be compare to the data to be annotated.

pdf bib abs
How much of UCCA can be predicted from AMR?
Siyana Pavlova | Maxime Amblard | Bruno Guillaume
Proceedings of the 18th Joint ACL - ISO Workshop on Interoperable Semantic Annotation within LREC2022

In this paper, we consider two of the currently popular semantic frameworks: Abstract Meaning Representation (AMR) - a more abstract framework, and Universal Conceptual Cognitive Annotation (UCCA) - an anchored framework. We use a corpus-based approach to build two graph rewriting systems, a deterministic and a non-deterministic one, from the former to the latter framework. We present their evaluation and a number of ambiguities that we discovered while building our rules. Finally, we provide a discussion and some future work directions in relation to comparing semantic frameworks of different flavors.

This paper describes the continuation of a project that aims at establishing an interoperable annotation schema for quantification phenomena as part of the ISO suite of standards for semantic annotation, known as the Semantic Annotation Framework. After a break, caused by the Covid-19 pandemic, the project was relaunched in early 2022 with a second working draft of an annotation scheme, which is discussed in this paper. Keywords: semantic annotation, quantification, interoperability, annotation schema, ISO standard

2021

pdf bib
Starting a new treebank? Go SUD!
Kim Gerdes | Bruno Guillaume | Sylvain Kahane | Guy Perrier
Proceedings of the Sixth International Conference on Dependency Linguistics (Depling, SyntaxFest 2021)

pdf bib abs
Graph Matching and Graph Rewriting: GREW tools for corpus exploration, maintenance and conversion
Bruno Guillaume
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

This article presents a set of tools built around the Graph Rewriting computational framework which can be used to compute complex rule-based transformations on linguistic structures. Application of the graph matching mechanism for corpus exploration, error mining or quantitative typology are also given.

pdf bib abs
Graph Rewriting for Enhanced Universal Dependencies
Bruno Guillaume | Guy Perrier
Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021)

This paper describes a system proposed for the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (EUD). We propose a Graph Rewriting based system for computing Enhanced Universal Dependencies, given the Basic Universal Dependencies (UD).

pdf bib
Corpus-based language universals analysis using Universal Dependencies
Hee-Soo Choi | Bruno Guillaume | Karën Fort
Proceedings of the Second Workshop on Quantitative Syntax (Quasy, SyntaxFest 2021)

pdf bib abs
Investigating Dominant Word Order on Universal Dependencies with Graph Rewriting
Hee-Soo Choi | Bruno Guillaume | Karën Fort | Guy Perrier
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

This paper details experiments we performed on the Universal Dependencies 2.7 corpora in order to investigate the dominant word order in the available languages. For this purpose, we used a graph rewriting tool, GREW, which allowed us to go beyond the surface annotations and identify the implicit subjects. We first measured the distribution of the six different word orders (SVO, SOV, VSO, VOS, OVS, OSV) in the corpora and investigated when there was a significant difference in the corpora within a given language. Then, we compared the obtained results with information provided in the WALS database (Dryer and Haspelmath, 2013) and in ( ̈Ostling, 2015). Finally, we examined the impact of using a graph rewriting tool for this task. The tools and resources used for this research are all freely available.

pdf bib
A morph-based and a word-based treebank for Beja
Sylvain Kahane | Martine Vanhove | Rayan Ziane | Bruno Guillaume
Proceedings of the 20th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2021)

2020

pdf bib abs
Rigor Mortis: Annotating MWEs with a Gamified Platform
Karën Fort | Bruno Guillaume | Yann-Alan Pilatte | Mathieu Constant | Nicolas Lefèbvre
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present here Rigor Mortis, a gamified crowdsourcing platform designed to evaluate the intuition of the speakers, then train them to annotate multi-word expressions (MWEs) in French corpora. We previously showed that the speakers’ intuition is reasonably good (65% in recall on non-fixed MWE). We detail here the annotation results, after a training phase using some of the tests developed in the PARSEME-FR project.

pdf bib abs
When Collaborative Treebank Curation Meets Graph Grammars
Gaël Guibon | Marine Courtin | Kim Gerdes | Bruno Guillaume
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper we present Arborator-Grew, a collaborative annotation tool for treebank development. Arborator-Grew combines the features of two preexisting tools: Arborator and Grew. Arborator is a widely used collaborative graphical online dependency treebank annotation tool. Grew is a tool for graph querying and rewriting specialized in structures needed in NLP, i.e. syntactic and semantic dependency trees and graphs. Grew also has an online version, Grew-match, where all Universal Dependencies treebanks in their classical, deep and surface-syntactic flavors can be queried. Arborator-Grew is a complete redevelopment and modernization of Arborator, replacing its own internal database storage by a new Grew API, which adds a powerful query tool to Arborator’s existing treebank creation and correction features. This includes complex access control for parallel expert and crowd-sourced annotation, tree comparison visualization, and various exercise modes for teaching and training of annotators. Arborator-Grew opens up new paths of collectively creating, updating, maintaining, and curating syntactic treebanks and semantic graph banks.

pdf bib abs
A French Version of the FraCaS Test Suite
Maxime Amblard | Clément Beysson | Philippe de Groote | Bruno Guillaume | Sylvain Pogodalla
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper presents a French version of the FraCaS test suite. This test suite, originally written in English, contains problems illustrating semantic inference in natural language. We describe linguistic choices we had to make when translating the FraCaS test suite in French, and discuss some of the issues that were raised by the translation. We also report an experiment we ran in order to test both the translation and the logical semantics underlying the problems of the test suite. This provides a way of checking formal semanticists’ hypotheses against actual semantic capacity of speakers (in the present case, French speakers), and allow us to compare the results we obtained with the ones of similar experiments that have been conducted for other languages.

We present edition 1.2 of the PARSEME shared task on identification of verbal multiword expressions (VMWEs). Lessons learned from previous editions indicate that VMWEs have low ambiguity, and that the major challenge lies in identifying test instances never seen in the training data. Therefore, this edition focuses on unseen VMWEs. We have split annotated corpora so that the test corpora contain around 300 unseen VMWEs, and we provide non-annotated raw corpora to be used by complementary discovery methods. We released annotated and raw corpora in 14 languages, and this semi-supervised challenge attracted 7 teams who submitted 9 system results. This paper describes the effort of corpus creation, the task design, and the results obtained by the participating systems, especially their performance on unseen expressions.

2019

pdf bib
Conversion et améliorations de corpus du français annotés en Universal Dependencies [Conversion and Improvement of Universal Dependencies French corpora]
Bruno Guillaume | Marie-Catherine de Marneffe | Guy Perrier
Traitement Automatique des Langues, Volume 60, Numéro 2 : Corpus annotés [Annotated corpora]

pdf bib
Improving Surface-syntactic Universal Dependencies (SUD): MWEs and deep syntactic features
Kim Gerdes | Bruno Guillaume | Sylvain Kahane | Guy Perrier
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)

2018

pdf bib abs
“Fingers in the Nose”: Evaluating Speakers’ Identification of Multi-Word Expressions Using a Slightly Gamified Crowdsourcing Platform
Karën Fort | Bruno Guillaume | Matthieu Constant | Nicolas Lefèbvre | Yann-Alan Pilatte
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This article presents the results we obtained in crowdsourcing French speakers’ intuition concerning multi-work expressions (MWEs). We developed a slightly gamified crowdsourcing platform, part of which is designed to test users’ ability to identify MWEs with no prior training. The participants perform relatively well at the task, with a recall reaching 65% for MWEs that do not behave as function words.

pdf bib abs
SUD or Surface-Syntactic Universal Dependencies: An annotation scheme near-isomorphic to UD
Kim Gerdes | Bruno Guillaume | Sylvain Kahane | Guy Perrier
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

This article proposes a surface-syntactic annotation scheme called SUD that is near-isomorphic to the Universal Dependencies (UD) annotation scheme while following distributional criteria for defining the dependency tree structure and the naming of the syntactic functions. Rule-based graph transformation grammars allow for a bi-directional transformation of UD into SUD. The back-and-forth transformation can serve as an error-mining tool to assure the intra-language and inter-language coherence of the UD treebanks.

2017

pdf bib abs
Vers l’annotation par le jeu de corpus (plus) complexes : le cas de la langue de spécialité (Towards (more) complex corpora annotation using a game with a purpose : the case of scientific language)
Karën Fort | Bruno Guillaume | Nicolas Lefebvre | Laura Ramírez | Mathilde Regnault | Mary Collins | Oksana Gavrilova | Tanti Kristanti
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Nous avons précédemment montré qu’il est possible de faire produire des annotations syntaxiques de qualité par des participants à un jeu ayant un but. Nous présentons ici les résultats d’une expérience visant à évaluer leur production sur un corpus plus complexe, en langue de spécialité, en l’occurrence un corpus de textes scientifiques sur l’ADN. Nous déterminons précisément la complexité de ce corpus, puis nous évaluons les annotations en syntaxe de dépendances produites par les joueurs par rapport à une référence mise au point par des experts du domaine.

pdf bib
Enhanced UD Dependencies with Neutralized Diathesis Alternation
Marie Candito | Bruno Guillaume | Guy Perrier | Djamé Seddah
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

2016

pdf bib abs
Crowdsourcing Complex Language Resources: Playing to Annotate Dependency Syntax
Bruno Guillaume | Karën Fort | Nicolas Lefebvre
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This article presents the results we obtained on a complex annotation task (that of dependency syntax) using a specifically designed Game with a Purpose, ZombiLingo. We show that with suitable mechanisms (decomposition of the task, training of the players and regular control of the annotation quality during the game), it is possible to obtain annotations whose quality is significantly higher than that obtainable with a parser, provided that enough players participate. The source code of the game and the resulting annotated corpora (for French) are freely available.

2015

pdf bib abs
Recherche de motifs de graphe en ligne
Bruno Guillaume
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Nous présentons un outil en ligne de recherche de graphes dans des corpus annotés en syntaxe.

pdf bib
Dependency Parsing with Graph Rewriting
Bruno Guillaume | Guy Perrier
Proceedings of the 14th International Conference on Parsing Technologies

2014

pdf bib
Annotation scheme for deep dependency syntax of French (Un schéma d’annotation en dépendances syntaxiques profondes pour le français) [in French]
Guy Perrier | Marie Candito | Bruno Guillaume | Corentin Ribeyre | Karën Fort | Djamé Seddah
Proceedings of TALN 2014 (Volume 2: Short Papers)

pdf bib
ZOMBILINGO: eating heads to perform dependency syntax annotation (ZOMBILINGO : manger des têtes pour annoter en syntaxe de dépendances) [in French]
Karën Fort | Bruno Guillaume | Valentin Stern
Proceedings of TALN 2014 (Volume 3: System Demonstrations)

We define a deep syntactic representation scheme for French, which abstracts away from surface syntactic variation and diathesis alternations, and describe the annotation of deep syntactic representations on top of the surface dependency trees of the Sequoia corpus. The resulting deep-annotated corpus, named deep-sequoia, is freely available, and hopefully useful for corpus linguistics studies and for training deep analyzers to prepare semantic analysis.

pdf bib abs
Mapping the Lexique des Verbes du Français (Lexicon of French Verbs) to a NLP lexicon using examples
Bruno Guillaume | Karën Fort | Guy Perrier | Paul Bédaride
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article presents experiments aiming at mapping the Lexique des Verbes du Français (Lexicon of French Verbs) to FRILEX, a Natural Language Processing (NLP) lexicon based on D ICOVALENCE. The two resources (Lexicon of French Verbs and D ICOVALENCE) were built by linguists, based on very different theories, which makes a direct mapping nearly impossible. We chose to use the examples provided in one of the resource to find implicit links between the two and make them explicit.

Nous montrons comment enrichir une annotation en dépendances syntaxiques au format du French Treebank de Paris 7 en utilisant la réécriture de graphes, en vue du calcul de sa représentation sémantique. Le système de réécriture est composé de règles grammaticales et lexicales structurées en modules. Les règles lexicales utilisent une information de contrôle extraite du lexique des verbes français Dicovalence.

pdf bib
Modular Graph Rewriting to Compute Semantics
Guillaume Bonfante | Bruno Guillaume | Mathieu Morey | Guy Perrier
Proceedings of the Ninth International Conference on Computational Semantics (IWCS 2011)

2010

pdf bib abs
Motifs de graphe pour le calcul de dépendances syntaxiques complètes
Jonathan Marchand | Bruno Guillaume | Guy Perrier
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article propose une méthode pour calculer les dépendances syntaxiques d’un énoncé à partir du processus d’analyse en constituants. L’objectif est d’obtenir des dépendances complètes c’est-à-dire contenant toutes les informations nécessaires à la construction de la sémantique. Pour l’analyse en constituants, on utilise le formalisme des grammaires d’interaction : celui-ci place au cœur de la composition syntaxique un mécanisme de saturation de polarités qui peut s’interpréter comme la réalisation d’une relation de dépendance. Formellement, on utilise la notion de motifs de graphes au sens de la réécriture de graphes pour décrire les conditions nécessaires à la création d’une dépendance.

pdf bib abs
Réécriture de graphes de dépendances pour l’interface syntaxe-sémantique
Guillaume Bonfante | Bruno Guillaume | Mathieu Morey | Guy Perrier
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous définissons le beta-calcul, un calcul de réécriture de graphes, que nous proposons d’utiliser pour étudier les liens entre différentes représentations linguistiques. Nous montrons comment transformer une analyse syntaxique en une représentation sémantique par la composition de deux jeux de règles de beta-calcul. Le premier souligne l’importance de certaines informations syntaxiques pour le calcul de la sémantique et explicite le lien entre syntaxe et sémantique sous-spécifiée. Le second décompose la recherche de modèles pour les représentations sémantiques sous-spécifiées.

pdf bib abs
LEOPAR, un analyseur syntaxique pour les grammaires d’interaction
Bruno Guillaume | Guy Perrier
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Nous présentons ici l’analyseur syntaxique LEOPAR basé sur les grammaires d’interaction ainsi que d’autres outils utiles pour notre chaîne de traitement syntaxique.

2009

pdf bib abs
Analyse en dépendances à l’aide des grammaires d’interaction
Jonathan Marchand | Bruno Guillaume | Guy Perrier
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Cet article propose une méthode pour extraire une analyse en dépendances d’un énoncé à partir de son analyse en constituants avec les grammaires d’interaction. Les grammaires d’interaction sont un formalisme grammatical qui exprime l’interaction entre les mots à l’aide d’un système de polarités. Le mécanisme de composition syntaxique est régi par la saturation des polarités. Les interactions s’effectuent entre les constituants, mais les grammaires étant lexicalisées, ces interactions peuvent se traduire sur les mots. La saturation des polarités lors de l’analyse syntaxique d’un énoncé permet d’extraire des relations de dépendances entre les mots, chaque dépendance étant réalisée par une saturation. Les structures de dépendances ainsi obtenues peuvent être vues comme un raffinement de l’analyse habituellement effectuée sous forme d’arbre de dépendance. Plus généralement, ce travail apporte un éclairage nouveau sur les liens entre analyse en constituants et analyse en dépendances.

pdf bib
Interaction Grammar for the Persian Language: Noun and Adjectival Phrases
Masood Ghayoomi | Bruno Guillaume
Proceedings of the 7th Workshop on Asian Language Resources (ALR7)

pdf bib
Dependency Constraints for Lexical Disambiguation
Guillaume Bonfante | Bruno Guillaume | Mathieu Morey
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

2008

pdf bib abs
Sylva : plate-forme de validation multi-niveaux de lexiques
Karën Fort | Bruno Guillaume
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

La production de lexiques est une activité indispensable mais complexe, qui nécessite, quelle que soit la méthode de création utilisée (acquisition automatique ou manuelle), une validation humaine. Nous proposons dans ce but une plate-forme Web librement disponible, appelée Sylva (Systematic lexicon validator). Cette plate-forme a pour caractéristiques principales de permettre une validation multi-niveaux (par des validateurs, puis un expert) et une traçabilité de la ressource. La tâche de l’expert(e) linguiste en est allégée puisqu’il ne lui reste à considérer que les données sur lesquelles il n’y a pas d’accord inter-validateurs.

2007

pdf bib abs
PrepLex : un lexique des prépositions du français pour l’analyse syntaxique
Karën Fort | Bruno Guillaume
Actes de la 14ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

PrepLex est un lexique des prépositions du français. Il contient les informations utiles à des systèmes d’analyse syntaxique. Il a été construit en comparant puis fusionnant différentes sources d’informations lexicales disponibles. Ce lexique met également en évidence les prépositions ou classes de prépositions qui apparaissent dans la définition des cadres de sous-catégorisation des ressources lexicales qui décrivent la valence des verbes.

pdf bib
PrepLex: A Lexicon of French Prepositions for Parsing
Karën Fort | Bruno Guillaume
Proceedings of the Fourth ACL-SIGSEM Workshop on Prepositions

2006

pdf bib abs
Extraction d’information de sous-catégorisation à partir des tables du LADL
Claire Gardent | Bruno Guillaume | Guy Perrier | Ingrid Falk
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Les tables du LADL (Laboratoire d’Automatique Documentaire et Linguistique) contiennent des données électroniques extensives sur les propriétés morphosyntaxiques et syntaxiques des foncteurs syntaxiques du français (verbes, noms, adjectifs). Ces données, dont on sait qu’elles sont nécessaires pour le bon fonctionnement des systèmes de traitement automatique des langues, ne sont cependant que peu utilisées par les systèmes actuels. Dans cet article, nous identifions les raisons de cette lacune et nous proposons une méthode de conversion des tables vers un format mieux approprié au traitement automatique des langues.