Benoit Crabbé

Also published as: Benoît Crabbé

2024

pdf bib abs
NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data
Sergei Bogdanov | Alexandre Constantin | Timothée Bernard | Benoit Crabbé | Etienne P Bernard
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) have shown impressive abilities in data annotation, opening the way for new approaches to solve classic NLP problems. In this paper, we show how to use LLMs to create NuNER, a compact language representation model specialized in the Named Entity Recognition (NER) task. NuNER can be fine-tuned to solve downstream NER problems in a data-efficient way, outperforming similar-sized foundation models in the few-shot regime and competing with much larger LLMs. We find that the size and entity-type diversity of the pre-training dataset are key to achieving good performance. We view NuNER as a member of the broader family of task-specific foundation models, recently unlocked by LLMs. NuNER and NuNER’s dataset are open-sourced with MIT License.

pdf bib abs
CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow
Nathanaël Beau | Benoit Crabbé
Findings of the Association for Computational Linguistics: ACL 2024

We introduce a novel dataset tailored for code generation, aimed at aiding developers in common tasks. Our dataset provides examples that include a clarified intent, code snippets associated, and an average of three related unit tests. It encompasses a range of libraries such as Pandas, Numpy, and Regex, along with more than 70 standard libraries in Python code derived from Stack Overflow. Comprising 3,402 crafted examples by Python experts, our dataset is designed for both model finetuning and standalone evaluation. To complete unit tests evaluation, we categorize examples in order to get more fine grained analysis, enhancing the understanding of models’ strengths and weaknesses in specific coding tasks. The examples have been refined to reduce data contamination, a process confirmed by the performance of three leading models: Mistral 7B, CodeLLAMA 13B, and Starcoder 15B. We further investigate data-contamination testing GPT-4 performance on a part of our dataset. The benchmark can be accessed at anonymized address.

pdf bib abs
Improving Word Sense Induction through Adversarial Forgetting of Morphosyntactic Information
Deniz Ekin Yavas | Timothée Bernard | Laura Kallmeyer | Benoît Crabbé
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

This paper addresses the problem of word sense induction (WSI) via clustering of word embeddings. It starts from the hypothesis that contextualized word representations obtained from pre-trained language models (LMs), while being a valuable source for WSI, encode more information than what is necessary for the identification of word senses and some of this information affect the performance negatively in unsupervised settings. We investigate whether using contextualized representations that are invariant to these ‘nuisance features’ can increase WSI performance. For this purpose, we propose an adaptation of the adversarial training framework proposed by Jaiswal et al. (2020) to erase specific information from the representations of LMs, thereby creating feature-invariant representations. We experiment with erasing (i) morphological and (ii) syntactic features. The results of subsequent clustering for WSI show that these features indeed act like noise: Using feature-invariant representations, compared to using the original representations, increases clustering-based WSI performance. Furthermore, we provide an in-depth analysis of how the information about the syntactic and morphological features of words relate to and affect WSI performance.

2023

pdf bib abs
Assessing the Capacity of Transformer to Abstract Syntactic Representations: A Contrastive Analysis Based on Long-distance Agreement
Bingzhi Li | Guillaume Wisniewski | Benoît Crabbé
Transactions of the Association for Computational Linguistics, Volume 11

Many studies have shown that transformers are able to predict subject-verb agreement, demonstrating their ability to uncover an abstract representation of the sentence in an unsupervised way. Recently, Li et al. (2021) found that transformers were also able to predict the object-past participle agreement in French, the modeling of which in formal grammar is fundamentally different from that of subject-verb agreement and relies on a movement and an anaphora resolution. To better understand transformers’ internal working, we propose to contrast how they handle these two kinds of agreement. Using probing and counterfactual analysis methods, our experiments on French agreements show that (i) the agreement task suffers from several confounders that partially question the conclusions drawn so far and (ii) transformers handle subject-verb and object-past participle agreements in a way that is consistent with their modeling in theoretical linguistics.

2022

pdf bib abs
How Distributed are Distributed Representations? An Observation on the Locality of Syntactic Information in Verb Agreement Tasks
Bingzhi Li | Guillaume Wisniewski | Benoit Crabbé
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This work addresses the question of the localization of syntactic information encoded in the transformers representations. We tackle this question from two perspectives, considering the object-past participle agreement in French, by identifying, first, in which part of the sentence and, second, in which part of the representation the syntactic information is encoded. The results of our experiments, using probing, causal analysis and feature selection method, show that syntactic information is encoded locally in a way consistent with the French grammar.

pdf bib abs
The impact of lexical and grammatical processing on generating code from natural language
Nathanaël Beau | Benoit Crabbé
Findings of the Association for Computational Linguistics: ACL 2022

Considering the seq2seq architecture of Yin and Neubig (2018) for natural language to code translation, we identify four key components of importance: grammatical constraints, lexical preprocessing, input representations, and copy mechanisms. To study the impact of these components, we use a state-of-the-art architecture that relies on BERT encoder and a grammar-based decoder for which a formalization is provided. The paper highlights the importance of the lexical substitution component in the current natural language to code systems.

pdf bib abs
Les représentations distribuées sont-elles vraiment distribuées ? Observations sur la localisation de l’information syntaxique dans les tâches d’accord du verbe en français (How Distributed are Distributed Representations ? An Observation on the Locality of Syntactic)
Bingzhi Li | Guillaume Wisniewski | Benoît Crabbé
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Ce travail aborde la question de la localisation de l’information syntaxique qui est encodée dans les représentations de transformers. En considérant la tâche d’accord objet-participe passé en français, les résultats de nos sondes linguistiques montrent que les informations nécessaires pour accomplir la tâche sont encodées d’une manière locale dans les représentations de mots entre l’antécédent du pronom relatif objet et le participe passé cible. En plus, notre analyse causale montre que le modèle s’appuie essentiellement sur les éléments linguistiquement motivés (i.e. antécédent et pronom relatif) pour prédire le nombre du participe passé.

The successes of contextual word embeddings learned by training large-scale language models, while remarkable, have mostly occurred for languages where significant amounts of raw texts are available and where annotated data in downstream tasks have a relatively regular spelling. Conversely, it is not yet completely clear if these models are also well suited for lesser-resourced and more irregular languages. We study the case of Old French, which is in the interesting position of having relatively limited amount of available raw text, but enough annotated resources to assess the relevance of contextual word embedding models for downstream NLP tasks. In particular, we use POS-tagging and dependency parsing to evaluate the quality of such models in a large array of configurations, including models trained from scratch from small amounts of raw text and models pre-trained on other languages but fine-tuned on Medieval French data.

pdf bib abs
Unifying Parsing and Tree-Structured Models for Generating Sentence Semantic Representations
Antoine Simoulin | Benoit Crabbé
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop

We introduce a novel tree-based model that learns its composition function together with its structure. The architecture produces sentence embeddings by composing words according to an induced syntactic tree. The parsing and the composition functions are explicitly connected and, therefore, learned jointly. As a result, the sentence embedding is computed according to an interpretable linguistic pattern and may be used on any downstream task. We evaluate our encoder on downstream tasks, and we observe that it outperforms tree-based models relying on external parsers. In some configurations, it is even competitive with Bert base model. Our model is capable of supporting multiple parser architectures. We exploit this property to conduct an ablation study by comparing different parser initializations. We explore to which extent the trees produced by our model compare with linguistic structures and how this initialization impacts downstream performances. We empirically observe that downstream supervision troubles producing stable parses and preserving linguistically relevant structures.

2021

pdf bib abs
How Many Layers and Why? An Analysis of the Model Depth in Transformers
Antoine Simoulin | Benoit Crabbé
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

In this study, we investigate the role of the multiple layers in deep transformer models. We design a variant of Albert that dynamically adapts the number of layers for each token of the input. The key specificity of Albert is that weights are tied across layers. Therefore, the stack of encoder layers iteratively repeats the application of the same transformation function on the input. We interpret the repetition of this application as an iterative process where the token contextualized representations are progressively refined. We analyze this process at the token level during pre-training, fine-tuning, and inference. We show that tokens do not require the same amount of iterations and that difficult or crucial tokens for the task are subject to more iterations.

pdf bib abs
Contrasting distinct structured views to learn sentence embeddings
Antoine Simoulin | Benoit Crabbé
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

We propose a self-supervised method that builds sentence embeddings from the combination of diverse explicit syntactic structures of a sentence. We assume structure is crucial to building consistent representations as we expect sentence meaning to be a function of both syntax and semantic aspects. In this perspective, we hypothesize that some linguistic representations might be better adapted given the considered task or sentence. We, therefore, propose to learn individual representation functions for different syntactic frameworks jointly. Again, by hypothesis, all such functions should encode similar semantic information differently and consequently, be complementary for building better sentential semantic embeddings. To assess such hypothesis, we propose an original contrastive multi-view framework that induces an explicit interaction between models during the training phase. We make experiments combining various structures such as dependency, constituency, or sequential schemes. Our results outperform comparable methods on several tasks from standard sentence embedding benchmarks.

pdf bib abs
Are Transformers a Modern Version of ELIZA? Observations on French Object Verb Agreement
Bingzhi Li | Guillaume Wisniewski | Benoit Crabbé
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Many recent works have demonstrated that unsupervised sentence representations of neural networks encode syntactic information by observing that neural language models are able to predict the agreement between a verb and its subject. We take a critical look at this line of research by showing that it is possible to achieve high accuracy on this agreement task with simple surface heuristics, indicating a possible flaw in our assessment of neural networks’ syntactic ability. Our fine-grained analyses of results on the long-range French object-verb agreement show that contrary to LSTMs, Transformers are able to capture a non-trivial amount of grammatical structure.

pdf bib abs
Analyse en dépendances du français avec des plongements contextualisés (French dependency parsing with contextualized embeddings)
Loïc Grobol | Benoit Crabbé
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Cet article présente un analyseur syntaxique en dépendances pour le français qui se compare favorablement à l’état de l’art sur la plupart des corpus de référence. L’analyseur s’appuie sur de riches représentations lexicales issues notamment de BERT et de FASTTEXT. On remarque que les représentations lexicales produites par FLAUBERT ont un caractère auto-suffisant pour réaliser la tâche d’analyse syntaxique de manière optimale.

pdf bib abs
Un modèle Transformer Génératif Pré-entrainé pour le______ français (Generative Pre-trained Transformer in______ (French) We introduce a French adaptation from the well-known GPT model)
Antoine Simoulin | Benoit Crabbé
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Nous proposons une adaptation en français du fameux modèle Generative Pre-trained Transformer (GPT). Ce dernier appartient à la catégorie des architectures transformers qui ont significativement transformé les méthodes de traitement automatique du langage. Ces architectures sont en particulier pré-entraînées sur des tâches auto-supervisées et sont ainsi spécifiques pour une langue donnée. Si certaines sont disponibles en français, la plupart se déclinent avant tout en anglais. GPT est particulièrement efficace pour les tâches de génération de texte. Par ailleurs, il est possible de l’appliquer à de nombreux cas d’usages. Ses propriétés génératives singulières permettent de l’utiliser dans des conditions originales comme l’apprentissage sans exemple qui ne suppose aucune mise à jour des poids du modèle, ou modification de l’architecture.

pdf bib
Is Old French tougher to parse?
Loïc Grobol | Sophie Prévost | Benoît Crabbé
Proceedings of the 20th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2021)

2020

pdf bib abs
FlauBERT : des modèles de langue contextualisés pré-entraînés pour le français (FlauBERT : Unsupervised Language Model Pre-training for French)
Hang Le | Loïc Vial | Jibril Frej | Vincent Segonne | Maximin Coavoux | Benjamin Lecouteux | Alexandre Allauzen | Benoît Crabbé | Laurent Besacier | Didier Schwab
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Les modèles de langue pré-entraînés sont désormais indispensables pour obtenir des résultats à l’état-de-l’art dans de nombreuses tâches du TALN. Tirant avantage de l’énorme quantité de textes bruts disponibles, ils permettent d’extraire des représentations continues des mots, contextualisées au niveau de la phrase. L’efficacité de ces représentations pour résoudre plusieurs tâches de TALN a été démontrée récemment pour l’anglais. Dans cet article, nous présentons et partageons FlauBERT, un ensemble de modèles appris sur un corpus français hétérogène et de taille importante. Des modèles de complexité différente sont entraînés à l’aide du nouveau supercalculateur Jean Zay du CNRS. Nous évaluons nos modèles de langue sur diverses tâches en français (classification de textes, paraphrase, inférence en langage naturel, analyse syntaxique, désambiguïsation automatique) et montrons qu’ils surpassent souvent les autres approches sur le référentiel d’évaluation FLUE également présenté ici.

Language models have become a key step to achieve state-of-the art results in many different Natural Language Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their contextualization at the sentence level. This has been widely demonstrated for English using contextualized representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research community for further reproducible experiments in French NLP.

French, as many languages, lacks semantically annotated corpus data. Our aim is to provide the linguistic and NLP research communities with a gold standard sense-annotated corpus of French, using WordNet Unique Beginners as semantic tags, thus allowing for interoperability. In this paper, we report on the first phase of the project, which focused on the annotation of common nouns. The resulting dataset consists of more than 12,000 French noun occurrences which were annotated in double blind and adjudicated according to a carefully redefined set of supersenses. The resource is released online under a Creative Commons Licence.

2019

pdf bib abs
Variable beam search for generative neural parsing and its relevance for the analysis of neuro-imaging signal
Benoit Crabbé | Murielle Fabre | Christophe Pallier
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

This paper describes a method of variable beam size inference for Recurrent Neural Network Grammar (rnng) by drawing inspiration from sequential Monte-Carlo methods such as particle filtering. The paper studies the relevance of such methods for speeding up the computations of direct generative parsing for rnng. But it also studies the potential cognitive interpretation of the underlying representations built by the search method (beam activity) through analysis of neuro-imaging signal.

pdf bib abs
Unlexicalized Transition-based Discontinuous Constituency Parsing
Maximin Coavoux | Benoît Crabbé | Shay B. Cohen
Transactions of the Association for Computational Linguistics, Volume 7

Lexicalized parsing models are based on the assumptions that (i) constituents are organized around a lexical head and (ii) bilexical statistics are crucial to solve ambiguities. In this paper, we introduce an unlexicalized transition-based parser for discontinuous constituency structures, based on a structure-label transition system and a bi-LSTM scoring system. We compare it with lexicalized parsing models in order to address the question of lexicalization in the context of discontinuous constituency parsing. Our experiments show that unlexicalized models systematically achieve higher results than lexicalized models, and provide additional empirical evidence that lexicalization is not necessary to achieve strong parsing results. Our best unlexicalized model sets a new state of the art on English and German discontinuous constituency treebanks. We further provide a per-phenomenon analysis of its errors on discontinuous constituents.

pdf bib abs
Using Wiktionary as a resource for WSD : the case of French verbs
Vincent Segonne | Marie Candito | Benoît Crabbé
Proceedings of the 13th International Conference on Computational Semantics - Long Papers

As opposed to word sense induction, word sense disambiguation (WSD) has the advantage of us-ing interpretable senses, but requires annotated data, which are quite rare for most languages except English (Miller et al. 1993; Fellbaum, 1998). In this paper, we investigate which strategy to adopt to achieve WSD for languages lacking data that was annotated specifically for the task, focusing on the particular case of verb disambiguation in French. We first study the usability of Eurosense (Bovi et al. 2017) , a multilingual corpus extracted from Europarl (Kohen, 2005) and automatically annotated with BabelNet (Navigli and Ponzetto, 2010) senses. Such a resource opened up the way to supervised and semi-supervised WSD for resourceless languages like French. While this perspective looked promising, our evaluation on French verbs was inconclusive and showed the annotated senses’ quality was not sufficient for supervised WSD on French verbs. Instead, we propose to use Wiktionary, a collaboratively edited, multilingual online dictionary, as a resource for WSD. Wiktionary provides both sense inventory and manually sense tagged examples which can be used to train supervised and semi-supervised WSD systems. Yet, because senses’ distribution differ in lexicographic examples found in Wiktionary with respect to natural text, we then focus on studying the impact on WSD of the training data size and senses’ distribution. Using state-of-the art semi-supervised systems, we report experiments of Wiktionary-based WSD for French verbs, evaluated on FrenchSemEval (FSE), a new dataset of French verbs manually annotated with wiktionary senses.

2017

pdf bib abs
Représentation et analyse automatique des discontinuités syntaxiques dans les corpus arborés en constituants du français (Representation and parsing of syntactic discontinuities in French constituent treebanks)
Maximin Coavoux | Benoît Crabbé
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 - Articles longs

Nous présentons de nouvelles instanciations de trois corpus arborés en constituants du français, où certains phénomènes syntaxiques à l’origine de dépendances à longue distance sont représentés directement à l’aide de constituants discontinus. Les arbres obtenus relèvent de formalismes grammaticaux légèrement sensibles au contexte (LCFRS). Nous montrons ensuite qu’il est possible d’analyser automatiquement de telles structures de manière efficace à condition de s’appuyer sur une méthode d’inférence approximative. Pour cela, nous présentons un analyseur syntaxique par transitions, qui réalise également l’analyse morphologique et l’étiquetage fonctionnel des mots de la phrase. Enfin, nos expériences montrent que la rareté des phénomènes concernés dans les données françaises pose des difficultés pour l’apprentissage et l’évaluation des structures discontinues.

pdf bib abs
Incremental Discontinuous Phrase Structure Parsing with the GAP Transition
Maximin Coavoux | Benoît Crabbé
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

This article introduces a novel transition system for discontinuous lexicalized constituent parsing called SR-GAP. It is an extension of the shift-reduce algorithm with an additional gap transition. Evaluation on two German treebanks shows that SR-GAP outperforms the previous best transition-based discontinuous parser (Maier, 2015) by a large margin (it is notably twice as accurate on the prediction of discontinuous constituents), and is competitive with the state of the art (Fernández-González and Martins, 2015). As a side contribution, we adapt span features (Hall et al., 2014) to discontinuous parsing.

pdf bib abs
Multilingual Lexicalized Constituency Parsing with Word-Level Auxiliary Tasks
Maximin Coavoux | Benoît Crabbé
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We introduce a constituency parser based on a bi-LSTM encoder adapted from recent work (Cross and Huang, 2016b; Kiperwasser and Goldberg, 2016), which can incorporate a lower level character biLSTM (Ballesteros et al., 2015; Plank et al., 2016). We model two important interfaces of constituency parsing with auxiliary tasks supervised at the word level: (i) part-of-speech (POS) and morphological tagging, (ii) functional label prediction. On the SPMRL dataset, our parser obtains above state-of-the-art results on constituency parsing without requiring either predicted POS or morphological tags, and outputs labelled dependency trees.

2016

pdf bib
Prédiction structurée pour l’analyse syntaxique en constituants par transitions : modèles denses et modèles creux [Structured Prediction for Transition-based Constituent Parsing: Dense and Sparse Models]
Maximin Coavoux | Benoît Crabbé
Traitement Automatique des Langues, Volume 57, Numéro 1 : Varia [Varia]

pdf bib abs
Boosting for Efficient Model Selection for Syntactic Parsing
Rachel Bawden | Benoît Crabbé
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We present an efficient model selection method using boosting for transition-based constituency parsing. It is designed for exploring a high-dimensional search space, defined by a large set of feature templates, as for example is typically the case when parsing morphologically rich languages. Our method removes the need to manually define heuristic constraints, which are often imposed in current state-of-the-art selection methods. Our experiments for French show that the method is more efficient and is also capable of producing compact, state-of-the-art models.

pdf bib
Neural Greedy Constituent Parsing with Dynamic Oracles
Maximin Coavoux | Benoît Crabbé
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf bib abs
Comparaison d’architectures neuronales pour l’analyse syntaxique en constituants
Maximin Coavoux | Benoît Crabbé
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

L’article traite de l’analyse syntaxique lexicalisée pour les grammaires de constituants. On se place dans le cadre de l’analyse par transitions. Les modèles statistiques généralement utilisés pour cette tâche s’appuient sur une représentation non structurée du lexique. Les mots du vocabulaire sont représentés par des symboles discrets sans liens entre eux. À la place, nous proposons d’utiliser des représentations denses du type plongements (embeddings) qui permettent de modéliser la similarité entre symboles, c’est-à-dire entre mots, entre parties du discours et entre catégories syntagmatiques. Nous proposons d’adapter le modèle statistique sous-jacent à ces nouvelles représentations. L’article propose une étude de 3 architectures neuronales de complexité croissante et montre que l’utilisation d’une couche cachée non-linéaire permet de tirer parti des informations données par les plongements.

pdf bib
Multilingual discriminative lexicalized phrase structure parsing
Benoit Crabbé
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Dependency length minimisation effects in short spans: a large-scale analysis of adjective placement in complex noun phrases
Kristina Gulordava | Paola Merlo | Benoit Crabbé
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

In this paper, we introduce a set of resources that we have derived from the EST RÉPUBLICAIN CORPUS, a large, freely-available collection of regional newspaper articles in French, totaling 150 million words. Our resources are the result of a full NLP treatment of the EST RÉPUBLICAIN CORPUS: handling of multi-word expressions, lemmatization, part-of-speech tagging, and syntactic parsing. Processing of the corpus is carried out using statistical machine-learning approaches - joint model of data driven lemmatization and part- of-speech tagging, PCFG-LA and dependency based models for parsing - that have been shown to achieve state-of-the-art performance when evaluated on the French Treebank. Our derived resources are made freely available, and released according to the original Creative Common license for the EST RÉPUBLICAIN CORPUS. We additionally provide an overview of the use of these resources in various applications, in particular the use of generated word clusters from the corpus to alleviate lexical data sparseness for statistical parsing.

2011

pdf bib
Testing the Robustness of Online Word Segmentation: Effects of Linguistic Diversity and Phonetic Variation
Luc Boruta | Sharon Peperkamp | Benoît Crabbé | Emmanuel Dupoux
Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics

2010

pdf bib abs
Approche quantitative en syntaxe : l’exemple de l’alternance de position de l’adjectif épithète en français
Juliette Thuilier | Gwendoline Fox | Benoît Crabbé
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article présente une analyse statistique sur des données de syntaxe qui a pour but d’aider à mieux cerner le phénomène d’alternance de position de l’adjectif épithète par rapport au nom en français. Nous montrons comment nous avons utilisé les corpus dont nous disposons (French Treebank et le corpus de l’Est-Républicain) ainsi que les ressources issues du traitement automatique des langues, pour mener à bien notre étude. La modélisation à partir de 13 variables relevant principalement des propriétés du syntagme adjectival, de celles de l’item adjectival, ainsi que de contraintes basées sur la fréquence, permet de prédire à plus de 93% la position de l’adjectif. Nous insistons sur l’importance de contraintes relevant de l’usage pour le choix de la position de l’adjectif, notamment à travers la fréquence d’occurrence de l’adjectif, et la fréquence de contextes dans lesquels il apparaît.

pdf bib abs
Statistical French Dependency Parsing: Treebank Conversion and First Results
Marie Candito | Benoît Crabbé | Pascal Denis
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We first describe the automatic conversion of the French Treebank (Abeillé and Barrier, 2004), a constituency treebank, into typed projective dependency trees. In order to evaluate the overall quality of the resulting dependency treebank, and to quantify the cases where the projectivity constraint leads to wrong dependencies, we compare a subset of the converted treebank to manually validated dependency trees. We then compare the performance of two treebank-trained parsers that output typed dependency parses. The first parser is the MST parser (Mcdonald et al., 2006), which we directly train on dependency trees. The second parser is a combination of the Berkeley parser (Petrov et al., 2006) and a functional role labeler: trained on the original constituency treebank, the Berkeley parser first outputs constituency trees, which are then labeled with functional roles, and then converted into dependency trees. We found that used in combination with a high-accuracy French POS tagger, the MST parser performs a little better for unlabeled dependencies (UAS=90.3% versus 89.6%), and better for labeled dependencies (LAS=87.6% versus 85.6%).

2009

pdf bib abs
Analyse syntaxique du français : des constituants aux dépendances
Marie Candito | Benoît Crabbé | Pascal Denis | François Guérin
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article présente une technique d’analyse syntaxique statistique à la fois en constituants et en dépendances. L’analyse procède en ajoutant des étiquettes fonctionnelles aux sorties d’un analyseur en constituants, entraîné sur le French Treebank, pour permettre l’extraction de dépendances typées. D’une part, nous spécifions d’un point de vue formel et linguistique les structures de dépendances à produire, ainsi que la procédure de conversion du corpus en constituants (le French Treebank) vers un corpus cible annoté en dépendances, et partiellement validé. D’autre part, nous décrivons l’approche algorithmique qui permet de réaliser automatiquement le typage des dépendances. En particulier, nous nous focalisons sur les méthodes d’apprentissage discriminantes d’étiquetage en fonctions grammaticales.

pdf bib abs
Adaptation de parsers statistiques lexicalisés pour le français : Une évaluation complète sur corpus arborés
Djamé Seddah | Marie Candito | Benoît Crabbé
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Cet article présente les résultats d’une évaluation exhaustive des principaux analyseurs syntaxiques probabilistes dit “lexicalisés” initialement conçus pour l’anglais, adaptés pour le français et évalués sur le CORPUS ARBORÉ DU FRANÇAIS (Abeillé et al., 2003) et le MODIFIED FRENCH TREEBANK (Schluter & van Genabith, 2007). Confirmant les résultats de (Crabbé & Candito, 2008), nous montrons que les modèles lexicalisés, à travers les modèles de Charniak (Charniak, 2000), ceux de Collins (Collins, 1999) et le modèle des TIG Stochastiques (Chiang, 2000), présentent des performances moindres face à un analyseur PCFG à Annotation Latente (Petrov et al., 2006). De plus, nous montrons que le choix d’un jeu d’annotations issus de tel ou tel treebank oriente fortement les résultats d’évaluations tant en constituance qu’en dépendance non typée. Comparés à (Schluter & van Genabith, 2008; Arun & Keller, 2005), tous nos résultats sont state-of-the-art et infirment l’hypothèse d’une difficulté particulière qu’aurait le français en terme d’analyse syntaxique probabiliste et de sources de données.

bib
On Statistical Parsing of French with Supervised and Semi-Supervised Strategies
Marie Candito | Benoit Crabbé | Djamé Seddah
Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference

pdf bib
Improving generative statistical parsing with semi-supervised word clustering
Marie Candito | Benoît Crabbé
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

pdf bib
Cross parser evaluation : a French Treebanks study
Djamé Seddah | Marie Candito | Benoît Crabbé
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

2008

pdf bib abs
Expériences d’analyse syntaxique statistique du français
Benoît Crabbé | Marie Candito
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous montrons qu’il est possible d’obtenir une analyse syntaxique statistique satisfaisante pour le français sur du corpus journalistique, à partir des données issues du French Treebank du laboratoire LLF, à l’aide d’un algorithme d’analyse non lexicalisé.

2006

pdf bib
XMG - An Expressive Formalism for Describing Tree-Based Grammars
Yannick Parmentier | Joseph Le Roux | Benoît Crabbé
Demonstrations

pdf bib
A Constraint Driven Metagrammar
Joseph Le Roux | Benoît Crabbé | Yannick Parmentier
Proceedings of the Eighth International Workshop on Tree Adjoining Grammar and Related Formalisms

pdf bib
Increasing the coverage of a domain independent dialogue lexicon with VERBNET
Benoit Crabbé | Myroslava O. Dzikovska | William de Beaumont | Mary Swift
Proceedings of the Third Workshop on Scalable Natural Language Understanding

2005

pdf bib abs
Projection et monotonie dans un langage de représentation lexico-grammatical
Benoît Crabbé
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Cet article apporte une méthode de développement grammatical pour la réalisation de grammaires d’arbres adjoints (TAG) de taille importante augmentées d’une dimension sémantique. La méthode que nous présentons s’exprime dans un langage informatique de représentation grammatical qui est déclaratif et monotone. Pour arriver au résultat, nous montrons comment tirer parti de la théorie de la projection dans le langage de représentation que nous utilisons. Par conséquent cet article justifie l’utilisation d’un langage monotone pour la représentation lexico-grammaticale.

2003

pdf bib abs
Une plate-forme de conception et d’exploitation d’une grammaire d’arbres adjoints lexicalisés
Benoît Crabbé | Bertrand Gaiffe | Azim Roussanaly
Actes de la 10ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article, nous présentons un ensemble d’outils de conception et d’exploitation pour des grammaires d’arbres adjoints lexicalisés. Ces outils s’appuient sur une représentation XML des ressources (lexique et grammaire). Dans notre représentation, à chaque arbre de la grammaire est associé un hypertag décrivant les phénomènes linguistiques qu’il recouvre. De ce fait, la liaison avec le lexique se trouve plus compactée et devient plus aisée à maintenir. Enfin, un analyseur permet de valider les grammaires et les lexiques ainsi conçus aussi bien de façon interactive que différée sur des corpus.