Marie Candito

Also published as: Marie-Helene Candito, Marie-Hélène Candito

2025

pdf bib abs
SELEXINI – a large and diverse automatically parsed corpus of French
Manon Scholivet | Agata Savary | Louis Estève | Marie Candito | Carlos Ramisch
Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC)

The annotation of large text corpora is essential for many tasks. We present here a large automatically annotated corpus for French. This corpus is separated into two parts: the first from BigScience, and the second from HPLT. The annotated documents from HPLT were selected in order to optimise the lexical diversity of the final corpus SELEXINI. An analysis of the impact of this selection was carried out on syntactic diversity, as well as on the quality of the new words resulting from the HPLT part of SELEXINI. We have shown that despite the introduction of interesting new words, the texts extracted from HPLT are very noisy. Furthermore, increasing lexical diversity did not increase syntactic diversity.

pdf bib abs
Annotating the French Wiktionary with supersenses for large scale lexical analysis: a use case to assess form-meaning relationships within the nominal lexicon
Nicolas Angleraud | Lucie Barque | Marie Candito
Proceedings of the 31st International Conference on Computational Linguistics

Many languages lack broad-coverage, semantically annotated lexical resources, which limits empirical research on lexical semantics for these languages. In this paper, we report on how we automatically enriched the French Wiktionnary with general semantic classes, known as supersenses, using a limited amount of manually annotated data. We trained a classifier combining sense definition classification and sense exemplars classification. The resulting resource, with an evaluated supersense accuracy of nearly 85% (92% for hypersenses), is used in a case study illustrating how such an semantically enriched resource can be leveraged to empirically test linguistic hypotheses about the lexicon, on a large scale.

pdf bib abs
Polarity inversion operators in PLM
David Kletz | Pascal Amsili | Marie Candito
Proceedings of the 29th Conference on Computational Natural Language Learning

From a linguistic perspective, negation is a unique and inherently compositional operator. In this study, we investigate whether the bert-large-cased Pretrained Language Model (PLM) properly encodes this compositional aspect of negation when embedding a token that falls within the scope of negation.To explore this, we train two external Multi-Layer Perceptrons to modify contextual embeddings in a controlled manner. The goal is to reverse the polarity information encoded in the embedding while preserving all other token-related information. The first MLP, called the Negator, transforms a negative polarity into a positive one, while the second, the Affirmator, performs the reverse transformation.We then conduct a series of evaluations to assess the effectiveness of these operators. Our results indicate that while the Negator/Affirmator is functional, it only partially simulates the negation operator. Specifically, applying it recursively does not allow us to recover the original polarity, suggesting an incomplete representation of negation within the PLM’s embeddings.In addition, a downstream evaluation on the Negated LAMA dataset reveals that the modifications introduced by the Negator/Affirmator lead to a slight improvement in the model’s ability to account for negation in its predictions. However, applying the Negator/Affirmator recursively results in degraded representations, further reinforcing the idea that negation is not fully compositional within PLM embeddings.

pdf bib abs
In the LLM era, Word Sense Induction remains unsolved
Anna Mosolova | Marie Candito | Carlos Ramisch
Findings of the Association for Computational Linguistics: ACL 2025

In the absence of sense-annotated data, word sense induction (WSI) is a compelling alternative to word sense disambiguation, particularly in low-resource or domain-specific settings. In this paper, we emphasize methodological problems in current WSI evaluation. We propose an evaluation on a SemCor-derived dataset, respecting the original corpus polysemy and frequency distributions. We assess pre-trained embeddings and clustering algorithms across parts of speech, and propose and evaluate an LLM-based WSI method for English. We evaluate data augmentation sources (LLM-generated, corpus and lexicon), and semi-supervised scenarios using Wiktionary for data augmentation, must-link constraints, number of clusters per lemma.We find that no unsupervised method (whether ours or previous) surpasses the strong “one cluster per lemma” heuristic (1cpl). We also show that (i) results and best systems may vary across POS, (ii) LLMs have troubles performing this task, (iii) data augmentation is beneficial and (iv) capitalizing on Wiktionary does help. It surpasses previous SOTA system on our test set by 3.3%. WSI is not solved, and calls for a better articulation of lexicons and LLMs’ lexical semantics capabilities.

pdf bib abs
FAMWA: A new taxonomy for classifying word associations (which humans improve at but LLMs still struggle with)
Maria A. Rodriguez | Marie Candito | Richard Huyghe
Proceedings of the 16th International Conference on Computational Semantics

Word associations have a longstanding tradition of being instrumental for investigating the organization of the mental lexicon. Despite their wide application in psychology and psycholinguistics, analyzing word associations remains challenging due to their inherent heterogeneity and variability, shaped by linguistic and extralinguistic factors. Existing word-association taxonomies often suffer limitations due to a lack of comprehensive frameworks that capture their complexity.To address these limitations, we introduce a linguistically motivated taxonomy consisting of co-existing meaning-related and form-related relations, while accounting for the directionality of word associations.We applied the taxonomy to a dataset of 1,300 word associations (FAMWA) and assessed it using various LLMs, analyzing their ability to classify word associations.The results show an improved inter-annotator agreement for our taxonomies compared to previous studies (𝜅 = .60 for meaning and 𝜅 = .58 for form). However, models such as GPT-4o perform only modestly in relation labeling (with accuracies of 46.2% for meaning and 78.3% for form), which calls into question their ability to fully grasp the underlying principles of human word associations.

pdf bib abs
Raffinage des représentations des tokens dans les modèles de langue pré-entraînés avec l’apprentissage contrastif : une étude entre modèles et entre langues
Anna Mosolova | Marie Candito | Carlos Ramisch
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : articles scientifiques originaux

Les modèles de langue pré-entraînés ont apporté des avancées significatives dans les représentations contextuelles des phrases et des mots. Cependant, les tâches lexicales restent un défi pour ces représentations en raison des problèmes tels que la faible similarité des representations d’un même mot dans des contextes similaires. Mosolova et al. (2024) ont montré que l’apprentissage contrastif supervisé au niveau des tokens permettait d’améliorer les performances sur les tâches lexicales. Dans cet article, nous étudions la généralisabilité de leurs résultats obtenus en anglais au français, à d’autres modèles de langue et à plusieurs parties du discours. Nous démontrons que cette méthode d’apprentissage contrastif améliore systématiquement la performance sur les tâches de Word-in-Context et surpasse celle des modèles de langage pré-entraînés standards. L’analyse de l’espace des plongements lexicaux montre que l’affinage des modèles rapproche les exemples ayant le même sens et éloigne ceux avec des sens différents, ce qui indique une meilleure discrimination des sens dans l’espace vectoriel final.

pdf bib abs
SELEXINI – un grand corpus français, divers et parsé automatiquement
Manon Scholivet | Agata Savary | Louis Estève | Marie Candito | Carlos Ramisch
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : traductions d'articles publiés

L’annotation de grands corpus de texte est essentielle pour de nombreuses tâches de Traitement Automatique des Langues. Dans cet article, nous présentons SELEXINI, un grand corpus français annoté automatiquement en syntaxe. Ce corpus est composé de deux parties : la partie BigScience, et la partie HPLT. Les documents de la partie HPLT ont été sélectionnés dans le but de maximiser la diversité lexicale du corpus total SELEXINI. Une analyse de l’impact de cette sélection sur la diversité syntaxique a été réalisée, ainsi qu’une étude de la qualité des nouveaux mots issus de la partie HPLT du corpus SELEXINI. Nous avons pu montrer que malgré l’introduction de nouveaux mots considérés comme intéressants (formes de conjugaison rares, néologismes, mots rares,...), les textes issus de HPLT sont extrêmement bruités. De plus, l’augmentation de la diversité lexicale n’a pas permis d’augmenter la diversité syntaxique.

2024

pdf bib abs
Injecting Wiktionary to improve token-level contextual representations using contrastive learning
Anna Mosolova | Marie Candito | Carlos Ramisch
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

While static word embeddings are blind to context, for lexical semantics tasks context is rather too present in contextual word embeddings, vectors of same-meaning occurrences being too different (Ethayarajh, 2019). Fine-tuning pre-trained language models (PLMs) using contrastive learning was proposed, leveraging automatically self-augmented examples (Liu et al., 2021b). In this paper, we investigate how to inject a lexicon as an alternative source of supervision, using the English Wiktionary. We also test how dimensionality reduction impacts the resulting contextual word embeddings. We evaluate our approach on the Word-In-Context (WiC) task, in the unsupervised setting (not using the training set). We achieve new SoTA result on the original WiC test set. We also propose two new WiC test sets for which we show that our fine-tuning method achieves substantial improvements. We also observe improvements, although modest, for the semantic frame induction task. Although we experimented on English to allow comparison with related work, our method is adaptable to the many languages for which large Wiktionaries exist.

2023

pdf bib abs
The Self-Contained Negation Test Set
David Kletz | Pascal Amsili | Marie Candito
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Several methodologies have recently been proposed to evaluate the ability of Pretrained Language Models (PLMs) to interpret negation. In this article, we build on Gubelmann and Handschuh (2022), which studies the modification of PLMs’ predictions as a function of the polarity of inputs, in English. Crucially, this test uses “self-contained” inputs ending with a masked position: depending on the polarity of a verb in the input, a particular token is either semantically ruled out or allowed at the masked position. By replicating Gubelmann and Handschuh (2022) experiments, we have uncovered flaws that weaken the conclusions that can be drawn from this test. We thus propose an improved version, the Self-Contained Neg Test, which is more controlled, more systematic, and entirely based on examples forming minimal pairs varying only in the presence or absence of verbal negation in English. When applying our test to the roberta and bert base and large models, we show that only roberta-large shows trends that match the expectations, while bert-base is mostly insensitive to negation. For all the tested models though, in a significant number of test instances the top-1 prediction remains the token that is semantically forbidden by the context, which shows how much room for improvement remains for a proper treatment of the negation phenomenon.

pdf bib
Actes de CORIA-TALN 2023. Actes des 16e Rencontres Jeunes Chercheurs en RI (RJCRI) et 25e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL)
Marie Candito | Thomas Gerald | José G Moreno
Actes de CORIA-TALN 2023. Actes des 16e Rencontres Jeunes Chercheurs en RI (RJCRI) et 25e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL)

We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced.

pdf bib abs
Probing structural constraints of negation in Pretrained Language Models
David Kletz | Marie Candito | Pascal Amsili
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Contradictory results about the encoding of the semantic impact of negation in pretrained language models (PLMs) have been drawn recently (e.g. Kassner and Schütze (2020); Gubelmann and Handschuh (2022)).In this paper we focus rather on the way PLMs encode negation and its formal impact, through the phenomenon of the Negative Polarity Item (NPI) licensing in English.More precisely, we use probes to identify which contextual representations best encode 1) the presence of negation in a sentence, and 2) the polarity of a neighboring masked polarity item. We find that contextual representations of tokens inside the negation scope do allow for (i) a better prediction of the presence of “not” compared to those outside the scope and (ii) a better prediction of the right polarity of a masked polarity item licensed by “not”, although the magnitude of the difference varies from PLM to PLM. Importantly, in both cases the trend holds even when controlling for distance to “not”.This tends to indicate that the embeddings of these models do reflect the notion of negation scope, and do encode the impact of negation on NPI licensing. Yet, further control experiments reveal that the presence of other lexical items is also better captured when using the contextual representation of a token within the same syntactic clause than outside from it, suggesting that PLMs simply capture the more general notion of syntactic clause.

2022

pdf bib abs
Auxiliary tasks to boost Biaffine Semantic Dependency Parsing
Marie Candito
Findings of the Association for Computational Linguistics: ACL 2022

The biaffine parser of (CITATION) was successfully extended to semantic dependency parsing (SDP) (CITATION). Its performance on graphs is surprisingly high given that, without the constraint of producing a tree, all arcs for a given sentence are predicted independently from each other (modulo a shared representation of tokens).To circumvent such an independence of decision, while retaining the O(n²) complexity and highly parallelizable architecture, we propose to use simple auxiliary tasks that introduce some form of interdependence between arcs. Experiments on the three English acyclic datasets of SemEval-2015 task 18 (CITATION), and on French deep syntactic cyclic graphs (CITATION) show modest but systematic performance gains on a near-state-of-the-art baseline using transformer-based contextualized representations. This provides a simple and robust method to boost SDP performance.

pdf bib abs
Tâches auxiliaires pour l’analyse biaffine en graphes de dépendances (Auxiliary tasks to boost Biaffine Semantic Dependency Parsing)
Marie Candito
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

L’analyseur biaffine de Dozat & Manning (2017), qui produit des arbres de dépendances syntaxiques, a été étendu avec succès aux graphes de dépendances syntaxico-sémantiques (Dozat & Manning, 2018). Ses performances sur les graphes sont étonnamment hautes étant donné que, sans la contrainte de devoir produire un arbre, les arcs pour une phrase donnée sont prédits indépendamment les uns des autres. Pour y remédier partiellement, tout en conservant la complexité O(n2 ) et l’architecture hautement parallélisable, nous proposons d’utiliser des tâches auxiliaires qui introduisent une forme d’interdépendance entre les arcs. Les expérimentations sur les trois jeux de données anglaises de la tâche 18 SemEval-2015 (Oepen et al., 2015), et sur des graphes syntaxiques profonds en français (Ribeyre et al., 2014) montrent une amélioration modeste mais systématique, par rapport à un système de base performant, utilisant un modèle de langue pré-entraîné. Notre méthode s’avère ainsi un moyen simple et robuste d’améliorer l’analyse vers graphes de dépendances.

2020

French, as many languages, lacks semantically annotated corpus data. Our aim is to provide the linguistic and NLP research communities with a gold standard sense-annotated corpus of French, using WordNet Unique Beginners as semantic tags, thus allowing for interoperability. In this paper, we report on the first phase of the project, which focused on the annotation of common nouns. The resulting dataset consists of more than 12,000 French noun occurrences which were annotated in double blind and adjudicated according to a carefully redefined set of supersenses. The resource is released online under a Creative Commons Licence.

We present edition 1.2 of the PARSEME shared task on identification of verbal multiword expressions (VMWEs). Lessons learned from previous editions indicate that VMWEs have low ambiguity, and that the major challenge lies in identifying test instances never seen in the training data. Therefore, this edition focuses on unseen VMWEs. We have split annotated corpora so that the test corpora contain around 300 unseen VMWEs, and we provide non-annotated raw corpora to be used by complementary discovery methods. We released annotated and raw corpora in 14 languages, and this semi-supervised challenge attracted 7 teams who submitted 9 system results. This paper describes the effort of corpus creation, the task design, and the results obtained by the participating systems, especially their performance on unseen expressions.

2019

pdf bib
Traitement Automatique des Langues, Volume 60, Numéro 2 : Corpus annotés [Annotated corpora]
Marie Candito | Mark Liberman
Traitement Automatique des Langues, Volume 60, Numéro 2 : Corpus annotés [Annotated corpora]

pdf bib
Introduction to the special issue on annotated corpora
Marie Candito | Mark Liberman
Traitement Automatique des Langues, Volume 60, Numéro 2 : Corpus annotés [Annotated corpora]

pdf bib abs
SemEval-2019 Task 2: Unsupervised Lexical Frame Induction
Behrang QasemiZadeh | Miriam R. L. Petruck | Regina Stodden | Laura Kallmeyer | Marie Candito
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper presents Unsupervised Lexical Frame Induction, Task 2 of the International Workshop on Semantic Evaluation in 2019. Given a set of prespecified syntactic forms in context, the task requires that verbs and their arguments be clustered to resemble semantic frame structures. Results are useful in identifying polysemous words, i.e., those whose frame structures are not easily distinguished, as well as discerning semantic relations of the arguments. Evaluation of unsupervised frame induction methods fell into two tracks: Task A) Verb Clustering based on FrameNet 1.7; and B) Argument Clustering, with B.1) based on FrameNet’s core frame elements, and B.2) on VerbNet 3.2 semantic roles. The shared task attracted nine teams, of whom three reported promising results. This paper describes the task and its data, reports on methods and resources that these systems used, and offers a comparison to human annotation.

pdf bib abs
Using Wiktionary as a resource for WSD : the case of French verbs
Vincent Segonne | Marie Candito | Benoît Crabbé
Proceedings of the 13th International Conference on Computational Semantics - Long Papers

As opposed to word sense induction, word sense disambiguation (WSD) has the advantage of us-ing interpretable senses, but requires annotated data, which are quite rare for most languages except English (Miller et al. 1993; Fellbaum, 1998). In this paper, we investigate which strategy to adopt to achieve WSD for languages lacking data that was annotated specifically for the task, focusing on the particular case of verb disambiguation in French. We first study the usability of Eurosense (Bovi et al. 2017) , a multilingual corpus extracted from Europarl (Kohen, 2005) and automatically annotated with BabelNet (Navigli and Ponzetto, 2010) senses. Such a resource opened up the way to supervised and semi-supervised WSD for resourceless languages like French. While this perspective looked promising, our evaluation on French verbs was inconclusive and showed the annotated senses’ quality was not sufficient for supervised WSD on French verbs. Instead, we propose to use Wiktionary, a collaboratively edited, multilingual online dictionary, as a resource for WSD. Wiktionary provides both sense inventory and manually sense tagged examples which can be used to train supervised and semi-supervised WSD systems. Yet, because senses’ distribution differ in lexicographic examples found in Wiktionary with respect to natural text, we then focus on studying the impact on WSD of the training data size and senses’ distribution. Using state-of-the art semi-supervised systems, we report experiments of Wiktionary-based WSD for French verbs, evaluated on FrenchSemEval (FSE), a new dataset of French verbs manually annotated with wiktionary senses.

pdf bib abs
Comparing linear and neural models for competitive MWE identification
Hazem Al Saied | Marie Candito | Mathieu Constant
Proceedings of the 22nd Nordic Conference on Computational Linguistics

In this paper, we compare the use of linear versus neural classifiers in a greedy transition system for MWE identification. Both our linear and neural models achieve a new state-of-the-art on the PARSEME 1.1 shared task data sets, comprising 20 languages. Surprisingly, our best model is a simple feed-forward network with one hidden layer, although more sophisticated (recurrent) architectures were tested. The feedback from this study is that tuning a SVM is rather straightforward, whereas tuning our neural system revealed more challenging. Given the number of languages and the variety of linguistic phenomena to handle for the MWE identification task, we have designed an accurate tuning procedure, and we show that hyperparameters are better selected by using a majority-vote within random search configurations rather than a simple best configuration selection. Although the performance is rather good (better than both the best shared task system and the average of the best per-language results), further work is needed to improve the generalization power, especially on unseen MWEs.

pdf bib abs
Syntax-based identification of light-verb constructions
Silvio Ricardo Cordeiro | Marie Candito
Proceedings of the 22nd Nordic Conference on Computational Linguistics

This paper analyzes results on light-verb construction identification from the PARSEME shared-task, distinguishing between simple cases that could be directly learned from training data from more complex cases that require an extra level of semantic processing. We propose a simple baseline that beats the state of the art for the simple cases, and couple it with another simple baseline to handle the complex cases. We additionally present two other classifiers based on a richer set of features, with results surpassing the state of the art by 8 percentage points.

pdf bib
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)
Marie Candito | Kilian Evang | Stephan Oepen | Djamé Seddah
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)

2018

pdf bib
Cheating a Parser to Death: Data-driven Cross-Treebank Annotation Transfer
Djamé Seddah | Eric de la Clergerie | Benoît Sagot | Héctor Martínez Alonso | Marie Candito
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

This paper describes the PARSEME Shared Task 1.1 on automatic identification of verbal multiword expressions. We present the annotation methodology, focusing on changes from last year’s shared task. Novel aspects include enhanced annotation guidelines, additional annotated data for most languages, corpora for some new languages, and new evaluation settings. Corpora were created for 20 languages, which are also briefly discussed. We report organizational principles behind the shared task and the evaluation metrics employed for ranking. The 17 participating systems, their methods and obtained results are also presented and analysed.

2017

pdf bib abs
Annotation d’expressions polylexicales verbales en français (Annotation of verbal multiword expressions in French)
Marie Candito | Mathieu Constant | Carlos Ramisch | Agata Savary | Yannick Parmentier | Caroline Pasquer | Jean-Yves Antoine
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Nous décrivons la partie française des données produites dans le cadre de la campagne multilingue PARSEME sur l’identification d’expressions polylexicales verbales (Savary et al., 2017). Les expressions couvertes pour le français sont les expressions verbales idiomatiques, les verbes intrinsèquement pronominaux et une généralisation des constructions à verbe support. Ces phénomènes ont été annotés sur le corpus French-UD (Nivre et al., 2016) et le corpus Sequoia (Candito & Seddah, 2012), soit un corpus de 22 645 phrases, pour un total de 4 962 expressions annotées. On obtient un ratio d’une expression annotée tous les 100 tokens environ, avec un fort taux d’expressions discontinues (40%).

Multiword expressions (MWEs) are known as a “pain in the neck” for NLP due to their idiosyncratic behaviour. While some categories of MWEs have been addressed by many studies, verbal MWEs (VMWEs), such as to take a decision, to break one’s heart or to turn off, have been rarely modelled. This is notably due to their syntactic variability, which hinders treating them as “words with spaces”. We describe an initiative meant to bring about substantial progress in understanding, modelling and processing VMWEs. It is a joint effort, carried out within a European research network, to elaborate universal terminologies and annotation guidelines for 18 languages. Its main outcome is a multilingual 5-million-word annotated corpus which underlies a shared task on automatic identification of VMWEs. This paper presents the corpus annotation methodology and outcome, the shared task organisation and the results of the participating systems.

pdf bib abs
The ATILF-LLF System for Parseme Shared Task: a Transition-based Verbal Multiword Expression Tagger
Hazem Al Saied | Matthieu Constant | Marie Candito
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

We describe the ATILF-LLF system built for the MWE 2017 Shared Task on automatic identification of verbal multiword expressions. We participated in the closed track only, for all the 18 available languages. Our system is a robust greedy transition-based system, in which MWE are identified through a MERGE transition. The system was meant to accommodate the variety of linguistic resources provided for each language, in terms of accompanying morphological and syntactic information. Using per-MWE Fscore, the system was ranked first for all but two languages (Hungarian and Romanian).

pdf bib
Enhanced UD Dependencies with Neutralized Diathesis Alternation
Marie Candito | Bruno Guillaume | Guy Perrier | Djamé Seddah
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

pdf bib
Annotating and parsing to semantic frames: feedback from the French FrameNet project
Marie Candito
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

2016

pdf bib abs
Deeper syntax for better semantic parsing
Olivier Michalon | Corentin Ribeyre | Marie Candito | Alexis Nasr
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Syntax plays an important role in the task of predicting the semantic structure of a sentence. But syntactic phenomena such as alternations, control and raising tend to obfuscate the relation between syntax and semantics. In this paper we predict the semantic structure of a sentence using a deeper syntax than what is usually done. This deep syntactic representation abstracts away from purely syntactic phenomena and proposes a structural organization of the sentence that is closer to the semantic representation. Experiments conducted on a French corpus annotated with semantic frames showed that a semantic parser reaches better performances with such a deep syntactic input.

pdf bib abs
Hard Time Parsing Questions: Building a QuestionBank for French
Djamé Seddah | Marie Candito
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present the French Question Bank, a treebank of 2600 questions. We show that classical parsing model performance drop while the inclusion of this data set is highly beneficial without harming the parsing of non-question data. when facing out-of- domain data with strong structural diver- gences. Two thirds being aligned with the QB (Judge et al., 2006) and being freely available, this treebank will prove useful to build robust NLP systems.

pdf bib abs
Corpus Annotation within the French FrameNet: a Domain-by-domain Methodology
Marianne Djemaa | Marie Candito | Philippe Muller | Laure Vieu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper reports on the development of a French FrameNet, within the ASFALDA project. While the first phase of the project focused on the development of a French set of frames and corresponding lexicon (Candito et al., 2014), this paper concentrates on the subsequent corpus annotation phase, which focused on four notional domains (commercial transactions, cognitive stances, causality and verbal communication). Given full coverage is not reachable for a relatively “new” FrameNet project, we advocate that focusing on specific notional domains allowed us to obtain full lexical coverage for the frames of these domains, while partially reflecting word sense ambiguities. Furthermore, as frames and roles were annotated on two French Treebanks (the French Treebank (Abeillé and Barrier, 2004) and the Sequoia Treebank (Candito and Seddah, 2012), we were able to extract a syntactico-semantic lexicon from the annotated frames. In the resource’s current status, there are 98 frames, 662 frame evoking words, 872 senses, and about 13000 annotated frames, with their semantic roles assigned to portions of text. The French FrameNet is freely available at alpage.inria.fr/asfalda.

pdf bib abs
A General Framework for the Annotation of Causality Based on FrameNet
Laure Vieu | Philippe Muller | Marie Candito | Marianne Djemaa
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present here a general set of semantic frames to annotate causal expressions, with a rich lexicon in French and an annotated corpus of about 5000 instances of causal lexical items with their corresponding semantic frames. The aim of our project is to have both the largest possible coverage of causal phenomena in French, across all parts of speech, and have it linked to a general semantic framework such as FN, to benefit in particular from the relations between other semantic frames, e.g., temporal ones or intentional ones, and the underlying upper lexical ontology that enable some forms of reasoning. This is part of the larger ASFALDA French FrameNet project, which focuses on a few different notional domains which are interesting in their own right (Djemma et al., 2016), including cognitive positions and communication frames. In the process of building the French lexicon and preparing the annotation of the corpus, we had to remodel some of the frames proposed in FN based on English data, with hopefully more precise frame definitions to facilitate human annotation. This includes semantic clarifications of frames and frame elements, redundancy elimination, and added coverage. The result is arguably a significant improvement of the treatment of causality in FN itself.

2014

pdf bib
Annotation scheme for deep dependency syntax of French (Un schéma d’annotation en dépendances syntaxiques profondes pour le français) [in French]
Guy Perrier | Marie Candito | Bruno Guillaume | Corentin Ribeyre | Karën Fort | Djamé Seddah
Proceedings of TALN 2014 (Volume 2: Short Papers)

We define a deep syntactic representation scheme for French, which abstracts away from surface syntactic variation and diathesis alternations, and describe the annotation of deep syntactic representations on top of the surface dependency trees of the Sequoia corpus. The resulting deep-annotated corpus, named deep-sequoia, is freely available, and hopefully useful for corpus linguistics studies and for training deep analyzers to prepare semantic analysis.

The Asfalda project aims to develop a French corpus with frame-based semantic annotations and automatic tools for shallow semantic analysis. We present the first part of the project: focusing on a set of notional domains, we delimited a subset of English frames, adapted them to French data when necessary, and developed the corresponding French lexicon. We believe that working domain by domain helped us to enforce the coherence of the resulting resource, and also has the advantage that, though the number of frames is limited (around a hundred), we obtain full coverage within a given domain.

pdf bib
Strategies for Contiguous Multiword Expression Analysis and Dependency Parsing
Marie Candito | Matthieu Constant
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2013

pdf bib
The LIGM-Alpage architecture for the SPMRL 2013 Shared Task: Multiword Expression Analysis and Dependency Parsing
Matthieu Constant | Marie Candito | Djamé Seddah
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages

2012

pdf bib
The French Social Media Bank: a Treebank of Noisy User Generated Content
Djamé Seddah | Benoit Sagot | Marie Candito | Virginie Mouilleron | Vanessa Combet
Proceedings of COLING 2012

pdf bib
Le corpus Sequoia : annotation syntaxique et exploitation pour l’adaptation d’analyseur par pont lexical (The Sequoia Corpus : Syntactic Annotation and Use for a Parser Lexical Domain Adaptation Method) [in French]
Marie Candito | Djamé Seddah
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib abs
Ubiquitous Usage of a Broad Coverage French Corpus: Processing the Est Republicain corpus
Djamé Seddah | Marie Candito | Benoit Crabbé | Enrique Henestroza Anguiano
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we introduce a set of resources that we have derived from the EST RÉPUBLICAIN CORPUS, a large, freely-available collection of regional newspaper articles in French, totaling 150 million words. Our resources are the result of a full NLP treatment of the EST RÉPUBLICAIN CORPUS: handling of multi-word expressions, lemmatization, part-of-speech tagging, and syntactic parsing. Processing of the corpus is carried out using statistical machine-learning approaches - joint model of data driven lemmatization and part- of-speech tagging, PCFG-LA and dependency based models for parsing - that have been shown to achieve state-of-the-art performance when evaluated on the French Treebank. Our derived resources are made freely available, and released according to the original Creative Common license for the EST RÉPUBLICAIN CORPUS. We additionally provide an overview of the use of these resources in various applications, in particular the use of generated word clusters from the corpus to alleviate lexical data sparseness for statistical parsing.

pdf bib
Probabilistic Lexical Generalization for French Dependency Parsing
Enrique Henestroza Anguiano | Marie Candito
Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages

2011

pdf bib
Parse Correction with Specialized Models for Difficult Attachment Types
Enrique Henestroza Anguiano | Marie Candito
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
A Word Clustering Approach to Domain Adaptation: Effective Parsing of Biomedical Texts
Marie Candito | Enrique Henestroza Anguiano | Djamé Seddah
Proceedings of the 12th International Conference on Parsing Technologies

2010

pdf bib
Benchmarking of Statistical Dependency Parsers for French
Marie Candito | Joakim Nivre | Pascal Denis | Enrique Henestroza Anguiano
Coling 2010: Posters

pdf bib abs
Statistical French Dependency Parsing: Treebank Conversion and First Results
Marie Candito | Benoît Crabbé | Pascal Denis
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We first describe the automatic conversion of the French Treebank (Abeillé and Barrier, 2004), a constituency treebank, into typed projective dependency trees. In order to evaluate the overall quality of the resulting dependency treebank, and to quantify the cases where the projectivity constraint leads to wrong dependencies, we compare a subset of the converted treebank to manually validated dependency trees. We then compare the performance of two treebank-trained parsers that output typed dependency parses. The first parser is the MST parser (Mcdonald et al., 2006), which we directly train on dependency trees. The second parser is a combination of the Berkeley parser (Petrov et al., 2006) and a functional role labeler: trained on the original constituency treebank, the Berkeley parser first outputs constituency trees, which are then labeled with functional roles, and then converted into dependency trees. We found that used in combination with a high-accuracy French POS tagger, the MST parser performs a little better for unlabeled dependencies (UAS=90.3% versus 89.6%), and better for labeled dependencies (LAS=87.6% versus 85.6%).

pdf bib
Parsing Word Clusters
Marie Candito | Djamé Seddah
Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages

pdf bib
Lemmatization and Lexicalized Statistical Parsing of Morphologically-Rich Languages: the Case of French
Djamé Seddah | Grzegorz Chrupała | Özlem Çetinoğlu | Josef van Genabith | Marie Candito
Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages

2009

pdf bib abs
Analyse syntaxique du français : des constituants aux dépendances
Marie Candito | Benoît Crabbé | Pascal Denis | François Guérin
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article présente une technique d’analyse syntaxique statistique à la fois en constituants et en dépendances. L’analyse procède en ajoutant des étiquettes fonctionnelles aux sorties d’un analyseur en constituants, entraîné sur le French Treebank, pour permettre l’extraction de dépendances typées. D’une part, nous spécifions d’un point de vue formel et linguistique les structures de dépendances à produire, ainsi que la procédure de conversion du corpus en constituants (le French Treebank) vers un corpus cible annoté en dépendances, et partiellement validé. D’autre part, nous décrivons l’approche algorithmique qui permet de réaliser automatiquement le typage des dépendances. En particulier, nous nous focalisons sur les méthodes d’apprentissage discriminantes d’étiquetage en fonctions grammaticales.

pdf bib abs
Adaptation de parsers statistiques lexicalisés pour le français : Une évaluation complète sur corpus arborés
Djamé Seddah | Marie Candito | Benoît Crabbé
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Cet article présente les résultats d’une évaluation exhaustive des principaux analyseurs syntaxiques probabilistes dit “lexicalisés” initialement conçus pour l’anglais, adaptés pour le français et évalués sur le CORPUS ARBORÉ DU FRANÇAIS (Abeillé et al., 2003) et le MODIFIED FRENCH TREEBANK (Schluter & van Genabith, 2007). Confirmant les résultats de (Crabbé & Candito, 2008), nous montrons que les modèles lexicalisés, à travers les modèles de Charniak (Charniak, 2000), ceux de Collins (Collins, 1999) et le modèle des TIG Stochastiques (Chiang, 2000), présentent des performances moindres face à un analyseur PCFG à Annotation Latente (Petrov et al., 2006). De plus, nous montrons que le choix d’un jeu d’annotations issus de tel ou tel treebank oriente fortement les résultats d’évaluations tant en constituance qu’en dépendance non typée. Comparés à (Schluter & van Genabith, 2008; Arun & Keller, 2005), tous nos résultats sont state-of-the-art et infirment l’hypothèse d’une difficulté particulière qu’aurait le français en terme d’analyse syntaxique probabiliste et de sources de données.

bib
On Statistical Parsing of French with Supervised and Semi-Supervised Strategies
Marie Candito | Benoit Crabbé | Djamé Seddah
Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference

pdf bib
Improving generative statistical parsing with semi-supervised word clustering
Marie Candito | Benoît Crabbé
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

pdf bib
Cross parser evaluation : a French Treebanks study
Djamé Seddah | Marie Candito | Benoît Crabbé
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

2008

pdf bib abs
Expériences d’analyse syntaxique statistique du français
Benoît Crabbé | Marie Candito
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous montrons qu’il est possible d’obtenir une analyse syntaxique statistique satisfaisante pour le français sur du corpus journalistique, à partir des données issues du French Treebank du laboratoire LLF, à l’aide d’un algorithme d’analyse non lexicalisé.