Pascal Denis - ACL Anthology

Pascal Denis

2025

Vers les Sens et Au-delà : Induire des Concepts Sémantiques Avec des Modèles de Langue Contextuels
Bastien Liétard | Pascal Denis | Mikaela Keller
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : traductions d'articles publiés

La polysémie et la synonymie sont deux facettes cruciales et interdépendantes de l’ambiguïté lexicosémantique, mais elles sont souvent considérées indépendamment dans les problèmes pratiques en TAL. Dans cet article, nous introduisons l’induction de concepts, une tâche non-supervisée consistant à apprendre un partitionnement diffus de mots définissant un ensemble de concepts directement à partir de données. Cette tâche généralise l’induction du sens des mots (via l’appartenance d’un mot à de multiples groupes). Nous proposons une approche à deux niveaux pour l’induction de concepts, avec une vue centrée sur les lemmes et une vue globale du lexique. Nous évaluons le regroupement obtenu sur les données annotées de SemCor et obtenons de bonnes performances (BCubed-F1 supérieur à 0,60). Nous constatons que les deux niveaux sont mutuellement bénéfiques pour induire les concepts et les sens. Enfin, nous créons des plongements dits « statiques » représentant nos concepts induits et obtenons des performances compétitives par rapport à l’état de l’art en Word-in-Context.

2024

MMAR: Multilingual and Multimodal Anaphora Resolution in Instructional Videos
Cennet Oguz | Pascal Denis | Simon Ostermann | Emmanuel Vincent | Natalia Skachkova | Josef Van Genabith
Findings of the Association for Computational Linguistics: EMNLP 2024

Multilingual anaphora resolution identifies referring expressions and implicit arguments in texts and links to antecedents that cover several languages. In the most challenging setting, cross-lingual anaphora resolution, training data, and test data are in different languages. As knowledge needs to be transferred across languages, this task is challenging, both in the multilingual and cross-lingual setting. We hypothesize that one way to alleviate some of the difficulty of the task is to include multimodal information in the form of images (i.e. frames extracted from instructional videos). Such visual inputs are by nature language agnostic, therefore cross- and multilingual anaphora resolution should benefit from visual information. In this paper, we provide the first multilingual and multimodal dataset annotated with anaphoric relations and present experimental results for end-to-end multimodal and multilingual anaphora resolution. Given gold mentions, multimodal features improve anaphora resolution results by ~10 % for unseen languages.

To Word Senses and Beyond: Inducing Concepts with Contextualized Language Models
Bastien Liétard | Pascal Denis | Mikaela Keller
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Polysemy and synonymy are two crucial interrelated facets of lexical ambiguity. While both phenomena are widely documented in lexical resources and have been studied extensively in NLP, leading to dedicated systems, they are often being considered independently in practical problems. While many tasks dealing with polysemy (e.g. Word Sense Disambiguiation or Induction) highlight the role of word’s senses, the study of synonymy is rooted in the study of concepts, i.e. meanings shared across the lexicon. In this paper, we introduce Concept Induction, the unsupervised task of learning a soft clustering among words that defines a set of concepts directly from data. This task generalizes Word Sense Induction. We propose a bi-level approach to Concept Induction that leverages both a local lemma-centric view and a global cross-lexicon view to induce concepts. We evaluate the obtained clustering on SemCor’s annotated data and obtain good performance (BCubed F₁ above 0.60). We find that the local and the global levels are mutually beneficial to induce concepts and also senses in our setting. Finally, we create static embeddings representing our induced concepts and use them on the Word-in-Context task, obtaining competitive performance with the State-of-the-Art.

Towards an Onomasiological Study of Lexical Semantic Change Through the Induction of Concepts
Bastien Liétard | Mikaela Keller | Pascal Denis
Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change

2023

Find-2-Find: Multitask Learning for Anaphora Resolution and Object Localization
Cennet Oguz | Pascal Denis | Emmanuel Vincent | Simon Ostermann | Josef van Genabith
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In multimodal understanding tasks, visual and linguistic ambiguities can arise. Visual ambiguity can occur when visual objects require a model to ground a referring expression in a video without strong supervision, while linguistic ambiguity can occur from changes in entities in action flows. As an example from the cooking domain, “oil” mixed with “salt” and “pepper” could later be referred to as a “mixture”. Without a clear visual-linguistic alignment, we cannot know which among several objects shown is referred to by the language expression “mixture”, and without resolved antecedents, we cannot pinpoint what the mixture is. We define this chicken-and-egg problem as Visual-linguistic Ambiguity. In this paper, we present Find2Find, a joint anaphora resolution and object localization dataset targeting the problem of visual-linguistic ambiguity, consisting of 500 anaphora-annotated recipes with corresponding videos. We present experimental results of a novel end-to-end joint multitask learning framework for Find2Find that fuses visual and textual information and shows improvements both for anaphora resolution and object localization with one joint model in multitask learning, as compared to a strong single-task baseline.

WordNet Is All You Need: A Surprisingly Effective Unsupervised Method for Graded Lexical Entailment
Joseph Renner | Pascal Denis | Rémi Gilleron
Findings of the Association for Computational Linguistics: EMNLP 2023

We propose a simple unsupervised approach which exclusively relies on WordNet (Miller,1995) for predicting graded lexical entailment (GLE) in English. Inspired by the seminal work of Resnik (1995), our method models GLE as the sum of two information-theoretic scores: a symmetric semantic similarity score and an asymmetric specificity loss score, both exploiting the hierarchical synset structure of WordNet. Our approach also includes a simple disambiguation mechanism to handle polysemy in a given word pair. Despite its simplicity, our method achieves performance above the state of the art (Spearman 𝜌 = 0.75) on HyperLex (Vulic et al., 2017), the largest GLE dataset, outperforming all previous methods, including specialized word embeddings approaches that use WordNet as weak supervision.

Exploring Category Structure with Contextual Language Models and Lexical Semantic Networks
Joseph Renner | Pascal Denis | Remi Gilleron | Angèle Brunellière
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

The psychological plausibility of word embeddings has been studied through different tasks such as word similarity, semantic priming, and lexical entailment. Recent work on predicting category structure with word embeddings report low correlations with human ratings. (Heyman and Heyman, 2019) showed that static word embeddings fail at predicting typicality using cosine similarity between category and exemplar words, while (Misra et al., 2021)obtain equally modest results for various contextual language models (CLMs) using a Cloze task formulation over hand-crafted taxonomic sentences. In this work, we test a wider array of methods for probing CLMs for predicting typicality scores. Our experiments, using BERT (Devlin et al., 2018), show the importance of using the right type of CLM probes, as our best BERT-based typicality prediction methods improve on previous works. Second, our results highlight the importance of polysemy in this task, as our best results are obtained when contextualization is paired with a disambiguation mechanism as in (Chronis and Erk, 2020). Finally, additional experiments and analyses reveal that Information Content-based WordNet (Miller, 1995) similarities with disambiguation match the performance of the best BERT-based method, and in fact capture complementary information, and when combined with BERT allow for enhanced typicality predictions.

Fair Without Leveling Down: A New Intersectional Fairness Definition
Gaurav Maheshwari | Aurélien Bellet | Pascal Denis | Mikaela Keller
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In this work, we consider the problem of intersectional group fairness in the classification setting, where the objective is to learn discrimination-free models in the presence of several intersecting sensitive groups. First, we illustrate various shortcomings of existing fairness measures commonly used to capture intersectional fairness. Then, we propose a new definition called the 𝛼-Intersectional Fairness, which combines the absolute and the relative performance across sensitive groups and can be seen as a generalization of the notion of differential fairness. We highlight several desirable properties of the proposed definition and analyze its relation to other fairness measures. Finally, we benchmark multiple popular in-processing fair machine learning approaches using our new fairness definition and show that they do not achieve any improvement over a simple baseline. Our results reveal that the increase in fairness measured by previous definitions hides a “leveling down” effect, i.e., degrading the best performance over groups rather than improving the worst one.

A Tale of Two Laws of Semantic Change: Predicting Synonym Changes with Distributional Semantic Models
Bastien Lietard | Mikaela Keller | Pascal Denis
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)

Lexical Semantic Change is the study of how the meaning of words evolves through time. Another related question is whether and how lexical relations over pairs of words, such as synonymy, change over time. There are currently two competing, apparently opposite hypotheses in the historical linguistic literature regarding how synonymous words evolve: the Law of Differentiation (LD) argues that synonyms tend to take on different meanings over time, whereas the Law of Parallel Change (LPC) claims that synonyms tend to undergo the same semantic change and therefore remain synonyms. So far, there has been little research using distributional models to assess to what extent these laws apply on historical corpora. In this work, we take a first step toward detecting whether LD or LPC operates for given word pairs. After recasting the problem into a more tractable task, we combine two linguistic resources to propose the first complete evaluation framework on this problem and provide empirical evidence in favor of a dominance of LD. We then propose various computational approaches to the problem using Distributional Semantic Models and grounded in recent literature on Lexical Semantic Change detection. Our best approaches achieve a balanced accuracy above 0.6 on our dataset. We discuss challenges still faced by these approaches, such as polysemy or the potential confusion between synonymy and hypernymy.

2022

Fair NLP Models with Differentially Private Text Encoders
Gaurav Maheshwari | Pascal Denis | Mikaela Keller | Aurélien Bellet
Findings of the Association for Computational Linguistics: EMNLP 2022

Encoded text representations often capture sensitive attributes about individuals (e.g., race or gender), which raise privacy concerns and can make downstream models unfair to certain groups. In this work, we propose FEDERATE, an approach that combines ideas from differential privacy and adversarial training to learn private text representations which also induces fairer models. We empirically evaluate the trade-off between the privacy of the representations and the fairness and accuracy of the downstream model on four NLP datasets. Our results show that FEDERATE consistently improves upon previous methods, and thus suggest that privacy and fairness can positively reinforce each other.

Chop and Change: Anaphora Resolution in Instructional Cooking Videos
Cennet Oguz | Ivana Kruijff-Korbayova | Emmanuel Vincent | Pascal Denis | Josef van Genabith
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

Linguistic ambiguities arising from changes in entities in action flows are a key challenge in instructional cooking videos. In particular, temporally evolving entities present rich and to date understudied challenges for anaphora resolution. For example “oil” mixed with “salt” is later referred to as a “mixture”. In this paper we propose novel annotation guidelines to annotate recipes for the anaphora resolution task, reflecting change in entities. Moreover, we present experimental results for end-to-end multimodal anaphora resolution with the new annotation scheme and propose the use of temporal features for performance improvement.

2021

Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale
Pascal Denis | Natalia Grabar | Amel Fraisse | Rémi Cardon | Bernard Jacquemin | Eric Kergosien | Antonio Balvet
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 : 23e REncontres jeunes Chercheurs en Informatique pour le TAL (RECITAL)
Pascal Denis | Natalia Grabar | Amel Fraisse | Rémi Cardon | Bernard Jacquemin | Eric Kergosien | Antonio Balvet
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 : 23e REncontres jeunes Chercheurs en Informatique pour le TAL (RECITAL)

Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 3 : Démonstrations
Pascal Denis | Natalia Grabar | Amel Fraisse | Rémi Cardon | Bernard Jacquemin | Eric Kergosien | Antonio Balvet
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 3 : Démonstrations

An End-to-End Approach for Full Bridging Resolution
Joseph Renner | Priyansh Trivedi | Gaurav Maheshwari | Rémi Gilleron | Pascal Denis
Proceedings of the CODI-CRAC 2021 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue

In this article, we describe our submission to the CODI-CRAC 2021 Shared Task on Anaphora Resolution in Dialogues – Track BR (Gold). We demonstrate the performance of an end-to-end transformer-based higher-order coreference model finetuned for the task of full bridging. We find that while our approach is not effective at modeling the complexities of the task, it performs well on bridging resolution, suggesting a need for investigations into a robust anaphor identification model for future improvements.

2020

Integrating knowledge graph embeddings to improve mention representation for bridging anaphora resolution
Onkar Pandit | Pascal Denis | Liva Ralaivola
Proceedings of the Third Workshop on Computational Models of Reference, Anaphora and Coreference

Lexical semantics and world knowledge are crucial for interpreting bridging anaphora. Yet, existing computational methods for acquiring and injecting this type of information into bridging resolution systems suffer important limitations. Based on explicit querying of external knowledge bases, earlier approaches are computationally expensive (hence, hardly scalable) and they map the data to be processed into high-dimensional spaces (careful handling of the curse of dimensionality and overfitting has to be in order). In this work, we take a different and principled approach which naturally addresses these issues. Specifically, we convert the external knowledge source (in this case, WordNet) into a graph, and learn embeddings of the graph nodes of low dimension to capture the crucial features of the graph topology and, at the same time, rich semantic information. Once properly identified from the mention text spans, these low dimensional graph node embeddings are combined with distributional text-based embeddings to provide enhanced mention representations. We illustrate the effectiveness of our approach by evaluating it on commonly used datasets, namely ISNotes and BASHI. Our enhanced mention representations yield significant accuracy improvements on both datasets when compared to different standalone text-based mention representations.

Joint Learning of the Graph and the Data Representation for Graph-Based Semi-Supervised Learning
Mariana Vargas-Vieyra | Aurélien Bellet | Pascal Denis
Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs)

Graph-based semi-supervised learning is appealing when labels are scarce but large amounts of unlabeled data are available. These methods typically use a heuristic strategy to construct the graph based on some fixed data representation, independently of the available labels. In this pa- per, we propose to jointly learn a data representation and a graph from both labeled and unlabeled data such that (i) the learned representation indirectly encodes the label information injected into the graph, and (ii) the graph provides a smooth topology with respect to the transformed data. Plugging the resulting graph and representation into existing graph-based semi-supervised learn- ing algorithms like label spreading and graph convolutional networks, we show that our approach outperforms standard graph construction methods on both synthetic data and real datasets.

2019

Phylogenic Multi-Lingual Dependency Parsing
Mathieu Dehouck | Pascal Denis
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Languages evolve and diverge over time. Their evolutionary history is often depicted in the shape of a phylogenetic tree. Assuming parsing models are representations of their languages grammars, their evolution should follow a structure similar to that of the phylogenetic tree. In this paper, drawing inspiration from multi-task learning, we make use of the phylogenetic tree to guide the learning of multi-lingual dependency parsers leveraging languages structural similarities. Experiments on data from the Universal Dependency project show that phylogenetic training is beneficial to low resourced languages and to well furnished languages families. As a side product of phylogenetic training, our model is able to perform zero-shot parsing of previously unseen languages.

2018

A Framework for Understanding the Role of Morphology in Universal Dependency Parsing
Mathieu Dehouck | Pascal Denis
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

This paper presents a simple framework for characterizing morphological complexity and how it encodes syntactic information. In particular, we propose a new measure of morpho-syntactic complexity in terms of governor-dependent preferential attachment that explains parsing performance. Through experiments on dependency parsing with data from Universal Dependencies (UD), we show that representations derived from morphological attributes deliver important parsing performance improvements over standard word form embeddings when trained on the same datasets. We also show that the new morpho-syntactic complexity measure is predictive of the gains provided by using morphological attributes over plain forms on parsing scores, making it a tool to distinguish languages using morphology as a syntactic marker from others.

A Probabilistic Model for Joint Learning of Word Embeddings from Texts and Images
Melissa Ailem | Bowen Zhang | Aurelien Bellet | Pascal Denis | Fei Sha
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Several recent studies have shown the benefits of combining language and perception to infer word embeddings. These multimodal approaches either simply combine pre-trained textual and visual representations (e.g. features extracted from convolutional neural networks), or use the latter to bias the learning of textual word embeddings. In this work, we propose a novel probabilistic model to formalize how linguistic and perceptual inputs can work in concert to explain the observed word-context pairs in a text corpus. Our approach learns textual and visual representations jointly: latent visual factors couple together a skip-gram model for co-occurrence in linguistic data and a generative latent variable model for visual data. Extensive experimental studies validate the proposed model. Concretely, on the tasks of assessing pairwise word similarity and image/caption retrieval, our approach attains equally competitive or stronger results when compared to other state-of-the-art multimodal models.

2017

Online Learning of Task-specific Word Representations with a Joint Biconvex Passive-Aggressive Algorithm
Pascal Denis | Liva Ralaivola
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

This paper presents a new, efficient method for learning task-specific word vectors using a variant of the Passive-Aggressive algorithm. Specifically, this algorithm learns a word embedding matrix in tandem with the classifier parameters in an online fashion, solving a bi-convex constrained optimization at each iteration. We provide a theoretical analysis of this new algorithm in terms of regret bounds, and evaluate it on both synthetic data and NLP classification problems, including text classification and sentiment analysis. In the latter case, we compare various pre-trained word vectors to initialize our word embedding matrix, and show that the matrix learned by our algorithm vastly outperforms the initial matrix, with performance results comparable or above the state-of-the-art on these tasks.

Delexicalized Word Embeddings for Cross-lingual Dependency Parsing
Mathieu Dehouck | Pascal Denis
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

This paper presents a new approach to the problem of cross-lingual dependency parsing, aiming at leveraging training data from different source languages to learn a parser in a target language. Specifically, this approach first constructs word vector representations that exploit structural (i.e., dependency-based) contexts but only considering the morpho-syntactic information associated with each word and its contexts. These delexicalized word embeddings, which can be trained on any set of languages and capture features shared across languages, are then used in combination with standard language-specific features to train a lexicalized parser in the target language. We evaluate our approach through experiments on a set of eight different languages that are part the Universal Dependencies Project. Our main results show that using such delexicalized embeddings, either trained in a monolingual or multilingual fashion, achieves significant improvements over monolingual baselines.

2016

Learning Connective-based Word Representations for Implicit Discourse Relation Identification
Chloé Braud | Pascal Denis
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2015

Comparing Word Representations for Implicit Discourse Relation Classification
Chloé Braud | Pascal Denis
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

Combining Natural and Artificial Examples to Improve Implicit Discourse Relation Identification
Chloé Braud | Pascal Denis
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

Identifier les relations discursives implicites en combinant données naturelles et données artificielles [Identifying implicit discourse relations by combining natural and artificial data]
Chloé Braud | Pascal Denis
Traitement Automatique des Langues, Volume 55, Numéro 1 : Varia [Varia]

2013

Automatically identifying implicit discourse relations using annotated data and raw corpora (Identification automatique des relations discursives « implicites » à partir de données annotées et de corpus bruts) [in French]
Chloé Braud | Pascal Denis
Proceedings of TALN 2013 (Volume 1: Long Papers)

Learning a hierarchy of specialized pairwise models for coreference resolution (Apprentissage d’une hiérarchie de modèles à paires spécialisés pour la résolution de la coréférence) [in French]
Emmanuel Lassalle | Pascal Denis
Proceedings of TALN 2013 (Volume 1: Long Papers)

Improving pairwise coreference models through feature space hierarchy learning
Emmanuel Lassalle | Pascal Denis
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Expressivity and comparison of models of discourse structure
Antoine Venant | Nicholas Asher | Philippe Muller | Pascal Denis | Stergos Afantenos
Proceedings of the SIGDIAL 2013 Conference

2012

Constrained Decoding for Text-Level Discourse Parsing
Philippe Muller | Stergos Afantenos | Pascal Denis | Nicholas Asher
Proceedings of COLING 2012

2011

French TimeBank : un corpus de référence sur la temporalité en français (French TimeBank: a reference corpus on temporality in French)
André Bittar | Pascal Amsili | Pascal Denis
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article a un double objectif : d’une part, il s’agit de présenter à la communauté un corpus récemment rendu public, le French Time Bank (FTiB), qui consiste en une collection de textes journalistiques annotés pour les temps et les événements selon la norme ISO-TimeML ; d’autre part, nous souhaitons livrer les résultats et réflexions méthodologiques que nous avons pu tirer de la réalisation de ce corpus de référence, avec l’idée que notre expérience pourra s’avérer profitable au-delà de la communauté intéressée par le traitement de la temporalité.

FreDist : Construction automatique d’un thésaurus distributionnel pour le Français (FreDist : Automatic construction of distributional thesauri for French)
Enrique Henestroza Anguiano | Pascal Denis
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cet article, nous présentons FreDist, un logiciel libre pour la construction automatique de thésaurus distributionnels à partir de corpus de texte, ainsi qu’une évaluation des différents ressources ainsi produites. Suivant les travaux de (Lin, 1998) et (Curran, 2004), nous utilisons un corpus journalistique de grande taille et implémentons différentes options pour : le type de relation contexte lexical, la fonction de poids, et la fonction de mesure de similarité. Prenant l’EuroWordNet français et le WOLF comme références, notre évaluation révèle, de manière originale, que c’est l’approche qui combine contextes linéaires (ici, de type bigrammes) et contextes syntaxiques qui semble fournir le meilleur thésaurus. Enfin, nous espérons que notre logiciel, distribué avec nos meilleurs thésaurus pour le français, seront utiles à la communauté TAL.

French TimeBank: An ISO-TimeML Annotated Reference Corpus
André Bittar | Pascal Amsili | Pascal Denis | Laurence Danlos
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

Comparison of different algebras for inducing the temporal structure of texts
Pascal Denis | Philippe Muller
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

Learning Recursive Segments for Discourse Parsing
Stergos Afantenos | Pascal Denis | Philippe Muller | Laurence Danlos
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Automatically detecting discourse segments is an important preliminary step towards full discourse parsing. Previous research on discourse segmentation have relied on the assumption that elementary discourse units (EDUs) in a document always form a linear sequence (i.e., they can never be nested). Unfortunately, this assumption turns out to be too strong, for some theories of discourse, like the ""Segmented Discourse Representation Theory"" or SDRT, allow for nested discourse units. In this paper, we present a simple approach to discourse segmentation that is able to produce nested EDUs. Our approach builds on standard multi-class classification techniques making use of a regularized maximum entropy model, combined with a simple repairing heuristic that enforces global coherence. Our system was developed and evaluated on the first round of annotations provided by the French Annodis project (an ongoing effort to create a discourse bank for French). Cross-validated on only 47 documents (1,445 EDUs), our system achieves encouraging performance results with an F-score of 73% for finding EDUs.

Benchmarking of Statistical Dependency Parsers for French
Marie Candito | Joakim Nivre | Pascal Denis | Enrique Henestroza Anguiano
Coling 2010: Posters

Statistical French Dependency Parsing: Treebank Conversion and First Results
Marie Candito | Benoît Crabbé | Pascal Denis
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We first describe the automatic conversion of the French Treebank (Abeillé and Barrier, 2004), a constituency treebank, into typed projective dependency trees. In order to evaluate the overall quality of the resulting dependency treebank, and to quantify the cases where the projectivity constraint leads to wrong dependencies, we compare a subset of the converted treebank to manually validated dependency trees. We then compare the performance of two treebank-trained parsers that output typed dependency parses. The first parser is the MST parser (Mcdonald et al., 2006), which we directly train on dependency trees. The second parser is a combination of the Berkeley parser (Petrov et al., 2006) and a functional role labeler: trained on the original constituency treebank, the Berkeley parser first outputs constituency trees, which are then labeled with functional roles, and then converted into dependency trees. We found that used in combination with a high-accuracy French POS tagger, the MST parser performs a little better for unlabeled dependencies (UAS=90.3% versus 89.6%), and better for labeled dependencies (LAS=87.6% versus 85.6%).

Exploitation d’une ressource lexicale pour la construction d’un étiqueteur morpho-syntaxique état-de-l’art du français
Pascal Denis | Benoît Sagot
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article présente MEltfr, un étiqueteur morpho-syntaxique automatique du français. Il repose sur un modèle probabiliste séquentiel qui bénéficie d’informations issues d’un lexique exogène, à savoir le Lefff. Evalué sur le FTB, MEltfr atteint un taux de précision de 97.75% (91.36% sur les mots inconnus) sur un jeu de 29 étiquettes. Ceci correspond à une diminution du taux d’erreur de 18% (36.1% sur les mots inconnus) par rapport au même modèle sans couplage avec le Lefff. Nous étudions plus en détail la contribution de cette ressource, au travers de deux séries d’expériences. Celles-ci font apparaître en particulier que la contribution des traits issus du Lefff est de permettre une meilleure couverture, ainsi qu’une modélisation plus fine du contexte droit des mots.

2009

Coupling an Annotated Corpus and a Morphosyntactic Lexicon for State-of-the-Art POS Tagging with Less Human Effort
Pascal Denis | Benoît Sagot
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 1

Analyse syntaxique du français : des constituants aux dépendances
Marie Candito | Benoît Crabbé | Pascal Denis | François Guérin
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article présente une technique d’analyse syntaxique statistique à la fois en constituants et en dépendances. L’analyse procède en ajoutant des étiquettes fonctionnelles aux sorties d’un analyseur en constituants, entraîné sur le French Treebank, pour permettre l’extraction de dépendances typées. D’une part, nous spécifions d’un point de vue formel et linguistique les structures de dépendances à produire, ainsi que la procédure de conversion du corpus en constituants (le French Treebank) vers un corpus cible annoté en dépendances, et partiellement validé. D’autre part, nous décrivons l’approche algorithmique qui permet de réaliser automatiquement le typage des dépendances. En particulier, nous nous focalisons sur les méthodes d’apprentissage discriminantes d’étiquetage en fonctions grammaticales.

2008

Specialized Models and Ranking for Coreference Resolution
Pascal Denis | Jason Baldridge
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

Joint Determination of Anaphoricity and Coreference Resolution using Integer Programming
Pascal Denis | Jason Baldridge
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

Venues