Thierry Poibeau


2020

pdf bib
Multi-SimLex: A Large-Scale Evaluation of Multilingual and Crosslingual Lexical Semantic Similarity
Ivan Vulić | Simon Baker | Edoardo Maria Ponti | Ulla Petti | Ira Leviant | Kelly Wing | Olga Majewska | Eden Bar | Matt Malone | Thierry Poibeau | Roi Reichart | Anna Korhonen
Computational Linguistics, Volume 46, Issue 4 - December 2020

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 crosslingual semantic similarity data sets. Because of its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and crosslingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and crosslingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised crosslingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex–style resources for additional languages. We make these contributions—the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be helpful in guiding future developments in multilingual lexical semantics and representation learning—available via a Web site that will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.

pdf bib
Sonnet Combinatorics with OuPoCo
Thierry Poibeau | Mylène Maignant | Frédérique Mélanie-Becquet | Clément Plancq | Matthieu Raffard | Mathilde Roussel
Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

In this paper, we describe OuPoCo, a system producing new sonnets by recombining verses from existing sonnets, following an idea that Queneau described in his book “Cent Mille Milliards de poèmes, Gallimard”, 1961. We propose to demonstrate different outputs of our implementation (a Web site, a Twitter bot and a specifically developed device, called ‘La Boîte à poésie’) based on a corpus of 19th century French poetry. Our goal is to make people interested in poetry again, by giving access to automatically produced sonnets through original and entertaining channels and devices.

2019

pdf bib
Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
Edoardo Maria Ponti | Helen O’Horan | Yevgeni Berzak | Ivan Vulić | Roi Reichart | Thierry Poibeau | Ekaterina Shutova | Anna Korhonen
Computational Linguistics, Volume 45, Issue 3 - September 2019

Linguistic typology aims to capture structural and semantic variation across the world’s languages. A large-scale typology could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly for languages that suffer from the lack of human labeled resources. We present an extensive literature survey on the use of typological information in the development of NLP techniques. Our survey demonstrates that to date, the use of information in existing typological databases has resulted in consistent but modest improvements in system performance. We show that this is due to both intrinsic limitations of databases (in terms of coverage and feature granularity) and under-utilization of the typological features included in them. We advocate for a new approach that adapts the broad and discrete nature of typological categories to the contextual and continuous nature of machine learning algorithms used in contemporary NLP. In particular, we suggest that such an approach could be facilitated by recent developments in data-driven induction of typological knowledge.

2018

pdf bib
SEx BiST: A Multi-Source Trainable Parser with Deep Contextualized Lexical Representations
KyungTae Lim | Cheoneum Park | Changki Lee | Thierry Poibeau
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We describe the SEx BiST parser (Semantically EXtended Bi-LSTM parser) developed at Lattice for the CoNLL 2018 Shared Task (Multilingual Parsing from Raw Text to Universal Dependencies). The main characteristic of our work is the encoding of three different modes of contextual information for parsing: (i) Treebank feature representations, (ii) Multilingual word representations, (iii) ELMo representations obtained via unsupervised learning from external resources. Our parser performed well in the official end-to-end evaluation (73.02 LAS – 4th/26 teams, and 78.72 UAS – 2nd/26); remarkably, we achieved the best UAS scores on all the English corpora by applying the three suggested feature representations. Finally, we were also ranked 1st at the optional event extraction task, part of the 2018 Extrinsic Parser Evaluation campaign.

pdf bib
Multilingual Dependency Parsing for Low-Resource Languages: Case Studies on North Saami and Komi-Zyrian
KyungTae Lim | Niko Partanen | Thierry Poibeau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Dependency Parsing of Code-Switching Data with Cross-Lingual Feature Representations
Niko Partanen | Kyungtae Lim | Michael Rießler | Thierry Poibeau
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

pdf bib
Proceedings of the Eight Workshop on Cognitive Aspects of Computational Language Learning and Processing
Marco Idiart | Alessandro Lenci | Thierry Poibeau | Aline Villavicencio
Proceedings of the Eight Workshop on Cognitive Aspects of Computational Language Learning and Processing

pdf bib
The First Komi-Zyrian Universal Dependencies Treebanks
Niko Partanen | Rogier Blokland | KyungTae Lim | Thierry Poibeau | Michael Rießler
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

Two Komi-Zyrian treebanks were included in the Universal Dependencies 2.2 release. This article contextualizes the treebanks, discusses the process through which they were created, and outlines the future plans and timeline for the next improvements. Special attention is paid to the possibilities of using UD in the documentation and description of endangered languages.

2017

pdf bib
Preliminary Experiments concerning Verbal Predicative Structure Extraction from a Large Finnish Corpus
Guersande Chaminade | Thierry Poibeau
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages

pdf bib
Enjambment Detection in a Large Diachronic Corpus of Spanish Sonnets
Pablo Ruiz | Clara Martínez Cantón | Thierry Poibeau | Elena González-Blanco
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Enjambment takes place when a syntactic unit is broken up across two lines of poetry, giving rise to different stylistic effects. In Spanish literary studies, there are unclear points about the types of stylistic effects that can arise, and under which linguistic conditions. To systematically gather evidence about this, we developed a system to automatically identify enjambment (and its type) in Spanish. For evaluation, we manually annotated a reference corpus covering different periods. As a scholarly corpus to apply the tool, from public HTML sources we created a diachronic corpus covering four centuries of sonnets (3750 poems), and we analyzed the occurrence of enjambment across stanzaic boundaries in different periods. Besides, we found examples that highlight limitations in current definitions of enjambment.

pdf bib
UDLex: Towards Cross-language Subcategorization Lexicons
Giulia Rambelli | Alessandro Lenci | Thierry Poibeau
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

pdf bib
A System for Multilingual Dependency Parsing based on Bidirectional LSTM Feature Representations
KyungTae Lim | Thierry Poibeau
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

In this paper, we present our multilingual dependency parser developed for the CoNLL 2017 UD Shared Task dealing with “Multilingual Parsing from Raw Text to Universal Dependencies”. Our parser extends the monolingual BIST-parser as a multi-source multilingual trainable parser. Thanks to multilingual word embeddings and one hot encodings for languages, our system can use both monolingual and multi-source training. We trained 69 monolingual language models and 13 multilingual models for the shared task. Our multilingual approach making use of different resources yield better results than the monolingual approach for 11 languages. Our system ranked 5 th and achieved 70.93 overall LAS score over the 81 test corpora (macro-averaged LAS F1 score).

2016

pdf bib
More than Word Cooccurrence: Exploring Support and Opposition in International Climate Negotiations with Semantic Parsing
Pablo Ruiz | Clément Plancq | Thierry Poibeau
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Text analysis methods widely used in digital humanities often involve word co-occurrence, e.g. concept co-occurrence networks. These methods provide a useful corpus overview, but cannot determine the predicates that relate co-occurring concepts. Our goal was identifying propositions expressing the points supported or opposed by participants in international climate negotiations. Word co-occurrence methods were not sufficient, and an analysis based on open relation extraction had limited coverage for nominal predicates. We present a pipeline which identifies the points that different actors support and oppose, via a domain model with support/opposition predicates, and analysis rules that exploit the output of semantic role labelling, syntactic dependencies and anaphora resolution. Entity linking and keyphrase extraction are also performed on the propositions related to each actor. A user interface allows examining the main concepts in points supported or opposed by each participant, which participants agree or disagree with each other, and about which issues. The system is an example of tools that digital humanities scholars are asking for, to render rich textual information (beyond word co-occurrence) more amenable to quantitative treatment. An evaluation of the tool was satisfactory.

pdf bib
Proceedings of the 7th Workshop on Cognitive Aspects of Computational Language Learning
Anna Korhonen | Alessandro Lenci | Brian Murphy | Thierry Poibeau | Aline Villavicencio
Proceedings of the 7th Workshop on Cognitive Aspects of Computational Language Learning

pdf bib
The Role of Intrinsic Motivation in Artificial Language Emergence: a Case Study on Colour
Miquel Cornudella | Thierry Poibeau | Remi van Trijp
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Human languages have multiple strategies that allow us to discriminate objects in a vast variety of contexts. Colours have been extensively studied from this point of view. In particular, previous research in artificial language evolution has shown how artificial languages may emerge based on specific strategies to distinguish colours. Still, it has not been shown how several strategies of diverse complexity can be autonomously managed by artificial agents . We propose an intrinsic motivation system that allows agents in a population to create a shared artificial language and progressively increase its expressive power. Our results show that with such a system agents successfully regulate their language development, which indicates a relation between population size and consistency in the emergent communicative systems.

pdf bib
Exploring a Continuous and Flexible Representation of the Lexicon
Pierre Marchal | Thierry Poibeau
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We aim at showing that lexical descriptions based on multifactorial and continuous models can be used by linguists and lexicographers (and not only by machines) so long as they are provided with a way to efficiently navigate data collections. We propose to demonstrate such a system.

2015

pdf bib
Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning
Robert Berwick | Anna Korhonen | Alessandro Lenci | Thierry Poibeau | Aline Villavicencio
Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning

pdf bib
Language Emergence in a Population of Artificial Agents Equipped with the Autotelic Principle
Miquel Cornudella | Thierry Poibeau
Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning

pdf bib
Combining Open Source Annotators for Entity Linking through Weighted Voting
Pablo Ruiz | Thierry Poibeau
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

pdf bib
EL92: Entity Linking Combining Open Source Annotators via Weighted Voting
Pablo Ruiz | Thierry Poibeau
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
ELCO3: Entity Linking with Corpus Coherence Combining Open Source Annotators
Pablo Ruiz | Thierry Poibeau | Frédérique Mélanie
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

2014

pdf bib
Proceedings of the 5th Workshop on Cognitive Aspects of Computational Language Learning (CogACLL)
Alessandro Lenci | Muntsa Padró | Thierry Poibeau | Aline Villavicencio
Proceedings of the 5th Workshop on Cognitive Aspects of Computational Language Learning (CogACLL)

pdf bib
Social and Semantic Diversity: Socio-semantic Representation of a Scientific Corpus
Thierry Poibeau | Elisa Omodei | Jean-Philippe Cointet | Yufan Guo
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

pdf bib
Processing Mutations in Breton with Finite-State Transducers
Thierry Poibeau
Proceedings of the First Celtic Language Technology Workshop

pdf bib
Reconstructing the Semantic Landscape of Natural Language Processing
Elisa Omodei | Jean-Philippe Cointet | Thierry Poibeau
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper investigates the evolution of the computational linguistics domain through a quantitative analysis of the ACL Anthology (containing around 12,000 papers published between 1985 and 2008). Our approach combines complex system methods with natural language processing techniques. We reconstruct the socio-semantic landscape of the domain by inferring a co-authorship and a semantic network from the analysis of the corpus. First, keywords are extracted using a hybrid approach mixing linguistic patterns with statistical information. Then, the semantic network is built using a co-occurrence analysis of these keywords within the corpus. Combining temporal and network analysis techniques, we are able to examine the main evolutions of the field and the more active subfields over time. Lastly we propose a model to explore the mutual influence of the social and the semantic network over time, leading to a socio-semantic co-evolutionary system.

pdf bib
Argumentative analysis of the ACL Anthology (Analyse argumentative du corpus de l’ACL (ACL Anthology)) [in French]
Elisa Omodei | Yufan Guo | Jean-Philippe Cointet | Thierry Poibeau
Proceedings of TALN 2014 (Volume 2: Short Papers)

2013

pdf bib
A Tensor-based Factorization Model of Semantic Compositionality
Tim Van de Cruys | Thierry Poibeau | Anna Korhonen
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2012

pdf bib
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss
Robert Berwick | Anna Korhonen | Thierry Poibeau | Aline Villavicencio
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss

pdf bib
ANALEC: a New Tool for the Dynamic Annotation of Textual Data
Frédéric Landragin | Thierry Poibeau | Bernard Victorri
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We introduce ANALEC, a tool which aim is to bring together corpus annotation, visualization and query management. Our main idea is to provide a unified and dynamic way of annotating textual data. ANALEC allows researchers to dynamically build their own annotation scheme and use the possibilities of scheme revision, data querying and graphical visualization during the annotation process. Each query result can be visualized using a graphical representation that puts forward a set of annotations that can be directly corrected or completed. Text annotation is then considered as a cyclic process. We show that statistics like frequencies and correlations make it possible to verify annotated data on the fly during the annotation. In this paper we introduce the annotation functionalities of ANALEC, some of the annotated data visualization functionalities, and three statistical modules: frequency, correlation and geometrical representations. Some examples dealing with reference and coreference annotation illustrate the main contributions of ANALEC.

pdf bib
Multi-way Tensor Factorization for Unsupervised Lexical Acquisition
Tim Van de Cruys | Laura Rimell | Thierry Poibeau | Anna Korhonen
Proceedings of COLING 2012

2011

pdf bib
A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents
Yufan Guo | Anna Korhonen | Thierry Poibeau
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
Latent Vector Weighting for Word Meaning in Context
Tim Van de Cruys | Thierry Poibeau | Anna Korhonen
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
A New Scheme for Annotating Semantic Relations between Named Entities in Corpora
Mani Ezzat | Thierry Poibeau
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

2010

pdf bib
Investigating the cross-linguistic potential of VerbNet-style classification
Lin Sun | Thierry Poibeau | Anna Korhonen | Cédric Messiant
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

2009

pdf bib
CBSEAS, a Summarization System – Integration of Opinion Mining Techniques to Summarize Blogs
Aurélien Bossard | Michel Généreux | Thierry Poibeau
Proceedings of the Demonstrations Session at EACL 2009

pdf bib
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs
Adeline Nazarenko | Thierry Poibeau
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

pdf bib
Annotation fonctionnelle de corpus arborés avec des Champs Aléatoires Conditionnels
Erwan Moreau | Isabelle Tellier | Antonio Balvet | Grégoire Laurence | Antoine Rozenknop | Thierry Poibeau
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

L’objectif de cet article est d’évaluer dans quelle mesure les “fonctions syntaxiques” qui figurent dans une partie du corpus arboré de Paris 7 sont apprenables à partir d’exemples. La technique d’apprentissage automatique employée pour cela fait appel aux “Champs Aléatoires Conditionnels” (Conditional Random Fields ou CRF), dans une variante adaptée à l’annotation d’arbres. Les expériences menées sont décrites en détail et analysées. Moyennant un bon paramétrage, elles atteignent une F1-mesure de plus de 80%.

pdf bib
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Prise de position
Adeline Nazarenko | Thierry Poibeau
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Prise de position

pdf bib
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts
Adeline Nazarenko | Thierry Poibeau
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

pdf bib
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations
Adeline Nazarenko | Thierry Poibeau
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

pdf bib
Proceedings of the EACL 2009 Workshop on Cognitive Aspects of Computational Language Acquisition
Afra Alishahi | Thierry Poibeau | Aline Villavicencio
Proceedings of the EACL 2009 Workshop on Cognitive Aspects of Computational Language Acquisition

pdf bib
Integrating Document Structure into a Multi-Document Summarizer
Aurélien Bossard | Thierry Poibeau
Proceedings of the International Conference RANLP-2009

2008

pdf bib
Coling 2008: Proceedings of the workshop Multi-source Multilingual Information Extraction and Summarization
Sivaji Bandyopadhyay | Thierry Poibeau | Horacio Saggion | Roman Yangarber
Coling 2008: Proceedings of the workshop Multi-source Multilingual Information Extraction and Summarization

pdf bib
Do we Still Need Gold Standards for Evaluation?
Thierry Poibeau | Cédric Messiant
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The availability of a huge mass of textual data in electronic format has increased the need for fast and accurate techniques for textual data processing. Machine learning and statistical approaches have been increasingly used in NLP since a decade, mainly because they are quick, versatile and efficient. However, despite this evolution of the field, evaluation still rely (most of the time) on a comparison between the output of a probabilistic or statistical system on the one hand, and a non-statistic, most of the time hand-crafted, gold standard on the other hand. In this paper, we take the example of the acquisition of subcategorization frames from corpora as a practical example. Our study is motivated by the fact that, even if a gold standard is an invaluable resource for evaluation, a gold standard is always partial and does not really show how accurate and useful results are.

pdf bib
LexSchem: a Large Subcategorization Lexicon for French Verbs
Cédric Messiant | Thierry Poibeau | Anna Korhonen
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents LexSchem - the first large, fully automatically acquired subcategorization lexicon for French verbs. The lexicon includes subcategorization frame and frequency information for 3297 French verbs. When evaluated on a set of 20 test verbs against a gold standard dictionary, it shows 0.79 precision, 0.55 recall and 0.65 F-measure. We have made this resource freely available to the research community on the web.

pdf bib
Regroupement automatique de documents en classes événementielles
Aurélien Bossard | Thierry Poibeau
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Cet article porte sur le regroupement automatique de documents sur une base événementielle. Après avoir précisé la notion d’événement, nous nous intéressons à la représentation des documents d’un corpus de dépêches, puis à une approche d’apprentissage pour réaliser les regroupements de manière non supervisée fondée sur k-means. Enfin, nous évaluons le système de regroupement de documents sur un corpus de taille réduite et nous discutons de l’évaluation quantitative de ce type de tâche.

2007

pdf bib
Automatically Restructuring Practice Guidelines using the GEM DTD
Amanda Bouffier | Thierry Poibeau
Biological, translational, and clinical language processing

pdf bib
UP13: Knowledge-poor Methods (Sometimes) Perform Poorly
Thierry Poibeau
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2005

pdf bib
Sur le statut référentiel des entités nommées
Thierry Poibeau
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Nous montrons dans cet article qu’une même entité peut être désignée de multiples façons et que les noms désignant ces entités sont par nature polysémiques. L’analyse ne peut donc se limiter à une tentative de résolution de la référence mais doit mettre en évidence les possibilités de nommage s’appuyant essentiellement sur deux opérations de nature linguistique : la synecdoque et la métonymie. Nous présentons enfin une modélisation permettant de rendre explicite les différentes désignations en discours, en unifiant le mode de représentation des connaissances linguistiques et des connaissances sur le monde.

2004

pdf bib
Event-Based Information Extraction for the Biomedical Domain: the Caderige Project
Erick Alphonse | Sophie Aubin | Philippe Bessières | Gilles Bisson | Thierry Hamon | Sandrine Lagarrigue | Adeline Nazarenko | Alain-Pierre Manine | Claire Nédellec | Mohamed Ould Abdel Vetah | Thierry Poibeau | Davy Weissenbacher
Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP)

pdf bib
Semi-automatic Acquisition of Command Grammar
Thierry Poibeau | Bénédicte Goujon
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Automatic extraction of paraphrastic phrases from medium-size corpora
Thierry Poibeau
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2003

pdf bib
The Multilingual Named Entity Recognition Framework
Thierry Poibeau | A. Acoulon | C. Avaux | L. Beroff-Bénéat | A. Cadeau | M. Calberg | A. Delale | L. De Temmerman | A.-L. Guenet | D. Huis | M. Jamalpour | A. Krul | A. Marcus | F. Picoli | C. Plancq
10th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
The Multilingual Named Entity Recognition Framework
Thierry Poibeau | A. Acoulon | C. Avaux | L. Beroff-Bénéat | A. Cadeau | M. Calberg | A. Delale | L. De Temmerman | A.-L. Guenet | D. Huis | M. Jamalpour | A. Krul | A. Marcus | F. Picoli | C. Plancq
10th Conference of the European Chapter of the Association for Computational Linguistics

2002

pdf bib
Evaluating resource acquisition tools for Information Extraction
Thierry Poibeau | Dominique Dutoit | Sophie Bizouard
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Generating Extraction Patterns from a Large Semantic Network and an Untagged Corpus
Thierry Poibeau | Dominique Dutoit
COLING-02: SEMANET: Building and Using Semantic Networks

pdf bib
Évaluer l’acquisition semi-automatique de classes sémantiques
Thierry Poibeau | Dominique Dutoit | Sophie Bizouard
Actes de la 9ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article vise à évaluer deux approches différentes pour la constitution de classes sémantiques. Une approche endogène (acquisition à partir d’un corpus) est contrastée avec une approche exogène (à travers un réseau sémantique riche). L’article présente une évaluation fine de ces deux techniques.

pdf bib
Inferring Knowledge from a Large Semantic Network
Dominique Dutoit | Thierry Poibeau
COLING 2002: The 19th International Conference on Computational Linguistics

2001

pdf bib
Extraction d’information dans les bases de données textuelles en génomique au moyen de transducteurs à nombre fini d’états
Thierry Poibeau
Actes de la 8ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article décrit un système d’extraction d’information sur les interactions entre gènes à partir de grandes bases de données textuelles. Le système est fondé sur une analyse au moyen de transducteurs à nombre fini d’états. L’article montre comment une partie des ressources (verbes d’interaction) peut être acquise de manière semi-automatique. Une évaluation détaillée du système est fournie.

pdf bib
Extraction de noms propres à partir de textes variés: problématique et enjeux
Leila Kosseim | Thierry Poibeau
Actes de la 8ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Cet article porte sur l’identification de noms propres à partir de textes écrits. Les stratégies à base de règles développées pour des textes de type journalistique se révèlent généralement insuffisantes pour des corpus composés de textes ne répondant pas à des critères rédactionnels stricts. Après une brève revue des travaux effectués sur des corpus de textes de nature journalistique, nous présentons la problématique de l’analyse de textes variés en nous basant sur deux corpus composés de courriers électroniques et de transcriptions manuelles de conversations téléphoniques. Une fois les sources d’erreurs présentées, nous décrivons l’approche utilisée pour adapter un système d’extraction de noms propres développé pour des textes journalistiques à l’analyse de messages électroniques.

pdf bib
Intex et ses applications informatiques
Max Silberztein | Thierry Poibeau | Antonio Balvet
Actes de la 8ème conférence sur le Traitement Automatique des Langues Naturelles. Tutoriels

Intex est un environnement de développement utilisé pour construire, tester et accumuler rapidement des motifs morpho-syntaxiques qui apparaissent dans des textes écrits en langue naturelle. Un survol du système est présenté dans [Silberztein, 1999] , le manuel d’instruction est disponible [Silberztein 2000]. Chaque description élémentaire est représentée par une grammaire locale, qui est habituellement entrée en machine grâce à l’éditeur de graphe d’Intex. Une caractéristique importante d’Intex est que chaque grammaire locale peut être facilement réemployée dans d’autres grammaires locales. Typiquement, les développeurs construisent des graphes élémentaires qui sont équivalents à des transducteurs à états finis, et réemploient ces graphes dans d’autres graphes de plus en plus complexes. Une seconde caractéristique d’Intex est que les objets traités (grammaires, dictionnaires et textes) sont représentés de façon interne par des transducteurs à états finis. En conséquence, toutes les fonctionnalités du système se ramènent à un nombre limité d’opérations sur des transducteurs. Par exemple, appliquer une grammaire à un texte revient à construire l’union des transducteurs élémentaires, la déterminiser, puis à calculer l’intersection du résultat avec le transducteur du texte. Cette architecture permet d’utiliser des algorithmes efficaces (par ex. lorsqu’on applique un transducteur déterministe à un texte préalablement indexé), et donne à Intex la puissance d’une machine de Turing (grâce à la possibilité d’appliquer des transducteurs en cascade). Dans ce tutoriel, nous montrerons comment utiliser un outil linguistique tel qu’Intex dans des environnements informatiques. Nous nous appuierons sur des applications de filtrage et d’extraction d’information, réalisées notamment au centre de recherche de Thales. Les applications suivantes seront détaillées, tant sur le plan linguistique qu’informatique filtrage d’information a partir d’un flux AFP [Meunier et al. l999] extraction de tables d’interaction entre gènes à partir de bases de données textuelles en génomique. [Poibeau 2001] Le tutoriel montrera comment Intex peut être employé comme moteur de filtrage d’un flux de dépêches de type AFP dans un cadre industriel. Il détaillera également les fonctionnalités de transformations des textes (transduction) permettant de passer rapidement de structures linguistiques variées à des formes normalisées permettant de remplir une base de données. Sur le plan informatique, on détaillera l’appel aux routines Intex, les paramétrages possibles (découpage en phrases, choix des dictionnaires...), et on survolera les nouvelles possibilités d’intégration (Intex API).
Search
Co-authors