Yannick Versley


2022

pdf bib
LINGUIST: Language Model Instruction Tuning to Generate Annotated Utterances for Intent Classification and Slot Tagging
Andy Rosenbaum | Saleh Soltan | Wael Hamza | Yannick Versley | Markus Boese
Proceedings of the 29th International Conference on Computational Linguistics

We present LINGUIST, a method for generating annotated data for Intent Classification and Slot Tagging (IC+ST), via fine-tuning AlexaTM 5B, a 5-billion-parameter multilingual sequence-to-sequence (seq2seq) model, on a flexible instruction prompt. In a 10-shot novel intent setting for the SNIPS dataset, LINGUIST surpasses state-of-the-art approaches (Back-Translation and Example Extrapolation) by a wide margin, showing absolute improvement for the target intents of +1.9 points on IC Recall and +2.5 points on ST F1 Score. In the zero-shot cross-lingual setting of the mATIS++ dataset, LINGUIST out-performs a strong baseline of Machine Translation with Slot Alignment by +4.14 points absolute on ST F1 Score across 6 languages, while matching performance on IC. Finally, we verify our results on an internal large-scale multilingual dataset for conversational agent IC+ST and show significant improvements over a baseline which uses Back-Translation, Paraphrasing and Slot Catalog Resampling. To our knowledge, we are the first to demonstrate instruction fine-tuning of a large-scale seq2seq model to control the outputs of multilingual intent- and slot-labeled data generation.

2021

pdf bib
Continuous Model Improvement for Language Understanding with Machine Translation
Abdalghani Abujabal | Claudio Delli Bovi | Sungho Ryu | Turan Gojayev | Fabian Triefenbach | Yannick Versley
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers

Scaling conversational personal assistants to a multitude of languages puts high demands on collecting and labelling data, a setting in which cross-lingual learning techniques can help to reconcile the need for well-performing Natural Language Understanding (NLU) with a desideratum to support many languages without incurring unacceptable cost. In this work, we show that automatically annotating unlabeled utterances using Machine Translation in an offline fashion and adding them to the training data can improve performance for existing NLU features for low-resource languages, where a straightforward translate-test approach as considered in existing literature would fail the latency requirements of a live environment. We demonstrate the effectiveness of our method with intrinsic and extrinsic evaluation using a real-world commercial dialog system in German. Beyond an intrinsic evaluation, where 56% of the resulting automatically labeled utterances had a perfect match with ground-truth labels, we see significant performance improvements in an extrinsic evaluation settings when manual labeled data is available in small quantities.

2017

pdf bib
Findings of the 2017 DiscoMT Shared Task on Cross-lingual Pronoun Prediction
Sharid Loáiciga | Sara Stymne | Preslav Nakov | Christian Hardmeier | Jörg Tiedemann | Mauro Cettolo | Yannick Versley
Proceedings of the Third Workshop on Discourse in Machine Translation

We describe the design, the setup, and the evaluation results of the DiscoMT 2017 shared task on cross-lingual pronoun prediction. The task asked participants to predict a target-language pronoun given a source-language pronoun in the context of a sentence. We further provided a lemmatized target-language human-authored translation of the source sentence, and automatic word alignments between the source sentence words and the target-language lemmata. The aim of the task was to predict, for each target-language pronoun placeholder, the word that should replace it from a small, closed set of classes, using any type of information that can be extracted from the entire document. We offered four subtasks, each for a different language pair and translation direction: English-to-French, English-to-German, German-to-English, and Spanish-to-English. Five teams participated in the shared task, making submissions for all language pairs. The evaluation results show that most participating teams outperformed two strong n-gram-based language model-based baseline systems by a sizable margin.

2016

pdf bib
Detecting Annotation Scheme Variation in Out-of-Domain Treebanks
Yannick Versley | Julius Steen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

To ensure portability of NLP systems across multiple domains, existing treebanks are often extended by adding trees from interesting domains that were not part of the initial annotation effort. In this paper, we will argue that it is both useful from an application viewpoint and enlightening from a linguistic viewpoint to detect and reduce divergence in annotation schemes between extant and new parts in a set of treebanks that is to be used in evaluation experiments. The results of our correction and harmonization efforts will be made available to the public as a test suite for the evaluation of constituent parsing.

pdf bib
ICL-HD at SemEval-2016 Task 10: Improving the Detection of Minimal Semantic Units and their Meanings with an Ontology and Word Embeddings
Angelika Kirilin | Felix Krauss | Yannick Versley
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
ICL-HD at SemEval-2016 Task 8: Meaning Representation Parsing - Augmenting AMR Parsing with a Preposition Semantic Role Labeling Neural Network
Lauritz Brandt | David Grimm | Mengfei Zhou | Yannick Versley
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
Implicit Semantic Roles in a Multilingual Setting
Jennifer Sikos | Yannick Versley | Anette Frank
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

pdf bib
Discontinuity (Re)²-visited: A Minimalist Approach to Pseudoprojective Constituent Parsing
Yannick Versley
Proceedings of the Workshop on Discontinuous Structures in Natural Language Processing

pdf bib
Findings of the 2016 WMT Shared Task on Cross-lingual Pronoun Prediction
Liane Guillou | Christian Hardmeier | Preslav Nakov | Sara Stymne | Jörg Tiedemann | Yannick Versley | Mauro Cettolo | Bonnie Webber | Andrei Popescu-Belis
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

2015

pdf bib
Subsentential Sentiment on a Shoestring: A Crosslingual Analysis of Compositional Classification
Michael Haas | Yannick Versley
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Pronoun-Focused MT and Cross-Lingual Pronoun Prediction: Findings of the 2015 DiscoMT Shared Task on Pronoun Translation
Christian Hardmeier | Preslav Nakov | Sara Stymne | Jörg Tiedemann | Yannick Versley | Mauro Cettolo
Proceedings of the Second Workshop on Discourse in Machine Translation

2014

pdf bib
Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages
Yoav Goldberg | Yuval Marton | Ines Rehbein | Yannick Versley | Özlem Çetinoğlu | Joel Tetreault
Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages

pdf bib
Experiments with Easy-first nonprojective constituent parsing
Yannick Versley
Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages

2013

pdf bib
SFS-TUE: Compound Paraphrasing with a Language Model and Discriminative Reranking
Yannick Versley
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib
Subgraph-based Classification of Explicit and Implicit Discourse Relations
Yannick Versley
Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) – Long Papers

pdf bib
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages
Yoav Goldberg | Yuval Marton | Ines Rehbein | Yannick Versley
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages

pdf bib
Overview of the SPMRL 2013 Shared Task: A Cross-Framework Evaluation of Parsing Morphologically Rich Languages
Djamé Seddah | Reut Tsarfaty | Sandra Kübler | Marie Candito | Jinho D. Choi | Richárd Farkas | Jennifer Foster | Iakes Goenaga | Koldo Gojenola Galletebeitia | Yoav Goldberg | Spence Green | Nizar Habash | Marco Kuhlmann | Wolfgang Maier | Joakim Nivre | Adam Przepiórkowski | Ryan Roth | Wolfgang Seeker | Yannick Versley | Veronika Vincze | Marcin Woliński | Alina Wróblewska | Eric Villemonte de la Clergerie
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages

2012

pdf bib
Supervised Learning of German Qualia Relations
Yannick Versley
Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages

pdf bib
Using Synthetic Compounds for Word Sense Discrimination
Yannick Versley | Verena Henrich
Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages

2011

pdf bib
Multilabel Tagging of Discourse Relations in Ambiguous Temporal Connectives
Yannick Versley
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

2010

pdf bib
Anaphoric Annotation of Wikipedia and Blogs in the Live Memories Corpus
Kepa Joseba Rodríguez | Francesca Delogu | Yannick Versley | Egon W. Stemle | Massimo Poesio
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The Live Memories corpus is an Italian corpus annotated for anaphoric relations. This annotation effort aims to contribute to two significant issues for the CL research: the lack of annotated anaphoric resources for Italian and the increasing interest for the social Web. The Live Memories Corpus contains texts from the Italian Wikipedia about the region Trentino/Süd Tirol and from blog sites with users' comments. It is planned to add a set of articles of local news papers. The corpus includes manual annotated information about morphosyntactic agreement, anaphoricity, and semantic class of the NPs. The anaphoric annotation includes discourse deixis, bridging relations and markes cases of ambiguity with the annotation of alternative interpretations. For the annotation of the anaphoric links the corpus takes into account specific phenomena of the Italian language like incorporated clitics and phonetically non realized pronouns. Reliability studies for the annotation of the mentioned phenomena and for annotation of anaphoric links in general offer satisfactory results. The Wikipedia and blogs dataset will be distributed under Creative Commons Attributions licence.

pdf bib
Extending BART to Provide a Coreference Resolution System for German
Samuel Broscheit | Simone Paolo Ponzetto | Yannick Versley | Massimo Poesio
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a flexible toolkit-based approach to automatic coreference resolution on German text. We start with our previous work aimed at reimplementing the system from Soon et al. (2001) for English, and extend it to duplicate a version of the state-of-the-art proposal from Klenner and Ailloud (2009). Evaluation performed on a benchmarking dataset, namely the TueBa-D/Z corpus (Hinrichs et al., 2005b), shows that machine learning based coreference resolution can be robustly performed in a language other than English.

pdf bib
Creating a Coreference Resolution System for Italian
Massimo Poesio | Olga Uryupina | Yannick Versley
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper summarizes our work on creating a full-scale coreference resolution (CR) system for Italian, using BART ― an open-source modular CR toolkit initially developed for English corpora. We discuss our experiments on language-specific issues of the task. As our evaluation experiments show, a language-agnostic system (designed primarily for English) can achieve a performance level in high forties (MUC F-score) when re-trained and tested on a new language, at least on gold mention boundaries. Compared to this level, we can improve our F-score by around 10% introducing a small number of language-specific changes. This shows that, with a modular coreference resolution platform, such as BART, one can straightforwardly develop a family of robust and reliable systems for various languages. We hope that our experiments will encourage researchers working on coreference in other languages to create their own full-scale coreference resolution systems ― as we have mentioned above, at the moment such modules exist only for very few languages other than English.

pdf bib
SemEval-2010 Task 1: Coreference Resolution in Multiple Languages
Marta Recasens | Lluís Màrquez | Emili Sapena | M. Antònia Martí | Mariona Taulé | Véronique Hoste | Massimo Poesio | Yannick Versley
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
BART: A Multilingual Anaphora Resolution System
Samuel Broscheit | Massimo Poesio | Simone Paolo Ponzetto | Kepa Joseba Rodriguez | Lorenza Romano | Olga Uryupina | Yannick Versley | Roberto Zanoli
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
Statistical Parsing of Morphologically Rich Languages (SPMRL) What, How and Whither
Reut Tsarfaty | Djamé Seddah | Yoav Goldberg | Sandra Kuebler | Yannick Versley | Marie Candito | Jennifer Foster | Ines Rehbein | Lamia Tounsi
Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages

2009

pdf bib
Scalable Discriminative Parsing for German
Yannick Versley | Ines Rehbein
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

2008

pdf bib
Coreference Systems Based on Kernels Methods
Yannick Versley | Alessandro Moschitti | Massimo Poesio | Xiaofeng Yang
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
How to Compare Treebanks
Sandra Kübler | Wolfgang Maier | Ines Rehbein | Yannick Versley
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Recent years have seen an increasing interest in developing standards for linguistic annotation, with a focus on the interoperability of the resources. This effort, however, requires a profound knowledge of the advantages and disadvantages of linguistic annotation schemes in order to avoid importing the flaws and weaknesses of existing encoding schemes into the new standards. This paper addresses the question how to compare syntactically annotated corpora and gain insights into the usefulness of specific design decisions. We present an exhaustive evaluation of two German treebanks with crucially different encoding schemes. We evaluate three different parsers trained on the two treebanks and compare results using EvalB, the Leaf-Ancestor metric, and a dependency-based evaluation. Furthermore, we present TePaCoC, a new testsuite for the evaluation of parsers on complex German grammatical constructions. The testsuite provides a well thought-out error classification, which enables us to compare parser output for parsers trained on treebanks with different encoding schemes and provides interesting insights into the impact of treebank annotation schemes on specific constructions like PP attachment or non-constituent coordination.

pdf bib
BART: A modular toolkit for coreference resolution
Yannick Versley | Simone Ponzetto | Massimo Poesio | Vladimir Eidelman | Alan Jern | Jason Smith | Xiaofeng Yang | Alessandro Moschitti
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Developing a full coreference system able to run all the way from raw text to semantic interpretation is a considerable engineering effort. Accordingly, there is very limited availability of off-the shelf tools for researchers whose interests are not primarily in coreference or others who want to concentrate on a specific aspect of the problem. We present BART, a highly modular toolkit for developing coreference applications. In the Johns Hopkins workshop on using lexical and encyclopedic knowledge for entity disambiguation, the toolkit was used to extend a reimplementation of Soon et al.’s proposal with a variety of additional syntactic and knowledge-based features, and experiment with alternative resolution processes, preprocessing tools, and classifiers. BART has been released as open source software and is available from http://www.sfs.uni-tuebingen.de/~versley/BART

pdf bib
BART: A Modular Toolkit for Coreference Resolution
Yannick Versley | Simone Paolo Ponzetto | Massimo Poesio | Vladimir Eidelman | Alan Jern | Jason Smith | Xiaofeng Yang | Alessandro Moschitti
Proceedings of the ACL-08: HLT Demo Session

2007

pdf bib
Antecedent Selection Techniques for High-Recall Coreference Resolution
Yannick Versley
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)