Carlos Ramisch

2025

pdf bib abs
SELEXINI – a large and diverse automatically parsed corpus of French
Manon Scholivet | Agata Savary | Louis Estève | Marie Candito | Carlos Ramisch
Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC)

The annotation of large text corpora is essential for many tasks. We present here a large automatically annotated corpus for French. This corpus is separated into two parts: the first from BigScience, and the second from HPLT. The annotated documents from HPLT were selected in order to optimise the lexical diversity of the final corpus SELEXINI. An analysis of the impact of this selection was carried out on syntactic diversity, as well as on the quality of the new words resulting from the HPLT part of SELEXINI. We have shown that despite the introduction of interesting new words, the texts extracted from HPLT are very noisy. Furthermore, increasing lexical diversity did not increase syntactic diversity.

pdf bib abs
Evaluating Pretrained Causal Language Models for Synonymy
Ioana Ivan | Carlos Ramisch | Alexis Nasr
Findings of the Association for Computational Linguistics: ACL 2025

The scaling of causal language models in size and training data enabled them to tackle increasingly complex tasks. Despite the development of sophisticated tests to reveal their new capabilities, the underlying basis of these complex skills remains unclear. We argue that complex skills might be explained using simpler ones, represented by linguistic concepts. As an initial step in exploring this hypothesis, we focus on the lexical-semantic concept of synonymy, laying the groundwork for research into its relationship with more complex skills. We develop a comprehensive test suite to assess various aspects of synonymy under different conditions, and evaluate causal open-source models ranging up to 10 billion parameters. We find that these models effectively recognize synonymy but struggle to generate synonyms when prompted with relevant context.

pdf bib abs
In the LLM era, Word Sense Induction remains unsolved
Anna Mosolova | Marie Candito | Carlos Ramisch
Findings of the Association for Computational Linguistics: ACL 2025

In the absence of sense-annotated data, word sense induction (WSI) is a compelling alternative to word sense disambiguation, particularly in low-resource or domain-specific settings. In this paper, we emphasize methodological problems in current WSI evaluation. We propose an evaluation on a SemCor-derived dataset, respecting the original corpus polysemy and frequency distributions. We assess pre-trained embeddings and clustering algorithms across parts of speech, and propose and evaluate an LLM-based WSI method for English. We evaluate data augmentation sources (LLM-generated, corpus and lexicon), and semi-supervised scenarios using Wiktionary for data augmentation, must-link constraints, number of clusters per lemma.We find that no unsupervised method (whether ours or previous) surpasses the strong “one cluster per lemma” heuristic (1cpl). We also show that (i) results and best systems may vary across POS, (ii) LLMs have troubles performing this task, (iii) data augmentation is beneficial and (iv) capitalizing on Wiktionary does help. It surpasses previous SOTA system on our test set by 3.3%. WSI is not solved, and calls for a better articulation of lexicons and LLMs’ lexical semantics capabilities.

pdf bib abs
Adaptation des connaissances médicales pour les grands modèles de langue : Stratégies et analyse comparative
Ikram Belmadani | Benoit Favre | Richard Dufour | Frédéric Béchet | Carlos Ramisch
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : articles scientifiques originaux

Cet article présente une étude sur l’adaptation des grands modèles de langue (LLMs) à des domaines spécialisés disposant de données limitées. Bien que certaines recherches remettent en question le pré-entraînement adaptatif (DAPT) dans le contexte médical en anglais, nous montrons que l’adaptation au domaine peut être efficace sous certaines conditions. En prenant comme exemple l’adaptation au domaine médical en français, nous comparons de manière systématique le pré-entraînement continu (CPT), l’affinage supervisé (SFT) et une approche combinée (CPT suivi de SFT). Nos résultats indiquent que l’adaptation d’un modèle généraliste à de nouvelles données dans le domaine médical offre des améliorations notables (taux de réussite de 87%), tandis que l’adaptation supplémentaire de modèles déjà familiarisés avec ce domaine procure des bénéfices limités. Bien que CPT+SFT offre les meilleures performances globales, SFT-seul présente des résultats solides et requiert moins de ressources matérielles.

pdf bib abs
Raffinage des représentations des tokens dans les modèles de langue pré-entraînés avec l’apprentissage contrastif : une étude entre modèles et entre langues
Anna Mosolova | Marie Candito | Carlos Ramisch
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : articles scientifiques originaux

Les modèles de langue pré-entraînés ont apporté des avancées significatives dans les représentations contextuelles des phrases et des mots. Cependant, les tâches lexicales restent un défi pour ces représentations en raison des problèmes tels que la faible similarité des representations d’un même mot dans des contextes similaires. Mosolova et al. (2024) ont montré que l’apprentissage contrastif supervisé au niveau des tokens permettait d’améliorer les performances sur les tâches lexicales. Dans cet article, nous étudions la généralisabilité de leurs résultats obtenus en anglais au français, à d’autres modèles de langue et à plusieurs parties du discours. Nous démontrons que cette méthode d’apprentissage contrastif améliore systématiquement la performance sur les tâches de Word-in-Context et surpasse celle des modèles de langage pré-entraînés standards. L’analyse de l’espace des plongements lexicaux montre que l’affinage des modèles rapproche les exemples ayant le même sens et éloigne ceux avec des sens différents, ce qui indique une meilleure discrimination des sens dans l’espace vectoriel final.

pdf bib abs
SELEXINI – un grand corpus français, divers et parsé automatiquement
Manon Scholivet | Agata Savary | Louis Estève | Marie Candito | Carlos Ramisch
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : traductions d'articles publiés

L’annotation de grands corpus de texte est essentielle pour de nombreuses tâches de Traitement Automatique des Langues. Dans cet article, nous présentons SELEXINI, un grand corpus français annoté automatiquement en syntaxe. Ce corpus est composé de deux parties : la partie BigScience, et la partie HPLT. Les documents de la partie HPLT ont été sélectionnés dans le but de maximiser la diversité lexicale du corpus total SELEXINI. Une analyse de l’impact de cette sélection sur la diversité syntaxique a été réalisée, ainsi qu’une étude de la qualité des nouveaux mots issus de la partie HPLT du corpus SELEXINI. Nous avons pu montrer que malgré l’introduction de nouveaux mots considérés comme intéressants (formes de conjugaison rares, néologismes, mots rares,...), les textes issus de HPLT sont extrêmement bruités. De plus, l’augmentation de la diversité lexicale n’a pas permis d’augmenter la diversité syntaxique.

2024

pdf bib abs
Injecting Wiktionary to improve token-level contextual representations using contrastive learning
Anna Mosolova | Marie Candito | Carlos Ramisch
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

While static word embeddings are blind to context, for lexical semantics tasks context is rather too present in contextual word embeddings, vectors of same-meaning occurrences being too different (Ethayarajh, 2019). Fine-tuning pre-trained language models (PLMs) using contrastive learning was proposed, leveraging automatically self-augmented examples (Liu et al., 2021b). In this paper, we investigate how to inject a lexicon as an alternative source of supervision, using the English Wiktionary. We also test how dimensionality reduction impacts the resulting contextual word embeddings. We evaluate our approach on the Word-In-Context (WiC) task, in the unsupervised setting (not using the training set). We achieve new SoTA result on the original WiC test set. We also propose two new WiC test sets for which we show that our fine-tuning method achieves substantial improvements. We also observe improvements, although modest, for the semantic frame induction task. Although we experimented on English to allow comparison with related work, our method is adaptable to the many languages for which large Wiktionaries exist.

This paper presents the objectives, organization and activities of the UniDive COST Action, a scientific network dedicated to universality, diversity and idiosyncrasy in language technology. We describe the objectives and organization of this initiative, the people involved, the working groups and the ongoing tasks and activities. This paper is also an pen call for participation towards new members and countries.

2023

We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced.

pdf bib abs
A Survey of MWE Identification Experiments: The Devil is in the Details
Carlos Ramisch | Abigail Walsh | Thomas Blanchard | Shiva Taslimipoor
Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023)

Multiword expression (MWE) identification has been the focus of numerous research papers, especially in the context of the DiMSUM and PARSEME Shared Tasks (STs). This survey analyses 40 MWE identification papers with experiments on data from these STs. We look at corpus selection, pre- and post-processing, MWE encoding, evaluation metrics, statistical significance, and error analyses. We find that these aspects are usually considered minor and/or omitted in the literature. However, they may considerably impact the results and the conclusions drawn from them. Therefore, we advocate for more systematic descriptions of experimental conditions to reduce the risk of misleading conclusions drawn from poorly designed experimental setup.

pdf bib abs
PARSEME Meets Universal Dependencies: Getting on the Same Page in Representing Multiword Expressions
Agata Savary | Sara Stymne | Verginica Barbu Mititelu | Nathan Schneider | Carlos Ramisch | Joakim Nivre
Northern European Journal of Language Technology, Volume 9

Multiword expressions (MWEs) are challenging and pervasive phenomena whose idiosyncratic properties show notably at the levels of lexicon, morphology, and syntax. Thus, they should best be annotated jointly with morphosyntax. We discuss two multilingual initiatives, Universal Dependencies and PARSEME, addressing these annotation layers in cross-lingually unified ways. We compare the annotation principles of these initiatives with respect to MWEs, and we put forward a roadmap towards their gradual unification. The expected outcomes are more consistent treebanking and higher universality in modeling idiosyncrasy.

2022

pdf bib abs
Identification des Expressions Polylexicales dans les Tweets (Identification of Multiword Expressions in Tweets)
Nicolas Zampieri | Carlos Ramisch | Irina Illina | Dominique Fohr
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

L’identification des expressions polylexicales (EP) dans les tweets est une tâche difficile en raison de la nature linguistique complexe des EP combinée à l’utilisation d’un langage non standard. Dans cet article, nous présentons cette tâche d’identification sur des données anglaises de Twitter. Nous comparons les performances de deux systèmes : un utilisant un dictionnaire et un autre des réseaux de neurones. Nous évaluons expérimentalement sept configurations d’un système état de l’art fondé sur des réseaux neuronaux récurrents utilisant des embeddings contextuels générés par BERT. Le système fondé sur les réseaux neuronaux surpasse l’approche dictionnaire, collecté automatiquement à partir des EP dans des corpus, grâce à son pouvoir de généralisation supérieur.

pdf bib abs
Identification of Multiword Expressions in Tweets for Hate Speech Detection
Nicolas Zampieri | Carlos Ramisch | Irina Illina | Dominique Fohr
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Multiword expression (MWE) identification in tweets is a complex task due to the complex linguistic nature of MWEs combined with the non-standard language use in social networks. MWE features were shown to be helpful for hate speech detection (HSD). In this article, we present joint experiments on these two related tasks on English Twitter data: first we focus on the MWE identification task, and then we observe the influence of MWE-based features on the HSD task. For MWE identification, we compare the performance of two systems: lexicon-based and deep neural networks-based (DNN). We experimentally evaluate seven configurations of a state-of-the-art DNN system based on recurrent networks using pre-trained contextual embeddings from BERT. The DNN-based system outperforms the lexicon-based one thanks to its superior generalisation power, yielding much better recall. For the HSD task, we propose a new DNN architecture for incorporating MWE features. We confirm that MWE features are helpful for the HSD task. Moreover, the proposed DNN architecture beats previous MWE-based HSD systems by 0.4 to 1.1 F-measure points on average on four Twitter HSD corpora.

pdf bib
Proceedings of the 18th Workshop on Multiword Expressions @LREC2022
Archna Bhatia | Paul Cook | Shiva Taslimipoor | Marcos Garcia | Carlos Ramisch
Proceedings of the 18th Workshop on Multiword Expressions @LREC2022

pdf bib abs
mwetoolkit-lib: Adaptation of the mwetoolkit as a Python Library and an Application to MWE-based Document Clustering
Fernando Zagatti | Paulo Augusto de Lima Medeiros | Esther da Cunha Soares | Lucas Nildaimon dos Santos Silva | Carlos Ramisch | Livy Real
Proceedings of the 18th Workshop on Multiword Expressions @LREC2022

This paper introduces the mwetoolkit-lib, an adaptation of the mwetoolkit as a python library. The original toolkit performs the extraction and identification of multiword expressions (MWEs) in large text bases through the command line. One of the contributions of our work is the adaptation of the MWE extraction pipeline from the mwetoolkit, allowing its usage in python development environments and integration in larger pipelines. The other contribution is the execution of a pilot experiment aiming to show the impact of MWE discovery in data professionals’ work. This experiment found that the addition of MWE knowledge to the Term Frequency-Inverse Document Frequency (TF-IDF) vectorization altered the word relevance order, improving the linguistic quality of the clusters returned by k-means method.

2021

pdf bib abs
AMU-EURANOVA at CASE 2021 Task 1: Assessing the stability of multilingual BERT
Léo Bouscarrat | Antoine Bonnefoy | Cécile Capponi | Carlos Ramisch
Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021)

This paper explains our participation in task 1 of the CASE 2021 shared task. This task is about multilingual event extraction from news. We focused on sub-task 4, event information extraction. This sub-task has a small training dataset and we fine-tuned a multilingual BERT to solve this sub-task. We studied the instability problem on the dataset and tried to mitigate it.

2020

pdf bib abs
Verbal Multiword Expression Identification: Do We Need a Sledgehammer to Crack a Nut?
Caroline Pasquer | Agata Savary | Carlos Ramisch | Jean-Yves Antoine
Proceedings of the 28th International Conference on Computational Linguistics

Automatic identification of multiword expressions (MWEs), like ‘to cut corners’ (to do an incomplete job), is a pre-requisite for semantically-oriented downstream applications. This task is challenging because MWEs, especially verbal ones (VMWEs), exhibit surface variability. This paper deals with a subproblem of VMWE identification: the identification of occurrences of previously seen VMWEs. A simple language-independent system based on a combination of filters competes with the best systems from a recent shared task: it obtains the best averaged F-score over 11 languages (0.6653) and even the best score for both seen and unseen VMWEs due to the high proportion of seen VMWEs in texts. This highlights the fact that focusing on the identification of seen VMWEs could be a strategy to improve VMWE identification in general.

pdf bib abs
SLICE: Supersense-based Lightweight Interpretable Contextual Embeddings
Cindy Aloui | Carlos Ramisch | Alexis Nasr | Lucie Barque
Proceedings of the 28th International Conference on Computational Linguistics

Contextualised embeddings such as BERT have become de facto state-of-the-art references in many NLP applications, thanks to their impressive performances. However, their opaqueness makes it hard to interpret their behaviour. SLICE is a hybrid model that combines supersense labels with contextual embeddings. We introduce a weakly supervised method to learn interpretable embeddings from raw corpora and small lists of seed words. Our model is able to represent both a word and its context as embeddings into the same compact space, whose dimensions correspond to interpretable supersenses. We assess the model in a task of supersense tagging for French nouns. The little amount of supervision required makes it particularly well suited for low-resourced scenarios. Thanks to its interpretability, we perform linguistic analyses about the predicted supersenses in terms of input word and context representations.

pdf bib abs
Multilingual enrichment of disease biomedical ontologies
Léo Bouscarrat | Antoine Bonnefoy | Cécile Capponi | Carlos Ramisch
Proceedings of the LREC 2020 Workshop on Multilingual Biomedical Text Processing (MultilingualBIO 2020)

Translating biomedical ontologies is an important challenge, but doing it manually requires much time and money. We study the possibility to use open-source knowledge bases to translate biomedical ontologies. We focus on two aspects: coverage and quality. We look at the coverage of two biomedical ontologies focusing on diseases with respect to Wikidata for 9 European languages (Czech, Dutch, English, French, German, Italian, Polish, Portuguese and Spanish) for both, plus Arabic, Chinese and Russian for the second. We first use direct links between Wikidata and the studied ontologies and then use second-order links by going through other intermediate ontologies. We then compare the quality of the translations obtained thanks to Wikidata with a commercial machine translation tool, here Google Cloud Translation.

We present edition 1.2 of the PARSEME shared task on identification of verbal multiword expressions (VMWEs). Lessons learned from previous editions indicate that VMWEs have low ambiguity, and that the major challenge lies in identifying test instances never seen in the training data. Therefore, this edition focuses on unseen VMWEs. We have split annotated corpora so that the test corpora contain around 300 unseen VMWEs, and we provide non-annotated raw corpora to be used by complementary discovery methods. We released annotated and raw corpora in 14 languages, and this semi-supervised challenge attracted 7 teams who submitted 9 system results. This paper describes the effort of corpus creation, the task design, and the results obtained by the participating systems, especially their performance on unseen expressions.

pdf bib abs
Seen2Unseen at PARSEME Shared Task 2020: All Roads do not Lead to Unseen Verb-Noun VMWEs
Caroline Pasquer | Agata Savary | Carlos Ramisch | Jean-Yves Antoine
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

We describe the Seen2Unseen system that participated in edition 1.2 of the PARSEME shared task on automatic identification of verbal multiword expressions (VMWEs). The identification of VMWEs that do not appear in the provided training corpora (called unseen VMWEs) – with a focus here on verb-noun VMWEs – is based on mutual information and lexical substitution or translation of seen VMWEs. We present the architecture of the system, report results for 14 languages, and propose an error analysis.

2019

pdf bib abs
Unsupervised Compositionality Prediction of Nominal Compounds
Silvio Cordeiro | Aline Villavicencio | Marco Idiart | Carlos Ramisch
Computational Linguistics, Volume 45, Issue 1 - March 2019

Nominal compounds such as red wine and nut case display a continuum of compositionality, with varying contributions from the components of the compound to its semantics. This article proposes a framework for compound compositionality prediction using distributional semantic models, evaluating to what extent they capture idiomaticity compared to human judgments. For evaluation, we introduce data sets containing human judgments in three languages: English, French, and Portuguese. The results obtained reveal a high agreement between the models and human predictions, suggesting that they are able to incorporate information about idiomaticity. We also present an in-depth evaluation of various factors that can affect prediction, such as model and corpus parameters and compositionality operations. General crosslingual analyses reveal the impact of morphological variation and corpus size in the ability of the model to predict compositionality, and of a uniform combination of the components for best results.

pdf bib abs
Typological Features for Multilingual Delexicalised Dependency Parsing
Manon Scholivet | Franck Dary | Alexis Nasr | Benoit Favre | Carlos Ramisch
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The existence of universal models to describe the syntax of languages has been debated for decades. The availability of resources such as the Universal Dependencies treebanks and the World Atlas of Language Structures make it possible to study the plausibility of universal grammar from the perspective of dependency parsing. Our work investigates the use of high-level language descriptions in the form of typological features for multilingual dependency parsing. Our experiments on multilingual parsing for 40 languages show that typological information can indeed guide parsers to share information between similar languages beyond simple language identification.

pdf bib abs
Without lexicons, multiword expression identification will never fly: A position statement
Agata Savary | Silvio Cordeiro | Carlos Ramisch
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

Because most multiword expressions (MWEs), especially verbal ones, are semantically non-compositional, their automatic identification in running text is a prerequisite for semantically-oriented downstream applications. However, recent developments, driven notably by the PARSEME shared task on automatic identification of verbal MWEs, show that this task is harder than related tasks, despite recent contributions both in multilingual corpus annotation and in computational models. In this paper, we analyse possible reasons for this state of affairs. They lie in the nature of the MWE phenomenon, as well as in its distributional properties. We also offer a comparative analysis of the state-of-the-art systems, which exhibit particularly strong sensitivity to unseen data. On this basis, we claim that, in order to make strong headway in MWE identification, the community should bend its mind into coupling identification of MWEs with their discovery, via syntactic MWE lexicons. Such lexicons need not necessarily achieve a linguistically complete modelling of MWEs’ behavior, but they should provide minimal morphosyntactic information to cover some potential uses, so as to complement existing MWE-annotated corpora. We define requirements for such minimal NLP-oriented lexicon, and we propose a roadmap for the MWE community driven by these requirements.

pdf bib abs
The Impact of Word Representations on Sequential Neural MWE Identification
Nicolas Zampieri | Carlos Ramisch | Geraldine Damnati
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

Recent initiatives such as the PARSEME shared task allowed the rapid development of MWE identification systems. Many of those are based on recent NLP advances, using neural sequence models that take continuous word representations as input. We study two related questions in neural MWE identification: (a) the use of lemmas and/or surface forms as input features, and (b) the use of word-based or character-based embeddings to represent them. Our experiments on Basque, French, and Polish show that character-based representations yield systematically better results than word-based ones. In some cases, character-based representations of surface forms can be used as a proxy for lemmas, depending on the morphological complexity of the language.

2018

pdf bib
Advances in Multiword Expression Identification for the Italian language: The PARSEME Shared Task Edition 1.1
Johanna Monti | Silvio Ricardo Cordeiro | Carlos Ramisch | Federico Sangati | Agata Savary | Veronika Vincze
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

pdf bib abs
If you’ve seen some, you’ve seen them all: Identifying variants of multiword expressions
Caroline Pasquer | Agata Savary | Carlos Ramisch | Jean-Yves Antoine
Proceedings of the 27th International Conference on Computational Linguistics

Multiword expressions, especially verbal ones (VMWEs), show idiosyncratic variability, which is challenging for NLP applications, hence the need for VMWE identification. We focus on the task of variant identification, i.e. identifying variants of previously seen VMWEs, whatever their surface form. We model the problem as a classification task. Syntactic subtrees with previously seen combinations of lemmas are first extracted, and then classified on the basis of features relevant to morpho-syntactic variation of VMWEs. Feature values are both absolute, i.e. hold for a particular VMWE candidate, and relative, i.e. based on comparing a candidate with previously seen VMWEs. This approach outperforms a baseline by 4 percent points of F-measure on a French corpus.

pdf bib abs
Towards a Variability Measure for Multiword Expressions
Caroline Pasquer | Agata Savary | Jean-Yves Antoine | Carlos Ramisch
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

One of the most outstanding properties of multiword expressions (MWEs), especially verbal ones (VMWEs), important both in theoretical models and applications, is their idiosyncratic variability. Some MWEs are always continuous, while some others admit certain types of insertions. Components of some MWEs are rarely or never modified, while some others admit either specific or unrestricted modification. This unpredictable variability profile of MWEs hinders modeling and processing them as “words-with-spaces” on the one hand, and as regular syntactic structures on the other hand. Since variability of MWEs is a matter of scale rather than a binary property, we propose a 2-dimensional language-independent measure of variability dedicated to verbal MWEs based on syntactic and discontinuity-related clues. We assess its relevance with respect to a linguistic benchmark and its utility for the tasks of VMWE classification and variant identification on a French corpus.

pdf bib
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)
Agata Savary | Carlos Ramisch | Jena D. Hwang | Nathan Schneider | Melanie Andresen | Sameer Pradhan | Miriam R. L. Petruck
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This paper describes the PARSEME Shared Task 1.1 on automatic identification of verbal multiword expressions. We present the annotation methodology, focusing on changes from last year’s shared task. Novel aspects include enhanced annotation guidelines, additional annotated data for most languages, corpora for some new languages, and new evaluation settings. Corpora were created for 20 languages, which are also briefly discussed. We report organizational principles behind the shared task and the evaluation metrics employed for ranking. The 17 participating systems, their methods and obtained results are also presented and analysed.

pdf bib abs
VarIDE at PARSEME Shared Task 2018: Are Variants Really as Alike as Two Peas in a Pod?
Caroline Pasquer | Carlos Ramisch | Agata Savary | Jean-Yves Antoine
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

We describe the VarIDE system (standing for Variant IDEntification) which participated in the edition 1.1 of the PARSEME shared task on automatic identification of verbal multiword expressions (VMWEs). Our system focuses on the task of VMWE variant identification by using morphosyntactic information in the training data to predict if candidates extracted from the test corpus could be idiomatic, thanks to a naive Bayes classifier. We report results for 19 languages.

pdf bib abs
Veyn at PARSEME Shared Task 2018: Recurrent Neural Networks for VMWE Identification
Nicolas Zampieri | Manon Scholivet | Carlos Ramisch | Benoit Favre
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This paper describes the Veyn system, submitted to the closed track of the PARSEME Shared Task 2018 on automatic identification of verbal multiword expressions (VMWEs). Veyn is based on a sequence tagger using recurrent neural networks. We represent VMWEs using a variant of the begin-inside-outside encoding scheme combined with the VMWE category tag. In addition to the system description, we present development experiments to determine the best tagging scheme. Veyn is freely available, covers 19 languages, and was ranked ninth (MWE-based) and eight (Token-based) among 13 submissions, considering macro-averaged F1 across languages.

2017

pdf bib abs
Annotation d’expressions polylexicales verbales en français (Annotation of verbal multiword expressions in French)
Marie Candito | Mathieu Constant | Carlos Ramisch | Agata Savary | Yannick Parmentier | Caroline Pasquer | Jean-Yves Antoine
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Nous décrivons la partie française des données produites dans le cadre de la campagne multilingue PARSEME sur l’identification d’expressions polylexicales verbales (Savary et al., 2017). Les expressions couvertes pour le français sont les expressions verbales idiomatiques, les verbes intrinsèquement pronominaux et une généralisation des constructions à verbe support. Ces phénomènes ont été annotés sur le corpus French-UD (Nivre et al., 2016) et le corpus Sequoia (Candito & Seddah, 2012), soit un corpus de 22 645 phrases, pour un total de 4 962 expressions annotées. On obtient un ratio d’une expression annotée tous les 100 tokens environ, avec un fort taux d’expressions discontinues (40%).

Multiword expressions (MWEs) are a class of linguistic forms spanning conventional word boundaries that are both idiosyncratic and pervasive across different languages. The structure of linguistic processing that depends on the clear distinction between words and phrases has to be re-thought to accommodate MWEs. The issue of MWE handling is crucial for NLP applications, where it raises a number of challenges. The emergence of solutions in the absence of guiding principles motivates this survey, whose aim is not only to provide a focused review of MWE processing, but also to clarify the nature of interactions between MWE processing and downstream applications. We propose a conceptual framework within which challenges and research contributions can be positioned. It offers a shared understanding of what is meant by “MWE processing,” distinguishing the subtasks of MWE discovery and identification. It also elucidates the interactions between MWE processing and two use cases: Parsing and machine translation. Many of the approaches in the literature can be differentiated according to how MWE processing is timed with respect to underlying use cases. We discuss how such orchestration choices affect the scope of MWE-aware systems. For each of the two MWE processing subtasks and for each of the two use cases, we conclude on open issues and research perspectives.

pdf bib
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)
Stella Markantonatou | Carlos Ramisch | Agata Savary | Veronika Vincze
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

Multiword expressions (MWEs) are known as a “pain in the neck” for NLP due to their idiosyncratic behaviour. While some categories of MWEs have been addressed by many studies, verbal MWEs (VMWEs), such as to take a decision, to break one’s heart or to turn off, have been rarely modelled. This is notably due to their syntactic variability, which hinders treating them as “words with spaces”. We describe an initiative meant to bring about substantial progress in understanding, modelling and processing VMWEs. It is a joint effort, carried out within a European research network, to elaborate universal terminologies and annotation guidelines for 18 languages. Its main outcome is a multilingual 5-million-word annotated corpus which underlies a shared task on automatic identification of VMWEs. This paper presents the corpus annotation methodology and outcome, the shared task organisation and the results of the participating systems.

pdf bib abs
Discovering Light Verb Constructions and their Translations from Parallel Corpora without Word Alignment
Natalie Vargas | Carlos Ramisch | Helena Caseli
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

We propose a method for joint unsupervised discovery of multiword expressions (MWEs) and their translations from parallel corpora. First, we apply independent monolingual MWE extraction in source and target languages simultaneously. Then, we calculate translation probability, association score and distributional similarity of co-occurring pairs. Finally, we rank all translations of a given MWE using a linear combination of these features. Preliminary experiments on light verb constructions show promising results.

pdf bib abs
Identification of Ambiguous Multiword Expressions Using Sequence Models and Lexical Resources
Manon Scholivet | Carlos Ramisch
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

We present a simple and efficient tagger capable of identifying highly ambiguous multiword expressions (MWEs) in French texts. It is based on conditional random fields (CRF), using local context information as features. We show that this approach can obtain results that, in some cases, approach more sophisticated parser-based MWE identification methods without requiring syntactic trees from a treebank. Moreover, we study how well the CRF can take into account external information coming from a lexicon.

2016

pdf bib abs
mwetoolkit+sem: Integrating Word Embeddings in the mwetoolkit for Semantic MWE Processing
Silvio Cordeiro | Carlos Ramisch | Aline Villavicencio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents mwetoolkit+sem: an extension of the mwetoolkit that estimates semantic compositionality scores for multiword expressions (MWEs) based on word embeddings. First, we describe our implementation of vector-space operations working on distributional vectors. The compositionality score is based on the cosine distance between the MWE vector and the composition of the vectors of its member words. Our generic system can handle several types of word embeddings and MWE lists, and may combine individual word representations using several composition techniques. We evaluate our implementation on a dataset of 1042 English noun compounds, comparing different configurations of the underlying word embeddings and word-composition models. We show that our vector-based scores model non-compositionality better than standard association measures such as log-likelihood.

pdf bib abs
DeQue: A Lexicon of Complex Prepositions and Conjunctions in French
Carlos Ramisch | Alexis Nasr | André Valli | José Deulofeu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We introduce DeQue, a lexicon covering French complex prepositions (CPRE) like “à partir de” (from) and complex conjunctions (CCONJ) like “bien que” (although). The lexicon includes fine-grained linguistic description based on empirical evidence. We describe the general characteristics of CPRE and CCONJ in French, with special focus on syntactic ambiguity. Then, we list the selection criteria used to build the lexicon and the corpus-based methodology employed to collect entries. Finally, we quantify the ambiguity of each construction by annotating around 100 sentences randomly taken from the FRWaC. In addition to its theoretical value, the resource has many potential practical applications. We intend to employ DeQue for treebank annotation and to train a dependency parser that can takes complex constructions into account.

pdf bib
Predicting the Compositionality of Nominal Compounds: Giving Word Embeddings a Hard Time
Silvio Cordeiro | Carlos Ramisch | Marco Idiart | Aline Villavicencio
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
How Naked is the Naked Truth? A Multilingual Lexicon of Nominal Compound Compositionality
Carlos Ramisch | Silvio Cordeiro | Leonardo Zilio | Marco Idiart | Aline Villavicencio
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
UFRGS&LIF at SemEval-2016 Task 10: Rule-Based MWE Identification and Predominant-Supersense Tagging
Silvio Cordeiro | Carlos Ramisch | Aline Villavicencio
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
Filtering and Measuring the Intrinsic Quality of Human Compositionality Judgments
Carlos Ramisch | Silvio Cordeiro | Aline Villavicencio
Proceedings of the 12th Workshop on Multiword Expressions

2015

pdf bib
Joint Dependency Parsing and Multiword Expression Tokenization
Alexis Nasr | Carlos Ramisch | José Deulofeu | André Valli
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Never-Ending Multiword Expressions Learning
Alexandre Rondon | Helena Caseli | Carlos Ramisch
Proceedings of the 11th Workshop on Multiword Expressions

2014

pdf bib
Nothing like Good Old Frequency: Studying Context Filters for Distributional Thesauri
Muntsa Padró | Marco Idiart | Aline Villavicencio | Carlos Ramisch
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib abs
Comparing the Quality of Focused Crawlers and of the Translation Resources Obtained from them
Bruno Laranjeira | Viviane Moreira | Aline Villavicencio | Carlos Ramisch | Maria José Finatto
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Comparable corpora have been used as an alternative for parallel corpora as resources for computational tasks that involve domain-specific natural language processing. One way to gather documents related to a specific topic of interest is to traverse a portion of the web graph in a targeted way, using focused crawling algorithms. In this paper, we compare several focused crawling algorithms using them to collect comparable corpora on a specific domain. Then, we compare the evaluation of the focused crawling algorithms to the performance of linguistic processes executed after training with the corresponding generated corpora. Also, we propose a novel approach for focused crawling, exploiting the expressive power of multiword expressions.

pdf bib abs
Comparing Similarity Measures for Distributional Thesauri
Muntsa Padró | Marco Idiart | Aline Villavicencio | Carlos Ramisch
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Distributional thesauri have been applied for a variety of tasks involving semantic relatedness. In this paper, we investigate the impact of three parameters: similarity measures, frequency thresholds and association scores. We focus on the robustness and stability of the resulting thesauri, measuring inter-thesaurus agreement when testing different parameter values. The results obtained show that low-frequency thresholds affect thesaurus quality more than similarity measures, with more agreement found for increasing thresholds. These results indicate the sensitivity of distributional thesauri to frequency. Nonetheless, the observed differences do not transpose over extrinsic evaluation using TOEFL-like questions. While this may be specific to the task, we argue that a careful examination of the stability of distributional resources prior to application is needed.

This paper presents the Multiword Expression Toolkit (mwetoolkit), an environment for type and language-independent MWE identification from corpora. The mwetoolkit provides a targeted list of MWE candidates, extracted and filtered according to a number of user-defined criteria and a set of standard statistical association measures. For generating corpus counts, the toolkit provides both a corpus indexation facility and a tool for integration with web search engines, while for evaluation, it provides validation and annotation facilities. The mwetoolkit also allows easy integration with a machine learning tool for the creation and application of supervised MWE extraction models if annotated data is available. In our experiment, the mwetoolkit was tested and evaluated in the context of MWE extraction in the biomedical domain. Our preliminary results show that the toolkit performs better than other approaches, especially concerning recall. Moreover, this first version can also be extended in several ways in order to improve the quality of the results.

pdf bib
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications
Éric Laporte | Preslav Nakov | Carlos Ramisch | Aline Villavicencio
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications