Manon Scholivet

2026

Multiword expressions (MWEs) have been a major challenge in NLP for decades and research on MWEs was driven notably by shared tasks, including those organized by the PARSEME community. We report the organisation and the results of edition 2.0 of the PARSEME shared task. For the first time, all syntactic categories are covered: verbal, nominal, adjectival, adverbial and functional. We rely on edition 2.0 of the PARSEME corpus, annotated for all these categories in 17 languages. We create a new dataset with paraphrases of sentences containing idioms in 14 languages, and defining a new subtask dedicated to MWE paraphrasing. We extend our evaluation protocol by measuring both performance and diversity of systems, and including manual evaluation in paraphrasing. 10 systems, including the baseline, participated in the MWE identification subtask and 5 in the paraphrasing subtask. Results are promising, but known MWE identification challenges remain unsolved. Performance correlates positively with diversity in MWE identification, and negatively in MWE paraphrasing.

pdf bib abs

Diversity patterns run deep: Impact of diversity intake on multiword expression identification
Mathilde Deletombe | Manon Scholivet | Louis Estève | Thomas Lavergne | Agata Savary
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)

Multiword expressions (MWEs) are good examples of a phenomenon where identification systems struggle with generalisation: MWE present in the test set but absent in the training set are rarely identified. This raises the question of the diversity of the test set, relative to that of the train set, and how this impacts performance. We set out to measure how much diversity of a train corpus increases when adding individual MWEs from the test corpus, and how this increase impacts MWE identification performance. We measure diversity across a three-dimension framework and find mostly consistent negative correlations with performance in 14 languages and 8 systems.

2025

pdf bib abs

SELEXINI – un grand corpus français, divers et parsé automatiquement
Manon Scholivet | Agata Savary | Louis Estève | Marie Candito | Carlos Ramisch
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : traductions d'articles publiés

L’annotation de grands corpus de texte est essentielle pour de nombreuses tâches de Traitement Automatique des Langues. Dans cet article, nous présentons SELEXINI, un grand corpus français annoté automatiquement en syntaxe. Ce corpus est composé de deux parties : la partie BigScience, et la partie HPLT. Les documents de la partie HPLT ont été sélectionnés dans le but de maximiser la diversité lexicale du corpus total SELEXINI. Une analyse de l’impact de cette sélection sur la diversité syntaxique a été réalisée, ainsi qu’une étude de la qualité des nouveaux mots issus de la partie HPLT du corpus SELEXINI. Nous avons pu montrer que malgré l’introduction de nouveaux mots considérés comme intéressants (formes de conjugaison rares, néologismes, mots rares,...), les textes issus de HPLT sont extrêmement bruités. De plus, l’augmentation de la diversité lexicale n’a pas permis d’augmenter la diversité syntaxique.

pdf bib abs

SELEXINI – a large and diverse automatically parsed corpus of French
Manon Scholivet | Agata Savary | Louis Estève | Marie Candito | Carlos Ramisch
Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC)

The annotation of large text corpora is essential for many tasks. We present here a large automatically annotated corpus for French. This corpus is separated into two parts: the first from BigScience, and the second from HPLT. The annotated documents from HPLT were selected in order to optimise the lexical diversity of the final corpus SELEXINI. An analysis of the impact of this selection was carried out on syntactic diversity, as well as on the quality of the new words resulting from the HPLT part of SELEXINI. We have shown that despite the introduction of interesting new words, the texts extracted from HPLT are very noisy. Furthermore, increasing lexical diversity did not increase syntactic diversity.

2019

pdf bib abs

Méthodes de représentation de la langue pour l’analyse syntaxique multilingue (Language representation methods for multilingual syntactic parsing )
Manon Scholivet
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume III : RECITAL

L’existence de modèles universels pour décrire la syntaxe des langues a longtemps été débattue. L’apparition de ressources comme le World Atlas of Language Structures et les corpus des Universal Dependencies rend possible l’étude d’une grammaire universelle pour l’analyse syntaxique en dépendances. Notre travail se concentre sur l’étude de différentes représentations des langues dans des systèmes multilingues appris sur des corpus arborés de 37 langues. Nos tests d’analyse syntaxique montrent que représenter la langue dont est issu chaque mot permet d’obtenir de meilleurs résultats qu’en cas d’un apprentissage sur une simple concaténation des langues. En revanche, l’utilisation d’un vecteur pour représenter la langue ne permet pas une amélioration évidente des résultats dans le cas d’une langue n’ayant pas du tout de données d’apprentissage.

pdf bib abs

Typological Features for Multilingual Delexicalised Dependency Parsing
Manon Scholivet | Franck Dary | Alexis Nasr | Benoit Favre | Carlos Ramisch
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The existence of universal models to describe the syntax of languages has been debated for decades. The availability of resources such as the Universal Dependencies treebanks and the World Atlas of Language Structures make it possible to study the plausibility of universal grammar from the perspective of dependency parsing. Our work investigates the use of high-level language descriptions in the form of typological features for multilingual dependency parsing. Our experiments on multilingual parsing for 40 languages show that typological information can indeed guide parsers to share information between similar languages beyond simple language identification.

2018

pdf bib abs

Veyn at PARSEME Shared Task 2018: Recurrent Neural Networks for VMWE Identification
Nicolas Zampieri | Manon Scholivet | Carlos Ramisch | Benoit Favre
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This paper describes the Veyn system, submitted to the closed track of the PARSEME Shared Task 2018 on automatic identification of verbal multiword expressions (VMWEs). Veyn is based on a sequence tagger using recurrent neural networks. We represent VMWEs using a variant of the begin-inside-outside encoding scheme combined with the VMWE category tag. In addition to the system description, we present development experiments to determine the best tagging scheme. Veyn is freely available, covers 19 languages, and was ranked ninth (MWE-based) and eight (Token-based) among 13 submissions, considering macro-averaged F1 across languages.

2017

pdf bib abs

Identification of Ambiguous Multiword Expressions Using Sequence Models and Lexical Resources
Manon Scholivet | Carlos Ramisch
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

We present a simple and efficient tagger capable of identifying highly ambiguous multiword expressions (MWEs) in French texts. It is based on conditional random fields (CRF), using local context information as features. We show that this approach can obtain results that, in some cases, approach more sophisticated parser-based MWE identification methods without requiring syntactic trees from a treebank. Moreover, we study how well the CRF can take into account external information coming from a lexicon.

Co-authors

Venues

NAACL1

Fix author