MultiVitaminBooster at PARSEME Shared Task 2020: Combining Window- and Dependency-Based Features with Multilingual Contextualised Word Embeddings for VMWE Detection

Sebastian Gombert, Sabine Bartsch


Abstract
In this paper, we present MultiVitaminBooster, a system implemented for the PARSEME shared task on semi-supervised identification of verbal multiword expressions - edition 1.2. For our approach, we interpret detecting verbal multiword expressions as a token classification task aiming to decide whether a token is part of a verbal multiword expression or not. For this purpose, we train gradient boosting-based models. We encode tokens as feature vectors combining multilingual contextualized word embeddings provided by the XLM-RoBERTa language model with a more traditional linguistic feature set relying on context windows and dependency relations. Our system was ranked 7th in the official open track ranking of the shared task evaluations with an encoding-related bug distorting the results. For this reason we carry out further unofficial evaluations. Unofficial versions of our systems would have achieved higher ranks.
Anthology ID:
2020.mwe-1.20
Volume:
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons
Month:
December
Year:
2020
Address:
online
Venues:
COLING | MWE
SIG:
SIGLEX
Publisher:
Association for Computational Linguistics
Note:
Pages:
149–155
Language:
URL:
https://aclanthology.org/2020.mwe-1.20
DOI:
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2020.mwe-1.20.pdf