Cross-type French Multiword Expression Identification with Pre-trained Masked Language Models

Van-Tuan Bui, Agata Savary


Abstract
Multiword expressions (MWEs) pose difficulties for natural language processing (NLP) due to their linguistic features, such as syntactic and semantic properties, which distinguish them from regular word groupings. This paper describes a combination of two systems: one that learns verbal multiword expressions (VMWEs) and another that learns non-verbal MWEs (nVMWEs). Together, these systems leverage training data from both types of MWEs to enhance performance on a cross-type dataset containing both VMWEs and nVMWEs. Such scenarios emerge when datasets are developed using differing annotation schemes. We explore the fine-tuning of several state-of-the-art neural transformers for each MWE type. Our experiments demonstrate the advantages of the combined system over multi-task approaches or single-task models, addressing the challenges posed by diverse tagsets within the training data. Specifically, we evaluated the combined system on a French treebank named Sequoia, which features an annotation layer encompassing all syntactic types of French MWEs. With this combined approach, we improved the F1-score by approximately 3% on the Sequoia dataset.
Anthology ID:
2024.lrec-main.374
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
4198–4204
Language:
URL:
https://aclanthology.org/2024.lrec-main.374
DOI:
Bibkey:
Cite (ACL):
Van-Tuan Bui and Agata Savary. 2024. Cross-type French Multiword Expression Identification with Pre-trained Masked Language Models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4198–4204, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Cross-type French Multiword Expression Identification with Pre-trained Masked Language Models (Bui & Savary, LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.374.pdf