CoAM: Corpus of All-Type Multiword Expressions

Yusuke Ide; Joshua Tanner; Adam Nohejl; Jacob Hoffman; Justin Vasselli; Hidetaka Kamigaito; Taro Watanabe

doi:10.18653/v1/2025.acl-long.1311

CoAM: Corpus of All-Type Multiword Expressions

Yusuke Ide, Joshua Tanner, Adam Nohejl, Jacob Hoffman, Justin Vasselli, Hidetaka Kamigaito, Taro Watanabe

Abstract

Multiword expressions (MWEs) refer to idiomatic sequences of multiple words.MWE identification, i.e., detecting MWEs in text, can play a key role in downstream tasks such as machine translation, but existing datasets for the task are inconsistently annotated, limited to a single type of MWE, or limited in size.To enable reliable and comprehensive evaluation, we created CoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences constructed through a multi-step process to enhance data quality consisting of human annotation, human review, and automated consistency checking.Additionally, for the first time in a dataset of MWE identification, CoAM’s MWEs are tagged with MWE types, such as Noun and Verb, enabling fine-grained error analysis.Annotations for CoAM were collected using a new interface created with our interface generator, which allows easy and flexible annotation of MWEs in any form.Through experiments using CoAM, we find that a fine-tuned large language model outperforms MWEasWSD, which achieved the state-of-the-art performance on the DiMSUM dataset.Furthermore, analysis using our MWE type tagged data reveals that Verb MWEs are easier than Noun MWEs to identify across approaches.

Anthology ID:: 2025.acl-long.1311
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 27004–27021
Language:
URL:: https://aclanthology.org/2025.acl-long.1311/
DOI:: 10.18653/v1/2025.acl-long.1311
Bibkey:
Cite (ACL):: Yusuke Ide, Joshua Tanner, Adam Nohejl, Jacob Hoffman, Justin Vasselli, Hidetaka Kamigaito, and Taro Watanabe. 2025. CoAM: Corpus of All-Type Multiword Expressions. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27004–27021, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: CoAM: Corpus of All-Type Multiword Expressions (Ide et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.1311.pdf

PDF Cite Search Fix data