MACRONYM: A Large-Scale Dataset for Multilingual and Multi-Domain Acronym Extraction

Amir Pouran Ben Veyseh, Nicole Meister, Seunghyun Yoon, Rajiv Jain, Franck Dernoncourt, Thien Huu Nguyen


Abstract
Acronym extraction is the task of identifying acronyms and their expanded forms in texts that is necessary for various NLP applications. Despite major progress for this task in recent years, one limitation of existing AE research is that they are limited to the English language and certain domains (i.e., scientific and biomedical). Challenges of AE in other languages and domains are mainly unexplored. As such, lacking annotated datasets in multiple languages and domains has been a major issue to prevent research in this direction. To address this limitation, we propose a new dataset for multilingual and multi-domain AE. Specifically, 27,200 sentences in 6 different languages and 2 new domains, i.e., legal and scientific, are manually annotated for AE. Our experiments on the dataset show that AE in different languages and learning settings has unique challenges, emphasizing the necessity of further research on multilingual and multi-domain AE.
Anthology ID:
2022.coling-1.292
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
3309–3314
Language:
URL:
https://aclanthology.org/2022.coling-1.292
DOI:
Bibkey:
Cite (ACL):
Amir Pouran Ben Veyseh, Nicole Meister, Seunghyun Yoon, Rajiv Jain, Franck Dernoncourt, and Thien Huu Nguyen. 2022. MACRONYM: A Large-Scale Dataset for Multilingual and Multi-Domain Acronym Extraction. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3309–3314, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
MACRONYM: A Large-Scale Dataset for Multilingual and Multi-Domain Acronym Extraction (Veyseh et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.292.pdf