ScriptMix: Mixing Scripts for Low-resource Language Parsing

Jaeseong Lee; Dohyeon Lee; Seung-won Hwang

doi:10.18653/v1/2024.naacl-long.357

ScriptMix: Mixing Scripts for Low-resource Language Parsing

Jaeseong Lee, Dohyeon Lee, Seung-won Hwang

Abstract

Despite the success of multilingual pretrained language models (mPLMs) for tasks such as dependency parsing (DEP) or part-of-speech (POS) tagging, their coverage of 100s of languages is still limited, as most of the 6500+ languages remains “unseen”. To adapt mPLMs for including such unseen langs, existing work has considered transliteration and vocabulary augmentation. Meanwhile, the consideration of combining the two has been surprisingly lacking. To understand why, we identify both complementary strengths of the two, and the hurdles to realizing it. Based on this observation, we propose ScriptMix, combining two strengths, and overcoming the hurdle.Specifically, ScriptMix a) is trained with dual-script corpus to combine strengths, but b) with separate modules to avoid gradient conflict. In combining modules properly, we also point out the limitation of the conventional method AdapterFusion, and propose AdapterFusion+ to overcome it. We empirically show ScriptMix is effective– ScriptMix improves the POS accuracy by up to 14%, and improves the DEP LAS score by up to 5.6%. Our code is publicly available.

Anthology ID:: 2024.naacl-long.357
Volume:: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6430–6444
Language:
URL:: https://aclanthology.org/2024.naacl-long.357/
DOI:: 10.18653/v1/2024.naacl-long.357
Bibkey:
Cite (ACL):: Jaeseong Lee, Dohyeon Lee, and Seung-won Hwang. 2024. ScriptMix: Mixing Scripts for Low-resource Language Parsing. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6430–6444, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: ScriptMix: Mixing Scripts for Low-resource Language Parsing (Lee et al., NAACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.naacl-long.357.pdf
Video:: https://aclanthology.org/2024.naacl-long.357.mp4

PDF Cite Search Video Fix data