Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment

Zhongtao Miao; Qiyu Wu; Kaiyan Zhao; Zilong Wu; Yoshimasa Tsuruoka

doi:10.18653/v1/2024.findings-naacl.204

Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment

Zhongtao Miao, Qiyu Wu, Kaiyan Zhao, Zilong Wu, Yoshimasa Tsuruoka

Abstract

The field of cross-lingual sentence embeddings has recently experienced significant advancements, but research concerning low-resource languages has lagged due to the scarcity of parallel corpora. This paper shows that cross-lingual word representation in low-resource languages is notably under-aligned with that in high-resource languages in current models. To address this, we introduce a novel framework that explicitly aligns words between English and eight low-resource languages, utilizing off-the-shelf word alignment models. This framework incorporates three primary training objectives: aligned word prediction and word translation ranking, along with the widely used translation ranking. We evaluate our approach through experiments on the bitext retrieval task, which demonstrate substantial improvements on sentence embeddings in low-resource languages. In addition, the competitive performance of the proposed model across a broader range of tasks in high-resource languages underscores its practicality.

Anthology ID:: 2024.findings-naacl.204
Volume:: Findings of the Association for Computational Linguistics: NAACL 2024
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3225–3236
Language:
URL:: https://aclanthology.org/2024.findings-naacl.204/
DOI:: 10.18653/v1/2024.findings-naacl.204
Bibkey:
Cite (ACL):: Zhongtao Miao, Qiyu Wu, Kaiyan Zhao, Zilong Wu, and Yoshimasa Tsuruoka. 2024. Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3225–3236, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment (Miao et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-naacl.204.pdf
Video:: https://aclanthology.org/2024.findings-naacl.204.mp4

PDF Cite Search Video Fix data