English-Malay Cross-Lingual Embedding Alignment using Bilingual Lexicon Augmentation

Ying Hao Lim, Jasy Suet Yan Liew


Abstract
As high-quality Malay language resources are still a scarcity, cross lingual word embeddings make it possible for richer English resources to be leveraged for downstream Malay text classification tasks. This paper focuses on creating an English-Malay cross-lingual word embeddings using embedding alignment by exploiting existing language resources. We augmented the training bilingual lexicons using machine translation with the goal to improve the alignment precision of our cross-lingual word embeddings. We investigated the quality of the current state-of-the-art English-Malay bilingual lexicon and worked on improving its quality using Google Translate. We also examined the effect of Malay word coverage on the quality of cross-lingual word embeddings. Experimental results with a precision up till 28.17% show that the alignment precision of the cross-lingual word embeddings would inevitably degrade after 1-NN but a better seed lexicon and cleaner nearest neighbours can reduce the number of word pairs required to achieve satisfactory performance. As the English and Malay monolingual embeddings are pre-trained on informal language corpora, our proposed English-Malay embeddings alignment approach is also able to map non-standard Malay translations in the English nearest neighbours.
Anthology ID:
2022.acl-srw.16
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Samuel Louvan, Andrea Madotto, Brielen Madureira
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
229–238
Language:
URL:
https://aclanthology.org/2022.acl-srw.16
DOI:
10.18653/v1/2022.acl-srw.16
Bibkey:
Cite (ACL):
Ying Hao Lim and Jasy Suet Yan Liew. 2022. English-Malay Cross-Lingual Embedding Alignment using Bilingual Lexicon Augmentation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 229–238, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
English-Malay Cross-Lingual Embedding Alignment using Bilingual Lexicon Augmentation (Lim & Liew, ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-srw.16.pdf