Leveraging a Bilingual Dictionary to Learn Wolastoqey Word Representations

Diego Bear; Paul Cook

Leveraging a Bilingual Dictionary to Learn Wolastoqey Word Representations

Abstract

Word embeddings (Mikolov et al., 2013; Pennington et al., 2014) have been used to bolster the performance of natural language processing systems in a wide variety of tasks, including information retrieval (Roy et al., 2018) and machine translation (Qi et al., 2018). However, approaches to learning word embeddings typically require large corpora of running text to learn high quality representations. For many languages, such resources are unavailable. This is the case for Wolastoqey, also known as Passamaquoddy-Maliseet, an endangered low-resource Indigenous language. As there exist no large corpora of running text for Wolastoqey, in this paper, we leverage a bilingual dictionary to learn Wolastoqey word embeddings by encoding their corresponding English definitions into vector representations using pretrained English word and sequence representation models. Specifically, we consider representations based on pretrained word2vec (Mikolov et al., 2013), RoBERTa (Liu et al., 2019) and sentence-BERT (Reimers and Gurevych, 2019) models. We evaluate these embeddings in word prediction tasks focused on part-of-speech, animacy, and transitivity; semantic clustering; and reverse dictionary search. In all evaluations we demonstrate that approaches using these embeddings outperform task-specific baselines, without requiring any language-specific training or fine-tuning.

Anthology ID:: 2022.lrec-1.124
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 1159–1166
Language:
URL:: https://aclanthology.org/2022.lrec-1.124
DOI:
Bibkey:
Cite (ACL):: Diego Bear and Paul Cook. 2022. Leveraging a Bilingual Dictionary to Learn Wolastoqey Word Representations. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1159–1166, Marseille, France. European Language Resources Association.
Cite (Informal):: Leveraging a Bilingual Dictionary to Learn Wolastoqey Word Representations (Bear & Cook, LREC 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.lrec-1.124.pdf

PDF Cite Search