2024
pdf
bib
abs
Predicting Machine Translation Performance on Low-Resource Languages: The Role of Domain Similarity
Eric Khiu
|
Hasti Toossi
|
David Anugraha
|
Jinyu Liu
|
Jiaxu Li
|
Juan Flores
|
Leandro Roman
|
A. Seza Doğruöz
|
En-Shiun Lee
Findings of the Association for Computational Linguistics: EACL 2024
Fine-tuning and testing a multilingual large language model is a challenge for low-resource languages (LRLs) since it is an expensive process. While previous studies have predicted the performance of natural language processing (NLP) tasks using machine learning methods, they primarily focus on high-resource languages, overlooking LRLs and shifts across domains. Focusing on LRLs, we investigate three factors (the size of the fine-tuning corpus, domain similarity between fine-tuning and testing corpora, and language similarity between source and target languages), which can potentially impact the model performance by using classical regression models. Our results indicate that domain similarity has the most important impact on predicting the performance of Machine Translation models.
pdf
bib
abs
Unlocking Parameter-Efficient Fine-Tuning for Low-Resource Language Translation
Tong Su
|
Xin Peng
|
Sarubi Thillainathan
|
David Guzmán
|
Surangika Ranathunga
|
En-Shiun Lee
Findings of the Association for Computational Linguistics: NAACL 2024
Parameter-efficient fine-tuning (PEFT) methods are increasingly vital in adapting large-scale pre-trained language models for diverse tasks, offering a balance between adaptability and computational efficiency. They are important in Low-Resource Language (LRL) Neural Machine Translation (NMT) to enhance translation accuracy with minimal resources. However, their practical effectiveness varies significantly across different languages. We conducted comprehensive empirical experiments with varying LRL domains and sizes to evaluate the performance of 8 PEFT methods with in total of 15 architectures using the SacreBLEU score. We showed that 6 PEFT architectures outperform the baseline for both in-domain and out-domain tests and the Houlsby+Inversion adapter has the best performance overall, proving the effectiveness of PEFT methods.
pdf
bib
abs
A Reproducibility Study on Quantifying Language Similarity: The Impact of Missing Values in the URIEL Knowledge Base
Hasti Toossi
|
Guo Huai
|
Jinyu Liu
|
Eric Khiu
|
A. Seza Doğruöz
|
En-Shiun Lee
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
In the pursuit of supporting more languages around the world, tools that characterize properties of languages play a key role in expanding the existing multilingual NLP research. In this study, we focus on a widely used typological knowledge base, URIEL, which aggregates linguistic information into numeric vectors. Specifically, we delve into the soundness and reproducibility of the approach taken by URIEL in quantifying language similarity. Our analysis reveals URIEL’s ambiguity in calculating language distances and in handling missing values. Moreover, we find that URIEL does not provide any information about typological features for 31% of the languages it represents, undermining the reliabilility of the database, particularly on low-resource languages. Our literature review suggests URIEL and lang2vec are used in papers on diverse NLP tasks, which motivates us to rigorously verify the database as the effectiveness of these works depends on the reliability of the information the tool provides.
pdf
bib
abs
Empowering the Future with Multilinguality and Language Diversity
En-Shiun Lee
|
Kosei Uemura
|
Syed Wasti
|
Mason Shipton
Proceedings of the Sixth Workshop on Teaching NLP
The rapid advancements and the widespread transformation of Large Language Models, have made it necessary to incorporate these cutting-edge techniques into the educational curricula of Natural Language Processing (NLP) with limited computing resources. This paper presents an applied NLP course designed for upper-year computer science undergraduate students on state-of-the-art techniques with an emphasis on multilinguality and language diversity. We hope to empower learners to advance their language community while preparing for industry.