2025
pdf
bib
abs
URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base
Aditya Khan
|
Mason Shipton
|
David Anugraha
|
Kaiyao Duan
|
Phuong H. Hoang
|
Eric Khiu
|
A. Seza Doğruöz
|
En-Shiun Annie Lee
Proceedings of the 31st International Conference on Computational Linguistics
URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec that addresses these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves the user experience with robust, customizable distance calculations to better suit the needs of users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies.
pdf
bib
abs
ProxyLM: Predicting Language Model Performance on Multilingual Tasks via Proxy Models
David Anugraha
|
Genta Indra Winata
|
Chenyue Li
|
Patrick Amadeus Irawan
|
En-Shiun Annie Lee
Findings of the Association for Computational Linguistics: NAACL 2025
Performance prediction is a method to estimate the performance of Language Models (LMs) on various Natural Language Processing (NLP) tasks, mitigating computational costs associated with model capacity and data for fine-tuning. Our paper presents ProxyLM, a scalable task- and language-agnostic framework designed to predict the performance of LMs using proxy models. These proxy models act as surrogates, approximating the performance of the LM of interest. By leveraging these proxy models, ProxyLM significantly reduces computational overhead in task evaluations, achieving up to a 37.08x speedup over traditional methods, even with our smallest proxy models. Our results across multiple multilingual NLP tasks and various robustness tests demonstrate that ProxyLM not only adapts well to previously unseen languages in pre-trained LMs, but also generalizes effectively across different datasets, outperforming the state-of-the-art by at least 1.78x in terms of root-mean-square error (RMSE).
pdf
bib
abs
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
David Ifeoluwa Adelani
|
Jessica Ojo
|
Israel Abebe Azime
|
Jian Yun Zhuang
|
Jesujoba Oluwadara Alabi
|
Xuanli He
|
Millicent Ochieng
|
Sara Hooker
|
Andiswa Bukula
|
En-Shiun Annie Lee
|
Chiamaka Ijeoma Chukwuneke
|
Happy Buzaaba
|
Blessing Kudzaishe Sibanda
|
Godson Koffi Kalipe
|
Jonathan Mukiibi
|
Salomon Kabongo Kabenamualu
|
Foutse Yuehgoh
|
Mmasibidi Setaka
|
Lolwethu Ndolela
|
Nkiruka Odu
|
Rooweither Mabuya
|
Salomey Osei
|
Shamsuddeen Hassan Muhammad
|
Sokhar Samb
|
Tadesse Kebede Guge
|
Tombekai Vangoni Sherman
|
Pontus Stenetorp
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (e.g. African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench—a human-translated benchmark dataset for 17 typologically-diverse low-resource African languages covering three tasks: natural language inference(AfriXNLI), mathematical reasoning(AfriMGSM), and multi-choice knowledge-based QA(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings(where test sets are translated into English) across 10 open and four proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages (such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Gemma 2 27B only at 63% of the best-performing proprietary model GPT-4o performance. Machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, like Gemma 2 27B and LLaMa 3.1 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages.
pdf
bib
abs
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
Genta Indra Winata
|
Frederikus Hudi
|
Patrick Amadeus Irawan
|
David Anugraha
|
Rifki Afina Putri
|
Wang Yutong
|
Adam Nohejl
|
Ubaidillah Ariq Prathama
|
Nedjma Ousidhoum
|
Afifa Amriani
|
Anar Rzayev
|
Anirban Das
|
Ashmari Pramodya
|
Aulia Adila
|
Bryan Wilie
|
Candy Olivia Mawalim
|
Cheng Ching Lam
|
Daud Abolade
|
Emmanuele Chersoni
|
Enrico Santus
|
Fariz Ikhwantri
|
Garry Kuwanto
|
Hanyang Zhao
|
Haryo Akbarianto Wibowo
|
Holy Lovenia
|
Jan Christian Blaise Cruz
|
Jan Wira Gotama Putra
|
Junho Myung
|
Lucky Susanto
|
Maria Angelica Riera Machin
|
Marina Zhukova
|
Michael Anugraha
|
Muhammad Farid Adilazuarda
|
Natasha Christabelle Santosa
|
Peerat Limkonchotiwat
|
Raj Dabre
|
Rio Alexander Audino
|
Samuel Cahyawijaya
|
Shi-Xiong Zhang
|
Stephanie Yulia Salim
|
Yi Zhou
|
Yinxuan Gui
|
David Ifeoluwa Adelani
|
En-Shiun Annie Lee
|
Shogo Okada
|
Ayu Purwarianti
|
Alham Fikri Aji
|
Taro Watanabe
|
Derry Tanti Wijaya
|
Alice Oh
|
Chong-Wah Ngo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.
pdf
bib
abs
AlignFreeze: Navigating the Impact of Realignment on the Layers of Multilingual Models Across Diverse Languages
Steve Bakos
|
David Guzmán
|
Riddhi More
|
Kelly Chutong Li
|
Félix Gaschi
|
En-Shiun Annie Lee
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Realignment techniques are often employed to enhance cross-lingual transfer in multilingual language models, still, they can sometimes degrade performance in languages that differ significantly from the fine-tuned source language. This paper introduces AlignFreeze, a method that freezes either the layers’ lower half or upper half during realignment. Through controlled experiments on 4 tasks, 3 models, and in 35 languages, we find that realignment affects all the layers but can be the most detrimental to the lower ones. Freezing the lower layers can prevent performance degradation. Particularly, AlignFreeze improves Part-of-Speech (PoS) tagging performances in languages where full realignment fails: with XLM-R, it provides improvements of more than one standard deviation in accuracy in seven more languages than full realignment.
2024
pdf
bib
abs
SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects
David Ifeoluwa Adelani
|
Hannah Liu
|
Xiaoyu Shen
|
Nikita Vassilyev
|
Jesujoba O. Alabi
|
Yanke Mao
|
Haonan Gao
|
En-Shiun Annie Lee
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite the progress in building multilingual language models, evaluation is often limited to a few languages with available datasets which excludes a large number of low-resource languages. In this paper, we create SIB-200—a large-scale open-sourced benchmark dataset for topic classification in 205 languages and dialects to address the lack of evaluation dataset for Natural Language Understanding (NLU). For many of the languages covered in SIB-200, this is the first publicly available evaluation dataset for NLU. The dataset is based on Flores-200 machine translation corpus. We annotated the English portion of the dataset and extended the sentence-level annotation to the remaining 204 languages covered in the corpus. Despite the simplicity of this task, our evaluation in full-supervised setting, cross-lingual transfer setting and prompting of large language model setting show that there is still a large gap between the performance of high-resource and low-resource languages when multilingual evaluation is scaled to numerous world languages. We found that languages unseen during the pre-training of multilingual language models, languages from under-represented families (like Nilotic and Altantic-Congo), and languages from the regions of Africa, Americas, Oceania and South East Asia, often have the lowest performance on our topic classification dataset. We hope our dataset %will encourages a more inclusive evaluation of multilingual language models on a more diverse set of languages.
pdf
bib
abs
AfriInstruct: Instruction Tuning of African Languages for Diverse Tasks
Kosei Uemura
|
Mahe Chen
|
Alex Pejovic
|
Chika Maduabuchi
|
Yifei Sun
|
En-Shiun Annie Lee
Findings of the Association for Computational Linguistics: EMNLP 2024
Large language models (LLMs) for African languages perform worse compared to their performance in high-resource languages. To address this issue, we introduce AfriInstruct, which specializes in instruction-tuning of multiple African languages covering various tasks. We trained the LLaMa-2-7B using continual pretraining and instruction fine-tuning, which demonstrates superior performance across multiple tasks. Our mixed task evaluation shows that our model outperforms GPT-3.5-Turbo and other baseline models of similar size. Our contributions fill a critical gap of LLM performance between high-resource and African languages.
2022
pdf
bib
abs
Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation?
En-Shiun Annie Lee
|
Sarubi Thillainathan
|
Shravan Nayak
|
Surangika Ranathunga
|
David Ifeoluwa Adelani
|
Ruisi Su
|
Arya D. McCarthy
Findings of the Association for Computational Linguistics: ACL 2022
What can pre-trained multilingual sequence-to-sequence models like mBART contribute to translating low-resource languages? We conduct a thorough empirical experiment in 10 languages to ascertain this, considering five factors: (1) the amount of fine-tuning data, (2) the noise in the fine-tuning data, (3) the amount of pre-training data in the model, (4) the impact of domain mismatch, and (5) language typology. In addition to yielding several heuristics, the experiments form a framework for evaluating the data sensitivities of machine translation systems. While mBART is robust to domain differences, its translations for unseen and typologically distant languages remain below 3.0 BLEU. In answer to our title’s question, mBART is not a low-resource panacea; we therefore encourage shifting the emphasis from new models to new data.