Data and Model Centric Approaches for Expansion of Large Language Models to New languages

Anoop Kunchukuttan; Raj Dabre; Rudra Murthy; Mohammed Safi Ur Rahman Khan; Thanmay Jayakumar

doi:10.18653/v1/2025.emnlp-tutorials.5

Data and Model Centric Approaches for Expansion of Large Language Models to New languages

Anoop Kunchukuttan, Raj Dabre, Rudra Murthy, Mohammed Safi Ur Rahman Khan, Thanmay Jayakumar

Abstract

Despite the increasing pace of Large Language Model (LLM) research, a vast majority of existing LLMs mainly support English alongside a handful of high resource languages, leaving a major gap for most low-resource languages. In this tutorial, we focus on approaches to expand the language coverage of LLMs. This provides an efficient and viable path to bring LLM technologies to low-resource languages, instead of training from scratch. We look at approaches at various stages of the LLM training pipeline, like tokenizer training, pre-training, instruction tuning, alignment, evaluation, etc., where adaptations are made to support new languages. We look at data-oriented approaches as well as model-oriented approaches. We hope that our tutorial enables researchers and practitioners to work on incorporating additional languages and tasks into existing LLMs to enhance inclusivity and coverage.

Anthology ID:: 2025.emnlp-tutorials.5
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Valentina Pyatkin, Andreas Vlachos
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12–13
Language:
URL:: https://aclanthology.org/2025.emnlp-tutorials.5/
DOI:: 10.18653/v1/2025.emnlp-tutorials.5
Bibkey:
Cite (ACL):: Anoop Kunchukuttan, Raj Dabre, Rudra Murthy, Mohammed Safi Ur Rahman Khan, and Thanmay Jayakumar. 2025. Data and Model Centric Approaches for Expansion of Large Language Models to New languages. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, pages 12–13, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Data and Model Centric Approaches for Expansion of Large Language Models to New languages (Kunchukuttan et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-tutorials.5.pdf

PDF Cite Search Fix data