Large Language Models as a Normalizer for Transliteration and Dialectal Translation

Md Mahfuz Ibn Alam, Antonios Anastasopoulos


Abstract
NLP models trained on standardized language data often struggle with variations. We assess various Large Language Models (LLMs) for transliteration and dialectal normalization. Tuning open-source LLMs with as little as 10,000 parallel examples using LoRA can achieve results comparable to or better than closed-source LLMs. We perform dialectal normalization experiments for twelve South Asian languages and dialectal translation experiments for six language continua worldwide. The dialectal normalization task can also be a preliminary step for the downstream dialectal translation task. Among the six languages used in dialectal translation, our approach enables Italian and Swiss German to surpass the baseline model by 21.5 and 25.8 BLEU points, respectively.
Anthology ID:
2025.vardial-1.5
Volume:
Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jorg Tiedemann, Marcos Zampieri
Venues:
VarDial | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
39–67
Language:
URL:
https://aclanthology.org/2025.vardial-1.5/
DOI:
Bibkey:
Cite (ACL):
Md Mahfuz Ibn Alam and Antonios Anastasopoulos. 2025. Large Language Models as a Normalizer for Transliteration and Dialectal Translation. In Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 39–67, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Large Language Models as a Normalizer for Transliteration and Dialectal Translation (Alam & Anastasopoulos, VarDial 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.vardial-1.5.pdf