BanglaTLit: A Benchmark Dataset for Back-Transliteration of Romanized Bangla

Md Fahim; Fariha Shifat; Fabiha Haider; Deeparghya Barua; Md Sourove; Md Ishmam; Md Bhuiyan

BanglaTLit: A Benchmark Dataset for Back-Transliteration of Romanized Bangla

Md Fahim, Fariha Shifat, Fabiha Haider, Deeparghya Barua, Md Sourove, Md Ishmam, Md Bhuiyan

Abstract

Low-resource languages like Bangla are severely limited by the lack of datasets. Romanized Bangla texts are ubiquitous on the internet, offering a rich source of data for Bangla NLP tasks and extending the available data sources. However, due to the informal nature of romanized text, they often lack the structure and consistency needed to provide insights. We address these challenges by proposing: (1) BanglaTLit, the large-scale Bangla transliteration dataset consisting of 42.7k samples, (2) BanglaTLit-PT, a pre-training corpus on romanized Bangla with 245.7k samples, (3) encoders further-pretrained on BanglaTLit-PT achieving state-of-the-art performance in several romanized Bangla classification tasks, and (4) multiple back-transliteration baseline methods, including a novel encoder-decoder architecture using further pre-trained encoders. Our results show the potential of automated Bangla back-transliteration in utilizing the untapped sources of romanized Bangla to enrich this language. The code and datasets are publicly available: https://github.com/farhanishmam/BanglaTLit.

Anthology ID:: 2024.findings-emnlp.859
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14656–14672
Language:
URL:: https://aclanthology.org/2024.findings-emnlp.859
DOI:
Bibkey:
Cite (ACL):: Md Fahim, Fariha Shifat, Fabiha Haider, Deeparghya Barua, Md Sourove, Md Ishmam, and Md Bhuiyan. 2024. BanglaTLit: A Benchmark Dataset for Back-Transliteration of Romanized Bangla. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 14656–14672, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: BanglaTLit: A Benchmark Dataset for Back-Transliteration of Romanized Bangla (Fahim et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-emnlp.859.pdf
Data:: 2024.findings-emnlp.859.data.zip

PDF Cite Search Data