IruMozhi: Automatically classifying diglossia in Tamil

Kabilan Prasanna, Aryaman Arora


Abstract
Tamil, a Dravidian language of South Asia, is a highly diglossic language with two very different registers in everyday use: Literary Tamil (preferred in writing and formal communication) and Spoken Tamil (confined to speech and informal media). Spoken Tamil is under-studied in modern NLP systems compared to Literary Tamil written in the Tamil script, as evidenced by a lack of datasets explicitly targetting the Spoken variety. In this paper, we release IruMozhi, a human-translated dataset of parallel text in Literary and Spoken Tamil. Using IruMozhi, we train classifiers on the task of identifying which Tamil variety a text belongs to. We use these models to gauge the availability of pretraining data in Spoken Tamil, to audit the composition of existing labelled datasets for Tamil, and to encourage future work on the variety.
Anthology ID:
2024.findings-naacl.195
Volume:
Findings of the Association for Computational Linguistics: NAACL 2024
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3096–3103
Language:
URL:
https://aclanthology.org/2024.findings-naacl.195
DOI:
Bibkey:
Cite (ACL):
Kabilan Prasanna and Aryaman Arora. 2024. IruMozhi: Automatically classifying diglossia in Tamil. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3096–3103, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
IruMozhi: Automatically classifying diglossia in Tamil (Prasanna & Arora, Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-naacl.195.pdf
Copyright:
 2024.findings-naacl.195.copyright.pdf