CoPara: The First Dravidian Paragraph-level n-way Aligned Corpus

Nikhil E; Mukund Choudhary; Radhika Mamidi

CoPara: The First Dravidian Paragraph-level n-way Aligned Corpus

Nikhil E, Mukund Choudhary, Radhika Mamidi

Abstract

We present CoPara, the first publicly available paragraph-level (n-way aligned) multilingual parallel corpora for Dravidian languages. The collection contains 2856 paragraph/passage pairs between English and four Dravidian languages. We source the parallel paragraphs from the New India Samachar magazine and align them with English as a pivot language. We do human and artificial evaluations to validate the high-quality alignment and richness of the parallel paragraphs of a range of lengths. To show one of the many ways this dataset can be wielded, we finetuned IndicBART, a seq2seq NMT model on all XX-En pairs of languages in CoPara which perform better than existing sentence-level models on standard benchmarks (like BLEU) on sentence level translations and longer text too. We show how this dataset can enrich a model trained for a task like this, with more contextual cues and beyond sentence understanding even in low-resource settings like that of Dravidian languages. Finally, the dataset and models are made available publicly at CoPara to help advance research in Dravidian NLP, parallel multilingual, and beyond sentence-level tasks like NMT, etc.

Anthology ID:: 2023.dravidianlangtech-1.12
Volume:: Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages
Month:: September
Year:: 2023
Address:: Varna, Bulgaria
Editors:: Bharathi R. Chakravarthi, Ruba Priyadharshini, Anand Kumar M, Sajeetha Thavareesan, Elizabeth Sherly
Venues:: DravidianLangTech | WS
SIG:
Publisher:: INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:: 88–96
Language:
URL:: https://aclanthology.org/2023.dravidianlangtech-1.12/
DOI:
Bibkey:
Cite (ACL):: Nikhil E, Mukund Choudhary, and Radhika Mamidi. 2023. CoPara: The First Dravidian Paragraph-level n-way Aligned Corpus. In Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages, pages 88–96, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):: CoPara: The First Dravidian Paragraph-level n-way Aligned Corpus (E et al., DravidianLangTech 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.dravidianlangtech-1.12.pdf

PDF Cite Search Fix data