IndicBART: A Pre-trained Model for Indic Natural Language Generation

Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh Khapra, Pratyush Kumar


Abstract
In this paper, we study pre-trained sequence-to-sequence models for a group of related languages, with a focus on Indic languages. We present IndicBART, a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English. IndicBART utilizes the orthographic similarity between Indic scripts to improve transfer learning between similar Indic languages. We evaluate IndicBART on two NLG tasks: Neural Machine Translation (NMT) and extreme summarization. Our experiments on NMT and extreme summarization show that a model specific to related languages like IndicBART is competitive with large pre-trained models like mBART50 despite being significantly smaller. It also performs well on very low-resource translation scenarios where languages are not included in pre-training or fine-tuning. Script sharing, multilingual training, and better utilization of limited model capacity contribute to the good performance of the compact IndicBART model.
Anthology ID:
2022.findings-acl.145
Volume:
Findings of the Association for Computational Linguistics: ACL 2022
Month:
May
Year:
2022
Address:
Dublin, Ireland
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1849–1863
Language:
URL:
https://aclanthology.org/2022.findings-acl.145
DOI:
10.18653/v1/2022.findings-acl.145
Bibkey:
Cite (ACL):
Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh Khapra, and Pratyush Kumar. 2022. IndicBART: A Pre-trained Model for Indic Natural Language Generation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1849–1863, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
IndicBART: A Pre-trained Model for Indic Natural Language Generation (Dabre et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-acl.145.pdf
Video:
 https://aclanthology.org/2022.findings-acl.145.mp4
Code
 AI4Bharat/indic-bart
Data
FLoResIndicCorpSamanantar