Unified NMT models for the Indian subcontinent, transcending script-barriers

Gokul N.c.


Abstract
Highly accurate machine translation systems are very important in societies and countries where multilinguality is very common, and where English often does not suffice. The Indian subcontinent (or South Asia) is such a region, with all the Indic languages currently being under-represented in the NLP ecosystem. It is essential to thoroughly explore various techniques to improve the performance of such lowresource languages at least using the data available in open-source, which itself is something not very explored in the Indic ecosystem. In our work, we perform a study with a focus on improving the performance of very-low-resource South Asian languages, especially of countries in addition to India. Specifically, we propose how unified models can be built that can exploit the data from comparatively resource-rich languages of the same region. We propose strategies to unify different types of unexplored scripts, especially Perso–Arabic scripts and Indic scripts to build multilingual models for all the South Asian languages despite the script barrier. We also study how augmentation techniques like back-translation can be made useof to build unified models just using openly available raw data, to understand what levels of improvements can be expected for these Indic languages.
Anthology ID:
2022.deeplo-1.23
Volume:
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing
Month:
July
Year:
2022
Address:
Hybrid
Editors:
Colin Cherry, Angela Fan, George Foster, Gholamreza (Reza) Haffari, Shahram Khadivi, Nanyun (Violet) Peng, Xiang Ren, Ehsan Shareghi, Swabha Swayamdipta
Venue:
DeepLo
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
227–236
Language:
URL:
https://aclanthology.org/2022.deeplo-1.23
DOI:
10.18653/v1/2022.deeplo-1.23
Bibkey:
Cite (ACL):
Gokul N.c.. 2022. Unified NMT models for the Indian subcontinent, transcending script-barriers. In Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 227–236, Hybrid. Association for Computational Linguistics.
Cite (Informal):
Unified NMT models for the Indian subcontinent, transcending script-barriers (N.c., DeepLo 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.deeplo-1.23.pdf
Video:
 https://aclanthology.org/2022.deeplo-1.23.mp4
Data
FLoRes-101IndicCorpSamanantar