Gokul N.C.
Also published as: Gokul N.c.
2022
Unified NMT models for the Indian subcontinent, transcending script-barriers
Gokul N.c.
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing
Highly accurate machine translation systems are very important in societies and countries where multilinguality is very common, and where English often does not suffice. The Indian subcontinent (or South Asia) is such a region, with all the Indic languages currently being under-represented in the NLP ecosystem. It is essential to thoroughly explore various techniques to improve the performance of such lowresource languages at least using the data available in open-source, which itself is something not very explored in the Indic ecosystem. In our work, we perform a study with a focus on improving the performance of very-low-resource South Asian languages, especially of countries in addition to India. Specifically, we propose how unified models can be built that can exploit the data from comparatively resource-rich languages of the same region. We propose strategies to unify different types of unexplored scripts, especially Perso–Arabic scripts and Indic scripts to build multilingual models for all the South Asian languages despite the script barrier. We also study how augmentation techniques like back-translation can be made useof to build unified models just using openly available raw data, to understand what levels of improvements can be expected for these Indic languages.
2020
IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages
Divyanshu Kakwani
|
Anoop Kunchukuttan
|
Satish Golla
|
Gokul N.C.
|
Avik Bhattacharyya
|
Mitesh M. Khapra
|
Pratyush Kumar
Findings of the Association for Computational Linguistics: EMNLP 2020
In this paper, we introduce NLP resources for 11 major Indian languages from two major language families. These resources include: (a) large-scale sentence-level monolingual corpora, (b) pre-trained word embeddings, (c) pre-trained language models, and (d) multiple NLU evaluation datasets (IndicGLUE benchmark). The monolingual corpora contains a total of 8.8 billion tokens across all 11 languages and Indian English, primarily sourced from news crawls. The word embeddings are based on FastText, hence suitable for handling morphological complexity of Indian languages. The pre-trained language models are based on the compact ALBERT model. Lastly, we compile the (IndicGLUE benchmark for Indian language NLU. To this end, we create datasets for the following tasks: Article Genre Classification, Headline Prediction, Wikipedia Section-Title Prediction, Cloze-style Multiple choice QA, Winograd NLI and COPA. We also include publicly available datasets for some Indic languages for tasks like Named Entity Recognition, Cross-lingual Sentence Retrieval, Paraphrase detection, etc. Our embeddings are competitive or better than existing pre-trained embeddings on multiple tasks. We hope that the availability of the dataset will accelerate Indic NLP research which has the potential to impact more than a billion people. It can also help the community in evaluating advances in NLP over a more diverse pool of languages. The data and models are available at https://indicnlp.ai4bharat.org.
Search
Fix data
Co-authors
- Avik Bhattacharyya 1
- Satish Golla 1
- Divyanshu Kakwani 1
- Mitesh M. Khapra 1
- Pratyush Kumar 1
- show all...