C.V. Jawahar

Also published as: C V Jawahar, C.v. Jawahar


2021

pdf bib
More Parameters? No Thanks!
Zeeshan Khan | Kartheek Akella | Vinay Namboodiri | C V Jawahar
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
A Multilingual Parallel Corpora Collection Effort for Indian Languages
Shashank Siripragada | Jerin Philip | Vinay P. Namboodiri | C V Jawahar
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present sentence aligned parallel corpora across 10 Indian Languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English - many of which are categorized as low resource. The corpora are compiled from online sources which have content shared across languages. The corpora presented significantly extends present resources that are either not large enough or are restricted to a specific domain (such as health). We also provide a separate test corpus compiled from an independent online source that can be independently used for validating the performance in 10 Indian languages. Alongside, we report on the methods of constructing such corpora using tools enabled by recent advances in machine translation and cross-lingual retrieval using deep neural network based methods.

pdf bib
IndicSpeech: Text-to-Speech Corpus for Indian Languages
Nimisha Srivastava | Rudrabha Mukhopadhyay | Prajwal K R | C V Jawahar
Proceedings of the Twelfth Language Resources and Evaluation Conference

India is a country where several tens of languages are spoken by over a billion strong population. Text-to-speech systems for such languages will thus be extremely beneficial for wide-spread content creation and accessibility. Despite this, the current TTS systems for even the most popular Indian languages fall short of the contemporary state-of-the-art systems for English, Chinese, etc. We believe that one of the major reasons for this is the lack of large, publicly available text-to-speech corpora in these languages that are suitable for training neural text-to-speech systems. To mitigate this, we release a 24 hour text-to-speech corpus for 3 major Indian languages namely Hindi, Malayalam and Bengali. In this work, we also train a state-of-the-art TTS system for each of these languages and report their performances. The collected corpus, code, and trained models are made publicly available.

pdf bib
Exploring Pair-Wise NMT for Indian Languages
Kartheek Akella | Sai Himal Allu | Sridhar Suresh Ragupathi | Aman Singhal | Zeeshan Khan | C.v. Jawahar | Vinay P. Namboodiri
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

In this paper, we address the task of improving pair-wise machine translation for specific low resource Indian languages. Multilingual NMT models have demonstrated a reasonable amount of effectiveness on resource-poor languages. In this work, we show that the performance of these models can be significantly improved upon by using back-translation through a filtered back-translation process and subsequent fine-tuning on the limited pair-wise language corpora. The analysis in this paper suggests that this method can significantly improve multilingual models’ performance over its baseline, yielding state-of-the-art results for various Indian languages.

pdf bib
PhraseOut: A Code Mixed Data Augmentation Method for MultilingualNeural Machine Tranlsation
Binu Jasim | Vinay Namboodiri | C V Jawahar
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Data Augmentation methods for Neural Machine Translation (NMT) such as back- translation (BT) and self-training (ST) are quite popular. In a multilingual NMT system, simply copying monolingual source sentences to the target (Copying) is an effective data augmentation method. Back-translation aug- ments parallel data by translating monolingual sentences in the target side to source language. In this work we propose to use a partial back- translation method in a multilingual setting. Instead of translating the entire monolingual target sentence back into the source language, we replace selected high confidence phrases only and keep the rest of the words in the target language itself. (We call this method PhraseOut). Our experiments on low resource multilingual translation models show that PhraseOut gives reasonable improvements over the existing data augmentation methods.

2019

pdf bib
CVIT’s submissions to WAT-2019
Jerin Philip | Shashank Siripragada | Upendra Kumar | Vinay Namboodiri | C V Jawahar
Proceedings of the 6th Workshop on Asian Translation

This paper describes the Neural Machine Translation systems used by IIIT Hyderabad (CVIT-MT) for the translation tasks part of WAT-2019. We participated in tasks pertaining to Indian languages and submitted results for English-Hindi, Hindi-English, English-Tamil and Tamil-English language pairs. We employ Transformer architecture experimenting with multilingual models and methods for low-resource languages.

2018

pdf bib
CVIT-MT Systems for WAT-2018
Jerin Philip | Vinay P. Namboodiri | C.V. Jawahar
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation

2016

pdf bib
Align Me: A framework to generate Parallel Corpus Using OCRs and Bilingual Dictionaries
Priyam Bakliwal | Devadath V V | C V Jawahar
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)

Multilingual language processing tasks like statistical machine translation and cross language information retrieval rely mainly on availability of accurate parallel corpora. Manual construction of such corpus can be extremely expensive and time consuming. In this paper we present a simple yet efficient method to generate huge amount of reasonably accurate parallel corpus with minimal user efforts. We utilize the availability of large number of English books and their corresponding translations in other languages to build parallel corpus. Optical Character Recognizing systems are used to digitize such books. We propose a robust dictionary based parallel corpus generation system for alignment of multilingual text at different levels of granularity (sentence, paragraphs, etc). We show the performance of our proposed method on a manually aligned dataset of 300 Hindi-English sentences and 100 English-Malayalam sentences.