C.V. Jawahar

Also published as: C.v. Jawahar, C V Jawahar

2025

Continuous Fingerspelling Dataset for Indian Sign Language
Kirandevraj R | Vinod K. Kurmi | Vinay P. Namboodiri | C.v. Jawahar
Proceedings of the Workshop on Sign Language Processing (WSLP)

Fingerspelling enables signers to represent proper nouns and technical terms letter-by-letter using manual alphabets, yet remains severely under-resourced for Indian Sign Language (ISL). We present the first continuous fingerspelling dataset for ISL, extracted from the ISH News YouTube channel, in which fingerspelling is accompanied by synchronized on-screen text cues. The dataset comprises 1,308 segments from 499 videos, totaling 70.85 minutes and 14,814 characters, with aligned video-text pairs capturing authentic coarticulation patterns. We validated the dataset quality through annotation using a proficient ISL interpreter, achieving a 90.67% exact match rate for 150 samples. We further established baseline recognition benchmarks using a ByT5-small encoder-decoder model, which attains 82.91% Character Error Rate after fine-tuning. This resource supports multiple downstream tasks, including fingerspelling transcription, temporal localization, and sign generation. The dataset is available at the following link: https://kirandevraj.github.io/ISL-Fingerspelling/.

2021

pdf bib

More Parameters? No Thanks!
Zeeshan Khan | Kartheek Akella | Vinay Namboodiri | C V Jawahar
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib abs

A Multilingual Parallel Corpora Collection Effort for Indian Languages
Shashank Siripragada | Jerin Philip | Vinay P. Namboodiri | C V Jawahar
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present sentence aligned parallel corpora across 10 Indian Languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English - many of which are categorized as low resource. The corpora are compiled from online sources which have content shared across languages. The corpora presented significantly extends present resources that are either not large enough or are restricted to a specific domain (such as health). We also provide a separate test corpus compiled from an independent online source that can be independently used for validating the performance in 10 Indian languages. Alongside, we report on the methods of constructing such corpora using tools enabled by recent advances in machine translation and cross-lingual retrieval using deep neural network based methods.

pdf bib abs

In this paper, we address the task of improving pair-wise machine translation for specific low resource Indian languages. Multilingual NMT models have demonstrated a reasonable amount of effectiveness on resource-poor languages. In this work, we show that the performance of these models can be significantly improved upon by using back-translation through a filtered back-translation process and subsequent fine-tuning on the limited pair-wise language corpora. The analysis in this paper suggests that this method can significantly improve multilingual models’ performance over its baseline, yielding state-of-the-art results for various Indian languages.

pdf bib abs

PhraseOut: A Code Mixed Data Augmentation Method for MultilingualNeural Machine Tranlsation
Binu Jasim | Vinay Namboodiri | C V Jawahar
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Data Augmentation methods for Neural Machine Translation (NMT) such as back- translation (BT) and self-training (ST) are quite popular. In a multilingual NMT system, simply copying monolingual source sentences to the target (Copying) is an effective data augmentation method. Back-translation aug- ments parallel data by translating monolingual sentences in the target side to source language. In this work we propose to use a partial back- translation method in a multilingual setting. Instead of translating the entire monolingual target sentence back into the source language, we replace selected high confidence phrases only and keep the rest of the words in the target language itself. (We call this method PhraseOut). Our experiments on low resource multilingual translation models show that PhraseOut gives reasonable improvements over the existing data augmentation methods.

pdf bib abs

IndicSpeech: Text-to-Speech Corpus for Indian Languages
Nimisha Srivastava | Rudrabha Mukhopadhyay | Prajwal K R | C V Jawahar
Proceedings of the Twelfth Language Resources and Evaluation Conference

India is a country where several tens of languages are spoken by over a billion strong population. Text-to-speech systems for such languages will thus be extremely beneficial for wide-spread content creation and accessibility. Despite this, the current TTS systems for even the most popular Indian languages fall short of the contemporary state-of-the-art systems for English, Chinese, etc. We believe that one of the major reasons for this is the lack of large, publicly available text-to-speech corpora in these languages that are suitable for training neural text-to-speech systems. To mitigate this, we release a 24 hour text-to-speech corpus for 3 major Indian languages namely Hindi, Malayalam and Bengali. In this work, we also train a state-of-the-art TTS system for each of these languages and report their performances. The collected corpus, code, and trained models are made publicly available.

2019

pdf bib abs

CVIT’s submissions to WAT-2019
Jerin Philip | Shashank Siripragada | Upendra Kumar | Vinay Namboodiri | C V Jawahar
Proceedings of the 6th Workshop on Asian Translation

This paper describes the Neural Machine Translation systems used by IIIT Hyderabad (CVIT-MT) for the translation tasks part of WAT-2019. We participated in tasks pertaining to Indian languages and submitted results for English-Hindi, Hindi-English, English-Tamil and Tamil-English language pairs. We employ Transformer architecture experimenting with multilingual models and methods for low-resource languages.

2018

pdf bib

CVIT-MT Systems for WAT-2018
Jerin Philip | Vinay P. Namboodiri | C.V. Jawahar
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation

2016

pdf bib abs

Align Me: A framework to generate Parallel Corpus Using OCRs and Bilingual Dictionaries
Priyam Bakliwal | Devadath V V | C V Jawahar
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)

Multilingual language processing tasks like statistical machine translation and cross language information retrieval rely mainly on availability of accurate parallel corpora. Manual construction of such corpus can be extremely expensive and time consuming. In this paper we present a simple yet efficient method to generate huge amount of reasonably accurate parallel corpus with minimal user efforts. We utilize the availability of large number of English books and their corresponding translations in other languages to build parallel corpus. Optical Character Recognizing systems are used to digitize such books. We propose a robust dictionary based parallel corpus generation system for alignment of multilingual text at different levels of granularity (sentence, paragraphs, etc). We show the performance of our proposed method on a manually aligned dataset of 300 Hindi-English sentences and 100 English-Malayalam sentences.