Vinay P. Namboodiri
Also published as: Vinay P. Namboodiri
2025
Continuous Fingerspelling Dataset for Indian Sign Language
Kirandevraj R | Vinod K. Kurmi | Vinay P. Namboodiri | C.v. Jawahar
Proceedings of the Workshop on Sign Language Processing (WSLP)
Kirandevraj R | Vinod K. Kurmi | Vinay P. Namboodiri | C.v. Jawahar
Proceedings of the Workshop on Sign Language Processing (WSLP)
Fingerspelling enables signers to represent proper nouns and technical terms letter-by-letter using manual alphabets, yet remains severely under-resourced for Indian Sign Language (ISL). We present the first continuous fingerspelling dataset for ISL, extracted from the ISH News YouTube channel, in which fingerspelling is accompanied by synchronized on-screen text cues. The dataset comprises 1,308 segments from 499 videos, totaling 70.85 minutes and 14,814 characters, with aligned video-text pairs capturing authentic coarticulation patterns. We validated the dataset quality through annotation using a proficient ISL interpreter, achieving a 90.67% exact match rate for 150 samples. We further established baseline recognition benchmarks using a ByT5-small encoder-decoder model, which attains 82.91% Character Error Rate after fine-tuning. This resource supports multiple downstream tasks, including fingerspelling transcription, temporal localization, and sign generation. The dataset is available at the following link: https://kirandevraj.github.io/ISL-Fingerspelling/.
2020
A Multilingual Parallel Corpora Collection Effort for Indian Languages
Shashank Siripragada | Jerin Philip | Vinay P. Namboodiri | C V Jawahar
Proceedings of the Twelfth Language Resources and Evaluation Conference
Shashank Siripragada | Jerin Philip | Vinay P. Namboodiri | C V Jawahar
Proceedings of the Twelfth Language Resources and Evaluation Conference
We present sentence aligned parallel corpora across 10 Indian Languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English - many of which are categorized as low resource. The corpora are compiled from online sources which have content shared across languages. The corpora presented significantly extends present resources that are either not large enough or are restricted to a specific domain (such as health). We also provide a separate test corpus compiled from an independent online source that can be independently used for validating the performance in 10 Indian languages. Alongside, we report on the methods of constructing such corpora using tools enabled by recent advances in machine translation and cross-lingual retrieval using deep neural network based methods.
Exploring Pair-Wise NMT for Indian Languages
Kartheek Akella | Sai Himal Allu | Sridhar Suresh Ragupathi | Aman Singhal | Zeeshan Khan | C.v. Jawahar | Vinay P. Namboodiri
Proceedings of the 17th International Conference on Natural Language Processing (ICON)
Kartheek Akella | Sai Himal Allu | Sridhar Suresh Ragupathi | Aman Singhal | Zeeshan Khan | C.v. Jawahar | Vinay P. Namboodiri
Proceedings of the 17th International Conference on Natural Language Processing (ICON)
In this paper, we address the task of improving pair-wise machine translation for specific low resource Indian languages. Multilingual NMT models have demonstrated a reasonable amount of effectiveness on resource-poor languages. In this work, we show that the performance of these models can be significantly improved upon by using back-translation through a filtered back-translation process and subsequent fine-tuning on the limited pair-wise language corpora. The analysis in this paper suggests that this method can significantly improve multilingual models’ performance over its baseline, yielding state-of-the-art results for various Indian languages.
2018
Learning Semantic Sentence Embeddings using Pair-wise Discriminator
Badri N. Patro | Vinod K. Kurmi | Sandeep Kumar | Vinay P. Namboodiri
Proceedings of the 27th International Conference on Computational Linguistics
Badri N. Patro | Vinod K. Kurmi | Sandeep Kumar | Vinay P. Namboodiri
Proceedings of the 27th International Conference on Computational Linguistics
In this paper, we propose a method for obtaining sentence-level embeddings. While the problem of securing word-level embeddings is very well studied, we propose a novel method for obtaining sentence-level embeddings. This is obtained by a simple method in the context of solving the paraphrase generation task. If we use a sequential encoder-decoder model for generating paraphrase, we would like the generated paraphrase to be semantically close to the original sentence. One way to ensure this is by adding constraints for true paraphrase embeddings to be close and unrelated paraphrase candidate sentence embeddings to be far. This is ensured by using a sequential pair-wise discriminator that shares weights with the encoder that is trained with a suitable loss function. Our loss function penalizes paraphrase sentence embedding distances from being too large. This loss is used in combination with a sequential encoder-decoder network. We also validated our method by evaluating the obtained embeddings for a sentiment analysis task. The proposed method results in semantic embeddings and outperforms the state-of-the-art on the paraphrase generation and sentiment analysis task on standard datasets. These results are also shown to be statistically significant.