Sebastin Santy


pdf bib
Use of Formal Ethical Reviews in NLP Literature: Historical Trends and Current Practices
Sebastin Santy | Anku Rani | Monojit Choudhury
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
BERTologiCoMix: How does Code-Mixing interact with Multilingual BERT?
Sebastin Santy | Anirudh Srinivasan | Monojit Choudhury
Proceedings of the Second Workshop on Domain Adaptation for NLP

Models such as mBERT and XLMR have shown success in solving Code-Mixed NLP tasks even though they were not exposed to such text during pretraining. Code-Mixed NLP models have relied on using synthetically generated data along with naturally occurring data to improve their performance. Finetuning mBERT on such data improves it’s code-mixed performance, but the benefits of using the different types of Code-Mixed data aren’t clear. In this paper, we study the impact of finetuning with different types of code-mixed data and outline the changes that occur to the model during such finetuning. Our findings suggest that using naturally occurring code-mixed data brings in the best performance improvement after finetuning and that finetuning with any type of code-mixed text improves the responsivity of it’s attention heads to code-mixed text inputs.


pdf bib
Learnings from Technological Interventions in a Low Resource Language: A Case-Study on Gondi
Devansh Mehta | Sebastin Santy | Ramaravind Kommiya Mothilal | Brij Mohan Lal Srivastava | Alok Sharma | Anurag Shukla | Vishnu Prasad | Venkanna U | Amit Sharma | Kalika Bali
Proceedings of the 12th Language Resources and Evaluation Conference

The primary obstacle to developing technologies for low-resource languages is the lack of usable data. In this paper, we report the adaption and deployment of 4 technology-driven methods of data collection for Gondi, a low-resource vulnerable language spoken by around 2.3 million tribal people in south and central India. In the process of data collection, we also help in its revival by expanding access to information in Gondi through the creation of linguistic resources that can be used by the community, such as a dictionary, children’s stories, an app with Gondi content from multiple sources and an Interactive Voice Response (IVR) based mass awareness platform. At the end of these interventions, we collected a little less than 12,000 translated words and/or sentences and identified more than 650 community members whose help can be solicited for future translation efforts. The larger goal of the project is collecting enough data in Gondi to build and deploy viable language technologies like machine translation and speech to text systems that can help take the language onto the internet.

pdf bib
The State and Fate of Linguistic Diversity and Inclusion in the NLP World
Pratik Joshi | Sebastin Santy | Amar Budhiraja | Kalika Bali | Monojit Choudhury
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Language technologies contribute to promoting multilingualism and linguistic diversity around the world. However, only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications. In this paper we look at the relation between the types of languages, resources, and their representation in NLP conferences to understand the trajectory that different languages have followed over time. Our quantitative investigation underlines the disparity between languages, especially in terms of their resources, and calls into question the “language agnostic” status of current models and systems. Through this paper, we attempt to convince the ACL community to prioritise the resolution of the predicaments highlighted here, so that no language is left behind.


pdf bib
Unsung Challenges of Building and Deploying Language Technologies for Low Resource Language Communities
Pratik Joshi | Christain Barnes | Sebastin Santy | Simran Khanuja | Sanket Shah | Anirudh Srinivasan | Satwik Bhattamishra | Sunayana Sitaram | Monojit Choudhury | Kalika Bali
Proceedings of the 16th International Conference on Natural Language Processing

In this paper, we examine and analyze the challenges associated with developing and introducing language technologies to low-resource language communities. While doing so we bring to light the successes and failures of past work in this area, challenges being faced in doing so, and what have they achieved. Throughout this paper, we take a problem-facing approach and describe essential factors which the success of such technologies hinges upon. We present the various aspects in a manner which clarify and lay out the different tasks involved, which can aid organizations looking to make an impact in this area. We take the example of Gondi, an extremely-low resource Indian language, to reinforce and complement our discussion.

pdf bib
INMT: Interactive Neural Machine Translation Prediction
Sebastin Santy | Sandipan Dandapat | Monojit Choudhury | Kalika Bali
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

In this paper, we demonstrate an Interactive Machine Translation interface, that assists human translators with on-the-fly hints and suggestions. This makes the end-to-end translation process faster, more efficient and creates high-quality translations. We augment the OpenNMT backend with a mechanism to accept the user input and generate conditioned translations.

pdf bib
CoSSAT: Code-Switched Speech Annotation Tool
Sanket Shah | Pratik Joshi | Sebastin Santy | Sunayana Sitaram
Proceedings of the First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP

Code-switching refers to the alternation of two or more languages in a conversation or utterance and is common in multilingual communities across the world. Building code-switched speech and natural language processing systems are challenging due to the lack of annotated speech and text data. We present a speech annotation interface CoSSAT, which helps annotators transcribe code-switched speech faster, more easily and more accurately than a traditional interface, by displaying candidate words from monolingual speech recognizers. We conduct a user study on the transcription of Hindi-English code-switched speech with 10 annotators and describe quantitative and qualitative results.