2024
pdf
bib
abs
Textless Speech-to-Speech Translation With Limited Parallel Data
Anuj Diwan
|
Anirudh Srinivasan
|
David Harwath
|
Eunsol Choi
Findings of the Association for Computational Linguistics: EMNLP 2024
Existing speech-to-speech translation (S2ST) models fall into two camps: they either leverage text as an intermediate step or require hundreds of hours of parallel speech data. Both approaches are incompatible with textless languages or language pairs with limited parallel data. We present PFB, a framework for training textless S2ST models that require just dozens of hours of parallel speech data. We first pretrain a model on large-scale monolingual speech data, finetune it with a small amount of parallel speech data (20-60 hours), and lastly train with an unsupervised backtranslation objective. We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains (European Parliament, Common Voice, and All India Radio) with single-speaker synthesized speech. Evaluated using the ASR-BLEU metric, our models achieve reasonable performance on all three domains, with some being within 1-2 points of our higher-resourced topline.
2023
pdf
bib
Counterfactually Probing Language Identity in Multilingual Models
Anirudh Srinivasan
|
Venkata Subrahmanyan Govindarajan
|
Kyle Mahowald
Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL)
2022
pdf
bib
abs
TyDiP: A Dataset for Politeness Classification in Nine Typologically Diverse Languages
Anirudh Srinivasan
|
Eunsol Choi
Findings of the Association for Computational Linguistics: EMNLP 2022
We study politeness phenomena in nine typologically diverse languages. Politeness is an important facet of communication and is sometimes argued to be cultural-specific, yet existing computational linguistic study is limited to English. We create TyDiP, a dataset containing three-way politeness annotations for 500 examples in each language, totaling 4.5K examples. We evaluate how well multilingual models can identify politeness levels – they show a fairly robust zero-shot transfer ability, yet fall short of estimated human accuracy significantly. We further study mapping the English politeness strategy lexicon into nine languages via automatic translation and lexicon induction, analyzing whether each strategy’s impact stays consistent across languages. Lastly, we empirically study the complicated relationship between formality and politeness through transfer experiments. We hope our dataset will support various research questions and applications, from evaluating multilingual models to constructing polite multilingual agents.
2021
pdf
bib
abs
BERTologiCoMix: How does Code-Mixing interact with Multilingual BERT?
Sebastin Santy
|
Anirudh Srinivasan
|
Monojit Choudhury
Proceedings of the Second Workshop on Domain Adaptation for NLP
Models such as mBERT and XLMR have shown success in solving Code-Mixed NLP tasks even though they were not exposed to such text during pretraining. Code-Mixed NLP models have relied on using synthetically generated data along with naturally occurring data to improve their performance. Finetuning mBERT on such data improves it’s code-mixed performance, but the benefits of using the different types of Code-Mixed data aren’t clear. In this paper, we study the impact of finetuning with different types of code-mixed data and outline the changes that occur to the model during such finetuning. Our findings suggest that using naturally occurring code-mixed data brings in the best performance improvement after finetuning and that finetuning with any type of code-mixed text improves the responsivity of it’s attention heads to code-mixed text inputs.
pdf
bib
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching
Thamar Solorio
|
Shuguang Chen
|
Alan W. Black
|
Mona Diab
|
Sunayana Sitaram
|
Victor Soto
|
Emre Yilmaz
|
Anirudh Srinivasan
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching
pdf
bib
abs
GCM: A Toolkit for Generating Synthetic Code-mixed Text
Mohd Sanad Zaki Rizvi
|
Anirudh Srinivasan
|
Tanuja Ganu
|
Monojit Choudhury
|
Sunayana Sitaram
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
Code-mixing is common in multilingual communities around the world, and processing it is challenging due to the lack of labeled and unlabeled data. We describe a tool that can automatically generate code-mixed data given parallel data in two languages. We implement two linguistic theories of code-mixing, the Equivalence Constraint theory and the Matrix Language theory to generate all possible code-mixed sentences in the language-pair, followed by sampling of the generated data to generate natural code-mixed sentences. The toolkit provides three modes: a batch mode, an interactive library mode and a web-interface to address the needs of researchers, linguists and language experts. The toolkit can be used to generate unlabeled text data for pre-trained models, as well as visualize linguistic theories of code-mixing. We plan to release the toolkit as open source and extend it by adding more implementations of linguistic theories, visualization techniques and better sampling techniques. We expect that the release of this toolkit will help facilitate more research in code-mixing in diverse language pairs.
2020
pdf
bib
abs
GLUECoS: An Evaluation Benchmark for Code-Switched NLP
Simran Khanuja
|
Sandipan Dandapat
|
Anirudh Srinivasan
|
Sunayana Sitaram
|
Monojit Choudhury
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Code-switching is the use of more than one language in the same conversation or utterance. Recently, multilingual contextual embedding models, trained on multiple monolingual corpora, have shown promising results on cross-lingual and multilingual tasks. We present an evaluation benchmark, GLUECoS, for code-switched languages, that spans several NLP tasks in English-Hindi and English-Spanish. Specifically, our evaluation benchmark includes Language Identification from text, POS tagging, Named Entity Recognition, Sentiment Analysis, Question Answering and a new task for code-switching, Natural Language Inference. We present results on all these tasks using cross-lingual word embedding models and multilingual models. In addition, we fine-tune multilingual models on artificially generated code-switched data. Although multilingual models perform significantly better than cross-lingual models, our results show that in most tasks, across both language pairs, multilingual models fine-tuned on code-switched data perform best, showing that multilingual models can be further optimized for code-switching tasks.
pdf
bib
abs
Code-mixed parse trees and how to find them
Anirudh Srinivasan
|
Sandipan Dandapat
|
Monojit Choudhury
Proceedings of the 4th Workshop on Computational Approaches to Code Switching
In this paper, we explore the methods of obtaining parse trees of code-mixed sentences and analyse the obtained trees. Existing work has shown that linguistic theories can be used to generate code-mixed sentences from a set of parallel sentences. We build upon this work, using one of these theories, the Equivalence-Constraint theory to obtain the parse trees of synthetically generated code-mixed sentences and evaluate them with a neural constituency parser. We highlight the lack of a dataset non-synthetic code-mixed constituency parse trees and how it makes our evaluation difficult. To complete our evaluation, we convert a code-mixed dependency parse tree set into “pseudo constituency trees” and find that a parser trained on synthetically generated trees is able to decently parse these as well.
pdf
bib
abs
MSR India at SemEval-2020 Task 9: Multilingual Models Can Do Code-Mixing Too
Anirudh Srinivasan
Proceedings of the Fourteenth Workshop on Semantic Evaluation
In this paper, we present our system for the SemEval 2020 task on code-mixed sentiment analysis. Our system makes use of large transformer based multilingual embeddings like mBERT. Recent work has shown that these models posses the ability to solve code-mixed tasks in addition to their originally demonstrated cross-lingual abilities. We evaluate the stock versions of these models for the sentiment analysis task and also show that their performance can be improved by using unlabelled code-mixed data. Our submission (username Genius1237) achieved the second rank on the English-Hindi subtask with an F1 score of 0.726.
2019
pdf
bib
abs
Unsung Challenges of Building and Deploying Language Technologies for Low Resource Language Communities
Pratik Joshi
|
Christain Barnes
|
Sebastin Santy
|
Simran Khanuja
|
Sanket Shah
|
Anirudh Srinivasan
|
Satwik Bhattamishra
|
Sunayana Sitaram
|
Monojit Choudhury
|
Kalika Bali
Proceedings of the 16th International Conference on Natural Language Processing
In this paper, we examine and analyze the challenges associated with developing and introducing language technologies to low-resource language communities. While doing so we bring to light the successes and failures of past work in this area, challenges being faced in doing so, and what have they achieved. Throughout this paper, we take a problem-facing approach and describe essential factors which the success of such technologies hinges upon. We present the various aspects in a manner which clarify and lay out the different tasks involved, which can aid organizations looking to make an impact in this area. We take the example of Gondi, an extremely-low resource Indian language, to reinforce and complement our discussion.
bib
abs
Processing and Understanding Mixed Language Data
Monojit Choudhury
|
Anirudh Srinivasan
|
Sandipan Dandapat
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): Tutorial Abstracts
Multilingual communities exhibit code-mixing, that is, mixing of two or more socially stable languages in a single conversation, sometimes even in a single utterance. This phenomenon has been widely studied by linguists and interaction scientists in the spoken language of such communities. However, with the prevalence of social media and other informal interactive platforms, code-switching is now also ubiquitously observed in user-generated text. As multilingual communities are more the norm from a global perspective, it becomes essential that code-switched text and speech are adequately handled by language technologies and NUIs.Code-mixing is extremely prevalent in all multilingual societies. Current studies have shown that as much as 20% of user generated content from some geographies, like South Asia, parts of Europe, and Singapore, are code-mixed. Thus, it is very important to handle code-mixed content as a part of NLP systems and applications for these geographies.In the past 5 years, there has been an active interest in computational models for code-mixing with a substantive research outcome in terms of publications, datasets and systems. However, it is not easy to find a single point of access for a complete and coherent overview of the research. This tutorial is expecting to fill this gap and provide new researchers in the area with a foundation in both linguistic and computational aspects of code-mixing. We hope that this then becomes a starting point for those who wish to pursue research, design, development and deployment of code-mixed systems in multilingual societies.