2024
pdf
bib
abs
PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
Ishaan Watts
|
Varun Gumma
|
Aditya Yadavalli
|
Vivek Seshadri
|
Manohar Swaminathan
|
Sunayana Sitaram
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors – the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated benchmarks. In this work, we study human and LLM-based evaluation in a multilingual, multi-cultural setting. We evaluate 30 models across 10 Indic languages by conducting 90K human evaluations and 30K LLM-based evaluations and find that models such as GPT-4o and Llama-3 70B consistently perform best for most Indic languages. We build leaderboards for two evaluation settings - pairwise comparison and direct assessment and analyse the agreement between humans and LLMs. We find that humans and LLMs agree fairly well in the pairwise setting but the agreement drops for direct assessment evaluation especially for languages such as Bengali and Odia. We also check for various biases in human and LLM-based evaluation and find evidence of self-bias in the GPT-based evaluator. Our work presents a significant step towards scaling up multilingual evaluation of LLMs.
pdf
bib
abs
MunTTS: A Text-to-Speech System for Mundari
Varun Gumma
|
Rishav Hada
|
Aditya Yadavalli
|
Pamir Gogoi
|
Ishani Mondal
|
Vivek Seshadri
|
Kalika Bali
Proceedings of the Seventh Workshop on the Use of Computational Methods in the Study of Endangered Languages
We present MunTTS, an end-to-end text-to-speech (TTS) system specifically for Mundari, a low-resource Indian language of the Austo-Asiatic family. Our work addresses the gap in linguistic technology for underrepresented languages by collecting and processing data to build a speech synthesis system. We begin our study by gathering a substantial dataset of Mundari text and speech and train end-to-end speech models. We also delve into the methods used for training our models, ensuring they are efficient and effective despite the data constraints. We evaluate our system with native speakers and objective metrics, demonstrating its potential as a tool for preserving and promoting the Mundari language in the digital age.
pdf
bib
abs
INMT-Lite: Accelerating Low-Resource Language Data Collection via Offline Interactive Neural Machine Translation
Harshita Diddee
|
Anurag Shukla
|
Tanuja Ganu
|
Vivek Seshadri
|
Sandipan Dandapat
|
Monojit Choudhury
|
Kalika Bali
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
A steady increase in the performance of Massively Multilingual Models (MMLMs) has contributed to their rapidly increasing use in data collection pipelines. Interactive Neural Machine Translation (INMT) systems are one class of tools that can utilize MMLMs to promote such data collection in several under-resourced languages. However, these tools are often not adapted to the deployment constraints that native language speakers operate in, as bloated, online inference-oriented MMLMs trained for data-rich languages, drive them. INMT-Lite addresses these challenges through its support of (1) three different modes of Internet-independent deployment and (2) a suite of four assistive interfaces suitable for (3) data-sparse languages. We perform an extensive user study for INMT-Lite with an under-resourced language community, Gondi, to find that INMT-Lite improves the data generation experience of community members along multiple axes, such as cognitive load, task productivity, and interface interaction time and effort, without compromising on the quality of the generated translations.INMT-Lite’s code is open-sourced to further research in this domain.
2023
pdf
bib
abs
X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents
Mehrad Moradshahi
|
Tianhao Shen
|
Kalika Bali
|
Monojit Choudhury
|
Gael de Chalendar
|
Anmol Goel
|
Sungkyun Kim
|
Prashant Kodali
|
Ponnurangam Kumaraguru
|
Nasredine Semmar
|
Sina Semnani
|
Jiwon Seo
|
Vivek Seshadri
|
Manish Shrivastava
|
Michael Sun
|
Aditya Yadavalli
|
Chaobin You
|
Deyi Xiong
|
Monica Lam
Findings of the Association for Computational Linguistics: ACL 2023
Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed English-Hindi language.X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language, and unlike most multilingual prior work, is an end-to-end dataset for building fully-functioning agents. The many difficulties we encountered in creating X-RiSAWOZ led us to develop a toolset to accelerate the post-editing of a new language dataset after translation. This toolset improves machine translation with a hybrid entity alignment technique that combines neural with dictionary-based methods, along with many automated and semi-automated validation checks. We establish strong baselines for X-RiSAWOZ by training dialogue agents in the zero- and few-shot settings where limited gold data is available in the target language. Our results suggest that our translation and post-editing methodology and toolset can be used to create new high-quality multilingual dialogue agents cost-effectively. Our dataset, code, and toolkit are released open-source.
2020
pdf
bib
abs
Crowdsourcing Speech Data for Low-Resource Languages from Low-Income Workers
Basil Abraham
|
Danish Goel
|
Divya Siddarth
|
Kalika Bali
|
Manu Chopra
|
Monojit Choudhury
|
Pratik Joshi
|
Preethi Jyoti
|
Sunayana Sitaram
|
Vivek Seshadri
Proceedings of the Twelfth Language Resources and Evaluation Conference
Voice-based technologies are essential to cater to the hundreds of millions of new smartphone users. However, most of the languages spoken by these new users have little to no labelled speech data. Unfortunately, collecting labelled speech data in any language is an expensive and resource-intensive task. Moreover, existing platforms typically collect speech data only from urban speakers familiar with digital technology whose dialects are often very different from low-income users. In this paper, we explore the possibility of collecting labelled speech data directly from low-income workers. In addition to providing diversity to the speech dataset, we believe this approach can also provide valuable supplemental earning opportunities to these communities. To this end, we conducted a study where we collected labelled speech data in the Marathi language from three different user groups: low-income rural users, low-income urban users, and university students. Overall, we collected 109 hours of data from 36 participants. Our results show that the data collected from low-income participants is of comparable quality to the data collected from university students (who are typically employed to do this work) and that crowdsourcing speech data from low-income rural and urban workers is a viable method of gathering speech data.