2024
pdf
bib
abs
MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks
Sanchit Ahuja
|
Divyanshu Aggarwal
|
Varun Gumma
|
Ishaan Watts
|
Ashutosh Sathe
|
Millicent Ochieng
|
Rishav Hada
|
Prachi Jain
|
Mohamed Ahmed
|
Kalika Bali
|
Sunayana Sitaram
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
There has been a surge in LLM evaluation research to understand LLM capabilities and limitations. However, much of this research has been confined to English, leaving LLM building and evaluation for non-English languages relatively unexplored. Several new LLMs have been introduced recently, necessitating their evaluation on non-English languages. This study aims to perform a thorough evaluation of the non-English capabilities of SoTA LLMs (GPT-3.5-Turbo, GPT-4, PaLM2, Gemini-Pro, Mistral, Llama2, and Gemma) by comparing them on the same set of multilingual datasets. Our benchmark comprises 22 datasets covering 83 languages, including low-resource African languages. We also include two multimodal datasets in the benchmark and compare the performance of LLaVA models, GPT-4-Vision and Gemini-Pro-Vision. Our experiments show that larger models such as GPT-4, Gemini-Pro and PaLM2 outperform smaller models on various tasks, notably on low-resource languages, with GPT-4 outperforming PaLM2 and Gemini-Pro on more datasets. We also perform a study on data contamination and find that several models are likely to be contaminated with multilingual evaluation benchmarks, necessitating approaches to detect and handle contamination while assessing the multilingual performance of LLMs.
pdf
bib
abs
MAPLE: Multilingual Evaluation of Parameter Efficient Finetuning of Large Language Models
Divyanshu Aggarwal
|
Ashutosh Sathe
|
Ishaan Watts
|
Sunayana Sitaram
Findings of the Association for Computational Linguistics ACL 2024
Parameter efficient finetuning has emerged as a viable solution for improving the performance of Large Language Models without requiring massive resources and compute. Prior work on multilingual evaluation has shown that there is a large gap between the performance of LLMs on English and other languages. Further, there is also a large gap between the performance of smaller open-source models and larger LLMs. Finetuning can be an effective way to bridge this gap and make language models more equitable. In this work, we finetune the Llama-2 and Mistral models on two synthetic multilingual instruction tuning datasets to determine its effect on model performance on six downstream tasks covering forty one languages in all. Additionally, we experiment with various parameters, such as rank for low-rank adaptation and values of quantisation to determine their effects on downstream performance and find that higher rank and higher quantisation values benefit low-resource languages. We find that parameter efficient finetuning of smaller open-source models sometimes bridges the gap between the performance of these models and the larger ones, however, English performance can take a hit. We also find that finetuning sometimes improves performance on low-resource languages, while degrading performance on high-resource languages.
2023
pdf
bib
abs
Evaluating Inter-Bilingual Semantic Parsing for Indian Languages
Divyanshu Aggarwal
|
Vivek Gupta
|
Anoop Kunchukuttan
Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023)
Despite significant progress in Natural Language Generation for Indian languages (IndicNLP), there is a lack of datasets around complex structured tasks such as semantic parsing. One reason for this imminent gap is the complexity of the logical form, which makes English to multilingual translation difficult. The process involves alignment of logical forms, intents and slots with translated unstructured utterance. To address this, we propose an Inter-bilingual Seq2seq Semantic parsing dataset IE-SemParse Suite for 11 distinct Indian languages. We highlight the proposed task’s practicality, and evaluate existing multilingual seq2seq models across several train-test strategies. Our experiment reveals a high correlation across performance of original multilingual semantic parsing datasets (such as mTOP, multilingual TOP and multiATIS++) and our proposed IE-SemParse suite.
2022
pdf
bib
abs
IndicXNLI: Evaluating Multilingual Inference for Indian Languages
Divyanshu Aggarwal
|
Vivek Gupta
|
Anoop Kunchukuttan
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
While Indic NLP has made rapid advances recently in terms of the availability of corpora and pre-trained models, benchmark datasets on standard NLU tasks are limited. To this end, we introduce INDICXNLI, an NLI dataset for 11 Indic languages. It has been created by high-quality machine translation of the original English XNLI dataset and our analysis attests to the quality of INDICXNLI. By finetuning different pre-trained LMs on this INDICXNLI, we analyze various cross-lingual transfer techniques with respect to the impact of the choice of language models, languages, multi-linguality, mix-language input, etc. These experiments provide us with useful insights into the behaviour of pre-trained models for a diverse set of languages.
pdf
bib
abs
XInfoTabS: Evaluating Multilingual Tabular Natural Language Inference
Bhavnick Minhas
|
Anant Shankhdhar
|
Vivek Gupta
|
Divyanshu Aggarwal
|
Shuo Zhang
Proceedings of the Fifth Fact Extraction and VERification Workshop (FEVER)
The ability to reason about tabular or semi-structured knowledge is a fundamental problem for today’s Natural Language Processing (NLP) systems. While significant progress has been achieved in the direction of tabular reasoning, these advances are limited to English due to the absence of multilingual benchmark datasets for semi-structured data. In this paper, we use machine translation methods to construct a multilingual tabular NLI dataset, namely XINFOTABS, which expands the English tabular NLI dataset of INFOTABS to ten diverse languages. We also present several baselines for multilingual tabular reasoning, e.g., machine translation-based methods and cross-lingual. We discover that the XINFOTABS evaluation suite is both practical and challenging. As a result, this dataset will contribute to increased linguistic inclusion in tabular reasoning research and applications.
2021
pdf
bib
abs
Fine-tuning Distributional Semantic Models for Closely-Related Languages
Kushagra Bhatia
|
Divyanshu Aggarwal
|
Ashwini Vaidya
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects
In this paper we compare the performance of three models: SGNS (skip-gram negative sampling) and augmented versions of SVD (singular value decomposition) and PPMI (Positive Pointwise Mutual Information) on a word similarity task. We particularly focus on the role of hyperparameter tuning for Hindi based on recommendations made in previous work (on English). Our results show that there are language specific preferences for these hyperparameters. We extend the best settings for Hindi to a set of related languages: Punjabi, Gujarati and Marathi with favourable results. We also find that a suitably tuned SVD model outperforms SGNS for most of our languages and is also more robust in a low-resource setting.