Manohar Swaminathan
2024
PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
Ishaan Watts
|
Varun Gumma
|
Aditya Yadavalli
|
Vivek Seshadri
|
Manohar Swaminathan
|
Sunayana Sitaram
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors – the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated benchmarks. In this work, we study human and LLM-based evaluation in a multilingual, multi-cultural setting. We evaluate 30 models across 10 Indic languages by conducting 90K human evaluations and 30K LLM-based evaluations and find that models such as GPT-4o and Llama-3 70B consistently perform best for most Indic languages. We build leaderboards for two evaluation settings - pairwise comparison and direct assessment and analyse the agreement between humans and LLMs. We find that humans and LLMs agree fairly well in the pairwise setting but the agreement drops for direct assessment evaluation especially for languages such as Bengali and Odia. We also check for various biases in human and LLM-based evaluation and find evidence of self-bias in the GPT-based evaluator. Our work presents a significant step towards scaling up multilingual evaluation of LLMs.
2022
“#DisabledOnIndianTwitter” : A Dataset towards Understanding the Expression of People with Disabilities on Indian Twitter
Ishani Mondal
|
Sukhnidh Kaur
|
Kalika Bali
|
Aditya Vashistha
|
Manohar Swaminathan
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022
Twitter serves as a powerful tool for self-expression among the disabled people. To understand how disabled people in India use Twitter, we introduce a manually annotated corpus #DisabledOnIndianTwitter comprising of 2,384 tweets posted by 27 female and 15 male users. These users practice diverse professions and engage in varied online discourses on disability in India. To examine patterns in their Twitter use, we propose a novel hierarchical annotation taxonomy to classify the tweets into various themes including discrimination, advocacy, and self-identification. Using these annotations, we benchmark the corpus leveraging state-of-the-art classifiers. Finally through a mixed-methods analysis on our annotated corpus, we reveal stark differences in self-expression between male and female disabled users on Indian Twitter.
Search
Fix data
Co-authors
- Kalika Bali 1
- Varun Gumma 1
- Sukhnidh Kaur 1
- Ishani Mondal 1
- Vivek Seshadri 1
- show all...