Mohammad Aflah Khan
2025
QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMs
Mohammad Aflah Khan
|
Neemesh Yadav
|
Sarah Masud
|
Md. Shad Akhtar
Proceedings of the 31st International Conference on Computational Linguistics
The rise of large language models (LLMs) has created a need for advanced benchmarking systems beyond traditional setups. To this end, we introduce QUENCH, a novel text-based English Quizzing Benchmark manually curated and transcribed from YouTube quiz videos. QUENCH possesses masked entities and rationales for the LLMs to predict via generation. At the intersection of world knowledge, geographical context, and common sense reasoning, QUENCH helps assess world knowledge and deduction capabilities of LLMs via a zero-shot, open-domain quizzing setup. We perform an extensive evaluation on 7 LLMs and 4 metrics, investigating the influence of model size, prompting style, geographical context, and gold-labeled rationale generation. The benchmarking concludes with an error analysis of various types of generative errors to which the LLMs are prone.
2024
Probing Critical Learning Dynamics of PLMs for Hate Speech Detection
Sarah Masud
|
Mohammad Aflah Khan
|
Vikram Goyal
|
Md Shad Akhtar
|
Tanmoy Chakraborty
Findings of the Association for Computational Linguistics: EACL 2024
Despite the widespread adoption, there is a lack of research into how various critical aspects of pretrained language models (PLMs) affect their performance in hate speech detection. Through five research questions, our findings and recommendations lay the groundwork for empirically investigating different aspects of PLMs’ use in hate speech detection. We deep dive into comparing different pretrained models, evaluating their seed robustness, finetuning settings, and the impact of pretraining data collection time. Our analysis reveals early peaks for downstream tasks during pretraining, the limited benefit of employing a more recent pretraining corpus, and the significance of specific layers during finetuning. We further call into question the use of domain-specific models and highlight the need for dynamic datasets for benchmarking hate speech detection.