Khyati Mahajan

2025

Given the inherent subjectivity of similarity in text, fully unsupervised text clustering is unlikely to produce groupings that work across a variety of use cases. Traditional techniques to guide clustering rely on costly, time-consuming human feedback and/or pre-existing labels. Leveraging recent advancements in LLMs and decoder-only embedding models, we present techniques to effectively control text embeddings with minimal human input: prefix instructions and LLM preprocessing. We evaluate clustering performance for datasets with multiple independent ground-truth labels, or perspectives, and find that these techniques can be used to improve clustering for one perspective or use case, at the cost of a tradeoff in performance for another use case.

pdf bib abs

M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models
Rishabh Maheshwary | Vikas Yadav | Hoang H Nguyen | Khyati Mahajan | Sathwik Tejaswi Madhusudhan
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Collecting instruction fine-tuning (IFT) data is a resource and time intensive task especially in multilingual setting where finding proficient native speakers is challenging. Moreover, traditional data collection is prone to privacy risks, toxicity and lacks scalability. While, fully synthetic datasets are a promising alternative, research on their use in multilingual domain is limited as existing approaches still rely on machine translation to improve multilingual performance. To bridge this gap we introduce M2Lingual, the first fully synthetic, multi-turn multilingual dataset having 175K conversations across 70 languages with a balanced mix of high, low and mid-resourced languages. M2Lingual is constructed using a cost-efficient and scalable method that uses our novel two-step Evol prompt taxonomy to transform a small set of human written instructions to complex and challenging conversations. Results across three model families, six baseline datasets and evaluation spanning 31 languages demonstrates the effectiveness of M2Lingual over other datasets.

pdf bib abs

Prompting with Phonemes: Enhancing LLMs’ Multilinguality for Non-Latin Script Languages
Hoang H Nguyen | Khyati Mahajan | Vikas Yadav | Julian Salazar | Philip S. Yu | Masoud Hashemi | Rishabh Maheshwary
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Multilingual LLMs have achieved remarkable benchmark performance, but we find they continue to underperform on non-Latin script languages across contemporary LLM families. This discrepancy arises from the fact that LLMs are pretrained with orthographic scripts, which are dominated by Latin characters that obscure their shared phonology with non-Latin scripts. We propose leveraging phonemic transcriptions as complementary signals to induce script-invariant representations. Our study demonstrates that integrating phonemic signals improves performance across both non-Latin and Latin languages, with a particularly significant impact on closing the performance gap between the two. Through detailed experiments, we show that phonemic and orthographic scripts retrieve distinct examples for in-context learning (ICL). This motivates our proposed Mixed-ICL retrieval strategy, where further aggregation leads to our significant performance improvements for both Latin script languages (up to 12.6%) and non-Latin script languages (up to 15.1%) compared to randomized ICL retrieval.

2024

pdf bib abs

Persona-aware Multi-party Conversation Response Generation
Khyati Mahajan | Samira Shaikh
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Modeling interlocutor information is essential towards modeling multi-party conversations to account for the presence of multiple participants. We investigate the role of including the persona attributes of both the speaker and addressee relevant to each utterance, collected via 3 distinct mock social media experiments. The participants were recruited via MTurk, and were unaware of the persona attributes of the other users they interacted with on the platform. Our main contributions include 1) a multi-party conversation dataset with rich associated metadata (including persona), and 2) a persona-aware heterogeneous graph transformer response generation model. We find that PersonaHeterMPC provides a good baseline towards persona-aware generation for multi-party conversation modeling, generating responses which are relevant and consistent with the interlocutor personas relevant to the conversation.

2023

pdf bib abs

Socratic Questioning of Novice Debuggers: A Benchmark Dataset and Preliminary Evaluations
Erfan Al-Hossami | Razvan Bunescu | Ryan Teehan | Laurel Powell | Khyati Mahajan | Mohsen Dorodchi
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

Socratic questioning is a teaching strategy where the student is guided towards solving a problem on their own, instead of being given the solution directly. In this paper, we introduce a dataset of Socratic conversations where an instructor helps a novice programmer fix buggy solutions to simple computational problems. The dataset is then used for benchmarking the Socratic debugging abilities of GPT-based language models. While GPT-4 is observed to perform much better than GPT-3.5, its precision, and recall still fall short of human expert abilities, motivating further work in this area.

2022

pdf bib

Towards Evaluation of Multi-party Dialogue Systems
Khyati Mahajan | Sashank Santhanam | Samira Shaikh
Proceedings of the 15th International Conference on Natural Language Generation

pdf bib abs

Improving Dialogue Act Recognition with Augmented Data
Khyati Mahajan | Soham Parikh | Quaizar Vohra | Mitul Tiwari | Samira Shaikh
Proceedings of the Second Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

We present our work on augmenting dialog act recognition capabilities utilizing synthetically generated data. Our work is motivated by the limitations of current dialog act datasets, and the need to adapt for new domains as well as ambiguity in utterances written by humans. We list our observations and findings towards how synthetically generated data can contribute meaningfully towards more robust dialogue act recognition models extending to new domains. Our major finding shows that synthetic data, which is linguistically varied, can be very useful towards this goal and increase the performance from (0.39, 0.16) to (0.85, 0.88) for AFFIRM and NEGATE dialog acts respectively.

2021

pdf bib abs

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for the 2021 shared task at the associated GEM Workshop.

pdf bib abs

A Case Study of Analysis of Construals in Language on Social Media Surrounding a Crisis Event
Lolo Aboufoul | Khyati Mahajan | Tiffany Gallicano | Sara Levens | Samira Shaikh
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

The events that took place at the Unite the Right rally held in Charlottesville, Virginia on August 11-12, 2017 caused intense reaction on social media from users across the political spectrum. We present a novel application of psycholinguistics - specifically, construal level theory - to analyze the language on social media around this event of social import through topic models. We find that including psycholinguistic measures of concreteness as covariates in topic models can lead to informed analysis of the language surrounding an event of political import.

pdf bib abs

On the Need for Thoughtful Data Collection for Multi-Party Dialogue: A Survey of Available Corpora and Collection Methods
Khyati Mahajan | Samira Shaikh
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

We present a comprehensive survey of available corpora for multi-party dialogue. We survey over 300 publications related to multi-party dialogue and catalogue all available corpora in a novel taxonomy. We analyze methods of data collection for multi-party dialogue corpora and identify several lacunae in existing data collection approaches used to collect such dialogue. We present this survey, the first survey to focus exclusively on multi-party dialogue corpora, to motivate research in this area. Through our discussion of existing data collection methods, we identify desiderata and guiding principles for multi-party data collection to contribute further towards advancing this area of dialogue research.

pdf bib abs

TeamUNCC@LT-EDI-EACL2021: Hope Speech Detection using Transfer Learning with Transformers
Khyati Mahajan | Erfan Al-Hossami | Samira Shaikh
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion

In this paper, we describe our approach towards utilizing pre-trained models for the task of hope speech detection. We participated in Task 2: Hope Speech Detection for Equality, Diversity and Inclusion at LT-EDI-2021 @ EACL2021. The goal of this task is to predict the presence of hope speech, along with the presence of samples that do not belong to the same language in the dataset. We describe our approach to fine-tuning RoBERTa for Hope Speech detection in English and our approach to fine-tuning XLM-RoBERTa for Hope Speech detection in Tamil and Malayalam, two low resource Indic languages. We demonstrate the performance of our approach on classifying text into hope-speech, non-hope and not-language. Our approach ranked 1st in English (F1 = 0.93), 1st in Tamil (F1 = 0.61) and 3rd in Malayalam (F1 = 0.83).

2020

bib abs

Studying The Effect of Emotional and Moral Language on Information Contagion during the Charlottesville Event
Khyati Mahajan | Samira Shaikh
Proceedings of the Fourth Widening Natural Language Processing Workshop

We highlight the contribution of emotional and moral language towards information contagion online. We find that retweet count on Twitter is significantly predicted by the use of negative emotions with negative moral language. We find that a tweet is less likely to be retweeted (hence less engagement and less potential for contagion) when it has emotional language expressed as anger along with a specific type of moral language, known as authority-vice. Conversely, when sadness is expressed with authority-vice, the tweet is more likely to be retweeted. Our findings indicate how emotional and moral language can interact in predicting information contagion.

2019

bib abs

Emoji Usage Across Platforms: A Case Study for the Charlottesville Event
Khyati Mahajan | Samira Shaikh
Proceedings of the 2019 Workshop on Widening NLP

We study emoji usage patterns across two social media platforms, one of them considered a fringe community called Gab, and the other Twitter. We find that Gab tends to comparatively use more emotionally charged emoji, but also seems more apathetic towards the violence during the event, while Twitter takes a more empathetic approach to the event.