Kranti Chalamalasetti

Also published as: Chalamalasetti Kranti

2025

From Templates to Natural Language: Generalization Challenges in Instruction-Tuned LLMs for Spatial Reasoning
Chalamalasetti Kranti | Sherzod Hakimov | David Schlangen
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Instruction-tuned large language models (LLMs) have shown strong performance on a variety of tasks; however, generalizing from synthetic to human-authored instructions in grounded environments remains a challenge for them. In this work, we study generalization challenges in spatial grounding tasks where models interpret and translate instructions for building object arrangements on a 2.5D grid. We fine-tune LLMs using only synthetic instructions and evaluate their performance on a benchmark dataset containing both synthetic and human-authored instructions. Our results reveal that while models generalize well on simple tasks, their performance degrades significantly on more complex tasks. We present a detailed error analysis of the gaps in instruction generalization.

pdf bib

pdf bib abs

Test Set Quality in Multilingual LLM Evaluation
Chalamalasetti Kranti | Gabriel Bernier-Colborne | Yvan Gauthier | Sowmya Vajjala
Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems

Several multilingual benchmark datasets have been developed in a semi-automatic manner in the recent past to measure progress and understand the state-of-the-art in the multilingual capabilities of Large Language Models (LLM). However, there is not a lot of attention paid to the quality of the datasets themselves, despite the existence of previous work in identifying errors in even fully human-annotated test sets. In this paper, we manually analyze recent multilingual evaluation sets in two languages – French and Telugu, identifying several errors in the datasets during the process. We compare the performance difference across several LLMs with the original and revised versions of the datasets and identify large differences (almost 10% in some cases) in both languages. Based on these results, we argue that test sets should not be considered immutable and should be revisited, checked for correctness, and potentially versioned. We end with some recommendations for both the dataset creators as well as consumers on addressing the dataset quality issues.

pdf bib abs

Conversational Collaborative Robots
Chalamalasetti Kranti
Proceedings of the 21st Workshop of Young Researchers' Roundtable on Spoken Dialogue Systems

Spoken dialogue systems (SDSs) aims to enable natural, interactive and collaborative conversations. My research interest lies in leveraging these situated collaborative conversations to teach new concepts (skills) to collaborative robots (cobots). These cobots, when operating in manufacturing environments such as assembly lines, are envisioned to converse with humans, reach common ground, and learn new skills in one shot without the need for multiple demonstrations. Unlike SDSs in consumer domains, these cobot-based systems must handle conversations in noisy, time-sensitive industrial settings. Motivated by these challenges, my research focuses on building collaborative dialogue systems capable of integrating conversational programming to translate situated dialogue into modular programs, knowing when to ask for clarifications, and adapting the program based on corrections.

pdf bib abs

clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations
Chalamalasetti Kranti | Sherzod Hakimov | David Schlangen
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue

The emergence of instruction-tuned large language models (LLMs) has advanced the field of dialogue systems, enabling both realistic user simulations and robust multi-turn conversational agents. However, existing research often evaluates these components in isolation, either focusing on a single user simulator or a specific system design, limiting the generalisability of insights across architectures and configurations. In this work, we propose clem:todd (chat-optimized LLMs for task-oriented dialogue systems development), a flexible framework for systematically evaluating dialogue systems under consistent conditions. clem:todd enables detailed benchmarking across combinations of user simulators and dialogue systems, whether existing models from literature or newly developed ones. To the best of our knowledge, clem:todd is the first evaluation framework for task-oriented dialogue systems that supports plug-and-play integration and ensures uniform datasets, evaluation metrics, and computational constraints. We showcase clem:todd’s flexibility by re-evaluating existing task-oriented dialogue systems within this unified setup and integrating three newly proposed dialogue systems into the same evaluation pipeline. Our results provide actionable insights into how architecture, scale, and prompting strategies affect dialogue performance, offering practical guidance for building efficient and effective conversational AI systems.

2024

pdf bib abs

Retrieval-Augmented Code Generation for Situated Action Generation: A Case Study on Minecraft
Chalamalasetti Kranti | Sherzod Hakimov | David Schlangen
Findings of the Association for Computational Linguistics: EMNLP 2024

In the Minecraft Collaborative Building Task, two players collaborate: an Architect (A) provides instructions to a Builder (B) to assemble a specified structure using 3D blocks. In this work, we investigate the use of large language models (LLMs) to predict the sequence of actions taken by the Builder. Leveraging LLMs’ in-context learning abilities, we use few-shot prompting techniques, that significantly improve performance over baseline methods. Additionally, we present a detailed analysis of the gaps in performance for future work.

2023

pdf bib abs

clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents
Kranti Chalamalasetti | Jana Götze | Sherzod Hakimov | Brielen Madureira | Philipp Sadler | David Schlangen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Recent work has proposed a methodology for the systematic evaluation of “Situated Language Understanding Agents” — agents that operate in rich linguistic and non-linguistic contexts — through testing them in carefully constructed interactive settings. Other recent work has argued that Large Language Models (LLMs), if suitably set up, can be understood as (simulators of) such agents. A connection suggests itself, which this paper explores: Can LLMs be evaluated meaningfully by exposing them to constrained game-like settings that are built to challenge specific capabilities? As a proof of concept, this paper investigates five interaction settings, showing that current chat-optimised LLMs are, to an extent, capable of following game-play instructions. Both this capability and the quality of the game play, measured by how well the objectives of the different games are met, follows the development cycle, with newer models generally performing better. The metrics even for the comparatively simple example games are far from being saturated, suggesting that the proposed instrument will remain to have diagnostic value.

2020

pdf bib abs

Word emphasis in textual content aims at conveying the desired intention by changing the size, color, typeface, style (bold, italic, etc.), and other typographical features. The emphasized words are extremely helpful in drawing the readers’ attention to specific information that the authors wish to emphasize. However, performing such emphasis using a soft keyboard for social media interactions is time-consuming and has an associated learning curve. In this paper, we propose a novel approach to automate the emphasis word detection on short written texts. To the best of our knowledge, this work presents the first lightweight deep learning approach for smartphone deployment of emphasis selection. Experimental results show that our approach achieves comparable accuracy at a much lower model size than existing models. Our best lightweight model has a memory footprint of 2.82 MB with a matching score of 0.716 on SemEval-2020 public benchmark dataset.

Venues

WS1