Ratish Puduppully - ACL Anthology

Ratish Puduppully

2026

RiddleBench: A New Generative Reasoning Benchmark for LLMs
Deepon Halder | Alan Saji | Thanmay Jayakumar | Anoop Kunchukuttan | Ratish Puduppully | Raj Dabre
Findings of the Association for Computational Linguistics: EACL 2026

While Large Language Models (LLMs) show remarkable capabilities, their complex reasoning skills require deeper investigation. We introduce **RiddleBench**, a new benchmark of 1,737 challenging puzzles designed to test reasoning beyond simple pattern matching. Our evaluation of state-of-the-art models reveals significant limitations, including hallucination cascades (uncritically accepting flawed peer reasoning) and poor self-correction due to strong self-confirmation bias. We also find that model performance is fragile, degrading when faced with reordered constraints or irrelevant information. RiddleBench serves as a resource for diagnosing these issues and guiding the development of more robust LLMs.

The Reasoning Lingua Franca: A Double-Edged Sword for Multilingual AI
Alan Saji | Raj Dabre | Anoop Kunchukuttan | Ratish Puduppully
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

Large Reasoning Models (LRMs) achieve strong performance on mathematical, scientific, and other question-answering tasks, but their multilingual reasoning abilities remain underexplored. When presented with non-English questions, LRMs often default to reasoning in English, raising concerns about interpretability and the handling of linguistic and cultural nuances. We systematically compare an LRM’s reasoning in English versus the language of the question. Our evaluation spans two tasks: MGSM and GPQA Diamond. Beyond measuring answer accuracy, we also analyze cognitive attributes in the reasoning traces. We find that English reasoning traces exhibit a substantially higher presence of these cognitive behaviors, and that reasoning in English generally yields higher final-answer accuracy, with the performance gap increasing as tasks become more complex. However, this English-centric strategy is susceptible to a key failure mode - getting “Lost in Translation," where translation steps lead to errors that would have been avoided by reasoning in the language of the question.

2025

RomanLens: The Role Of Latent Romanization In Multilinguality In LLMs
Alan Saji | Jaavid Aktar Husain | Thanmay Jayakumar | Raj Dabre | Anoop Kunchukuttan | Ratish Puduppully
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) exhibit strong multilingual performance despite being predominantly trained on English-centric corpora. This raises a fundamental question: How do LLMs achieve such multilingual capabilities? Focusing on languages written in non-Roman scripts, we investigate the role of Romanization—the representation of non-Roman scripts using Roman characters—as a potential bridge in multilingual processing. Using mechanistic interpretability techniques, we analyze next-token generation and find that intermediate layers frequently represent target words in Romanized form before transitioning to native script, a phenomenon we term Latent Romanization. Further, through activation patching experiments, we demonstrate that LLMs encode semantic concepts similarly across native and Romanized scripts, suggesting a shared underlying representation. Additionally, for translation into non-Roman script languages, our findings reveal that when the target language is in Romanized form, its representations emerge earlier in the model’s layers compared to native script. These insights contribute to a deeper understanding of multilingual representation in LLMs and highlight the implicit role of Romanization in facilitating language transfer.

2024

How Good is Zero-Shot MT Evaluation for Low Resource Indian Languages?
Anushka Singh | Ananya Sai | Raj Dabre | Ratish Puduppully | Anoop Kunchukuttan | Mitesh Khapra
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

While machine translation evaluation has been studied primarily for high-resource languages, there has been a recent interest in evaluation for low-resource languages due to the increasing availability of data and models. In this paper, we focus on a zero-shot evaluation setting focusing on low-resource Indian languages, namely Assamese, Kannada, Maithili, and Punjabi. We collect sufficient Multi-Dimensional Quality Metrics (MQM) and Direct Assessment (DA) annotations to create test sets and meta-evaluate a plethora of automatic evaluation metrics. We observe that even for learned metrics, which are known to exhibit zero-shot performance, the Kendall Tau and Pearson correlations with human annotations are only as high as 0.32 and 0.45. Synthetic data approaches show mixed results and overall do not help close the gap by much for these languages. This indicates that there is still a long way to go for low-resource evaluation.

RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models via Romanization
Jaavid J | Raj Dabre | Aswanth M | Jay Gala | Thanmay Jayakumar | Ratish Puduppully | Anoop Kunchukuttan
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This study addresses the challenge of extending Large Language Models (LLMs) to non-English languages, specifically those using non-Roman scripts. We propose an approach that utilizes the romanized form of text as an interface for LLMs, hypothesizing that its frequent informal use and shared tokens with English enhance cross-lingual alignment. Our approach involve the continual pretraining of a English LLM like Llama 2 on romanized text of non-English, non-Roman script languages, followed by instruction tuning on romanized data. The results indicate that romanized text not only reduces token fertility by 2x-4x but also matches if not outperforms native script representation across various NLU, NLG and MT tasks. Moreover, the embeddings computed on romanized text exhibit closer alignment with their English translations than those from the native script. Our approach presents a promising direction for leveraging the power of English LLMs in languages traditionally underrepresented in NLP research.

An Empirical Comparison of Vocabulary Expansion and Initialization Approaches For Language Models
Nandini Mundra | Aditya Nanda Kishore Khandavally | Raj Dabre | Ratish Puduppully | Anoop Kunchukuttan | Mitesh M Khapra
Proceedings of the 28th Conference on Computational Natural Language Learning

Language Models (LMs) excel in natural language processing tasks for English but show reduced performance in most other languages. This problem is commonly tackled by continually pre-training and fine-tuning these models for said languages. A significant issue in this process is the limited vocabulary coverage in the original model’s tokenizer, leading to inadequate representation of new languages and necessitating an expansion of the tokenizer. The initialization of the embeddings corresponding to new vocabulary items presents a further challenge. Current strategies require cross-lingual embeddings and lack a solid theoretical foundation as well as comparisons with strong baselines. In this paper, we first establish theoretically that initializing within the convex hull of existing embeddings is a good initialization, followed by a novel but simple approach, Constrained Word2Vec (CW2V), which does not require cross-lingual embeddings. Our study evaluates different initialization methods for expanding RoBERTa and LLaMA 2 across four languages and five tasks. The results show that CW2V performs equally well or even better than more advanced techniques. Additionally, simpler approaches like multivariate initialization perform on par with these advanced methods indicating that efficient large-scale multilingual continued pretraining can be achieved even with simpler initialization methods.

2023

In-context Learning of Large Language Models for Controlled Dialogue Summarization: A Holistic Benchmark and Empirical Analysis
Yuting Tang | Ratish Puduppully | Zhengyuan Liu | Nancy Chen
Proceedings of the 4th New Frontiers in Summarization Workshop

Large Language Models (LLMs) have shown significant performance in numerous NLP tasks, including summarization and controlled text generation. A notable capability of LLMs is in-context learning (ICL), where the model learns new tasks using input-output pairs in the prompt without any parameter update. However, the performance of LLMs in the context of few-shot abstractive dialogue summarization remains underexplored. This study evaluates various state-of-the-art LLMs on the SAMSum dataset within a few-shot framework. We assess these models in both controlled (entity control, length control, and person-focused planning) and uncontrolled settings, establishing a comprehensive benchmark in few-shot dialogue summarization. Our findings provide insights into summary quality and model controllability, offering a crucial reference for future research in dialogue summarization.

DecoMT: Decomposed Prompting for Machine Translation Between Related Languages using Large Language Models
Ratish Puduppully | Anoop Kunchukuttan | Raj Dabre | Ai Ti Aw | Nancy Chen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

This study investigates machine translation between related languages i.e., languages within the same family that share linguistic characteristics such as word order and lexical similarity. Machine translation through few-shot prompting leverages a small set of translation pair examples to generate translations for test sentences. This procedure requires the model to learn how to generate translations while simultaneously ensuring that token ordering is maintained to produce a fluent and accurate translation. We propose that for related languages, the task of machine translation can be simplified by leveraging the monotonic alignment characteristic of such languages. We introduce DecoMT, a novel approach of few-shot prompting that decomposes the translation process into a sequence of word chunk translations. Through automatic and human evaluation conducted on multiple related language pairs across various language families, we demonstrate that our proposed approach of decomposed prompting surpasses multiple established few-shot baseline approaches. For example, DecoMT outperforms the strong few-shot prompting BLOOM model with an average improvement of 8 chrF++ scores across the examined languages.

CTQScorer: Combining Multiple Features for In-context Example Selection for Machine Translation
Aswanth Kumar | Ratish Puduppully | Raj Dabre | Anoop Kunchukuttan
Findings of the Association for Computational Linguistics: EMNLP 2023

Large language models have demonstrated the capability to perform on machine translation when the input is prompted with a few examples (in-context learning). Translation quality depends on various features of the selected examples, such as their quality and relevance, but previous work has predominantly focused on individual features in isolation. In this paper, we propose a general framework for combining different features influencing example selection. We learn a regression model, CTQ Scorer (Contextual Translation Quality), that selects examples based on multiple features in order to maximize the translation quality. On multiple language pairs and language models, we show that CTQ Scorer helps significantly outperform random selection as well as strong single-factor baselines reported in the literature. We also see an improvement of over 2.5 COMET points on average with respect to a strong BM25 retrieval-based baseline.

2022

Data-to-text Generation with Variational Sequential Planning
Ratish Puduppully | Yao Fu | Mirella Lapata
Transactions of the Association for Computational Linguistics, Volume 10

We consider the task of data-to-text generation, which aims to create textual output from non-linguistic input. We focus on generating long-form text, that is, documents with multiple paragraphs, and propose a neural model enhanced with a planning component responsible for organizing high-level information in a coherent and meaningful way. We infer latent plans sequentially with a structured variational model, while interleaving the steps of planning and generation. Text is generated by conditioning on previous variational decisions and previously generated text. Experiments on two data-to-text benchmarks (RotoWire and MLB) show that our model outperforms strong baselines and is sample-efficient in the face of limited training data (e.g., a few hundred instances).

IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages
Aman Kumar | Himani Shrotriya | Prachi Sahu | Amogh Mishra | Raj Dabre | Ratish Puduppully | Anoop Kunchukuttan | Mitesh M. Khapra | Pratyush Kumar
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Natural Language Generation (NLG) for non-English languages is hampered by the scarcity of datasets in these languages. We present the IndicNLG Benchmark, a collection of datasets for benchmarking NLG for 11 Indic languages. We focus on five diverse tasks, namely, biography generation using Wikipedia infoboxes, news headline generation, sentence summarization, paraphrase generation and, question generation. We describe the created datasets and use them to benchmark the performance of several monolingual and multilingual baselines that leverage pre-trained sequence-to-sequence models. Our results exhibit the strong performance of multilingual language-specific pre-trained models, and the utility of models trained on our dataset for other related NLG tasks. Our dataset creation methods can be easily applied to modest-resource languages as they involve simple steps such as scraping news articles and Wikipedia infoboxes, light cleaning, and pivoting through machine translation data. To the best of our knowledge, the IndicNLG Benchmark is the first NLG benchmark for Indic languages and the most diverse multilingual NLG dataset, with approximately 8M examples across 5 tasks and 11 languages. The datasets and models will be publicly available.

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
Sebastian Gehrmann | Abhik Bhattacharjee | Abinaya Mahendiran | Alex Wang | Alexandros Papangelis | Aman Madaan | Angelina Mcmillan-major | Anna Shvets | Ashish Upadhyay | Bernd Bohnet | Bingsheng Yao | Bryan Wilie | Chandra Bhagavatula | Chaobin You | Craig Thomson | Cristina Garbacea | Dakuo Wang | Daniel Deutsch | Deyi Xiong | Di Jin | Dimitra Gkatzia | Dragomir Radev | Elizabeth Clark | Esin Durmus | Faisal Ladhak | Filip Ginter | Genta Indra Winata | Hendrik Strobelt | Hiroaki Hayashi | Jekaterina Novikova | Jenna Kanerva | Jenny Chim | Jiawei Zhou | Jordan Clive | Joshua Maynez | João Sedoc | Juraj Juraska | Kaustubh Dhole | Khyathi Raghavi Chandu | Laura Perez Beltrachini | Leonardo F . R. Ribeiro | Lewis Tunstall | Li Zhang | Mahim Pushkarna | Mathias Creutz | Michael White | Mihir Sanjay Kale | Moussa Kamal Eddine | Nico Daheim | Nishant Subramani | Ondrej Dusek | Paul Pu Liang | Pawan Sasanka Ammanamanchi | Qi Zhu | Ratish Puduppully | Reno Kriz | Rifat Shahriyar | Ronald Cardenas | Saad Mahamood | Salomey Osei | Samuel Cahyawijaya | Sanja Štajner | Sebastien Montella | Shailza Jolly | Simon Mille | Tahmid Hasan | Tianhao Shen | Tosin Adewumi | Vikas Raunak | Vipul Raheja | Vitaly Nikolaev | Vivian Tsai | Yacine Jernite | Ying Xu | Yisi Sang | Yixin Liu | Yufang Hou
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Evaluations in machine learning rarely use the latest metrics, datasets, or human evaluation in favor of remaining compatible with prior work. The compatibility, often facilitated through leaderboards, thus leads to outdated but standardized evaluation practices. We pose that the standardization is taking place in the wrong spot. Evaluation infrastructure should enable researchers to use the latest methods and what should be standardized instead is how to incorporate these new evaluation advances. We introduce GEMv2, the new version of the Generation, Evaluation, and Metrics Benchmark which uses a modular infrastructure for dataset, model, and metric developers to benefit from each other’s work. GEMv2 supports 40 documented datasets in 51 languages, ongoing online evaluation for all datasets, and our interactive tools make it easier to add new datasets to the living benchmark.

IndicBART: A Pre-trained Model for Indic Natural Language Generation
Raj Dabre | Himani Shrotriya | Anoop Kunchukuttan | Ratish Puduppully | Mitesh Khapra | Pratyush Kumar
Findings of the Association for Computational Linguistics: ACL 2022

In this paper, we study pre-trained sequence-to-sequence models for a group of related languages, with a focus on Indic languages. We present IndicBART, a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English. IndicBART utilizes the orthographic similarity between Indic scripts to improve transfer learning between similar Indic languages. We evaluate IndicBART on two NLG tasks: Neural Machine Translation (NMT) and extreme summarization. Our experiments on NMT and extreme summarization show that a model specific to related languages like IndicBART is competitive with large pre-trained models like mBART50 despite being significantly smaller. It also performs well on very low-resource translation scenarios where languages are not included in pre-training or fine-tuning. Script sharing, multilingual training, and better utilization of limited model capacity contribute to the good performance of the compact IndicBART model.

2021

Data-to-text Generation with Macro Planning
Ratish Puduppully | Mirella Lapata
Transactions of the Association for Computational Linguistics, Volume 9

Recent approaches to data-to-text generation have adopted the very successful encoder-decoder architecture or variants thereof. These models generate text that is fluent (but often imprecise) and perform quite poorly at selecting appropriate content and ordering it coherently. To overcome some of these issues, we propose a neural model with a macro planning stage followed by a generation stage reminiscent of traditional methods which embrace separate modules for planning and surface realization. Macro plans represent high level organization of important content such as entities, events, and their interactions; they are learned from data and given as input to the generator. Extensive experiments on two data-to-text benchmarks (RotoWire and MLB) show that our approach outperforms competitive baselines in terms of automatic and human evaluation.

2019

University of Edinburgh’s submission to the Document-level Generation and Translation Shared Task
Ratish Puduppully | Jonathan Mallinson | Mirella Lapata
Proceedings of the 3rd Workshop on Neural Generation and Translation

The University of Edinburgh participated in all six tracks: NLG, MT, and MT+NLG with both English and German as targeted languages. For the NLG track, we submitted a multilingual system based on the Content Selection and Planning model of Puduppully et al (2019). For the MT track, we submitted Transformer-based Neural Machine Translation models, where out-of-domain parallel data was augmented with in-domain data extracted from monolingual corpora. Our MT+NLG systems disregard the structured input data and instead rely exclusively on the source summaries.

Data-to-text Generation with Entity Modeling
Ratish Puduppully | Li Dong | Mirella Lapata
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Recent approaches to data-to-text generation have shown great promise thanks to the use of large-scale datasets and the application of neural network architectures which are trained end-to-end. These models rely on representation learning to select content appropriately, structure it coherently, and verbalize it grammatically, treating entities as nothing more than vocabulary tokens. In this work we propose an entity-centric neural architecture for data-to-text generation. Our model creates entity-specific representations which are dynamically updated. Text is generated conditioned on the data input and entity memory representations using hierarchical attention at each time step. We present experiments on the RotoWire benchmark and a (five times larger) new dataset on the baseball domain which we create. Our results show that the proposed model outperforms competitive baselines in automatic and human evaluation.

2017

Transition-Based Deep Input Linearization
Ratish Puduppully | Yue Zhang | Manish Shrivastava
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Traditional methods for deep NLG adopt pipeline approaches comprising stages such as constructing syntactic input, predicting function words, linearizing the syntactic input and generating the surface forms. Though easier to visualize, pipeline approaches suffer from error propagation. In addition, information available across modules cannot be leveraged by all modules. We construct a transition-based model to jointly perform linearization, function word prediction and morphological generation, which considerably improves upon the accuracy compared to a pipelined baseline system. On a standard deep input linearization shared task, our system achieves the best results reported so far.

2016

Transition-Based Syntactic Linearization with Lookahead Features
Ratish Puduppully | Yue Zhang | Manish Shrivastava
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2015

Brahmi-Net: A transliteration and script conversion system for languages of the Indian subcontinent
Anoop Kunchukuttan | Ratish Puduppully | Pushpak Bhattacharyya
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

2014

Merging Verb Senses of Hindi WordNet using Word Embeddings
Sudha Bhingardive | Ratish Puduppully | Dhirendra Singh | Pushpak Bhattacharyya
Proceedings of the 11th International Conference on Natural Language Processing

Co-authors

Pushpak Bhattacharyya 2

Pratyush Kumar 2

Manish Shrivastava 2

Himani Shrotriya 2

Tosin Adewumi 1

Pawan Sasanka Ammanamanchi 1

Chandra Bhagavatula 1

Abhik Bhattacharjee 1

Sudha Bhingardive 1

Samuel Cahyawijaya 1

Ronald Cardenas 1

Khyathi Raghavi Chandu 1

Elizabeth Clark 1

Mathias Creutz 1

Daniel Deutsch 1

Kaustubh Dhole 1

Ondřej Dušek 1

Moussa Kamal Eddine 1

Cristina Garbacea 1

Sebastian Gehrmann 1

Dimitra Gkatzia 1

Deepon Halder 1

Hiroaki Hayashi 1

Jaavid Aktar Husain 1

Yacine Jernite 1

Shailza Jolly 1

Juraj Juraska 1

Mihir Sanjay Kale 1

Jenna Kanerva 1

Aditya Nanda Kishore Khandavally 1

Aswanth Kumar 1

Faisal Ladhak 1

Paul Pu Liang 1

Zhengyuan Liu 1

Saad Mahamood 1

Abinaya Mahendiran 1

Jonathan Mallinson 1

Joshua Maynez 1

Angelina McMillan-Major 1

Sebastien Montella 1

Nandini Mundra 1

Vitaly Nikolaev 1

Jekaterina Novikova 1

Alexandros Papangelis 1

Laura Perez-Beltrachini 1

Mahim Pushkarna 1

Dragomir Radev 1

Leonardo F. R. Ribeiro 1

Rifat Shahriyar 1

Anushka Singh 1

Dhirendra Singh 1

Hendrik Strobelt 1

Nishant Subramani 1

Craig Thomson 1

Lewis Tunstall 1

Ashish Upadhyay 1

Michael White 1

Genta Indra Winata 1

Deyi Xiong (德意熊) 1

Bingsheng Yao 1

Sanja Štajner 1

Venues