Souvik Kundu

2025

State space models (SSMs) achieve efficient sub-quadratic compute complexity but often exhibit significant performance drops as context length increases. Recent work attributes this deterioration to an exponential decay in hidden-state memory. While token filtering has emerged as a promising remedy, its underlying rationale and limitations remain largely non-understood. In this paper, we first investigate the attention patterns of Mamba to shed light on why token filtering alleviates long-context degradation. Motivated by these findings, we propose LAMB, a training-free, attention-guided token filtering strategy designed to preserve critical tokens during inference. LAMB can boost long-context performance for both pure SSMs and hybrid models, achieving up to an average improvement of 30.35% over state-of-the-art techniques on standard long-context understanding benchmarks. Our analysis and experiments reveal new insights into the interplay between attention, token selection, and memory retention, and are thus expected to inspire broader applications of token filtering in long-sequence modeling.

Despite recent efforts in understanding the compression impact on Large Language Models (LLMs) in terms of their downstream task performance and trustworthiness on relatively simpler uni-modal benchmarks (e.g. question answering, common sense reasoning), their detailed study on multi-modal Large Vision Language Models (LVLMs) is yet to be unveiled. Towards mitigating this gap, we present LVLM-Compress-Bench, a framework to first thorough study on the broad impact of compression on the generative performance of LVLMs on multi-modal input driven tasks. In specific, we consider two major classes of compression for autoregressive models, namely KV cache and weight compression, for the dynamically growing intermediate cache and static weights, respectively. We use four LVLM variants of the popular LLaVA framework to present our analysis to integrate various state-of-the-art KV and weight compression methods including uniform, outlier-reduced, and group quantization. With this framework we demonstrate on ten different multi-modal datasets with varied capabilities including recognition, knowledge, language generation, spatial awareness, visual reasoning, hallucination and visual illusion identification, toxicity, stereotypes and bias. In specific, our framework demonstrates the compression impact on both general and ethically critical metrics leveraging a combination of real world and synthetic datasets to encompass diverse societal intersectional attributes. Extensive experimental evaluations yield diverse and intriguing observations on the behavior of LVLMs at different quantization budget of KV and weights, in both maintaining and losing performance as compared to the baseline model with FP16 data format. We believe LVLM-Compress-Bench would help the community to have a deeper insight on the parting impact of compression and the societal impact the compressed models may pose. Code will be released soon.

2024

pdf bib abs
AFLoRA: Adaptive Freezing of Low Rank Adaptation in Parameter Efficient Fine-Tuning of Large Models
Zeyu Liu | Souvik Kundu | Anni Li | Junrui Wan | Lianghao Jiang | Peter Beerel
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We present a novel Parameter-Efficient Fine-Tuning (PEFT) method, dubbed as Adaptive Freezing of Low-Rank Adaptation (AFLoRA). Specifically, for each pre-trained frozen weight tensor, we add a parallel path of trainable low-rank matrices, namely a down-projection and an up-projection matrix, each of which is followed by a feature transformation vector. Based on a novel freezing score, we incrementally freeze these projection matrices during fine-tuning to reduce the computation and alleviate over-fitting. Our experimental results demonstrate that we can achieve state-of-the-art performance with an average improvement of up to 0.85% as evaluated on the GLUE benchmark while yielding up to 9.5× fewer average trainable parameters. While compared in terms of runtime, AFLoRA can yield up to 1.86× improvement as opposed to similar PEFT alternatives. Besides the practical utility of our approach, we provide insights on the trainability requirements of LoRA paths at different modules and the freezing schedule for the different projection matrices.

pdf bib abs
LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation
Seyedarmin Azizi | Souvik Kundu | Massoud Pedram
Findings of the Association for Computational Linguistics: EMNLP 2024

Low-rank adaptation (LoRA) has become the default approach to fine-tune large language models (LLMs) due to its significant reduction in trainable parameters. However, trainable parameter demand for LoRA increases with increasing model embedding dimensions, leading to high compute costs. Additionally, its backward updates require storing high-dimensional intermediate activations and optimizer states, demanding high peak GPU memory. In this paper, we introduce _LaMDA_, a novel approach to fine-tuning large language models, which leverages low-dimensional adaptation to achieve significant reductions in trainable parameters and peak GPU memory footprint. LaMDA freezes a first projection matrix (PMA) in the adaptation path while introducing a low-dimensional trainable square matrix, resulting in substantial reductions in trainable parameters and peak GPU memory usage. LaMDA gradually freezes a second projection matrix (PMB) during the early fine-tuning stages, reducing the compute cost associated with weight updates to enhance parameter efficiency further.We also present an enhancement, LaMDA++, incorporating a “lite-weight” adaptive rank allocation for the LoRA path via normalized spectrum analysis of pre-trained model weights. We evaluate LaMDA/LaMDA++ across various tasks, including natural language understanding with the GLUE benchmark, text summarization, natural language generation, and complex reasoning on different LLMs.Results show that LaMDA matches or surpasses the performance of existing alternatives while requiring up to **17.7×** fewer parameter updates and up to **1.32×** lower peak GPU memory usage during fine-tuning. Code will be publicly available at https://github.com/ArminAzizi98/LaMDA.

2022

Due to the success of pre-trained language models, versions of languages other than English have been released in recent years. This fact implies the need for resources to evaluate these models. In the case of Spanish, there are few ways to systematically assess the models’ quality. In this paper, we narrow the gap by building two evaluation benchmarks. Inspired by previous work (Conneau and Kiela, 2018; Chen et al., 2019), we introduce Spanish SentEval and Spanish DiscoEval, aiming to assess the capabilities of stand-alone and discourse-aware sentence representations, respectively. Our benchmarks include considerable pre-existing and newly constructed datasets that address different tasks from various domains. In addition, we evaluate and analyze the most recent pre-trained Spanish language models to exhibit their capabilities and limitations. As an example, we discover that for the case of discourse evaluation tasks, mBERT, a language model trained on multiple languages, usually provides a richer latent representation than models trained only with documents in Spanish. We hope our contribution will motivate a fairer, more comparable, and less cumbersome way to evaluate future Spanish language models.

2020

pdf bib abs
Learning to Identify Follow-Up Questions in Conversational Question Answering
Souvik Kundu | Qian Lin | Hwee Tou Ng
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Despite recent progress in conversational question answering, most prior work does not focus on follow-up questions. Practical conversational question answering systems often receive follow-up questions in an ongoing conversation, and it is crucial for a system to be able to determine whether a question is a follow-up question of the current conversation, for more effective answer finding subsequently. In this paper, we introduce a new follow-up question identification task. We propose a three-way attentive pooling network that determines the suitability of a follow-up question by capturing pair-wise interactions between the associated passage, the conversation history, and a candidate follow-up question. It enables the model to capture topic continuity and topic shift while scoring a particular candidate follow-up question. Experiments show that our proposed three-way attentive pooling network outperforms all baseline systems by significant margins.

pdf bib abs
A Co-Attentive Cross-Lingual Neural Model for Dialogue Breakdown Detection
Qian Lin | Souvik Kundu | Hwee Tou Ng
Proceedings of the 28th International Conference on Computational Linguistics

Ensuring smooth communication is essential in a chat-oriented dialogue system, so that a user can obtain meaningful responses through interactions with the system. Most prior work on dialogue research does not focus on preventing dialogue breakdown. One of the major challenges is that a dialogue system may generate an undesired utterance leading to a dialogue breakdown, which degrades the overall interaction quality. Hence, it is crucial for a machine to detect dialogue breakdowns in an ongoing conversation. In this paper, we propose a novel dialogue breakdown detection model that jointly incorporates a pretrained cross-lingual language model and a co-attention network. Our proposed model leverages effective word embeddings trained on one hundred different languages to generate contextualized representations. Co-attention aims to capture the interaction between the latest utterance and the conversation history, and thereby determines whether the latest utterance causes a dialogue breakdown. Experimental results show that our proposed model outperforms all previous approaches on all evaluation metrics in both the Japanese and English tracks in Dialogue Breakdown Detection Challenge 4 (DBDC4 at IWSDS2019).

2019

pdf bib abs
Exploiting Explicit Paths for Multi-hop Reading Comprehension
Souvik Kundu | Tushar Khot | Ashish Sabharwal | Peter Clark
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We propose a novel, path-based reasoning approach for the multi-hop reading comprehension task where a system needs to combine facts from multiple passages to answer a question. Although inspired by multi-hop reasoning over knowledge graphs, our proposed approach operates directly over unstructured text. It generates potential paths through passages and scores them without any direct path supervision. The proposed model, named PathNet, attempts to extract implicit relations from text through entity pair representations, and compose them to encode each path. To capture additional context, PathNet also composes the passage representations along each path to compute a passage-based representation. Unlike previous approaches, our model is then able to explain its reasoning via these explicit paths through the passages. We show that our approach outperforms prior models on the multi-hop Wikihop dataset, and also can be generalized to apply to the OpenBookQA dataset, matching state-of-the-art performance.

2018

pdf bib abs
A Nil-Aware Answer Extraction Framework for Question Answering
Souvik Kundu | Hwee Tou Ng
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Recently, there has been a surge of interest in reading comprehension-based (RC) question answering (QA). However, current approaches suffer from an impractical assumption that every question has a valid answer in the associated passage. A practical QA system must possess the ability to determine whether a valid answer exists in a given text passage. In this paper, we focus on developing QA systems that can extract an answer for a question if and only if the associated passage contains an answer. If the associated passage does not contain any valid answer, the QA system will correctly return Nil. We propose a novel nil-aware answer span extraction framework that is capable of returning Nil or a text span from the associated passage as an answer in a single step. We show that our proposed framework can be easily integrated with several recently proposed QA models developed for reading comprehension and can be trained in an end-to-end fashion. Our proposed nil-aware answer extraction neural network decomposes pieces of evidence into relevant and irrelevant parts and then combines them to infer the existence of any answer. Experiments on the NewsQA dataset show that the integration of our proposed framework significantly outperforms several strong baseline systems that use pipeline or threshold-based approaches.

Co-authors

Venues

Fix author