Murali Annavaram

2025

Estimating Privacy Leakage of Augmented Contextual Knowledge in Language Models
James Flemings | Bo Jiang | Wanrong Zhang | Zafar Takhirov | Murali Annavaram
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Language models (LMs) rely on their parametric knowledge augmented with relevant contextual knowledge for certain tasks, such as question answering. However, the contextual knowledge can contain private information that may be leaked when answering queries, and estimating this privacy leakage is not well understood. A straightforward approach of directly comparing an LM’s output to the contexts can overestimate the privacy risk, since the LM’s parametric knowledge might already contain the augmented contextual knowledge. To this end, we introduce context influence, a metric that builds on differential privacy, a widely-adopted privacy notion, to estimate the privacy leakage of contextual knowledge during decoding. Our approach effectively measures how each subset of the context influences an LM’s response while separating the specific parametric knowledge of the LM. Using our context influence metric, we demonstrate that context privacy leakage occurs when contextual knowledge is out of distribution with respect to parametric knowledge. Moreover, we experimentally demonstrate how context influence properly attributes the privacy leakage to augmented contexts, and we evaluate how factors– such as model size, context size, generation position, etc.– affect context privacy leakage. The practical implications of our results will inform practitioners of the privacy risk associated with augmented contextual knowledge.

pdf bib abs

On Using Arabic Language Dialects in Recommendation Systems
Abdulla Alshabanah | Murali Annavaram
Findings of the Association for Computational Linguistics: NAACL 2025

While natural language processing (NLP) techniques have been applied to user reviews in recommendation systems, the potential of leveraging Arabic dialects in this context remains unexplored. Arabic is spoken by over 420 million people, with significant dialectal variation across regions. These dialects, often classified as low-resource languages, present both challenges and opportunities for machine learning applications. This paper represents the first attempt to incorporate Arabic dialects as a signal in recommendation systems. We explore both explicit and implicit approaches for integrating Arabic dialect information from user reviews, demonstrating its impact on improving recommendation performance. Our findings highlight the potential for leveraging dialectal diversity in Arabic to enhance recommendation systems and encourage further research at the intersection of NLP and recommendation systems within the Arab multicultural world.

pdf bib abs

KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
Chaoyi Jiang | Lei Gao | Hossein Entezari Zarch | Murali Annavaram
Findings of the Association for Computational Linguistics: ACL 2025

Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value (KV) cache is used to store intermediate activations, which significantly lowers the computational overhead for token generation. However, the memory required for the KV cache grows rapidly, often exceeding the capacity of GPU memory. A cost-effective alternative is to offload KV cache to CPU memory, which alleviates GPU memory pressure, but shifts the bottleneck to the limited bandwidth of the PCIe connection between the CPU and GPU. Existing methods attempt to address these issues by overlapping GPU computation with I/O or employing CPU-GPU heterogeneous execution, but they are hindered by excessive data movement and dependence on CPU capabilities. Fully overlapping PCIe communication latency gets challenging as the size of the KV cache grows and/or the GPU compute capabilities increase. In this paper, we introduce KVPR, an efficient I/O-aware LLM inference method where the CPU first transfers a partial set of activations, from which the GPU can start recomputing the KV cache values. While the GPU recomputes the partial KV cache, the remaining portion of the KV cache is transferred concurrently from the CPU. This approach overlaps GPU recomputation with KV cache transfer to minimize idle GPU time and maximize inference performance. KVPR is fully automated by integrating a profiler module that utilizes input characteristics and system hardware information, a scheduler module to optimize the distribution of computation and communication workloads, and a runtime module to efficiently execute the derived execution plan. Experimental results show that KVPR achieves up to 35.8% lower latency and 46.2% higher throughput during decoding compared to state-of-the-art approaches. The code is available at https://github.com/chaoyij/KVPR.

pdf bib abs

MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines
Lei Gao | Amir Ziashahabi | Yue Niu | Salman Avestimehr | Murali Annavaram
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. The next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Given the sensitive nature of such private data, it is desirable to fine-tune these models on edge devices to improve user trust. However, fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands, as well as limited infrastructure support. We observe that inference engines (e.g., ExecuTorch) can be repurposed for fine-tuning by leveraging zeroth-order (ZO) optimization, which uses multiple forward passes to approximate gradients. While promising, direct application of ZO methods on edge devices is inefficient due to the high computational cost of multiple forward passes required for accurate gradient estimation, and their deployment has been largely unexplored in practice. We introduce MobiZO, a resource-efficient fine-tuning framework for LLMs specifically designed for edge devices. MobiZO combines three key innovations: (1) a parallelized randomized gradient estimator that employs both outer-loop and inner-loop parallelism to eliminate sequential forward passes, (2) a specialized Multi-Perturbed LoRA (MP-LoRA) module that enables efficient realization of both inner and outer loop parallelism, and (3) a seamless integration with ExecuTorch for on-device training, requiring no modifications to the runtime. Experiments demonstrate that MobiZO achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy, paving the way for practical deployment of LLMs in real-time, on-device applications. Code available at: https://github.com/leigao97/MobiZO.

pdf bib abs

Mind the Dialect: NLP Advancements Uncover Fairness Disparities for Arabic Users in Recommendation Systems
Abdulla Alshabanah | Murali Annavaram
Findings of the Association for Computational Linguistics: EMNLP 2025

Recommendation systems play a critical role in shaping user experiences and access to digital content. However, these systems can exhibit unfair behavior when their performance varies across user groups, especially in linguistically diverse populations. Recent advances in NLP have enabled the identification of user dialects, allowing for more granular analysis of such disparities. In this work, we investigate fairness disparities in recommendation quality among Arabic-speaking users, a population whose dialectal diversity is underrepresented in recommendation system research. By uncovering performance gaps across dialectal variation, we highlight the intersection of NLP and recommendation system and underscore the broader social impact of NLP. Our findings emphasize the importance of interdisciplinary approaches in building fair recommendation systems, particularly for global and local platforms serving diverse Arabic-speaking communities. The source code is available at https://github.com/alshabae/FairArRecSys.

2024

pdf bib abs

Differentially Private Next-Token Prediction of Large Language Models
James Flemings | Meisam Razaviyayn | Murali Annavaram
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Ensuring the privacy of Large Language Models (LLMs) is becoming increasingly important. The most widely adopted technique to accomplish this is DP-SGD, which trains a model to guarantee Differential Privacy (DP). However, DP-SGD overestimates an adversary’s capabilities in having white box access to the model and, as a result, causes longer training times and larger memory usage than SGD. On the other hand, commercial LLM deployments are predominantly cloud-based; hence, adversarial access to LLMs is black-box. Motivated by these observations, we present Private Mixing of Ensemble Distributions (PMixED): a private prediction protocol for next-token prediction that utilizes the inherent stochasticity of next-token sampling and a public model to achieve Differential Privacy. We formalize this by introducing RD-mollifers which project each of the model’s output distribution from an ensemble of fine-tuned LLMs onto a set around a public LLM’s output distribution, then average the projected distributions and sample from it. Unlike DP-SGD which needs to consider the model architecture during training, PMixED is model agnostic, which makes PMixED a very appealing solution for current deployments. Our results show that PMixED achieves a stronger privacy guarantee than sample-level privacy and outperforms DP-SGD for privacy 𝜖 = 8 on large-scale datasets. Thus, PMixED offers a practical alternative to DP training methods for achieving strong generative utility without compromising privacy.

pdf bib abs

Ethos: Rectifying Language Models in Orthogonal Parameter Space
Lei Gao | Yue Niu | Tingting Tang | Salman Avestimehr | Murali Annavaram
Findings of the Association for Computational Linguistics: NAACL 2024

Language models (LMs) have greatly propelled the research on natural language processing. However, LMs also raise concerns regarding the generation of biased or toxic content and the potential disclosure of private information from the training dataset. In this work, we present a new efficient approach, Ethos, that rectifies LMs to mitigate toxicity and bias in outputs and avoid privacy leakage. Ethos is built on task arithmetic. However, unlike current task arithmetic algorithms, Ethos distinguishes general beneficial and undesired knowledge when reconstructing task vectors. Specifically, Ethos first obtains a set of principal components from the pre-trained models using singular value decomposition. Then, by projecting the task vector onto principal components, Ethos separates the principal components that encode general from those associated with undesired knowledge. Ethos performs forgetting or unlearning by only negating the task vector with undesired knowledge, thereby minimizing collateral damage on general model utility. We demonstrate the efficacy of our approach on three different tasks: bias, toxicity, and memorization unlearning. Evaluations show Ethos is more effective in removing undesired knowledge while maintaining the overall model performance compared to current task arithmetic methods.

pdf bib abs

Differentially Private Knowledge Distillation via Synthetic Text Generation
James Flemings | Murali Annavaram
Findings of the Association for Computational Linguistics: ACL 2024

Large Language models (LLMs) are achieving state-of-the-art performance in many different downstream tasks. However, the increasing urgency of data privacy puts pressure on practitioners to train LLMs with Differential Privacy (DP) on private data. Concurrently, the exponential growth in parameter size of LLMs necessitates model compression before deployment of LLMs on resource-constrained devices or latency-sensitive applications. Differential privacy and model compression generally must trade off utility loss to achieve their objectives. Moreover, simultaneously applying both schemes can compound the utility degradation. To this end, we propose DistilDP: a novel differentially private knowledge distillation algorithm that exploits synthetic data generated by a differentially private teacher LLM. The knowledge of a teacher LLM is transferred onto the student in two ways: one way from the synthetic data itself– the hard labels, and the other way by the output distribution of the teacher evaluated on the synthetic data– the soft labels. Furthermore, if the teacher and student share a similar architectural structure, we can further distill knowledge by aligning the hidden representations between both. Our experimental results demonstrate that DistilDP can substantially improve the utility over existing baselines, at least 9.0 PPL on the Big Patent dataset, with strong privacy parameters, 𝜖=2. These promising results progress privacy-preserving compression of autoregressive LLMs. Our code can be accessed here: https://github.com/james-flemings/dp_compress.

2022

pdf bib abs

Knowledge graphs (KGs) often represent knowledge bases that are incomplete. Machine learning models can alleviate this by helping automate graph completion. Recently, there has been growing interest in completing knowledge bases that are dynamic, where previously unseen entities may be added to the KG with many missing links. In this paper, we present StATIK–Structure And Text for Inductive Knowledge Completion. StATIK uses Language Models to extract the semantic information from text descriptions, while using Message Passing Neural Networks to capture the structural information. StATIK achieves state of the art results on three challenging inductive baselines. We further analyze our hybrid model through detailed ablation studies.