Swaroop Ramaswamy


2026

Large language model context lengths have grown rapidly in recent years, from 512 tokens in GPT to 2M tokens in Gemini 1.5 Pro. Larger context windows enable models to condition on significantly more input tokens, leading to higher quality responses for some user prompts. However, longer contexts also pose challenges to system instruction adherence. In this work, we formalize verifiable instructions to evaluate model *compliance* based on clear, measurable criteria. From this criteria, we present **VerIFY**, a **Ver**ifiable **I**nstruction **F**ollowing **Y**ardstick dataset designed to benchmark the compliance and accuracy of LLMs in adhering to various types of instructions across multi-turn, long-context conversations. From experiments with open-source models, we reveal insights into instruction-following failures in long contexts, helping to improve the reliability, safety, and precision of these models. Furthermore, we implement and evaluate six mitigation strategies to enhance instruction compliance in extended contexts, achieving an improvement up to 79%. This is the first work to consider instruction following for multi-turn, long context conversations.

2021

Recent works have shown that language models (LMs), e.g., for next word prediction (NWP), have a tendency to memorize rare or unique sequences in the training data. Since useful LMs are often trained on sensitive data, it is critical to identify and mitigate such unintended memorization. Federated Learning (FL) has emerged as a novel framework for large-scale distributed learning tasks. It differs in many aspects from the well-studied central learning setting where all the data is stored at the central server, and minibatch stochastic gradient descent is used to conduct training. This work is motivated by our observation that NWP models trained under FL exhibited remarkably less propensity to such memorization compared to the central learning setting. Thus, we initiate a formal study to understand the effect of different components of FL on unintended memorization in trained NWP models. Our results show that several differing components of FL play an important role in reducing unintended memorization. First, we discover that the clustering of data according to users—which happens by design in FL—has the most significant effect in reducing such memorization. Using the Federated Averaging optimizer with larger effective minibatch sizes for training causes a further reduction. We also demonstrate that training in FL with a user-level differential privacy guarantee results in models that can provide high utility while being resilient to memorizing out-of-distribution phrases with thousands of insertions across over a hundred users in the training set.