Prajjwal Bhargava


2024

pdf bib
Effective Long-Context Scaling of Foundation Models
Wenhan Xiong | Jingyu Liu | Igor Molybog | Hejia Zhang | Prajjwal Bhargava | Rui Hou | Louis Martin | Rashi Rungta | Karthik Abinav Sankararaman | Barlas Oguz | Madian Khabsa | Han Fang | Yashar Mehdad | Sharan Narang | Kshitiz Malik | Angela Fan | Shruti Bhosale | Sergey Edunov | Mike Lewis | Sinong Wang | Hao Ma
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We present an effective recipe to train strong long-context LLMs that are capable of utilizing massive context windows of up to 32,000 tokens. Our models are built through continual pretraining from Llama 2 checkpoints with longer text sequences and on a dataset where long texts are upsampled. We perform extensive evaluation using language modeling, synthetic context probing tasks, and a wide range of downstream benchmarks. Across all evaluations, our models achieve consistent improvements on most regular-context tasks and significant improvements on long-context tasks over Llama 2. Moreover, with a cost-effective instruction tuning procedure that is free of expensive annotation, the presented models can already surpass gpt-3.5-turbo-16k‘s overall performance on long-context benchmarks. Alongside these results, we provide an in-depth analysis on each individual component of our method. We delve into Llama’s position encodings and discuss its key limitation in modeling long data. We examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths – ablation results suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences.

2022

pdf bib
DiscoSense: Commonsense Reasoning with Discourse Connectives
Prajjwal Bhargava | Vincent Ng
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We present DiscoSense, a benchmark for commonsense reasoning via understanding a wide variety of discourse connectives. We generate compelling distractors in DiscoSense using Conditional Adversarial Filtering, an extension of Adversarial Filtering that employs conditional generation. We show that state-of-the-art pre-trained language models struggle to perform well on DiscoSense, which makes this dataset ideal for evaluating next-generation commonsense reasoning systems.

2021

pdf bib
Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics
Prajjwal Bhargava | Aleksandr Drozd | Anna Rogers
Proceedings of the Second Workshop on Insights from Negative Results in NLP

Much of recent progress in NLU was shown to be due to models’ learning dataset-specific heuristics. We conduct a case study of generalization in NLI (from MNLI to the adversarially constructed HANS dataset) in a range of BERT-based architectures (adapters, Siamese Transformers, HEX debiasing), as well as with subsampling the data and increasing the model size. We report 2 successful and 3 unsuccessful strategies, all providing insights into how Transformer-based models learn to generalize.

2020

pdf bib
Adaptive Transformers for Learning Multimodal Representations
Prajjwal Bhargava
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

The usage of transformers has grown from learning about language semantics to forming meaningful visiolinguistic representations. These architectures are often over-parametrized, requiring large amounts of computation. In this work, we extend adaptive approaches to learn more about model interpretability and computational efficiency. Specifically, we study attention spans, sparse, and structured dropout methods to help understand how their attention mechanism extends for vision and language tasks. We further show that these approaches can help us learn more about how the network perceives the complexity of input sequences, sparsity preferences for different modalities, and other related phenomena.