Santiago Ontanon


2024

pdf bib
MEMORY-VQ: Compression for Tractable Internet-Scale Memory
Yury Zemlyanskiy | Michiel de Jong | Luke Vilnis | Santiago Ontanon | William Cohen | Sumit Sanghai | Joshua Ainslie
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Retrieval augmentation is a powerful but expensive method to make language models more knowledgeable about the world. Memory-based methods like LUMEN (de Jong et al., 2023a) pre-compute token representations for retrieved passages to drastically speed up inference. However, memory also leads to much greater storage requirements from storing pre-computed representations. We propose MEMORY-VQ, a new method to reduce storage requirements of memory-augmented models without sacrificing performance. Our method uses a vector quantization variational autoencoder (VQ-VAE) to compress token representations. We apply MEMORY-VQ to the LUMEN model to obtain LUMEN-VQ, a memory model that achieves a 16x compression rate with comparable performance on the KILT benchmark. LUMEN-VQ enables practical retrieval augmentation even for extremely large retrieval corpora.

2023

pdf bib
mLongT5: A Multilingual and Efficient Text-To-Text Transformer for Longer Sequences
David Uthus | Santiago Ontanon | Joshua Ainslie | Mandy Guo
Findings of the Association for Computational Linguistics: EMNLP 2023

We present our work on developing a multilingual, efficient text-to-text transformer that is suitable for handling long inputs. This model, called mLongT5, builds upon the architecture of LongT5, while leveraging the multilingual datasets used for pretraining mT5 and the pretraining tasks of UL2. We evaluate this model on a variety of multilingual summarization and question-answering tasks, and the results show stronger performance for mLongT5 when compared to existing multilingual models such as mBART or M-BERT.

pdf bib
CoLT5: Faster Long-Range Transformers with Conditional Computation
Joshua Ainslie | Tao Lei | Michiel de Jong | Santiago Ontanon | Siddhartha Brahma | Yury Zemlyanskiy | David Uthus | Mandy Guo | James Lee-Thorp | Yi Tay | Yun-Hsuan Sung | Sumit Sanghai
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Many natural language processing tasks benefit from long inputs, but processing long documents with Transformers is expensive – not only due to quadratic attention complexity but also from applying feedforward and projection layers to every token. However, not all tokens are equally important, especially for longer documents. We propose CoLT5, a long-input Transformer model that builds on this intuition by employing conditional computation, devoting more resources to important tokens in both feedforward and attention layers. We show that CoLT5 achieves stronger performance than LongT5 with much faster training and inference, achieving SOTA on the long-input SCROLLS benchmark. Moreover, CoLT5 can effectively and tractably make use of extremely long inputs, showing strong gains up to 64k input length.

2022

pdf bib
FNet: Mixing Tokens with Fourier Transforms
James Lee-Thorp | Joshua Ainslie | Ilya Eckstein | Santiago Ontanon
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that “mix” input tokens. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths. At longer input lengths, our FNet model is significantly faster: when compared to the “efficient Transformers” on the Long Range Arena benchmark, FNet matches the accuracy of the most accurate models, while outpacing the fastest models across all sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.

pdf bib
Making Transformers Solve Compositional Tasks
Santiago Ontanon | Joshua Ainslie | Zachary Fisher | Vaclav Cvicek
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Several studies have reported the inability of Transformer models to generalize compositionally, a key type of generalization in many NLP tasks such as semantic parsing. In this paper we explore the design space of Transformer models showing that the inductive biases given to the model by several design decisions significantly impact compositional generalization. We identified Transformer configurations that generalize compositionally significantly better than previously reported in the literature in many compositional tasks. We achieve state-of-the-art results in a semantic parsing compositional generalization benchmark (COGS), and a string edit operation composition benchmark (PCFG).

pdf bib
LongT5: Efficient Text-To-Text Transformer for Long Sequences
Mandy Guo | Joshua Ainslie | David Uthus | Santiago Ontanon | Jianmo Ni | Yun-Hsuan Sung | Yinfei Yang
Findings of the Association for Computational Linguistics: NAACL 2022

Recent work has shown that either (1) increasing the input length or (2) increasing model size can improve the performance of Transformer-based neural models. In this paper, we present LongT5, a new model that explores the effects of scaling both the input length and model size at the same time. Specifically, we integrate attention ideas from long-input transformers (ETC), and adopt pre-training strategies from summarization pre-training (PEGASUS) into the scalable T5 architecture. The result is a new attention mechanism we call Transient Global (TGlobal), which mimics ETC’s local/global attention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization and question answering tasks, as well as outperform the original T5 models on these tasks. We have open sourced our architecture and training code, as well as our pre-trained model checkpoints.

2021

pdf bib
Improving Compositional Generalization in Classification Tasks via Structure Annotations
Juyong Kim | Pradeep Ravikumar | Joshua Ainslie | Santiago Ontanon
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Compositional generalization is the ability to generalize systematically to a new data distribution by combining known components. Although humans seem to have a great ability to generalize compositionally, state-of-the-art neural models struggle to do so. In this work, we study compositional generalization in classification tasks and present two main contributions. First, we study ways to convert a natural language sequence-to-sequence dataset to a classification dataset that also requires compositional generalization. Second, we show that providing structural hints (specifically, providing parse trees and entity links as attention masks for a Transformer model) helps compositional generalization.

2020

pdf bib
ETC: Encoding Long and Structured Inputs in Transformers
Joshua Ainslie | Santiago Ontanon | Chris Alberti | Vaclav Cvicek | Zachary Fisher | Philip Pham | Anirudh Ravula | Sumit Sanghai | Qifan Wang | Li Yang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Transformer models have advanced the state of the art in many Natural Language Processing (NLP) tasks. In this paper, we present a new Transformer architecture, “Extended Transformer Construction” (ETC), that addresses two key challenges of standard Transformer architectures, namely scaling input length and encoding structured inputs. To scale attention to longer inputs, we introduce a novel global-local attention mechanism between global tokens and regular input tokens. We also show that combining global-local attention with relative position encodings and a “Contrastive Predictive Coding” (CPC) pre-training objective allows ETC to encode structured inputs. We achieve state-of-the-art results on four natural language datasets requiring long and/or structured inputs.