Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models

We provide the first exploration of sentence embeddings from text-to-text transformers (T5) including the effects of scaling up sentence encoders to 11B parameters. Sentence embeddings are broadly useful for language processing tasks. While T5 achieves impressive performance on language tasks, it is unclear how to produce sentence embeddings from encoder-decoder models. We investigate three methods to construct Sentence-T5 (ST5) models: two utilize only the T5 encoder and one using the full T5 encoder-decoder. We establish a new sentence representation transfer benchmark, SentGLUE, which extends the SentEval toolkit to nine tasks from the GLUE benchmark. Our encoder-only models outperform the previous best models on both SentEval and SentGLUE transfer tasks, including semantic textual similarity (STS). Scaling up ST5 from millions to billions of parameters shown to consistently improve performance. Finally, our encoder-decoder method achieves a new state-of-the-art on STS when using sentence embeddings.


Introduction
Sentence embeddings providing compact meaning representations that are broadly useful for a variety of language processing tasks include classification, question-answering, semantic retrieval, bitext mining, and semantic similarity tasks. Sentence embedding models have been trained using a variety of methods including: supervised tasks such as natural language inference (Conneau et al., 2017;Gao et al., 2021) or with semi-structured data such as question-answer pairs ; translation pairs (Yang et al., 2020a;Feng et al., 2020); and adjacent sentence pairs (Kiros et al.,Sentence T5 preprint. (Gao et al., 2021) 90.23 1 83.76 SBERT (large) (Reimers and Gurevych, 2019) 87.69 76.55 USE  85.10 71.22 InferSent (Conneau et al., 2017) 85.59 65.01 2015; Logeswaran and Lee, 2018). Recent work has shown that scaling up model parameters and leveraging pre-trained models (Devlin et al., 2019;Liu et al., 2019) are two effective approaches to improve performance Gurevych, 2019, 2020;Yang et al., 2020b;Gao et al., 2021). We explore sentence embeddings from a new family of pre-trained models: Text-to-Text Transfer Transformer (T5) (Raffel et al., 2020). Unlike encoder-only models, which use a transformer encoder to predict random masked tokens, T5 uses an encoder-decoder architecture and a generative span corruption pre-training task. T5 models can be scaled up to hundreds of billions of parameters (Fedus et al., 2021) and have achieved state-of-theart performance on a broad range of NLP tasks including GLUE (Wang et al., 2018) and Super-GLUE (Wang et al., 2019). However, it is difficult to efficiently apply T5 to some tasks such as retrieval or clustering. To score retrieval candidates, T5 would need to perform full inference with crossattention on each query-candidate pair. In contrast, sentence embeddings allow for efficient retrieval and clustering (Gillick et al., 2018;Reimers and Gurevych, 2019;Yang et al., 2020a). As shown in fig. 2, we explore three ways of turning a pre-trained T5 encoder-decoder model into a sentence embedding model: (i) using the first token representation of the encoder; (ii) averaging all token representations from the encoder; (iii) using the first token representation from the decoder. We evaluate the quality of the resulting sentence embeddings on sentence transfer tasks using SentEval (Conneau and Kiela, 2018) and on semantic textual similarity (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016Cer et al., 2017). We contrast raw representations from pre-trained T5 models with those learned through fine-tuning on natural language inference (NLI) and Retrieval Question-Answering (ReQA) (Ahmad et al., 2019) using dual encoders and contrastive learning (Conneau et al., 2017;Gao et al., 2021). We introduce a multi-stage contrastive learning recipe involving fine-tuning first on ReQA and then on NLI. Finally, we investigate scaling our T5 sentence embedding model up to 11B parameters. As shown in fig. 1 To our knowledge, we are the first to study using large-scale pre-trained text-to-text models for sen-tence representation learning and to scale sentence embedding models up to 11 billion parameters. We summarize our contributions as follows: (i) even without fine-tuning, encoder-only ST5 models perform well on sentence transfer tasks, outperforming state-of-the-art fine-tuned models such as SimBERT and SimRoBERTa (Gao et al., 2021); (ii) encoder-decoder sentence embedding models achieve strong performance on STS, establishing a new state-of-the-art on sentence embedding based STS; (iii) contrastive learning is effective for finetuning sentence encoders from T5-style pre-trained models, particularly using our proposed two-stage contrastive learning approach; (iv) training ST5 longer and with more data using a contrastive loss leads to consistent improvement on both sentence transfer and STS tasks. We name our model Sentence T5 (ST5).

Text-to-Text Transfer Transformers (T5)
Text-to-Text transfer transformers (T5) (Raffel et al., 2020) are gaining popularity due to their competitive performance and ease of use in solving a variety of tasks as simple text-to-text mapping problems. As shown in fig. 2a, T5 consists of an encoder-decoder transformer model (Vaswani et al., 2017) pre-trained on an unsupervised span corruption task. Though T5 has been successfully applied to numerous NLP tasks, how to extract high quality text representations from T5 remains unexplored.

Model Architecture
In this work we explore three strategies to extract sentence representations from T5, as shown in figs. 2b to 2d: • Encoder-only first (ST5-Enc first): The encoder output of the first token is taken as the sentence embedding.
• Encoder-only mean (ST5-Enc mean): The sentence embedding is defined as the average of the encoder outputs across all input tokens.
• Encoder-Decoder first (ST5-EncDec first): The first decoder output is taken as the sentence embedding. To obtain the decoder output, the input text is fed into the encoder, and the standard "start" symbol is fed as the first decoder input.
The first two are pooling strategies widely used in encoder-only pre-trained models such as BERT. Unlike BERT models, T5 models do not have a CLS token at the beginning of each sentence. For T5 encoder-decoder models, we assume the decoder is aware of the semantics of the entire input sentence when generating its first token prediction; and if so, the first decoder output embeddings (i.e. input to the softmax layer) might naturally capture the sentence semantics.
For sentence encoder training, we adopt a dual encoder architecture (Gillick et al., 2018;Reimers and Gurevych, 2019). As shown in fig. 3, this architecture consists of two sharedweight transformer modules that encode the inputs. The transformer module can be either an encoderonly or encoder-decoder architecture. In our experiments, we initialize the transformer modules from the pre-trained T5 models. After each module computes a fixed-length representation for its input sentence, a projection layer and L2 normalization are applied to the resulting embeddings. The projection layer transforms the output to a configurable fixed dimensionality (i.e. the sentence embedding size). The embeddings from paired encoding towers can be scored for similarity tasks using a dotproduct 2 or provide as input to additional layers layers for pairwise classification tasks (e.g., NLI).

Contrastive Learning
Applying contrastive learning to sentence embeddings improves the uniformity of the embeddings space, leading to better performance on downstream tasks such as STS (Gao et al., 2021). We 2 Since L2 normalization is applied to the output of each tower, the dot-product between the embeddings will produce their cosine similarity.  apply contrastive learning to fine-tune the T5 sentence representations. 3

Contrastive Loss
Using a contrastive loss to train a sentence encoder where v i is an input sentence and v + i is a related sentence (e.g., that is semantically close). During training, v + i is considered as a positive example for v i and all other examples in the batch are considered as negatives. The model should learn to pull the positive example closer to the input example while pushing away the negatives. We operationalize our contrastive loss using an in-batch sampled softmax (Henderson et al., 2017): The similarity scoring function is sim. B is a minibatch of examples and τ is the softmax temperature. When additional negatives v − j are provided for input example v, the loss can be computed as:

Two-stage Training
To investigate the effect of additional training data, we explore two-stage training: (i) first training on mined question-answering data from Community QA sites; (ii) then, fine-tune the model on sentence pairs with human annotated NLI labels.

Evaluation
We evaluate using SentEval, which includes 7 transfer and 7 STS tasks (Conneau and Kiela, 2018

Configurations
Our models are implemented using JAX 5 and trained on Cloud TPU-v8. We initialize the dual encoder modules from public T5 checkpoints 6 . During training, we use Adafactor (Shazeer and Stern, 2018) as the optimizer and set the learning rate to 0.001. Linear decay is applied after 10% of the total number of training steps, reducing the learning rate to 0 by the end of training. To finetune on NLI we use a batch size of 512, while for the Community QA dataset the batch size is 2048. We use a softmax temperature τ of 100.

Experimental Goals
Our experiments aim to answer the following: • Q1: What is the best way to extract sentence representations from T5?
• Q2: How well do raw T5 sentence embeddings perform on downstream tasks?
• Q4: Can we benefit from scaling up model capacity for better sentence representations?
With these goals, we study transfer and STS performance of T5 sentence embeddings using a variety of model and training configurations, comparing ST5 to state-of-the-art methods including SBERT/SRoBERTa (Reimers and Gurevych, 2019) and SimCSE (Gao et al., 2021). Table 2 and 3 provide performance on transfer and STS tasks, respectively. We compare ST5 models with two types of baselines: (ii) a model that extracts sentence embeddings from a pre-trained BERT model, listed in rows 1-2 of each table;

Results
(ii) the current state-of-the-art sentence embedding models fine-tuned from BERT or RoBERTa, listed in rows 6-8 of each table.

Results for Raw T5 Sentence Embeddings
We first evaluate the T5 sentence embeddings without fine-tuning. We evaluate all three strategies from section 3.1: (i) Encoder-only first token, (ii) Encoder-only mean, and (iii) Encoder-decoder start token. For all experiments, we use the encoder or decoder outputs from the T5 transformer directly, without doing any projection. This enables us to fully leverage the embedding capacity from the pre-trained models.
Transfer tasks Results for ST5 models using raw embeddings on transfer tasks are shown in rows 3-5 of table 2. Unlike BERT, T5's first token (either for encoder or decoder) is not reserved for a special placeholder (i.e. CLS) and there are no specific pre-training tasks using the first token's embeddings. Therefore, it is unlikely that without additional fine-tuning the first token's representation would capture the semantics of the whole sentence. Indeed, our experiments show the first token's representation from encoder or decoder are much worse on all SentEval tasks compared to the mean pooling of the encoder-only model. When mean pooling is applied to the T5's encoder outputs, it greatly outperforms the average embeddings of BERT. Notably, even without fine-tuning, the average embeddings of the T5's encoder-only outputs outperforms SimCSE-RoBERTa, which is fine-tuned on NLI dataset. This may be due to the fact that T5 is trained on more data. The original T5 models also included downstream tasks (e.g. GLUE, SuperGLUE) during pretraining, and this multi-task setting may improve transfer performance. However we note that there are only two SentEval tasks (SST and MRPC) included in GLUE while the other five tasks are not. As shown in table 2, we observe significant improvements on the five tasks that are not included.

STS tasks
In contrast, we observe weak results on STS tasks using raw T5 sentence embeddings as shown in rows 3-5 of table 3. The mean pooling of T5 embeddings achieves an average STS score of 55.97, slightly better than BERT mean pooling but still worse than models fine-tuned on supervised tasks. This is similar to findings about the anisotropy phenomenon of contextual embeddings from other pre-trained language models such as BERT, RoBERTa (Ethayarajh, 2019;Gao et al., 2021). Embedding collapse prevents the model from performing well on distance-related evaluations.

Results for Fine-Tuning T5 Sentence Embeddings
We next evaluate ST5 models that are fine-tuned on NLI tasks using our contrastive loss, starting from pre-trained T5 models. Given that mean pooling performs much better than the first token when using encoder only, we opt to discard the first token model when fine-tuning ST5 models. The last three rows of table 2 show that the transfer performance of ST5 models is very consistent across different embedding extracting strategies after fine-tuning. The best fine-tuned model is 0.57 better than the best raw T5 sentence embeddings.
In table 3, we see that fine-tuning on the NLI dataset significantly improves the STS task performance of ST5 compared to that without fine-tuning. This supports the claim that contrastive learning is effective to mitigate embedding collapse for T5style models.
To investigate the impact of additional training data on contrastive learning, we experiment with the ST5 models first trained on Community QA and then fine-tuned on NLI. As shown in tables 2 and 3, fine-tuning on an additional dataset brings a large performance boost for both transfer and STS tasks. This suggests that we may be able to improve sentence embedding quality further through the mining of additional semi-structured data for continued contrastive learning.

Encoder-only vs. Encoder-decoder
In this section, we compare the performance of two architectures: encoder-only and encoder-decoder.
Better generalizability for T5's encoder In table 2, we saw that the encoder-only Base model performs on-par with the encoder-decoder model on transfer tasks. When we scale the ST5 model up from Base to Large, 3B and 11B, the encoder-only models' performance on transfer tasks consistently outperforms the encoder-decoder models as shown in table 4. This shows that building ST5 on top of the T5's encoder gives strong transfer performance.
Recently, Chung et al. (2021) have shown that larger output embeddings (i.e. larger embedding size) effectively prevent the encoder from overspecializing to the pre-training task, thus making the encoder's representations more general and more transferable. We hypothesize that the decoder in the encoder-decoder architecture can improve the generalizability of the encoder's representation in a similar fashion, as the decoder focuses on optimizing for specific tasks.
Effectiveness of the decoder In the last two rows of table 3, we observe that the encoderdecoder architecture outperforms encoder-only models for all STS tasks. As we scale up the ST5 model, we also observe improvement on STS tasks. As shown in table 4, the ST5 encoder-decoder Large model outperforms the state-of-the-art model SimCSE-RoBERTa Large, improving the Spearman's correlation score from 83.76 to 84.11.
One explanation is that the additional parameters from the decoder are helpful for improving textual similarity tasks. Another possibility is that the decoder architecture itself helps to improve the sentence embedding quality. As shown in fig. 2d, the decoder can be considered as an additional attention pooling layer on top of the encoder outputs. As the decoder's weights are lifted from the pretrained T5 model, the decoder might learn a better way to add attention pooling over the encoder outputs than mean pooling.

Scaling up Sentence T5
We leverage the existing checkpoints from large T5 models to study the effect of scaling sentence encoders. The parameters of the T5 models are listed in table 5. Note however that ST5-EncDec doesn't fully leverage the model parameters; the decoder's learned self-attention is effectively ignored as only the start token is fed into the decoder.

Effect on Directly Using T5 Embeddings
As shown in table 4, the performance on the transfer tasks of directly using T5 embeddings consistently improves as T5 scales up. This corroborates that leveraging large pre-trained models can improve transfer performance of sentence embeddings.
On the other hand, increasing the model capacity alone is not enough to mitigate the embedding collapse. Even the embeddings from the T5 11B model still do worse on STS tasks than fine-tuned models. One reason is that the pre-training task of T5 (span corruption as generation) does not require the model to avoid anisotropy (e.g., by using a contrastive loss or regularization). This highlights the importance of choosing fine-tuning tasks that are aligned to the goal of similarity/retrieval performance.

EncDec-3B
EncDec-Large EncDec-Base S ca lin g up Figure 4: Alignment and uniformity losses for different model sizes. We consider the test split of the STS-B dataset. L align is calculated considering all pairs with score greater than 4. L uniform is computed using all sentences. The colormap denotes the models' Spearman's correlation score.

Improving the ST5 Fine-tuning
As shown in   the encoder-only model achieves an average score of 91.08 for transfer tasks which is better than 90.45 from the ST5 Large model; while the encoderdecoder model pushes the STS score to 84.94 that also outperforms the ST5 Large model. This inspires us to explore even larger model sizes to achieve better sentence embedding quality.
For STS tasks, we observe that the gain from increasing model size from 3B to 11B is smaller than that from Large to 3B. This might be due to the fact that we are fixing the embedding size for all model sizes in our experiments. One potential exploration is to increase the sentence embedding size for larger models to fully leverage the model capacity.
We further compute the alignment loss and uniformity loss as defined in Wang and Isola (2020) to measure the quality of the sentence embeddings: where p pos is all positive data and p data is the data distribution. L align denotes the expected distance between embeddings of the positive pairs of data, while L uniform indicates how uniformly the embeddings are distributed. For both losses, lower numbers indicate better performance. As shown in fig. 4, as the models scale up, both the encoder-only and encoder-decoder model decrease the uniformity loss with only a slight increase in the alignment loss.

Training with More Data
We seek to investigate whether the effects of larger model size and more training data are additive for better sentence embeddings. As shown in the last two rows of table 4, when scaling up to Large and 3B parameters, ST5 further improves on downstream tasks by training on the Community QA dataset in addition to NLI.

Model Inference
We run ST5 encoder-only on different platforms to investigate the computational cost of inference. Figure 5 summarizes the inference speed for different model sizes, sequence length, batch size and platforms. ST5 achieves the fastest inference speed on Cloud TPU-v8. As we increase the batch size, the inference speed can be further improved. For the 11B model, we are able to achieve a speed of 274 examples per second for sequence length 128 and batch size 1024. This shows the feasibility of deploying such large models on TPU hardware.
We also report the speed on Nvidia Tesla V100 GPU and CPU. The ST5 11B model is able to run on 4 V100 GPUs with sequence length 128 and batch size 1024, achieving an inference speed of 27 examples per second. For CPU, with batch size 512, ST5 11B achieves 0.5 examples per second.
Although the speed on GPU and CPU are considerably slower than on TPU, the sentence embedding models are much faster than cross-attention based models whose computation time increases quadratically with the number of examples (e.g., clustering 1,000 sentences requires inference over 1 million sentence pairs).

Conclusion
In this paper, we study effective methods to build T5 sentence encoders (ST5) from pre-trained models. We propose three architectures and a two-stage contrastive learning method to fine-tune ST5. We compare the difference between encoder-only and encoder-decoder architectures as sentence encoders and analyze their performance on downstream tasks. Through extensive experiments on the Sent-Eval benchmark, we show that encoder-only models have strong transfer performance while encoderdecoder models perform better on textual similarity tasks. We also demonstrate the effectiveness of scaling up the model size, which greatly improves sentence embedding quality. These findings suggest that future improvements in scale and quality of pre-trained text-to-text models may translate into further gains for sentence encoder models.