Few-Shot Learning with Siamese Networks and Label Tuning

We study the problem of building text classifiers with little or no training data, commonly known as zero and few-shot text classification. In recent years, an approach based on neural textual entailment models has been found to give strong results on a diverse range of tasks. In this work, we show that with proper pre-training, Siamese Networks that embed texts and labels offer a competitive alternative. These models allow for a large reduction in inference cost: constant in the number of labels rather than linear. Furthermore, we introduce label tuning, a simple and computationally efficient approach that allows to adapt the models in a few-shot setup by only changing the label embeddings. While giving lower performance than model fine-tuning, this approach has the architectural advantage that a single encoder can be shared by many different tasks.


Introduction
Few-shot learning is the problem of learning classifiers with only a few training examples. Zero-shot learning (Larochelle et al., 2008), also known as dataless classification (Chang et al., 2008), is the extreme case, in which no labeled data is used. For text data, this is usually accomplished by representing the labels of the task in a textual form, which can either be the name of the label or a concise textual description.
In recent years, there has been a surge in zeroshot and few-shot approaches to text classification. One approach (Yin et al., 2019(Yin et al., , 2020Halder et al., 2020;Wang et al., 2021) makes use of entailment models. Textual entailment (Dagan et al., 2006), also known as natural language inference (NLI) (Bowman et al., 2015), is the problem of predicting whether a textual premise implies a textual hypothesis in a logical sense. For example, Emma loves apples implies that Emma likes apples.
The entailment approach for text classification sets the input text as the premise and the text repre-senting the label as the hypothesis. A NLI model is applied to each input pair and the entailment probability is used to identify the best matching label.
In this paper, we investigate an alternative based on Siamese Networks (SN) (Bromley et al., 1993), also known as dual encoders. These models embed both input and label texts into a common vector space. The similarity of the two items can then be computed using a similarity function such as the dot product. The advantage is that input and label text are encoded independently, which means that the label embeddings can be pre-computed. Therefore, at inference time, only a single call to the model per input is needed. In contrast, the models typically applied in the entailment approach are Cross Attention (CA) models which need to be executed for every combination of text and label. On the other hand, they allow for interaction between the tokens of label and input, so that in theory they should be superior in classification accuracy. However, in this work we show that in practice, the difference in quality is small.
Both CA and SNs also support the few-shot learning setup by fine-tuning the models on a small number of labeled examples. This is usually done by updating all parameters of the model, which in turn makes it impossible to share the models between different tasks. In this work, we show that when using a SN, one can decide to only fine-tune the label embeddings. We call this Label Tuning (LT). With LT the encoder can be shared between different tasks, which greatly eases the deployment of this approach in a production setup. LT comes with a certain drop in quality, but this drop can be compensated by using a variant of knowledge distillation (Hinton et al., 2014).
Our contributions are as follows: We perform a large study on a diverse set of tasks showing that CA models and SN yield similar performance for both zero-shot and few-shot text classification.  Figure 1: Overview of training and inference with Label Tuning (LT). At training time, input and label texts (hypotheses) are processed by the encoder. LT then tunes the labels using a cross entropy (CE) loss. At inference time, the input text is passed through the same encoder. The tuned label embeddings and a similarity function are then used to score each label. The encoder remains unchanged and can be shared between multiple tasks.
In contrast to most prior work, we also show that these results can also be achieved for languages other than English. We compare the hypothesis patterns commonly used in the literature and using the plain label name (identity hypothesis) and find that on average there is no significant difference in performance. Finally, we present LT as an alternative to full fine-tuning that allows using the same model for many tasks and thus greatly increases the scalability of the method. We will release the code 1 and trained models used in our experiments.

Methodology
Figure 1 explains the overall system. We follow Reimers and Gurevych (2019) and apply symmetric Siamese Networks that embed both input texts using a single encoder. The encoder consists of a transformer (Vaswani et al., 2017) that produces contextual token embeddings and a mean pooler that combines the token embeddings into a single text embedding. We use the dot product as the similarity function. We experimented with cosine similarity but did not find it to yield significantly better results.
As discussed, we can directly apply this model to zero-shot text classification by embedding the input text and a textual representation of the label. For 1 https://tinyurl.com/label-tuning the label representation we experiment with a plain verbalization of the label, or identity hypothesis, as well as the hypotheses or prompts used in the related work.

Fine-Tuning
In the case of few-shot learning, we need to adapt the model based on a small set of examples. In gradient-based few-shot learning we attempt to improve the similarity scores for a small set of labeled examples. Conceptually, we want to increase the similarity between every text and its correct label and decrease the similarity for every other label. As the objective we use the so called batch softmax (Henderson et al., 2017): Where B is the batch size and S(x, y) = f (x)·f (y) the similarity between input x and label text y under the current model f . All other elements of the batch are used as in-batch negatives. To this end, we construct the batches so that every batch contains exactly one example of each label. Note that this is similar to a typical softmax classification objective. The only difference is that f (y i ) is computed during the forward pass and not as a simple parameter look-up.

Label Tuning
Regular fine-tuning has the drawback of requiring to update the weights of the complete network. This results in slow training and large memory requirements for every new task, which in turn makes it challenging to deploy new models at scale. As an alternative, we introduce label tuning, which does not change the weights of the encoder. The main idea is to first pre-compute label embeddings for each class and later tune them using a small set of labeled examples. Formally, we have a training set containing N pairs of an input text x i and its reference label index z i . We pre-compute a matrix of the embedded input texts and embedded labels, X∈R N ×d and Y ∈R K×d , respectively. d is the embedding dimension and K the size of the label set. We now define the score for every input and label combination as S = X × Y T (S∈R N ×K ) and tune it using cross entropy: To avoid overfitting, we add a regularizer that penalizes moving too far from the initial label embeddings Y 0 as Y 0 − Y F , where . F is the Frobenius norm. 2 Additionally, we also implement a version of dropout by masking some of the entries in the label embedding matrix at each gradient step. To this end, we sample a random vector r of dimension d whose components are 0 with probability dropout and 1 otherwise. We then multiply this vector component-wise with each row in the label embedding matrix Y . The dropout rate and the strength of the regularizer are two hyper-parameters of the method. The other hyperparameters are the learning rate for the stochastic gradient descent as well as the number of steps. Following Logan IV et al. (2021), we tune them using 4-fold cross-validation on the few-shot training set. Note that the only information to be stored for each tuned model are the d-dimensional label embeddings.

Knowledge Distillation
As mentioned, label tuning produces less accurate models than real fine-tuning. We find that this can be compensated by a form of knowledge distillation (Hinton et al., 2014). We first train a normal fine-tuned model and use that to produce label distributions for a set of unlabeled examples. Later, this silver set is used to train the new label embeddings for the untuned model. This increases the training cost of the approach and adds an additional requirement of unlabeled data but keeps the advantages that at inference time we can share one model across multiple tasks.

Related Work
Pre-trained Language Models (LMs) have been proved to encode knowledge that, with taskspecific guidance, can solve natural language understanding tasks (Petroni et al., 2019). Leveraging that, Le Scao and Rush (2021) quantified a reduction in the need of labeled data of hundreds of instances with respect to traditional fine-tuning approaches (Devlin et al., 2019;Liu et al., 2019). This has led to quality improvements in zero and few-shot learning.
Semantic Similarity methods Gabrilovich and Markovitch (2007)  We evaluate it against more extensive and diverse benchmarks. In addition, we show that pre-training few-shot learners on their proposed textual similarity task NatCat underperforms pre-training on NLI datsets.
Prompt-based methods GPT-3 (Brown et al., 2020), a 175 billion parameter LM, has been shown to give good quality on few-shot learning tasks. Pattern-Exploiting Training (PET) (Schick and Schütze, 2021) is a more computational and memory efficient alternative. It is based on ensembles of smaller masked language models (MLMs) and was found to give few-shot results similar to GPT-3. Logan IV et al. (2021) reduced the complexity of finding optimal templates in PET by using nullprompts and achieved competitive performance. They incorporated BitFit (Ben-Zaken et al., 2021) and thus reached comparable accuracy fine-tuning only 0.1% of the parameters of the LMs. Hambardzumyan et al. (2021) present a contemporary approach with a similar idea to label tuning. As in our work, they use label embeddings initialized as the verbalization of the label names. These taskspecific embeddings, along with additional ones that are inserted into the input sequence, are the only learnable parameters during model training. They optimize a cross entropy loss between the label embeddings and the output head of a MLM. The major difference is that they employ a promptbased approach while our method relies on embedding models.
Entailment methods The entailment approach (Yin et al., 2019;Halder et al., 2020) uses the label description to reformulate text classification as textual entailment. The model predicts the entailment probability of every label description . Wang et al. (2021) report results outperforming LM-BFF (Gao et al., 2021), an approach similar to PET.
True Few-Shot Learning Setting Perez et al. (2021) argue that for true few-shot learning, one should not tune parameters on large validation sets or use parameters or prompts that might have been tuned by others. We follow their recommendation and rely on default parameters and some hyperparameters and prompts recommended by Wang et al. (2021), which according to the authors, were not tuned on the few-shot datasets. For label tuning, we follow Logan IV et al. (2021) and tune parameters with cross-validation on the few-shot training set.

Experimental Setup
In this section we introduce the baselines and datasets used throughout experiments.

Models
Random The theoretical performance of a random model that uniformly samples labels from the label set.
Word embeddings For the English experiments, we use Word2Vec (Mikolov et al., 2013) embeddings 3 . For the multi-lingual experiments, we use FastText (Grave et al., 2018). In all cases we preprocess using the NLTK tokenizer (Bird et al., 2009) and stop-words list and by filtering non-alphabetic tokens. Sentence embeddings are computed by averaging the token embeddings.
Char-SVM For the few-shot experiments we implemented a Support Vector Machines (SVM) (Hearst et al., 1998) based on character n-grams. The model was implemented using the text vectorizer of scikit-learn (Pedregosa et al., 2011) and uses bigrams to fivegrams.  2018)), using the same code and parameters as above. The model has approx. 280M parameters. We give more details on the NLI datasets in Appendix G.

Cross Attention
Siamese Network We also use models based on MPNET for the experiments with the Siamese Networks. paraphrase-mpnet-base-v2 4 is a sentence transformer model (Reimers and Gurevych, 2019) trained on a variety of paraphrasing datasets as well as SNLI and MNLI using a batch softmax loss (Henderson et al., 2017). nli-mpnet-base-v2 5 is identical to the previous model but trained exclusively on MNLI and SNLI and thus comparable to the cross attention model. For the multilingual experiments, we trained a model using the code name task lang. train test labels token length GNAD (Block, 2019) topic de 9,245 1,028 9 279 AG News (Gulli, 2005) en 120,000 7,600 4 37 HeadQA (Vilares and Gómez-Rodríguez, 2019) es 4,023 2,742 6 15 Yahoo (Zhang et al., 2015) en 1,360,000 100,000 10 71 Amazon Reviews (Keung et al., 2020) reviews de, en, es 205,000 5,000 5 25-29 IMDB (Maas et al., 2011) en 25,000 25,000 2 173 Yelp full (Zhang et al., 2015) en 650,000 50,000 5 99 Yelp polarity (Zhang et al., 2015) en  of the sentence transformers with the same batch softmax objective used for fine-tuning the few-shot models and on the same data we used for training the cross attention model.

Roberta-NatCat
For comparison with the related work, we also trained a model based on RoBERTa (Liu et al., 2019) and fine-tuned on the NatCat dataset as discussed in Chu et al. (2021) using the code 6 and parameters of the authors.

Datasets
We use a number of English text classification datasets used in the zero-shot and the few-shot literature (Yin et al., 2019;Gao et al., 2021;Wang et al., 2021). In addition, we use several German and Spanish datasets for the multilingual experiments. Table 1 provides more details. These datasets are of a number of common text classification tasks such as topic classification, sentiment and emotion detection, and review rating. However, we also included some less well-known tasks such as acceptability, whether an English sentence is deemed acceptable by a native speaker, and subjectivity, whether a statement is subjective or objective. As some datasets do not have a standard split we split them randomly using a 9/1 ratio.

Hypotheses
We use the same hypotheses for the cross attention model and for the Siamese network. For Yahoo and Unified we use the hypotheses from Yin et al. 6 https://github.com/ZeweiChu/ULR (2019). For SUBJ, COLA, TREC, Yelp, AG News and IMDB we use the same hypotheses as Wang et al. (2021). For the remaining datasets we designed our own hypotheses. These were written in an attempt to mirror what has been done for other datasets and they have not been tuned in any way. Appendix B shows the patterns used. We also explored using an identity hypothesis, that is the raw label names as the label representation and found this to give similar results.

Fine-Tuning
Inspired by Wang et al. (2021), we investigate finetuning the models with 8, 64 and 512 examples per label. For fine-tuning the cross attention models we follow the literature (Wang et al., 2021) and create examples of every possible combination of input text and label. The example corresponding to the correct label is labeled as entailed while all other examples are labeled as refuted. We then fine-tune the model using stochastic gradient descent and a cross-entropy loss. We use a learning rate of 1e-5, a batch size of 8 and run the training for 10 epochs. As discussed in the methodology Section 2.1, for the Siamese Networks every batch contains exactly one example of every label and therefore the batch size equals the number of labels of the task. We use a learning rate of 2e-5 and of 2e-4 for the Bit-Fit experiments. Appendix D contains additional information on the hyper-parameters used.
We use macro F1-score as the evaluation metric. We run all experiments with 5 different training sets and report the mean and standard deviation. For  the zero-shot experiments, we estimate the standard deviation using bootstrapping (Koehn, 2004). In all cases, we use Welch's t-test 7 with a p-value of 0.05 to establish significance (following Logan IV et al.

Results
Here we present the results of our experiments. The two main questions we want to answer are whether Siamese Networks (SN) give comparable results as Cross Attention models (CA) and how well Label Tuning (LT) compares to regular fine-tuning. Table 2 shows results comparing SN with CA and various baselines. As discussed above, SN and CA models are based on the MPNET architecture and trained on SNLI and MNLI. For the zero-shot setup (n=0) we see that all models out-perform the random baseline on average. The word embedding baselines and RoBERTa-NatCat perform significantly worse than random on several of the datasets. In contrast the SN and CA models only perform worse than random on COLA. The SN outperforms the CA on average, 7 https://en.wikipedia.org/wiki/Welch% 27s_t-test but the results for the individual datasets are mixed. The SN is significantly better for 4, significantly worse for 4 and on par for the remaining 3 datasets. Regarding the use of a hypothesis pattern from the literature or just an identity hypothesis (IH), we find that, while there are significant differences on individual datasets, the IH setup shows higher but still comparable (within 1 point) average performance.

Siamese Network and Cross Attention
For the few-shots setup (n={8, 64, 512}), we find that all models out-perform a Char-SVM trained with the same number of instances by a large margin. Comparing SN and CA, we see that CA outperforms the SN on average but with a difference with-in the confidence interval. For n=8 and n=64, CA significantly outperforms SN on 3 datasets and performs comparably on the remaining 8. For n=512, we see an even more mixed picture. CA is on par with SN on 6 datasets, outperforms it on 3 and is out-performed on 2. We can conclude that for the English datasets, SN is more accurate for zero-shot while CA is more accurate for fewshot. The average difference is small in both setups and we do not see a significant difference for most datasets. Table 3 shows the multi-lingual experiments. The RoBERTa XLM models were pre-trained on data from more than 100 languages and fine-tuned on an NLI data of 15 languages.  for the languages other than English, explains why quality is lower than for the English-only experiments. For the zero-shot scenario, all models outperform the random baseline on average, but with a smaller margin than for the English-only models. The FastText baseline performs comparable to CA on average (26.0 vs 27.2), while SN is ahead by a large margin (27.2 vs 32.4). The differences between models with hypotheses and identity hypothesis (IH) are smaller than for the English experiments.
Looking at the few-shot scenarios, we see that both models out-perform the Char-SVM by a large margin. In general, the results are closer than for the English experiments, as well as in the number of datasets with significant differences (only 2-4 of datasets). Similarly to English, we can conclude that at multilingual level, SN is more accurate in the zero-shot scenario whereas CA performs better in the few-shot one. However, for few-shot we see only small average differences (less than 1 point except for n=64). Table 4 shows a comparison of different fine-tuning approaches on the English datasets. Appendix H contains the multi-lingual results and gives a similar picture. We first compare Label Refinement (LR) as discussed in Chu et al. (2021) (see Section 3). Recall that this approach makes use of unlabeled data. We find that in the zero-shot sce-nario LR gives an average improvement of more than 2 points and significantly out-performing the baseline (mpnet) for 7 of the 11 datasets. When combining LR with labeled data as discussed in Chu et al. (2021) we find this to only give modest improvements over the zero-shot model (e.g., 54.0 (zero-shot) vs 55.8 (n=8)). Note that we apply LR to the untuned model, while Chu et al. (2021) proposed to apply it to a tuned model. However, we find that to only give small improvements over an already tuned model (mpnet (FT) vs. mpnet (FT+LR)). Also, in this work we are interested in approaches that do not change the initial model so that it can be shared between tasks to improve scalability. Label Tuning (LT) improves results as n grows and out-performs LR and the Char-SVM baseline from Table 2.

Label Tuning
Comparing regular Fine-Tuning (FT) and BitFit, we find them to perform quite similarly both on average and on individual datasets, with only few exceptions, such as the performance difference on TREC for the n=8 setup. In comparison with FT and BitFit, LT is significantly out-performed on most datasets. The average difference in performance is around 5 points, which is comparable to using 8 times less training data.
Using the knowledge distillation approach discussed before (LT-DIST), we find that for 8 and 64 examples, most of the difference in performance can be recovered while still keeping the high scalability. For n=8, we only find a significant differ-

Analysis
We analyze the performance of the Cross Attention (CA) and Siamese Network-based (SN) models. Unless otherwise noted, the analysis was run over all datasets and languages.  pendent of the number of labels. This shows that Siamese Networks have a huge advantage at inference time -especially for tasks with many labels. Table 6 shows the average F1 scores for different token lengths. To this end the data was grouped in bins of roughly equal size. SN has an advantage for shorter sequences (≤ 44 tokens), while CA performs better for longer texts (> 160 tokens). Table 7 shows an analysis based on whether the text does or does not contain negation markers. We used an in-house list of 23 phrases for German and Spanish and 126 for English. For emotion detection and review tasks, both models perform better on the subset without negations. However, while SN outperforms CA on the data without negations, CA performs better on the data with negations. The same trend does not hold for the sentiment datasets. These are based on Twitter and thus contain shorter and simpler sentences. For the sentiment datasets based on Twitter we also found that both models struggle to predict the neutral class. CA classifies almost everything neutral tweet as positive or negative. SN predicts the neutral class regularly but still with a relative high error rate. Appendix E contains further analysis showing that label set size, language and task do not have a visible effect on the difference in accuracy of the two models.

Conclusion
We have shown that Cross Attention (CA) and Siamese Networks (SN) for zero-shot and few-shot text classification give comparable results across a diverse set of tasks and multiple languages. The inference cost of SNs is low as label embeddings can be pre-computed and, in contrast to CA, does not scale with the number of labels. We also showed that tuning only these label embeddings (Label Tuning (LT)) is an interesting alternative to regular Fine-Tuning (FT). LT gets close to FT performance when combined with knowledge distillation and when the number of training samples is low, i.e., for realistic few-shot learning. This is relevant for production scenarios, as it allows to share the same model among tasks. However, it will require 60 times more memory to add a new task: For a 418 MB mpnet-base model, BitFit affects 470 kB of the parameters. LT applied to a task with 10 labels and using a embedding dimension of 768 requires 7.5 kB. The main disadvantage of BitFit, however, is that the weight sharing it requires is much harder to implement, especially in highly optimized environments such as NVIDIA Triton. Therefore we think that LT is an interesting alternative for fast and scalable few-shot learning.  (Liu et al., 2007) and EmotionCause (Ghazi et al., 2015). For completeness, we also add similar breakdowns by task type, label set size, and language.