SSMix: Saliency-Based Span Mixup for Text Classification

Data augmentation with mixup has shown to be effective on various computer vision tasks. Despite its great success, there has been a hurdle to apply mixup to NLP tasks since text consists of discrete tokens with variable length. In this work, we propose SSMix, a novel mixup method where the operation is performed on input text rather than on hidden vectors like previous approaches. SSMix synthesizes a sentence while preserving the locality of two original texts by span-based mixing and keeping more tokens related to the prediction relying on saliency information. With extensive experiments, we empirically validate that our method outperforms hidden-level mixup methods on a wide range of text classification benchmarks, including textual entailment, sentiment classification, and question-type classification. Our code is available at https://github.com/clovaai/ssmix.


Introduction
Data augmentation gains popularity in natural language processing (NLP) (Feng et al., 2021) due to the expensive cost of data collection. Some of them are based on simple rules (Wei and Zou, 2019) and models (Edunov et al., 2018;Ng et al., 2020) to generate similar text. Augmented samples are trained jointly with original samples by a standard way or advanced training methods (Zhu et al., 2019;Park et al., 2021). On the other hand, mixup (Zhang et al., 2018) interpolates input texts and labels for the augmentation.
Training with mixup and its variants become a popular regularization method in computer vision to improve the generalization of neural networks. Mixup approaches are categorized into input-level mixup (Yun et al., 2019;Kim et al., Figure 1: Illustration of SSMix. Two data samples x A and x B are labeled negative and positive respectively for sentiment classification task. For each token, saliency maps are visualized where darker concentration of colors mean higher contribution to corresponding label. We select the least salient span from x A and replace it with the most salient span from x B . The output results inx = mixup(x A , x B ). We also assignỹ by the mixup ratio λ. In this example, λ is set to 0.2 as the span length is 2 out of 10. 2020; Walawalkar et al., 2020;Uddin et al., 2021) and hidden-level mixup (Verma et al., 2019) depending on the location of the mix operation. Inputlevel mixup is a more prevalent approach than hidden-level mixup because of its simplicity and the ability to capture locality, leading to better accuracy.
Applying mixup in NLP is more challenging than in computer vision because of the discrete nature of text data and variable sequence lengths. Therefore, most previous attempts on mixup for texts (Guo et al., 2019;Chen et al., 2020) apply mixup on hidden vectors like embeddings or intermediate representations. However, input-level mixup might have an advantage over hidden-level mixup with a similar intuition from computer vision. This motivation encourages us to examine input-level mixup approaches for text data.
In this work, we propose SSMix (Fig 1), a novel input-level spanwise mixup method considering the saliency of spans. First, we conduct a mixup by replacing a span of contiguous tokens with a span in another text, which is inspired from CutMix (Yun et al., 2019), to preserves the locality of two source texts in the mixed text. Second, we select a span to be replaced and to replace based on saliency information to make the mixed text contain tokens more related to output prediction, which may be semantically important. Our input-level method is different from hidden level mixup methods in that while current hidden level mixup methods linear interpolate original hidden vectors, our method mix tokens on the input level, resulting in a nonlinear output. Also, we utilize saliency values to select span from each sentence and discretely define the length of span and mixup ratio, which is outside the hidden level.
SSMix has empirically proven effective through extensive experiments on a wide range of text classification benchmarks. Especially, we prove that input-level mixup methods generally outperform hidden-level methods. We also show the importance of using saliency information and restricting token selection in span-level when conducting our method via ablation study.

SSMix
We propose SSMix to synthesize a new textx by replacing a span x A S from one text x A into another span x B S from another text x B based on saliency information. Also, we have to set a new labelỹ forx using y A and y B which are one-hot labels corresponding to x A and x B , respectively. Consequently, we can additionally use this generated virtual sample (x,ỹ) for training.
Saliency Saliency measures how each portion of data (in this case, tokens) affects the final prediction. Gradient-based methods (Simonyan et al., 2013;Li et al., 2016) are widely used for the saliency computation. We compute the gradient of classification loss L with respect to input embedding e, and use its magnitude as the saliency: i.e., s = ∂L/∂e 2 . We apply the L2 norm to obtain the magnitude of a gradient vector, which becomes a saliency of each token similar to PuzzleMix (Kim et al., 2020).
Mixing text Text data x A and x B are discrete token sequences. Using saliency scores as explained earlier, we can find the least salient span in x A with a length l A as x A S and the most salient span in x A with a length l B as x B S . We set l A = l B = max(min([λ 0 |x A |], |x B |), 1) given a prior mixup ratio λ 0 . Then, finalx becomes the concatena- L and x A R are tokens located to the left and the right side of x A S respectively in the original text x A .
Same span length We set the length of the original (l A ) and replaced (l B ) span to be the same, since allowing different length of spans would result in redundant and ambiguous mixup variations. Also, calculating the mixup ratio between different span length would be too complex. This same-size replacement strategy is also adopted in Yun et al. (2019) and Uddin et al. (2021). In situations where span length is the same, our method maximizes the effect of saliency. Since SSMix doesn't restrict the position of tokens, we can pick the most salient span and replace it with least salient span on the other text.
Mixing label We set mixup ratio λ for label as λ = |x B S |/|x|. Since λ is recalculated by counting the number of tokens in the span, it may differ from λ 0 . We set the label ofx toỹ = (1 − λ)y A + λy B . Algorithm 1 shows how we utilize the original sample pairs to compute the mixup loss for augmented samples. We calculate the cross-entropy loss of the augmented output logit with respect to the original target label of each sample and combine them by weighted sum, which is similar to the original implementation of Zhang et al. (2018). 1 Therefore, applying SSMix is independent of the total number of labels of the classification dataset. On any dataset, output label ratio is calculated by linear combination of two original labels.
Paired sentence tasks For tasks requiring a pair of texts as an input such as textual entailment and similarity classification, we conduct mixup in a pairwise manner and calculate the mixup ratio by aggregating token counts in each mixup result. x = (p,q), we define mixup of paired sentence data asx = (mixup(p A , p B ), mixup(q A , q B )).
Here, we set the mixup ratio on paired sentence tasks as λ = (|p S | + |q S |)/(|p| + |q|), where p S and q S are replacing spans of independent mixup operations. Illustration is available in Appendix B.3.
3 Experimental Setup

Dataset
As listed in table 1, to evaluate the effectiveness of SSMix, we perform experiments on various text classfication benchmarks: six datasets in GLUE benchmark (Wang et al., 2018), TREC (Li and Roth, 2002;Hovy et al., 2001), and ANLI (Nie et al., 2020). Two of them are single sentence classification tasks, and six of them are sentence pair classification tasks. All datasets are extracted from HuggingFace datasets library. 2 For GLUE, we use SST-2 (Socher et al., 2013), MNLI (Williams et al., 2018), QNLI (Rajpurkar et al., 2016), RTE (Bentivogli et al., 2009), MRPC (Dolan andBrockett, 2005), and QQP 3 . Among GLUE, we leave out datasets that were not evaluated by accuracy, along with WNLI, because the size is too small to show any general trend of effectiveness.
TREC is a commonly used dataset to evaluate mixup methods in sentence classification (Guo et al., 2019;Thulasidasan et al., 2019). We use two different versions of TREC (coarse, fine) that have different levels of label number to test the dependency of mixup effectiveness on the number of class labels. In addition, we use ANLI to see how mixup can help to improve model robustness. For training ANLI, we concatenate all training data from different rounds and use them to train the model.

Baseline
We compare SSMix with three baselines: (1) standard training without mixup, (2) EmbedMix, and (3) TMix. EmbedMix apply mixup on the embedding layer, which is similar to the wordMixup in Guo et al. (2019) except their experiments are performed with LSTM or CNN architecture. TMix, borrowed from Chen et al. (2020), interpolates hidden states of two different inputs at a particular encoder layer and forward the combined hidden states to the remaining layers. For EmbedMix and TMix, we follow the best settings stated in the original papers: mixup ratio is set by λ ∼ Beta(α, α), λ = max(λ , 1 − λ ) with α = 0.2. During the training with TMix, we randomly sample the mixup layer from [7,9,12].

Ablation study
To investigate how much (1) considering saliency and (2) restricting mixup operation on the spanlevel individually benefit our proposed method, we conduct an ablation study. We implement SSMix without considering saliency information (SSMix -saliency) where the spans are randomly selected, and additionally without the span-level restriction (SSMix -saliency -span). For SSMix -saliencyspan, we randomly sample tokens from x B , which need not be a contiguous span and are conducted on a per-token basis. Then, we replace tokens accordingly with the position of the token be preserved, meaning that the second token from x A is replaced with the second token from x B , and so on. For all ablation studies, the lambda values were set to 0.1 to compare methods with the same setting as SSMix. Detailed implementation and illustration of ablation methods and comparison with simple word dropout methods are described in Appendix B.

Training Details
Among the entire experiment, we use sequence classification task with the pre-trained BERT-base model having 110M parameters from Hugging-Face Transformers library. 4 We perform all experiments with five different seeds (0 to 4) on a single NVIDIA P40 GPU and report the average score. We set a maximum sequence length of 128, batch size of 32, with AdamW optimizer with eps of 1e-8 and weight decay of 1e-4. We use a linear scheduler with a warmup for 10% of the total training step. We update the best checkpoint by measuring validation accuracy on every 500 steps. For datasets that have less than 500 steps per epoch, we update and validate every epoch.
Considering our objective of enhancing performance through mixup, we conduct training in two steps. We first train without mixup with a learning rate of 5e-5 for three epochs, and then train with mixup starting from previous training's best checkpoint, with a learning rate of 1e-5 for five epochs. This two-step training, which also utilized by Zhang et al. (2018), speeds up the model convergence. We report the best accuracy among both training with and without mixup. For the ANLI task, we select the best checkpoint for training without mixup separately for each round, then conduct training with mixup and report the best accuracy of each round's evaluation dataset.
For each iteration, we split the batch into two smaller batches with the same size, A and B. Since mixup operation in SSMix is not symmetric, we conduct mixup back-and-forth so that mixup performance is evaluated regardless of the data position in batch. To prevent the training data distribution getting too far from the original data distribution, we train with and without mixup together as He et al. (2019). As a result, we forward each step with average loss from A, B, mixup(A, B), and mixup(B, A).
We leave out tokens specific to transformer architecture (e.g., [CLS], [SEP ]) when conducting a mixup to preserve special signs. As stated by Zhang et al. (2018), giving too high values for mixup ratio may lead to underfitting, while giving λ close to 0 leads to the same effect of giving nonaugmented original data. From our experiments, we found out that augmentation with prior ratio λ 0 = 0.1 is the optimal hyperparameter.
In terms of computation time, SSMix takes about twice the training time compared with other mixup methods since we need an additional forward and backward step to compute the saliency of tokens. 4 https://github.com/huggingface/transformers Figure 2: Visualization of original data and synthesized data by hidden-level mixup (EmbedMix or TMix) and SSMix in the hidden space. Black dots indicate the original data, x A and x B . For hidden-level mixup, synthetic data (x) are created only along the line (blue) connecting two points, since it is a linear combination within the hidden space. However, SSMix explore larger synthetic sample space forx, since it consists of a discrete combination within the input space. Synthetic data for SSMix are illustrated in pink dots.
Among hidden-level mixup methods, TMix takes a slightly longer time to train than EmbedMix. Table 2 illustrates our results. We investigate the effectiveness of SSMix compared with hidden layer mixup methods on the aspect of dataset size, number of class labels, and paired sentence tasks.

Results and Discussion
Dataset size Compared with hidden-level mixup methods, SSMix fully demonstrate its effectiveness on datasets having a sufficient amount of data. Since SSMix is a discrete combination rather than a linear combination of two data samples, it creates data samples on a synthetic space in a larger range than hidden-level mixup (Fig. 2). We hypothesize that a large amount of data help better representation in synthetic space.
The number of class labels SSMix is especially effective for multiple class label datasets (TREC, ANLI, MNLI, QNLI). Accordingly, the accuracy gain of SSMix from the training without mixup is much higher on TREC-fine (47 labels) than TRECcoarse (6 labels), with +3.56 and +0.52, respectively. We hypothesize that this result originates from the mixup characteristic that benefits more from cross-label mixup than mixup with the same label, as stated at Zhang et al. (2018). 5 Since  datasets with multiple total class labels increase the possibility of being selected cross-label in a random sampling of mixup sources, we assert mixup performance increases in such datasets.
Paired sentence tasks SSMix have a competitive advantage on paired sentence tasks, such as textual entailment or similarity classification. We suspect this accuracy gain originates from consideration of individual tokens. Existing methods (hidden-level mixup) apply mixup on the hidden layer, without consideration of special tokens, i.e., [SEP ], [CLS]. These methods may lose information about the start of the sentence or appropriate separation of pair of sentences. In contrast, SSMix can consider the individual token property when applying mixup. Here, our mixup strategy on paired data (Section 2) preserves the property of [SEP ], which is not guaranteed by hidden mixup.

Ablation Study
The results of SSMix and its variants demonstrate that the performance improves as we add span constraint and saliency information. Adding span constraint in the mixup operation benefit from better localizable ability, and most salient spans have more relationship to corresponding labels while discarding least salient spans have a higher probability that those spans are not semantically important with respect to the original labels. Among those two, introducing saliency information contributes to accuracy relatively more than the span constraint.

Conclusion
We present SSMix, a novel and simple input-level mixup method for text data that improves regularization ability leading to better performance in text classification. SSMix preserves the locality of mixing texts by replacing in span-level and keep most discriminative tokens in the mixed text using saliency score. Throughout the experiment, we show that our method improves performance in various types of text classification tasks. For future work, we plan to apply SSMix on a broader range of tasks, including generation or different scenarios like semi-supervised learning.
A Accuracy Variance We also report accuracy variance among the five seeds for each experiment (Table. A.1). Fig. 3 and Fig. 4 shows the illustration of different variants of SSMix and random UNK replacement with λ = 0.2. Fig. 5 shows the illustration of getting the augmented output with lambda calculation by SSMix for paired sentence tasks. The saliency maps are visualized where darker concentration of colors mean higher contribution to corresponding label. Here, we describe in detail how we implement SSMix without saliency ( Figure. 3 (b)) and SSMix without saliency and span restriction ( Figure. 3 (c)).  At normal training, only two real data samples (x A and x B ) are used to train the model. For Figure. 3 (b), we randomly select each span from x A and x B . Then, we replace x B to x A to make a new datax. For Figure. 3 (c), input level mixup is conducted on a per-token basis. After calculaton of l given the prior mixup ratio, we randomly sample tokens from x A . The tokens need not be a contiguous span. Then, we replace tokens accordingly with the position of the token be preserved, meaning that the second token from x A is replaced with second token from x B , the sixth token from x A is replaced with sixth token from x B (by the illustration example), and so on. We also compare SSMix with simple word dropout methods, which may seem similar in the perspective that they create noisy sentences. The difference is whether label mixup is performed. Illustration of the implementation of random [UNK] replacement is available at Fig. 4. Random UNK replacement is similar to word dropout. We don't use x B when making synthetic samples (l = 0). Instead, we randomly sample a set of tokens from x A and replace each token in that span with [UNK]. The process is similar to Figure. 3 (c), except that the selected tokens at x A are replaced into [UNK]. Another difference is that the output label (ỹ) completely follow the origin (y A ) and no label mixup is performed. The illustration is available at 3.

B.2 Comparison with other simple augmentation methods
We evaluate the random [UNK] replacement method on all dataset with SSMix and variants of SSMix at ablation study. By the experiment results at Table B.1, we show that input level mixup methods generally outperform simple regularization methods. This means that datasets synthesized from SSMix and the according target vectors have more gain on the generalization ability than word dropout. Figure 5: Illustration of applying SSMix to makex for paired sentence, in particular NLI tasks, which classifies whether the relation of sentence pairs is entailment, neutral, or contradiction. Mixup is conducted individually, sentence by sentence.

B.3 Illustration of SSMix on paired sentence tasks
Fig. 5 shows the illustration of example for paired sentence. Here, "Fun for only children." and "Fun for adults and children." correspond to p A and q A , "Problems in data synthesis." and "Issues in data synthesis." correspond to p B and q B , and "Problems for only children.", "Fun for issues and children." correspond to p and q, respectively. λ is calculated as : λ = (|p S | + |q S |)/(|p| + |q|) = (1 + 1)/(5 + 6) = 2/11 ≈ 0.18.