Adversarial Mixing Policy for Relaxing Locally Linear Constraints in Mixup

Mixup is a recent regularizer for current deep classification networks. Through training a neural network on convex combinations of pairs of examples and their labels, it imposes locally linear constraints on the model's input space. However, such strict linear constraints often lead to under-fitting which degrades the effects of regularization. Noticeably, this issue is getting more serious when the resource is extremely limited. To address these issues, we propose the Adversarial Mixing Policy (AMP), organized in a min-max-rand formulation, to relax the Locally Linear Constraints in Mixup. Specifically, AMP adds a small adversarial perturbation to the mixing coefficients rather than the examples. Thus, slight non-linearity is injected in-between the synthetic examples and synthetic labels. By training on these data, the deep networks are further regularized, and thus achieve a lower predictive error rate. Experiments on five text classification benchmarks and five backbone models have empirically shown that our methods reduce the error rate over Mixup variants in a significant margin (up to 31.3%), especially in low-resource conditions (up to 17.5%).


Introduction
Deep classification models have achieved impressive results in both images (He et al., 2016;Dosovitskiy et al., 2020) and language processing (Devlin et al., 2019;Kim, 2014;Wang et al., 2016). One of the most significant challenges to train a deep model is the great efforts and costs to collect largescale labels. Without sufficient labels, the deep networks tend to generalize poorly, leading to unsatisfactory performance. Thus, the regularization techniques under augmentation schema, which generate labeled data to regularize models (Hernández-García and König, 2018), are widely explored (Wei and Zou, 2019;Liu et al., 2021).
Mixup (Zhang et al., 2018) is an effective regularizer under the augmentation schema. In recent years, topics related to Mixup have warranted serious attention (Lee et al., 2020;Archambault et al., 2019;Berthelot et al., 2019b,a;. The core idea of Mixup is to generate synthetic training data via a mixing policy, which convex combines a pair of examples and its labels. Through training on these data, the classification networks will be regularized to reach higher performance. Unlike conventional regularizers (Srivastava et al., 2014;Hanson and Pratt, 1988;Ioffe and Szegedy, 2015), Mixup imposes a kind of locally linear constraint (Zhang et al., 2018;Guo et al., 2019b) on the model's input space.
However, vanilla Mixup often suffers from underfitting due to the ambiguous data (Guo et al., 2019b;Guo, 2020;Mai et al., 2021) generated under the strict locally linear constraints. To alleviate the under-fitting, (Guo, 2020) uses extra parameters to project the inputs and labels into a high dimensional space to properly separate the data. (Guo et al., 2019b;Mai et al., 2021) use auxiliary networks to learn the mixing policy in a data-driven way to avoid the generation of ambiguous data. Although existing works effectively reduce the underfitting, they have limitations to properly regularization networks. Current networks are prone to be over-fitting when adding the extra parameters. Eventually, these methods degrade the effects of regularization. The conflicts between over-fitting and under-fitting get more serious when the labeled resources are rare or hard to obtain. Besides, the methods with auxiliary networks usually have difficulties in integrating with other Mixup variants. More importantly, Mixup works well in most cases (Guo et al., 2019b). Adding too much non-linearity into Mixup will sacrifice the majority of synthetic data that can regularize the networks under locally linear constraints. So, the locally linear constraints in Mixup only need to be slightly relaxed.
In this paper, we propose the Adversarial Mixing Policy (AMP) to overcome these limitations. We modify the adversarial training (Goodfellow et al., 2015), which relaxes the linear nature of the network without any extra parameters or auxiliary networks, to relax the Locally Linear Constraints in Mixup. Inspired by the "min-max" formulation of adversarial training, we formulate our method as a form of "min-max-rand" regularization. Specifically, the "rand" operation randomly samples a mixing coefficient as in vanilla Mixup to generate synthetic example and label. Then, the "max" operation calculates the perturbation of the mixing coefficient and applies it. Note that the updated mixing coefficient is only used to re-synthetic example, keeping the synthetic label unchanged. Thus, slight non-linearity is injected in-between the synthetic example and label. Finally, the "min" operation minimizes the training loss over the non-linearly generated example-label pairs. In summary, we highlight the following contributions: • We propose an Adversarial Mixing Policy (AMP) to relax the Locally Linear Constraints (LLC) in Mixup without any auxiliary networks. It can be seamlessly integrated into other Mixup variants for its simplicity.
• To the best of our knowledge, this is the first exploration of the application of adversarial perturbation to the mixing coefficient in Mixup.
• We analyze our proposed method with extensive experiments and show that our AMP improves the performance of two Mixup variants on various settings and outperforms the nonlinear Mixup in terms of error rate.

Linear nature of the networks
Let (x; y) be a sample in the training data, where x denotes the input and y the corresponding label. Deep networks learns a mapping function from x to y, which is: Here, y is the output of the networks, → represents the learning process. The linear nature of networks can be interpreted as that a small change in the input will lead to a change of model output: f (x + ∇x) = y + ∇y .
Here, ∇x is a small perturbation of x, and ∇y is the changing of output caused by the injection of ∇x. This linearity causes the networks vulnerable to adversarial attacks (Goodfellow et al., 2015).

Relax the linear nature
To relax the linear nature of the networks, adversarial training (Goodfellow et al., 2015) forces the networks to learn the following mapping function, where ∇x is an small adversarial perturbation. Such kind of training can effectively relax the linearity of networks and improve the robustness of deep networks. However, there exists a trade-off between model robustness(Equation. 3) and generalization(Equation. 1) .

Locally linear constraints in Mixup
Mixup can be formulated as follows, where λ ∈ [0, 1] is the mixing coefficient. m is the mixing policy. (x 1 ; y 1 ) and (x 2 ; y 2 ) are a pair of examples from the original training data. By training on synthetic data, m x (λ) and m y (λ), Mixup (Zhang et al., 2018; imposes the Locally Linear Constraints on the input space of networks. Different from Eq. 2, this linearity can be formulated as follow, f (m x (λ+∇λ)) = y +∇y → m y (λ+∇λ) . (7) Here, the ∇λ is a small change in λ. We can observe that the output of the networks is changed accordingly. That is similar to the form of the linear nature of networks. Under these settings, the small change in λ often leads to an undesirable change of output. Eventually, these strict linear constraints lead to under-fitting that degrades the regularization effects (Guo et al., 2019b;Guo, 2020).

Why relaxing locally linear constraints
Relaxing the strict linear constraints in Mixup can alleviate the under-fitting and therefore improve the regularization effects (Guo, 2020). The underfitting happens when the synthetic data is corrupted or ambiguous for the network. So, if we can make the networks compatible with such data, like the soft margin (Suykens and Vandewalle, 1999), the under-fitting will be eased. Furthermore, such a technique is best realized the relaxing without extra parameters. Inspired by the adversarial training (Eq. 3), we hypothesize that injecting slight non-linearity into Mixup can relax its constraints without extra parameters as follow, where ∇λ is an adversarial perturbation injected to the original mixing coefficient λ.

Methodology
As shown in Figure 1, Adversarial Mixing Policy (AMP) consists of three operations: Rand, Max and Min. Rand Operation (RandOp) generates the synthetic data by interpolating pairs of training examples and their labels with a random mixing coefficient λ. Max Operation (MaxOp) injects a small adversarial perturbation into the λ to resynthesize the example and keeps the synthetic label unchanged. This operation injects slight nonlinearity into the synthetic data. Min Operation (MinOp) minimizes the losses of these data. Additionally, we use a simple comparison to eliminate the influence caused by the scaling of gradients.

Method formulation
Given a training set D = {x i , y i } of texts, in which each sample includes a sequence of words x i and a label y i . A classification model encodes the text into a hidden state and predicts the category of text. Mixup's objective is to generate interpolated sampleĝ k and labelŷ by randomly linear interpolation with ratio λ applied on a data pair(x i ; y i ) and (x j ; y j ). Our method aims to project a perturbation ∇λ into λ to maximize the loss on interpolated data. Then, it minimizes the maximized loss. Inspired by adversarial training, we formulate this problem as a min-max-rand optimization problem, ; θ) .
(9) Here,D = {ĝ ki ,ŷ i } is the synthetic data set generated by f rand (λ, i, j), ∇λ is the adversarial perturbation of λ, ε is the maximum step size, mix ( * ) is the Mixup loss function, f rand ( * ) represent the random interpolation of data and labels, λ is the random mixing coefficient sampled from a Beta distribution with α parameters, i and j are the randomly sampled data indexes in D, k is the mixed layer.

Rand operation
Rand Operation (RandOp) is identical to Mixup (Zhang et al., 2018). It aims to generate random interpolated data between two categories. Specifically, it generates synthetic labeled data by linearly interpolating pairs of training examples as well as their corresponding labels. For a data pair (x i ; y i ) and (x j ; y j ), x denotes the examples and y the one-hot encoding of the corresponding labels.
Consider a model f (x) = f k (g k (x)), g k denotes the part of the model mapping the input data to the hidden state at layer k, and f k denotes the part mapping such hidden state to the output of f (x). The synthetic data is generated as follows, where λ is the mixing coefficient for the data pair, α indicates the hyper-parameter of Beta distribution, g k is the synthetic hidden state. For efficient computation, the mixing happens by randomly picking one sample and then pairs it up with another sample drawn from the same mini-batch (Zhang et al., 2018). Here, the sample is obtained randomly. To simplify, we reformulate the random interpolation f rand ( * ) as follow, Here, f rand ( * ) takes the results of Equation 10-12 as input, outputs the model predictions f k (ĝ k ) and the labelŷ. The model trained on the generated data tends to reduce the volatility of prediction on these data. Then, the model will generalize better on unseen data.

Max operation
Max operation (MaxOp) injects a small adversarial perturbation to inject slight non-linearity between the synthetic example and synthetic label. It means that the generated synthetic data will not strictly follow the Locally Linear Constraints in Mixup. To achieve this, we propose an algorithm, which is similar to the Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015), to inject an adversarial perturbation to the λ. It calculates the gradient of λ in the gradient ascend direction, where the ∇λ is the gradients of λ on gradient ascent direction, ε is the step size. Different from the FGSM (Goodfellow et al., 2015), we add a small perturbation on λ instead of the input. Besides, the λ is a scalar, we can get the adversarial direction and strength directly. So, there is no need to perform the normalization on ∇λ.
where λ is the slight hardness version of mix coefficient, ε is the step size, ∇λ is the clipped (≤ 1) gradient of λ. The perturbation is the gradient in the adversarial direction. We calculate the gradient of λ as follow, Here, the Mixup loss L is calculated by interpolation of losses on pair of labels (Zhang et al., 2018; as follow, Here, L represents the loss of synthetic data generated under mixing coefficient λ, θ is the parameters of the model, mix ( * ) is the Mixup loss, ce ( * ) represents the cross-entropy function. Notable that the step size of gradient ε may lead to undesirable results that minimize the losses. So, we need to eliminate the influence caused by ε.

Min operation
Min operation (MinOp) minimizes loss of constraints relaxed synthetic data as follow, where L f inal is the final loss. In addition, MinOp leans to minimize the larger loss in the previous two steps to eliminate the influence of the step size ε. Besides, this preference will help model learning from the one with larger loss to reduce the risk of under-fitting. We use a mask-based mechanism to realize the operation as follow, Here, the mask is used as a selector of losses. The comparison is carried out on losses before and after updated λ in the synthetic example. The latter one L is calculated as follow, Here, λ is the mixing coefficient after injecting perturbation (we only inject the perturbation into mixing coefficient of input, as Eq. 8), L is the Mixup loss on synthetic example generated under λ . Note that the λ for the synthetic label is unchanged. mask is calculated as follow, Here, the mask is batch size vector, δ L is the direct comparison L − L. By doing this, the proposed method achieves steady improvement under different settings of step size.

Data
We evaluate the proposed AMP on five sentence classification benchmark datasets as used in (Guo et al., 2019a). TREC is a question dataset which aims to categorize a question into six types (Li and Roth, 2002). MR is a movie review dataset aiming at classifying positive/negative reviews (Pang and Lee, 2005). SST-1 is the Stanford Sentiment Treebank dataset with five sentiment categories: very positive, positive, neutral, negative, and very negative (Socher et al., 2013). SST-2 is a binary label version of SST-1. SUBJ is a dataset aiming to judge a sentence to be subjective or objective (Pang and Lee, 2004). Table 1 summarizes the statistical characteristics of the five datasets after prepossessing.

Baselines and Settings
Our AMP is evaluated by integrating to two recent proposed Mixup variants. We choose five popular sentence classification models as the backbone to test the performance of all Mixups on the five benchmark datasets. Classification backbone. We test Mixups on five classification backbones. LST M rand and LST M glove (Wang et al., 2016) are two versions of bi-directional Long Short Term Memory(LSTM) with attention, where the former uses randomly initiated word embeddings and the latter uses GloVe (Pennington et al., 2014) initiated word embeddings. CN N rand and CN N glove (Kim, 2014) are two versions of convolutional neural networks. They are fed with randomly and GloVe initiated word embeddings, repectively. The above four methods are popular sentence classification models without pre-training techniques. We employ BERT base (Devlin et al., 2019) as the pre-training classification backbone.
Mixup. We choose three popular Mixup variants for sentence classification as baselines. Word-Mixup (Guo et al., 2019a) is the straightforward application of Mixup on NLP tasks where linear interpolation applying on the word embedding level (first layer). SentMixup Sun et al., 2020) is the Mixup applying to NLP tasks where linear interpolation is conducted in the last layer of hidden states. Non-linear Mixup is the non-linear version of SentMixup.
AMP. WordAMP is applied on the word embedding level, the same as WordMixup. SentAMP is applied on the last layer of hidden states, the same as SentMixup.
We obtained the source codes of backbone models from the public available implementations 1 . In our experiments, we follow the exact implementation and settings in (Kim, 2014;Wang et al., 2016;Devlin et al., 2019;Guo et al., 2019a;. Specifically, we use filter sizes of 3, 4, and 5, each with 100 feature maps; dropout rate of 0.5 and L2 regularization of 1e-8 for the CNN baselines. We use hidden size of 1024 of singlelayer; dropout rate of 0.5 and L2 regularization of 1e-8 for the LSTM baselines. For datasets without a standard development set, we randomly select 10% of training data as a development set. Training is done through Adam (Kingma and Ba, 2015) over mini-batches of size 50 (CNN, LSTM) and 24 (BERT base ) respectively. The learning rate is 2e-4 for CNN and LSTM, and 1e-5 for BERT base . The word embeddings are 300 dimensions for CNN and LSTM. The step size ε = 0.002 for all experiments. The α for all Mixup is set to one. For each dataset, we train each model 10 times with different random seeds each with 8k steps and compute their mean error rates and standard deviations.

Main results
To evaluate the predictive performance of AMP, we conduct five sets of experiments. For each setting, we compare the performance of without Mixup (w/o), WordMixup (Word), SentMixup (Sent) and non-linear Mixup(non-linear 2 . As presented in Table 2, AMP outperform Mixup comparison baselines. For example, compared with the Sent base- Table 2: The results of our AMP method compared with two recent Mixup methods on five different datasets under five different classification models. For a fair comparison, we re-implement the Mixup baselines based on backbone models. The results may not the same as the results in (Guo et al., 2019a;Sun et al., 2020). RP indicates the relative improvement. † indicates the results are cited from (Guo, 2020 We use different initial embeddings to evaluate the effectiveness of augmentation as (Guo et al., 2019a). From the embedding perspective, we have three kinds of embeddings: the randomly initiated embeddings (RN N rand and CN N rand ), the pre-trained fixed embeddings (RN N glove and CN N glove ) and the pre-trained context-aware embeddings (BERT base ). For each kind of embeddings, AMP outperforms the Mixup baselines. For instance, when compared with Sent under randomly initiated embeddings, the proposed method Sent(our) obtains lower predictive error rate on eight out of ten experiments. While Word(our) outperforms Word on nine out of ten experiments. Similar results can be observed on the pre-trained embeddings settings. Even under the context-aware embeddings setting (BERT base ), our AMP can fur-ther improve the performance against the Mixup with advanced backbone models. Significantly, on SST1, our method help BERT base outperforms the SOTA model (BERT large , 44.5) (Munikar et al., 2019), which is as two times large as BERT base .
The results show the effectiveness of our method.

Low-resource conditions
With low resources, the under-fitting caused by the strict LLC has a serious impact on the model generalization. To evaluate our AMP performance with different amounts of data, particularly in the case of low-resource settings. We scale the size of the dataset by a certain ratio of data for each category. If the scaled category is less than 0, we retain at least one sample. We randomly generate ten different datasets for each scale ratio and then run the experiment on each dataset. The mean error rate and standard deviation are reported. As shown in Table 3, we can see that our method reduces the mean error rate against Mixup with a significant margin. For instance, Sent(our) reduces the error rate over Sent with 17.5% and 14.1% on 3% and 4% training data, separately. AMP works well as we expected in low resource conditions for its effectiveness in relaxing LLC in Mixup.

Ablation study
To further understand the Max Operation (MaxOp) and Min Operation (MinOp) effects in AMP, we make several variations of our model. The variations are tested under CN N glove and BERT base on TREC. As presented in Table 4, the model trained without augmentation is denoted as Baseline. +RandOp is identical to the model trained with Mixup, +M axOp indicates Mixup    Table 3.

Mix ratio distribution
To analyze the effects of different shapes of mixing coefficient distributions, we compare Word(out) with Word on BERT base on four α settings (from 0.2 to 1.5) and three datasets: TREC, SST2, and MR. The α is the parameter of the Beta distribution. It controls the shape of how the mixing coefficient λ is distributed. As presented in Table 5, our method can achieve lower mean error rates than Word on all α settings. For instance, Word(our) achieve 8.9% lower mean error rate than Word on SST2 with α = 0.5. The improvements come mainly from training the models with the slightly non-linear data generated by AMP.

Visualization
To intuitively demonstrate the effects of relaxing LLC, we visualize the loss of networks trained by our AMP and Mixup. The synthetic data is generated strictly follow the LLC based on the testing data. The network trained with relaxed LLC has a smaller loss value shows the effectiveness of our method in alleviate under-fitting. As shown in Figure 2(a), 2(b) and 2(c), we draw the losses on synthetic data generated with mixing coefficient ∈ [0, 1]. Figure 2(a) and 2(b) each uses one random pair of data in the testing set for generating. For two random pair (x 1 , y 1 )(x 4 , y 4 ) and (x 2 , y 2 )(x 3 , y 3 ), we calculate the Mixup loss of each pair on different λ to get Figure 2 Figure 2(a) and 2(b).As illustrated in Figure 2(a) and 2(b), one can observe that AMP have a smaller loss than Mixup. That indicates the effectiveness of training on the slightly non-linear synthetic data in the micro view. Figure 2(c) uses the full-size testing set for generating. Figure 2(c) shows the average loss over all synthetic data generated with the full-size testing set. We freeze the random seeds; thus, we can freeze the data pairs. Let the testing dataset be X = [(x 1 , y 1 ), (x 2 , y 2 ), (x 3 , y 3 ), (x 4 , y 4 )]. The synthetic data is generated by λX + (1 − λ)X, where X = [(x 4 , y 4 ), (x 3 , y 3 ), (x 2 , y 2 ), (x 1 , y 1 )] is shuffled X. So, the loss when λ = 0 and λ = 1 are identical. Similarly, we can get a symmetric picture as Figure 2(c).One can observe that our method can achieve a significantly smaller average loss than Mixup in the macro view. The visualizations verified our assumption that relaxing LLC can further regularize models.

Related work
Mixup on text classification. Text classification has achieved remarkable improvements underlying some effective paradigms, e.g., CNN (Kim, 2014), attention-based LSTMs (Wang et al., 2016), GloVe (Pennington et al., 2014) and BERT (Devlin et al., 2019), etc. The large scale parameter of the model tends to generalize poorly in lowresource conditions. To overcome the limitation, Mixup (Zhang et al., 2018) is proposed as a data augmentation based regularizer. Few researches explore the Mixup (Guo et al., 2019b;Guo, 2020) on NLP tasks. For classification, (Guo et al., 2019a) suggest applying Mixup on particular level of networks, i.e., word or sentence level. Although these work make promising progress, the mechanism of Mixup is still need to be explored. Adversarial Training. The min-max formulation of adversarial training has been theoretically and empirically verified Pang et al., 2020;Archambault et al., 2019;Lee et al., 2020;Miyato et al., 2015Miyato et al., , 2018Miyato et al., , 2017. Such training procedure first generates ad-versarial examples that might maximize the training loss and then minimizes the training loss after adding the adversarial examples into the training set . The Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015) is an efficient one-step method. Inspired by the min-max formulation of adversarial learning, we organize our method into a min-max-rand formulation.

Conclusion
For relaxing Locally Linear Constraints (LLC) in Mixup to alleviate the under-fitting, this paper proposes an Adversarial Mixing Policy (AMP). Inspired by the adversarial training, we organize our method into a min-max-rand formulation. The proposed method injects slightly non-linearity inbetween synthetic examples and synthetic labels without extra parameters. By training on these data, the networks can compatible with some ambiguous data and thus reduce under-fitting. Thus, the network will be further regularized to reach better performance. We evaluate our method on five popular classification models on five publicly available text datasets. Extensive experimental results show that our AMP can achieve a significantly lower error rate than vanilla Mixup (up to 31.3%), especially in low-resource conditions(up to 17.5%).