Unsupervised training data reweighting for natural language understanding with local distribution approximation

One of the major challenges of training Natural Language Understanding (NLU) production models lies in the discrepancy between the distributions of the ofﬂine training data and of the online live data, due to, e.g., biased sampling scheme, cyclic seasonality shifts, annotated training data coming from a variety of different sources, and a changing pool of users. Consequently, the model trained by the ofﬂine data is biased. We often observe this problem especially in task-oriented conversational systems, where topics of interest and the characteristics of users using the system change over time. In this paper we propose an unsupervised approach to mitigate the ofﬂine training data sampling bias in multiple NLU tasks. We show that a local distribution approximation in the pre-trained embedding space enables the estimation of importance weights for training samples guiding resampling for an effective bias mitigation. We illustrate our novel approach using multiple NLU datasets and show improvements obtained without additional annotation, making this a general approach for mitigating effects of sampling bias.


Introduction
Production Natural Language Understanding (NLU) models are typically trained on the offline annotated data.Models learn from the offline data to perform classification on the online live data in production after the model being deployed.
The core of voice-controlled assistants, such as Google Home, Amazon Alexa, or Siri, apply NLU models to perform both intent classification and slot labelling (Weld et al., 2021).For example, the input utterance "set alarm at 9 am", would be classified as "SetAlarmIntent" intent, and the slots "9" and "am" would be labelled as Time.
In the deployed NLU systems, a distribution mismatch between training and live data is common.Some factors contributing to such a mismatch are changes of the live data distribution over time (due to, for example, new users or to seasonal changes), and usage of data from other more or less unrelated tasks to enrich the training data, so called out-of-domain data.
The issue with this mismatch in distribution between training and inference time is that models learn a bias towards specific classifications that is not existing at inference time.Even if the label distributions are matched, it is still possible that the model will have biased performance, since demographic and speech differences need not perfectly correlate with label distribution, resulting in degraded accuracy and possibly unequal performance across populations (Subramanian et al., 2021).Thus, mitigation of this distribution mismatch is an important step in the development of models.
While a common approach of dealing with this kind of bias is manual upsampling of classes in the training data (Estabrooks et al., 2004), this approach is not always optimal, due to the complexity and variation of natural language.Data of the same class in a classification task often come from very different forms of language, for example slang vs. formal language.A simple upsampling based on the classes does not mitigate differences in usage of slang during inference time compared with the training data.
Another difficulty in class based resampling is to get the correct label distribution from the live data in the case that it is missing ground-truth.For example, when a model is deployed, its training data will ideally match the distribution of the current live data, since this is the data the model will be applied to.This current live data distribution will be more similar to recent live data, compared to historical data.However, manually annotating data to obtain the ground-truth labels takes time.Thus, during deployment, the training data should match the unannotated live data, and in this case it is only feasible to use bias reduction methods which don't rely on manual annotation.
In this work we build on top of importance weighting, which is an approach that has gained traction in other machine learning fields, but until now has found little attention on natural language understanding.We propose a method to assign weights to every individual utterance in a training corpus based on observed live usage of the system, by using utterances' neighbourhood in the embedding space.We choose to find the neighbourhood of utterances with KNN and KMeans (in the case of KMeans, each cluster is considered a neighbourhood).We choose these methods due to their easy interpretation: For example, in the KMeans case, one can observe a cluster of utterances, its frequency online and offline, and easily understand why this specific pattern has a high or small weight.
The two unsupervised re-weighting based on KNN and KMeans are compared with two baselines: keeping the training data as it is (with the distribution mismatch), and, on the other hand, a semi-supervised intent-based approach, in effect class up/down-sampling.We evaluate our methods on both public data and in a deployed commercial NLU system.In the public datasets, we simulate a distribution mismatch by both introducing a label mismatch and also combining different sources of data with different distributions.
We show that the unsupervised approaches can better mitigate certain kinds of sampling bias compared to the intent-based approach, while also having the advantage that we can perform re-weighting of the training data without need of annotation: thus our method is suitable for test data with fast-changing distribution.Without the need for any labeled data, our unsupervised approaches are generic enough to be applicable to multiple different natural language processing tasks.

Related Work
The problem of dealing with training data sampling bias in machine learning is well studied.The idea of adjusting training data distribution to meet the distribution at inference time is discussed in (Zadrozny, 2004), (Shimodaira, 2000) and (Dudík et al., 2005).These methods however require estimation of biased densities or selection probabilities, which pose a challenge in the real world.
Similarly in (Grover et al., 2019), to deal with bias in generative models, a classifier is learned to distinguish the data distribution from the generative model.This allows guidance of the generation of additional data to better mimic the existing data.
In this work we extend on the work above towards natural language understanding, and focus on the real world problem in which the training data is biased with respect to the unannotated real world application data (live data).
In (Huang et al., 2006) unsupervised modelagnostic importance weights for every training sample are computed.Our unsupervised approach differs from theirs in that we calculate the weights based on the neighbourhood, which makes interpretation of the individual weights easier in the case of natural language data.A closer investigation of importance weighting can also be found in (Cortes et al., 2010) providing theoretical bounds, as well as in the recent work of (Fang et al., 2020) that looks specifically at the application of importance weighting and weight estimation for deep learning tasks.An important difference to these approaches is that they focus on including importance weighting directly into the learning of the models.In our work we focus however solely on the underlying data distribution of utterances, while keeping the estimation model the same.
In contrast to importance weighting, another common approach in real world applications is the use of pure upsampling of training utterances for certain classes, based on automatic labelling of the live data.In (Estabrooks et al., 2004) the effect of upsampling for certain underrepresented classes is investigated, showing its effectiveness.On the other hand looking at the class distribution alone will also not reduce data bias as described in Sec 1, making the requirement of an automatic way of handling different kinds of distribution mismatch more pronounced.

Utterance Weight Estimation
In this section, we describe our approach on how to estimate the weight of each individual utterance in offline training data based on a random sample from online live data.
Let X represent the random variable of an utterance from online live data, where X follows some distribution P X , denoted as X ∼ P X .Let Y be the random variable of true labels of X, where Y follows some distribution P Y , denoted as Y ∼ P Y .Also, let X and Y be the corresponding random variables of X and Y in offline data, where X ∼ P X and Y ∼ P Y .The issue we aim to resolve is that typically P X = P X and Analysing the difference in distributions of utterances P X and P X is particularly challenging in NLP because the different surface forms of utterances do not necessarily imply the semantic difference in classification tasks.However, due to the advance of natural language embeddings, we are now able to efficiently approximate the local distributions over the semantic meanings of text which allows the estimation of P X and P X .Specifically, we propose to approximate the difference of local distributions in offline and online utterances summarized as follows: 1. Map all utterances of offline training and online live data into the embedding space.
2. For every offline training utterance x i , estimate the local approximations of P X and P X , denoted as PX and PX and compute the weight using its neighbourhood utterances 3. Resample the utterance x i in offline training data according to the weight w i .

Mapping utterances into embedding space
Pre-trained BERT-based models sentence-level representations do not guarantee that semantically similar utterances will be close in the embedding space.
Thus, for the mapping of the text into the embedding space, we use Sentence-BERT (Reimers and Gurevych, 2019a), which modifies the original BERT architecture via siamese and triplet network structures to compute semantically meaningful embeddings which can be compared using several functions such as cosine similarity or euclidean distance.

Local Distribution Approximation
To mitigate the distribution mismatch we aim to debias the local distribution of each training utterance to match the live data distribution.Having embedded utterances into the embedding space in the first step, it is now possible to estimate the local neighbourhood of text utterances by using the distances in the embedding space.Then, we are able to determine an approximation of the local distributions P X and P X at some given utterance by looking at the number of samples in this neighbourhood that belong to either X or X .In the following we propose three different reweighting methods, which differ on how the neighbourhood of each utterance is defined: Reweighting K-Nearest-Neighbour (R-KNN), Reweighting KMeans (R-KMeans) and, as an additional method, an intent-based approach (B-intent; effectively class up/down sampling).

R-KNN (Reweighting via KNN):
The first local approximation we discuss is based on k-nearestneighbours.We follow the standard procedure to use K = √ N with N being the total number of utterances in training and the live sample combined.
We aim to determine the weight the individual training utterance x i , by using a sample of embedded training samples T and of live samples L. Let KNN(x) be the set of K nearest neighbours to a point x in the embedding space.We determine: the set of all utterances that are part of both the neighbourhood of a training utterance x i and the training data.In a similar way we determine the set of all utterances from the live traffic sample L that fall into the neighbourhood of x i : e With these two sets, we approximate the probability of having a training sample in this region of the And similarly we approximate the probability of a live utterance x being seen in this region of the embedding space as . The ratio of these two probability approximations is the weight we assign to the utterance x i : )) w i therefore indicates therefore how much more likely it is that an utterance in a certain region is part of the live traffic in comparison to being part of the training data.
R-Kmeans (Reweighting via KMeans) Another way of approximating neighbourhoods is with unsupervised clustering.In this case the training and live data are combined and then clusters are computed in the embedding space.Then, the neighbourhood is all utterances within the same cluster.Thus, all utterances within a cluster obtain the same weight.After having found the neighbourhoods D (i) train and D (i) live through clustering, we follow exactly the same equations as above to compute the weights.For simplicity we chose K-Means clustering (MacQueen et al., 1967) and chose K as K = √ N .If the live data and the training data came from the same distribution, it would be expected to find that, in each cluster i, |L| .After reweighting each utterance with a weight calculated with R-KMeans, the above equality is true on every cluster.

B-intent (Baseline via intent)
As a baseline, we reweight the data based on the label distribution.The problem with this approach is it can't address latent distribution mismatches not directly related to the labels, as for example formal and informal language (see Sec 1).We train a classifier on the biased training data to infer PY , an approximation of P Y , and we use P Y as is known from the annotated training data.We give each intent a weight as: w intent = PY (intent) P Y (intent) which is in line with the description above for R-KNN, considering the neighbourhood of an utterance to be made of all utterances with the same label.As a result after reweighting the utterances of every intent with the weight of their intent w intent , the labels of the resampled data will follow PY (intent).

Resampling the Training Data
With the computed weights for every training example, we are now able to resample the training data according to the live data distribution.
A weight < 1.0 means, that this training utterance is less reflective of the live distribution, while a weight > 1.0 reflects utterances more important for matching the live distribution.
While there are different ways in the literature of using this reweighting information, like (Fang et al., 2020) and (Huang et al., 2006) using it directly as part of the optimisation in the learning of the machine learning model, we chose the most straight forward of up-and down-sampling the utterances directly in the training data.A toy example of R-KNN resampling can be seen in Fig. 1.

Experiments
In our experiments we evaluated our methods on multiple different NLU datasets to verify the feasi- bility of the approach.

Datasets
We tested our methods on a large commercial voice assistant dataset, as well as in two public ones: SLURP (Bastianelli et al., 2020) and SNIPS (Coucke et al., 2018).In all these datasets, the NLU task is intent classification and slot labelling.In the commercial dataset case, data is de-identified.
The training and test data are manually annotated, whereas the live data isn't.In the commercial voice assistant scenario, we take a sample of last month's unannotated live data as representative of current usage of the system.The size of the sample is the same as the offline training data.The annotated live data (test data), is not available during model deployment, but can be obtained afterwards to estimate the performance of the method.

Bias simulation strategies
Most available natural language understanding datasets are very well curated, with the test sets closely resembling the distribution of the training data.Thus, in the public datasets we simulate bias that could occur in real world applications via two different strategies on the training data: Intent-based sampling bias: We introduce bias in the label distribution in the following way: each intent is assigned to either a low-sampling bucket (with probaility 20%) or to a high-sampling bucket (with probability 80%).The two intents that are in common between SNIPS and SLURP tasks (related to weather and to music) are both assigned to the low-sampling bucket.Finally, intents in the low-sampling buckets are down-sized to 20% of their original size, by randomly removing 80% of utterances which are annotated as belonging to this intent.The high-sampling intents are left as is.
Add OOD data: To introduce bias not directly related to the labels, as well as mimic the real-life scenario in which the training set is composed of different data sources with different amounts of noise, we also add, to each task, the training data of the other task.That is, we add the SNIPS data to the SLURP training set, and we add the SLURP data to the SNIPS training set.Prior to adding the data, we first produce machine-annotated labels for the SNIPS utterances in the SLURP label space, as well as labels to the SLURP utterances in the SNIPS label space.

Experimental Setup
The embeddings were generated with paraphrase-MiniLM-L6-v2 model part of (Reimers and Gurevych, 2019a) sentence transformer model family.This model is fine-tuned so that semantically similar sentences are close in the embedding space with respect to distance functions, including euclidean distance (Reimers and Gurevych, 2019b).
To not leak information of the unseen test data into the reweighting, we used the development data for the distribution approximation.
For the resampling, we upsampled utterances with a weight w i to frequency: n i = w i + θ, where θ is random variable that is 1 with p = w i − w i , and 0 otherwise.The expected value is We train a BERT model (Chen et al., 2019).For hyperparameter tuning, we follow (Chen et al., 2019), and use adam optimizer (Kingma and Ba, 2014) over 4 epochs, with a learning rate of 5e-5 and batch size 32.We use the implementation from (Wolf et al., 2019), with bert-base-uncased pretrained model.We report f1-score on the test data.
We compare our unsupervised approaches (R-KNN and R-KMeans from Sec 3.2) with two baselines: B-Bias (baseline model trained on the biased data) and B-Intent, baseline model in which the biased data is up/down-sampled so that the label distribution matches the live data (see Sec 3.2).The BERT model described above is used to obtain the hypothesised intent on the live data.

Results
Public datasets: The results of our experiments can be seen in table 1.We report intent classification error rate, as well as utterance error rate.We define utterance error rate as the fraction of utterances in which there is an error either in the slot labelling or intent classification task.
Each experiment is run ten times, and the average error rate is reported.The difference between both R-KMeans and the two baselines (B-intent and B-bias) passes a two-sided paired t-test for statistical significance at 95% confidence level.
The difference Between the R-KMeans and R-KNN approaches is, however, not statistically significant.R-KMeans has the advantage over R-KNN of easier interpretation of the weights: one weight is produced per cluster, instead of per utterance.The clusters can manually be inspected, and, comparing the in-cluster frequency of the live and training data, understand why this cluster got a high/ low weight.
For example, we observe in our SNIPS run two distinct clusters related to weather queries that get different weights: the first one, related to questions about specific weather events (such as snow or rain: includes, for example, the utterance "is it snowing in California".Using R-KMeans reweighting, this cluster receives a weight of 1.04 (which can be interpreted roughly as: this pattern of utterances is equally frequent in the live data (development SNIPS data in this case) as in the training data.Thus, it does not need to be upsampled or downsampled.
However, a different cluster of weather queries containing more general questions "what is the weather forecast for Akers New Hampshire" receives a weight of 12: This cluster is 12 times less frequent in the training data than in the live data.Thus, this cluster is upsampled by 12.
Overall, in our experiments the two unsupervised methods perform better than both intent-based resampling and the baseline.A limitation of our work, however, is that it requires live annotated data to use as test data, to estimate the performance post model deployment.Obtaining this data can be a challenge in real-life applications.
Commercial dataset: On the commercial dataset, we show that, in the case that the training data has a different distribution to the live and test data, applying reweighting techniques with local distribution approximation can improve performance.We compare the results of applying reweighting on the training data vs. without reweighting strategy and report the relative differences.We use R-KMeans reweighting, due to the easier manual inspection of the assigned weights (see Sec 3.2).We report the relative difference in both intent error rate and utterance-error rate.As shown in Table 2, we see improvements in both utterance and intent error metric, with the biggest coming from Home Automation domain (13.77%), and an overall improvement of 4.63% accross all domains.The results with respect to the baseline passes a two-sided paired t-test for statistical significance at 95% confidence level.

Conclusion and future work
In this work, we showed how the reweighting of training data using local distribution approximation helps in mitigating sampling bias in natural language understanding production models.We simulated the bias in public training datasets to mimic real world application scenarios in which different data sources are used, and they each come from different distributions.We reweighted utterances based on the approximation of local distribution to minimise the mismatch between the training and online live traffic data.The simplicity of our approach, and the fact that it does not require manual or machine annotation, means that it can be used to quickly adapt the training data to the ever-changing live data in deployed models.Experiments in both a commercial dataset and two public datasets have shown that our approach can mitigate the mismatch and bias in training data without additional manual tuning.In the future, we want to experiment the combined impact of our method with different data augmentation techniques, study the impact on fairness across populations, as well as bias detection methods to trigger the reweighting model.

Ethical considerations
In this work we apply a reweighting method before model deployment to mitigate the problem of bias in the training data compared to the live data.We target overall accuracy as the metric we aim to improve, and we achieve so by tailoring the model to the latest live data at model deployment.However, the impact of reweighting on per-population accuracy has not been studied.There is a risk that, due to focusing on current live data, populations which at the time of a model deployment are not extensively using the model are not well-served by the reweighting, even though overall accuracy improves.

Figure 1 :
Figure 1: Example of training (blue, left) and live data (orange, left) with different distributions, as well as the output of R-KNN resampling the training data (blue, right).Darker points indicate higher weight.

Figure 2 :
Figure2: Pipeline for utterance reweighting.We combine many different sources of training data, and then assign a high/low weight to each utterance depending on the recent, unannotated live data, which follows the most similar distribution the data the model will be applied to (compared to, for example, historical annotated live data)

Table 1 :
Intent ("int") and utterance ("utt") error rates of the different methods in SLURP/ SNIPS datasets.Best result in bold.Each experiment is run ten times, and the average is reported.Both absolute value and relative change with respect to the first baseline is also reported.

Table 2 :
Relative reduction in error rates (both intent and utterance) in the commercial dataset.