Data Augmentation for Voice-Assistant NLU using BERT-based Interchangeable Rephrase

We introduce a data augmentation technique based on byte pair encoding and a BERT-like self-attention model to boost performance on spoken language understanding tasks. We compare and evaluate this method with a range of augmentation techniques encompassing generative models such as VAEs and performance-boosting techniques such as synonym replacement and back-translation. We show our method performs strongly on domain and intent classification tasks for a voice assistant and in a user-study focused on utterance naturalness and semantic similarity.


Introduction
With conversational assistants becoming more and more pervasive in everyday life, task-oriented dialogue systems are rapidly evolving. These systems typically consist of Spoken Language Understanding (SLU) models, tasked with determining the domain category and the intent of user utterances at each turn of the conversation.
The ability to quickly train such models to meet changing and evolving user needs is necessary. However, developers often find themselves in situations with access to very little labeled training data. This is especially true when new functions are deployed, and a large user base has not had a chance to utilize the function, thus, limiting the number of available utterances that can be labeled for training. Furthermore, the process of labeling large amounts of data can be time consuming and expensive. More recently, these challenges have been both enhanced and complicated by privacy concerns and legislation that may prevent the use of user utterances for training.
Much of the recent research addressing data paucity has focused on pre-training using selfsupervision and vast amounts of unlabeled data * equal contribution  (Devlin et al., 2018;Radford et al., 2018Radford et al., , 2019. Pre-trained models can later be fine-tuned with a much smaller amount of labeled data for specific tasks. In this work, instead of pre-training, we explore methods that enhance and expand the taskspecific training set by using data augmentation. While models such as BERT prove to be both useful and relevant, we show that data augmentation during the fine-tuning stage can boost performance even on these large pre-trained models. We implement and compare several pre-existing techniques for data augmentation on Natural Language Understanding (NLU) tasks such as domain classification (DC) and intent classification (IC) for a voice assistant. We also introduce a new method of data augmentation called Interchangeable Rephrase (IR) with the goal of "rephrasing" an existing utterance using new language while maintaining the original intent or goal (see Table 1).

Related Work
Recurrent neural network (RNN)-based VAE generative models (Kingma and Welling, 2013;Bowman et al., 2015) explicitly model properties of utterances like topic, style, and other higher-level syntactic features. The variational component helps in generating diverse text, thus, we use VAEs as a candidate for transforming and augmenting text data in our experiments.
Building upon unconditioned text generation, Conditional VAE (CVAE) (Sohn et al., 2015;Hu et al., 2017) generates more relevant and diverse text conditioned on certain control attributes, e.g. tense, sentiment (Hu et al., 2017), style (Ficler andGoldberg, 2017). In this work, our goal is to maintain semantic similarity, therefore, we generate text by conditioning on the original intent or goal.
Guu et al. generate novel sentences from prototypes by exploiting analogical relationships of sentences. They have shown that generated sentences have a varied style. We extend this idea by using prototype utterances and then editing it into a new utterance or rephrase, using both the VAE and CVAE architecture, to generate augmented rephrase data. We call these models VAE-edit and CVAE-edit.
Back translation (BT) is the process of translating an utterance in a certain language to another language and then translating it back to the original language. Certain question answering (QA) models (Yu et al., 2018) have observed that backtranslation generates diverse paraphrases, while preserving the semantics of the original sentences. In our case, we use back-translations as rephrases to augment data.
Another form of simple augmentation techniques is EDA: easy data augmentation (Wei and Zou, 2019). They consist of four operations: synonym replacement, random insertion, random swap and random deletion. They show a boost in text classification tasks with these operations on smaller datasets. Since the number of labeled of spoken utterances are limited, we compare to this approach for our NLU tasks.
Automatic speech recognition (ASR) module, which converts audio input into text, introduces some errors in the process before feeding it into downstream NLU tasks. Without modifying ASR or NLU components, an utterance correction module can be used to help with denoising data (Freitag and Roy, 2018). The reconstructed utterances can be further used as data augmentation.
PPDB is a well known paraphrase database consisting of automatically generated paraphrases (Ganitkevitch et al., 2013). We use this database to rephrase utterances by identifying short phrases within the utterance and replacing them with a related phrase according to the database and POS (e.g. "there is a lot of" → "there are plenty of").

Interchangeable Rephrase
BERT is pre-trained using two tasks: Masked LM (MLM) and Next Sentence Prediction (NSP). In our rephrase task, the end-objective is almost identical to the MLM training, and only this procedure is used to train our self-attention model. MLM training allows the model to predict appropriate word(s) to replace the masked token depending on the context of the rest of the phrase. Each input token corresponds to a final hidden vector that is fed into an output softmax over the vocabulary. Thus, to rephrase an utterance that has more or fewer tokens than the original, the desired number of tokens for the rephrase must be known a priori. Figure 1: Overview of BERT based interchangeable rephrase (BERT-IR) -BPE is used is to encode n-grams into a single token. Thus, the model's vocabulary is comprised of tokens representing both single words and sequences of words. The model then computes a softmax over the vocabulary representing the vector of the masked input token allowing for a final output that may be a different word length than the original input.
To allow rephrases with unknown lengths, we use byte pair encoding (BPE) to group word ngrams into single tokens (Sennrich et al., 2015). By using BPE, an individual token may represent a sequence of several words, but the model can still be trained to predict only a single token. We perform BPE on a set of training relevant to the end tasks to get the most frequent n-gram sequences and include these in the model's vocabulary. Similarly to PPDB, we assume that many of these n-grams are synonyms and are interchangeable. For example, in the context of a virtual assistant skill that enables finding and reciting recipes, n-gram short phrases such as a "how to make", "tell me how to cook" and "teach me to make" can all be used in the place of the n-gram "how do I make" in the utterance "how do I make a margherita pizza." This interchangeable property is the foundation of our rephrase and data augmentation system.
The BERT-like self-attention model allows for the predictions to be made based on the context of the input utterance and BPE tokenization allows for variable-length outputs while still only having to predict a single token. Though the new BPEbased vocabulary requires re-training of the model (pre-trained BERT cannot be used), it remains structurally the same as the original BERT model. We refer to our BERT based interchangeable rephrase model as BERT-IR. Figure 1 provides an overview of the rephrase model.
In order to maintain intent such that the bigram "turn on" in the utterance "turn on the lights" is not inadvertently replaced with "turn off", negative examples and an intent feature are included in a fine-tuning step. The fine-tuning process minimizes a loss function based on the cosine similarity function where X and Y are two equal length vectors and negative examples are included through a softmax function where Q is a one hot ground truth token label derived from the input Q, and R represents the output vector of Q. D is the set of three vectors that includes R and two vectors derived from two negative examples in the training set. The network then minimizes the following differentiable loss function using gradient descent −log We emphasize that it is not necessary to use a BERT model for this rephrase method. It is technically possible to use a model architecture identical to that of BERT. However, given that this is not a fine-tuning task and the model needs to be fully re-trained to support the new n-gram tokens a less resource intensive and data hungry model is preferred. In our experiments we use a BERT-like model that leverages self-attention, but has only a fraction of the total parameters of BERT.

Experiments
In the following experiments, we examine and compare our proposed method with various data aug-mentation techniques in the context of utterance generation (rephrase) for voice assistants.
First, we study the properties of the utterances the systems are capable of generating. An ideal data augmentation method should create data that either expands upon or fills in missing gaps of the original training distribution, while still being inherently natural and meaningful. In this context, given an input utterance (from the original distribution), the goal is to generate rephrases that are semantically similar, yet, different enough to positively alter the original training distribution. This is measured in our next two experiments in which we compare the performance of augmented datasets by training an utterance domain classifier (DC) and an intent classifier (IC).
Finally, we perform a user study to examine the quality of utterance generation in terms of naturalness and semantic similarity.
Data. We use an original dataset comprised of utterances for a set of 63 skills (domains) for a voice assistant. The skills range from playing a song on Spotify to turning on/off in-home appliances to providing the weather. The developer of each skill provides a set of training utterances that users can say as an entry point to the function. Each utterance is annotated with an intent, and on average there are 530 utterances per skill (maximum 2000 and minimum 9).
For the DC task, of the 43540 utterances in the dataset, 6590 are held out for validation and the remaining utterances are used for training and processed for augmentation. Additionally, we use a separate test set that was collected and labeled through user trials and crowd sourcing. Each domain in this test set has roughly 900 test utterances.
For the IC task, we consider 5 different skills (domains) from the above dataset: Weather, Sys-temApp, SmartThings, TvControl and TvSettings. We allocate 30% of the data in each skill for testing and use the remaining for training including augmentation. (see Appendix A.3 for more details).

Experimental Settings
Our BERT-based interchangeable rephrase model uses an architecture that is similar in size and structure to DistilBERT (Sanh et al., 2019). It is possible to use alternative types of models, but we are motivated by the power of self-attention mechanisms for sequential tasks. Our model is pre-trained with roughly 500k utterances from user data (in (a) Jaccard Similarity: ↓ lower is better (b) Copied n-gram fraction: ↓ lower is better (c) Semantic Similarity: ↑ higher is better Figure 2: Evaluation of automatic metrics: Averaged across all the generated rephrases for each model. USA) for the voice assistant. The vocabulary is established using a combination of the developer training data and the usage data. After the bytepair-encoding process, we prune the resulting pairs so that only pairs occurring more than 100 times are used in the final vocabulary. All unigrams appearing two or more times are also included in the vocabulary.
Baselines. Our first baseline is a classifier that is finetuned solely on the original training data with no augmentation. We compare with existing work on data augmentation via VAE/CVAE, VAE/CVAE-edit, back-translation, denoising autoencoder (DAE), easy data augmentation (EDA) and PPDB. We refer the reader to Appendix A.1 for experimental details of the above augmentation techniques.

Analysis and Comparison of Methods
We apply several automated linguistic metrics to examine differences in quality of generated rephrases. We evaluate the generated rephrases based on how related they are to the original utterances. A model which minimizes the word-level overlap (i.e more variation) and increases the semantic similarity the most is presumably ideal.
Jaccard similarity (Jaccard, 1912;Roemmele et al., 2017) is used to measure the proportion of overlapping words between the rephrase and the original utterance. Additionally, for n = 1, 2, 3, we measure the proportion of generated n-grams that also appear in the original utterance i.e amount copied from original utterance (See et al., 2019).
We measure semantic similarity at the wordlevel and sentence-level. We compute the the mean cosine similarity of the word2vec vectors of all pairs of words between a rephrase and the original utterance. We also measure the cosine similarity of the sentence encodings, generated by the skip-  thought model (Kiros et al., 2015), of the rephrase and the original utterance (see Figure 2c). The skipthought model maps sentences sharing semantic and syntactic properties to similar vector representations. We average all these metrics across all the generated rephrases for each model (see Figure 2).

Domain and Intent Classification
The data generated from each augmentation technique is used to train a domain classifier and an intent classifier. Ten classifiers are trained for each task, using the same distribution of training and held out data described previously. Though several of the augmentation techniques are capable of generating more than one utterance per input, here we generate augmented data in a 1-1 fashion i.e, for each original utterance in the training set, a single rephrase is generated (see Figure 3).    The results on DC task are shown in Table 2. In Figure 4, we show the relative error reduction of each augmentation technique achieved on IC, averaged across all 5 skills (see Appendix A.4 for complete details).

User study
We performed an online user study where participants completed two tasks. In the first task, participants answered a three item Likert survey regarding the naturalness of the utterance (see Appendix A.5). We compared results for utterances that came from the original developer training (i.e. humangenerated), our proposed BERT-IR, and a baseline rephrase using PPDB. In the second task, users were given a reference utterance and four candidate utterances. Of the four candidates, one of them had an identical intent as the reference, but with different wording, and the other three were random utterances with different intents. The participants were asked to select the candidate with the same intent as the reference. The "correct" candidate was generated from one of the three sources: human, our rephrase, or PPDB rephrase. There were 88 participants, and each participant completed each task 30 times. Results for both tasks are shown in Table 3. Using a 3-way ANOVA, we found significant differences between all three methods on a p < .05 level.

Discussion
In our experiments, we show that VAE models are capable of generating diverse rephrases. However, these rephrases do not preserve the original meaning. This likely contributed to poorer performance on DC and IC. The discriminator of CVAE models trained to condition on a domain (DC) or intent (IC) helps improve semantic similarity resulting in slightly improved performance of NLU tasks. The VAE-edit and CVAE-edit models perform quite poorly on all comparisons as they don't necessarily preserve the meaning of the utterance when transforming them into an altered style.
Back-translation yields relatively diverse rephrases; however, it poorly conserves the original meaning. DAE just tends to copy a higher fraction of n-grams, while not changing the meaning of the utterance. EDA has boosted the DC and IC performance, compared to other methods. However, it merely changes a small percent of words in utterances by replacement operations, which are not always grammatically sound.
Our BERT-IR yields the best performance on the NLU tasks, with a relative error reduction of 46.26% on DC and 43.4% on IC as compared to no additional augmentation. Similar to PPDB, our approach generates rephrases with a lower wordoverlap and a significantly higher semantic similarity with the original utterance. The user study reveals that BERT-rephrase is an improvement over the PPDB baseline, but does not perform at the level of human generated utterances in terms of naturalness and intelligibility.
For examples of generated rephrases see Table 8 in the Appendix.

Conclusion
We introduced BERT-IR, a simple augmentation strategy based on byte pair encoding and a BERTlike self-attention model to generate diverse, natural and meaningful rephrases in the context of utterance generation for voice assistants. We demonstrate that BERT-IR performs strongly on spoken understanding tasks like domain classification and intent classification and in a user-study focused on evaluating quality of rephrases based on naturalness and interpretability (intent preservation). Denoising Auto-encoder: The decoder and encoder of the denoising auto-encoder are set as single-layer GRU RNNs with input/hidden dimension of 300/512 with Luong attention. We follow a similar utterance corruption as Freitag and Roy by dropping words randomly whose frequency was greater than 100 (i.e non-content words like "the"), followed by shuffling of its bigrams while not splitting bigrams that also exist in the original utterance. We train a DAE using this corrupted data. We use the corrected utterance/rephrases generated as augmentation data.
We train the above models until convergence on 2 NVIDIA Tesla V100 GPUs. Back-translation: We use the open-sourced back-translation system from (Xie et al., 2019). Specifically, we use pre-trained WMT'14 English-French translation models (in both directions) to perform back-translation on each utterance.
EDA: We followed three strategies by Wei and Zou: random insertion, random replacement and random swap.
PPDB: The PPDB 2 database consists of 1 paraphrase rule per line. A standard format of a line is: 2 http://paraphrase.org/#/download LHS ||| PHRASE ||| PARAPHRASE ||| (FEATURE=VALUE ) * ||| ALIGNMENT ||| ENTAILMENT If a PHRASE exists in the utterance, we replace it with the PARAPHRASE. Since it's a 1-1 data augmentation, we sample a rephrase from the list of all possible generated rephrases. We use the English-Phrasal PPDB database .
Refer to Table 4

A.2 Domain Classification and Intent Classification
For domain classification and intent classification, we train a classifier on top of the base uncased DistilBERT model. We use a maximum sequence length of 128, a dropout rate of 0.1, and a learning rate of 2e-5. We train the classifier for 15 epochs and batch size of 32 on a single NVIDIA Tesla V100.

A.3 Training Data Statistics
Domain Classification: Table 5 shows the counts for each training, development (held out), and test set. The partition used for training is also the partition that is augmented (and subsequently also used for training in domain classification, see Figure 3). Each utterance has an average of 6 words. There are 63 different domains in our dataset like music, calendar, calculator, weather etc. We evaluate the automatic metrics from Section 4.2 on the same data used for training the domain classifier.  WEATHER  22  732  313  SYSTEMAPP  29  665  286  SMARTTHINGS  61  1190  511  TVCONTROL  19  392  168  TVSETTINGS  40  834  357   Table 6: Intent Classification Data Statistics Intent Classification: Table 6 shows the training split counts for each of the 5 different domains. Similar to domain classification, the partition used for training is also the partition that is augmented (and subsequently also used for training in intent classification, see Figure 3).

A.4 Intent Classification Results
As shown in Table 6, the amount of training data available to train an intent classifier for each domain is very less. This can explain the poor performance of VAE models and its variations on IC since they require a lot more training data to perform well. This is also illustrated in the results of domain classification where these VAE based models achieve a slight improvement since there is a higher amount of training data available. Please refer to Table 7 for the fine-grained performance results of the augmentation techniques on the 5 domains.

A.5 User Study Details
Participants responded to three items on a Likertscale of one to five: