Boosting Naturalness of Language in Task-oriented Dialogues via Adversarial Training

The natural language generation (NLG) module in a task-oriented dialogue system produces user-facing utterances conveying required information. Thus, it is critical for the generated response to be natural and fluent. We propose to integrate adversarial training to produce more human-like responses. The model uses Straight-Through Gumbel-Softmax estimator for gradient computation. We also propose a two-stage training scheme to boost performance. Empirical results show that the adversarial training can effectively improve the quality of language generation in both automatic and human evaluations. For example, in the RNN-LG Restaurant dataset, our model AdvNLG outperforms the previous state-of-the-art result by 3.6% in BLEU.


Introduction
In task-oriented dialogues, the computer system communicates with the user in the form of a conversation and accomplishes various tasks such as hotel booking, flight reservation and retailing. In this process, the system needs to accurately convert the desired information, a.k.a. meaning representation, to a natural utterance and convey it to the users ( Table 1). The quality of response directly impacts the user's impression of the system. Thus, there are numerous previous studies in the area of natural language generation (NLG) for task-oriented dialogues, ranging from templatebased models (Cheyer and Guzzoni, 2014;Langkilde and Knight, 1998) to corpus-based methods (Dušek and Jurčíček, 2016;Tran and Nguyen, 2017;Wen et al., 2015;.
However, one issue yet to be solved is that the system responses often lack the fluency and naturalness of human dialogs. In many cases, the system responses are not natural, violating inherent human language usage patterns. For instance, Wildwood is an Indian restaurant in the riverside area near Raja Indian Cuisine.
It is not family friendly.
w/o adv.
Wildwood is a restaurant providing Indian food. It is located in the riverside. It is near Raja Indian Cuisine. in the last row of Table 1, two pieces of location information for the same entity restaurant should not be stated in two separate sentences. In another example in Table 4, the positive review child friendly and the negative review low rating should not appear in the same sentence connected by the conjunction and. These nuances in language usage do impact user's impression of the dialogue system, making the system response rigid and less natural.
To solve this problem, several methods use reinforcement learning (RL) to boost the naturalness of generated responses (Ranzato et al., 2015;Li et al., 2016). However, the Monte-Carlo sampling process in RL is known to have high variance which can make the training process unstable. Li et al. (2015) proposes to use maximum mutual information (MMI) to boost the diversity of language, but this criterion makes exact decoding intractable.
On the other hand, the adversarial training for natural language generation has shown to be promising as the system needs to produce responses indiscernible from human utterances (Rajeswar et al., 2017;Wu et al., 2017;Nie et al., 2018). Apart from the generator, there is a dis-criminator network which aims to classify system responses from human results. The generator is trained to fool the discriminator, resulting in a min-max game between the two components which boosts the quality of generated utterances (Goodfellow et al., 2014). Due to the discreteness of language, most previous work on adversarial training in NLG apply reinforcement learning, suffering from high-variance problem (Yu et al., 2017;Li et al., 2017;Ke et al., 2019).
In this work, we apply adversarial training to utterance generation in task-oriented dialogues and propose the model AdvNLG. Instead of using RL, we follow Yang et al. (2018) to leverage the Straight-Through Gumbel-Softmax estimator (Jang et al., 2016) for gradient computation. In the forward pass, the generator uses the argmax operation on vocabulary distribution to select an utterance and sends it to the discriminator. But during backpropagation, the Gumbel-Softmax distribution is used to let gradients flow back to the generator. We also find that pretraining the generator for a warm start is very helpful for improving the performance.
To evaluate our model, we conduct experiments on public datasets E2ENLG (Novikova et al., 2017) and RNN-LG (Wen et al., 2016). Our model achieves strong performance and obtains new state-of-the-art results on four datasets. For example, in Restaurant dataset, it improves the best result by 3.6% in BLEU. Human evaluation corroborates the effectiveness of our model, showing that the adversarial training against human responses can make the generated language more accurate and natural.

Problem Formulation
The goal of natural language generation module in task-oriented dialogues is to produce system utterances directly issued to the end users (Young, 2000). The generated utterances need to carry necessary information determined by upstream dialogue modules, including the dialogue act (DA) and meaning representation (MR).
The dialogue act specifies the type of system response (e.g. inform, request and confirm), while the meaning representation contains rich information that the system needs to convey to or request from the user in the form of slot-value pairs. Each slot indicates the information category and each value represents the information content. Therefore, the training data for the supervised NLG task is is the set of MR slot-value pairs, and y i is the humanlabeled response.
NLG models typically use delexicalization during training and inference, replacing slots and values in the utterance with a special token SLOT NAME . In this way, the system does not need to generate the proper nouns. Finally, the model substitutes these special tokens with corresponding values when delivering to users.

Generator Model
We use the sequence-to-sequence encoderdecoder architecture (Sutskever et al., 2014) for the response generator G. The input to the encoder is a single sequence x of length m via concatenating dialogue act d and slots and values in the meaning representation r. The target utterance y has n tokens, y 1 , ..., y n . Following , we delexicalize both sequences and surround each sequence with BOS and EOS tokens.
Both the encoder and decoder use GRU  for contextual embedder, and they share the embedding matrix E to map each token to a fixed-length vector. The final hidden state of the encoder RNN is used as the initial state of the decoder RNN. Moreover, the decoder employs a dot-product attention mechanism  over the encoder states to get a context vector c at each decoding step.
This context vector c is concatenated with the embedding of the current token and fed into the GRU to predict the next token. The result p t = p(y t |y 1 , ..., y t−1 ; x) is the probability distribution of the next token over all tokens in dictionary V .
We use cross entropy as the generator's loss function. Suppose the one-hot ground-truth token vector at the t-th step is y t , then the loss is:

Adversarial Training
The goal of the adversarial training is to use a discriminator to differentiate between the utterance y from generator and the ground-truth utterance y. We leverage the improved version of generative adversarial network (GAN), Wasserstein-GAN (WGAN) (Arjovsky et al., 2017), in our framework. WGAN designs a min-max game between the generator G and the discriminator D: (2) where G(x) denotes the probability distribution computed by the generator G given input x. The discriminator function D is a scoring function on utterances.
The goal of the generator is to obtain y as similar as possible to y to fool the discriminator D (the outer-loop min), while D learns to successfully classify generated output y from the ground-truth y (the inner-loop max), via the scoring function D.

Discriminator Model
For the discriminator, we reuse the embedding matrix E as the embedder, followed by a bidirectional GRU layer. The last GRU hidden state h is passed through a batch normalization layer and a linear layer to get the final score D(y): where W 3 and b 3 are trainable parameters.

Training
Gradient computation. One problem with adversarial training in language generation is that the token sequence y sampled from G is discrete, making it impossible to back-propagate gradients from the min-max objective to the generator. Several previous methods leverage reinforcement learning for gradient computation (Yu et al., 2017;Li et al., 2017). However, the related sampling process can introduce high variance dur-ing training. Therefore, we employ the Straight-Through Gumbel-Softmax estimator (Jang et al., 2016;Baziotis et al., 2019). In detail, during the forward pass, at the t-th step, the argmax of the generated word distribution p t is taken, i.e. greedy sampling. But for gradient computation, the Gumbel-Softmax distribution is used as a differentiable alternative to the argmax operation: where g 1 , ..., g |V | are i.i.d samples drawn from the Gumbel distribution G(0, 1) and τ represents the softmax temperature. Jang et al. (2016) shows that the Gumbel-Softmax distribution converges to the one-hot distribution as τ → 0 and to the uniform distribution as τ → ∞. We set τ = 0.1 in all the experiments. Two-stage Training. We find that the adversarial training does not work well if we optimize both the cross entropy (Eq. 1) and the min-max objective (Eq. 2) from the beginning. However, after we warm up the generator model with only cross entropy loss for several epochs, and then train with the discriminator under both the cross entropy and adversarial objective, the performance is consistently boosted. We argue that during early stages, the generator cannot produce meaningful output, making the discriminator easy to overfit. It's then hard for generator to learn to fool the adversary. We summarize our model AdvNLG and gradient computation process in Fig. 1.

Experiments
We conduct empirical tests on a number of benchmarks for task-oriented dialogues over a variety of domains such as restaurant booking, hotel booking and retail. The datasets include the E2E-NLG task (Novikova et al., 2017) with 51.4K samples, and the TV, Laptop, Hotel and Restaurant datasets from RNN-LG (Wen et al., 2016), with 14.1K, 26.5K, 8.7K and 8.5K samples respectively. We use BLEU-4 (Papineni et al., 2002) for the automatic metric, computed by the official evaluation scripts from E2E-NLG and RNN-LG.

Training Details
In all experiments, the learning rate is 1e-3, the batch size is 20 and the beam width in inference is 10. According to WGAN, the discriminator's parameters are clipped at 0.1. We use RMSprop (Ruder, 2016) as the optimizer. Teacher forcing is used for training the generator, which means that the decoder is exposed to the previous groundtruth token. In warm-up phase, we train the generator for 2 epochs. In E2E-NLG dataset, the generator is updated 5 times before the discriminator is updated once, which is typical in GAN training (Wu et al., 2017). The hyper-parameters above are chosen based on performance on the dev set. Other hyper-parameters like dropout rate, dictionary dimension and RNN hidden size are the same with Table 3 in .
For baseline models, we implemented NLG-LM  and reproduced its results. We obtain the prediction results of Slug (Juraska et al., 2018) from its open-source website.  the same category together, while placing positive and negative aspects (e.g. family-friendly and expensive) in different sentences.

As shown in
Ablation Study. The bottom section of Table 2 shows that adversarial training can boost performance by 0.7% to 7.8%. Our proposed two-stage training is also very beneficial. If both generator and discriminator are trained from scratch, the result drops significantly. RL-based adversarial training achieves mixed results. On TV dataset, it even hurts the performance. We attribute this to the high variance and instability in training.

Model
Naturalness Accuracy  Table 3: Average human evaluation ratings (1-3, 3 is best) for naturalness and accuracy of output generated by different models. Standard deviation is shown in parenthesis. * : the p-value is smaller than 0.01.

Human Evaluation
We randomly sample 100 data-text pairs from the test set of E2ENLG. We then ask 3 labelers to judge the accuracy and naturalness of the utterances generated by Slug, NLG-LM, AdvNLG with and without adversarial training. The accuracy measures how precisely the utterance ex-presses the dialogue act and meaning representation. The naturalness is measured by how likely the labeller thinks the utterance is spoken by a real human. In addition to the model output, each labeler is also given the meaning representation and the ground truth. The labelers need to give an integer rating from 1 to 3 (3 being the best) for each criterion. Table 3 shows that our AdvNLG model has an apparent lead in both naturalness and accuracy, and the paired t-test shows that the result is statistically significant with p-value smaller than 0.01. And our ablation model -Adv. achieves the lowest score, proving that adversarial training can boost both naturalness and accuracy.

Conclusion
In this paper, we propose adversarial training using the Straight-Through Gumbel-Softmax estimator in NLG for task-oriented dialogues. We also propose a two-stage training scheme to further boost the gain in performance. Experimental results show that our model, AdvNLG, consistently outperforms state-of-the-art models in both automatic and human evaluations.
In the future, we plan to apply this method to other conditional generation tasks, e.g. produce a natural utterance containing a given list of keywords.