Jointly Improving Language Understanding and Generation with Quality-Weighted Weak Supervision of Automatic Labeling

Neural natural language generation (NLG) and understanding (NLU) models are data-hungry and require massive amounts of annotated data to be competitive. Recent frameworks address this bottleneck with generative models that synthesize weak labels at scale, where a small amount of training labels are expert-curated and the rest of the data is automatically annotated. We follow that approach, by automatically constructing a large-scale weakly-labeled data with a fine-tuned GPT-2, and employ a semi-supervised framework to jointly train the NLG and NLU models. The proposed framework adapts the parameter updates to the models according to the estimated label-quality. On both the E2E and Weather benchmarks, we show that this weakly supervised training paradigm is an effective approach under low resource scenarios with as little as 10 data instances, and outperforming benchmark systems on both datasets when 100% of the training data is used.


Introduction
Natural language generation (NLG) is the task that transforms meaning representations (MR) into natural language descriptions (Reiter and Dale, 2000;Barzilay and Lapata, 2005); while natural language understanding (NLU) is the opposite process where text is converted into MR (Zhang and Wang, 2016). These two processes can thus constrain each other -recent exploration of the duality of neural natural language generation (NLG) and understanding (NLU) has led to successful semi-supervised learning techniques where both labeled and unlabeled data can be used for training Tseng et al., 2020;Schmitt and Schütze, 2019;Qader et al., 2019;.
Standard supervised learning for NLG and NLU depends on the access to labeled training data -a major bottleneck in developing new applications. In particular, neural methods require a large annotated dataset for each specific task. The collection process is often prohibitively expensive, especially when specialized domain expertise is required. On the other hand, learning with weak supervision from noisy labels offers a potential solution as it automatically builds imperfect training sets from low cost labeling rules or pretrained models (Zhou, 2018;Fries et al., 2020). Further, labeled data and large unlabeled data can be utilized in semi-supervised learning (Lample et al., 2017;Tseng et al., 2020), as a way to jointly improve both NLU and NLG models.
To this end, we target a weak supervision scenario (shown in Figure 1) consisting of small, highquality expert-labeled data and a large set of unlabeled MR instances. We propose to expand the labeled data by automatically annotating the MR samples with noisy text labels. These noisy text labels are generated by a weak annotator, which is built upon recent works that directly fine-tune GPT-2 (Radford et al., 2019) on joint meaning representation (MR) and text (Mager et al., 2020;Harkous et al., 2020). Then, we jointly train the NLG and NLU models in a two-step process with semisupervised learning objectives (Tseng et al., 2020). First, we use pretrained models to estimate quality scores for each sample. Then, we down-weight the loss updates in the back-propagation phase using the estimated quality scores. This way, the models are guided to avoid mistakes of the weak annotator.
On two benchmarks, E2E (Novikova et al., 2017b) and Weather (Balakrishnan et al., 2019), we utilize varying amount of labeled data and show that the framework is able to successfully learn from the synthetic data generated by weak annotator, thereby allowing jointly-trained NLG and NLU models to outperform other baseline systems.
This work makes the following contributions: 1. We propose an automatic method to overcome the lack of text labels by using a fine-tuned language model as a weak annotator to construct text labels for the vast amount of MR samples, resulting in a much larger labeled dataset.
2. We propose an effective two-step weak supervision using the dual mutual information (DMI) measure which can be used to modulate parameter updates on the weakly labeled data by providing quality estimates.
3. We show that the approach can even be used to improve upon baselines with 100% data to establish new state-of-the-art performance.

Related Work
Learning with Weak Supervision. Learning with weak supervision is a well-studied area that is popularized by the rise of data-driven neural approaches Safranchik et al., 2020;Wu et al., 2018;Dehghani et al., 2018;Jiang et al., 2018;Chang et al., 2020a;de Souza et al., 2018). Our approach incorporates similar line of work, by providing noisy labels (text) with a fine-tuned LM which incorporates prior knowledge from general-domain text and data-text pair (Budzianowski and Vulić, 2019;Peng et al., 2020;Mager et al., 2020;Harkous et al., 2020;Shen et al., 2020;Chang et al., 2020bChang et al., , 2021b, and use it as the weak annotator, similar by functionality to that of fidelity-weighted learning (Dehghani et al., 2017), or data creation tool Snorkel .
Learning with Semi-Supervision. Work on semi-supervised learning considers settings with some labeled data and a much larger set of unlabeled data, and then leverages both labeled the unlabeled data as in machine translation (Artetxe et al., 2017;Lample et al., 2017), data-to-text generation (Schmitt and Schütze, 2019;Qader et al., 2019) or more relevantly the joint learning framework for training NLU and NLG (Tseng et al., 2020;. Nonetheless, these approaches all assume that a large collection of text is available, which is an unrealistic assumption for the task due to the need for expert curation. In our work, we show that both NLU and NLG models can benefit from (1) automatically labeling MR with text, and (2) by semi-supervisedly learning from these samples while accounting for their qualities.

Approach
We represent the set of meaning representation (MR) as X and the text samples as Y. There are no restrictions on the format of the MR: each x ∈ X can be a set of slot-value pairs, or can take the form of tree-structured semantic definitions as in Balakrishnan et al. (2019). Each text y ∈ Y consists of a sequence of words.
In our setting, we have (1) k labeled pairs and (2) a large quantity of unlabeled MR set X U where |X U | k > 0. (We force k > 0 as we believe a reasonable generation system needs at least a few demonstrations of the annotation.) This is a realistic setting for novel application domains, as unlabeled MR are usually abundant and can also be easily constructed from predefined schemata. Notably, we assume no access to outside resources containing in-domain text. The k annotations are all we know about in-domain text.
The core of our approach consists of first labeling MR samples with text, and then training on the expanded dataset. We start with describing the process of creating weakly labeled data ( §4). Next, we delve into the semi-supervised training objectives for the NLU and NLG models, which allow the models to learn from labeled and unlabeled data ( §5). Lastly, we explain the training process where NLG and NLU models are jointly optimized in two steps: In step 1, we pretrain the models on the weakly-labeled corpus, then continue updating the models on the combined data consisting of the weak and real data in step 2. Importantly, to account for the noise that comes with the automatic weak annotation, step 2 trains the model with quality-weighted updates ( §6). We depict this process in Figure 2.

Construct noisy text labels
Step 1. Train on weak data Step 2. Train on combined data  Figure 2: Depiction of the proposed framework. In joint learning, gradients are back-propagated through solid lines.

Creating Weakly Labeled Data
We construct synthetic data in two ways: (1) creating more MR samples (see §4.1), and (2) by creating a larger parallel set of MRs with texts (see §4.2).

Generating Synthetic MR Samples
We consider a simple way of MR augmentation via value swapping. This creates more unlabeled MR to be annotated by the weak annotator and also provide a substantial augmentation that benefits the autoencoding on MR samples (see Equation 3) by exposing it to a larger set of MR. ...
Since each slot in the MR samples corresponds to multiple possible values, we pair each slot with a randomly sampled value collected from the set of all MR samples to obtain new combination of slotvalue pairs. This way, we create a large synthetic MR set.

Creation of Parallel MR-to-Text Set
GPT-2 (Radford et al., 2019) is a powerful language model pretrained on the large WebText corpus. Recent work on conditional data-to-text generation (Harkous et al., 2020;Mager et al., 2020) demonstrated that fine-tuning GPT-2 on the joint distribution of MR and text for text-only generation yields impressive performance.
The fine-tuned model generates in-domain text by conditioning on samples from the augmented MR set (X U ). Rather than using GPT-2 outputs directly, we employ them in a process analogous to knowledge distillation (Tan et al., 2018;Tang et al., 2019;Baziotis et al., 2020) where the finetuned GPT-2 provides supervisory signals instead of being used directly for generation.
We now describe the process of GPT-2 finetuning. Given the sequential MR representation x 1 · · · x M and a sentence y 1 · · · y N in the labeled dataset (X L , Y L ), we maximize the joint probability p y 1 · · · y N ". In addition, we also freeze the input embeddings when fine-tuning had positive impact on performance, following Mager et al. (2020). At test time, we provide the MR samples as context as in conventional conditional text generation: The fine-tuned LM conditions on augmented MR sample set X U to generate the in-domain text 1 , forming the weak label dataset D W = (X U ,Ỹ L ) with noisy labelsỹ i ∈Ỹ L . In practice, the finetuned LM produces malformed, synthetic text which does not fully match with the MR it was conditioned on, as it might hallucinate additional values not consistent with its MR counterpart. Thus, it is necessary to check for factual consistency (Moryossef et al., 2019). We address this point next.
Past findings showed (e.g. (Wang, 2019)) that the removal of utterance with "hallucinated" facts (MR values) from MR leads to considerable performance gain, since inconsistent MR-Text correspondence might misguide systems to generate incorrect facts and deteriorate the NLG outputs. We filter out the synthetic, poor quality MR-text pairs by training a separate NLU model on the original labeled data to predict MR from generated text labels. These MRs can then be checked against the paired MR in D W via pattern matching as inspired by Cai and Knight (2013); Wiseman et al. (2017). Specifically, we use a measure of semantic similarity in terms of f-score via matching of slots between the two MRs. We keep all MR-text pairs with f-scores above 0.7, as we found empirically that this criterion retains a sufficiently large amount of high-quality data. The removed text sentences are used for unsupervised training objectives as in Eq. 1-3. Using this method, we create a collection of parallel MR-text samples (~500k) an order of magnitude larger than even the full training sets (~40k for E2E and~25k for Weather).

Joint learning of NLG and NLU
For both NLU and NLG models, we adopt the same architecture as Tseng et al. (2020), which use two Bi-LSTM-based (Hochreiter and Schmidhuber, 1997) encoders for each model. The NLU decoder for slot-value structured data (e.g., E2E, Mrkšić et al., 2017) contains several 1-layer feedforward neural classifiers for each slot; while for tree-structured meaning representation in Balakrishnan et al. (2019), the decoder is LSTM-based. In this framework, both NLU and NLG models are trained to infer the shared latent variable repeatedly -starting from either MR or text, in order to encourage semantic consistency. Each model can be improved via gradient passing between them using REINFORCE (Williams, 1992). This way, the models benefit from each other's training in a process known as the dual learning , which consists of both unsupervised and supervised learning objectives. We now go into details describing them.
Unsupervised Learning. Starting from either a MR sample or a text sample, the models project the sample from one space into the other, then map it back to the original space (either MR or text sample, respectively), and compute the reconstruction loss after the two operations. This repetition will result in aligned pairs between the MR samples and corresponding text (He et al., 2016). Specifically, let p θ (y|x) be the probability distribution to map x to its corresponding y (NLG), and p φ (x|y) be the probability distribution to map y back to x (NLU).
Starting from x ∈ X, its objective is: which ensures the semantic consistency by first performing NLG accompanied by NLU in direction x → y → x. Note that only p φ is updated in this direction and p θ serves only as as an auxiliary function to provide pseudo samples y from x. Similarly, starting from y ∈ Y , the objective ensures semantic consistency in the direction where the NLU step is followed by NLG: y → x → y 2 : We further add two autoencoding objectives on both MR and text samples: Thus, unlabeled text samples can be used as they are shown to benefit the text space (Y ) by introducing new signals into learning directions y → x → y andỹ → y. Thus, we use all in-domain text data whether they have corresponding MR or not. Note that following (Tseng et al., 2020), we also adopt the variational optimization objective upon the latent variable z which was shown to pull the inferred posteriors q(z|x) and q(z|y) closer to each other. In this case, the parameters of both NLG and NLU models are updated.
Supervised Learning. Apart from the above unsupervised objectives, we can impose the supervised objective on the k labeled pairs: Each MR is flattened into a sequence and fed into the NLG encoder, giving NLG and NLU models an inductive bias to project similar MR/text into the surrounding latent space (Chisholm et al., 2017). As we observed anecdotally 3 , the information flow enabled by REINFORCE allows the models to utilize unlabeled MR and text, boosting the performance in our scenarios.

Learning with Weak Supervision
The primary challenge that arises from the synthetic data is the noise introduced during the generation process. Noisy and poor quality labels tend to bring little to no improvements (Elman, 1993;Frénay and Verleysen, 2013). To better train on the large and noisy corpus described in section §4 (sizẽ 500k), we employ a two-step training process motivated by fidelity-weighted learning (Dehghani et al., 2018). The two-step process consists of (1) pretraining and (2) quality-weighted fine-tuning to account for the heterogenous data quality.
Step 1: Pre-train two sets of models on weak and clean data, respectively. We train the first set of models (teacher) consisting of NLU, NLG, and autoencoder (AUTO) models on the clean data. The second set of models (i.e. NLU and NLG) is the student that pretrains on the weak data.
Step 2: Fine-tune the student model parameters on the combined clean and weak datasets. We use each teacher model to determine the step size for each iteration of the stochastic gradient descent (SGD) by down-weighting the training step of the corresponding student model using the sample quality given by the teacher. Data points with true labels will have high quality, and thus will be given a larger step-size when updating the parameters; conversely, we down-weight the training steps of the student for data points where the teacher is not confident. For this specific fine-tuning process, we update the parameters of the student (i.e. NLG and NLU models) at time t by training with SGD, where L(·) is the loss of predictingŷ for an input x i when the label isỹ. The weighted step is then c(x i ,ỹ i )∇L(ŷ,ỹ), where c(·) is a scoring function learned by the teacher taking as input MR x i and its noisy text labelỹ i . In essence, we control the degree of parameter updates to the student based on how reliable its labels are according to the teacher.
We denote c(·) as the function of the label quality based on the dual mutual information (DMI), defined as the absolute difference between mutual information (MI) 4 in inference directions x → y and y → x. Bugliarello et al. (2020) shows that MI x→y correlates to the difficulty in predicting y from x, and vice versa. Thus we expect the difference between MI x→y and MI y→x for clean sample (x, y) to be relatively small compared to noisy samples, since the level of difficulty is largely proportional between NLU and NLG on the samplesdifficulty in inferring x from y will result in harder prediction of y from x. Based on this intuition, the DMI score of the sample (x, y) is defined as: where q(·) are the two respective models. The DMI for a clean MR-text pair should be relatively small, as the two sides contain proportional semantic information 5 , and so poor quality samples tend to have higher DMI scores and lower c(·) as they are less semantically aligned. Thus, c(·) defines the confidence (quality) the teacher has about the current MR-text sample. We use c(·) to scale η t . Note that η t (t) does not necessarily depend on each data point, whereas c(·) does. We define c(x t , y t ) as: where N (·) normalizes DMI over all samples in both clean and weak data to be in  Table 1: Performance for NLG (BLEU-4) and NLU (joint accuracy (%)) on E2E and Weather datasets with increasing amount of labeled data from 10, 50 labeled instances to 1%, 5%, and 100% of the labeled data (DL). Models that have access to unlabeled ground-truth text labels are marked with *. We provide results for the NLG and NLU models trained separately using supervised objectives alone (decoupled), our semi-supervised joint-learning model (joint), joint with all unlabeled data (joint+aug), and weakly-supervised models (step 1).
Step 1+2 denotes the full proposed approach. Sources of data include labeled data (DL), unlabeled MR (XU ), weakly labeled data (DW ), 100% real text (YSL), and weak text labels (YW L). the seq2seq model. All models were trained on 1 Nvidia V100 GPU (32GB and CUDA Version 10.2) for 10k steps. The average training time for seq2seq model was approximately 1 hour, and roughly 2 hours for the proposed semi-supervised training with 100% data. The total number of updates is set to 10k steps for all training and patience is set as 100 updates. At decoding time, sentences are generated using greedy decoding.

Results
We first compare our model with other baselines on both datasets, then perform a set of ablation studies on the E2E dataset to see the effects of each component. Finally, we analyze the strength of the weak annotator, and the effect of the qualityweighted weak supervision, before concluding with the analysis of dual mutual information.

E2E NLG
BLEU-4 TGEN (Dušek and Jurcicek, 2016) 0.6593 SLUG (Juraska et al., 2018) 0.6619 Dual supervised learning (Su et al., 2019) 0.5716 JUG (Tseng et al., 2020) 0.6855 GPT2-FT  0.6562 WA (Harkous et al., 2020) 0.6445 Ours (step 1+2) 0.7025 Weather NLG BLEU-4 S2S-CONSTR (Balakrishnan et al., 2019) 0.7660 JUG (Tseng et al., 2020) 0.7768 Ours (step 1+2) 0.7986 In particular, we experiment with various low resource conditions of training set (10 instances, 50 instances, 1% of all data, 5% of all data). To show that our proposed approach is consistently better, we include the scenario with 0-100% of the data at 10% interval, to show that performance does not deteriorate as more training samples are added (Figure 4). Table 2 shows the summary of training data used for all models in Table 1. We compare our model with (1) a fine-tuned GPT2 model (GPT2-FT) that uses a switch mechanism to select between input and GPT2 knowledge    Figure 4: Model performance (BLEU-4) on 5% E2E data with varying percentages of strong and weak data with and without DMI-based quality weighting. Left plot begins with models trained on labeled data while right plot starts with the weak synthesized dataset instead.  annotator (WA) that predict text from MR or MR from text, depending on the input format during fine-tuning (Harkous et al., 2020) 7 , and (3) the semi-supervised model 8 (JUG) from Tseng et al. (2020). Note that the specialized encoder in GPT2-FT cannot be easily adapted to the tree-structured input in Weather, and so we do not provide its score on the Weather dataset.
In Table 1, we show that our proposed approach (step 1+2) generally performs better than the baselines for both tasks (NLG and NLU) for most selected labeled data sizes. We show that even with only 10 labeled instances, our approach (step 1+2) is able to yield decent results compared to the baselines. The difference between models tends to be larger for settings with few training instances, and the advantage of the method diminishes as the amount of labeled data available for JUG increases, to the point where JUG is able to outperform the proposed approach. Overall, the benefit of the noisy supervisory signal from the weak data is able to boost performance, especially at lower resource conditions.
We observe that training with weakly labeled data alone (step 1) is not sufficient, and so strong data is required to provide the supervisory signals 7 No released source code so we re-implemented it based on paper. 8 https://github.com/andy194673/ Joint-NLU-NLG necessary (step 2). Further, the fact that joint+aug displays noticeable improvements over joint suggests that simply having augmented text helps to improve the encoded latent space as projected by both the NLU and NLG encoders. This also shows an alternative way to introduce additional indomain information to both models, even though the NLU model does not benefit directly from additional text. Importantly, our approach shows that the weak annotator is able to bridge the gap as defined by the access to ground-truth text labels in JUG -outperforming it significantly at low resource conditions (10, 50, 1%, 5%) with the difference in NLG being as large as 48.7 BLEU points with 10 instances. We find that the proposed model also performs well in the high resource (100% of labeled data) condition, as shown in Table 3. Moreover, with 100% labeled data, our model is still able to produce superior performance over some of the baselines, which shows that weak annotation does capture additional useful patterns that benefit the NLG process.

Analysis
Error Analysis. Since word-level overlapping scores usually correlate rather poorly with human judgements on fluency and information accuracy (Reiter and Belz, 2009;Novikova et al., 2017a), we perform human evaluation on the E2E corpus on 100 sampled generation outputs. For each MR-text pair, the annotator is instructed to evaluate the fluency (score 1-5, with 5 being most fluent), miss (count of MR slots that were missed) and wrong (count of included slots not in MR) are presented in Table 4, where fluency scores are averaged over 50 crowdworkers. We show that with 1% data, both NLU and NLG models yield signif-   icantly fewer errors in terms of misses and wrong facts, while having more fluent outputs. However, it generates more redundant slot-value pairs which we attribute to the noisy augmentation that "misguided" the NLU model.
How Strong is the Weak Annotator? To assess the strength of the weak annotator (WA) itself, we also computed its NLG scores with varying amounts of labeled data (see Table 1). We observe that the WA suffers from a performance drop in lower resource conditions (i.e. 0.195 BLEU with 10 labeled instances), when the given training samples are not sufficient for the pretrained model to converge upon a region of in-domain generation. However, it yields some quality data when conditioned on a large number of possible MR (i.e. 50% data), forming a useful in-domain text set (See Table 6).
Analysis of Weak Supervision. In Table 5, we present the results of an ablation study on weak supervision (see §6) where the effect of data fidelity is stronger on NLU than on NLG, which is due to the nature of the filtering process which removes faulty text labels which influences both x → y and y → y training directions. Next, though weak supervision boosted the model by giving direct supervision in training directions x → y and y → x, the noisy nature of the augmentation limits its ef-fectiveness. The model is further improved with the proposed quality-weighted update that takes into account the sample quality and alleviate the influence of poor quality samples. Refer to Table 7 for output comparison.
Analysis of the Two-Step Training Process. As inspired by Dehghani et al. (2018), we justify the two-step training process by performing two types of experiments with 5% data (see Figure 4): In the first experiment, we use all the available strong data but consider different ratios of the entire weak dataset -as used in our 2-step approach. In the second, we fix the amount of weak data and provide the model with varying amounts of strong data. The results show that the student models are generally better off by having the teacher's supervision. Further, pretraining on weak data prior to fine-tuning on strong data appears to be the better approach and this motivates the reasoning behind our two-step approach.
Analysis of the Dual Mutual Information. Figure 5 depicts DMI with the visualization of M I x→y as x-axis and M I y→x as y-axis, in which 100 randomly sampled noisy and ground-truth samples are plotted for both datasets. On the plot, the diagonal reference represents the scenario in which NLG and NLU inference are equally difficult, and we see that annotated data cluster more around the diagonal reference. This means that expert-labeled samples' DMI scores tend to be smaller, where NLU and NLG inference for these samples carry similar levels of difficulty. Importantly, since DMI scores are normalized over both clean and noisy samples, the proximity of data to the trendlines can then be used to estimate the sample qualityclean data are closer as compared to the noisy sam-   ples. Thus clean data will have smaller normalized scores, higher c(·), and a larger update step. This further supports the use of the proposed sample quality-based updates on the parameters.

Conclusion and Future Work
In this paper, we show the efficacy of the framework where data is automatically labeled and both NLU and NLG models learn with quality-weighted weak supervision so as to account for the individual data quality. Most importantly, we show that not only is the two-step training process useful in improving the model, it yields decent quality text. This work serves as a starting point for weaklysupervised learning in natural language generation, especially for topics related to instance-based weighting approaches.
For future work, we hope to extend on the framework and propose ways with which it can be incorporated into existing text annotation systems.