Building Adaptive Acceptability Classifiers for Neural NLG

We propose a novel framework to train models to classify acceptability of responses generated by natural language generation (NLG) models, improving upon existing sentence transformation and model-based approaches. An NLG response is considered acceptable if it is both semantically correct and grammatical. We don’t make use of any human references making the classifiers suitable for runtime deployment. Training data for the classifiers is obtained using a 2-stage approach of first generating synthetic data using a combination of existing and new model-based approaches followed by a novel validation framework to filter and sort the synthetic data into acceptable and unacceptable classes. Our 2-stage approach adapts to a wide range of data representations and does not require additional data beyond what the NLG models are trained on. It is also independent of the underlying NLG model architecture, and is able to generate more realistic samples close to the distribution of the NLG model-generated responses. We present results on 5 datasets (WebNLG, Cleaned E2E, ViGGO, Alarm, and Weather) with varying data representations. We compare our framework with existing techniques that involve synthetic data generation using simple sentence transformations and/or model-based techniques, and show that building acceptability classifiers using data that resembles the generation model outputs followed by a validation framework outperforms the existing techniques, achieving state-of-the-art results. We also show that our techniques can be used in few-shot settings using self-training.


Introduction
A key component of these models is a synthetic error generation step that applies various sentence * Work done while on leave from Ohio State University. transformations to some seed data. However, these simple transformations may not always be able to generate realistic error samples with respect to the NLG models. In this paper, we take an adaptive approach to synthetic data generation that employs a variety of model-based sentence transformations, some of which are additionally adaptive to the NLG models or dataset, in order to generate samples that better resemble the output of these models. We then pass these synthetic samples through a novel validation framework that filters and sorts them into acceptable and unacceptable classes, further improving the quality of the overall synthetic dataset. We show that an acceptability classifier built on top of the data generated by our approach improves upon existing techniques, and that we achieve stateof-the-art results by combining our adaptive data generation approaches with Harkous et al.'s nonadaptive ones.

Related Work
Work on automated evaluation metrics in the tradition of BLEU (Papineni et al., 2002) shares similar goals as our work, except that such metrics make use of reference sentences and thus are not designed for use at inference time. Moreover, such methods have not been found to correlate well with human evaluation of individual texts outside of the machine translation paradigm (Reiter, 2018). Çelikyilmaz et al. (2020) presents a comprehensive literature survey of the three broad categories of evaluation of the text generation models-human, automated and machine-learned-along with providing strong motivation for doing NLG evaluation. 1 Our approach is inspired by work in the third category of machine-learned evaluation.
As noted, Harkous et al. (2020) improve upon earlier heuristic-based filtering by generating synthetic error data for training a semantic fidelity classifier. To do so, they use simple sentence transformations to create artificial omission, repetition, hallucination and value errors. However, since such transformations are not adaptive to NLG generation models the classifier is used with, they may not always produce the kind of unacceptable samples the corresponding NLG model would. Also related is Sellam et al.'s (2020) work on building a machine-learned scorer, BLEURT, to replace automated metrics such as BLEU. They use mask filling with a pretrained language model for creating synthetic unacceptable examples. In this paper, we introduce several new techniques for synthetic data generation, and comprehensively evaluate them in comparison to Harkous et al. (2020)'s methods, as well as to BLEU and BLEURT. In addition, we introduce a validation framework to sort the samples into the 2 classes. Our validation framework uses a pretrained entailment model, similarly to how Dušek and Kasner (2020) use one for semantic evaluation; here, we go beyond their approach by using it to develop an adaptive acceptability classifier that is better suited to runtime use.
As an alternative to using acceptability classifiers, one can make use of reconstruction models (Shen et al., 2019;Yee et al., 2019) to determine how well the NLG model's output predicts its input. These models are capable of detecting content errors but are not designed to capture grammatical mistakes. Additionally, since such approaches employ a second autoregressive decoding step, they are less well-suited to runtime inference in systems with tight latency budgets.
Regarding our self-training experiments, we note that self-training has been previously investigated for NLG by Kedzie and McKeown (2019), Qader et al. (2019) and Stevens-Guille et al. (2020), though they do not explore using pre-trained models with self-training. Also related are earlier approaches that use cycle consistency between parsing and generation models for automatic data cleaning (Nie et al., 2019;Chisholm et al., 2017). More recently, Chang et al. (2021) have developed a method for randomly generating new text samples with GPT-2 then automatically pairing them with data samples. By comparison, we take a much more direct and traditional approach to generating new text samples from unpaired inputs in self-training (He et al., 2020), using pre-trained models fine-tuned on the few-shot data for both generation and reconstruction filtering.

Datasets
We conducted experiments on 5 datasets: WebNLG 2 (Gardent et al., 2017), Cleaned E2E (Dušek et al., 2019, ViGGO (Juraska et al., 2019), a Conversational Weather dataset created following the method described in Arun et al. (2020), and an Alarm dataset released by Arun et al. (2020). Further, in order to examine the effectiveness of self-training for building the acceptability classifier in the few-shot setting, we delexicalized (Arun et al., 2020)  For test sets, we used (1) system outputs from the public human evaluation set for WebNLG 2017 and converted the labels to Acceptable class if both grammar and semantics ratings were greater than 2 (out of 3); (2) the Data Sabotaging strategy described in Section 5.1 to create model responses for Cleaned E2E, ViGGO and delexicalized WebNLG; and (3) responses generated by the Weather and Alarm NLG models for those datasets. We used these methods as a practical way to create a variety of errors in sufficient quantities to be able to effectively test the acceptability classifiers. Additionally, the Weather and Alarm test sets are representative of current SOTA models built for these domains. Human evaluations were done for all test sets except WebNLG to determine acceptability, using two annotators and a tie-breaker round in case of disagreement. The number of samples in all human annotated test sets can be found in Table 17.

Framework Design
In Figure 1, we show the overall design of our proposed framework. The framework takes in as input the training data of the text generation model as well as the trained generation model. The next step is the synthetic data generation that makes use of these inputs and is able to generate as many samples as needed. The synthetic samples are then passed through a validation framework that either sorts them into acceptable and unacceptable classes or rejects them altogether.

Synthetic Data Generation
Our synthetic data generation methods use the training data of the generation models (seed data) and the trained generation model. In Sections 9 and 10, we observe that a classifier built on data using our model-based and adaptive approaches improves upon the average F1 scores of standalone non-adaptive approaches by 1.1% to 18%. Following are the 4 strategies we introduce. Table 2 shows sample responses generated by each of these methods.

Data Sabotaging (SBTG)
We intentionally sabotage low-capacity LSTM models by only training them using 25% of the seed data to generate synthetic responses. These responses are more likely to be unacceptable with respect to the generation model responses as the full training data may contain considerably different inputs than the sabotaged one. We carry out this process 4 times with a different 25% sample of the training data and make predictions on the remaining 75% of the training data.

Noisy Beam Search (NBM)
We add random noise to beam scores at each inference step of the generation model. With this technique, the generated unacceptable responses tend to have grammatical errors, while the acceptable responses tend to be paraphrases having a different sentence structure compared to the seed responses.

Mask-Filling with vanilla BART (BART)
We insert 3 to 7 random masks in the seed data and use the vanilla BART (Lewis et al., 2020) model for filling in the masks. A small number of masks tends to produce acceptable data whereas a large number of masks tends to produce unacceptable semantically incorrect but grammatical data. This approach generates out-of-domain data (OOD).

Mask-Filling with fine-tuned BART (FTB)
We improve upon the OOD distribution limitation of vanilla BART by fine-tuning BART on noised sequences from seed data to reconstruct the original sequences. Denoising responses helps capture similar patterns in the seed data and masked words in the response are replaced by tokens most similar to that in seed data. We obtained best results by noising seed data using an insert mask ratio of 0.3 and random mask ratio of 0.5 where we mask the whole word. We use the same masking parameters to generate synthetic responses by mask-filling.

Validation Framework
The validation framework takes in the synthetic samples generated by above described methods and filters and sorts them into acceptable and unacceptable classes. Our experiments in Section 9 show that using a validation framework improves Macro F1 scores across all models by 1.4% to 5%. Following are the techniques we introduce.

Reconstruction Model Validator (REC-VAL)
We use this technique solely for the data sabotaging synthetic data generation method. In this approach, we use all the seed data to fine-tune BART (Lewis et al., 2020) as a reverse model with model responses as input and model input as the output. We then feed the synthetic responses to this reconstruction model and obtain the model inputs. Finally, we partitioned samples into acceptable and unacceptable classes based on whether they had the exact reconstruction match or not, respectively.

Entailment Model Validator (ENT-VAL)
For each seed response, we create a pair of the seed response and the generated synthetic sample. Next, we pass this pair twice to a RoBERTa-based entailment model (Liu et al., 2019) to obtain {entailment, neutral, contradiction} labels in both directions. The synthetic sample is sorted as acceptable if there is 2-way entailment within specified confidence thresholds (set heuristically for all domains through initial experimentation). Otherwise, the sample is sorted as unacceptable if the confidence score is within specified thresholds for the neutral or contradiction class in either direction. If none of the conditions are met, the sample is rejected.

BLEU Score Validator (BLEU-VAL)
We calculate BLEU for a synthetic sample with respect to the original seed text. The sample is then sorted as acceptable or unacceptable when it lies within a specified range of BLEU scores.

Classifier Model Architecture
We formulate the task as binary classification with labels {Acceptable, Unacceptable} and learn a discriminator model using both the training data of the generator models and synthetic samples generated and refined using our adaptive approach. The training data consists of generation model input concatenated to the response (synthetic or original text) with a separator token. For the underlying model architecture, we use RoBERTa-Base (Liu et al., 2019).

Few-Shot Setting
Recently, there have been several efforts to train NLG models in a few-shot setting (Chen et al., 2020;Peng et al., 2020;Arun et al., 2020;Heidari et al., 2021). We adopt the self-training strategy introduced by Heidari et al. (2021) to generate training data for generative models, which is also used as the seed data needed for acceptability modeling. Self-training consists of several cycles of generation and reconstruction. For generation, we fine-tune BART (Lewis et al., 2020) using only 500 annotated examples to generate NLG responses given the input meaning representations (MRs). The same generation data is used to finetune a reconstruction BART model to obtain the input MR given the response. We use the reconstruction model to select samples with the exact reconstruction match after the generation step. At the end of the self-training cycles, we use all the selected samples as seed data for acceptability modeling. Following Heidari et al. (2021), we delexicalize the meaning representations of the WebNLG dataset and pair them with existing delexicalized responses.

Results
In the following subsections, we compare the overall performance of classifiers. We report precision (P) and recall (R) of both acceptable (A) and unacceptable (U) classes. We also report Macro-F1 (F1) as the main metric we use to compare performance of the generation techniques. We use 10-fold cross validation (CV) to adjust the classification thresholds used for making predictions on the test sets.
To calculate standard deviation of the F1 values, we use bootstrapping with 1000 rounds. Finally, we report the means over 3 runs for each technique. Further, we perform McNemar's statistical significance test comparing models trained with the best combination of methods with those trained with sentence transformations. We use the following abbreviations to refer to the different synthetic data generation techniques: sentence transformation (SNT), data sabotaging (SBTG), noisy beam search (NBM), mask filling with vanilla BART (BART), mask filling with finetuned BART (FTB). Table 3 shows examples from 3 of the datasets where the acceptability classifiers capture unacceptable responses that pass the sentence transformation (SNT) baseline models. Our experiments show that fine-tuned BART (FTB) is often the best single method, so we include it in all the results along with its combination with sentence transformation (SNT+FTB). The results indicate that in the absence of a representative validation set, SNT+FTB should be used as it performs competitively across all datasets; otherwise, the validation set can be used to pick the best combination of techniques. We present comprehensive ablation experiments across all datasets in the appendix.
We performed nearest neighbor analysis (using BLEU) between test and synthetic unacceptable responses generated during training by our adaptive methods and sentence transformation. We found that across datasets, 52.27% to 98.48% unacceptable responses have a closest match to a sample generated by an adaptive method, suggesting that adaptive techniques produce more realistic samples compared to sentence transformation. In Table 4, we show sample unacceptable responses from 3 datasets with their closest match to a sample generated by each technique. 3

Comparing Synthetic Data Generation Techniques
Tables 5-9 compare the performance of models trained with data generated from the mentioned techniques using a RoBERTa-based architecture.
We also compare against the techniques described in Harkous et al. (2020), which we call sentence transformation (SNT). When data from different techniques including SNT are combined, we mix them in equal proportions. Additionally, we use all of the seed data as acceptable samples as well as ensure a 50:50 split between overall acceptable and unacceptable samples for training. We use different validation methods across these techniques: for BART and FTB we use a combination of BLEU and entailment models, for NBM we use entailment models, and for SBTG we use reconstruction models.
We observed that mask filling with fine-tuned BART performed consistently well across all datasets. We think this is because the pre-trained language model property of BART tends to generate grammatical responses, and by finetuning BART on the generation model training data, mask filling tends to generate words from the training data distribution, resulting in consistent generation of realistic samples. Noisy Beam Search tends to generate unacceptable samples that are ungrammatical (thus complementing fine-tuned BART technique) and hence, we see it included in the best combination for datasets where the test sets contain ungrammatical samples, such as WebNLG. Having said this, these are our initial observations and digging into more insights of the exact conditions under which different data generation techniques perform better is left for future research.
As can be seen in Figure 2, the mean macro-F1 score improves over the base sentence transformation in all 5 datasets. Note that macro-F1 for a majority class baseline is at best 50% since F1 for the minority class is always zero, and ranges from 33.3 to 48.2 for our test sets. Likewise, we built and tested against baseline supervised classifiers using available test data and 5-fold cross validation, and observed that our acceptability classifiers improve on the macro-F1 scores of these baseline classifiers by 5% to 27.3%. 3 We also performed

Cleaned E2E
The Blue Spice pub located near Burger King has been rated average by customers.
Blue Spice is a three star pub with average pricing. it is located near the river to the burger king.  agent-1 has patient-1 members and agent-1 ground is in patient-2 . agent-1 play in patient-3 and were in patient-4 .
agent-1 has patient-1 members and agent-1 ground is in patient-2 . agent-1 play in patient-3 and were in patient-2 .
agent-1 ground is in patient-2 and agent-1 have patient-1 members . agent-1 play in patient-3 and were in agent-1 .    McNemar's significance test comparing the best combination of techniques to the SNT version. The performance increase is statistically significant for the Weather, Alarm, and WebNLG datasets with p-values of 0.023, 0.004, and 0.006, respectively. Note that the Weather and Alarm datasets are representative of current SOTA models. The significance test of the Cleaned E2E was inconclusive with a p-value of 0.055, suggesting more samples are needed to make a determination. The improve-   Table 8: Performance on Weather dataset. The best adaptive combination (BADP) is BART+FTB and the best adaptive methods combined with sentence transformations (SNT+ADP) are also BART+FTB. ment in ViGGO dataset is not significant.

Method P(A) R(A) P(U) R(U) F1
We conjecture that this may be in part due to the proportion of acceptable vs. unacceptable samples in the Cleaned E2E and ViGGO test sets, along with differences in the datasets themselves. Since the other test sets have a relatively higher proportion of acceptable samples, they are likely to be more challenging for the classifiers, since the classifier needs to ensure complete parity between the meaning representation and generated output with   Table 9: Performance on Alarm dataset. The best adaptive combination (BADP) is SBTG+FTB and the best adaptive methods combined with sentence transformations (SNT+ADP) are NBM+BART+FTB. the acceptable samples (which may be more difficult than spotting an error). Conversely, if the Cleaned E2E and ViGGO test sets are easier, then that could explain why it would be more difficult for a model to significantly improve on the sentence transformation baseline performance. Further analysis of this possibility is left for future work. Table 10 compares the performance of 4 validation strategies across the 5 datasets. We choose the best performing synthetic data generation technique combination for each dataset from Tables 5-9 and apply validation strategies on them for this comparison. The strategies ENT-VAL, BLEU-VAL and ENT+BLEU-VAL are applied only to the NBM, BART and FTB synthetic data generation methods. When SBTG is used, REC-VAL is used. Table 10 compares all validation strategies except REC-VAL (since it is always used when using SBTG data generation method) across the 5 datasets. As can be seen, not using any validation framework performs the worst across all 5 datasets, with 1.4% to 5% decrease in average F1 scores. Moreover, the best performing validation strategy includes the entailment model validator for 4 of the datasets.

Effect of Validation Framework
Next, we compare the effect of adding synthetic acceptable data to the acceptability classifier's training data for the best performing synthetic data generation technique combination for each dataset. For all experiments, REC-VAL is applied for SBTG, ENT-VAL is applied for NBM, and ENT+BLEU-VAL is applied for both BART and FTB. As can be seen in Table 11, the average macro-f1 scores are improved by 0.4%-4.3% across all 5 datasets when adding synthetically generated acceptable data.

Few-Shot Setting
We delexicalized meaning representations of the WebNLG dataset and used the delexicalized version to build NLG models. We used 500 samples in the few-shot setting and auto-annotated 8,000 more through 2 cycles of self-training. We compared the performance of our few-shot acceptability classifiers with the full data ones, which were trained using more than 20,000 samples. As can be seen in Figure 3, there is no significant drop in the per-   formance of the classifiers in the few-shot setting. Moreover, our adaptive methods work much better than sentence transformations on the delexicalized WebNLG dataset, including in the few-shot setting.
10 Comparison with Automated Metrics Sellam et al. (2020) show that BLEURT-base achieves state-of-the-art consistency with human judgements on the WMT Metrics Shared Task (Bojar et al., 2017), and can be further fine-tuned on the WebNLG 2017 human ratings 4 to improve agreement. Since fine-tuned BLEURT checkpoints are not publicly available, we fine-tuned our own BLEURT models on human judgments of semantic adequacy (SEM) and grammatical correctness  (E2E) for E2E human evaluations. 5 Specifically, we used all 5,363 items, sampling 1,000 of them as validation data, following BLEURT paper conclusions. Following Sellam et al. (2020), we stopped fine-tuning at 40,000 steps. We obtained confidence thresholds for BLEU and all BLEURT based models by optimizing 10fold cross validation Macro F1 scores as described in Section 9 at every 2 unit interval (step size of 0.02 for BLEURT and 2 for BLEU) between minimum and maximum BLEU/BLEURT scores. The threshold was then used to determine the predicted class. We show results in Table 12. As expected, BLEURT variants generally outperform BLEU. BLEURT fine-tuned on WebNLG outperforms BLEURT fine-tuned on E2E on WebNLG, and vice-versa for E2E, with an outlier outperformance of BLEU on ViGGO. Remarkably, our best acceptability classifier outperforms all BLEURT variants across all 3 datasets, despite BLEURT using reference sentences. This could be because BLEURT doesn't take the input into consideration, or because BLEURT is fine-tuned with a regression loss instead of a classification loss.

Conclusion
In this paper, we introduced and analyzed several model-based and model-adaptive techniques, along with a validation framework, to create synthetic acceptable and unacceptable responses for training acceptability classifiers to filter outputs of neural NLG models. In addition, we compared and contrasted combinations of these techniques with using only the simple sentence transformation methods recently introduced by Harkous et al. (2020). We carried out a comprehensive study using 5 NLG datasets with varying levels of complexity and demonstrated that a combination of our methods and sentence transformations deliver state-of-theart performance on all of them. Additionally, we demonstrated that using self-training, our models can be trained in few-shot settings without any significant drop in performance. This is especially important in light of recent efforts to develop few-shot NLG models, where avoiding semantic errors remains a central challenge. Finally, we recommend the strategy of using fine-tuned BART with the entailment model validator for building an acceptability classifier in the absence of a representative validation set. When such a set is available, we recommend performing ablation experiments across all combinations of different techniques using our framework in Section 4. Further analyzing the various conditions under which different synthetic data generation and validation strategies work with respect to the nature of underlying data is left for future work.

Ethical Considerations
The human annotators involved in the data evaluation for this paper are full time contracted employees. Before the data and evaluation guidelines are sent out to the annotators, the project goes through an approval process. The process starts by submitting a request containing human review workflow and guidelines according to project scope in layman's terms. Upon receiving the request, the the trained team automatically identifies risks based on the information contained in the request and assigns relevant reviewers. Subsequently, all potential risks are identified, documented and addressed before the start of the annotation process. This process ensures that the data and guidelines are designed to mitigate potential bias and risk. All of the guidelines and data used by this paper and sent to human annotators underwent this review process.
The classifiers laid out in this paper should only reduce the harms associated with models outputting semantically incorrect information, therefore reducing the risk of deploying such models. However, we would like to call out potential biases that may arise from training correctness models on a specific grammar. The grammatical evaluation done on the data uses prescriptive grammar of informal Standard American English. These prescriptive notions of grammaticality potentially serve to perpetuate systemic power imbalances as they're conveyed by language. The use of this grammar to train a correctness model may not be appropriate depending on the potential use case.

A Appendix
A.1 Reproducibility All the data and annotations for the experiments conducted in this paper are released here 6 . All experiments were conducted on 32GB Quadro GV100 GPUs. The Acceptability Classifiers were trained by optimizing roc-auc metric on the validation set. The average latency of the classifiers is 150ms.

Parameter
Value tokenizer BPE tokenizer max length 1024 encoder output dropout 0.1 encoder embedding dim 768 #encoder layers 12 #encoder attention heads 12 decoder dropout 0 decoder activation relu Number of model params 124055810

A.4 Ablation Results Across Techniques And Datasets
Below ablation results are on a single run. Top 10 winning strategies were selected and run 3 times to obtain the final results, as shown in main paper.