Knowledge Transfer from Answer Ranking to Answer Generation

Recent studies show that Question Answering (QA) based on Answer Sentence Selection (AS2) can be improved by generating an improved answer from the top-k ranked answer sentences (termed GenQA). This allows for synthesizing the information from multiple candidates into a concise, natural-sounding answer. However, creating large-scale supervised training data for GenQA models is very challenging. In this paper, we propose to train a GenQA model by transferring knowledge from a trained AS2 model, to overcome the aforementioned issue. First, we use an AS2 model to produce a ranking over answer candidates for a set of questions. Then, we use the top ranked candidate as the generation target, and the next k top ranked candidates as context for training a GenQA model. We also propose to use the AS2 model prediction scores for loss weighting and score-conditioned input/output shaping, to aid the knowledge transfer. Our evaluation on three public and one large industrial datasets demonstrates the superiority of our approach over the AS2 baseline, and GenQA trained using supervised data.


Introduction
In recent times, extractive QA research can be categorized into two broad directions for the task of producing the final answer for a question: (i) Answer Sentence Selection (AS2), which, given a question and a set of answer-sentence candidates, selects sentences (e.g., retrieved by a search engine) that correctly answer the question; and (ii) Machine Reading (MR), e.g., (Chen et al., 2017), which, given a question and a reference text, involves finding an exact text span that answers the question.AS2 models can perform more efficiently with large text databases (as they originated from the TREC-QA track (Voorhees, 1999)), and there seems a renewed research interest in these models for applications to personal assistants, e.g., Alexa (Garg et al., 2020;Matsubara et al., 2020a;Garg and Moschitti, 2021).
Both approaches (AS2 and MR) when applied for QA over unstructured web text, while effective, may have certain drawbacks.Arbitrary web sentences may not contain all the information needed to answer a question, or may contain distracting extraneous information.Moreover, they may have a particular sentiment or style that is not suited to QA, or be too structurally reliant on longer discourse context to serve as a standalone answer.In light of this, researchers have been exploring text generation systems for writing 'better' answers.For example, in MR, RAG (Lewis et al., 2020b) generates an answer from a set of documents selected by dense passage retrieval models.
For AS2 systems, research has focused on learning to summarize answers from relevant paragraphs (Lewis et al., 2020a), or to synthesize information from the top ranked candidates of an AS2 system (Hsu et al., 2021).The latter approach, termed as GenQA, has shown improvements in terms of both answer accuracy and style suitability.A distinctive characteristic of GenQA over a generation-based approach for MR is the length of the answer: the former uses an entire sentence as the target, while the latter in practice uses a short text (primarily targeting entity names).In this work, we focus on GenQA as we are interested to generate complete answer sentences from precise information selected by AS2 models.
A challenge for training effective GenQA models is the difficulty of obtaining large-scale, highquality training data.Producing such data for GenQA typically requires human annotators to read questions and paragraphs of relevant background information, and then author a self-contained, natural answer (typically a sentence).This fairly involved procedure highly diminishes the veloc-ity of annotation.Existing datasets in research works either offer limited coverage of all domains, where GenQA can be applied (Bajaj et al., 2018), or are too small to be used as supervised training data (Muller et al., 2021).Generally, collecting a human-authored answer to a question when given a context is significantly more expensive compared to annotating the correctness of an extracted web sentence as an answer for the same question.Consequently, there are a large number of annotated datasets (Wang et al., 2007;Yang et al., 2015;Garg et al., 2020) available for the latter type, aimed at training answer sentence selection (AS2) systems.
In this work, we propose a training paradigm for transferring the knowledge learned by a discriminative AS2 ranking model to train an answer generation QA system.Towards this, we learn a GenQA model using weak supervision provided by a trained AS2 model on a unlabeled data set comprising of questions and answer candidates.Specifically, for each question, the AS2 model is used to rank a set of answer candidates without having any label of correctness/incorrectness for answering the question.The top ranked answer is used as the generation target for the GenQA model, while the question along with the next k top-ranked answers are used as the input for the GenQA model.
We supplement the ranking order of answer candidates with the prediction confidence scores provided by the AS2 model for each answer candidate.This is done by modifying our knowledge transfer strategy in two ways.First, we weight the loss of each training instance (question + context, comprised of k answer candidates) using the AS2 model score of the top ranked answer, which is to be used as the GenQA target.This allows the GenQA model to selectively learn more from 'good' quality target answers in the weakly supervised training data (AS2 models are calibrated to produce higher confidence scores for correct answers).However, this loss weighting only considers the score of the output target, and does not exploit the scores of the input candidates.To overcome this limitation, we discretize and label the AS2 scores into l confidence buckets, add these bucket labels to the GenQA vocabulary, and finally prepend the corresponding label to each answer candidate in the input and/or the output.This confidence bucket label provides the GenQA model with an additional signal about the answer quality of each candidate as assigned by the AS2 model.
We show that both these techniques improve the QA accuracy, and can be combined to provide additional improvements.
We empirically evaluate1 our proposed knowledge transferring technique from AS2 to GenQA on three popular public datasets: MS-MARCO NLG (Bajaj et al., 2018), WikiQA (Yang et al., 2015), TREC-QA (Wang et al., 2007); and one large scale industrial QA dataset.Our results show that the GenQA model trained using our paradigm of weak supervision from an AS2 model can surprisingly outperform both the AS2 model that was used for knowledge transfer (teacher), as well as a GenQA model trained on fully supervised data.On small datasets such as WikiQA and TREC-QA, we show that AS2 models trained even on small amounts of labeled data can be effectively used to weakly supervise a GenQA model, which then can outperform its teacher in QA accuracy.Additionally, on MS-MARCO NLG, where fully supervised GenQA training data is available, we show that an initial round of training with our weakly supervised methods yields additional performance improvements compared to the standard supervised training of GenQA.Qualitatively, the answers generated by our model are often more directly related to the question being asked, and stylistically more natural-sounding and suitable as responses than answers from AS2 models, despite being trained only on extracted sentences from the web.

Related Work
Our work builds upon recent research in AS2, answer generation for QA, and transfer learning.
Answer Sentence Selection Early approaches for AS2 use CNNs (Severyn and Moschitti, 2015) or alignment networks (Shen et al., 2017;Tran et al., 2018;Tay et al., 2018) to learn and score question and answer representations.Compareand-aggregate architectures have also been extensively studied (Wang and Jiang, 2017;Bian et al., 2017;Yoon et al., 2019) for AS2.Tayyar Madabushi et al. (2018) exploited fine-grained question classification to further improve answer selection.Garg et al. (2020) achieved state-of-theart results by fine-tuning transformer-based models on a large-scale QA dataset first, and then adapting to smaller AS2 datasets.Matsubara et al. (2020b) combine multiple heterogeneous systems for AS2 to improve a QA pipeline, similar in spirit to GenQA.Several follow-up works have further improved the performance of AS2 using transformer models, using multiple answer candidates (Zhang et al., 2021) and document-aware pre-training strategies (Di Liello et al., 2022a,b).
Answer Generation for QA Answer generation for MR has been studied by Izacard and Grave (2021); Lewis et al. (2020b) All the previously described approaches focus on identifying short answer spans for answering questions.Research on generating complete sentences as answers (similar to answer sentences produced by extractive AS2 systems) is rarer, but includes Hsu et al. (2021), that propose a QA pipeline for GenQA (refer Fig 1).This pipeline starts with an AS2 model that selects 'good' answer candidates that are then used for generating the answer.Hsu et al. learn to generate natural responses to questions using the top ranked candidates from the AS2 model as input context to the GenQA model.GenQA has also been explored for multilingual QA (Muller et al., 2021) by extending the answer generation approach to a multilingual setting, where the answer candidates (that are used as input to the GenQA model) can be from a mix of different languages.
In all these works, a major challenge is finding training data for effectively training GenQA models, which requires annotator-authored natural responses.In this work, we alleviate this problem by showing that it is possible to use AS2 ranked candidates to create the input context and output target for training GenQA, achieving state-of-the-art results.
Transfer Learning Transfer learning is well studied in NLP, including pre-training (Devlin et al., 2019;Liu et al., 2019), multi-task learning (Luong et al., 2016), cross-lingual transfer (Schuster et  2019) and domain adaptation (Gururangan et al., 2020).Our work is squarely located in this space: our underlying language models are based on pretraining for text generation (Radford et al., 2019;Raffel et al., 2020); our main contribution is to show that knowledge can be transferred sequentially from a ranking (discriminative) task to a generation task.Recently Wang et al. (2021) propose a new domain adaptation method leveraging large unlabeled datasets and a query generator model.Izacard and Grave used retrieved text passages containing evidences to train a generative model for open domain QA.
3 Knowledge Transfer: AS2 → GenQA Previous works on GenQA require the use of labeled data for effectively training the GenQA model.To reduce the need of expensive largescale training data for GenQA, we propose a training paradigm that uses unlabeled data while being weakly-supervised by a discriminative AS2 model (as shown in Fig. 2).

Answer Sentence Selection (AS2)
AS2 is a popular task in QA, defined as follows: Given a question q, and a set of answer candidates C = {c 1 , . . ., c n } (retrieved using a web-index, KB, etc), find the answer candidate c q ∈ C that best answers q.This is typically modeled as a binary classifier M over QA pairs, labeled as correct or incorrect.At inference, the scores assigned by M can be used to produce a ranking over C, with c q = argmax i M(q, c i ).

Generative QA (GenQA)
Generative QA refers to using a text generation model for generating an answer for a question.More specifically, when provided with a question q and context c, the GenQA model M G should generate a natural sounding answer c q = M G (q, c) that correctly answers q.Following Hsu et al. ( 2021), we consider a set of k answer candidates as the context c to be provided to M G .

Training GenQA using an AS2 model
We aim at training a GenQA model, M G , using a trained AS2 model, M, which predicts correctness/incorrectness of answer candidates for a given question.Specifically, we use an unsupervised dataset, U, comprising of a set of questions along with their retrieved answer candidates, i.e, (q, C = {c 1 , . . ., c n }).Note that there are no human annotations of correctness/incorrectness for the answer candidates in C for the question q.
For each question q ∈ U, we denote the ranking of answer candidates by M in decreasing order of prediction scores by C M = {c M 1 , c M 2 , . . ., c Mn }.We create weakly supervised examples for training the GenQA model by using q, c = {c M 2 , c M 3 , . . ., c M k+1 } as the input, and setting the generation target to be the top ranked answer candidate from M, i.e, c M 1 .For seq2seq transformer-based text generation models such as T5 (Raffel et al., 2020) and BART (Lewis et al., 2020a), we concatenate the question and k answer candidates: "q c M k+1 " to be provided as input to M G and use the negative log probability of predicting each token of the target c M 1 given the previous tokens as the training loss.
The resulting GenQA model M G is trained on the unsupervised dataset only using weak supervision from the discriminative AS2 model M.
For the rest of the paper, we denote this training paradigm for GenQA by WS.This approach is related to knowledge distillation (KD) (Hinton et al., 2015) wherein the predictions of a teacher model are used for guiding the learning of a student model.The novelty of our proposed approach from standard distillation techniques stems from the fact that the teacher (AS2) and student (GenQA) belong to different paradigms of training, the former being a discriminative classifier model while the latter being a generative model.Furthermore, standard KD techniques (Hinton et al., 2015;Sanh et al., 2019) use a combination of supervision from the teacher (KL divergence) and supervision from the labeled data (Cross Entropy) for teaching the student, while in our case, we only use the supervision signal from the teacher without any access to labeled data.

Weighting GenQA Loss with AS2 scores
The binary cross-entropy loss used for training discriminative AS2 models typically calibrates their prediction w.r.t answer correctness (Kamath et al., 2020;Garg and Moschitti, 2021).This means that the top ranked answer to a question from M that receives a high prediction probability is more likely to be correct than the answer to another question that receives a lower prediction probability.We exploit this in addition to the ranking order generated by M to improve the learning of the GenQA model M G .Intuitively, we want the GenQA model to learn more from 'good' quality target answers (having higher prediction scores) than from lower quality answers.
To this end, we propose to modify our WS crossentropy loss by incorporating the AS2 scores provided by the AS2 model M when performing the knowledge transfer.Specifically, we use the prediction score M(q, c M 1 ) (normalized in [0, 1]) of M on the top ranked answer candidate c M 1 to weight the loss term for M G corresponding to that instance (question q).Formally, the loss for each last generated word, y r of the generated output y is: where Z is the normalizing constant for AS2 scores computed on the training dataset, L G is the standard loss for generating y r , and c M 1 is assumed to be the gold standard output.L G is defined as: v)   h∈V e yr(h) c M 1 (r, v), where V is the vocabulary, y r (v) is the score of generating the word v at position r, and We refer to the model trained with Eq. 1 as LW.

AS2 Score Conditioned I/O Shaping
In the previous section, we described how to use M(q, c M 1 ) -the AS2 prediction score of c M 1to weight the training loss for question q, since this candidate is used as the target for q in M G .However LW ignores the AS2 scores for the other answer candidates c M 2 . . .c M k +1 , and does not explicitly provide this AS2 score as context to the GenQA model.To overcome this, we propose a method for labeling each candidate in the input of M G with a representation of its AS2 score.This method can also be applied to the model output, which results in an improved performance (as shown in Section 5).
We define a bucketing function F over the normalized interval [0, 1] that operates on the AS2 prediction score M(q, a).For a QA pair (q, a), F(q, a) assigns a confidence bucket label b For our experiments, we set the value of l=5.We add the bucket labels b i as special tokens to the vocabulary of M G . 2 We use F to modify the input and output of the GenQA model as follows: • AS2 Score Conditioned Input (SCI): We prepend the bucket label b j = F(q, c M j ) to each of the j ∈ {2, . .., k+1} answer candidates to be provided as input to M G , so that the new input is formatted as: SCI and SCO can be used independently as well as jointly for training the GenQA model M G using M.For simplicity, we will use the acronym SC when these two techniques are used together.
We propose SCI to make the knowledge transfer more effective.Intuitively, labeling each input candidate with a special token correlated with its AS2 score helps the GenQA model: during training the model can focus more on the answer candidates associated with higher quality (more correct answers), thereby improving the model performance.
While SCO is related to LW presented in Sec 3.4, it differs in the fact that the former allows the model to "know" the score of the target when designing internal representations of the text in its input and output.We hypothesize that this knowledge allows the model to organize its internal representations differently in the presence of bad targets, rather than just be less influenced by them as in LW.Another advantage of SCO is that during inference time, we can use the generated bucket label token as a confidence score for the GenQA model's answer.Calibrating confidence scores for text generation models, e.g., using sequence likelihood, etc. is challenging, especially when decoding is constrained as in real world applications.Finally, we can force decoding to start from any one of the SCO bucket tokens in order to exploit its influence on the model's output.We empirically explore this in Appendix E.

Datasets and Models
For training and evaluating our knowledge transfer techniques (WS, LW, SC) described above, we categorize the data that we use for each experiment into the following four sources/types: • AS2: Labeled (q, a) pairs with correctness and incorrectness annotation for training M • Transfer: Unlabeled (q, a) pairs that are ranked by M, and used for knowledge transfer to M G • Fine-tuning: Labeled data (human written answers / answers with correctness labels) for finetuning M G , whenever available • Evaluation: Evaluation data for M and M G In Section 5, we vary the sources of different types of the data described above, to demonstrate the robustness and generality of our knowledge transfer method.Below, we provide details about the data sources we use, along with a summary of the underlying models.
4.1 Unlabeled Data MS-MARCO QA A popular MR dataset released by (Bajaj et al., 2018).We use the training split which contains ∼800k unique user queries from the Bing search engine along with ∼ 10 passages retrieved for each question. 3We split the original dataset into individual sentences using the Bling-Fire tokenizer4 to be used as the Transfer data.Note that this dataset is used as unlabeled data for our experiments.AQAD-U A large scale internal industrial QA dataset containing non-representative de-identified user questions from Alexa virtual assistant.This unlabeled Alexa QA Dataset (AQAD-U) contains ∼50 million questions, and ∼400 answer candidates retrieved for each question using a large scale web index that contains over 100M web documents.We use this dataset as Transfer data for experiments in the industrial setting.

Labeled Data
ASNQ A large-scale AS2 corpus (Garg et al., 2020) derived from Google Natural Questions (NQ) dataset (Kwiatkowski et al., 2019).It consists of ∼60K questions with labeled answer sentences.We use this as AS2 training data.MS-MARCO NLG A split of MS-MARCO (Bajaj et al., 2018) that contains manually generated answers along with retrieved passages for ∼150k user queries, which we use for Fine-tuning.We sub-sample 1k questions from the development set, along with their answer candidates extracted from the associated passages, to be used as Evaluation data in our experiments.(We do not use the entire development set of ∼ 100k questions for evaluation due to the expensive cost of human annotations).TREC-QA A popular QA benchmark (Wang et al., 2007) used to evaluate AS2 models.For our experiments, we use the filtering and splits proposed in (Zhang et al., 2022), where all questions have at least one positive and one negative candidate, and the test split is larger.The resulting dataset contains 816, 204 and 340 unique questions respectively for the training, dev.and test sets.WikiQA A popular AS2 dataset (Yang et al., 2015) containing questions from Bing search logs and answer candidates from Wikipedia.We use a 'clean' setting for training by retaining questions with at least one positive answer candidate in the train and validation splits.This results in training/dev./testsets of WikiQA having 2118/296/236 questions, respectively.AQAD-L The labeled counterpart of the industrial dataset AQAD-U as described in Section 4.1 above, where answer candidates additionally have human annotations of correctness/incorrectness. We use AQAD-L, comprising of ∼ 5k questions, as Evaluation data for experiments in the industrial setting.Results on AQAD-L are presented relative to the baseline AS2 model due to the data being internal.
For data statistics, please refer to Appendix A.2.

Modeling Details
We use T5 (Raffel et al., 2020) as the model for GenQA M G .For the AS2 models, we use a RoBERTa-Large (Liu et al., 2019) or ELECTRA-Base (Clark et al., 2020) trained using the TANDA approach (Garg et al., 2020), depending on the experimental setting.For our experiments, we set the value of k=5, i.e, the number of answer candidates to be provided as input to the GenQA model.We train our models using f p16 precision, Adam (Kingma and Ba, 2015) as optimizer with a lr = 1e−4 and a batch size of 256.We trained each model for 25 epochs on both the versions of MS-MARCO (QA and NLG), and for 50 epochs on WikiQA and TrecQA.We select the best model by maximizing the average AS2 score on the development set of each dataset instead of minimizing the validation loss (see the details in Appendix B).

Evaluation and Metrics
We perform human evaluation of our generated answers: for each question/answer pair, we collect the annotations from five annotators (corresponding to the answer being correct/incorrect) using Amazon MTurk (see Appendix C for details).We use accuracy as the primary metric for all our experiments and models.Given a set of questions, this is computed as the fraction of correct answers divided by the number of incorrect answers as judged by the annotators.Note that: (i) each QA pair receives an average score from five annotators, and (ii) for the AS2 model, the accuracy is the same as Precision@1, which is the precision of the top ranked answer candidate.

Experiments and Results
We perform experiments in three data settings to evaluate different features of our method.On the MS-MARCO datasets, we show that weak supervision can augment strong models trained on indomain data.On WikiQA and TREC, we show that weak supervision on large data improves over direct supervision on small data for this QA task.
We also present an experiment on a very large industrial dataset to measure the contribution of each of our proposed techniques for using unlabeled training data at scale.Ablating each of the approaches (LW,SCI,SCO) individually in addition to WS, we observe consistent improvements (+1.6, +2.1 and +2.6% respectively) over the performance of WS, indicating that the AS2 scores can help in the knowledge transfer.Additionally, combining all the approaches with WS significantly improves the performance, and surprisingly can even outperform the supervised GenQA baseline (by 1.1% = 83.7-82.6).This shows that the knowledge transferred by our approach from ASNQ exceeds what can be learned from MS-MARCO NLG. 5  Finally, when we combine our weak supervised training techniques with the supervised training in a two stage pipeline, we observe very significant performance gains, e.g., 2.7% over the supervised 5 MS-MARCO NLG is larger than ASNQ, but the latter is a much higher quality dataset in terms of diversity and complexity of questions and answer annotations  The weak supervision is provided by RoBERTa-Large AS2 models trained respectively on WikiQA and TREC-QA.We compare with a fully supervised GenQA baseline (Hsu et al., 2021) trained respectively on the train split of WikiQA and TREC-QA using ground truth correct answers as the target for generation.We use T5-Large for all GenQA models.

Comparison with the
approach.This shows that (i) the information in MS-MARCO NLG is complementary to the knowledge transferred from ASNQ, and (ii) our approach is effective in transferring knowledge from a discriminative ranker to a downstream GenQA model.Due to brevity of space, we present a qualitative ablation of the generated examples in Appendix E.

Scarce Data Setting
In this experiment, we measure the quality of our weak supervision approaches, by evaluating their performance on two popular AS2 benchmark datasets: WikiQA and TREC-QA.We train the AS2 teacher model on this data and still use the unlabeled data from MS-MARCO QA for performing the knowledge transfer.This way, we test if our approach is applicable in real scenarios, where data can be scarce and no large labeled data dataset is available (such as ASNQ or MS-MARCO NLG).Additionally, we verify if our approach works for other domains and if fine-tuning GenQA on the target domain data can help knowledge transfer in that domain, even in case of data scarcity.
We compare our weakly-supervised approaches with an AS2 baseline and a GenQA model trained on the target datasets, using their ground-truth labels.We used the original test splits of the datasets.Note that for these experiments, (i) we use the best performing strategy for our transfer learning, i.e., WS along with LW and SC, and (ii) the AS2 baseline is the same model that we use to transfer knowledge on the MS-MARCO QA.Results: From our results in Table 2, we make the following observations: (i) AS2 accuracy evaluated with our human annotations is around 10% lower than results from previous works, e.g., (Zhang et al., 2021).As we use the same model,6 the difference is due to the fact that we use the 'raw' test setting which includes questions with no correct answer candidates.(ii) Our transfer learning techniques have better performance than both the AS2 model and the supervised GenQA baselines.For WikiQA, our knowledge transfer approach, which has only seen unlabeled MS-MARCO data and no labeled training data from WikiQA, gets higher accuracy than both the AS2 baseline (+0.4%) and a fully supervised GenQA baseline (+5.8%), which uses the ground truth labels from the target datasets.(iii) We observe the same trend for TREC-QA: our weakly supervised models improve over both AS2 (4.6%) and supervised GenQA (9.8%) baselines.(iv) In contrast to our observations from Table 1, the supervised GenQA baseline for WikiQA and TREC-QA is less accurate than the AS2 baseline.We explain this with two reasons: (a) the small size of these datasets (only few thousands training questions) might be insufficient to train a large T5 model for GenQA, and (b) the usage of extracted answers as the target for generation instead of a human-written and natural sounding answer affects the quality of answer generation.(v) Finally, supervised fine-tuning applied after our transfer learning only improves performance on WikiQA.The WikiQA dataset has several questions with no correct answers (∼40%).Fine-tuning on the supervised dataset reinforces the training of the generator on questions having actual positive labels, thereby helping to reduce noise, and improving the final accuracy on the entire test set.

Industrial Setting
In this experiment, we aim to show that our experimental findings extend to very large-scale and realworld data, i.e., non-representative de-identified  customer questions from Alexa virtual assistant.We use the 50M question AQAD-U corpus as the unlabeled QA corpus for training the GenQA model, transferring the knowledge of an AS2 teacher model (no human-authored answer is used for training fully-supervised GenQA models on this data).We compare our methods for weak supervision against the AS2 teacher model on the labeled test split: AQAD-L.
Results: We present the results in Table 3 relative to the AS2 baseline, due to the data being internal, which is used as the 'teacher' for transferring knowledge to train the GenQA model.For these experiments we use T5-Base as the GenQA model (due to the large size of AQAD-U), and our results show that knowledge transfer from the AS2 model using the unlabeled data surprisingly improves the accuracy of the baseline by 1.34%.This indicates that the weak supervision provided by the AS2 model is able to train a GenQA model that performs better than the AS2 teacher itself.Furthermore, using loss weighting (LW) and input/output shaping (SC) significantly improves our weak supervision approach.The T5-Base model trained using a combination of LW and SC on the unlabeled AQAD-U corpus achieves an impressive 7.35% gain in accuracy over the baseline AS2 model (which has been trained on labeled data with annotations for answer correctness).

Analysis and Ablation Studies
Automatic Evaluation: We consider whether automatic evaluation metrics correlate with human evaluation for our task.In Table 4, we compare BERT-Score (Zhang* et al., 2020) and BLEURT (Sellam et al., 2020) with the human evaluation of various models on MS MARCO-NLG test data.We find that despite their good performance for other NLG tasks neither metric has a particularly strong Pearson correlation with the human evaluation results for the task of answer sentence generation (GenQA): BLEURT has a correlation 0.622; BERT-Score has a correlation of 0.447.Neither automatic metric is able to correctly identify Generated Answer v/s Input Candidates: We compare the generated answers (for each SCO bucket token) with the input answer candidates to understand how the model copies from the input candidates, and if there is a correlation with the SCO bucket tokens.In Fig. 3, we present the similarity between the generated answer with the top 4 ranked input candidates using BLEU score.This analysis shows that generated answers starting with a high confidence SCO bucket token (e.g., '[_YES_]' or '[_PROBABLY_]') are more similar to the first candidate (higher ranked), while answers starting with lower confidence SCO bucket tokens (e.g., '[_NO_]') are on average equally distant from all the input candidates.

Conclusion
In this paper, we have presented a novel approach for transferring knowledge from a discriminative AS2 model to an answer generation model, only using unlabeled data.We use the ranking produced by the AS2 model for training a GenQA model using the top answer as the target output, and the next k top ranked answers along with the question as the input.We also propose input/output shaping and loss weighting techniques during knowledge transfer to improve the performance of GenQA.Our experimental results on three public and one large industrial datasets show that GenQA models trained with knowledge transfer from AS2 models achieve higher answering accuracy than both the AS2 teacher and supervised GenQA trained with in-domain data.We are releasing our code and trained models to support future research.

Limitations
Our approach of training GenQA models requires access to large GPU resources for training large pre-trained language models such as T5-Large, etc.
For the experiments in this paper, we only consider datasets from the English language, however we conjecture that our techniques should work similarly for languages with a similar morphology.The evaluations for all experiments performed in this paper are done using human annotations on MTurk, which is time consuming and expensive.Currently, automatic evaluation of correctness and style suitability for question answering is extremely challenging, and we hope that research advances in this domain further encourages broader research in answer generation systems.

B Checkpoint Selection using AS2 Model
We observed that minimizing the loss on the development split does not strictly correlate with better answer generation from a human annotation perspective.Thus we experimented with an alternate performance measure for model checkpoint selection.Specifically, we use each checkpoint to generate outputs for a number of validation examples and score them with an AS2 model.We use the average scores produced by the AS2 model as the metric for deciding the best model checkpoint.The differences between using the loss on the development split and the average AS2 score can be noticed in Figure 4. We conducted a manual evaluation of outputs for different checkpoints and determined that using AS2 scores correlates better with our manual judgements than development set loss.We plan to explore this technique further in future work.Figure 4: These plots show the differences between using the dev.loss and the average AS2 model score as the measure for model checkpoint selection.Notice that the vertical dashed red line is used to identify the best checkpoint (which is the one with the lowest loss in the first plot, and the one with the highest average AS2 score in the second).

C Details of Human Evaluation
To evaluate our models we used the Amazon Mechanical Turk 8 framework.We designed an annotation task in which we showed to a pool of several high quality annotators (turkers) a question, a target answer generated from our models and, where possible (e.g., MS MARCO-NLG), a well-formed reference answer asking if the target answer was correct or not.For each QA pair (hit) we paid 0.1$ and we assigned 5 turkers.Specifically, we selected each Turker considering only masters with an approval rate greater than 95% and with at least 500 hits approved.

D Human Annotations v/s BLEU
Comparing human evaluation to BLEU scores for GenQA model outputs, we find that BLEU is not a reliable performance metric for this task and setting.For BLEU, we use two references for the generated target answer: (i) the manually written answer from MS-MARCO NLG, and (ii) the top ranked answer by the AS2 model that is being used as the teacher for knowledge transfer.The results are shown in Table 7 and appear to be quite random.
Neither of the rankings induced by the BLEU metric correspond to the ranking induced by human evaluation (see Table 1).

E Qualitative Evaluation
In this section, we present some qualitative examples of answers generated by our models.In particular, we show (i) the differences between the generated answer with the answer candidates used as input for GenQA, and (ii) how we can manipulate the generated answers by forcing the decoding to start from different SCO bucket label tokens.

E.1 Generated Answer v/s Input Candidates
From qualitative analysis we observe that the answers generated by our knowledge transfer techniques are generally longer than the answers generated by a GenQA model trained in a fully supervised manner.We present three examples in Table 9.In the first example, both the GenQA answers are correct, and we observe that the AS2 selected top answer contains the correct answer string in it.
In the second example, both the GenQA answers are incorrect, however, similar to the previous example, both the generated answers are shorter and syntactically improved versions of the AS2 selected answer.In the third example, the answer generated by our weakly supervised GenQA model is correct (the AS2 selected top answer is also correct) while that generated by the supervised GenQA model is incorrect.Overall, we notice that our weakly supervised models tend to copy and summarize the answer from the input candidates having the highest AS2 scores, while the supervised GenQA model generates more concise and shorter answers.

E.2 Forced Decoding using SCO
In this section, we aim to analyze how the SCO bucket tokens can be used to modify the quality of the generated answers from GenQA models.Specifically, we tested our GenQA models trained with SCO by forcing the decoder to generate answers starting from each of the different SCO bucket tokens: ({'[_YES_]', '[_PROBABLY_]', '[_MAYBE_]', '[_DOUBT_]', '[_NO_]'}).We present an anecdotal example in Table 8.We observe that the syntactical quality of the generated answers correlates with the SCO bucket token selected as the first token of the generated answer.Furthermore, higher confidence SCO bucket tokens (e.g., '[_YES_]', '[_PROBABLY_]') tend to generate shorter and more concise answers, while lower confidence SCO bucket tokens like '[_DOUBT_]' and '[_NO_]' can be used to generate longer sequences that are syntactically inferior.

Figure 2 :
Figure 2: Our pipeline for creating weakly supervised training examples for training GenQA models.The AS2 model assigns a confidence score to each answer candidate sentence.These scores are used to select the inputs and the target sequences for the GenQA model. al.,

Table 1 :
(Hsu et al., 2021)t split of MS MARCO-NLG for different training paradigms of GenQA models.The weak supervision is provided by a RoBERTa-Large AS2 model trained on ASNQ.We compare with a fully supervised GenQA baseline(Hsu et al., 2021)trained on the train split of MS MARCO-NLG.
(Hsu et al., 2021)acher models do not have any knowledge of the original labels of MS-MARCO QA as we consider this dataset to be unlabeled.We compare our approach against a supervised GenQA model baseline(Hsu et al., 2021), which is trained Results: We present the results on the MS-MARCO NLG test set in Table 1.The baseline zero-shot accuracy of the AS2 model on this data is 79.3 and the baseline accuracy of the fully supervised GenQA model (Hsu et al., 2021) is 82.

Table 2 :
Results on the test split of WikiQA and TREC-QA.

Table 3 :
Results on AQAD-L for different training paradigms of GenQA models.All results are reported in absolute % changes w.r.t the AS2 baseline.

Table 4 :
Results on the the testset of MS MARCO-NLG com-

Table 5 :
Accuracy on generated answers clustered according to the starting SCO bucket token generated by the GenQA model on the WikiQA test set the system ranking effected by human evaluation as presented in Table1.Additional analysis using the BLEU score is presented in Appendix D.

Table 7 :
Results using BLEU as the metric for measuring performance of answer generation.We train the GenQA models on MS-MARCO NLG for the supervised setting, and MS-MARCO QA for the knowledge transfer setting.