Learning Answer Generation using Supervision from Automatic Question Answering Evaluators

Recent studies show that sentence-level extractive QA, i.e., based on Answer Sentence Selection (AS2), is outperformed by Generation-based QA (GenQA) models, which generate answers using the top-k answer sentences ranked by AS2 models (a la retrieval-augmented generation style). In this paper, we propose a novel training paradigm for GenQA using supervision from automatic QA evaluation models (GAVA). Specifically, we propose three strategies to transfer knowledge from these QA evaluation models to a GenQA model: (i) augmenting training data with answers generated by the GenQA model and labelled by GAVA (either statically, before training, or (ii) dynamically, at every training epoch); and (iii) using the GAVA score for weighting the generator loss during the learning of the GenQA model. We evaluate our proposed methods on two academic and one industrial dataset, obtaining a significant improvement in answering accuracy over the previous state of the art.


Introduction
Recent research on retrieval-based Question Answering (QA) systems has been focused on two main tasks: (i) Answer Sentence Selection (AS2) e.g., (Garg et al., 2020), which, given a question and a list of answer candidates, chooses the most relevant answer that correctly answers the question; and (ii) Machine Reading (MR) e.g., (Chen et al., 2017), which, given a question and a reference text, involves finding a text span that directly answers the question.While effective, both the strategies (AS2 and MR) have some limitations: (i) the text might not include all the information necessary to answer a question, (ii) the text might include unnecessary, distracting information, or (iii) the text expresses the answer in a convoluted (indirect) format.Additionally, the text style and sentiment may be inappropriate for answering, or might be structurally too dependent on longer discourse context to enable usage as a stand-alone answer.
These drawbacks have motivated researchers to explore text generation systems for writing 'better' answers in the open-domain abstractive QA setting.For example, in the MR domain, RAG (Lewis et al., 2020b) generates an answer from a set of documents which are selected by dense passage retrieval models.For the domain of AS2, previous research has focused on summarizing answers from relevant paragraphs/evidences (Lewis et al., 2020a), or synthesizing information from the top ranked answer candidates of an AS2 system (Hsu et al., 2021;Muller et al., 2022;Gabburo et al., 2022).
The latter, termed as GenQA, has shown improvements in answer generation from the perspective of both answering accuracy and style suitability.The main distinguishing feature of GenQA from a generation-based MR approach is the length of the answer: the former uses an entire sentence as the target answer, while the latter in practice uses a short text (primarily targeting entity names).Therefore GenQA offers a more general and challenging research setting for answer generation.
Training effective GenQA models is made challenging by the cost and difficulty of obtaining largescale, high quality training data.This typically requires human annotators to read the questions and relevant top k retrieved paragraphs/sentences, and then re-write them into a self-contained, and concise natural answer (sentence/paragraph).
Recent research (Vu and Moschitti, 2021;Bulian et al., 2022) has proposed effective automatic QA evaluation models based on transformer encoders for sentence-form answers.Training these QA evaluators only requires access to question answer pairs with annotations of correctness/incorrectness of the answers.This style of data annotation is much cheaper to perform than writing high-quality answers for training for GenQA models.In this work we explore the novel idea of using automatic QA evaluators for training GenQA models, which enables a faster and cheaper design implementation.
In this paper, we reduce the amount of data needed for training a GenQA model using supervision from Automatic QA Evaluators.Our first contribution is to propose GAVA: an automatic QA evaluation approach that extends AVA (Vu and Moschitti, 2021) by (i) exploiting multiple reference answers and (ii) evaluating LM-generated answers instead of extracted answers.This way, we obtain a more robust and accurate QA evaluator that can effectively supervise the training of GenQA models.We propose three novel methods to use GAVA for refining the training of GenQA.
The first consists of (i) generating multiple possible answers using a baseline GenQA model for questions belonging to the GenQA training dataset, and (ii) then refining the set of generated answers by only retaining those with the highest GAVA scores (corresponding to "correct" or "high quality" answers).These generated answers are used as alternate gold standard answers (in addition to the annotators' written answers) to create additional training examples for GenQA.We term this approach GAVA-SDA (Static Data Augmentation).
The second approach extends GAVA-SDA, performing data augmentation dynamically at every epoch instead of off-line before training.This intuitively is more effective as the GenQA model continuously improves during the training.Specifically, at every epoch, we use GAVA to score the list of generated answers along with the k input answer candidates.We then use the top scoring answer as the GenQA target and the next top-k scoring answers as inputs for GenQA.We term this approach GAVA-DDA (Dynamic Data Augmentation).
The third approach uses GAVA as a scoring function for loss weighting during the training of GenQA.Specifically, we generate an answer using a GenQA model for a training sample, and weight the GenQA model loss of this instance using the GAVA score corresponding to the generated answer.Intuitively, this makes the GenQA model learn more from instances associated with higher GAVA-scoring answers (which corresponds to "correct" or "high quality" answers).We term this approach GAVA-LW (Loss Weighting).
We perform empirical evaluation on two academic and one industrial QA dataset (de-identified customer questions from Alexa personal assistant), and show that our three proposed techniques using GAVA for training a GenQA model produce significant improvements in answering accuracy over a baseline GenQA approach.We also show that the answers generated by these improved GenQA models consistently achieve higher GAVA scores on average than the baseline.We will release the code along with the trained GenQA and GAVA models at https://github.com/amazon-science/wqa-genqa-gava to enable easy replication of our experimental results.

Related Work
Answer Generation: Several research works (Izacard and Grave, 2021;Lewis et al., 2020b) have studied the problem of generating short answer spans (typically entity level) for MR systems.The most relevant of these works for GenQA is the work of Asai et al. (2022), that trains an answer generation model using the evidentiality of retrieved passages.Xu et al. (2021) uses decoder cross-attention patterns to generate extractive answer spans.Fajcik et al. (2021) generate answer spans by using a combination of a generative and extractive reader (aggregating information over multiple passages).An independent, but related line of research is questionbased summarization, and there have been several research works in this field: (Iida et al., 2019;Deng et al., 2020).Hsu et al. (2021) proposed the first formulation for generating complete answer sentences using evidences retrieved by an answer sentence selection (AS2) model.This model was termed GenQA, and it uses the top-k most relevant answer sentence candidates for a question as input context to an encoder-decoder model to generate a natural sounding complete answer sentence for this question.Muller et al. (2022) extend GenQA for multiple languages by using answer sentence candidates from multiple languages as input context for the GenQA model.Recently, Gabburo et al. (2022) propose training of GenQA models using unlabeled data by leveraging weak supervision from trained AS2 ranking models.This approach was shown to combine well with the supervised GenQA approach (Hsu et al., 2021) to improve the answering accuracy.Note that all of these approaches are different from the ones described in the previous paragraph as they aim to generate complete answer sentences, and not just short answer spans.Evaluation of QA Systems: Token level simi-larity metrics like BLEU (Papineni et al., 2001), ROUGE (Lin, 2004), METEOR (Banerjee and Lavie, 2005), etc have been shown (Reiter, 2018) to not extend to sentence-form QA evaluation.For MR tasks, Yang et al. (2018) adapt BLEU and ROUGE metrics, but limit their evaluation to only yes-no and entity questions.Si et al. (2021) uses multiple gold reference answers (extracted from Knowledge Bases) to be used as references for evaluating answer span extraction.
There have been several learnable automatic metrics: BERTScore (Zhang et al., 2020), COMET (Rei et al., 2020), BLEURT (Sellam et al., 2020), etc. that have been proposed for some tasks in NLP such as MT and Summarization.These are based on transformer encoder models.Chen et al. (2019) proposed extending BERTScore for MR tasks using the question and paragraph context in addition to the answer.In similar line of work, Vu and Moschitti (2021) propose AVA which is an automatic QA evaluation metric that learns a transformer to encode the question, a reference gold answer and the target answer to be evaluated.Very recently, Bulian et al. (2022) also present similar findings as AVA, by proposing BEM which can be used for evaluating sentence-level extractive QA (AS2).AVA and BEM have not been evaluated for GenQA systems previously.Hsu et al. ( 2021) and (Gabburo et al., 2022) show that automatic metrics like BLEU, BLEURT, BERTScore do not correlate well with human judgements for evaluating accuracy of GenQA systems.We extend AVA for our experiments as the automatic QA evaluation system.
3 Automatic QA Evaluation using Multiple References (GAVA) Vu and Moschitti (2021) propose AVA: an automatic evaluation models for QA based on a transformer encoder.It is applied to a question and a complete answer sentence to determine the correctness or incorrectness of the answer.Formally, we denote the AVA model with E, which takes as input a question q, a target answer a, and a reference r, i.e., gold standard (GS) answer, and outputs a correctness probability score, s ∈ [0, 1].AVA is trained on the same labeled data of AS2, i.e., question answer pairs, where each question has multiple annotated answer candidates available.Though the AVA approach was empirically shown to be accurate for evaluating AS2 systems Vu and Moschitti (2021), there are some lim- itations associated with it: (i) several questions may have diverse correct answers, e.g., "Tell me a winner of the US Open?", and (ii) the same answer may be expressed in very different formats, e.g., "How old is Joe Biden?", "Biden is 80 years old" v/s."The president has just entered his life's eighth decade".Furthermore, AVA does not use negative references when evaluating correctness of answers, while incorrect answers can also help refine the prediction of correctness/incorrectness. Note that most AS2 datasets have multiple annotated answers (combination of correct and incorrect labels), and thus it is straightforward to use them for building data to train AVA with multiple positive and negative references.Intuitively, a GenQA system synthesizes an answer using different pieces of information spread across many relevant candidates (while suppressing any irrelevant information), aligning well with the idea of using multiple references for QA evaluation.
We term this approach: GAVA (AVA for generation-based models), which uses multiple references (combining positive and negative references) {r 1 , r 2 , . . ., r n }.Fig 1 shows the GAVA architecture: which uses a transformer encoder, taking as input: a question q, a target answer a, and n references.The information about the nature of the positive/negative references is encoded by prepending each reference with a prompt indicating the type of reference it is.

Comparison between GAVA and AVA
In subsequent sections of the paper, we will use the QA evaluator as a teacher to transfer knowledge for training GenQA models.We hypothesize that this knowledge transfer improves the answer generation capability of GenQA models by enabling the model to discriminate between good and poor supporting answer candidates.Thus a stronger QA evaluator teacher will benefit in training more effective GenQA models.Here we perform an empirical comparison between GAVA and the baseline AVA model, to show that the former achieves a higher correlation with human annotations.
We consider two Answer Sentence Selection (AS2) datasets: WikiQA (Yang et al., 2015) and TREC-QA (Wang et al., 2007).We use a DebertaV3-Large (He et al., 2021) pre-trained model for both AVA and GAVA, and set n=5 reference answers per question for the latter.We measure the Pearson correlation between the human annotations and the two QA evaluators for each dataset under two configurations in Table 1: (i) Extractive QA (AS2) using the answer candidates available in the datasets, and (ii) Generative QA (GenQA) using answers that are written using a T5-Large GenQA model Hsu et al. (2021).The results indicate the empirical superiority of GAVA over AVA as an automatic QA evaluation metric, which stems from the usage of multiple references, including negative ones.

Model
Extractive QA (AS2) Generative QA (GenQA) WikiQA q, {a 1 , a 2 , . . ., a k }, t where q is the question, {a 1 , . . ., a k } are the k answer candidates used as input context to M, and t is the target output answer (GS human authored answer).Gabburo et al. (2022) extended this line of work by proposing a novel approach to train GenQA models using unlabeled data by transferring knowledge from an AS2 model (that is used to produce silver labels).Specifically, for each question q, the AS2 model is used to rank a set of answer candidates without having any label of correctness/incorrectness for answering the question.The top ranked answer is used as the generation target for the GenQA model, while the question along with the next k top-ranked answers are used as the input for the GenQA model.

GAVA for Training GenQA
In this section, we propose three approaches for training GenQA models using GAVA.For every question, q∈D, along with its answer candidates {a 1 , . . ., a k } as input context, we apply M 0 to generate multiple possible answers {g 1 , g 2 , . . ., g l }, using a probabilistic decoding approach (Wiher et al., 2022).Then, we apply the GAVA model to each of the generated answers g i to obtain GAVA scores, s i , of correctness i.e., E(q, g i , {a 1 , .., a k }).Then using a pre-defined threshold θ, we filter and pick only those answers G = {g i : s i ≥ θ}.We use this set of generated and filtered answers as alternate targets for generation to produce new examples for training GenQA, q, {a 1 , . . ., a k }, g , where q, {a 1 , . . ., a k }, t ∈D and g∈G.Fig. 2 illustrates this approach.
It should be noted that: (i) θ is a parameter that can be tuned to increase the probability of correctness of the generated answers g i .However, a very high θ will lead to filtering out a majority of the generated answers, leading to a very small augmented set (trade-off between size and quality).For our experiments, we used a value of θ that generated a large set of good quality diverse answers, as indicated by the GAVA score.(ii) Training a GenQA model on the augmented data can refine its predictions, biasing the generation towards "good-quality" answers.Overall, this produces improvement in quality and accuracy of the generated answers.

Dynamic Data Augmentation (GAVA-DDA)
We can improve the GAVA-SDA approach by producing new examples at regular intervals during the training, e.g., at the beginning of every epoch.This makes the data augmentation approach more adaptive, improving the learning of the GenQA model M. As M improves during training, it will generate improved and higher quality answers, which can then be selected by GAVA to augment for the subsequent iterations.These 'higher-quality' answers can, in turn, improve the GenQA model's generation ability.In other words, instead of using a static base GenQA model M 0 , for the generation of the augmented data, we use the latest GenQA model M, trained on the latest augmented data in the training routine.
Additionally, we refine the input context to the GenQA model during training.After obtaining the filtered set of generated and selected answers from GAVA: G, we combine them with the answers from D, i.e., A = {a 1 , . . ., a k , t} ∪ G.We then use E to score A, and pick the topmost ranked answer as the target for GenQA, and the next k top ranked answers as the input context for GenQA (following the same idea as (Gabburo et al., 2022)).Intuitively, this can improve the quality of both the input context that the GenQA model is using for generation, as well as the output target answer.
We combine the above two modifications into a single approach and call this Dynamic Data Augmentation (GAVA-DDA).

Loss Weighting (GAVA-LW)
GAVA-SDA and GAVA-DDA transfer the knowledge of the GAVA evaluation model for training GenQA by augmenting training data.Both approaches do not modify the GenQA training ap- proach.In contrast, GAVA-LW uses the GAVA score to modify the GenQA training loss.
More formally, for training M with an example q, {a 1 , . . ., a k }, t ∈ D, we apply three steps: (i) compute the standard cross entropy loss L G (t) of GenQA model M on input q, {a 1 , . . ., a k } with target t (ii) run inference procedure on q, {a 1 , . . ., a k } to obtain model generation g (iii) compute the GAVA score of g using E(q, g, {a 1 , .., a k }).We then use the GAVA score to weight the GenQA training loss as follows: The GAVA-LW approach is illustrated in Fig. 3. Intuitively, we want to make the model learn to improve its predictions for examples where the answer quality given by GAVA is low.Thus, for these examples, we give a weight to the training loss with the GAVA score.The L GAV A−LW formulation (i) helps the model emphasize harder training samples (on which the model is currently not performing well) during learning, and (ii) trains a stronger more generalized system.

Datasets
For training and evaluating our models, we consider two academic and one industrial dataset representing real world customer questions.WQA Web Question Answers (WQA) is a public dataset defined in (Zhang et al., 2021).The dataset contains 149,513 questions, each associated with about 15 answer candidates.Both questions and answers are retrieved from a large-scale web index.
Each QA pair is manually annotated for answer correctness by professional annotators.MS- MARCO Bajaj et al. (2018) proposed MS-MARCO, originally for MR tasks, comprising ∼800k queries retrieved from the Bing search engine along with ∼10 labeled answer passages.Following Gabburo et al. (2022), we transform MS-MARCO to obtain a large dataset of QA pairs, where the answers are sentences and not passages/paragraphs.Using a SOTA AS2 model (DeBERTav3-xl (He et al., 2021) trained on the ASNQ (Garg et al., 2020) dataset), we pick the top-2 ranked answer sentences from a positively labeled passage as positive answer candidates for the question.Similar to Gabburo et al. (2022), we randomly sub-sample 1k questions from the dev.set for evaluation (we use human annotations for our experiments and using the entire 100k dev.set would be extremely expensive to annotate).IQAD Industrial QA Dataset (Garg and Moschitti, 2021;Di Liello et al., 2022b) is a large scale internal industrial QA dataset containing nonrepresentative de-identified user questions from Alexa personal assistant.IQAD contains ∼10k questions, each having ∼200 answer candidates retrieved using a large scale web index (over 100M documents).Each question has ∼15 answer candidates with human annotations of correctness/incorrectness. Results on IQAD are presented relative to a baseline due to the data being internal.

Models and Evaluation
For our experiments we consider two types of models (i) GAVA evaluation models, as described in Section 3, and (ii) GenQA models, using techniques described in Section 5.For GAVA E, we use a DebertaV3-Large (He et al., 2021) pre-trained model using up to n=5 reference answers per question.We train two GAVA models: one on WQA and one on IQAD, using the former for both the public datasets. 1For GenQA, we consider a baseline model from Gabburo et al. (2022), which is a T5-Large (Raffel et al., 2020) encoder-decoder transformer trained using weak supervision on MS-MARCO.We consider this as the baseline GenQA model M 0 , and apply all of our techniques: GAVA-SDA, GAVA-DDA, GAVA-LW starting from this.Unless otherwise stated, we use θ=0.9 for GAVA-1 WQA contains human annotations of answer correctness which can be used as references for training a strong GAVA model.The answer sentence version of MS-MARCO does not contain human annotations of answer correctness.
SDA and GAVA-DDA.For the GenQA models, we take k=5 answer candidates as inputs, and select the best checkpoint, corresponding to highest AVA-Score on the development set.We present complete experimental details in Appendix.
We perform human evaluation of the generated answers using Amazon MTurk (5 annotations per QA pair, taking average of these scores).We selected a pool of turkers having an approval rate higher than 95% with more than 500 approved hits.We designed our annotation task by providing the annotator with (i) the question, (ii) the generated answer, and (iii) a correct reference answer.For each hit (question + generated answer pair), we pay the turker 0.1$ and obtain 5 independent annotations.Using these annotations, we compute the answering accuracy over the entire evaluation set: number of correct answers divided by the total number of generated answers.We also evaluate models using the automatic GAVA metric.

Main Results
We evaluate GenQA models trained with our three proposed techniques in Table 2 using human evaluation of accuracy and GAVA-Score (automatic evaluation).We observe that across all datasets, our approaches outperform the baseline and are able to improve GenQA training, as evidenced by both human and automatic evaluation.
Specifically for WQA, we observe that the GAVA-SDA approach achieves the highest answering accuracy (improving an impressive +21.3% points over the baseline).The experiments on WQA indicate the ideal case, where we can have a GAVA model trained on the same dataset (due to availability of some annotations of correctness).We even observe improvement in the GAVA score for our approaches (which is expected, since we are using this model to supervise the training of GenQA).Interestingly, we do not see a perfect correlation between the human-induced and GAVAinduced relative ordering of the four techniques.
On MS-MARCO, we again observe that GAVA-SDA achieves the highest answering accuracy (+10.2% points over the baseline), and here there is a perfect correlation between the human evaluation and GAVA.This evaluation on MS-MARCO demonstrates the transferability of using GAVA for teaching GenQA across data distributions (the GAVA model used here is trained on WQA, as the sentence version of MS-MARCO does not have   human annotations).
On the industrial dataset, IQAD, we observe that the GAVA-LW loss weighting approach achieves the highest accuracy (+9.85% relative improvement over the baseline).The results on IQAD lend support to our approaches extending to a real world scenario with actual customer questions.

Analysis and Ablations
Variation of GAVA Score over Training To understand how the GAVA score of our proposed techniques improves over the baseline, we plot its variation over the training epochs.We pick the MS-MARCO dataset and GAVA-LW as our approach to compare with, and present results in Fig. 4. From the figure, we observe that the GAVA-LW achieves a higher GAVA score than the GenQA baseline throughout the training regime.This demonstrates the knowledge transfer from the GAVA model for training GenQA, as the GenQA model is able to achieve a higher GAVA score over training epochs.Variation of Threshold θ for GAVA-SDA: As discussed in Section 5, θ is a tunable parameter that decides the quantity v/s quality trade-off for data augmentation.We aim to study how the choice of θ affects the training of the GenQA model.We consider the GAVA-SDA approach, and the MS-MARCO dataset, and use three different values of θ={0.5, 0.7, 0.9}.We follow the same experimental setting described in Section 6.2, and present results in Table 4.The results suggest a trend of achieving a higher final GenQA accuracy using a higher value of θ.This highlights that the quality of the generated answers (for augmenting) is more important for downstream answer generation than the quantity (Higher θ will pick "better-quality" answers, but increase the number of answers getting filtered out).Additionally we plot the GAVA-Score on the development set across training with these different values of θ in Fig. 5.We observe that the GAVA-Score for the model trained with θ=0.9 is always higher than the score of the other models across the entire training.

Correlation with other Automatic Metrics
We perform a study to analyze how well can other automatic evaluation metrics perform for the task of evaluating answer generation.Specifically, we consider BLEU, BLEURT, and BERTScore.We use the MS-MARCO dataset, and evaluate several different GenQA models trained using our approaches.We evaluate performance using each of these automatic metrics and GAVA; and present the Pearson and Spearman's rank correlation between these metrics and the manual evaluation in Table 3.
From the table, we observe that GAVA achieves the strongest correlation with human evaluation, highlighting its efficacy to be used as an automatic QA evaluation metric.Other text similarity matching metrics achieve poor correlation with the human annotation of answer correctness.

Qualitative results
We perform a qualitative analysis highlighting anecdotal examples to study success and failure cases of our answer-generation approach.Specifically, we pick the MS-MARCO dataset and the GAVA-DDA approach, and present both success and failure cases of answer generation to gain insights into the strengths and limitations of our approach.
Table 5 shows instances where GAVA-DDA successfully generates accurate answers.These examples highlight various sub-tasks that the model implicitly performs.Firstly, the model demonstrates its ability to synthesize information from multiple answer candidates.For example, for the question "How do I get from DC to Alexandria VA?", the model correctly synthesized information from each of the reference answer candidates into the generated answer about the Metrorail service connecting the two locations.Second, the model exhibits reasoning capabilities highlighting identification of the correct reference answer candidate, along with improving it's style suitability for answering the input question.This is observed in the second example with the question "How long should a central air conditioner last?", where the model identified the first reference "10 to 20 years -sometimes longer" to contain the most relevant information for answering the question.At times, the model acts as an answer sentence selection (AS2) model that simply re-ranks (without any modification) and generates one of the reference answer candidates.
Table 6 presents some examples where the model hallucinates and produces incorrect answers.This is highlighted in the question about Albany Minnesota's Population, where the model hallucinates and introduces an incorrect year in the generated answer, even when it is not present in any of the input reference candidates.Additionally, at times, the model may be unable to synthesize a good answer due to lacking evidence in the retrieved reference candidates.This is highlighted in the question about the earth's magnetic field.This limitation emphasizes the importance of reliable and accurate answer candidates for grounding the answer generation from the model.

Conclusion
In this paper we propose a novel training paradigm of learning answer generation systems (GenQA) using supervision from automatic QA evaluation met-Table 5: Examples of correctly generated answers using GAVA-DDA approach on the MS-MARCO dataset.
Example (1) highlights that the model is correctly able to synthesize a correct answer using the reference answer candidates for the question.Example (2) highlights a case where the GenQA model uses information from a single reference answer, but reformulates it's style using the question to present as an answer.Example (3) highlights a case where the GenQA model is effectively functioning as an answer ranker, as it directly copies the best answer candidate among the references to produce the generated answer.

2) Question
Which layer is responsible for the earth's magnetic field?Answer #1 Best Answer: The Earth's magnetic field is produced by convective currents in the outer core.Answer #2 The (presumably) molten iron core.Answer #3 The outer core is liquid iron.GenQA Answer The outer core is liquid iron.
Table 6: Examples of incorrectly generated answers using GAVA-DDA approach on the MS-MARCO dataset.Example (1) highlights a case of hallucination during generation where the model introduces an incorrect year in the generated answer, even when it is not present in any of the input reference candidates.Example (2) highlights a failure case of GenQA where the model is unable to synthesize a good answer due to lacking evidence in the retrieved reference candidates.rics based on transformer encoders.We propose three strategies: augmenting the training corpora with high GAVA-scoring generated answers for training the GenQA model (either statically once before training, or dynamically at every training epoch); and using the GAVA score for weighting the loss during the learning of the GenQA model.We perform empirical evaluation on two academic and one industrial dataset and show that our approaches outperform the baseline with both human annotations and automatic QA evaluation metrics (GAVA score).For future work, we plan to investigate how automatic QA evaluator based preferences align with human-annotated preferences for training larger LMs via reinforcement learning (Lambert et al., 2022).This would involve using GAVA as the RLHF reward model.

Limitations
The main limitation of our methodology is that the training of Generative Question Answering models requires the usage of large GPU resources, which may not be easily available to all researchers.Regarding the performance of the model and quality of the generated answers, our approach can be affected by possible bias induced by the evaluation system we are using.For the experiments in this paper, we only consider datasets from the English language, however, we conjecture that our techniques should work similarly for languages with a similar morphology.Automatic QA evaluation systems do not achieve perfect correlation with human annotations, which indicates a gap with respect to human evaluation.For safety critical applications, human evaluation of generated answers still remains the best means for evaluation.

Figure 1 :
Figure 1: Multi-reference GAVA that uses multiple positive and negative reference answers to evaluate the correctness/incorrectness of a target answer for a particular question and produces a score s ∈ [0, 1].

5. 1
Static Data Augmentation (GAVA-SDA) We create new training examples starting from D, using a GAVA model, E, and a base GenQA model M 0 (trained only on D).The new examples are added to D to create an improved training corpora for learning an improved GenQA model, M.

Figure 4 :
Figure 4: Comparison of baseline GenQA-WS and AVA-LW on WQA in terms of GAVA-Score (on validation split) varying across training epochs (GAVA-LW achieves higher GAVA-Score throughout training).

Figure 5 :
Figure 5: Comparison of three different GAVA-SDA models trained on MS-MARCO using different thresholds θ in terms of GAVA-Score (on validation split) varying across training epochs.The model trained with the largest value of θ achieves the highest GAVA-Score.

Table 1 :
Comparison between AVA and GAVA on Wik-iQA and TREC-QA.The models are compared in terms of Pearson correlation between the evaluation system prediction and the human evaluation.The best results for each dataset are highlighted in bold.

Table 2 :
Answering accuracy (manual evaluation) and AVA-Score on WQA, MS-MARCO and IQAD datasets.Results on IQAD are presented relative to the baseline (due to the data being internal).For WQA and MS-MARCO, we use an AVA model trained on WQA, and for IQAD we use an AVA model trained on IQAD.Best results for each dataset are highlighted in bold.

Table 3 :
Evaluation of different GenQA models using automatic evaluation metrics: BLEU, BLEURT, BERTScore in addition to GAVA-Score on the MS-MARCO dataset.We present the correlation each metric achieves with human annotation.GAVA achieves the best correlation with human evaluation of answer accuracy.

Table 4 :
Variation of GenQA accuracy by changing θ for GAVA-SDA approach, on the MS-MARCO dataset.We present human and automatic (GAVA-Score) evaluation.| Augmented Set | indicates the number of data augmentation examples created using a particular value of θ (Lower θ → more augmentation examples).
Albany, Minnesota, as per 2017 US Census estimate, has a community population of 2,662 people.Answer #2 Albany is located in Stearns County, 20 miles west of St. Cloud and 80 miles northwest of Minneapolis/St.Paul on Interstate 94 (I-94).Albany has direct access to State Highway 238, which originates in Albany.GenQA Answer The population was 2,662 at the 2010 census.