Grounded Keys-to-Text Generation: Towards Factual Open-Ended Generation

Large pre-trained language models have recently enabled open-ended generation frameworks (e.g., prompt-to-text NLG) to tackle a variety of tasks going beyond the traditional data-to-text generation. While this framework is more general, it is under-specified and often leads to a lack of controllability restricting their real-world usage. We propose a new grounded keys-to-text generation task: the task is to generate a factual description about an entity given a set of guiding keys, and grounding passages. To address this task, we introduce a new dataset, called EntDeGen. Inspired by recent QA-based evaluation measures, we propose an automatic metric, MAFE, for factual correctness of generated descriptions. Our EntDescriptor model is equipped with strong rankers to fetch helpful passages and generate entity descriptions. Experimental result shows a good correlation (60.14) between our proposed metric and human judgments of factuality. Our rankers significantly improved the factual correctness of generated descriptions (15.95% and 34.51% relative gains in recall and precision). Finally, our ablation study highlights the benefit of combining keys and groundings.


Introduction
Converting information to text (McKeown, 1985) has been a cornerstone of NLG research with the goal of improving the accessibility of knowledge to general users. It has found many applications such as generating sport commentaries (Wiseman et al., 2017), weather forecast (Konstas and Lapata, 2012), biographical text (Lebret et al., 2016), and dialogue response generation (Wen et al., 2015(Wen et al., , 2016. The problem has traditionally been formulated as data-to-text generation, to generate an output given structured input such as graph, tables or key-value pairs. However, this formulation is overspecified and does not cover other open-ended * Work done while first author was interning at MSR.

Barack Obama __ Family & Personal Life
Factual Keys: Topical Keys: birth date, birth place, parents, spouse hospital, city, territorial entity, law firm Obama was born on August 4, 1961, at Kapiolani Medical Center for Women and Children in Honolulu, Hawaii. He was born to an American mother and a Kenyan father. His mother, Ann Dunham (1942Dunham ( -1995, was born in Wichita, Kansas; […] In June 1989, Obama met Michelle Robinson when he was employed as a summer associate at the Chicago law firm of Sidley Austin.
[…] They began dating later that summer, became engaged in 1991, and were married on October 3, 1992.
Grounding Passages Figure 1: An example from ENTDEGEN dataset. Given a set of topical and factual keys, along with multiple grounding passages, the task is to generate an entity description. Corresponding knowledge are underlined. scenarios in real-world. Recent advances in large pre-trained language models (PLMs), as well as the general knowledge represented in them, have made it possible to formulate the problem as prompt-totext or outline-to-text (Rashkin et al., 2020) generation. This offers the prospect of making NLG more broadly applicable, as such models allow input to be more parsimonious or ill-defined. However, issues such as lack of controllability and hallucination have lessened the practical applicability of this setting in real-world scenarios.
To overcome these issues, we propose a new task, grounded keys-to-text generation, where given a wishlist of keys (without the values) about an entity 1 and a set of short grounding passages as a source knowledge, the goal is to generate a factual description. An example is shown in Fig. 1, where the task is to generate a paragraph about "Barack Obama", in particular about his family and personal life. Potential factual keys in this example are "birth date, birthplace, parents, spouse, children", etc. The task also enables a finer-grained control over the types of entities to be included in the output via topical keys such as "hospital, city, law firm" for the example in Fig. 1. Finally, pertinent information about the entities needs to be fetched from a set of candidate grounding passages. These passages can be obtained via internet search. Our task differs from similar existing tasks, such as data-to-text generation (Koncel-Kedziorski et al., 2019; in that, we presume keys but not values are given.
This covers more open-ended scenarios in the real-world where knowledge about entities are not available in detailed structured format, is constantly changing and so have to be fetched on the fly. Moreover, this formulation offers control to the user over the generated text.
To facilitate research on grounded keys-to-text generation task, we introduce a large-scale and challenging dataset, called ENTDEGEN, with about 375K instances. The grounded, factual and longform nature of the task, brings a new challenge, i.e., generating paragraph-level text which is faithful to one or more grounding passages based the provided guiding keys. To address this challenge, we propose ENTDESCRIPTOR equipped with strong ranker to help the model focus on passages that are both relevant to the keys and complementary. We propose two rankers. Our contrastive dense ranker is based on embedding-based retrieval systems trained in a contrastive framework. Our autoregressive ranker generates a sequence of passage indices autoregressivly by modeling the probability of each passage conditioned on previously generated passages. This ranker is shown to achieve the strongest performance by modeling the joint probability of passages.
The factual aspect of generation also calls for a new evaluation metric. Inspired by recent factbased evaluation for summarization, we propose an automatic metric, called MAFE, to evaluate different aspects of grounded text quality, including relevance and consistency. Our contributions are: 2 • We introduce a new controllable entity description generation task which requires aggregating knowledge from multiple grounding passages efficiently.
• To address this task, we also present a new dataset, called ENTDEGEN.
• We propose two ranking methods, contrastive dense and autoregressive, to select a sequence of useful passages for the model to ground in.
• We propose an evaluation metric to evaluate factual consistency in our proposed task, which highly correlates with human judgments of factuality.
2 Task: Grounded keys-to-text Generation Given an entity e, title t, a set of factual K .., k t m } keys, and grounding passages P = {p 1 , p 2 , ..., p N }, the goal is to generate a text (description) with respect to the provided keys.

Dataset: ENTDEGEN
Our dataset collection strategy is based on Wikipedia and motivated by the WIKITABLET dataset . Each Wikipedia article A w is composed of multiple sections S = {s 1 , s 2 , ..., s n }. The title of the Wikipedia article is the entity e whose description is to be generated and the text in each section forms a reference (gold) description, r. For example, an article about a football player may contain sections about "Introduction", "Early Life", "Club Career", and "International Career", each forming a separate instance in our dataset. We perform the following steps for each section s i in an article to obtain: factual keys, topical keys, and grounding passages. Factual Keys. Factual keys seek specific knowledge about an entity of interest.For obtaining factual keys, we align key-value pairs in infobox and Wikidata with each section s i . For this, we took a distant-supervision approach to estimate the alignment score of each key-value pair with the section using semantic similarity and lexical precision. For semantic similarity, we compute the precision component of BERT-Score (Zhang et al., 2020b) between the section text and the concatenation of keyvalue pair (key + value). A high value indicates that the key-value pair is semantically relevant to that section. We also measure the ROUGE-L precision score (Lin, 2004) between the section text with respect to the concatenation of key-value pair. For each instance in our dataset, we select keys whose key-value BERT-Score is greater than 0.82, and ROUGE-L score is greater than 0.25.

34
Topical Keys. Topical keys are not tied to specific aspects of the entity of interest, but give hints on the type of other entities to be included in the output. For obtaining topical keys, we first find all hyperlinked articles A h appearing in the section. We then use the value of the "instance of" or "subclass of" tuple in the Wikidata  (Liu et al., 2018). The documents are citations in the Wikipedia article obtained by Common-Crawl or web pages returned by Google Search. Each instance in our data has 40 grounding passages. Note that our dataset is distantly supervised, and these passages may not always contain all the facts regarding the keys. To enhance the quality of our dataset, we filter out entities for which the average Bert-Score recall of key-value pairs against the grounding passages is lower than 0.82.

5
Basic statistics of ENTDEGEN are provided in Table 1

MAFE: Multi-Aspect Factuality Evaluation
Our proposed task is to generate a factual description. Hence, it is crucial to evaluate the factuality of the generated texts. Inspired by recent fact-based evaluation in abstractive summarization (Scialom et al., 2019;Durmus et al., 2020;, we propose to assess the factuality of generation through question answering (QA). We evaluate factuality of a generated description h with respect to (i) factual triples (e, k, v) which are constructed from the entity e, each factual key k and its value v, 7 and (ii) reference (gold) description r.
In our QA based Multi-Aspect Factualy Evaluation (MAFE), questions are generated from spans in the reference and factual triples (recall), or the generated output (precision), and are automatically answered using the output, or reference-factual triples. Then, the similarity between the predicted answer and the gold answer is used to compute recall and precision. Our evaluation framework is illustrated in Fig. 2 which accounts for both relevance (recall) and consistency (precision): Recall (h → r) evaluates the generated output h on recalling information from the factual triples (e, k, v) AND reference r. For this, we generate questions that have gold answers in factual triples and reference using a Question Generation (QG) module, and obtain answers to these questions from generated output h using a Question Answering (QA) module. We define recall as the average scores of these answers when compared to the gold answers (computed by an Answer Matching (AM) module).
Precision (r → h) measures the amount of information contained in the generated output h that is consistent with factual triples (e, k, v) OR reference r. For this, we generate questions from output and obtain answers from factual triples and reference. We define precision as the maximum score between answers predicted from factual triples and reference. 8 Next, we describe the 3 modules of the evaluation framework.

Question Generation
Given a sentence s containing an answer span a (marked by special tokens), we train a QG model to generate a question q (which is answerable by a), modeling P qg (q|s, a). For evaluating a generated output, we gather a set of answer spans a by extracting all name entities and noun phrases from each sentence s (of reference or output) using spaCy 9 .
For generating questions from factual triples, we linearize them by concatenating their constituent elements and consider the value v as the answer a. For example, we form "Barack Obama place of birth hawaii" from (Barack Obama, place of birth, Hawaii). Following Durmus et al. (2020), our QG model is a BART model fine-tuned on (s, a, q) triples annotated by Demszky et al. (2018).
Although the QG model is trained on natural language sentences, we found it transferring reasonably well on relational triple data because of their simple format.

Question Answering
Given a question q, and a context c, the QA model gives the probability of an answer a, modeling P qa (a|q, c). For evaluating a generated output, given a question q generated by the QG model from the reference and factual triples, or the output, the QA model answers it using the output, or reference and factual triples (as context c), respectively. For answering questions using factual triples as context, we concatenate all the linearized triples into a single text. Our QA model is an ALBERT-XL model (Lan et al., 2020) fine-tuned on SQuAD2.0 (Rajpurkar et al., 2018), with F1 score of 87.9% on SQuAD2.0. SQuAD2.0 support identifying unanswerable questions, which is crucial as not all answers are found in a given context.

Answer Matching
The common approach to assess the answers given by a QA model (compared to gold answers) is to use F1-score, which is based on exact matching of n-grams. We argue that is problematic in our case when correct answers are lexically different. For example, Sport:"professional wrestling" can be realized as "She is a wrestler [...]". The F1score does not capture these lexically varied but correct answers. Therefore, we propose using an NLI model to compare the similarity of two answers. Given the generated question q from the reference, to compare the reference (gold) answer and the predicted answer, we concatenate each answer with the question separately to form the premise and hypothesis for the NLI model. For example, for the question "What sport did Mr. Kenny Jay play?", we pass the following to the NLI model: Premise:What sport did Mr. Kenny Jay play? professional wrestling Hypothesis:What sport did Mr. Kenny Jay play? wrestler We give the predicted answer a score of 1 if the NLI model predicts entailment, and a score of 0 if it predicts contradiction. For neutral, we compute the BERTScore (Zhang et al., 2020b) comparing the contextualized representations of the two answers. For the NLI model, we use RoBERTa (Liu et al., 2019) fine-tuned on MNLI (Williams et al., 2018), with an accuracy of 90% on MNLI. We included examples of comparison between NLI and F1 score in Table 9.

ENTDESCRIPTOR
The ENTDESCRIPTOR model needs to fetch relevant passages on the fly to generate a factual description. For this, we equip our ENTDESCRIPTOR model with a Passage Ranker ( §5.1). Given the entity, keys and a set of ranked passages, the Descriptor Generator ( §5.2) then generates an entity description.

Passage Ranker
Each instance in our dataset is accompanied by a set of candidate grounding passages. However, not all passages contain useful knowledge about certain aspects of an entity, i.e., the provided factual and topical keys. We, therefore, introduce a ranking stage where we rank the grounding passages P given the entity, title, and a set of keys as query q.  The ranker outputs top-k passages {p 1 , ..., p k } ⊂ P, which are then used to ground the Descriptor Generator. Below, we describe two baseline rankers, namely ROUGE-2 and tf-idf rankers, and two proposed rankers, namely contrastive dense and autoregressive rankers. ROUGE-2 (oracle). This ranker ranks passages according to their ROUGE-2 recall against the reference. This is akin to oracle ranking as we use information in the reference to do the ranking. Tf-idf. This ranker ranks passages using their tf-idf score following Liu et al. (2018). Contrastive Dense. This ranker learns and then compares dense representations of queries and passages using contrastive training. We train a dense ranker (shown in Fig. 3(a)) which is inspired by recent embedding-based retrieval systems such as REALM (Guu et al., 2020), and DPR (Karpukhin et al., 2020). We follow a distant supervision approach (Jernite, 2020) by using the reference descriptions r instead of the gold passages as supervision signal. For each instance, we form a query q i by concatenating the entity, title and the set of keys. We thus construct a dataset of (q i , r i ) pairs and use a bi-encoder architecture to project queries and references to 128-d embedding space. We use a contrastive framework with in-batch negatives where the idea is to push encoded vector of a query closer to its corresponding reference vector, but away from other reference vectors in the batch. Formally, we optimize the following Cross-Entropy loss with in-batch negatives: where q i and r i are encoded query and reference vectors, and mB denotes the mini-Batch. We use mini-batches of 1024, and initialize the encoders with distilled-BERT (Turc et al., 2019;Devlin et al., 2019). Two projection layers are then learned for queries and references. Once the ranker is trained, we use the reference encoder to encode each grounding passage p i and score them based on their dot product similarity w.r.t vector representation of query q i . We then use the top-k passages as input to Descriptor Generator.
Autoregressive. In the previous ranker, passages are scored independently according to their relevance to the input query q. However, an ideal ranker should select relevant yet diverse passages. To achieve this goal, we develop an autoregressive ranker with an encoder-decoder architecture (shown in Fig. 3(b)) where the encoder process the entire set of passages P, and the decoder generates a sequence of k passage indices. The autoregressive nature enables modeling the joint probability of passages P (p 1 , ..., p N |q). Similar text-to-index framework showed promising results for sentence ordering (Basu Roy Chowdhury et al., 2021) and multi-answer retrieval (Min et al., 2021). To enable encoding the entire set of passages (in our case 40), we use the Fusion-in-Decoder (FiD) architecture following Izacard and Grave (2021). The FiD architecture takes the input query (concatenation of entity, title and set of keys) as well as each individual passage independently as inputs to its encoders. The query is concatenated with each passage and its positional index using special tokens: Generator. All encoders and the decoder are initialized with T5 (Raffel et al., 2020). This ranker is trained using the silver sequence of passage indices obtained by ROUGE-2 (oracle) ranker.

Descriptor Generator
Extractive. We build an extractive baseline using QG and QA models. For this, we convert an entity name and each factual key in our input into a natural language question using a seq2seq model. We then use a strong extractive QA model, namely ALBERT-XL fine-tuned on SQuAD2.0, to answer these questions using all grounding passages as the context (Each grounding passage is passed separately). Finally, we concatenate all sentences from the groundings which contained the most confident answers as our final output. Abstractive. We build strong abstractive baselines by fine-tuning several transformer-based PLMs. This include encoder-decoder models, namely BART-Large P k and generate the entity description. Note that during training our generation models, we use the top-10 grounding passages obtained by oracle ROUGE-2 ranker. 10 See Appendix A for details.

Automatic Evaluation
Beside our proposed MAFE metric, we use several widely used automatic metrics like BLEU (Papineni et al., 2002), and ROUGE-L (Lin, 2004). However, recent works (Dhingra et al., 2019) have raised concerns on the usage of these metrics for automatically constructed data-to-text dataset as they fail to consider divergent reference texts. We also use PARENT (Dhingra et al., 2019) 11 that considers similarity of generation to both data (in our case factual triples) and the reference. Lastly, we 10 Using the passages obtained from other rankers during training degrades the performance. 11 We use the co-occurrence version which is recommended when paraphrasing is involved between data and text.  use BERTScore (Zhang et al., 2020b) which computes alignment between BERT representations of reference and generated output.

Evaluation of MAFE
We propose a metric for evaluating factuality, MAFE. To evaluate MAFE as a metric, we compute its correlation with human judgments of factuality. We take a random set of 297 BART-L generated outputs using different rankers. The instances include diverse set of entity domains (see Fig. 5 in Appendix). We collect human judgments of factuality on this subset using Amazon Mechanical Turk (AMT). Three annotators judged the recall-oriented and precision-oriented factuality of each generated paragraph. For evaluating recall-oriented factuality, we present each sentence of the reference one at a time and ask annotators how well the sentence is supported by the content in the generated paragraph. The annotators have to choose from a Likert scale of 1-5 (1 being very badly supported, 5 being very well supported). 12 We also present each factual triple one at a time and ask annotators if it is supported by the content in the generated paragraph. 13 For evaluating precision-oriented factuality, we switch references and factual triples with generated paragraphs, i.e., we show the generated paragraphs one sentence at a time and ask how well the sentence is supported by the reference and all factual triples. We then average scores across all sentences. See Appendix C.2 for details and screenshots of annotation layout. To account for recall and precision oriented values, we measure correlations between human judgment F1 (2 rec.prec rec+prec ) with MAFE-F1 and other automatic metrics. According to Table 2

Results
Performance of Different Baselines. Table 3 reports the performance of different baselines for the task of entity description generation. According to the results, Extractive performs poorly compared to other abstractive baselines. This is mainly because it lacks the narrative flow required for a coherent output. Comparing all abstractive baselines, when they are given oracle groundings (defined in §5.1), shows that BART outperforms T5 and PEGASUS in general on all n-gram overlap-based, PARENT, as well as BERTScore metrics. Uni/bi-gram overlap (R-1,R-2) are reported in Table 8. When comparing baselines with respect to factuality using our MAFE metric, we see that BART in general generates paragraphs that are significantly more consistent (precise) with respect to factual triples and reference. Whereas, T5 is slightly better at content-selection (measured by recall). Performance of Different Rankers. We now investigate the effect of different rankers on generation performance. For this, we compare baselines using different rankers (see Table 3). All models perform better when they are given top-k ranked groundings than their Unranked baselines. For all generation models, the proposed contrastive and autoregressive rankers significantly outperform the tf-idf baseline ranker. This is because tf-idf ranker only finds passages that feature sparse words from  the input query and fails to capture semantic similarities. Moreover, by predicting a sequence of passages each conditioned on the previously selected passages in the autoregressive ranker, the generation model gains further improvements over the strong contrastive dense ranker. We also compare Recall@k for different rankers w.r.t the oracle ranking in Table 4. The score indicates the proportion of oracle passages (obtained y ROUGE-2 method) that is found in the top-k predicted passages by any of the rankers. We find that autoregressive outperforms the other two rankers.

Human Evaluation
Here, we evaluate factuality and faithfulness of generated descriptions on AMT. Factuality (r ← → h). We evaluate the factuality of generated paragraphs using human annotators. We randomly sample 100 datapoints from the test set and evaluate paragraphs generated by BART-L using four rankers: tf-idf, contrastive dense, autoregressive and ROUGE-2 (oracle) (a total of 400 generation examples). We ask 3 judges from AMT to evaluate the recall-oriented and precision-oriented factual correctness of each sample generation. We use the same annotation layout described for evaluating MAFE metric (correlation analysis; §6.1). More details can be found in Appendix C.2. Table 5 shows that human annotators consistently rate the factuality of paragraphs generated using autoregressive ranker higher than those generated using contrastive dense ranker and lower than Oracle ranker. The result is consistent with our proposed metric as well.
We also evaluate whether the generated outputs are faithful to the top-k grounding passages.
14 For this, we randomly sample 100 data points from the test set and ask 3 annotators from AMT to evaluate the faithful-  Table 5: Human evaluation of factuality (recall-and precision-oriented in %) for BART-L generated paragraphs using different rankers.

T5-Large BART-L PEGASUS
Human Rating 4.17 3.53 3.83 ness of generated outputs using different baselines on a scale of 1-5. Following our previous annotation layout, we show one sentence at a time and then average scores across all sentences. Table 6 shows that T5 generates more faithful paragraphs compared to other baselines.

Ablation Studies
Here, we discuss different ablations of our task where we remove/add certain information from/to the input and investigate its effect on the performance. We experiment with settings where there are no groundings, no keys, no factual keys, no topical keys, values w/o groundings, and values w/ groundings. Table 7 shows the results for the BART-L baseline with the autoregressive ranker. As expected, the model performance degrades the most w.r.t all metrics when the grounding passages are removed from the input. This setting is similar to the promptto-text generation, where the model mostly relies on its parametric knowledge and is prone to hallucination. Removing all the keys from the input is detrimental in recalling important information, as shown from the MAFE-R score. We also observe that ablating factual keys hurts the relevance of the generated paragraph (i.e., Recall) w.r.t its reference more, whereas ablating topical keys hurts the n-gram overlapping metric (R-L). This is because factual keys are essential to make a good content selection and be rewarded by MAFE metric, whereas topical keys mostly appear verbatim in the output. Lastly, having the gold values for the correspond-

Recall Precision
Grounding Passages  ing keys without the grounding passages cannot beat the performance with the original inputs. In particular, although the model can recover more information (i.e. better recall), not being grounded causes it to generate less consistent information (i.e. lower precision). This is in line with our previous findings where passages play an important role in achieving good performance. When accompanied with groundings, the model achieves the best performance, emphasizing the importance of grounding.

Related Work
Natural Language Generation. Several datato-text problems have been proposed with various input formats like Knowledge Graphs (Koncel-Kedziorski et al., 2019;Cheng et al., 2020), Abstract Meaning Representations (Flanigan et al., 2016;Ribeiro et al., 2019), tables and tree structured semantic frames (Bao et al., 2018;Nan et al., 2021), and Resource Description Framework (Gardent et al., 2017). Towards a more controlled generation task, ToTTo  was introduced for an open-domain table-to-text generation where only some of the cells are selected as the input. However, ToTTo and most existing datasets such as WIK-IBIO (Lebret et al., 2016) and LogicNLG  focus on generating single sentences. Although generating long-form text is becoming a new frontier for NLP research , not many datasets and tasks have been proposed to explore this new direction. Available datasets such as ROTOWIRE (Wiseman et al., 2017) or MLB (Puduppully et al., 2019) are either small-scale or on single domain (e.g., Sports). Unlike prior works, we propose a longform grounded keys-to-text generation task that covers multiple domains and categories, including people, location, organization, event, etc.
Recently,  presented the WIK-ITABLET dataset for long-form text generation from multiple tables and meta data. However, this setting is overpecified because knowledge about entities may not always be available in structured format and may get updated in real-time. In a more natural setting, our ENTDEGEN dataset uses factual and topical keys as guidance but still leaves a considerable amount of content selection from grounding passages to be done by the model.
There has been several work on open-ended NLG (e.g., prompt-to-text or outline-to-text) (Fan et al., 2018;Xu et al., 2018;Yao et al., 2019;Rashkin et al., 2020;Brahman et al., 2020). Our task is also closely related to query-focused multidocument summarization Lapata, 2020, 2022) which relies on retrieval-style methods for estimating the relevance between queries and text. Additionally, our task setup can benefit from evaluation methods in summarization domain. Factual Consistency Evaluation. Evaluating factual consistency of machine-generated outputs has gained growing attention in recent years. New approaches have been proposed mainly for tasks like abstractive summarization and machine translation (Zhang et al., 2020b;Sellam et al., 2020;Durmus et al., 2020). Some of these metrics are QA based and have been used to measure common information between documents/reference and summaries (Eyal et al., 2019;Scialom et al., 2019;. Our proposed metric, MAFE, is inspired by these works.

Conclusion
We present a practical task of grounded keys-totext generation and construct a large-scale dataset ENTDEGEN to facilitate research on this task. Experiments show the effectiveness of the proposed rankers to fetch relevant information required to generate a factual description. The human evaluation shows that ENTDEGEN poses a challenge to state-of-the-art models in terms of achieving human-level factuality in long-form generation. Our proposed dataset and task can also foster further research in the recently emerging retrieval augmented generations models (Lewis et al., 2020b;Zhang et al., 2021;Shuster et al., 2021) -where the retriever and generator components are trained end-to-end.

Limitations
One of the limitations of our work is the reliance on a strong retriever/ranker. A weak retriever may result in generating text that are less factual and thus less thrust-worthy. While we proposed efficient and simple methods for training the retriever, these require large GPUs. Additionally, as the retrieved passages get longer the quality of text generation may degrades due to known issues with encoding longer sequences.

A Implementation Details
Baselines. Top-10 grounding passages were used to train and test all baselines. We use the Transformer library (Wolf et al., 2019). Each baseline was trained for 3 epochs with effective batch size of 8, and initial learning rate of 5e-6 for T5 and BART, and 1e-4 for PEGASUS. We use the maximum input length of 512 tokens. During inference, we use beam search decoding with 5 beams, and repetition penalty of 1.2. Note that we use the BART-L model finetuned on XSUM dataset as our initial weights. Similarly, we use google's PEGASUS model finetuned on XSUM. The experiments are conducted in PyTorch framework using Quadro RTX 6000 GPU.

Rankers.
The contrastive dense ranker was trained for 10 epochs with 2e-4 learning rate. The autoregressive ranker was trained for total of 30,000 steps with learning rate and weight decay of 1e-5 and 0.01, respectively. Rankers were trained using 4x Nvidia V100 GPU machines, each with 32G memory. Question Generation in MAFE. The question generation module (QG) in MAFE evaluation metric, generates questions using beam search decoding with beam size of 10.

B Dataset Quality Assessment
We conducted a human evaluation on Amazon Mechanical Turk to assess the quality of our automatically constructed dataset. In this experiment, we randomly sample 100 examples from the test set. For each example, we ask 3 annotators to read the reference description carefully and answer whether each of the factual key and value pair is stated in the description or can be implied by the description. We then take the majority vote between the anno-   (Lin, 2004).
tations. The result shows that 74% of reference descriptions contain information about more than half of the key-value pairs, with Fleiss' Kappa of 0.53 showing moderate agreement.

C.2 Human Evaluations
For all the human evaluations, we restricted the pool of workers to those who were located in the US, or CA, and had a 95% approval rate for at least 1, 000 previous annotations. Additionally, to further ensure the quality of annotations, we only hired master turkers, i.e., high performing turkers who have demonstrated excellence across a wide range of tasks and are awarded Masters Qualification. We also designed our setup to avoid annotator fatigue by asking them to read each paragraph only once and continuously answer several questions about it. We use a pay rate of $15 per hour approximately based on our estimation of time needed to complete the task. We depict our annotation layouts for evaluating precision-oriented and recall-oriented (both w.r.t reference and factual triples) factuality in Fig ure 6, 7, and 8. Likert scale of 1-5 and binary scores (supported/not supported) are used when evaluating recall w.r.t references, and factual triples, respectively. These scores are then normalized and averaged to obtain the final recall-oriented score.