AggGen: Ordering and Aggregating while Generating

We present AggGen (pronounced ‘again’) a data-to-text model which re-introduces two explicit sentence planning stages into neural data-to-text systems: input ordering and input aggregation. In contrast to previous work using sentence planning, our model is still end-to-end: AggGen performs sentence planning at the same time as generating text by learning latent alignments (via semantic facts) between input representation and target text. Experiments on the WebNLG and E2E challenge data show that by using fact-based alignments our approach is more interpretable, expressive, robust to noise, and easier to control, while retaining the advantages of end-to-end systems in terms of fluency. Our code is available at https://github.com/XinnuoXu/AggGen.


Introduction
Recent neural data-to-text systems generate text "end-to-end" (E2E) by learning an implicit mapping between input representations (e.g. RDF triples) and target texts. While this can lead to increased fluency, E2E methods often produce repetitions, hallucination and/or omission of important content for data-to-text (Dušek et al., 2020) as well as other natural language generation (NLG) tasks (Cao et al., 2018;Rohrbach et al., 2018). Traditional NLG systems, on the other hand, tightly control which content gets generated, as well as its ordering and aggregation. This process is called sentence planning (Reiter and Dale, 2000;Duboue andMcKeown, 2001, 2002;Konstas and Lapata, 2013;Gatt and Krahmer, 2018). Figure 1 shows two different ways to arrange and combine the representations in the input, resulting in widely different generated target texts.
In this work, we combine advances of both paradigms into a single system by reintroducing  sentence planning into neural architectures. We call our system AGGGEN (pronounced 'again'). AGGGEN jointly learns to generate and plan at the same time. Crucially, our sentence plans are interpretable latent states using semantic facts 1 (obtained via Semantic Role Labelling (SRL)) that align the target text with parts of the input representation. In contrast, the plan used in other neural plan-based approaches is usually limited in terms of its interpretability, control, and expressivity. For example, in (Moryossef et al., 2019b;Zhao et al., 2020) the sentence plan is created independently, incurring error propagation; Wiseman et al. (2018) use latent segmentation that limits interpretability; Shao et al. (2019) sample from a latent variable, not allowing for explicit control; and Shen et al. (2020) aggregate multiple input representations which limits expressiveness. AGGGEN explicitly models the two planning processes (ordering and aggregation), but can directly influence the resulting plan and generated target text, using a separate inference algorithm based on dynamic programming. Crucially, this enables us to directly evaluate and inspect the model's planning and alignment performance by comparing to manually aligned reference texts.
We demonstrate this for two data-to-text generation tasks: the E2E NLG (Novikova et al., 2017) and the WebNLG Challenge (Gardent et al., 2017a). We work with a triple-based semantic representation where a triple consists of a subject, a predicate and an object. 2 For instance, in the last triple in Figure 1, Apollo 8, operator and NASA are the subject, predicate and object respectively. Our contributions are as follows: • We present a novel interpretable architecture for jointly learning to plan and generate based on modelling ordering and aggregation by aligning facts in the target text to input representations with an HMM and Transformer encoder-decoder.
• We show that our method generates output with higher factual correctness than vanilla encoderdecoder models without semantic information.
• We also introduce an intrinsic evaluation framework for inspecting sentence planning with a rigorous human evaluation procedure to assess factual correctness in terms of alignment, aggregation and ordering performance.

Related Work
Factual correctness is one of the main issues for data-to-text generation: How to generate text according to the facts specified in the input triples without adding, deleting or replacing information?
Several works aim to improve accuracy and controllability by dividing the end-to-end architecture into sentence planning and surface realisation.  Moryossef et al. (2019b,a) use pattern matching to approximate the required planning annotation (entity mentions, their order and sentence splits). Zhao et al. (2020) use a planning stage in a graph-based model -the graph is first reordered into a plan; the decoder conditions on both the input graph encoder and the linearized plan. Similarly, Fan et al. (2019) use a pipeline approach for story generation via SRL-based sketches. However, all of these pipeline-based approaches either require additional manual annotation or depend on a parser for the intermediate steps.
Other works, in contrast, learn planning and realisation jointly. For example, Su et al. (2018) introduce a hierarchical decoding model generating different parts of speech at different levels, while filling in slots between previously generated tokens. Puduppully et al. (2019) include a jointly trained content selection and ordering module that is applied before the main text generation step.The model is trained by maximizing the log-likelihood of the gold content plan and the gold output text. Li and Rush (2020) utilize posterior regularization in a structured variational framework to induce which input items are being described by each token of the generated text. Wiseman et al. (2018) aim for better semantic control by using a Hidden Semi-Markov Model (HSMM) for splitting target sentences into short phrases corresponding to "templates", which are then concatenated to produce the outputs. However it trades the controllability for fluency. Similarly, Shen et al. (2020)  In contrast to these previous works, we achieve input ordering and aggregation, input-output alignment and text generation control via interpretable states, while preserving fluency.

Joint Planning and Generation
We jointly learn to generate and plan by aligning facts in the target text with parts of the input representation. We model this alignment using a Hidden Markov Model (HMM) that follows a hierarchical structure comprising two sets of latent states, corresponding to ordering and aggregation. The model is trained end-to-end and all intermediate steps are learned in a unified framework.

Model Overview
Let x = {x 1 , x 2 , . . . , x J } be a collection of J input triples and y their natural language description (human written target text). We first segment y into a sequence of T facts y 1:T = y 1 , y 2 , . . . , y T , where each fact roughly captures "who did what to whom" in one event. We follow the approach of Xu et al. (2020), where facts correspond to predicates and their arguments as identified by SRL (See Appendix B for more details). For example:

Fact-1 Fact-2
Each fact y t consists of a sequence of tokens y 1 t , y 2 t , . . . , y Nt t . Unlike the text itself, the planning information, i.e. input aggregation and ordering, is not directly observable due to the absence of labelled datasets. AGGGEN therefore utilises an HMM probabilistic model which assumes that there is an underlying hidden process that can be modeled by a first-order Markov chain. At each time step, a latent variable (in our case input triples) is responsible for emitting an observed variable (in our case a fact text segment). The HMM specifies a joint distribution on the observations and the latent variables. Here, a latent state z t emits a fact y t , representing the group of input triples that is verbalized in y t . We write the joint likelihood as: i.e., it is a product of the probabilities of each latent state transition (transition distribution) and the probability of the observations given their respective latent state (emission distribution).

Parameterization
Latent State. A latent state z t represents the input triples that are verbalized in the observed fact y t . It is not guaranteed that one fact always verbalizes only one triple (see bottom example in Figure 1). Thus, we represent state z t as a sequence of latent variables o 1 t , . . . , o Lt t , where L t is the number of triples verbalized in y t . Figure 2 shows the structure of the model. The structure of our model. z t , z t−1 , y t , and y t−1 represent the basic HMM structure, where z t , z t−1 are latent states and y t , y t−1 are observations. Inside the dashed frames is the corresponding structure for each latent state z t , which is a sequence of latent variables o Lt t representing the predicates that emit the observation. For example, at time step t − 1 two input triples ('member of' and 'operator') are verbalized in the observed fact y t−1 , whose predicates are represented as latent variables o 1 (t−1) and o 2 (t−1) . T1-4 represent transitions introduced in Section 3.2.
Let o l t ∈ Q = {1, . . . , K} be a set of possible latent variables, then K Lt is the size of the search space for z t . If o l t maps to unique triples, the search space becomes intractable for a large value of K. To make the problem tractable, we decrease K by representing triples by their predicate. Q thus stands for the collection of all predicates appearing in the corpus. To reduce the search space for z t further, we limit L t < L, where L = 3. 3 Transition Distribution. The transition distribution between latent variables (T1 in Figure 2) is a K × K matrix of probabilities, where each row sums to 1. We define this matrix as where denotes the Hadamard product. A ∈ R K×m and B ∈ R m×K are matrices of predicate embeddings with dimension m. q = {q 1 , q 2 , . . . , q J } is the set of predicates of the input triples x, and each q j ∈ Q is the predicate of the triple x j . M (q) is a K × K masking matrix, where M ij = 1 if i ∈ q and j ∈ q, otherwise M ij = 0. We apply row-wise softmax over the resulting matrix to obtain probabilities.
The probability of generating the latent state z t (T2 in Figure 2) can be written as the joint distribution of the latent variables o 1 t , . . . , o Lt t . Assuming a first-order Markov chain, we get: where o 0 t is a marked start-state. On top of the generation probability of the latent states p (z t | x) and p (z t−1 | x), we define the transition distribution between two latent states (T3 in Figure 2) as: (t−1) denotes the last latent variable in latent state z t−1 , while o t1 denotes the first latent variable (other than the start-state) in latent state z t . We use two sets of parameters {A in , B in } and {A out , B out } to describe the transition distribution between latent variables within and across latent states, respectively.
Emission Distribution. The emission distribution p (y t | z t , x) (T4 in Figure 2) describes the generation of fact y t conditioned on latent state z t and input triples x. We define the probability of generating a fact as the product over token-level probabilities, zt, x).
The first and last token of a fact are marked factstart and fact-end tokens. We adopt Transformer (Vaswani et al., 2017) as the model's encoder and decoder. Each triple is linearized into a list of tokens following the order: subject, predicate, and object. In order to represent individual triples, we insert special [SEP] tokens at the end of each triple. A special [CLS] token is inserted before all input triples, representing the beginning of the entire input. An example where the encoder produces a contextual embedding for the tokens of two input triples is shown in Figure 6 in Appendix E.
At time step t, the decoder generates fact y t token-by-token autoregressively, conditioned on both the contextually-encoded input and the latent state z t . To guarantee that the generation of y t conditions only on the input triples whose predicate is in z t , we mask out the contextual embeddings of tokens from other unrelated triples for the encoderdecoder attention in all Transformer layers.
Autoregressive Decoding. Autoregressive Hidden Markov Model (AR-HMM) introduces extra links into HMM to capture long-term correlations between observed variables, i.e., output tokens. Following Wiseman et al. (2018), we use AR-HMM for decoding, therefore allowing the interdependence between tokens to generate more fluent and natural text descriptions. Each token distribution depends on all the previously generated tokens, i.e., we define the token-level probabilities as p(y i t | y 1:Nt 1:(t−1) , y During training, at each time step t, we teacher-force the generation of the fact y t by feeding the ground-truth history, y 1:(t−1) , to the word-level Transformer decoder. However, since only y t depends on the current hidden state z t , we only calculate the loss over y t .

Learning
We apply the backward algorithm (Rabiner, 1989) to learn the parameters introduced in Section 3.2, where we maximize p(y | x), i.e., the marginal likelihood of the observed facts y given input triples x, over all the latent states z and o on the entire dataset using dynamic programming. Following Murphy (2012), and given that the latent state at time t is C, we define a conditional likelihood of future evidence as: where C denotes a group of predicates that are associated with the emission of y. The size of C ranges from 1 to L and each component is from the collection of predicates Q (see Section 3.2). Then, the backward recurrences are: with the base case β T (C) = 1. The marginal probability of y over latent z is then obtained as In Equation 2, the size of the search space for C is L α=1 K α , where K = |Q|, i.e., the number of unique predicates appearing in the dataset. The problem can still be intractable due to high K, despite the simplifications explained in Section 3.2 (cf. predicates). To tackle this issue and reduce the search space of C, we: (1) only explore permutations of C that include predicates appearing on the input; (2) introduce a heuristic based on the overlap of tokens between a triple and a fact-if a certain fact mentions most tokens appearing in the predicate and object of a triple we hard-align it to this triple. 4 As a result, we discard the permutations that do not include the aligned predicates.

Inference
After the joint learning process, the model is able to plan, i.e., order and aggregate the input triples in the most likely way, and then generate a text description following the planning results. Therefore, the joint prediction of (ŷ,ẑ) is defined as: where {z (i) } denotes a set of planning results,ŷ is the text description, andẑ is the planning result thatŷ is generated from. The entire inference process (see Figure 3) includes three steps: input ordering, input aggregation, and text generation. The first two steps are responsible for the generation of {z (i) } together with their probabilities {p(z (i) | x)}, while the last step is for the text generation p(y |z (i) , x).
Planning: Input Ordering. The aim is to find the top-k most likely orderings of predicates appearing in the input triples. In order to make the search process more efficient, we apply left-to-right beam-search 5 based on the transition distribution introduced in Equation 1. Specifically, we use a transition distribution between latent variables within latent states, calculated with predicate embeddings A in and B in (see Section 3.2). To guarantee that the generated sequence does not suffer from omission and duplication of predicates, we constantly update the masking matrix M (q) by removing generated predicates from the set q. The planning process stops when q is empty. Planning: Input Aggregation. The goal is to find the top-n most likely aggregations for each result of the Input Ordering step. To implement this process efficiently, we introduce a binary state for each predicate in the sequence: 0 indicates "wait" and 1 indicates "emit" (green squares in Figure 3). Then we list all possible combinations 6 of the binary states for the Input Ordering result. For each combination, the aggregation algorithm proceeds left-to-right over the predicates and groups those labelled as "emit" with all immediately preceding predicates labelled as "wait". In turn, we rank all the combinations with the transition distribution introduced in Equation 1. In contrast to the Input Ordering step, we use the transition distribution between latent variables across latent states, calculated with predicate embeddings A out and B out . That is, we do not take into account transitions between two consecutive predicates if they belong to the same group. Instead, we only consider consecutive predicates across two connected groups, i.e., the last predicate of the previous group with the first predicate of the following group. Text Generation. The final step generates a text description conditioned on the input triples and the planning result (obtained from the Input Aggregation step). We use beam search and the planningconditioned generation process described in Section 3.2 ("Emission Distribution").

Controllability over sentence plans
While the jointly learnt model is capable of fully automatic generation including the planning step (see Section 3.4), the discrete latent space allows direct access to manually control the planning component, which is useful in settings which require increased human supervision and is a unique feature of our architecture. The plans (latent variables) can be controlled in two ways: (1) hyperparameter. Our code offers a hyperparameter that can be tuned to control the level of aggregation: no aggregation, aggregate one, two triples, etc. The model can predict the most likely plan based on the input triples and the hyperparameter and generate a corresponding text description; (2) the model can directly adopt human-written plans, e.g. using the notation [eatType][near customer-rating], which translates to: first generate 'eatType' as an independent fact and then aggregate the predicates 'near' and 'customer-rating' in the following fact and generate their joint description.

Datasets
We tested our approach on two widely used datato-text tasks: the E2E NLG (Novikova et al., 2017) and WebNLG 7 (Gardent et al., 2017a). Compared to E2E, WebNLG is smaller, but contains more predicates and has a larger vocabulary. Statistics with examples can be found in Appendix C. We followed the original training-development-test data split for both datasets.

Evaluation Metrics
Generation Evaluation focuses on evaluating the generated text with respect to its similarity to human-authored reference sentences. To compare to previous work, we adopt their associated metrics to evaluate each task. The E2E task is evaluated using BLEU (Papineni et al., 2002), NIST (Doddington, 2002), ROUGE-L (Lin, 2004), METEOR (Lavie and Agarwal, 2007), and CIDEr (Vedantam et al., 2015). WebNLG is evaluated in terms of BLEU, METEOR, and TER (Snover et al., 2006). Factual Correctness Evaluation tests if the generated text corresponds to the input triples (Wen et al., 2015b;Reed et al., 2018;Dušek et al., 2020). We evaluated on the E2E test set using automatic slot error rate (SER), 8 i.e., an estimation of the occurrence of the input attributes (predicates) and their values in the outputs, implemented by Dušek et al. 7 Since we propose exploring sentence planning and increasing the controllability of the generation model and do not aim for a zero-shot setup, we only focus on the seen category in WebNLG. 8 SER is based on regular expression matching. Since only the format of E2E data allows such patterns for evaluation, we only evaluate factual correctness on the E2E task. (2020). SER counts predicates that were added, missed or replaced with a wrong object. Intrinsic Planning Evaluation examines planning performance in Section 6.

Baseline model and Training Details
To evaluate the contributions of the planning component, we choose the vanilla Transformer model To make our HMM-based approach converge faster, we initialized its encoder and decoder with the baseline model parameters and fine-tuned them during training of the transition distributions. Encoder and decoder parameters were chosen based on validation results of the baseline model for each task (see Appendix D for details). Table 1 shows the generation results on the WebNLG seen category (Gardent et al., 2017b). Our model outperforms TILB-PIPE and Transformer, but performs worse than T5, PlanEnc and ADAPT. However, unlike these three models, our approach does not rely on large-scale pretraining, extra annotation, or heavy pre-processing using external resources.   Table 3: Evaluation of Generation trained on the original E2E data, while tested on the cleaned E2E data. Note that, the clean test set has more diverse MRs and fewer references per MR, which leads to lower scores -see also the paper introducing the cleaned E2E data (Table 2  worse than both seq2seq models in terms of wordoverlap metrics.

Generation Evaluation Results
However, the results in Table 3 demonstrate that our model does outperform the baselines on most surface metrics if trained on the noisy original E2E training set and tested on clean E2E data (Dušek et al., 2019). This suggests that the previous performance drop was due to text references in the original dataset that did not verbalize all triples or added information not present in the triples that may have down-voted the fact-correct generations. 9 This also shows that AGGGEN produces correct outputs even when trained on a noisy dataset. Since constructing high-quality data-to-text training sets is expensive and labor-intensive, this robustness towards noise is important.

Factual Correctness Results
The results for factual correctness evaluated using SER on the original E2E test set are shown in Table 2. The SER of AGGGEN is the best among all models. Especially, the high "Miss" scores for TGen and Transformer demonstrate the high chance of information omission in vanilla seq2seqbased generators. In contrast, AGGGEN shows much better coverage over the input triples while keeping a low level of hallucination (low "Add" 9 We also trained and tested models on the cleaned E2E data. The full results (including the factual correctness evaluation) are shown in Table 8 in Appendix F: there is a similar trend as in results in Table 3, compared to Transformer. and "Wrong" scores).

Ablation variants
To explore the effect of input planning on text generation, we introduced two model variants: AGGGEN −OD , where we replaced the Input Ordering with randomly shuffling the input triples before input aggregation, and AGGGEN −AG , where the Input Ordering result was passed directly to the text generation and the text decoder generated a fact for each input triple individually.
The generation evaluation results on both datasets (Table 1 and Table 2) show that AGGGEN outperforms AGGGEN −OD and AGGGEN −AG substantially, which means both Input Ordering and Input Aggregation are critical. Table 2 shows that the factual correctness results for the ablative variants are much worse than full AGGGEN, indicating that planning is essential for factual correctness. An exception is the lower number of missed slots in AGGGEN −AG . This is expected since AGGGEN −AG generates a textual fact for each triple individually, which decreases the possibility of omissions at the cost of much lower fluency. This strategy also leads to a steep increase in added information.
Additionally, AGGGEN −AG performs even worse on the E2E dataset than on the WebNLG set. This result is also expected, since input aggregation is more pronounced in the E2E dataset with a higher number of facts and input triples per sentence (cf. Appendix C).

Qualitative Error Analysis
We manually examined a sample of 100 outputs (50 from each dataset) with respect to their factual correctness and fluency. For factual correctness, we follow the definition of SER and check whether there are hallucinations, substitutions or omissions in generated texts. For fluency, we check whether the generated texts suffer from grammar mistakes, redundancy, or contain unfinished sentences. Fig-the cricketers is a chinese restaurant near all bar one in the city centre . it is children friendly and has an average customer rating .
william anders birthplace british hong kong apollo 8 backup_pilot buzz aldrin apollo 8 crewmembers frank borman apollo 8 operator nasa william anders was born in british hong kong and served as a crew member on apollo 8. frank borman was a crewman aboard the nasa operated apollo 8 mission. the backup pilot was buzz aldrin. william anders retired on september 1st , 1969 .
william anders (born in british hong kong) was a crew member of nasa's apol 8 alongside frank borman. william anders retired on september 1st, 1969 .

[birthPlace] [crew_member] [operator crewMembers] [backup_pilot] [Retirement] [eatType priceRange] [food customerrating] [familyFriendly area near]
the cricketers area city centre the cricketers customerrating average the cricketers eattype restaurant the cricketers familyfriendly yes the cricketers food chinese the cricketers near all bar one the cricketers pricerange high Inputs Trans AggGen the cricketers is a chinese restaurant with a high price range. it has an average customer rating and is children friendly near all bar one in the city centre.  Table 6 and Table 7 in Appendix A). We observe that, in general, the seq2seq Transformer model tends to compress more triples into one fluent fact, whereas AGGGEN aggregates triples in more but smaller groups, and generates a shorter/simpler fact for each group. Therefore, the texts generated by Transformer are more compressed, while AGGGEN's generations are longer with more sentences. However, the planning ensures that all input triples will still be mentioned. Thus, AGGGEN generates texts with higher factual correctness without trading off fluency. 10

Intrinsic Evaluation of Planning
We now directly inspect the performance of the planning component by taking advantage of the readability of SRL-aligned facts. In particular, we investigate: (1) Sentence planning performance. We study the agreement between model's planning and reference planning for the same set of input triples; (2) Alignment performance -we use AGGGEN as an aligner and examine its ability to align segmented facts to the corresponding input triples. Since both studies require ground-truth triple-to-fact alignments, which are not part of the WebNLG and E2E data, we first introduce a human annotation process in Section 6.1.

Human-annotated Alignments
We asked crowd workers on Amazon Mechanical Turk to align input triples to their fact-based text snippets to derive a "reference plan" for each target text. 11 Each worker was given a set of input triples and a corresponding reference text description, segmented into a sequence of facts. The workers were then asked to select the triples that are verbalised in each fact. 12 We sampled 100 inputs from the WebNLG 13 test set for annotation. Each input was paired with three reference target texts from WebNLG. To guarantee the correctness of the annotation, three different workers annotated each input-reference pair. We only consider the alignments where all three annotators agree. Using Fleiss Kappa (Fleiss, 1971) over the facts aligned by each judge to each triple, we obtained an average agreement of 0.767 for the 300 input-reference pairs, which is considered high agreement.

Study of Sentence Planning
We now check the agreement between the modelgenerated and reference plans based on the top-1 Input Aggregation result (see Section 3.4). We introduce two metrics: • Normalized Mutual Information (NMI) (Strehl and Ghosh, 2002) to evaluate aggregation. We represent each plan as a set of clusters of triples, where a cluster contains the triples sharing the same fact verbalization. Using NMI we measure mutual information between two clusters, normalized into the 0-1 range, where 0 and 1 denote no mutual information and perfect correlation, respectively.
• Kendall's tau (τ ) (Kendall, 1945) is a ranking based measure which we use to evaluate both ordering and aggregation. We represent each plan as a ranking of the input triples, where the rank of each triple is the position of its associated fact verbalization in the target text. τ measures rank correlation, ranging from -1 (strong disagreement) to 1 (strong agreement).
In the crowdsourced annotation (Section 6.1), each set of input triples contains three reference texts with annotated plans. We fist evaluate the correspondence among these three reference plans by 11 The evaluation requires human annotations, since anchorbased automatic alignments are not accurate enough (86%) for the referred plan annotation. See Table 5 ("RB") for details. 12 The annotation guidelines and an example annotation task are shown in Figure 7 in Appendix G. 13 We chose WebNLG over E2E for its domain and predicate diversity.   calculating NMI and τ between one plan and the remaining two. In the top row of Table 4, the high average and maximum NMI indicate that the reference texts' authors tend to aggregate input triples in similar ways. On the other hand, the low average τ shows that they are likely to order the aggregated groups differently. Then, for each set of input triples, we measure NMI and τ of the top-1 Input Aggregation result (model's plan) against each of the corresponding reference plans and compute average and maximum values (bottom row in Table 4). Compared to the strong agreement among reference plans on the input aggregation, the agreement between model's and reference plans is slightly weaker. Our model has slightly lower agreement on aggregation (NMI), but if we consider aggregation and ordering jointly (τ ), the agreement between our model's plans and reference plans is comparable to the agreement among reference plans.

Study of Alignment
In this study, we use the HMM model as an aligner and assess its ability to align input triples with their fact verbalizations on the human-annotated set. Given the sequence of observed variables, a trained HMM-based model is able to find the most likely sequence of hidden states z * = arg max z (z 1:T |y 1:T ) using Viterbi decoding. Similarly, given a set of input triples and a factoid segmented text, we use Viterbi with our model to align each fact with the corresponding input triple(s). We then evaluate the accuracy of the model-produced alignments against the crowdsourced alignments. The alignment evaluation results are shown in Table 5. We compare the Viterbi (Vtb) alignments with the ones calculated by a rule-based aligner (RB) that aligns each triple to the fact with the greatest word overlap. The precision of the Viterbi aligner is higher than the rule-based aligner. How-ever, the Viterbi aligner tends to miss triples, which leads to a lower recall. Since HMMs are locally optimal, the model cannot guarantee to annotate input triples once and only once.

Conclusion and Future Work
We show that explicit sentence planning, i.e., input ordering and aggregation, helps substantially to produce output which is both semantically correct as well as naturally sounding. Crucially, this also enables us to directly evaluate and inspect both the model's planning and alignment performance by comparing to manually aligned reference texts. Our system outperforms vanilla seq2seq models when considering semantic accuracy and word-overlap based metrics. Experiment results also show that AGGGEN is robust to noisy training data. We plan to extend this work in three directions: Other Generation Models. We plan to plug other text generators, e.g. pre-training based approaches (Lewis et al., 2020;Kale and Rastogi, 2020), into AGGGEN to enhance their interpretability and controllability via sentence planning and generation. Zero/Few-shot scenarios. Kale and Rastogi (2020)'s work on low-resource NLG uses a pretrained language model with a schema-guided representation and hand-written templates to guide the representation in unseen domains and slots. These techniques can be plugged into AGGGEN, which allows us to examine the effectiveness of the explicit sentence planning in zero/few-shot scenarios. Including Content Selection. In this work, we concentrate on the problem of faithful surface realization based on E2E and WebNLG data, which both operate under the assumption that all input predicates have to be realized in the output. In contrast, more challenging tasks such as RotoWire (Wiseman et al., 2017), include content selection before sentence planning. In the future, we plan to include a content selection step to further extend AGGGEN's usability. A Examples of input and system-generated target text Predicted Text: the cricketers is a chinese restaurant that is children friendly, has a high price range, a customer rating of 3 out of 5, is located near the portland arms and is in the city centre. it has an average customer rating. is also a children-friendly . you can find it is called the cricketers.

B Factoid Sentence Segmentation
In order to align meaningful parts of the human-written target text to semantic triples, we first segment the target sentences into sequences of facts using SRL, following Xu et al. (2020). The aim is to break down sentences into sub-sentences (facts) that verbalize as few input triples as possible; the original sentence can still be fully recovered by concatenating all its sub-sentences. Each fact is represented by a segment of the original text that roughly captures "who did what to whom" in one event. We first parse the sentences into SRL propositions using the implementation of He et al. (2018). 14 We consider each predicate-argument structure as a separate fact, where the predicate stands for the event and its arguments are mapped to actors, recipients, time, place, etc. (see Figure 5). The sentence segmentation consists of two consecutive steps: (1) Tree Construction, where we construct a hierarchical tree structure for all the facts of one sentence, by choosing the fact with the largest coverage as the root and recursively building sub-trees by replacing arguments with their corresponding sub-facts (ARG1 in FACT1 is replaced by FACT2).
(2) Argument Grouping, where each predicate (FACT in tree) with its leaf-arguments corresponds to a sub-sentence. For example, in Figure 5, leaf-argument "was" and "a crew member on Apollo 8" of FACT1 are grouped as one sub-sentence.

Tree MR
William Anders, who retired on September 1st, 1969 was a crew member on Apollo 8 Sentence Segmentation: Figure 5: Semantic Role Labeling based tree meaning representation and factoid sentence segmentation for text "William Anders, who retired on September 1st, 1969, was a crew member on Apollo 8."

C Datasets
WebNLG. The corpus contains 21K instances (input-text pairs) from 9 different domains (e.g., astronauts, sports teams). The number of input triples ranges from 1 to 7, with an average of 2.9. The average number of facts that each text contains is 2.4 (see Appendix B). The corpus contains 272 distinct predicates. The vocabulary size for input and output side is 2.6K and 5K respectively. E2E NLG. The corpus contains 50K instances from the restaurant domain. We automatically convert the original attribute-value pairs to triples: For each instance, we take the restaurant name as the subject and use it along with the remaining attribute-value pairs as corresponding predicates and objects. The number of triples in each input ranges from 1 to 7 with an average of 4.4. The average number of facts that each text contains is 2.6. The corpus contains 9 distinct predicates. The vocabulary size for inputs and outputs is 120 and 2.4K respectively. We also tested our approach on an updated cleaned release (Dušek et al., 2019).

D Hyperparameters
WebNLG. Both encoder and decoder are a 2-layer 4-head Transformer, with hidden dimension of 256. The size of token embeddings and predicate embeddings is 256 and 128, respectively. The Adam optimizer (Kingma and Ba, 2015) is used to update parameters. For both the baseline model and the pre-train of the HMM-based model, the learning rate is 0.1. During the training of the HMM-based model, the learning rate for the encoder-decoder fine-tuning and the training of the transition distributions is set as 0.002 and 0.01, respectively.
E2E. Both encoder and decoder are a Transformer with hidden dimension of 128. The size of token embeddings and predicate embeddings is 128 and 32, respectively. The rest hyper-parameters are same with WebNLG.  Figure 6: The Transformer encoder takes linearized triples and produces contextual embeddings We assume that, at time step t, the Transformer decoder is generating fact y t conditioned on z t . The number of latent variables L t is 1. In other words, z t = o t1 . If the value of o t1 is the predicate of the first triple (solid borders), then the second triple (dashed borders) is masked out for the encoder-decoder attention during decoding.