Explicit Syntactic Guidance for Neural Text Generation

Most existing text generation models follow the sequence-to-sequence paradigm. Generative Grammar suggests that humans generate natural language texts by learning language grammar. We propose a syntax-guided generation schema, which generates the sequence guided by a constituency parse tree in a top-down direction. The decoding process can be decomposed into two parts: (1) predicting the infilling texts for each constituent in the lexicalized syntax context given the source sentence; (2) mapping and expanding each constituent to construct the next-level syntax context. Accordingly, we propose a structural beam search method to find possible syntax structures hierarchically. Experiments on paraphrase generation and machine translation show that the proposed method outperforms autoregressive baselines, while also demonstrating effectiveness in terms of interpretability, controllability, and diversity.


Introduction
Natural language generation (NLG), such as paraphrase generation (Sun et al., 2021), text summarization (Lin et al., 2018), machine translation (Vaswani et al., 2017;Edunov et al., 2018), and language models (Brown et al., 2020;OpenAI, 2023), have shown remarkable progress in the past few years. Most of the highest-performing NLG models train the model based on source-target correspondence and conduct autoregressive inference, which achieves competitive empirical performances yet deviates from a range of desirable attributes of human language generation, e.g., lack of interpretability (Alvarez-Melis and Jaakkola, 2017;He et al., 2019;Li and Yao, 2021).
It has been shown that humans generate language by learning and manipulating language grammar (Zholkovskii and Mel'chuk, 1965 Figure 1: Syntax-guided generation: searching the hypotheses hierarchically throughout the syntax tree in a top-down direction, starting from the root node "<T>". The green blocks denote the possible syntax structures at different tree depths, the blue one denotes the external modification, whereas the gray ones denote the finalized hypotheses, marking the end of search paths. 1974), which generative grammar (Chomsky, 1965) considers as a finite rule set that combines words to form grammatical sentences, thereby avoiding enumeration of surface sequences, which can significantly increase data sparsity and reduce learning efficiency. In this process, syntax plays a crucial role, imposing constraints on how to construct sentences. Syntax knowledge has been found implicitly contained by deep neural models (Kovaleva et al., 2019;Clark et al., 2019) and also useful for NLG tasks (Yang et al., 2020a;Sun et al., 2021;Xie et al., 2021). However, relatively little recent work has considered explict syntax in NLG (Wang et al., 2018). Inspired by the above psycholinguistic observation, we propose a syntax-guided generation scheme, which generates text by following a welldefined grammar. As shown in Figure 1, instead of sequential generation, the model generates the sentence in a hierarchically top-down manner guided by the constituency parse tree, starting with the root node <T>. Syntactic categories such as noun phrases <NP> and verb phrases <VP> are inte-grated with tokens in the generation process, and the model simultaneously considers multiple syntax structures at each tree depth, hierarchically exploring the syntax tree for reasonable hypotheses.
Intuitively, such a generation paradigm has the following advantages compared with autoregressive generation. First, akin to the language learning process of human beings, grammar learning breaks down non-enumerable surface sequences into finite pieces, acting as a training curriculum. Second, it provides an effective and interpretable pathway to probe into the generation process. Consequently, generation errors can be traced back to specific constituent expansion at the respective tree depth. Third, one can manipulate the generation process by exerting versatile control at arbitrary depths, e.g., modifying the translation of a verb phrase and constraining the paraphrase style with syntax templates. Forth, diverse sequences can be generated by exploring various syntax structures hierarchically throughout the syntax tree.
We implement the above process on Transformer (Vaswani et al., 2017). As shown in Figure 1, the generation process proceeds under the guidance of syntactic grammar. Starting from the root node "<T>", the model recursively generates the infilling texts (e.g., "he" and "seems <S>") for each constituent in the current lexicalized syntax context (e.g, "<NP> <VP>.".), and infills each one accordingly to construct the next-level lexicalized syntax context (e.g., "he seems <S>."). The generation proceeds until there is no remaining constituent. The infilling texts are predicted by a Transformerbased model, which is trained by maximizing the likelihood of infilling texts for each constituent in the syntax context based on the source input. To explore more syntactically diverse and reasonable hypotheses during inference, we propose structural beam search, which searches promising syntax structures over the entire syntax tree in a top-down manner, as shown in Figure 1.
To isolate the effect of syntax and avoid the influence of other transformation factors, we conduct experiments on two sequence-to-sequence (seq2seq) tasks with semantic equivalence between the source and target sequences: paraphrase generation and machine translation. Empirical results demonstrate that our method can generate sequences with higher quality than the seq2seq baselines. Quantitative analysis demonstrates that the generation process can be interpreted effec-tively. In addition, our method demonstrates the capability of executing control from both syntax templates and fine-grained manual modifications. Finally, we show the diversity advantage through both automatic evaluation and human evaluation. We release the code on https://github.com/ yafuly/SyntacticGen.

Related Work
Syntax as Extra Input. A line of work incorporates syntax knowledge as extra input to boost task performance. In paraphrase generation, Iyyer et al.  2020) and Sun et al. (2021) additionally encode a constituency tree to produce controllable paraphrases. For machine translation, researchers utilize syntactic information to boost the neural machine translation system using syntactic encoders Eriguchi et al., 2019;Ma et al., 2020;Yang et al., 2020a), position encoding (Ma et al., 2019;Xie et al., 2021), attention mechanism (Chen et al., 2018;, and auxiliary training objectives (Ma et al., 2019).
Syntax for Generation Guidance. Different from the above work, we focus on guiding generation explicitly following syntactic grammar. Typically, Aharoni and Goldberg (2017) and Le et al. (2017) learn the mapping from sequences to linearized constituency trees to improve machine translation. Eriguchi et al. (2017) proposes a hybrid decoder with RNNG (Dyer et al., 2016) to jointly learn parse actions and word predictions. Wu et al. (2017) and Wang et al. (2018) design a syntactic tree decoder based on LSTM (Hochreiter and Schmidhuber, 1997), with an extra rule decoder. Yang et al. (2020b) introduce a syntax-guided soft target template as extra prompts in Transformer. Different from their work, our method leverages Transformer strengths and breaks down the sequence-to-sequence generation process into a hierarchically top-down generation guided by the syntax tree.

Baseline Transformer
Transformer models the correspondence between the source sequence x = {x 1 , . . . , x |x| } and the target sequence y = {y 1 , . . . , y |y| } in an end-to-end fashion. The Transformer encoder transforms the discrete source sequence x into a continuous repre-  based on the constituency parse tree; the right part denotes the architecture of the neural decoder, which takes in the German source sentence x and the syntax context s 2 as input, and predicts the infilling text f 2 .
sentation, which the Transformer decoder utilizes to generate the target sequence. The conditional probability p(y|x) can be factorized in an autoregressive way: where θ denotes the model parameters.
Given a source-target training set D = , the model is optimized by minimizing the cross-entropy (CE) loss:

Syntax-guided Generation
In this section, we introduce syntax-guided generation, which generates texts by hierarchically expanding constituents in syntax contexts throughout the syntax tree, while also leveraging the strengths of Transformer. In general, the generation process can be decomposed into two stages: (1) neural generation: the neural decoder (Section 3.2.2) generates the infilling sequences based on the source sequence and the syntax context; (2) constituent expansion: predicted infilling sequences are mapped and filled into each constituent in the syntax context accordingly (Section 3.2.3), forming the next-level syntax context. To facilitate parallelism during training, we decompose the sequenceto-sequence dataset to a triplet set, where the neural decoder is optimized to maximize the probability of the infilled sequence (e.g., "<c> I <c> ate <NP> .") given the lexicalized syntax context (e.g., "<NP> <VP> ."), as shown in Figure 2.

Triplet Construction
Given a target sequence y, the corresponding constituency parse tree of depth |T| can be composed by a set of labeled spans T: where a k and b k represent the k-th constituent span's fencepost positions at depth d, and l k represents the constituent label. Our model is optimized to predict the next-level span sets T d given the previous one and the source input, i.e., Given the set of labeled spans at depth d, i.e., T d , we transform the target sequence into a lexicalized syntax sequence of length |s d |: s d = {s d;1 , s d;2 , . . . , s d;|s d | }, by keeping the lexical tokens and replacing the constituent spans with corresponding labels. For instance, the sequence "I ate an apple ." is transformed to s 2 ={<NP>,<VP>,.} at depth 2, and is transformed to s 3 ={I,ate,<NP>,.} at depth 3, as shown in Figure 2. The alignment between s 2 and s 3 can be modeled as a text-infilling task. For example, the {<NP>}, {<VP>} and at depth 2 are replaced by {I} and {ate <NP>} at depth 3, respectively. To generate the whole s 3 based on s 2 in one pass, we concatenate all the infilling texts with a special token "<c>", yielding an infilling sequence f 2 = {<c>,I,<c>,ate,<NP>}.
Similarly for each syntax context s d , we collect the respective infilling texts for each constituent in the lexicalized sequence at depth d+1, and concatenate them to construct the target infilling sequence of length |f d |: f d = {f d;1 , f d;2 , . . . , f d;|f d | }. In this way, a triplet is constructed for a source-target sequence pair at depth d: {(x, s d , f d )}. We traverse the target syntax tree in level-order to obtain the full set Φ of training triplets for a training instance: Given a sequence-to-sequence training set D = {x i , y i }| |D| i=1 , we go through the full training set to construct the complete triplet set Ψ:

Neural Decoder
Given a triplet instance Ψ j , we construct the neural decoder based on Transformer to model the generative probability p θ (f j |x j , s j ). The neural decoder takes the source sequence and the lexicalized syntax context as input and generates the corresponding infilling texts, as shown in Figure 2.
Besides the encoder that encodes source context, we introduce an extra Transformer encoder, i.e., syntax context encoder, to encode the lexicalized syntax context into a representation. On top of selfattention and source context attention, we insert an extra attention layer (syntax context attention) into each decoder layer to incorporate syntax contexts, as shown in the right part of Figure 2.
Similarly, the probability of the infilling sequence can be factorized as: We define the scoring function for an infilling sequence as the sum of the log probabilities: We adopt the standard cross-entropy loss (CE loss) to optimize our model, where the loss for the j-th triplet in the training set Ψ can be written as: and the CE loss across the whole triple set Ψ becomes:

Generation Process
Given a source sequence, our model generates the target sequence in a top-down manner which is grounded on syntactic grammar rules. As shown in Figure 2, the neural decoder first encodes the source sequence x into the source context representation h src , which remains fixed and can be reused throughout the generation process. Initially, the neural decoder generates the infilling sequences t 0 given x and s 0 ={<T>}, based on Equation 6. Then the model proceeds with the generation process via iteratively generating infilling texts and expanding constituents. At each iteration step (i.e., tree depth), the neural decoder generates the infilling sequence f d for the syntax context s d : Then the constituent expansion function yields the next-level syntax context given the syntax context and the infilling sequences predicted by the neural decoder: Specifically, we first separate the infilling sequences by the special separator "<c>" into a group of infilling texts, e.g., spliting f 2 ={{<c>,I,<c>,ate,<NP>}} to {{I},{ate <NP>}}. Then we fill in each of the infilling texts into the corresponding constituent in the syntax context s 2 to obtain the syntax context at the following level, e.g., s 3 ={I,ate,<NP>,.}. The syntax context encoder encodes the updated syntax context s d+1 and starts the next iteration. The remaining decoding process loops between these two stages, until there is no constituent label in the syntax context, or a maximum tree depth is reached, as shown in Figure 2.
As the model behavior on expanding constituents over the entire syntax tree is completely accessible, the generation process can be effectively interpreted, as shown in Section 6.2. Moreover, manual modifications can be directly incorporated into the expansion process for each constituent throughout the syntax tree (Section 6.3). Finally, more than one syntax structure can be considered simultaneously at each tree depth, enabling searching for hypotheses of better syntactical diversity(Section 6.4).

Structural Beam Search
By default, our model selects the best infilling texts greedily in each iteration. We introduce structural beam search to explore the hypothesis space for a more accurate and diverse generation. Similar to standard beam search (Sutskever et al., 2014) Figure 3: A real example of our model generating a paraphrase given the source sequence "it seems like he has made a mistake.", under the structural beam search of width 2. Diverse syntax structures are explored during the generation, e..g, "<VP>.", "<NP> <VP>.", and "did <NP> <VP>?".
traversing the constituency parse tree during inference, our method is able to search promising syntax structures throughout the syntax tree in a top-down manner. We show a real example of our model generating a paraphrase in Figure 3. At each level, we apply standard beam search for neural generation and keep top k infilling texts along with their scores, computed by Equation 7. Taking previous predictions into consideration, we introduce a moving average mechanism to trade off confidence between the predictions from lower levels and the current-level prediction. Specifically, suppose s i is the i-th syntax context in the k-width beam at the current depth, with an accumulated score of δ s i ; and f j;s i is the j-th infilling sequence candidate from the neural generation beam given the syntax context s i , with a score of δ f j;s i . A beam of next-level syntax contexts is constructed, by filling in the current syntax context with the corresponding infilling sequences: The updated score for each of the next-level syntax contexts in the beam is given by: where α is a hyper-parameter (accumulation weight) that determines how much weight is put on predictions at lower levels. Then the beam is further pruned by their updated scores to maintain the beam width. For example, the first two candidate syntax contexts are selected at depth 2 in Figure 3. Algorithm implementation details can be referred to in Appendix A. Evaluation We use the BLEU score (Papineni et al., 2002) to evaluate machine translation performance. For paraphrase generation, we also adopt ROUGE (Lin, 2004)     we report iBLEU (Sun and Zhou, 2012): which evaluates the generation fidelity with novelty to the source sentence considered 1 . Following Bandel et al. (2022), we consider two referencefree metrics: (1) lexical diversity score, i.e., D lex , which is the normalized character-level minima edit distance between the bag-of-words; and (2) syntax diversity score, i.e., D syn , which is the normalized tree edit distance. Both scores measure generated paraphrases with the source sequences unless specified.

Results
Paraphrase We compare our method with the baselines and previous work on syntax-control paraphrase generation. Another two baselines are also 1 r is set as 0.7. listed, i.e., copy the source input and use the reference as the output. The results are shown in Table 1. For paraphrase generation without syntax control (the center section in Table 1), our method achieves higher performance than the seq2seq Transformer, in both greedy and beam search settings. Typically, our method under greedy decoding obtains comparable results with the Transformer under beam search, and even outperforms under some metrics. The advantage of our method becomes larger for metrics such as iBLEU, D lex , and D syn , which consider generation novelty compared with the source input. For example, compared with Transformer (beam 5), our method (beam 5) gives a much lower self-BLEU score ( We extend our method to the pre-trained language model (PLM) setting and present the result in Table 3 (Details in Appendix D). It can be seen from the table that the utilization of BART (Lewis et al., 2019) improves the generation diversity for the sequence-to-sequence model significantly. Despite the narrowed gap, our model outperforms the seq2seq counterpart in terms of iBLEU and lexical diversity by a considerable margin.
Machine Translation As shown in Table 2, our method achieves consistent performance (BLEU score) improvement over the Transformer baseline. The improvement is larger for the greedy setting (+1.5 BLEU scores on average), compared with the beam search setting (+1.2). This indicates that using syntax to guide and constrain generation yields more reasonable and high-quality hypotheses than the greedy autoregressive generation, and thus relies less on search algorithms (e.g., beam search). Note that compared with the Englishoriented datasets, our model obtains a smaller performance improvement on WMT'14 En-De. This can be because the German parser is less accurate than the English one (92.1 v.s. 96.3 for F1 score), resulting in a training set with lower quality.

Analysis
We first discuss the influence of grammar quality, then we understand the potential advantages of our method from three perspectives, i.e., interpretability, controllability, and diversity.

The Influence of Grammar Quality
Intuitively, learning syntactic grammar of higher quality results in better generation performance, e.g., the advantage of our method on Englishoriented datasets is larger than the German-oriented one. To further explore the influence of grammar quality, we randomly replace a certain ratio of the constituent labels with a random one to simulate a less accurate parser. We conduct experiments on the WMT'16 Ro-En dataset. By injecting noise of ratios of 0.2 and 0.4, the model performance deteriorates from 34.9 to 34.6 and 32.3 accordingly, indicating the quality of syntactic grammar exerts a large influence on model's generation performance.

Interpretability
We evaluate the model's interpretability based on its capability of providing explanations in understandable terms to a human (Doshi-Velez and Kim, 2017), i.e., whether it generates texts following language grammar. We trace each constituent expansion during generation and compare the modelinduced tree with the tree parsed by a benchmark    parser, e.g., Berkeley Parser. Specifically, we use the Berkeley parser to parse the same generated hypotheses by our model and treat the corresponding parsing results as golden parses. Quantitative results (Figure 4) show that our model achieves an average F1 score of 94.6 , which demonstrates the generation process highly corresponds to the syntactic grammar and thus can be effectively interpreted. Note that the score for WMT'14 En-De is lower (89.0), possibly due to the less accurate German parser for constructing the syntactic grammar, as discussed in Section 6.1.

Controllability
Control with Complete Syntax Template To leverage control signals from delexicalized syntax templates (e.g., "(S (NP) (VP (NP)))" for the sequence "I ate an apple."), we introduce a reward γ into Equation 13: If the updated syntax context s ik+j matches the corresponding template pattern at depth d + 1, the γ is a positive value otherwise 0. For example, the syntax context "<NP> <VP>" in Figure 3 matches the pattern "((NP)(VP))" at depth 2. Intuitively, the reward encourages the model to favor beam candidates that match the syntax template. We set the reward value as 0.32 based on validation results (Appendix F). The testset of ParaNMT-small is provided with human-annotated exemplars and we use it to control generation, with results shown in Table 1. More generally, golden templates can be derived by parsing the reference sentences for each dataset with a parser (e.g., the Berkeley Parser). We present the results in Table 5. Guided by the reference syntax template, our model obtains consistent improvement in terms of hypothesis similarity with references, which is reflected by the decreased syntax edit distance to the references, i.e., D ref syn . For the multi-reference dataset NIST Zh-En, our model can generate translations of different styles which are prompted by alternative syntax templates from multiple references.

Control with Partial Syntax Template
We further explore whether the model can handle finegrained arbitrary controls. Specifically, we ask three annotators to modify the intermediate syntax contexts output by the model, based on the source input. 100 instances are randomly selected from the NIST Zh-En test set and each annotator gives different modifications for each instance. The modified contexts are fed to the model to predict the infilling texts. We then ask the annotators to evaluate whether their controls (i.e., modifications) are safely responded to by the model. We show some of the control examples in Appendix G. The average control success rate is 81%, which demonstrates the capability of our model to handle arbitrary fine-grained controls.

Diversity
Beam Diversity We expect the model to generate diverse hypotheses under beam search, while also maintaining generation quality. To this end, we measure the model's beam diversity by computing two average scores: (1) the average of the mutual diversity scores of every two of the beam candidates, i.e., D beam lex and D beam syn ; (2) the average generation quality of the beam candidates, measured by BLEU scores. The results for paraphrase generation are shown in Table 6. In terms of generation quality, our model generates consistently better beam candidates on average than the baseline model. Besides, we can see that structural beam search can yield more diverse beam candidates, indicated by the higher mutual diversity (i.e., D beam lex and D beam syn ) among beam candidates. Effects of Accumulation Weight A larger accumulation weight (α in Eq. 13) indicates a larger   weight on previous decisions when re-ranking the newly updated beam candidates. As a result, early determined syntax structures are less likely to be surpassed throughout the whole structural beam search. On the contrary, a smaller α encourages the model to explore promising candidates at higher levels, and can therefore find more diverse hypotheses. We explore the effects of α with results shown in Figure 4. As the weight grows smaller, the model generates sequences of better syntactic diversity, i.e., D syn . However, an overly small weight deteriorates generation quality (iBLEU), which can be caused by the model's overconfidence in local predictions without considering the predictions of syntax contexts at lower levels. Such deterioration is also seen for overly large weights (>0.95), due to limited exploration at higher levels.

Human Evaluation
We further conduct a human evaluation to evaluate generation quality and diversity on paraphrase generation. We ask three annotators to vote for one of the two candidates: hypotheses from the seq2seq baseline and our method. The annotators are required to decide, which one is better by considering Fidelity, Novelty, and Diversity (See Appendix H for details). The results are shown in Table 7. As can be seen from the table, our method achieves much better generation novelty and beam diversity compared with the baseline, while maintaining semantic fidelity, which further  validates the results of the automatic evaluation.

Conclusion
We proposed a syntax-guided generation paradigm, which leverages the strengths of Transformer and generates sequences by hierarchically expanding constituents in the lexicalized syntax contexts throughout the syntax tree. The neural decoder was trained by maximizing the likelihood of the infilling texts for each constituent in the syntax contexts given the source sequence. Moreover, we proposed the structural beam search to better explore the hypothesis space. Empirical results demonstrated the advantage of generation quality over the seq2seq baseline, and also the effectiveness in terms of interpretability, controllability, and diversity. Our method can be seen as a step towards explicit modelling of psycholinguistic structures during neural text generation , helping the model to have a degree of control over what it intends to generate, which can potentially address salient issues of current neural NLG, such as hallucination (Guerreiro et al., 2023;Dziri et al., 2022) and ethical issues (Sheng et al., 2019(Sheng et al., , 2021Weidinger et al., 2021), if semantics, pragmatics, and other factors are also integrated.

Limitations
Despite the competitive performance, there are several limitations of this work: (1) As discussed in Section 6.1, the generation performance relies on the parser performance, which is strong enough for English but still less satisfactory for other languages. Dedicated methods need to be considered to compensate for the weak parser performance if we want to extend our method to more languages.
(2) In this work, we consider two NLG tasks with semantic equivalence to testify if the proposed method can convey the source semantics accurately by following the target syntactic grammar. Other tasks such as summarization and dialogue generation can also be tested, where the semantics are not equivalent between the source and target. (3) To train the neural decoder parallelly, we break down the source-target dataset into a triple set. However, the global dependency of the syntax parse tree is not considered, which can deteriorate generation performance. (4) Due to the recursive encoding of the syntax contexts, our model's inference speed is approximately half that of the seq2seq counterpart (Appendix E). (5) Future work should include experiments on large language models (Brown et al., 2020;OpenAI, 2023;Zeng et al., 2022;Touvron et al., 2023;Taori et al., 2023). to further demonstrate the effectiveness of our method beyond pretrained language models.

Ethics Statement
We honor the ACL Code of Ethics. No private data or non-public information is used in this work. For human annotation (Section 6.3 and Section 6.4), we recruited our annotators from the linguistics departments of local universities through public advertisement with a specified pay rate. All of our annotators are senior undergraduate students or graduate students in linguistic majors who took this annotation as a part-time job. We pay them 60 CNY an hour. The local minimum salary in the year 2022 is 25.3 CNY per hour for part-time jobs. The annotation does not involve any personally sensitive information. The annotated is required to rank the system output and label factual information (i.e., syntactic annotation).

C Model Architecture
We conduct experiments to compare different model architectures to incorporate syntax context on the WMT'16 Ro-En validation set. We consider the following settings: • Concat: concatenate the syntax context with the source sequence, with the vanilla Transformer unmodified.
• Extra-attention: reuse the source encoder for encoding syntax context and insert an extra at-   tention layer, i.e., the syntax context attention, into each decoder layer.
• Extra-encoder: introduce an additional encoder for encoding syntax context and also uses the syntax context attention.
Empirical results are shown in Table 9. Based on validation results, we adopt the Extra-encoder model in all experiments except for training on BART (Table 3), where we adopt the Concat model.

D Experiments on PLM
In this section, we introduce our experiment settings of PLM. Following previous work (Sun et al., 2021), we use BART-base (Lewis et al., 2019) as our base model. All models are finetuned for 10 epochs with a batch size of 64k tokens. The learning rate is 3e-5 and the linear decay schedule, as recommended in BART's official repository 3 . We use the Concat (Appendix C) model architecture for extending our method to BART. The source text and the syntax context are concatenated with a special token "<sep>", e.g., "I ate an apple . <sep> <NP> <VP> .". To effectively employ our method with BART, whose inputs are tokenized sequences byte-level, as same as Radford et al. (2019), we make several modifications. In the pre-processing, we make sure our special tokens (e.g., <sep>, <c>, <NP>, <VP>) are not split and add extra byte-level spaces before and after the special token. Thanks to the unused tokens in BART embeddings, we do not need to modify the embedding matrix. Instead, we assign our special tokens to unused token indexes. Finally, in the inference stage, we find the constituency expansion causes a discrepancy between inputs of train and test. Thus, we first detokenize each layer's outputs and then tokenize them back with the same procedure in the preprocessing to avoid such a gap.

E Generating Linearized Trees Directly
A baseline method to induce grammar simultaneously during generation is generating linearized parse trees directly, i.e., training a seq2seq model which takes in source sequences and outputs linearized parse trees. We compare it with our method on WMT'16 Ro-En. Specifically, the BLEU score for WMT'16 Ro-En is only 27.6 compared to the seq2seq baseline (34.1) and our method (34.9). This can be because the additional parentheses and constituency tags in linearized trees may deteriorate sequence coherence, making learning more difficult. Our method, on the other hand, breaks down syntax trees into level pieces to create a better learning curriculum. Furthermore, Generating linearized parse trees is much slower than the seq2seq counterpart, since the average sequence length of linearized tree sequences is longer (152.3 vs 28.4). As a result, the average speed for generating linearized parse trees is only 0.8 sentences/s compared to 3.6 sentences/s for the seq2seq baseline. Our method achieves an inference speed of 1.7 sentences/s under the same computing condition (V100 GPU). Additionally, generating a linearized parse tree is not easily interpretable or controllable, due to the black-box nature of the sequence-to-  sequence paradigm.

F Effects of Control Reward
The magnitude of the reward γ determines how much priority is given to beam candidates that match the syntax exemplar. We experiment with different reward values to give a quantitative demonstration, shown in Figure 5. It can be seen that the control effectiveness grows with the increase of the reward value until 0.64, which suggests that all possible matched beam candidates are re-ranked to the top in the search space.

G Control with Partial Syntax Template
We present 3 sample cases to demonstrate finegrained controls over the generation process, shown in Figure 6. Each Chinese source sentence is paired with 3 manual controls from three annotators. The model takes in the annotated syntax context and proceeds to obtain the respective translations.

H Human Evaluation for Paraphrase Generation
We ask three annotators to conduct side-by-side human evaluations and report averaged results of their annotations. For each instance, the annotators vote for one of the two outputs by the baseline and our model. The outputs contain top-5 beam candidates under beam search. The annotators are asked to evaluate both the best candidate and the beam results as a whole, based on the following three aspects: • Fidelity: Whether the best candidate is semantics-equivalent with the input.
• Novelty: Whether the best candidate modifies the input sentence structure.
• Diversity: Whether the generated five candidates are different from each other given the input.