PIP: Parse-Instructed Prefix for Syntactically Controlled Paraphrase Generation

Syntactically controlled paraphrase generation requires language models to generate paraphrases for sentences according to specific syntactic structures. Existing fine-tuning methods for this task are costly as all the parameters of the model need to be updated during the training process. Inspired by recent studies on parameter-efficient learning, we propose Parse-Instructed Prefix (PIP), a novel adaptation of prefix-tuning to tune large pre-trained language models on syntactically controlled paraphrase generation task in a low-data setting with significantly less training cost. We introduce two methods to instruct a model's encoder prefix to capture syntax-related knowledge: direct initiation (PIP-Direct) and indirect optimization (PIP-Indirect). In contrast to traditional fine-tuning methods for this task, PIP is a compute-efficient alternative with 10 times less learnable parameters. Compared to existing prefix-tuning methods, PIP excels at capturing syntax control information, achieving significantly higher performance at the same level of learnable parameter count.


Introduction
Syntactically controlled paraphrase generation (SCPG) has attracted increasing attention as it can diversify the generated paraphrases (Iyyer et al., 2018;Huang and Chang, 2021;Sun et al., 2021).Given an input sentence and a target syntax specification, an SCPG model aims to generate paraphrases that satisfy the specific syntax requirement.Such generation systems are promising in benefiting multiple application areas in natural language processing (NLP), such as text summarization (Fan et al., 2018), dialogue systems (Niu and Bansal, 2018;Gao et al., 2020), diverse question generation (Yu and Jiang, 2021), creative generation (Tian et al., 2021), and improving the robustness of models (Iyyer et al., 2018;Huang and Chang, 2021).
However, prior studies on SCPG mainly explore fine-tuning strategies, which require updating the parameters of the entire language model to adapt to the newly included syntax information.Therefore, many previously proposed methods suffer from tremendous training cost (Lewis et al., 2019;Raffel et al., 2020;Brown et al., 2020).With the recent rise of larger pre-trained language models (PLMs), this problem has become even more imminent.Nevertheless, a lightweight and more resource-efficient tuning method would allow easier application of large PLMs on the SCPG task.
Resource-efficient training methods such as prompt-tuning and prefix-tuning (Li and Liang, 2021; Lester et al., 2021) have proven to be effective in tuning large PLMs on various NLP tasks, such as text classification (Liu et al., 2021a), sequence labeling (Liu et al., 2021a), and summarization (Li and Liang, 2021).Prefix-tuning freezes a PLM's parameters and optimizes a small taskoriented continuous prefix that is prepended to the model's Transformer layers.It is a promising alternative to fine-tuning in a low-data setting with significantly fewer learnable parameters.However, no previous literature has explored the potential of prefix-tuning on the SCPG task.
In light of the lack of previous studies, we are amongst the first to study the application of resource-efficient training methods on the SCPG task.Our work has two main contributions.To begin with, we are among the first to study prefix-tuning's application on the SCPG task as a compute-efficient alternative for fine-tuning.Secondly, we propose parse-instructed prefix (PIP), a novel adaptation of prefix-tuning for enhanced syntax control in paraphrase generation.Similar to prefix-tuning, PIP freezes all parameters of a PLM and only optimizes the prefix parameters, reducing the number of tune-able parameters to almost 10× less than that required for fine-tuning.Prefix-tuning methods initialize the prefix as continuous and completely free parameters.For the SCPG task, this means that the prefix would need to learn the syntax control from scratch, since the PLMs were not pre-trained on any syntax-related task.In contrast, PIP provides syntax-related guidance to the prefix, allowing for better capturing of syntax knowledge.Specifically, we introduce two methods to guide the process of syntax knowledge capturing: direct initiation and indirect optimization.We prove that prefix-tuning-based methods achieve promising performance in a low-data setting with significantly fewer learnable parameters.In addition, our proposed PIP methods outperform prefix-tuning at the same level of training cost.1

Related Work
Syntactically controlled paraphrase generation.For the SCPG task, given a source sentence and a target syntax structure, a language model is trained to output a paraphrase sentence of the source sentence that (1) is semantically similar to the source sentence, and (2) conforms to the given target syntax structure, or the "syntax control".Prior works mainly adopted encoder-decoder model structures and used sequence-to-sequence training for the SCPG task (Iyyer et al., 2018;Kumar et al., 2020;Huang and Chang, 2021;Sun et al., 2021), while exploring different means to include syntax control signal during training.The first type of approach encodes the source sentence and the target syntactic tree separately, then concatenates them at decoder input (Iyyer et al., 2018;Kumar et al., 2020).The second type of approach concatenates linearized target constituency parse and source sentence at model input (Huang and Chang, 2021;Huang et al., 2022;Sun et al., 2021).However, the aforementioned methods require updating all model parameters during tuning at a high training cost.
Prompt-tuning and prefix-tuning.Prompting (Brown et al., 2020;Sun and Lai, 2020) provides PLMs with a discrete task-specific "prompt" to generate task-related outputs without task-specific fine-tuning.Prompt-tuning-based methods (Liu et al., 2021b;Qin and Eisner, 2021;Lester et al., 2021;Vu et al., 2022;Min et al., 2021), Prefix-Tuning (Li and Liang, 2021) and P-Tuning v2 (Liu et al., 2021a) derived from prompting and propose to only optimize a small sequence of continuous vectors.However, since prefix-tuning learns a prefix that was initiated as a continuous vector with completely free parameters, the prefix would need to learn task information from scratch during training.In addition, the training process for prefixtuning does not allow for incorporation of any taskspecific guidance.In summary, existing prefixbased methods fail to consider both specific task instruction and model-learned knowledge.
3 Method Problem formulation.Following previous studies (Iyyer et al., 2018;Huang and Chang, 2021;Huang et al., 2022), we use an encoder-decoder model structure and utilize the constituency parse as the control signal.Denote the source sentence as s src , the target parse as t, and the target sentence as s tgt .The goal of an SCPG model is to generate the target sentence s tgt that is semantically similar to s src while conforming to the syntax of t.In our study, the model is provided with s src and t at input, and is supervised by the target sentence s tgt at output during training.Notice that previous methods (Sun et al., 2021;Huang et al., 2022) mainly fine-tune all PLM parameters and therefore suffer from high training costs.
Prefix-tuning.We investigate a resourceefficient method for training an SCPG model based on the prefix-tuning method (Li and Liang, 2021;Liu et al., 2021a).Li and Liang ( 2021) freezes all pre-trained LM parameters and only optimizes a small sequence of continuous prefixes that are then prepended to keys and values of the attention module in the input layer of the model's encoder and decoder.Liu et al. (2021a) further extends this approach and applies prefixes to every layer of the model encoder and decoder.We follow the previous approach (Li and Liang, 2021) and consider prepending additional prefix parameters to the key and value matrices of each Transformer layer in the PLM.Specifically, we establish a prefix p with length |p| for a PLM with l layers and hidden dimension dim(h) and produce a set of key prefixes K p = {k 1 , k 2 , ..., k l }  and a set of value prefixes where k i , v i ∈ |p| × dim(h) denotes the key and value prefixes of layer i, respectively.For an encoder-decoder PLM, the key and value prefixes will then influence the model's encoding and decoding process through the attention mechanism, where the prefixes directly attend with the hidden states of the model.Figure 1 visualizes structure of the prefix-tuning method.

Parse-Instructed Prefix
Intuition.In prefix-tuning, the learned prefix acts as a context that influences the encoding of inputs through extracting task-related information from the PLM (Li and Liang, 2021).However, as prefixtuning optimizes a prefix with completely free parameters, the prefix is learned from scratch and is unable to receive task-specific guidance during tuning.Since we use the constituency parse of target paraphrase as the control signal, which the PLM has never seen during pre-training, it will take a long time for the prefix to adapt to and learn the encoding for the control syntax.Specifically, for the SCPG task, the prefix will need to learn to: (1) capture semantic and syntax information from model input, and (2) combine the extracted semantic and syntax knowledge to produce an encoding for paraphrase generation under target syntax control.Since the prefix is implemented as pre-pended parameters for keys and values in Transformer layers, it first retrieves semantic and syntax information by attending to the source sentence and target parse in model input.Ideally, the prefix will then combine both the retrieved semantic and syntax information by influencing the output encoding.The combined information will then be captured by the decoder to output a paraphrase that conforms to the syntactic control.Therefore, guiding the prefix at encoder output to capture and leverage syntax knowledge will enhance the syntax control signal for improved performance on the SCPG task.
Therefore, we propose parse-instructed prefix (PIP) at the model's last encoder layer to "instruct" and augment the prefix's ability to capture taskspecific syntax control for paraphrase generation.Specifically, we introduce two PIP-based methods for better capturing of syntax control information: Direct Parse-Instructed Prefix (PIP-Direct) and Indirect Parse-Instructed Prefix (PIP-Indirect).Different from prefix-tuning, where all prefix parameters are learned from scratch, PIP instructs the value prefix v m of the last encoder layer with task-specific information.Through the attention mechanism, the instructed value prefix will help to better capture syntax information in the model's encoding output.
Direct parse-instructed prefix.We propose Direct Parse-Instructed Prefix (PIP-Direct) as an intuitive way to enhance knowledge on syntax control at model encoding.PIP-Direct directly updates the parameters of the value prefix at the last encoder layer with the model's encoding of the target parse.That is, for an input with target syntax t and an LM encoder with m layers, we first retrieve the model's encoding output of the target parse, which we denote as e(t).Then, for the set of model encoder's value prefixes V p = {v 1 , v 2 , ..., v m }, we replace the value prefix of the last encoder layer with the parse encoding e(t).The final value prefix prepended to the LM value state is then: This method directly augments syntax-related information at the last model encoder layer, which enables the key prefix of the same layer to capture the augmented syntax knowledge through attention.Structure of the direct initiation PIP is demonstrated on the left of Figure 2a.
Indirect parse-instructed prefix.We propose Indirect Parse-Instructed Prefix (PIP-Indirect) as an alternative way to guide the capturing of target syntax knowledge at the last encoder layer.Instead of directly replacing the prefix parameters, we utilize the Parse Encoding Loss (PEL) to indirectly augment the prefix's ability to capture syntax knowledge.Given a parse input, the prefix will influence the original model encoding by attending to the parse input, resulting in a modified encoding of the parse information.We can therefore augment the prefix's syntax information capturing by improving the ability to reconstruct the original parse encoding from the prefix-modified encoding.
For a target parse t with encoding e(t), we can obtain its prefix-modified encoding through an additional prefix self-attention layer A(•), in which the prefix directly attends to the parse encoding e(t).The prefix attention layer has the same structure as a prefix Transformer layer in the model, with the key prefix k m and value prefix v m of the last encoder layer m prepended to the attention key and value.We denote the output encoding of this prefix self-attention layer as A(k m , v m , e(t)).
To examine the ability to reconstruct the original parse encoding e(t) from the prefix-modified encoding A(k m , v m , e(t)), we leverage a learnable projection head, denoted by H(•) : dim(h) → dim(h), to approximate the process of reconstruction.We denote the output of the projection head as: . Then, we establish the PEL loss by measuring the projected output's cosine distance, or the reconstructed parse encoding, with the original model encoding of target parse, e(t).The PEL loss is mathematically formulated as: ), e(t)), where Dist cos denotes the cosine distance.By integrating the PEL loss with the LM output loss during optimization, we can indirectly guide the prefix to better capture syntax-related information in model encoding.The structure of PIP-Indirect is demonstrated on the right of Figure 2b.

Experiments
We conducted experiments on our proposed PIP-Direct and PIP-Indirect methods, as well as two baseline methods for comparison.All four training methods are implemented on the BART-base model (Lewis et al., 2019).For all models, we concatenate the source sentence and target parse as input, and train the models to output a paraphrase that follows the given target syntax.
Dataset.We use ParaNMT (Chen et al., 2019) as the training and validation dataset for all models.Specifically, we sample 30,000 and 6,400 data entries from ParaNMT as our training set and dev set, respectively.To test the models' abilities to generate paraphrases with syntax control in unseen domains, we follow previous work (Huang and Chang, 2021;Huang et al., 2022) and apply the trained models on three mainstream test datasets: Quora (Iyer et al., 2017), MRPC (Dolan et al., 2004), and PAN (Madnani et al., 2012).
Evaluation metrics.Conforming to prior works (Huang et al., 2022;Sun et al., 2021), we evaluate generations of models on both alignment-based and syntactic conformation metrics.Alignment-based metrics measure the similarity between target paraphrases and model-generated paraphrases.We consider 4 alignment-based evaluation metrics: BLEU (Papineni et al., 2002), ROUGE-1, ROUGE-2, and ROUGE-L (Lin, 2004).Syntactic conformation metrics measure the quality of syntactic control in generated paraphrases.We consider 2 syntactic conformation evaluation metrics: Template Matching accuracy (TMA) and Tree-Edit Distances score (Zhang and Shasha, 1989) at height three (TED-3).
Baselines.We establish 2 baseline training methods for our study.The first baseline is the vanilla fine-tuning method that updates all parameters in the PLM, which we denote as Seq2Seq, In addition, we consider prefix-tuning, which freezes the PLM and learns a small set of parameters as the prefix, as the second baseline.PIP-Direct outperforms other prefix-tuning-based methods for the Quora dataset.
Analysis.We conduct additional ablation experiments to further validate experimental results.Specifically, we examine if PIP-Indirect's performance gain is due to the effectiveness of design or the slightly higher parameter count compared to prefix-tuning.We experiment on prefix-tuning with an additional linear layer during prefix construction, denoted as Prefix-Tuning-Large.Prefix-Tuning-Large has 31.56Mlearnable parameters, 12.78M more than the PIP-Indirect method.
Table 2 demonstrates results of the ablation experiment on ParaNMT's validation dataset.We observe that although having more parameters, Prefix-Tuning-Large fails to outperform the PIP methods.In addition, Prefix-Tuning-Large even fails to outperform the original Prefix-Tuning method, which only has 15.83M parameters.This provides further insights that 1) a larger number of parameters in prefix-tuning-based methods does not guarantee performance gain on downstream tasks, and 2) outstanding performance of the PIP methods on SCPG task is due to the effectiveness of method design.

Conclusion
This research is amongst the first to study resourceefficient training methods for syntactically controlled paraphrase generation tasks.In this work, we proposed Parse-Instructed Prefix (PIP), a compute-efficient method that only requires 10× less learnable parameters than traditional finetuning methods.We introduce Direct and Indirect PIP methods to further improve prefix-tuning's performance by providing task-specific guidance and augmenting task-related knowledge at the finetuning stage.Through extensive experiments, we find out that both PIP-Direct and PIP-Indirect outperform prefix-tuning in a low-data setting at the same level of parameter count, and are promising as a resource-efficient alternative to fine-tuning.With ablation studies, we further validate that the performance gain of the proposed PIP methods is due to the effectiveness of the design.

Limitations
We identify some limitations of our study.First, due to a lack of computing resources, we were not able to experiment with even larger pre-trained language models such as GPT-2 and GPT-3.In future explorations, we would like to seek the opportunity to investigate the potential of instructed prefix-tuning on even larger-scale language models across a variety of generation tasks.Second, since our work are amongst the first to explore the application of prefix-tuning on the task of syntacticallycontrolled paraphrase generation, we were not able to identify state-of-the-art prior works on the same subject in the field to establish comparison with.We believe, however, that with our demonstration of promising application of prefix-tuning for SCPG, researchers will soon propose new ideas to utilize prefix for tuning large PLMs on this task at even lower training costs.

Figure 1 :
Figure 1: Structure of the prefix-tuning method.

Figure 2 :
Figure 2: Structure of the proposed PIP-Direct and PIP-Indirect Models.Note that we only visualize the encoder of the BART model.The model decoder follows the regular prefix-tuning setting without modifications.In (a), the value prefix of the last encoder layer is directly initiated by the model encoding of the target parse.In (b), the Parse Encoding Loss (PEL) is calculated between the prefix-attended parse encoding and the model parse encoding.

Table 1 :
Experiment results."# Params" denotes the number of learnable parameters for each method.The PIP methods achieve highest performance amongst the three prefix-based methods on all valid and test datasets.
Experiments on prefix-tuning in this study are based on our implementation of Li and Liang and Liu et al.'s work.
We observe that PIP-Indirect achieves highest performance across all metrics among the 3 prefixtuning-based approaches on the validation set of ParaNMT, as well as on testsets Pan and MRPC.