Revisiting Pivot-Based Paraphrase Generation: Language Is Not the Only Optional Pivot

Paraphrases refer to texts that convey the same meaning with different expression forms. Pivot-based methods, also known as the round-trip translation, have shown promising results in generating high-quality paraphrases. However, existing pivot-based methods all rely on language as the pivot, where large-scale, high-quality parallel bilingual texts are required. In this paper, we explore the feasibility of using semantic and syntactic representations as the pivot for paraphrase generation. Concretely, we transform a sentence into a variety of different semantic or syntactic representations (including AMR, UD, and latent semantic representation), and then decode the sentence back from the semantic representations. We further explore a pretraining-based approach to compress the pipeline process into an end-to-end framework. We conduct experiments comparing different approaches with different kinds of pivots. Experimental results show that taking AMR as pivot can obtain paraphrases with better quality than taking language as the pivot. The end-to-end framework can reduce semantic shift when language is used as the pivot. Besides, several unsupervised pivot-based methods can generate paraphrases with similar quality as the supervised sequence-to-sequence model, which indicates that parallel data of paraphrases may not be necessary for paraphrase generation.


Introduction
Paraphrase generation is an important and challenging task in the field of Natural Language Processing (NLP), which can be applied in a variety of applications such as information retrieval (Yan et al., 2016), question answering (Fader et al., 2014;Yin et al., 2015), machine translation (Cho et al., 2014), and so on.
Traditionally, paraphrase generation is usually implemented using ruled-based models (Fader et al., 2014;Zhao et al., 2009), lexicon-based methods (Bolshakov and Gelbukh, 2004;Kauchak and Barzilay, 2006), grammar-based methods (Narayan et al., 2016), statistical machine translation-based methods (Quirk et al., 2004;Zhao et al., 2008). With the rapid development of deep learning techniques, neural methods have shown great power in paraphrase generation and achieve state-of-the-art results (Gupta et al., 2018;Yang et al., 2019a). Neural paraphrase generation models usually follow the encoder-decoder paradigm. Given a sentence X, these models generate the paraphrase Y by directly modeling P (Y |X) through a deep neural network. However, deep neural networks are sensitive to domains in general (Stahlberg, 2020), while existing mainstream paraphrase corpora only cover a few specific domains, such as image caption (Lin et al., 2014) and questions (Fader et al., 2013). Highquality paraphrases for general domains are difficult to obtain in practice, which greatly restricts the application of these seq2seq models.
Benefiting from the rapid development of machine translation technologies, pivot-based methods (Guo et al., 2019;Wieting et al., 2017) have been proposed for paraphrase generation. Formally speaking, pivot-based methods generate the paraphrase by following P (Y |X) = P (Z|X)P (Y |Z), where Z denotes the pivot of X. Existing pivot-based methods all choose Z as representations in a different language, therefore the quality of the generated paraphrases largely depends on the pre-existing machine translation system.
Choosing language as pivot has some disadvantages, for example: (1) the pipeline translations may incur semantic shift (Guo et al., 2019), and (2) machine translation systems are sensitive to domain, and the quality of translating out-of-domain sentences can not be guaranteed.
In this paper, we explore the feasibility of using different pivots for pivot-based paraphrasing models, including syntactic representation (Universal Dependencies (McDonald et al., 2013), UD), semantic representation (Abstract Meaning Representation (Banarescu et al., 2013), AMR), and latent semantic representation (LSR). Compared with choosing other languages as pivot, choosing syntactic or semantic as pivot is a more direct way, and is less likely to incur semantic shift. Apart from pipeline pivot-based generation, we also investigate how much an end-to-end pivot-based model, which can produce paraphrases in a single step with the help of pivot, affects the quality of paraphrases. In the end-to-end framework, the model directly learns the paraphrasing probability P (Y |X) from text distribution P (X) and P (Y ), pivot distribution P (Z), as well as parallel text-pivot distribution P (Z|X) and P (Y |Z).
We conduct experiments on two benchmarks of paraphrasing tasks: Parabank and Quora datasets. We compared in detail the pros and cons of models using different pivots in terms of fidelity, fluency, diversity and so on in the experiments. The results show that using the AMR as the pivot can also produce high-quality paraphrases. Besides, the endto-end framework can reduce the semantic shift when language is the pivot.
In sum, the prime contributions of this paper are as follows: • We explore to use syntactic and semantic representations as pivots for pivot-based paraphrasing models, which is a more direct way and less likely to incur semantic shift.
• We also investigate applying an end-to-end paraphrasing model instead of the pipeline framework.
• We conduct experiments on two paraphrasing datasets to detailedly investigate the pros and cons of models using different pivots.
• We find out that models taking AMR as pivot can generate better paraphrases compared with taking UD or language as pivot.
The end-to-end framework can also reduce the semantic changes when language is used as the pivot. Besides, several unsupervised pivot-based methods can generate paraphrases as good as the supervised encoder-decoder method, indicating that parallel samples may not be essential to generate high-quality paraphrases.

Language
Using language as the pivot has been widely explored by previous works (Wieting and Gimpel, 2018;Wieting et al., 2017;Guo et al., 2019). There are hundreds of languages in the world, and a sentence has different expressions in different languages. Therefore, we can take the sentence representation in another language as the pivot.

Abstract Meaning Representation (AMR)
Abstract Meaning Representation (AMR) (Banarescu et al., 2013) is a rooted, labeled, acyclic graph which abstracts away from syntax and preserves semantics. Nodes in AMR graph are concepts, which are highly related to English words. Edges represent semantic relations between concepts. Since AMR only keeps semantic information, paraphrases can share the same AMR graph.

Universal Dependencies (UD)
Universal Dependencies (UD) (McDonald et al., 2013) is a framework for consistent annotation of parts of speech, morphological features and syntactic dependencies across human languages. Nodes in UD are tokens in sentences. Edge labels, Different from AMR, represent syntactic information.

Latent Semantic Representation (LSR)
The latent semantic representation (i.e. a dense vector) is also a simple way of meaning representation. We use a deep neural model to obtain the latent semantic representation of a given sentence.

Pipeline Pivot-based Paraphrase Generation
In the pipeline process, we first translate the input texts to pivots (Language, AMR or UD) 1 , followed by generating paraphrases from pivots. This process is shown in Figure 1 (a).

Pipeline-language
We train an English-German and a German-English machine translation model with Transformers Figure 1: (a) Pipeline pivot-based paraphrase generation. (b) Left: training stage of end-to-end pivot-based paraphrase generation. Right: inference stage of end-to-end pivot-based paraphrase generation. (Vaswani et al., 2017). The English sentences are first translated into German and then translated back into English. The sentences in German are regarded as the pivot.

Pipeline-AMR
When parsing texts to AMRs, we employ one of the state-of-the-art AMR parser (Xu et al., 2020). This is a sequence-to-sequence model, since AMR graphs are first linearized. Machine translation and constituent parsing are introduced as auxiliary tasks when training the model. Researchers first generate AMR graphs automatically with an existing AMR parser and construct a larger silver dataset. The seq2seq model is first trained on the silver dataset and fine-tuned on the gold dataset.
As for generating texts from AMRs, we choose the graph-to-text model (Ribeiro et al., 2020). This model is based on T5 (Raffel et al., 2020). It is first trained on a larger task-specific silver dataset and then fine-tuned on the gold English-AMR dataset.

Pipeline-UD
We apply Stanza toolkit (Qi et al., 2020) to obtain UD. Stanza is a pipeline system with tokenization, sentence and word segmentation, part-of-speech tagging, morphological features tagging, lemmatization and dependency parsing. We omit the model details here, which could be found in Qi et al. (2018).
We use the IMSurReal (Yu et al., 2019) to accomplish the UD-to-text task. The model first linearizes the UD trees and then inflects the lemmas into word forms. At last, the model contracts the tokenized word into one token.

Towards End-to-End Paraphrase Generation
The above pivot-based methods are simple and straightforward, but have two disadvantages: (1) It is difficult to control and optimize the pipeline system, and the quality of the generated paraphrases is totally determined by the text-to-pivot and pivotto-text systems used.
(2) The pipeline system is inefficient at the inference stage.
In this paper, we also investigate the feasibility of end2end methods. Different from the supervised paraphrasing models, our model does not involve any explicit paraphrase sentences, so it needs to generate paraphrases in a "zero-shot" way. Inspired by recent work on cross-lingual transfer (Conneau and Lample, 2019), we propose a pretraining framework to endow the model with the ability of zero-shot paraphrasing. Besides, we also experiment using auto-encoder to generate paraphrases. In the auto-encoder model, the encoded latent semantic representation (LSR) can be considered as a kind of semantic pivot.

LSR
We train a Transformer-based auto-encoder model, and use the encoder to encode the input sentence. The dense representation, which is the output of the encoder and can be considered as the latent semantic representation, is then decoded back to a sentence by the decoder.

End-to-end Pivot-based Method (E2E-pivot)
For E2E-pivot method, our framework contains only one encoder-decoder (transformer) model, which is learned from parallel text-to-pivot distribution P (Z|X), pivot-to-text distribution P (Y |Z), prior text distribution P (X), P (Y ), and prior pivot distribution P (Z). At the inference time, given an input sentence, we guide the model to produce the output in text form again, which is then considered as the paraphrase of the input. The model architecture of the E2E-pivot method is in Figure 1 (b).

Language Modeling Tasks
Our language modeling task contains two sub-tasks: causal language modeling (CLM) and masked language modeling (MLM). We use CLM and MLM objectives to enable the model to learn a better encoder and decoder. These objectives have been proved effective for cross-lingual transfer in crosslingual tasks. Given a sentence, causal language modeling task trains to model the probability of a word given the prefix words: P (x t |x 1 , x 2 , · · · , x t−1 ; θ), where x t denotes the t-th word in sentence X, and θ denotes the model parameters. The training objective is to maximize the log likelihood: (1) Our masked language modeling task is the same as Devlin et al. 2019a, which is also known as the Cloze task (Taylor, 1953). Concretely, we randomly sample 15% tokens from the input sentence, which are replaced by [MASK] tokens for 80% of the time, by random tokens for 10% of the time, and keep unchanged for 10% of the time. The training objective is to maximize the log reconstruction probability: max L 2 (X) = log P (X|X; θ) (2) whereX = (x 1 ,x 2 , · · · ,x t ) is the corrupt sentence. We recommend readers to refer to Devlin et al. 2019a for more details. The training objective of language modeling task is to maximize the sum of above two objectives: In our framework, we apply language model pretraining on both texts and pivots. AMR and UD are linearized with depth-first search.

Text-to-Pivot and Pivot-to-Text Tasks
The language modeling tasks only require nonparallel data. To leverage the parallel text-pivot data, we introduce text-to-pivot and pivot-to-text tasks.
Denoting X and Z as a parallel text-pivot sample, the training objective of text-to-pivot (t2p) is to maximize the log likelihood: (4) Similarly, denoting Z and Y as a parallel pivot-text sample, the training objective of pivot-to-text (p2t) is: The final objective is the sum of L LM , L t2p and L p2t .

Tag and Indicator Embeddings
We add a special tag at the beginning of each sentence to specify the type of representation. For example, amr for AMR texts and en for English sentences.
At the inference stage, we set the first token of the decoder to en to force the model to produce sentences in text form again, which are then considered as the paraphrases of the input sentences.
However, we find that the tag does not always guarantee the type of the output sentences produced by the model. To keep the consistency, we follow (Conneau and Lample, 2019) to concatenate an indicator embedding into the word embedding. Concretely, supposing the word embedding for the i-th AMR token as e i and the indicator embedding for AMR as a amr , we concatenate the word embedding and the indicator embedding, and feed [e i , a amr ] as the input to the model.

Datasets
In this paper, we conduct experiments on Parabank (Hu et al., 2019) and Quora 2 datasets, which are two benchmarks of the paraphrase generation task.
Parabank is a large-scale paraphrasing dataset from the general (news) domain. We use the officially released test set to evaluate the performance of models. The test set contains 36,417 test samples. The average length of sentences in the parabank dataset is 21.34.
Quora dataset contains over 155,000 paraphrased question pairs from the quora forum 3 . We adopt the quora test set to evaluate models' performance. The number of quora test samples is 4,000, and the average length of sentences in quora test set is 10.05.
We utilize WMT14 EN-DE dataset to train the machine translation system. As for AMR and UD, the gold parallel datasets are AMR 2.0 (LDC2017T10) and EWT (LDC2012T13). Since these corpus comes from similar domain as Parabank, Parabank can be regarded as the in-domain test set and Quora can be regarded as the out-ofdomain test set. We can evaluate the domain robustness of pivot-based models.

Competitive Methods
We investigate and compare the performance of pipeline methods as well as end-to-end methods. Pipeline methods include Pipeline-language, Pipeline-AMR and Pipeline-UD, which are mentioned in Section 3. End-to-end methods consist of E2E-language, E2E-AMR and E2E-UD, which leverage language, AMR and UD as the pivot respectively and apply the end-to-end framework mentioned in Section 4. Besides, we also compare these unsupervised methods with a supervised encoder-decoder (Enc-dec) model based on Transformer, which is trained with parallel paraphrase pairs in the training set of ParaBank/Quora. By analyzing performance of these models, we want to examine (1) whether AMR or UD can serve as the pivot for paraphrase generation, (2) whether end-to-end framework can bring benefit to paraphrase generation, and (3) whether zero-shot methods can obtain paraphrase as good as the supervised model.

Evaluation Metrics
We evaluate the paraphrasing models from the following aspects: (1) Fidelity, i.e., the semantic consistency between generated paraphrase and the original sentence. (2) Diversity, i.e., the degree of change in expression between the generated paraphrase and original sentence. (3) Fluency, i.e., the fluency of the generated paraphrase. (4) The number of parallel samples used for training the paraphrasing system.
To evaluate the fidelity automatically, we use BertScore , which has been widely used to evaluate semantic similarity (Mager et al., 2020a;Cao and Wan, 2020;Dong et al., 2021).
To evaluate the diversity automatically, we calculate "Self-BLEU", i.e., the BLEU-4 score between the output and input sentences. A high Self-BLEU score means that the output is similar to the input, and the diversity is poor, vice versa.
Besides the above automatic evaluation metrics, we also conduct the human evaluation to evaluate the quality of generated paraphrases of each model. Concretely, we randomly sample 100 test instances from Parabank and 100 test instances from Quora datasets, and ask volunteers to score the outputs from the following aspects: (1) Fidelity, (2) Diversity, and (3) Fluency. The scores range from 1-5, with 5 being the best. We guarantee that each instance is scored by at least 3 human annotators.

Implementation Details
We use the fairseq toolkit (Ott et al., 2019) to implement Pipeline-language and all end-to-end models. We set the model hidden size, feed-forward hidden size to 512 and 2048 respectively, and set the number of heads, number of layers to 8 and 6 respectively. We use the Adam optimizer (Kingma and Ba, 2014) for training, and adopt the warm-up learning rate (Goyal et al., 2017) technique for the first 4,000 steps.

Results and Analysis
The automatic evaluation results are shown in Table 1. The results of human evaluation on the Parabank and Quora test sets are shown in Table 2 and  Table 3 respectively. We also calculate kappa coefficient to measure the consistency for each judge's evaluation.

Fidelity
The results in Table 1, Table 2 and Table 3 show that all models can achieve comparable or superior fidelity scores compared to the reference and the supervised model (Enc-dec), except for the Pipeline-language model. By checking the output files, we find that this is partially because the Pipeline-language model may introduce semantic shift during two-step translation. Compared with   Pipeline-language, Pipeline-AMR reduces semantic change, since AMR graphs preserve important words as concepts and thus preserve the original meaning. Pipeline-UD, LSR and E2E-UD seem to be able to achieve much higher fidelity scores than other methods, even than the reference. This is due to they produce sentences that are very similar to the source sentence, and sometimes even copy the whole source sentence entirely, which makes their output hardly change the semantics of the sentence, yielding high fidelity scores. Compared to Pipeline-language, E2E-language achieves much higher scores in terms of fidelity, as it can preserve semantic information since endto-end models do not require explicitly changing texts into pivots. However, E2E-AMR does not outperform Pipeline-AMR, which also demonstrates that the Pipeline-AMR method does not change semantics substantially.

Diversity
As for Pipeline-UD, LSR and E2E-UD model, the high score of Self-BLEU in Table 1, and the low score of Diversity in Table 2 and Table 3 reveal that paraphrases predicted by these three models are usually copied from the input texts.
In Parabank, Pipeline-AMR can achieve a similar score in terms of diversity as Pipelinelanguage. Besides, both Pipeline-AMR and Pipeline-language can achieve better results in diversity than E2E-language, E2E-AMR and Encdec in Parabank, revealing that pipeline process can generate more diverse sentences. However, in Quora, the diversity of Pipeline-AMR is similar to E2E-AMR and E2E-language and is far less than Pipeline-language. This is because syntactic information is removed in AMR and thus Pipeline-AMR always produces syntactically diverse sentences. Compared to Pipeline-AMR, Pipeline-language is more likely to replace words or phrases with their synonyms. Texts in Quora are shorter and simpler than ones in Parabank, which is harder for model to produce syntactically diverse output in Quora. Thus the diversity score of Pipeline-AMR is similar to E2E-language and E2E-AMR and is less than Pipeline-language in Quora.

Fluency
The fluency scores in Table 2 and Table 3 show that all models can generate fluent texts, especially Pipeline-AMR, Pipeline-UD and LSR. With language modeling tasks, E2E-pivot models can also Source Text: which candidate handled the race question best during the first presidential debate ? Reference: who provided a better response to the question regarding race relations in the us during the first presidential debate ? Enc-dec: which candidate deals with the question of the race best in the first presidential debate ? Pipeline-language: Which candidate has treated the symptoms best in the first half of the year ? Pipeline-AMR: Which candidate best handled the race question in the first presidential debate ? Pipeline-UD: which candidate handled the race question best during the first presidential debate ? LSR: which candidate handled the race question best during the first presidential debate ? E2E-language: what candidate did the race question best in the first presidential debate ? E2E-AMR: candidate did the race question best in the first presidential debate ? E2E-UD: which candidate handled the race question best during the first presidential debate ? generate fluent texts. Pipeline-language performs worst in fluency among these models in Parabank dataset, since the translation systems may sometimes generate irrelevant words and phrases, which both affect fidelity as well as fluency. When it comes to Quora dataset, Pipeline-language and E2E-UD get the lowest scores, which shows that these two methods may generate incoherent sentences and they are sensitive to domains.

Number of parallel samples required
We also analyze the cost of training each model. In terms of the number of samples used for training, the training of machine translation models in Pipeline-language method requires much more gold parallel training samples than other semantic or syntax based models.
For training AMR-based models, the text-to-AMR model leverages 2M silver training samples and the AMR-to-text model uses 3.9M silver training data. These models both leverage the gold data with 36k samples. For training UD-based models, we only use about 12K training samples. Since training auto-encoder models does not require any parallel samples, we can easily construct autoencoder training samples from any non-parallel texts. While for training language-based models, we use more than 4.5M training samples to train a well-performed NMT model. It has been proven that the performance of NMT models is greatly limited by the number of training samples (Koehn Source Text: There may be a hundred crimes in the background , but it is only on this one that they can be tried . Reference: There may be a hundred felonies in the background , but they can only be tried on this one . Enc-dec: there may be hundreds of crimes in the background , but they can only be tried on this one . Pipeline-language: There may be a hundred crimes in the background , but only on this basis can they be condemned . Pipeline-AMR: 100s of crimes in the background, but they can only be tried in this one. Pipeline-UD: there may be a hundred crimes in the background , but it is only on this one that they can be tried . LSR: there may be a hundred crimes in the background , but it is only on this one that they can be tried . E2E-language: we may have a hundred crimes in the background , but we can only try them on this one . E2E-AMR: there may be hundreds of crimes in the background , but they can only be tried on this one . E2E-UD: there may be a hundred crimes in the background , but it is only on this one that they can be tried . and Knowles, 2017), when the number of samples is small, the performance of NMT models will be greatly reduced.

Summary of Observations
In sum, we have the following conclusions: • The Pipeline-language method generates paraphrases of low fidelity scores and low fluency scores. Pipeline-language method is more likely to change the semantics of sentences and more sensitive to domains. The E2E-language method can alleviate the semantic changes to some extent, generating paraphrases with good quality.
• The UD-based and LSR methods tend to generate paraphrases with fewer changes in expression compared to the original sentence. However, they require much less humanannotated parallel samples for training compared to other methods.
• AMR-based methods perform well in fidelity, diversity, and fluency, which indicates that language is not the only optional pivot and using AMR as the pivot is also a good choice for pivot-based paraphrase generation systems.
• Compared to the Enc-dec method, Pipeline-AMR, E2E-language and E2E-AMR methods can generate paraphrases with similar fidelity, diversity and fluency scores, which indicates that parallel paraphrasing data may not be necessary for generating high-quality paraphrases.
7 Case Analysis Table 4 shows an example of Quora, consisting of paraphrases predicted by all competitive methods mentioned in section 5.2. In this case, Pipeline-UD, LSR and E2E-UD generate the same sentence with the original sentence. Pipeline-language is the only model that fails to preserve semantics, due to the error propagation of machine translation systems. Table 5 is another example from Parabank. It reveals that Pipeline-AMR tends to paraphrase texts syntactically, while Pipeline-language tends to paraphrase texts by replacing words with their synonyms (e.g. tried and condemned).
Some translation-based models have also been proposed for paraphrase generation (Wieting and Gimpel, 2018;Wieting et al., 2017;Guo et al., 2019). Wieting et al. 2017 and Wieting and Gimpel 2018 select different languages as pivots to generate multiple and diverse paraphrase. Considering that two-step translation may incur semantic shift, Guo et al. 2019 build a Transformerbased language model and pre-train the model on the concatenated bilingual parallel sentences.

Text-to-AMR and AMR-to-Text
As for AMR parsing, some previous works (Flanigan et al., 2014;Lyu and Titov, 2018;Zhang et al., 2019a) first project words to AMR concepts and then identify the relations. Transition-based models are widely applied (Wang et al., 2015b,a;Damonte et al., 2017;Liu et al., 2018;Guo and Lu, 2018;Naseem et al., 2019;. Because of the rapid development of sequence-to-sequence model, many works leverage it to parse texts into AMRs. Some works (Konstas et al., 2017;van Noord and Bos, 2017;Ge et al., 2019;Xu et al., 2020) linearize AMR graphs and directly use sequenceto-sequence models. Others (Zhang et al., 2019b;Cai and Lam, 2020a) use sequence-to-sequence model to predict concepts. They also jointly train the model to identify the relations.
UD-to-text task is introduced in Surface Realisation Shared Task (Mille et al., 2018(Mille et al., , 2019(Mille et al., , 2020. Several works (Ferreira et al., 2018;Castro Ferreira and Krahmer, 2019;Elder, 2020;Farahnak et al., 2020) first linearize UD trees without word reordering and then feed the linearized trees to the sequence-to-sequence models or statistical machine translation models to generate texts. Others (Cabezudo and Pardo, 2018;Yu et al., 2019;Recski et al., 2020;Yu et al., 2020) reorder the word in UD trees with neural models first, followed by word inflection.

Conclusions and Future Work
In this work, we focus on pivot-based paraphrase generation. Previous works leverage language as the pivot, which may introduce semantic shift. In this work, we explore whether we can use AMR or UD as pivot. We also explore an end-to-end framework in a zero-shot way, using only parallel text-pivot data. Results of the automatic metrics and human evaluations show that AMR is a good choice of pivot, as AMR graphs preserve important words as concepts and thus preserve semantics. Moreover, replacing two-step pipeline process with the end-to-end framework is beneficial when language is the pivot, reducing the semantic change. Besides, some unsupervised pivot-based methods can perform as well as supervised paraphrase models. In the future, we will focus on zero-shot paraphrase generation task and explore more semantic representations as pivots for pivot-based paraphrase generation.