Neural-Driven Search-Based Paraphrase Generation

We study a search-based paraphrase generation scheme where candidate paraphrases are generated by iterated transformations from the original sentence and evaluated in terms of syntax quality, semantic distance, and lexical distance. The semantic distance is derived from BERT, and the lexical quality is based on GPT2 perplexity. To solve this multi-objective search problem, we propose two algorithms: Monte-Carlo Tree Search For Paraphrase Generation (MCPG) and Pareto Tree Search (PTS). We provide an extensive set of experiments on 5 datasets with a rigorous reproduction and validation for several state-of-the-art paraphrase generation algorithms. These experiments show that, although being non explicitly supervised, our algorithms perform well against these baselines.


Introduction
Paraphrase generation, i.e. the transformation of a sentence into a well-formed but lexically different one while preserving its original meaning, is a fundamental task of NLP. Its ability to provide diversity and coverage finds applications in several domains like question answering (McKeown, 1979;Harabagiu and Hickl, 2006), machine-translation (Callison-Burch et al., 2006), dialog systems (Yan et al., 2016), privacy (Gröndahl and Asokan, 2019a) or adversarial learning (Iyyer et al., 2018).
The formal definition of paraphrase may vary according to the targeted application and the tolerance we set along several axes, including the semantic distance from the source that we want to minimize, the quality of the syntax, and the lexical distance from the source that we want to maximize to ensure diversity.
The available aligned paraphrase corpora are often biased toward specific problems like question answering or image captioning. For instance, a transformer or a seq2seq model trained on a question answering corpus will typically turn any input sentence into a question. With the lack of generic aligned datasets, it remains challenging to train generic paraphrase models in a supervised manner.
On the other hand, with the availability of largescale non-supervised language models like BERT and GPT2, the assessment of a given candidate paraphrase in terms of semantic distance from its source and lexical quality has become much more tractable.
Leveraging from these metrics, we propose to cast the paraphrase generation task as a multicriteria search problem. We use PPDB 2.0 (Pavlick et al., 2015), a large-scale database of rewriting rules derived from bilingual corpora, to potentially generate billions of 'naive' candidate paraphrases by edition from the source sentence. To sort the good candidates efficiently from the others, we experiment with two search algorithms. The first one, called Monte-Carlo Paraphrase Generation (MCPG), is a variant of the Monte-Carlo Tree Search algorithm (MCTS) (Kocsis and Szepesvári, 2006;Gelly and Silver, 2007;Chevelu et al., 2009). The MCTS algorithm is famous for its successes on mastering the -highly combinatorial -game of Go (Gelly and Silver, 2007;Silver et al., 2016).
The second one is a novel search algorithm that we call Pareto Tree Search (PTS). In contrast to MCTS which is a single-criterion search algorithm, we designed PTS to retrieve an approximation of the whole Pareto optimal set. This allows for more flexibility on paraphrase generation where the balance between semantic distance, syntax quality, and lexical distance is hard to tune a priori. Another difference between MCPG and PTS is that PTS uses a randomized breadth-first exploration policy which proves to be more efficient on this problem.
The main contribution of this article is a study on search-based paraphrase generation through the Source sentence : he is speaking on june 14 .

PPDB rule
Edited sentence is → is found he is found speaking on june 14 . is speaking → 's talking he 's talking on june 14 . speaking → speak now he is speak now on june 14 . 14 → 14th he is speaking on june 14th . sieve of three criteria: semantic similarity, syntax quality, and lexical distance. We propose and evaluate two search algorithms: MCPG and PTS. We also provide an extensive set of experiments on English datasets with a rigorous reproduction and validation methodology for several state-of-the-art paraphrase generation algorithms. This article is organized as follows: The candidate paraphrase generation scheme is presented in Section 2. In Section 3, we develop the different criteria we use to qualify correct paraphrases. The two search algorithms (MCPG and PTS) are described in Section 4. Section 5 gives a survey on other state-of-the-art paraphrase generation algorithms. The comparisons with non-supervised and supervised baselines are presented in Section 6 where the methodology is discussed in Section 6.3. We conclude by a data-augmentation experiment in Section 6.6.

Paraphrase generation scheme
We model paraphrase generation as a sequence of editions and transformations from a source sentence into its paraphrase. In this work, we only consider local transformations, i.e. replacement of certain words or group of words by others that have the same or similar meanings, but the method we propose should work with more sophisticated transformations as well.
The Paraphrase Database (PPDB) (Ganitkevitch et al., 2013;Pavlick et al., 2015) is a large collection of scored paraphrase rules that was automatically constructed from various bilingual corpora using a pivot alignment method (Callison-Burch, 2008). The database is divided into increasingly large and decreasingly accurate subsets 1 . We used the XL subset, and we removed the rules labeled as "Independent". This left us with a set of 5.5 million rewriting rules. We give some examples of these rules in Table 1.
By iteratively applying the rules from a source sentence like the one in Table 1, we obtain a vast lattice of candidate paraphrases. Some of these candidates like "he's talking on june 14" are wellformed, but many are syntactically broken, like "he is speak now on june 14".
The number of rules that apply depends on the source sentence's size and the words it contains. For instance, on the MSRPARAPHRASE dataset (see section 6.2), sentences are quite long and the median number of PPDB-XL rules that apply is around 450. After two rewriting steps, the median number of candidates is around 10 5 , and by iterative rewriting, we quickly reach a number of paraphrase candidates that is greater than 10 8 .

Paraphrase selection criteria
As it depends on the type of text we consider that may be spoken or written, casual or formal; it is not easy to define a universal semantic distance or a universal scale of well formed syntax. However, recent advances in NLP with neural networks like BERT and GPT2 trained on huge corpora have led to the development of metrics that can act as good proxies for these ideal notions.
For the semantic distance, a quick experiment confirms that the BERT score (Zhang et al., 2019a) performs well on difficult paraphrase identification tasks. The BERT score is an F1-measure over an alignment of the BERT contextual word embeddings of each of the sentences. To assess the sensitivity of this score, we computed the Area Under the ROC curve (AUC) on QQP and PAWS, two difficult paraphrase identification corpora (Kornél Csernai, 2017;. On QQP, we obtained 75.2% and on PAWS, we obtained 67.0%. The PAWS corpus being designed to trick paraphrase identification classifiers, 67.0% is a reasonable performance. We hence opted for the BERT score between the source sentence and paraphrase candidate (denoted BERT S ) as our semantic score.
Regarding the syntax quality, the perplexity of GPT2 (Radford et al., 2019) is a good ranking criterion. Although, as illustrated in Table 2, in some cases, a rule-based spell-checker may detect errors that GPT2 would miss (but the reverse is also true). We hence opted for GPT2 as a primary criterion for syntax quality, combined with the LANGUAGE-TOOL spell-checker (Naber, 2003) that we only used on a second stage for performance reasons.
The lexical distance is important to ensure the diversity of the produced paraphrases. It is however simple to handle. Some authors use the BLEU surface metric (Miao et al., 2018a), we opted here for the normalized character-level Levenshtein edition distance.
The balance between these criteria is difficult to obtain. Table 2 illustrates their impact on a sentence taken from the train set of MSRPARAPHRASE. The candidate examples in this table underline the tough dilemma between maximizing the semantic similarity (safe and conservative policy) and maximizing the lexical distance (risk-prone). The third and fourth examples underline the utility of the spell-checker: some low-perplexity examples are ill-formed. On the second part of the table, we printed the sentences chosen by our two models: MCPG and PTS.

Searching algorithms
Searching for good paraphrases in the large lattice of candidates generated by PPDB is a costly task. We propose two algorithms that share a similar structure: an outer loop explores the lattice at different depths, while an inner loop explores the candidates at each depth. Both algorithms are anytime: they return the best solutions found so far when the time or space budget is depleted.

MCPG: Monte-Carlo tree search for Paraphrase Generation
Following the idea of Chevelu et al. (2009), we used Monte-Carlo Tree Search (MCTS) to explore the PPDB lattice. The three key ingredients of MCTS are: a bandit policy at each node of a search-tree to select the most promising paths, randomized rollouts to estimate the quality of these paths, and backpropagation of rewards along the paths to update the bandit. We opted here for a randomized bandit policy called EXP3 (Auer et al., 2002). The MCTS algorithm being not designed for multi-objective problems, we needed to combine semantic similarity BERT S , syntax correctness GPT2 and surface diversity Lev S into a single criterion. We opted for the following polynomial: where the product Lev S · BERT S is intended to avoid a trivial maximization of the score by applying a lot of editions to the source sentence. After a few experiments on train sets, we tuned empirically the weights to α = 3, β = 0.5 and γ = 0.025 in order to obtain a balance as the one described in Table 2.

PTS: Pareto Tree Search in the paraphrase lattice
We observed two drawbacks for MCTS. First, it was designed for combinatorial problems like Go where the evaluation is only possible on the leaves of the search tree. This is not the case for paraphrase generation where the neural models can evaluate any rewriting step and where rewriting from good candidates is more likely to provide good paraphrases than rewriting from bad ones. Secondly, it has been designed for single criterion search which requires fixing the balance between criteria definitively before any paraphrase search begins. This is not very flexible, and it becomes painful when we want to generate sets of candidates.
By plotting the distributions of the scores like on Figure 1, we noticed that most of the candidates were dominated in the Pareto sense: it was possible to eliminate most of the candidates without any hyper-parameter tuning. Hence, we adapted MCPG to explore the paraphrase lattice and recover an approximation of the Pareto front, postponing the balance between criteria as a quick post-optimization stage. This led us to the PTS algorithm described as pseudo-code in Table 3.  Table 2. The optima of any positive combination of BERT score, normalized Levenshtein distance, and GPT2 perplexity belong to the Pareto front (orange dots). We plotted the projections of MCPG combined score (1) with dashed isolines. The BERT score and Levenshtein distance being clearly anti-correlated, the balance between these two criteria is difficult to tune.  Table 2: An example of a source sentence sampled from the MSRPARAPHRASE train set with some representative candidates from its PPDB rewriting graph. For each of the candidates, we computed the BERT score with respect to the source (BERT S ), the normalized GPT2 perplexity (GPT2), the Levenshtein distance from the source sentence (Lev S ), and the number of spell and grammar errors detected (Errs). The PPDB editions are highlighted in green. The detected spell and grammar errors are highlighted in purple. The first candidate maximizes the semantic similarity (BERT score) and is very conservative. The second one maximizes the surface diversity (Lev S ) and takes a lot of risks. The third one shows an ill-formed paraphrase candidate. The fourth candidate emphasizes the utility of the spell-checker. The last candidate, on the fifth row, achieves our equilibrium goal. We show on a second part, the paraphrases generated by our models MCPG an PTS, and on a third part, we give the reference paraphrase from the dataset.

Related work
Rule-based and statistical approaches Following the path of machine translation, the paraphrase generation literature first evolved from laboriously handcrafted linguistic rules (McKeown, 1979;Meteer and Shaked, 1988;Chandrasekar and Srinivas, 1997;Carroll et al., 1999) to more automatic and data-driven rules extraction methods (Callison-Burch et al., 2006;Madnani and Dorr, 2010). Like in machine translation, phrase-level substitution rules can be extracted by sub-sentence alignment algorithms from parallel corpora (Brown et al., 1993). Building such a dedicated corpus being a long and costly task, one usually transforms other corpora through a "pivot representation" (Barzilay and McKeown, 2001;Callison-Burch et al., 2006;Ganitkevitch et al., 2013;Chen et al., 2015). The weakness of these approaches is that phrase-level rewriting rules alone are not able to build coherent sentences. A typical data-driven paraphrase generator used to be a mixture of potentially noisy handcrafted and data-driven rewriting rules coupled with a score that had to be optimized in real-time through dynamic programming. However, dynamic programming methods like Viterbi are constrained by the requirement of a score that decomposes into a sum of word-level or phraselevel criteria (Xu et al., 2016). Some attempts were made to relax this constraint with search-based approaches (Chevelu et al., 2009;Daumé et al., 2009), but the global optimized criteria were simplistic, and the obtained solutions were not suitable for practical deployment.
Supervised encoder-decoder approaches Like machine translation, paraphrase generation benefited from deep neural networks and evolved to efficient end-to-end architectures that can both learn to align and translate (Bahdanau et al., 2016;Vaswani et al., 2017). Several papers like (Prakash et al., 2016a;Cao et al., 2017) set the paraphrase generation task as a supervised sequence-to-sequence problem. As confirmed by our experiments in Section 6.4, this approach is efficient for specific types of paraphrases. It is also able to produce relatively long-range transformations, but it requires huge and high-quality sentence-level aligned datasets for training.
The paraphrase generation literature mostly reports results on MSCOCO (Chen et al., 2015) and QQP (Kornél Csernai, 2017) datasets which are built respectively from image captions and semiduplicate questions. These datasets are very specific: MSCOCO is strongly biased toward image description sentences, and QQP is dedicated to questions. The Transformer model we trained on QQP typically transforms any input into a question. For instance, from "He is speaking on june 14." it gives "Who is speaking on june 14?". Search-based approaches Search-based methods regained interest in the text generation community for several reasons, including the need for flexibility and the fact that with deep neural-networks, the search evaluation criteria have become more reliable. These methods are often slower than autoregressive text generation methods, but it is always possible to distillate the models into faster ones like we do in Section 6.6. In (Gröndahl and Asokan, 2019b) they deployed a search-based policy for style imitation, Schwartz and Wolter (2018) and Kumagai et al. (2016) both used MCTS for text generation. Following the same trend, Miao et al. (2018a) proposed to use Metropolis-Hasting sampling (Metropolis et al., 2004) for constrained sentence generation in an algorithm called CGMH.
Starting from the source sentence, the CGMH algorithm samples a sequence of sentences by using local editions: word replacement, deletion, and insertion. For paraphrase generation, CGMH constraints the sentence generation using a matching function that combines a measure of semantic similarity and a measure of English fluency. This model is therefore directly comparable with our MCPG and PTS approaches.

Experiments
The evaluation metrics and datasets are described respectively in Section 6.1 and 6.2. We paid attention to set up a rigorous validation protocol for our experiments. The reproducibility and methodology issues that we faced are discussed in Section 6.3. We compare our model with state-of-theart supervised methods and with CGMH, another search-based algorithm. The technical details on these algorithms are developed in Section 6.4. The results are detailed in Section 6.5.

Evaluation metrics
We rely on standard machine translation metrics that compare the generated paraphrase to one or several ground-truth references (Olive et al., 2011). We report surface metrics BLEU and TER and semantic metrics METEOR and BERT 2 , the average BERT score of the generated sentence with respect to the reference sentences 3 . 4.2 · 10 7 6 36.0% 0.5% 113 1.5 · 10 4 PAWS 3.6 · 10 5 20 100% 65.6% 289 1.3 · 10 6 QQP 1.5 · 10 5 10 64.1% 21.8% 141 1.3 · 10 5 Table 4: Datasets statistics. 'Size' is the number of instances. 'Len' is the median number of words per sentence. We also report a rough distribution of the BERT score (B) and the Levenshtein distance (L) computed on the corpora paraphrases pairs. The 'PPDB@x' columns give the median number of candidates that PPDB-XL generates from one sentence respectively at one and three rewriting steps. The size of the PPDB rewriting lattice is correlated to the length of sentences.

Evaluation datasets
The MSCOCO-2017 set (Chen et al., 2015) contains image captions, assuming that captions associated with the same picture are paraphrases. The strengths of MSCOCO are its size and the fact that each source sentence is associated with four reference paraphrases. However, the sentences are biased towards a descriptive style, and the quality of the paraphrases is often questionable.
The OPUSPARCUS dataset (Creutz, 2018) has been extracted from movies and TV shows subtitles. It contains mostly informal dialogues.
Quora Question Pairs dataset (QQP) (Kornél Csernai, 2017) is a paraphrase identification corpus dedicated to question-answering systems. The MSRPARAPHRASE dataset (Dolan and Brockett, 2005) is mostly build with pieces of news. The sentences of this corpus are quite long. It is a small but high-quality paraphrase identification corpus that was labeled by humans. PAWS wiki (PAWS) Yang et al., 2019b) is a paraphrase identification corpus that contains several lexicallysimilar but hard-to-classify pairs like "Flights to Florida from New York" and "Flights from Florida to New York".

Methodology and reproducibility issues
In the paraphrase generation literature, most of the papers report results on MSCOCO and QQP corpora. In table 5, we provide the BLEU scores as reported in (Prakash et al., 2016b;Gupta et al., 2017;Fu et al., 2019;Miao et al., 2018b) and (Egonmwan and Chali, 2019a). However, even if the dataset names coincide, and even if each evaluation methodology is correct on its own, the discrepancies between methodologies render these values impossible to compare with each other.
Regarding the sentence lengths, Prakash et al. (2016b) and Gupta et al. (2017) shrunk all sentences to 15 words. Fu et al. (2019) set the maximum length to 16 while Egonmwan and Chali (2019a) set it to 15 and 10 respectively for the input and target sentences. Knowing that roughly 56% (resp. 5%) of MSCOCO target sentences are strictly longer than 10 (resp. 15) words, these small changes can have a great impact on the results. The vocabulary considered also differs. Fu et al. (2019) used a vocabulary of 8k and 11k tokens from the train sets of QQP and MSCOCO respectively, whereas Egonmwan and Chali (2019a) had a vocabulary of approximately 15k words that was constructed on both the train and test sets.
The scripts used to compute metrics did also differ from one paper to another, and it is known that BLEU scores can vary wildly with different parameterizations (Post, 2018).
We used the code when available and otherwise, we tried as much as possible to reproduce the models of the literature faithfully. This allowed us to have the exact same preprocessing, training and testing pipeline for all our experiments. As a side effect, it gives a grounded benchmark between the methods that we could test.

Baseline systems implementation
In the next subsections, we present the reimplemented encoder-decoder neural networks architectures and the weakly-supervised paraphrase generator used as baselines in our experiments.
Supervised paraphrase generators As supervised baselines, we trained three neural network architectures that were previously reported to achieve good results on MSCOCO and QQP, in particular, the Seq2Seq architecture, a Residual LSTM architecture (Prakash et al., 2016b) and a TRANSFORMER model (Egonmwan and Chali, 2019a). We extended the experiments to the other aligned corpora: MSRPARAPHRASE, OPUSPARCUS and PAWS. To be more precise, we trained a 4-layers LSTM Seq2Seq with a bidirectional encoder and decoder using attention. This architecture is reported as SEQ2SEQ in the results. We trained a 4-layer Residual LSTM Seq2Seq as introduced by Prakash et al. (2016b) and reproduced by Fu et al. (2019). This architecture is reported as RESIDUAL LSTM in the results. The results we obtained with this model are close to the ones reported by Fu et al. (2019).
Finally, we trained a TRANSFORMER using the transformer base hyper-parameters set from (Vaswani et al., 2017). This architecture is reported as TRANSFORMER BASE in the results.
For all the encoder-decoder experiments, we used the fairseq framework (Ott et al., 2019) that implements the SEQ2SEQ and TRANSFORMER architectures. We added our own implementation of the RESIDUAL LSTM architecture.
For preprocessing, we used Moses tokenizer and subword segmentation following Sennrich et al. (2016b) and using the subword-nmt library . The maximum sentence length is set to 1024 tokens which the default setting in fairseq. For decoding, we did a beam search with a beam of size 5.
Weakly-supervised paraphrase generator For the weakly-supervised strategy CGMH introduced by Miao et al. (2018b) we used the official code. We managed to reproduce their results on QQP. On our test set we achieve a BLEU score of 22.5 while they reported 18.8. We then extended the experiment to other datasets and metrics. Table 6 summarizes the results of the comparison of our models, supervised encoder-decoder neural networks and the weakly-supervised method CGMH. Overall, these results are mixed: it is however important to keep in mind that contrary to the supervised baselines which are retrained for each dataset, the parameters of the CGMH, MCPG and PTS models are left unchanged.

Results
On the MSCOCO and QQP datasets, the supervised baselines achieve clearly better results, but MCPG and PTS achieve better results on OPUSPAR-CUS and PAWS except with the BERT score for which the TRANSFORMER model achieves similar results. On MSRPARAPHRASE, the encoderdecoder neural networks models perform poorly. This result can be explained by the small number of training examples available on this corpus (See Table 4). On the weakly-supervised side, MCPG and PTS models outperform the CGMH baseline on all corpora except on the MSCOCO dataset where the results are similar.
These results prove that even without a specialized training sets, generic search-based methods are competitive for paraphrase generation. However, it is a fact that encoder-decoder networks have excellent performances for text generation and have the potential to generate more complex paraphrases than those obtained by simple local transformations as in our models.
Training a general -all-purpose -paraphrase generation network would require a huge volume of data. And there is yet much less aligned corpora available for paraphrase than for translation.

Data-augmentation experiment
In order to get the best of both worlds, one option is to enrich the training set of a TRANSFORMER with the results of a search-based method.
To test this idea, we used our models to augment the MSRPARAPHRASE training set. For that purpose we created new pairs of paraphrases from unused sentences of MSRPARAPHRASE (the pairs labeled as "not paraphrases") using MCPG and PTS. We then trained a new supervised TRANSFORMER models on the augmented training sets.
Having no guarantee that our models generate syntactically perfect sentences, we inverted the pairs, thus taking the generated paraphrases as input and the dataset's sentences as output. This trick, called back-translation, forces the target model to generate correct sentences (Sennrich et al., 2016a;Edunov et al., 2018).
We report the results of this experiment in Table 7. The models trained with the augmented   Table 7: Data-augmentation experiment summary.
We trained a TRANSFORMER base model on three versions of the MSRPARAPHRASE set: the original train set (ORIG) and the original train set extended by paraphrasing other sentences from the same distribution with the MCPG (ORIG + MCPG) and PTS (ORIG + PTS) models.
training sets achieved a significant performance gain on BLEU, TER and METEOR.

Conclusion
We experimented with two search-based approaches for paraphrase generation. These approaches are pragmatic and flexible. Being generic, our approaches did not overfit on small datasets. We performed extensive experiments with a rigorous evaluation methodology that we applied both on our algorithms and on the other related methods that we tried to reproduce faithfully. These experiments confirm that our two methods, namely MCPG and PTS, are comparable to supervised stateof-the-art baselines despite being less tightly supervised. When compared with CGMH, another search-based and weakly-supervised method, our algorithms proved to be faster and more efficient. We plan to refine the scoring with deep reinforcement learning techniques and enrich the edition rules with more sophisticated patterns like phrase permutations. Our search algorithms remain slow for a real-time deployment: the current versions are better suited as an offline model for data augmentation. The experiment of Section 6.6 confirms that this application of search-based methods is a promising research avenue. A planning-thenrealization hybridization like the one proposed by Moryossef et al. (2019) and Fu et al. (2019) could also be considered for further works.