LEWIS: Levenshtein Editing for Unsupervised Text Style Transfer

Many types of text style transfer can be achieved with only small, precise edits (e.g. sentiment transfer from I had a terrible time... to I had a great time...). We propose a coarse-to-fine editor for style transfer that transforms text using Levenshtein edit operations (e.g. insert, replace, delete). Unlike prior single-span edit methods, our method concurrently edits multiple spans in the source text. To train without parallel style text pairs (e.g. pairs of +/- sentiment statements), we propose an unsupervised data synthesis procedure. We first convert text to style-agnostic templates using style classifier attention (e.g. I had a SLOT time...), then fill in slots in these templates using fine-tuned pretrained language models. Our method outperforms existing generation and editing style transfer methods on sentiment (Yelp, Amazon) and politeness (Polite) transfer. In particular, multi-span editing achieves higher performance and more diverse output than single-span editing. Moreover, compared to previous methods on unsupervised data synthesis, our method results in higher quality parallel style pairs and improves model performance.


Introduction
In text style transfer, a model changes the style of a source text (e.g.sentiment, politeness) into a target style, while otherwise changing as little as possible about the input.Many types of style transfer can be performed with only small, precise edits instead of generation from scratch.Consider the task of transforming a negative sentiment sentence such as the worst ribs I've ever had! to a positive sentence such as probably the best ribs ever!.Here, we need only invert the negative sentiment phrase around worst -the references to ribs should be left as-is.Recent and concurrent work on text style transfer propose single-span editing (Wu et al., 2019;Malmi et al., 2020) as an alternative to generating the target text from scratch (Prabhumoye et al., 2018;He et al., 2020b;John et al., 2019;Shen et al., 2017;Fu et al., 2018).
We introduce a more flexible and powerful multispan editing method that identifies multiple stylespecific components of the text and concurrently edits them into the target style.Given a source text, we first predict the sequence of coarse-grain Levenshtein edit types (e.g.insert, replace, delete) that transform the source text to the target text, then fill insertion and replacement edits using a generator.In the previous example, the operations correspond to inserting the word probably before the, replacing worst with best, and removing the words I've and had.This example is illustrated in detail in Figure 1.
Learning to edit requires supervised sourcetarget text pairs.How do we learn high-quality editors when no such supervised parallel data exists?Given a style text, we synthesize its pair by identifying style-specific content and replacing it with samples from style-specific masked language-models.In our sentiment transfer example, the style-specific content of the sentence I had a great time at the theatre is had a great time.We can replace this phrase by saw a fantastic movie today to synthesize an alternative positive-sentiment sentence, or by got ripped off today to synthesize a negative-sentiment sentence.Figure 3 illustrates this example in detail.
We evaluate our editing and synthesis framework, which we call LEWIS (Levenshtein editing with unsupervised synthesis), on three style transfer tasks in sentiment (YELP, AMAZON) and politeness (POLITE) transfer, and achieve state-of-the-art arXiv:2105.08206v1[cs.CL] 18 May 2021 Source text x < l a t e x i t s h a 1 _ b a s e 6 4 = " c f 2 V 4 A n I h R i m / 3 J e H l w y Q o m s m 9 U = " > A A A B 6 H i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H Y N R o 4 k X j x C I o 8 E N m R 2 6 I W R 2 d n N z K y R E L 7 A i w e N 8 e o n e f N v H G A P C l b S S a W q O 9 1 d Q S K 4 N q 7 7 7 e Q 2 N r e 2 d / K 7 h b 3 9 g 8 O j 4 v F J S 8 e p Y t h k s Y h V J 6 A a B Z f Y N N w I 7 C Q K a R Q I b A f j 2 7 n f f k S l e S z v z S R B P 6 J D y U P O q L F S 4 6 l f L L l l d w G y T r y M l C B D v V / 8 6 g 1 i l k Y o D R N U 6 6 7 n J s a f U m U 4 E z g r 9 F K N C W V j O s S u p Z J G q P 3 p 4 t A Z u b D K g I S x s i U N W a i / J 6 Y 0 0 n o S B b Y z o m a k V 7 2 5 + J / X T U 1 Y 9 a d c J q l B y Z a L w l Q Q E 5 P 5 1 2 T A F T I j J p Z Q p r i 9 l b A R V Z Q Z m 0 3 B h u C t v r x O W l d l r 1 K + b l R K t W o W R x 7 O 4 B w u w Y M b q M E d 1 K E J D B C e 4 R X e n A f n x X l 3 P p a t O S e b O Y U / c D 5 / A O X P j P o = < / l a t e x i t > x < l a t e x i t s h a 1 _ b a s e 6 4 = " c f 2 V 4 A n I h R i m / 3 J e H l w y Q o m s m 9 U = " > A A A B 6 H i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H Y N R o 4 k X j x C I o 8 E N m R 2 6 I W R 2 d n N z K y R E L 7 A i w e N 8 e o n e f N v H G A P C l b S S a W q O 9 1 d Q S K 4 N q 7 7 7 e Q 2 N r e 2 d / K 7 h b 3 9 g 8 O j 4 v F J S 8 e p Y t h k s Y h V J 6 A a B Z f Y N N w I 7 C Q K a R Q I b A f j 2 7 n f f k S l e S z v z S R B P 6 J D y U P O q L F S 4 6 l f L L l l d w G y T r y M l C B D v V / 8 6 g  results in terms of retention of style-agnostic content, similarity to the annotated target text, and transfer accuracy.LEWIS significantly outperforms prior state-of-the-art methods by 2.6-13.5% accuracy depending on the task.In further analyses, we show that (1) compared to concurrent work on editing for style transfer, our editor achieves 33.3% higher accuracy when trained on the same data; (2) compared to a competitive BART (Lewis et al., 2020) pure generation baseline, our editor achieves 5.8% higher accuracy when trained on the same data; (3) compared to concurrent work on unsupervised synthesis of style transfer data, our synthesis procedure improves performance by 9.5 BLEU when used to train the same model.Our experiments show that our editor significantly outperforms both pure generation and editing prior methods, that our editor yields more diverse text transfer and that training on our synthesized data improves performance more than prior synthesis methods.

LEWIS
LEWIS consists of coarse-to-fine editing and data synthesis.The editing component, shown in Figure 1, performs local, precise edits of style-specific content of the source text to produce the target text.The data synthesis component, shown in Figure 3, produces supervised source-target text pairs, which do not exist in naturally, to train the editor.To apply our method to transfer text from a source style to a target style, we first train style-specific masked language models, with which we synthesize sourcetarget text pairs.We then compute Levenshtein operations for these source-target text pairs and train the coarse-to-fine editor to reproduce these operations.The full LEWIS is shown in Figure 2.For ease of exposition, we first describe the editor, then describe how to synthesize parallel data to train the editor.

Style transfer via Levenshtein editing
We propose a coarse-to-fine editor that first predicts coarse-grain Levenshtein edit types (Levenshtein, 1966), then fills in fine-grain edits with a generator.Figure 1 illustrates the editor.
Suppose we are to transfer text from a source style into a target style.Let x denote the source text, which we would like to edit into the target text y.In the example shown in Figure 1, we transform the source text the worst ribs I've ever had! into probably the best ribs ever!Our approach has two parts: The source text is first tagged with a sequence of coarse-grain Levenshtein transition types c that transform x into y.A generator then fills in phrases for insertion and replacement operations.The set of coarse Levenshtein transition types are insert, keep, replace, and delete.In the running example, the sequence of operations are to insert before the, replace worst, and delete I've and had.
First, we train a RoBERTa-tagger (Liu et al., 2019) to generate these coarse edit types, which produces coarse edit types for each token in the source text.To accommodate the insertion operation, we produce two tags for each token.The first tag is a binary indicator of whether an additional phrase should be inserted before this token.The second tag is the non-insertion operation to take for this token.In the previous example, for instance, the word the triggers both insertion and keep operations.
Next, we train a fine-grain edit generator to produce the target text.Unlike the coarse-grain edit type generator which only observes the source text, the fine-grain edit generator observes both the original source text and the source text with the coarse-grain edit types applied x c .We use the edit types produced by the Levenshtein algorithm during training and the edit types predicted by the RoBERTatagger during inference.
Our generator is a BART-based (Lewis et al., 2020) masked sequence-to-sequence model.The input to BART is the concatenation of the original source text x and the source text with the coarse-grain edit types applied x c .The generator is trained to fill in phrases for coarse-grain edit types <ins> and <repl>.In the example, BART y is given the input text the worst ribs I've ever had! SEP <MASK> the <MASK> ribs ever !and respectively fills in the two <MASK>s with probably and best.

Unsupervised synthesis of source-target style pairs
Training an editor requires large quantities of source-target text pairs.While there exists an abundant amount of style-specific data, parallel sourcetarget pairs are difficult to collect and annotate.How do we train editing style transfer methods when no such data exists?We hypothesize that pretrained masked language-models, when carefully constrained to generate only style-specific content, can provide high-quality source-target pairs for style transfer.
Our synthesis procedure, shown in Figure 3, is two-fold.First, given a text s from either style, we identify a style-agnostic template t, in which style-specific content are replaced with slots.For instance, for the style text I had a great time at the theatre, the style-agnostic template is I SLOT at the theatre.To identify style-specific content, we train a RoBERTa-based style classifier that differentiates between text from each style.Vaswani et al. (2017) and Hoover et al. (2020) show that heavily attended-tokens correlate strongly with tokens that are indicative of the target class.We observe similar results when inspecting the attention matrices computed by the 12-layer Transfomer for the sentiment classification task.Namely, the penultimate layer's attention weights correlate strongly with words humans identify as strongly indicative of positive vs. negative sentiment.Hence, we define style-specific content as tokens that have higher-than-average attention weights in the classifier.
Consider the multi-head attention matrix A in the penultimate Transformer layer, where A ij represents the attention weight of the jth attention head on the i's token, normalized across all tokens.First, we max-pool A i over all attention heads to form a i .Conceptually, a i represents the maximum extent to which the ith word was attended to by any Style text s < l a t e x i t s h a 1 _ b a s e 6 4 = " n D J r I 5 0 w h o a g h K 9 A k U z E l c 8 9 P y 8   Figure 3: Unsupervised synthesis of source-target style pairs.We first train an attentive style classifier, whose attention weights we use to identify style-specific content.Next, we remove style-specific content with slots to form a style-agnostic template.This template is finally filled using style-specific masked language-models for each style to synthesize parallel style text pairs.attention head.
Let N denote the sequence length.We compute the average attention weight ã as To modify the style text s into the style-agnostic template t, we keep tokens that have above-average attention weight.
We merge consecutive SLOT tokens in t.In the running example, for the style text I had a great time at the theatre, the tagger generates I SLOT SLOT SLOT SLOT at the theatre, which after merging becomes I SLOT at the theatre.
We then fine-tune style-specific masked language-models BART x and BART y to fill in slots in the template and recover the style-specific text.During training, phrases in the input sentence are randomly discarded and the model is trained to fill the phrases back in (Lewis et al., 2020).Having trained style-specific masked language-models for both the source and target styles, we use both models to generate source and target filled-in text given style-agnostic templates.
In our running example, sampling with the positive language model yields the sentence I saw a fantastic movie today at the theatre, while sampling with the negative language model yields the sentence I got ripped off today at the theatre.
The last step we perform is a filtering step using the classifier.For synthesized examples in style k, we keep examples for which the style classifier predicts k.In other words, we keep only examples where the language models and the classifier agree.We find that this improves data quality and editor performance.We use the collection of synthesized source and target text pairs x, ŷ to train the editor.

Experimental Setup
We focus on two types of text style transfer: (1) Sentiment transfer, in which we transform a positive sentiment sentence to a corresponding negative sentiment sentence or vice-versa without changing the core content (i.e.attributes of the sentence not concerned with sentiment) (2) Politeness transfer, in which we transform the tone of a sentence from impolite to polite.We make use of three datasets: YELP (Shen et al., 2017) 1.

Training Setup
We implement our models using fairseq2 (Ott et al., 2019) and HuggingFace3 (Wolf et al., 2020) -both based on the PyTorch library (Paszke et al., 2019).For BART-based generation models, we initialize with BART-base (Lewis et al., 2020), and train using a batch size of 65K tokens for 30000 iterations.We use a linear warmup schedule, reaching the peak of 3 × 10 −5 at 5000 iterations, and then proceed to decay the learning rate with a polynomial decay schedule.For regularization, we use a dropout value of 0.3 and a weight decay value of 0.1.We optimize using Adam, with hyperparameters β 1 = 0.9, β 2 = 0.98 and cross entropy loss.For RoBERTa-based taggers and classifiers, we initialize with RoBERTa-base (Liu et al., 2019), and train using a batch size of 256 for 5000 iterations.We optimize using Adam, warm up the learning rate to 1 × 10 −6 and then decay with a cosine schedule.We train all models using mixed precision (Micikevicius et al., 2018) for faster training.Similar to prior work (Wu et al., 2019;Malmi et al., 2020), we decode using a beam width of 5 and rerank outputs produced by beam search using the likelihood of the classifier trained in Section 2.2.

Comparison with existing methods
We compare LEWIS to five prior methods: Delete, Retrieve, Generate (Li et al., 2018), a retrieval method that finds text from the target domain corpus whose style-agnostic form is similar to that of the source text; Tag and Generate (Madaan et al., 2020), a generation method that conditionally generates target text from style-agnostic source text; DeepLatentSeq (He et al., 2020b), an unsupervised machine translation-based approach where genera- tors in each domain are regularized by a language model-based latent prior.Finally, we compare to previous editing approaches proposed by Malmi et al. (2019Malmi et al. ( , 2020) ) where a single span in the source text two domain-specific language models disagree on is replaced.

Evaluation
Automatic Evaluation We use five evaluation metrics: BLEU (Papineni et al., 2002) measured against the reference (denoted as BLEU) to evaluate lexical overlap with human annotation; Self-BLEU measured against the source to measure content preservation (denoted as SBLEU); BERTScore and Self-BERTScore (Zhang et al., 2020) measured against the reference and the source (denoted as BERT and SBERT respectively); and accuracy measured against an external classifier (denoted as Accuracy) to measure how well the style was transferred.While measuring BLEU, Self-BLEU, and accuracy are standard for this task, we propose additionally using BERTScore due to its higher correlation with human judgments (Zhang et al., 2020).Compared to BLEU and Self-BLEU which are n-gram based, BERTScore is measured using token-wise cosine similarity between representations produced by BERT (Devlin et al., 2019).
Given this, the usage of BERTScore addresses the potential issue of accurately transferred sentences being scored poorly due to its low n-gram overlap.Table 6 shows an example of this where the style is accurately transferred but is scored poorly by BLEU as a result of low n-gram overlap.
Furthermore, following Malmi et al. (2020) who use a BERT-based classifier to score their outputs, we train a classifier initialized with RoBERTa-base (Liu et al., 2019).This model correctly classifies 98.2% of the YELP classification test set by Shen et al. (2017).Its accuracy is used to evaluate the output of style transfer models.
Human Evaluation We perform a robust human evaluation on all datasets, asking crowdworkers to rate 300 examples from Yelp (150 positive, 150 negative), 200 examples from Amazon (100 positive,100 negative) and 100 from Politeness.Five annotators rate each pair from 1 (strongly disagree) to 5(strongly agree) in terms of fluency, content preservation (CP) and style transfer.We compare with our strongest baseline Tag and Generate (Madaan et al., 2020).

Results
Performance of LEWIS compared to other methods on YELP, AMAZON, and POLITE are respectively shown in Tables 2, 3, and 4, with human evaluation shown in Table 7. LEWIS outperforms prior methods on all datasets in terms of accuracy, BLEU, and BERTScore: LEWIS achieves more successful transfers (2.6-13.5% accuracy depending on task), has higher overlap with human annotations (4-14.4BERTScore), and retains more source content (5.7-14.3Self-BERTScore).Human evaluation (p = 0.01 for Yelp/Polite using pairwise bootstrap sampling (Koehn, 2004)) shows that LEWIS outperforms Tag and Generate on fluency, content preservation and style across datasets.These results indicate that LEWIS is an effective method for style transfer.On the AMAZON dataset -which is noisier than the YELP dataset -LEWIS underperforms Tag and Generate when evaluating using BLEU, however when evaluating using BERTScore LEWIS outperforms the latter.When we inspect the output of LEWIS, we find that it generates more diverse output as shown in Figure 4.One reason that LEWIS generates more diverse output is that unlike previous and concurrent editing work that use single-span replacement (Malmi et al., 2019(Malmi et al., , 2020)), our method concurrently edits multiple spans with a larger set of operations.This is inherently supported by the editor (Figure 5) as well as encouraged during unsupervised data synthesis (Figure 4).Table 8 shows that a large number of examples do require multiple edits, and that the coarse-to-fine editor indeed performs multiple edit operations on average.
In addition to comparing end-to-end systems, we also compare LEWIS to concurrent editing and synthesis methods by Malmi et al. (2019Malmi et al. ( , 2020)).Table 2 shows that training the same model (LaserTagger) on our data improves and BLEU by 9.5 (the accuracy difference is not directly comparable since Malmi et al. (2020) used a BERT classifier and did not release model output).This suggests that our data synthesis procedure produces higher quality data than Malmi et al. (2020).Furthermore, because LaserTagger only performs singlespan edits, it often fails to transfer the style of the text.This also accounts for its high BLEU and BERTScore but low accuracy, as we show that a model that simply copies the input also achieves high BLEU and BERTScore but low accuracy.Replacing LaserTagger with our coarse-to-fine Levenshtein editor results in a sizable 33% gain in accuracy.In Table 1 of the Appendix, we show example outputs of these models for comparison.Finally, we ablate LEWIS to investigate how the different components of LEWIS affect performance.
Editing outperforms pure generation We replace the coarse-to-fine editor with a sequence-tosequence BART model, which we also train with synthesized data.This is a strong baseline that outperforms prior pure generation work on style transfer, as shown in Table 3.Nevertheless, Table 5 shows that LEWIS outperforms this baseline on all metrics.This confirms our hypothesis that editing is a more effective means of style transfer compared to pure generation.
Training on synthesized data improves performance.Instead of training an editor using synthesized data, given a source text during inference, we convert it to a style-agnostic template and immediately fill it using the target language model.Table 5 shows that the resulting model underperforms both the sequence-to-sequence BART and the coarse-to-fine editor on all metrics.This result may be surprising, in that one expects the performance of a model trained on data synthesized by language models to be at-most on par with the performance of the language model.In this case, we i love the fresh , right out of the oven bread too .
i SLOT the fresh, SLOT out of the oven SLOT.
i love the fresh , crisp bread out of the oven as well.i love the fresh , hot bread out of the oven as well.
i ordered the fresh , it came out of the oven cold .i ordered the fresh , it came out of the oven and was cold.moreover , they found ways to help save on the expense .
SLOTover, SLOT ways to SLOT save on the expense.overall , great place with ways to save on the expense.overall , good ways to help save on the expense.overall , there are better ways to try and save on the expense.overall , there are better ways to save on the expense .

Source
we will not be coming back .

Coarse edit type tagger output
Fine generator input we will <mask> be coming back <mask> Fine generator output we will definitely be coming back again ! the food 's ok , the service is among the worst i have encountered .
the food <mask> , the service is among the <mask> i have encountered .
the food is great , the service is among the best i have encountered.
i said it was disgusting to even serve this to diners .
i <mask> it was <mask> to even serve this to diners .i thought it was very nice to even serve this to diners.observe that training on the synthesized data actually improves over just using the language models.We hypothesize that this gain is due to the editor learning correlations between the source language model and the target language model, namely how to precisely transform the output of the source language model to the output of the target language model.The gains we observe here may be related to gains from training on back-translated or pseudoparallel data (Sennrich et al., 2016;Edunov et al., 2018;He et al., 2020a).More research is needed to investigate the problem conditions under which such gains occur.
Filtering improves performance.Here, we forgo the filtering step, which removes ≈ 20% of the synthesized data on YELP.Table 5 shows that filtering improves the quality of the synthesized data and leads to consistent gains.

Related Work
Text style transfer Previous work on style transfer can largely be divided into two categories: (1) learning a latent space with controllable attributes such as those found in Shen et al. (2017);John et al. (2019) or (2) using unsupervised generative approaches from retrieval (Li et al., 2018), tagging using style phrases (Madaan et al., 2020), to backtranslation and unsupervised machine translation techniques (Prabhumoye et al., 2018;Lample et al., 2019;He et al., 2020b).
Editing for style transfer Our work is closest to Madaan et al. (2020) and Malmi et al. (2020).Madaan et al. (2020) use a tagger to mark style phrases in the source text, then generates the target text conditioned on the tagged source text.In contrast, we do not fully generate target text and only perform small, precise edits.In concurrent work to ours, Malmi et al. (2020) train a BERT language model on each style and edits a span where the models' likelihoods disagree the most.In contrast, instead of performing single-span replacement, our editor concurrently edits multiple spans in the text, and supports a wider set of operations than replacement.We showed that this results in more effective and more diverse style transfer.This coarse-to-fine transformations of text, in which the input context is progressively refined, has also led to improvements in syntactic parsing (Charniak and Johnson, 2005), semantic parsing (Dong and Lapata, 2018), and NER (Choi et al., 2018).
Unsupervised data synthesis for style transfer Malmi et al. (2020) also generate synthetic data with which to train an editing model from Malmi et al. (2019).Our synthesis differs from Malmi et al. (2020) in how slots for generation are chosen.In their work, the highest disagreeing span is chosen for rewriting.In our work, multiple spans with words whose attention weights that exceed the average are chosen for rewriting, which allows for more flexible and diverse samples.In turn, training on our synthesized data improves the performance and diversity of the style transfer model.

Conclusion
We proposed LEWIS, a coarse-to-fine editor for style transfer that transforms text using Levenshtein edit operations.Unlike prior edit methods, our methods concurrently performs multi-span edits.To train this editor, we proposed an unsupervised data synthesis procedure that converts text to styleagnostic templates using style classifier attention, then fills in slots in these templates using fine-tuned pretrained language models.LEWIS outperformed existing generation and editing style transfer methods on sentiment and politeness transfer.In addition, the proposed data-synthesis procedure increased transfer performance.Given the same synthesized data, our editor outperformed prior pure generation and editing methods.In future work, we will study the application of LEWIS to general sequence to sequence problems.

Ethical Considerations
This work has impact in the field of controlled text generation, and as with much of language technology has the potential to be both used for good and used maliciously.Our work learns to generate synthetic data in an unsupervised way, and is based on a pre-trained model (BART), which is likely to caputre and amplity biases found in the data.As with all text-style transfer models, our model is amenable to malicious use, including impersonation and mass generation of faked opposing opinion, for example, negative and positive product reviews or political statements.

C Source Code & Synthetic Data
We release source code with this work, with preprocessing scripts, training scripts for both conditional lanaguage models, editors and coarse-grain taggers, edit operations extraction scripts, and synthetic data generation scripts at https://github.com/machelreid/lewis.

D Synthetic Data
For synthetic data generation, we generate approximately 2.2M pairs for Yelp, 2.0M pairs for Amazon, and 1M pairs for Polite.Note that when generating synthetic data on Polite, given the longer sequence length, we threshold the amount of SLOT tokens to be the minimum of one-third of the total sequence length and 6.
We release our synthetic data to help facilitate further development in approaches using synthetic data for this task.

E Qualitative Analysis
We analyzed 100 examples from YELP produced by LEWIS.83% transfers were correct,6% incorrect,and 11% ambiguous (the resulting sentence expressed both styles).This is in line with automatic metrics and shows LEWIS is effective in successfully transferring style.For diversity of edits, in 59% of cases, LEWIS inverted key phrases (and enjoying this → and avoiding this, friendly folks, delicious authentic bagels → sorry folks,not authentic bagels), in 26%, LEWIS rewrote part of the sentence in a way that is not inverting key adjectives/nouns (and he loved it → and he said it was OK).In 10%, LEWIS performed purely syntactic editing (definitely not enough room → enough room).In contrast to other editors that rely on primarily single-phrase inversion (e.g.LaserTagger), demonstrating that LEWIS provides diverse edits.

F Further Automatic Evaluation
We further evaluate our model on semantic similarity and fluency using the classifiers released by Krishna et al. (2020).Results are shown in Table 9 and 10.LEWIS improves fluency by a significant margin on all, and outperforms other methods on 2/3 datasets on semantic similarity.

G Example Outputs
Source the wine was very average and the food was even less .

LEWIS
the wine was very good and the food was even less .LaserTagger the wine was very good and the food was even better .Reference the wine was above average and the food was even better Source owner : a very rude man .LEWIS owner : a very nice man .LaserTagger owner : a very man .

Reference
The owner was such a friendly person.
Source i love the food ... however service here is horrible .LEWIS i love the food and the service here is great .LaserTagger i love the food ... however service here is great .Reference i love the food ... and service here is awesome .

Figure 1 :
Figure 1: Coarse-to-fine Levenshtein editor.Given the source text, the two-step editor first generates coarse edit types via a tagger.A subsequent generator fills in insertions and replacements while taking into account the source text and the edit types.
t e x i t s h a 1 _ b a s e 6 4 = " w N m H o 0 z F G M 2 8 k L 2 e P b N g M m + S 9 Q M = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E s c e C F 4 8 V 7 A e 0 o W y 2 m 3 b p Z h N 2 J 0 I I / R F e P C j i 1 d / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n d L G 5 t b 2 T n m 3 s r d / c H h U P T 7 p m D j V j L d Z L G P d C 6 j h U i j e R o G S 9 x L N a R R I 3 g 2 m d 3 O z s e y t e Q U M 6 f w B 8 7 n D 7 F X j 8 g = < / l a t e x i t > ŷ < l a t e x i t s h a 1 _ b a s e 6 4 = "w N m H o 0 z F G M 2 8 k L 2 e P b N g M m + S 9 Q M = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 69 L B b B U 0 l E s c e C F 4 8 V 7 A e 0 o W y 2 m 3 b p Z h N 2 J 0 I I / R F e P C j i 1 d / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n d L G 5 t b 2 T n m 3 s r d / c H h U P T 7 p m D j V j L d Z L G P d C 6 j h U i j e R o G S 9 x L N a R R I 3 g 2 m d 3 O

Figure 4 :
Figure 4: Examples of synthesized parallel text on the YELP dataset.

Figure 5 :
Figure5: Examples of coarse-to-fine editor output on the YELP dataset.We abbreviate the edit operation with K for <keep>, D for <del>, and R for <repl>.Unlike previous and concurrent edit methods, we concurrently edit multiple spans in the text.
Figure2: LEWIS consists of two components.Given source-target style text pairs, a coarse-to-fine Levenshtein editor (yellow) first identifies coarse-grain Levenshtein edit types to perform for each token in the source text (e.g.insert, replace, delete), then fills in the final edits with a fine-grain generator to produce the target text.In most applications, supervised source-target style text pairs rarely exist.To resolve this lack of annotated data, we perform unsupervised synthesis of source-target style pairs (blue) by first learning to produce style-agnostic templates given arbitrary style text.Next, we fill in slots in the template by sampling from style-specific masked language-models.In this figure, source and intermediate data are shown in white while model components are shown in red.

Table 1 :
Dataset statistics for style transfer tasks.The politeness corpus does not have parallel evaluation data and only evaluates on transfer from impolite to polite.

Table 2 :
Malmi et al. (2020)ntences from busi-Results on YELP.Results with † are taken from the classifier trained inMalmi et al. (2020)because the outputs for these models are not released.

Table 4 :
Results on POLITE

Table 5 :
Ablation on YELP."LM fill" is the ablation experiment in which we convert the source style text to a styleagnostic template and directly use the target style language model to synthesize a target style text (e.g. the editor is not used)."Seq2Seq" is a pretrained BART model that is fine-tuned on the synthesized data (e.g. a from-scratch generation model trained on the same data as the editor).

Table 6 :
Example comparing BERTScore vs BLEU.Ref denotes the reference sentence and Hyp 1 and Hyp 2 represent two example hypotheses.

Table 8 :
Coarse-to-fine editor statistics on YELP, after merging consecutive edit operations of the same type, so that the number of operations denote spans as opposed to tokens (e.g.delete, replace).

Table 11 :
Three examples from the Yelp test set comparing the LaserTagger trained on our synthetic data and LEWIS