T-STAR: Truthful Style Transfer using AMR Graph as Intermediate Representation

Unavailability of parallel corpora for training text style transfer (TST) models is a very challenging yet common scenario. Also, TST models implicitly need to preserve the content while transforming a source sentence into the target style. To tackle these problems, an intermediate representation is often constructed that is devoid of style while still preserving the meaning of the source sentence. In this work, we study the usefulness of Abstract Meaning Representation (AMR) graph as the intermediate style agnostic representation. We posit that semantic notations like AMR are a natural choice for an intermediate representation. Hence, we propose T-STAR: a model comprising of two components, text-to-AMR encoder and a AMR-to-text decoder. We propose several modeling improvements to enhance the style agnosticity of the generated AMR. To the best of our knowledge, T-STAR is the first work that uses AMR as an intermediate representation for TST. With thorough experimental evaluation we show T-STAR significantly outperforms state of the art techniques by achieving on an average 15.2% higher content preservation with negligible loss (~3%) in style accuracy. Through detailed human evaluation with 90,000 ratings, we also show that T-STAR has upto 50% lesser hallucinations compared to state of the art TST models.


Introduction
A well accepted definition of style refers to the manner (via linguistic elements like word choices, syntactic structures, metaphors) in which semantics of a sentence are expressed (McDonald and Pustejovsky, 1985;Jin et al., 2020).Text Style Transfer (TST) is the task of rephrasing an input sentence to contain specific stylistic properties without altering the meaning of the sentence (Prabhumoye et al., 2018).We refer the reader to Jin et al. (2020) for a * denotes equal contribution detailed survey of approaches towards TST problem formulation, metrics and models.In the practical scenario, that we consider in this paper, where a large corpus of parallel data is not available (Niu and Bansal, 2018;Ma et al., 2020;Wu et al., 2020), two family of approaches have been proposed in the literature (Jin et al., 2020).1. Disentanglement: Content and Style are disentangled in a latent space and only the style information is varied to transform the sentence.2. Prototype Editing: Style bearing words in the source sentence are replaced with those corresponding to the target style.The sentence may be further re-arranged for fluency and naturalness.Both the above approaches have drawbacks in the way the style agnostic intermediate representation is constructed, described as follows.First, in the disentangling approaches, it is not easy to verify the efficacy of separation between style and content.Recent approaches (Subramanian et al., 2018;Samanta et al., 2021) even state that as content and style are so subtly entangled in text, it is difficult to disentangle both of them in a latent space.Consequently, this affects the model's interpretability in that it is hard to attribute an effect we see in the output to the latent intermediate vector or the style vector.Second, with prototype editing approaches for lingustic styles (such as author, formality), where the content and style are tightly coupled it is not feasible to segregate style and content carrying words (word examples: cometh, thou).Furthermore, style-marker detection is a non-trivial NLP task and needs to be addressed for every new style that is added to the system causing scalabilty concerns (Jin et al., 2020).In this paper, we propose T-STAR (Truthful Style Transfer using AMR Graph as Intermediate Representation) that uses a symbolic semantic graph notation called Abstract Meaning Representation (AMR) as the style agnostic intermediate stage.AMR (Banarescu et al., 2013) is designed to capture semantics of a given sentence in all entirety while abstracting away the syntactic variations, inflections and function words.In other words, two sentences with the same meaning but written in very different styles will have a very similar AMR if not exactly the same (See Figure 1).
This addresses the shortcomings with Disentanglement and Prototype Editing approaches.First, AMR being a representation with well-defined semantics, we can inspect, interpret and measure quality of the intermediate representation and the provenance of knowledge transfer between source and target sentence.Second, AMR being a well known standard has high quality, robust reference implementations, especially for head languages (e.g., English).Our contributions are as follows: 1. We propose a novel TST approach with AMRs, an interpretable, symbolic intermediate representation, to achieve better content preservation.To this end, we enhance AMR parsing techniques to better suit the TST task.To the best of our knowledge, we are the first work to use AMR representations for the style transfer task (Sections 3, 4).2. Through novel experimentation, we show that an AMR, as a style agnostic intermediate representation, has better content preservation and less style information of the given source sentence compared to competitive baselines.(Sections5, 6) 3. On multiple datasets we show T-STAR is able to beat competitive baselines by producing sentences with significantly higher content preservation with similar style transfer scores.(Section 7) 4. With thorough human evaluations spanning 90,000 ratings we show T-STAR has ∼ 70% better content preservation compared to state of art baseline with 50% lesser hallucinations.(Section 8) 2 Related Work
Recently, some works propose to jointly optimize for content and style information, to overcome the limitations of explicitly disentangling the style and content information.Yamshchikov et al. (2019b) illustrates that architectures with higher quality of information decomposition perform better on style transfer tasks.Subramanian et al. (2018) argues that it is often easy to fool style discriminators without explicitly disentangling content and style information, which may lead to low content preservation (Xu et al., 2018).Instead, Subramanian et al. ( 2018

Text ⇐⇒ AMR
In order to improve the parsing performance for AMRs, neural models are receving increasing attention.Neural AMR parsers can be divided into following categories: i) sequence-to-sequence based AMR-parsers (Xu et al., 2020a), ii) sequence-to- graph based AMR parsers (Zhang et al., 2019), where the graph is incrementally built by spanning one node at a time.A more detailed survey on related works can be found in Appendix A.

AMR as an Intermediate Representation
Abstract Meaning Representations (AMRs) is a semantic formalism construct that abstracts away the syntactic information and only preserves the semantic meaning, in a rooted, directed and acyclic graph.
In Figure 1, we present certain syntactic variations (changing the voice, and tense) for a sentence without altering meaning.All variations result in the same AMR graph.The nodes in the AMR graphs ("produce-01", "they", "before", etc.) are concepts (entities/predicates) that are canonicalized and mapped to semantic role annotations present in Propbank framesets1 .The edges ("ARG0", "time", "duration", etc.) are then relations between these concepts.In other words, AMRs aim at decoupling "what to say" from "how to say" in an interpretrable way.We posit that this could be beneficial for text style transfer, where the goal is to alter the "how to say" aspect while preserving "what to say".
Recently, semantic meaning representation property of AMRs has been shown to be useful in other generation tasks.In abstractive summarization, Liao et al. (2018) uses AMR as an intermediate representation to first obtain a summary AMR graph from document and then generate a summary from it.Hardy and Vlachos (2018) use AMRs to cover for the lack of explicit semantic modeling in sequence-to-sequence models.In machine translation, Song et al. (2019) adopted AMRs to enforce meaning preservation while translating from English to German.For Paraphrase Generation, Cai et al. (2021) found that using AMRs as intermediate representation reduces the semantic drift.Moreover, incorporating symbolic representations as an intermediate representation provides a way to effectively understand the reasons behind a model's shortcomings.We utilize this advantage to analyse the weaknesses of T-STAR in Section 8.2.
In order to demonstrate the semantic meaning preservation property of AMRs, we design an experiment using three publicly available paraphrase benchmarks, i.e., MRPC (Dolan and Brockett, 2005), QQP2 , and STS.MRPC and QQP are sentence pair datasets with each pair labeled yes if they are paraphrases of each other no otherwise.STS dataset assigns a scores from 0 (not similar) to 5 (exactly similar) to a sentence pair.We hypothesize that if AMRs are indeed semantic meaning preserving, two sentences with similar meaning should have highly similar AMRs.To measure the similarity between two AMRs, we use the SMATCH score (Cai and Knight, 2013) that calculates the number of overlapping triplets between two AMRs.We use an off-the-shelf AMR parser3 to generate AMR given a sentence.We plot the distribution of the SMATCH scores for MRPC, QQP and STS datasets in Figure 2.For MRPC, we infer that the SMATCH scores for paraphrases are significantly higher than the SMATCH scores for non-paraphrases.Similarly, for QQP, quartile distribution of SMATCH scores for paraphrases is higher in comparison to non-paraphrases.For STS dataset, we observe a gradual increase in quartile distribution of SMATCH scores as we move towards more similar sentences.
The experiments above corroborate the claim that AMRs can preserve meaning under lexical variations like paraphrasing, tense and voice changes.Recent research discussed earlier, have successfully used this property to show task improvements.Building on the above qualitative, quantitative and prior research evidence, we further explore the applicability of AMR for the TST task.

Proposed Solution
Our proposed model T-STAR consists of two modules (refer Figure 3).First, T-STAR Encoder generates an AMR given source sentence in style i.Second, T-STAR Decoder generates a sentence in style j with similar meaning as preserved in the generated intermediate AMR.We take T5-Base (Raffel et al., 2020) pre-trained model as our basic seq2seq architecture for both the modules.In order to use AMR as a sequence in T5, we borrow the choice of DFS Traversal from (Bevilacqua et al., 2021), that thoroughly study the effect of various traversals on AMR parsing.

T-STAR Encoder
We train our simplest encoder, called the vanilla T-STAR Encoder, by fine-tuning T5 base with the open source AMR 3.0 dataset (Knight et al., 2021).The AMR 3.0 dataset consists of roughly 59,000 generic English sentences (denoted as s i for the i th sentence) and their corresponding AMRs (denoted as A i ).In a qualitative analysis, we observe that the vanilla T-STAR encoder under performs in two significant ways as illustrated in Table 1.First, style bearing words (such as "shew") become concepts in the AMRs (Sentence-1 in Table 1) as opposed to being canonicalized to their respective propbank role annotations.Second, the meaning of the stylized sentences get incorrectly encoded in the AMRs, as shown in the second example in Table 1.To overcome this, we propose a style-agnostic fine-tuning strategy as follows.Style-agnostic Fine Tuning: We hypothesize that vanilla Text to AMR encoder is unable to effectively transform stylized sentences to AMRs because it has only been trained on generic English sentences.Therefore, we propose a dataaugmentation strategy (refer to Figure 4), where we use an off the shelf style transfer model, e.g., STRAP (Krishna et al., 2020), to stylize a generic English sentence s i in style p (ŝ p i ).While converting s i to ŝp i , we alter the style of original sentence, keeping the meaning intact.For a high quality synthetic dataset, we filter out samples with low semantic similarity between s i and ŝp i .We provide a detailed empirical analysis on this filtering strategy in Appendix C. Since the meaning is preserved, we can now map ŝp i to the same AMR A i .We then fine-tune our T-STAR Encoder on S = S ∪ Ŝ , where Ŝ = {ŝ p i , A i } N ∀p ∈ P and P is the total number of styles in the dataset.

T-STAR Decoders
Due to the unavailability of parallel style corpora, we are provided P mono-style corpora R p = {r p i } M p written in style p, where P refers to the total number of styles and r p i refers to the i th sen-  (Cai and Knight, 2013) measures degree of overlap between AMR graphs of S,T.The score is computed based on triplet (edge) overlap by finding a variable node mapping that maximizes the count of matching triplets.

WMD ↓
Word Mover Distance (Kusner et al., 2015) measures dissimilarity between S,T as the minimum distance between their embedded words.Yamshchikov et al. (2020) states that WMD correlates best with humanevaluations on semantic similarity SIM ↑ SIM (Wieting et al., 2019a) uses an embedding model proposed in (Wieting et al., 2019b) to measure semantic similarity (Krishna et al., 2020;Luo et al., 2019) Style Transfer (S.T.) Style Accuracy ↑ Score of 4-way and 2-way fine-tuned RoBERTa-Large (Liu et al., 2019) model for styles in CDS and Author-Imitation datasets respectively (Krishna et al., 2020)*, (Hu et al., 2017;Fu et al., 2017;Madaan et al., 2020) Training a style specific decoder (f p (.)) to generate a sentence in style p from an AMR consists of two steps.First, we use our fine-tuned T-STAR Encoder (Section 4.1) to generate AMRs Âp i for every sentence r p i for every style corpora.Second, we fine tune a T5 base model to recover the original sentence r p i given the AMR Âp i , obtaining stylespecific decoders.In other words, we fine tune using M p pairs of ( r p i , Âp i ) constructed from the first step.Note that, we experimented with a data augmentation technique for the decoders as well, however it did not lead to an improvement in the style transfer performance (refer to Appendix E).
Once style specific decoders have been trained for every style in P , we can use the T-STAR Encoder in tandem with the T-STAR Decoders to convert between arbitrary style combinations as in Krishna et al. (2020).

Iterative T-STAR
The performance of our modules, T-STAR Encoder and T-STAR Decoders depends on the quality of the synthetic datasets ( Ŝ, Rp ) generated by their complementary modules.We adopt the iterative back-translation technique used in unsupervised machine translation (Hoang et al., 2018).Iteration proceeds in rounds of training an encoder and decoder from mono-style data.In every round, we aim to iteratively improve the quality of the encoder and decoder modules by generating increasingly better synthetic data from the previous round.We briefly describe this process in Algorithm 1.

Experimental Setup
In this section we briefly describe the various T-STAR variations that are analyzed in the subsequent sections, baselines, and the implementation details.The models are validated against the metrics summarized in Table 2.

T-STAR variations
Vanilla T-STAR: The T-STAR Encoder used in this version, is only trained on AMR 3.0 dataset, and not finetuned for stylized sentences.T-STAR: We train the encoder and decoders using Algorithm 1 for only one iteration.
Iterative T-STAR: We follow two iterations of Algorithm 1 to obtain better quality synthetic dataset.

Baselines
UNMT: (Subramanian et al., 2018) models style transfer as unsupervised machine translation task.DLSM (He et al., 2020) is a deep generative model that unifies back-translation and adversarial loss.
RLPrompt (Deng et al., 2022a) uses a discrete prompt optimization approach using reinforcement learning.It is adopted in a zero-shot setting where we use Distil-BERT (Sanh et al., 2019) and run the optimization for 1k steps.STRAP (Krishna et al., 2020) first normalizes the style information by paraphrasing the source text into generic English sentence, which is then passed through a style-specific GPT-2 based model to generate the styled output.

Datasets
We evaluate performance of T-STAR on two English datasets that capture Linguistic Styles (Jin et al., 2020).First, Shakespeare Author Imitation Dataset (Xu et al., 2012) consists of 18K pairs of sentences written in two styles.Original Shakespeare's plays have been written in Early Modern English, a significantly different style.Second, Corpus of Diverse Styles (CDS) (Krishna et al., 2020).This dataset consists of non-parallel sentences written in 11 different styles.We will present our results on a subset of four styles : Bible, Poetry, Shakespeare and Switchboard which consists of 34.8K, 29.8K, 27.5K, 148.8K instances respectively (CDS uses MIT License).

Implementation Details
We use pre-trained T5-Base model architecture for both the encoder and decoder.Following iterative back translation literature (Kumari et al., 2021) should be able to reconstruct the input sentence from it.To evaluate the robustness of AMRs across all styles for content preservation, we first generate intermediate AMR given the sentence in style p using our encoder and then reconstruct the same sentence using our decoder f p (.) in style p.We study how close the generated sentence (from STRAP, T-STAR) is with respect to the original sentence.We can infer from Table 3 that AMRs as an intermediate representation performs significantly better on content preservation as compared to STRAP across all the styles.Specifically, it gives an average of 0.10 and 0.06 absolute improvement on SIM and WMD scores across the four styles respectively.Also, we get comparable performance on retaining the original style.We present an ablation study on AMR parser in Appendix C, and propose a new unsupervised Text-AMR evaluation metric to measure the content preservation of AMR parsing along with its results on the CDS dataset in Appendix D.

Style Agnosticity of AMR
A style classifier C assigns a style class label to an input sentence S. If S does not encode style information, it should result in poor classifier performance.We use this observation to design an ex-periment to evaluate the style-agnosticity of AMRs as follows.We train three versions of 4-way style classifier using original sentences, paraphrased sentences (as used in STRAP) and AMRs as the input sequences to the classifier.We illustrate some examples in Table 1, where the T-STAR Encoder generates better AMRs as compared to the Vanilla T-STAR Encoder.In the first example,the T-STAR Encoder, unlike the vanilla T-STAR Encoder, is able to map stylespecific word "shew" to the valid concept "show" and also associate "existence" to it.In the second example, the T-STAR Encoder does not split the AMR into two sentences while parsing.Additionally, it is able to make the association between "you" and "youth".Through the above quantitative and qualitative analysis we demonstrate that the T-STAR Encoder generates AMRs which are robust in preserving meaning of the source sentence while predominantly losing style information.

Performance on Style Transfer Tasks
In this section, we compare our model performance T-STAR and Iterative-T-STAR to the baselines across two datasets: Shakespeare Imitation Dataset and CDS dataset.

Performance Analysis on Shakespeare Imitation Dataset
In  We observe that our model, T-STAR slightly performs better than STRAP model for original to modern style, but has lower performance for modern to original style.However Iterative T-STAR, outperforms all the baselines for both the directions on Weighted Style Accuracy.We observe that STRAP has very high style accuracy that it achieves by compromising on Content Preservation.Through human evaluation (Section 8) we see STRAP employs significantly higher hallucinations to achieve style transfer.Vanilla T-STAR on the other hand achieves high content preservation via significant copying from source as seen by the substantially high self-BLEU score.Iterative T-STAR finds the middle ground of achieving style transfer while not compromising on content preservation.We also found that the length of generated sentence is similar to the input sentence, with on average one word difference.In the subsequent sections, we compare T-STAR to only the best performing baseline, STRAP.

Performance Analysis on CDS Dataset
In Table 6, we compare the performance of T-STAR and Iterative T-STAR against STRAP and Vanilla T-STAR, across all 12 directions for {Poetry, Shakespeare, Switchboard, Bible} styles.We make the following observations: First, both our models T-STAR and Iterative T-STAR outperform the stateof-the-art baseline, STRAP, on 11 out of 12 directions, with an average absolute improvement of 7.7% and 9.7% respectively.Second, Vanilla T-STAR is observed to be a stronger baseline than STRAP, as it beats STRAP on 8 out of 12 direc-

Qualitative Analysis
In Table 7, we enumerate few examples with generated stylized sentences using STRAP and T-STAR variations.We can infer that, although STRAP performs well in transforming the sentence to the given style, it alters the meaning (row 1,3, and 4).
On the other hand, vanilla T-STAR does not always transform the style (row 1 and 2).However, with T-STAR and Iterative T-STAR are able to transform the sentences while keeping the style intact.We further quantify these observations through an extensive set of human annotations as described below.

Human Evaluations
Automatic metrics are insufficient to thoroughly understand subjective quality measures like content preservation.Therefore, we conduct an extensive case study with human evaluations.Our analysis is two folds, first we compare STRAP with our models T-STAR and Iterative T-STAR on meaning preservation.Second, we further understand the various categories of meaning loss failures.
For both the human evaluation tasks below, the criteria for choosing annotators were i) proficient in English with a minimum education of Diploma.ii) The annotators have to first qualify on two simpler questions, else they are not allowed to continue on the task.Each instance is annotated by three taskers, and the final annotation is a majority vote.

Comparison on Meaning Preservation
In order to study the faithfulness of the T-STAR models, we do a side-by-side human evaluation.In this task, a source sentence is shown with 2 stylized target sentence (one from T-STAR and another from STRAP).We present the annotators with three options to judge content preservation with respect to source sentence: option on left better than one on right, right better than left and both equal.
We extensively compare the two models across all 12 directions for four styles.For each direction we randomly sample 500 instances.Each instance is rated by 3 annotators leading to a total of 18,000 ratings.We summarize our findings in Table 8.Both T-STAR ad Iterative T-STAR significantly outperform STRAP in terms of being better at content preservation (The > STRAP column in Figure 8).Further Iterative T-STAR has 7% higher meaning preservation compared to T-STAR.In addition to the quantitative content preservation metrics dis-

Error Analysis
In the next study, we further aim to study the nature of meaning loss errors made by style tranfer models.
We categorize these errors into three categories i) Hallucinations: new information not present in the source sentence is added to target ii) Semantic Drift: the target sentence has a different meaning to source sentence iii) Incomplete: some important content information is missed in the target.The taskers also have the option to select "No Error" if the meaning is preserved in the generated target.
As in the previous experiment, we collect 18,000 ratings and the results are summarized in Table 9.We observe that our models T-STAR and Iterative T-STAR consistently beat STRAP in the "No Error" category.Furthermore, the amount of hallucinations significantly drops to 24.46% and further 22.6% with Iterative T-STAR across all styles from 39.3% for STRAP.Reduction in hallucination can be clearly seen as a benefit of encoding critical information in the source sentence using a semantic parse representation like AMR.As a sign of improving the AMR parsing quality, we see that iterative T-STAR further reduce the Incomplete to 15.5%.For further details refer to Appendix F.

Usefulness of an interpretable intermediate representation
With intermediate AMRs being interpretable, it is possible to broadly understand if such errors are emerging from either encoder or decoder module.
To intuitvely understand the reason for high number in Incomplete and Semantic Drift errors, we qualitatively analyzed some instances along with the generated intermediate AMRs.We have listed down these examples in Appendix F. We observed that for Incomplete errors, the generated AMRs were not encoding complete information, and thus this error percolated from the T-STAR encoder.For the majority of the instances, either some entities were missing, and if the clause was separated using ":,;", it was not parsed in the intermediate graph.Semantic Drift errors indicate shortcomings in both the modules, for some instances the encoder is not abstracting out the meaning efficiently and for others the decoder is not able to generate sentences with the meaning encoded in the intermediate AMR.

Conclusion
We explored the use of AMR graphs as an intermediate representation for the TST task.We see that the performance of the proposed method T-STAR surpasses state of the art techniques in content preservation with comparable style accuracy.
Through qualitative analysis we show that obtaining very high style accuracy scores without altering meaning is indeed a challenging problem.
10 Limitations Some of the limitations for T-STAR based models are the following.First, although our proposed models are performing better in the joint objective of content preservation and style transfer, but they are not able to outperform vanilla T-STAR (overall best performing model for CP) and STRAP (overall best performing model for ST).This is a promising future direction, to keep boosting the performance on both the directions without comprising on the other dimension.Second, we are not incorporating graph structure in our models, and thus there could be some information loss while interpreting and generating the AMRs.Third, based on our error analysis, although our T-STAR encoder is able to generate better AMRs for stylized sentences as compared to vanilla T-STAR model, we are generating significant incomplete AMRs that are missing out on important entities and relations to preserve meaning of source sentence.Fourth, similar to prior research to generate synthetic dataset, initial iteration of our model are dependant on an existing off the shelf TST model, however the quality of the generated AMRs improves significantly using the described data augmentation strategy.Fifth, our work is dependant on a robust AMR parsing approach, which makes it challenging to adopt our approach for other languages.However, with the recent advancements in multilingual AMR parsing, it will be feasible in upcoming future works.

A Extended Related Works
A.1 TST Metrics Table 2 summarizes different metrics that have been used to measure content preservation and style transfer efficacy.Yamshchikov et al. (2020) presents a comprehensive analysis and categorization of several such metrics with respect to human evaluations.Tikhonov et al. (2019) also points out some flaws in traditional evaluation techniques and insists on using human written reformulations for better evaluation4 .

A.2 Text to AMR
Recent works (Bevilacqua et al., 2021;Cai and Lam, 2020;Zhou et al., 2020) for text-to-AMR task have pushed the SOTA, that makes it feasible to automatically construct AMR given a sentence.As a consequence, semantic-preserving NLG tasks, such as Neural Machine Translation (Song et al., 2019;Xu et al., 2020b), Abstractive Summarization (Takase et al., 2016), and Question Decomposition for multi-hop question answering (Deng et al., 2022b) use AMRs as intermediate representations.
However, AMRs have not been explored for style transfer tasks before our work.The increase in AMRs being adopted for several seq2seq tasks is due to the boost in the quality of AMR parsers.Earlier works, relied on statistical approaches (Peng et al., 2017;Flanigan et al., 2014Flanigan et al., , 2016) ) to generate AMRs for a given piece of text.With the emergence of deep learning, various AMR parsers are being proposed, which can be divided into following categories: i) sequence-to-sequence based AMR-parsers (Xu et al., 2020a), ii) sequenceto-graph based AMR parsers (Zhang et al., 2019), where the graph is incrementally built by spanning one node at a time.More recently, several works have adopted pretrained models for AMR parsers, and have observed a boost in performance.Bai et al. (2022) uses BART model and posit the AMR parsing task as a seq2seq task, and generates a traversal of the AMR parser as the output.Bai et al. (2022) incorporates a pretraining strategy to better encode graph information in the BART architecture.Xu et al. (2020a) uses sentence encoding generated from BERT model.In this work, we adopt the pretrained technique based AMR parser, to generate high quality AMRs for the given stylized sentences.
Although off-the-shelf AMR parsers work well for some problems (Fan and Gardent, 2020), they often need to be modified to be useful in the downstream tasks.For instance, Deng et al. (2022b) proposed graph segmentation strategy to perform question decomposition on a multi-hop query.Xia et al. (2021) and Du and Flanigan (2021) illustrated that using silver data augmentation can help improve in the task of AMR parsing.In this work, we also illustrate the benefit of using silver data towards improving the style agnosticity of AMR graphs as an intermediate representation.

A.3 AMR to Text
Similar to text-to-AMR models, AMR-to-text frameworks can also be categorised into two types -i ) sequence-to-sequence generation frameworks (), ii) graph-encoder based frameworks (Song et al., 2018;Wang et al., 2021Wang et al., , 2020)).Bai et al. (2020) propose a decoder that back-predicts projected AMR graphs to better preserve the input meaning than standard decoders.Bai et al. (2022) argues that PLMs are pretrained on textual data, making is suboptimal for modeling structural knowledge, and hence propose self-supervised graph-based training objectives to improve the quality of AMR-to-text generation.

B Implementation Details
Offensive language We used the "List of Dirty, Naughty, Obscene or Otherwise Bad Words"5 to validate that the source and the generated target text do not contain any offensive text.
Model Architecture: We use a standard t5-base encoder-decoder model as described in (2020).The pre-trained HuggingFace6 T5 transformer is used for both text-to-AMR and AMR-totext parts of the proposed architecture.The model is pre-trained on the Colossal Clean Crawled Corpus (C4)7 comprising of ∼750 GBs worth of text articles.The model comprises of 220 billion parameters, and is pre-trained for 2 1 9 steps before fine-tuning.For pre-training, AdaFactor optimizer (Shazeer and Stern, 2018) is used with "inverse square root" learning rate schedule.
AMR Graph Construction: We use the SPRING model (Bevilacqua et al., 2021) to generate AMR graphs from source style text.We use amrlib8 package to generate the AMR graphs.This implementation uses T5-base (Raffel et al., 2020) as its underlying model, as opposed to the BART model (Lewis et al., 2019) in the SPRING architecture.It is trained on AMR 3.0 (LDC2020T02) dataset (Knight et al., 2021) that consists of 59K manually created sentence-AMR pairs.The model is trained for 8 epochs using a learning rate of 10 −4 .The source and target sequence lengths are restricted to 100 and 512 tokens respectively.Note that t5-based SPRING model achieves an SMATCH score of 83.5, which guarantees the quality of obtained AMR representations z i .
AMR-Based Style Transfer: We use the T5wtense (T5 with tense) architecture from the amrlib package.The T5wtense architecture encodes part-of-speech (POS) tags to the concepts in the AMR graph, which helps the generation model to predict the tense of the output sentence since AMR graphs do not retain any tense information from their corresponding sentence.This model outperforms the standard T5-based model by 10 BLEU points on the AMR 3.0 (LDC2020T02) dataset (Knight et al., 2021).To keep the training steps comparable for the subsets of the CDS dataset (Krishna et al., 2020), we train this t5-base model for 20 epochs for the Bible, Romantic Poetry, and Shakespeare datasets, and 5 epochs for the Switchboard dataset.The model was trained for 20 epochs for the Shakespeare Author Imitation Dataset (Xu et al., 2012) as well.We used a learning rate of 10 −4 for both datasets, and restricted source and target sequence lengths to 512 and 90 throughout, respectively.Everything else was kept same as the amrlib implementation to keep the results consistent.
STRAP baseline: We train the model keeping the same hyperparameter configuration as reported in Krishna et al. (2020).We train each stylespecific decoder for 3 epochs with learning rate of 5 × 10 −5 with Adam optimizer (Kingma and Ba, 2014).During inference we set the p-value for nucleus sampling (Holtzman et al., 2019) to 0.7 to have an appropriate balance between content preservation and style accuracy scores.
Style classifiers: Similar to Krishna et al. (2020), we fine-tune a RoBERTa-large model (Liu et al., 2019) using the official implementa- The following packages use the GNU LGPL license 3.0 17gensim.

C T-STAR-Encoder Ablation Study
The performance of our T-STAR-Encoder heavily depends on the quality of synthetic dataset generated while stylizing the sentences present in AMR 3.0 dataset.Therefore, we conducted thorough empirical analysis to identify a filtering strategy to boost the performance of the encoder.
To obtain the initial set of stylized sentences, we use the state-of-the-art model available to transform generic English sentences to relevant styles {Poetry, Shakespeare, Bible, Switchboard}, i.e., seq2seq inverse paraphrase module (Krishna et al., 2020).
We then fine-tune our T-STAR-Encoder on synthetic datasets obtained from different filtering strategies, and compare the performan on test split of AMR 3.0, to ensure that the quality of the generated AMRs do not drop significantly.We present our findings in Table 11.We observe that using the whole set of generated samples, leads to a significant drop in the performance (row-1).
Therefore, we filter out the augmented stylized sentences with SIM similarity score (Wieting et al., 2019a) below the threshold δ.This filtering strategy was able to significantly improve over the T-STAR-Encoder performance, giving competitive results to the non-augmented Vanilla T-STAR-Encoder.We select the best performing δ based on the performance on the test-split of AMR 3.0.Note that we have used the best performing threshold, δ = 0.7

D Unsupervised Evaluation of AMR parsing
Since we use AMR graphs as the intermediate representation for the TST task, it is important to validate the generation quality of generated AMRs in terms of content preservation with respect to the input sentence.However, there does not exist an 17 https://www.gnu.org/licenses/lgpl-3.0.en.html unsupervised metric to evaluate content overlap between an AMR graph and a sentence.Hence we propose to use a slight variation of the Word Mover Distance (Kusner et al., 2015) for this purpose.We choose WMD over other content preservation metrics for the following reasons - Yamshchikov et al. (2019a) illustrate the efficacy of WMD to evaluate text style transfer over other metrics based on correlation with human evaluations.Which means that it is more robust to the domain difference in the input and the output sentence, making it an ideal candidate for text-AMR similarity measurement.
• Syntactic metrics like BLEU would not be able to compute the content overlap between an AMR graph and a sentence because word representation in an AMR graph discards noun forms and tense information, and some verb tokens are mapped to a different Prop-Bank verb.These modifications along with the disparity in sequential-graphical representation makes syntactic metrics infeasible for the task.
• Semantic representations like SIM are fragile to the input sequence order, and affected by non-content bearing words as well.However, WMD adopts on a bag-of-words paradigm, making it more suitable for the task.
We propose the following two variants of WMD - • WMD Overall -In this variant, we aimed to keep the content bearing tokens from the sentence and AMR graphs.For sentence, we removed the stopwords (after doing a detailed corruption study on sentences, refer to Table 13), while for AMR Graphs we removed AMR notation specific tokens (like ":op?", "ARG?"), punctuations (like "(", '" '), assigned variables and propbank code for verbs (eg.changing "s / say-01" to "say").
• WMD Verb Overall -In this variant we specifically want to compute the similarity of verbs in the parsed AMR graphs and input sentence.For this, use nltk POS tagging tool to extract out verbs from the input sentence, and directly extract out propbank based verbs from the AMR graph.

E Data Augmentation for T-STAR Decoder
We hypothesize that the T-STAR decoder performances will improve if the underlying model, is better at generating text given an AMR graph.To this end, we create synthetic dataset using sentences from Wikipedia corpus.We sample 10 million We first fine-tune the T5-Base model for AMR to Text task on this filter dataset.We obtain a BLEU score of 49.13 on Gold AMR test set.Note that this performance is very close to the state-of-the art result 49.2 BLEU for this task (Bai et al., 2022).Table 15 lists down the different filtering strategy and dataset sizes we experimented to identify the best strategy to improve the performance.
We then compare the performance of the best performing model on the style transfer task again STRAP.We observe that this model is not beating the STRAP performance across various style directions.Therefore, we conclude that a vanilla fine-tuning of model for AMR to text task, does not necessarily boost the performance in the downstream tasks.

F.1 Comparison on Meaning Preservation
We present the results across all the 12 directions for content preservation comparitive analysis in Table 16 and 17 respectively.We can observe that for every direction our models are consistently better in content preservation with respect to STRAP model.

F.2 Error Analysis per direction
In this section, we present the error analysis for each direction in  that number significantly.Note that all the models are giving high error in Semantic Drift, and improving the model for this type of errors can be explored in future works.

F.3 Qualitative Analysis for Incompleteness and Semantic Drift
As we are using interpretable intermediate representations, it is easily possible to understand the intuition behind these errors, and broadly understand which modules (encoder or decoder) needs to be improved further.Therefore, we study few instances and analyze the generated Intermediate AMRs to understand the reason for high number in Semantic Drift and Incomplete errors.We list down some intstances in Table 20 and Table 19 Across the various instances that we analyzed, we observe that the generated AMRs were not encoding the complete information themselves.For instance, either missing some entities (example 1, 4, 5 and 6 in Table 19), if the clauses were separated using ":", ";", only one of those were parsed in the intermediate graph (example 2 and 3 in Table 19.For semantic drift, we observed that the errors was arising due to the shortcomings in both the modules, i.e., it was leading to meaning change if the Encoder didn't generate an efficient AMR graph, or if the decoder was not able to interpret AMR correctly.We have listed the modules that could be the potential reason for the error in the last column in Table 20.
It is important to note that, the source of errors is very easy to identify now because we are using robust, interpretable and symbolic representation as pivot to transfer from style A to style B. We have also provided a case study of performance of various baselines and proposed model in 3.
And there were made on them, on the doors of the temple, cherubims and palm trees, like as were made upon the walls; and there were thick planks upon the face of the porch without.
The cherubims and palm-trees are at the temple doors, And the thick planks on the porch face.

6.
And the rough goat is the king of Grecia: and the great horn that is between his eyes is the first king.
The rough goat o'er the king of Greece (h / have-org-role-91 :ARG0 (g / goat :ARG1-of (r / rough-04)) :ARG1 (c / country :name (n / name :op1 "Greece")) :ARG2 (k / king)) Table 19: Samples from Human Evaluations where the T-STAR output was marked incomplete.We observe that entities and relations missing in the sentence, were not present in their corresponding intermediate AMRs.

Figure 1 :
Figure 1: Different syntactic variations, leads to the same AMR as they all are similar in meaning.
); Logeswaran et al. (2018) use back-translation to optimize for content preservation.Luo et al. (2019); Liu et al. (2020) use reinforcement learning framework with explicit rewards designed for content preservation.Wang et al. (2019) pushes the entangled latent representation in the targeted-style direction using style discriminators.Samanta et al. (2019) uses normalizing flow to infuse the content and style information back before passing it to the decoder.Cheng et al. (2020) proposes a context aware text style transfer framework using two separate encoders for input sentence and the context.Similar to our work, Krishna et al. (2020) also generates an interpretable intermediate representation.The authors first paraphrase the given source to first convert it to a destylized version before passing it to the targeted style-specific decoder.Complementary to our work, (Shi et al., 2021) uses syntactic graphs (dependency tree) as an intermediate representation for attribute transfer to retain the linguistic structure.On the other hand, we focus on retaining the semantics using AMR graphs as intermediate representation and modifying the linguistic structure (authorship style).

Figure 3 :
Figure 3: An overview of T-STAR model architecture.It consists of two modules: T-STAR Encoder, that transforms a given sentence s i in style i to its AMR representation A i .To convert the sentence to style j, A i is passed to T-STAR Decoder specific to style j.

Figure 4 :
Figure 4: Generic sentences from AMR 3.0 corpus are stylized using a TST model.The corresponding AMR and stylized sentences are mapped together as silver training dataset to finetune T-STAR Encoder.

Ŝ
by their corresponding semantic similarity scores averaged across all test instances.(Krishna et al., 2020)*,(Li et al., 2018)* Table2: Evaluation Metrics: Source and Target sentences denoted as S,T respectively.We use all the dimensions except AMR similarity to measure the performance of TST.Note that Weighted Style Accuracy is the only metric that encompasses two crucial dimension : C.P. and S.T. in one metric.AMR similarity is used to select best performing TSTAR-Encoder.* represents the works that use slight variations of the mentioned metrics.Note that across all metrics we compute an average scores across all test instances.Algorithm 1: Iterative T-STAR Input: parallel corpora (S), mono-lingual corpus (R p , ∀p ∈ P ) 1 Ŝ = {ŝ p i , Ai}N ∀p ∈ P using an existing TST model, e.g., STRAP 2 X := S ∪ Ŝ 3 while Convergence do 4 fine-tune TSTAR-Encoder famr(.) on X 5 Use famr(.) to create Rp := {r p i , Ãp i }M 8 fine-tune TSTAR-Decoder fp(.) for style p 9 Use TSTAR-Decoder to get Ŝp = {ŝ p i , Ai}N 10 Ŝ = Ŝ ∪ Ŝp 11 X := S ∪ Ŝ tence of the p style dataset which has M p sentences.

Table 1 :
Comparison between AMRs from vanilla T-STAR Encoder and T-STAR Encoder.T-STAR Encoder generates better AMRs for stylized sentences.

Table 3 :
Reconstructing the original sentences using the intermediate semantic representation outperforms the baseline with a significant margin with respect to content preservation.T-STAR is on par with style retention (S.R.) with the baseline model.
6.1 Semantic Containment of AMRIf semantics of input sentence is completely preserved in the intermediate representation, we

Table 4 :
Accuracy of style classifiers trained on original, paraphrase and AMR inputs.AMR has least performance denoting it encodes least style information.
(Krishna et al., 2020) results of T-STAR and Iterative-T-STAR in comparison with the baselines for the Shakespeare Author Imitation dataset for both the directions :original to modern style and vice versa.The Weighted Style Accuracy is the primary metric as it is shown to effectively combine both style accuracy and content preservation(Krishna et al., 2020).

Table 5 :
T-STAR models comparison with baseline models on Shakespeare Author Imitation Dataset for both the directions.Iterative T-STAR outperforms all the baselines on Weighted Style Accuracy.

Table 7 :
Example of generated stylized sentences for STRAP, Vanilla T-STAR, T-STAR and Iterative T-STAR models for the given input sentence.

Table 8 :
Comparison of T-STAR and Iterative T-STAR models against STRAP for content preservation.

Table 9 :
Aggregate Error Analysis for error types: Hallucinations, Semantic Drift and Incomplete for 6,000 samples.With Iterative T-STAR, the number of samples with no errors and less hallucinations increase significantly.

Table 18 .
We observe that T-STAR and Iterative T-STAR models are consistently better on No-Error and Hallucinations across all the directions.Moreover, we observe that for 7 out of 12 direction, Iterative T-STAR model is better than STRAP for Incompleteness error.Note that our T-STAR model was under-performing, however another iteration of model improvements increases

Table 16 :
Comparitive analysis of TSTAR and STRAP model to understand which model generates more meaning preserving outputs

Table 17 :
Comparison on content preservation using human evaluations on Iterative T-STAR and STRAP across all 12 direction

Table 18 :
Type of Error Analysis across three models STRAP, TSTAR, Iterative-TSTAR across all the four styles.
Table 7 on the CDS dataset.