Tailor: Generating and Perturbing Text with Semantic Controls

Controlled text perturbation is useful for evaluating and improving model generalizability. However, current techniques rely on training a model for every target perturbation, which is expensive and hard to generalize. We present Tailor, a semantically-controlled text generation system. Tailor builds on a pretrained seq2seq model and produces textual outputs conditioned on control codes derived from semantic representations. We craft a set of operations to modify the control codes, which in turn steer generation towards targeted attributes. These operations can be further composed into higher-level ones, allowing for flexible perturbation strategies. We demonstrate the effectiveness of these perturbations in multiple applications. First, we use Tailor to automatically create high-quality contrast sets for four distinct natural language processing (NLP) tasks. These contrast sets contain fewer spurious artifacts and are complementary to manually annotated ones in their lexical diversity. Second, we show that Tailor perturbations can improve model generalization through data augmentation. Perturbing just ∼2% of training data leads to a 5.8-point gain on an NLI challenge set measuring reliance on syntactic heuristics.


Introduction
Controllable text generation through semantic perturbations, which modifies sentences to match certain target attributes, has been widely applied to a variety of tasks, e.g., changing sentence styles (Reid and Zhong, 2021), mitigating dataset biases (Gardner et al., 2021), explaining model behaviors (Ross et al., 2020), and improving model generalization (Teney et al., 2020;Wu et al., 2021). Existing work trains controlled generators with task-specific data, e.g., training a style transferer requires instances labeled with positive and negative sentiments (Madaan et al., 2020b). As a result, * denotes equal contribution.  Figure 1: A compositional perturbation using Tailor. 1 Given (A) an original sentence and its semantic role parse, we abstract each span into a structured header that contains its semantic roles and keywords. We specify desired perturbations by modifying each control code (e.g., changing role LOCATIVE)TEMPORAL in (B), verb tense past)present, and patient keyword specificity complete)partial). Given these perturbed control codes in the input (C), Tailor generates a new sentence (D) that reflects the desired perturbations, recombining original text with new text generated to follow the designated semantic controls.
transferring to a new application is prohibitive, if at all possible, and requires costly annotation efforts and re-training for every task of interest.
In this work, we introduce Tailor, a system that supports application-agnostic perturbations without the need for retraining. At the core of Tailor is a controlled generator ( §2) that flexibly generates full sentences from target semantic features. We combine structured control codes in our inputs to represent desired linguistic properties of outputs. As shown in Figure 1, each code is built on semantic parses derived from the Prop-Bank formalism and specifies the semantics of a span in the target sentence (e.g., "who did what to whom" (Palmer et al., 2005)). We use unlikeli-hood training (Welleck et al., 2020) to encourage control code following, by penalizing generations that are not aligned with designated codes.
The multi-dimensionality of semantic roles lends Tailor the ability to perform fine-grained changes to individual arguments within a sentence (e.g., one can just change the patient span in Figure 1). Such granularity is critical for generating datasets that evaluate and improve models' natural language understanding (Kaushik et al., 2020;Wu et al., 2021). Instead of a single target attribute positive)negative, we can breakdown the specific linguistic transformations involved in achieving the attribute, e.g., changing sentiment polarities through negation or through antonym replacement.
To highlight perturbations feasible with Tailor, we identify and implement a list of primary perturbation operations on inputs on top of the generator ( §3). More complex changes can be easily constructed by composing these operations. For example, in Figure 1, while it would nontrivial to directly train a generator to transform sentence A to D, we can achieve the transformation through a series of perturbation operations: syntactic rewriting (changing verb tense), sentence expansion (extending "the athlete"), and data recombination (i.e., sourcing TEMPORAL constraint).
Tailor's flexible control codes allow for broad, easily extendable applicability. We demonstrate Tailor's utility in three distinct applications: 1) We use Tailor to replicate existing contrast sets ( §5) on four diverse tasks, with much less manual annotation effort. Our analysis suggests that these contrast sets not only have high rates of validity (up to 82%), but also contain lexical diversity and reduce dataset bias. 2) We show that augmenting training data with just a small ratio of Tailor perturbations (≈5%) improves the robustness of Natural Language Inference (NLI) models to inference heuristics, increasing performance on the HANS evaluation set by an average of 1.73 points (McCoy et al., 2019). 3) Without any finetuning, Tailor achieves impressive performance on fine-grained and compositional style transfers ( §7) in the StylePTB benchmark (Lyu et al., 2021), even outperforming models trained on the dataset on 6 transfers.

Tailor's Controllable Generator
In this section, we provide an overview of the Tailor generator, which takes linguistic controls specifying what information should be included in a generation and how. We first motivate and outline dimensions useful for semantic perturbations ( §2.1), and then explain how to embed them within inputs to the generator ( §2.2). Finally, we describe how we use unlikelihood training to train our generator to follow the specified control codes ( §2.3).

Controllable Dimensions
To allow for control over sentence semantics at varying levels of granularity, we incorporate a combination of semantic roles and content keywords.
To denote shallow semantics, we use semantic parsing roles derived from PropBank, which express predicate-argument structures of sentences (Palmer et al., 2005). Predicates reflect events (what happened), and are usually evoked by verbs, like "comforted" in Figure 1. Arguments, usually spans of tokens, realize the thematic roles of the predicates, including core arguments such as who (e.g., "the doctor") and to whom ("the athlete"), as well as adjunct arguments like where ("In the operation room"), how, etc. PropBank semantic analyses provide well-established feature representations for meanings and are generalizable across different verb predicates and languages (Hajič et al., 2009), making it an appealing choice for representing high level semantics.
We further use content keywords to drive the generation of actual predicates and arguments. Depending on to what extent we would like to retrieve new text from the generator, the keywords can either be sparse (e.g., adding a random temporal constraint), or fully specified (adding a fixed "in the midst of the earthquake"). As later shown in Table 3, such control is important for supporting different perturbation strategies and use cases.
Since the same set of thematic roles can be combined in different ways, we further add controls on span ordering. We use the form of the predicate to control the order of generated core arguments. For example, although "the athlete was comforted by the doctor" is semantically equivalent to "the doctor comforted the athlete," we can target the former ordering through a passive control on the predicate, and the latter through an active control. Additionally, we use the location of blank tokens (e.g., <id_*> in Figure 1 and Table 1) to control the position of generated arguments (Wu et al., 2021). For example, "in the operating room" can appear at the beginning or end of the generation.  Table 2: Overview of control codes for Tailor. Primary controls build on predicate/argument labels, and other secondary controls further affect the form and content of generations. The argument content can be the * symbol (meaning nothing specified), connecting words ("in"), prefixes ("in the"), noun chunks ("operation room"), or the full span. Combined with specificity, it determines how much new text to generate.

Input Format Design
We aim to integrate the aforementioned control dimensions into an input format, and finetune language models to reconstruct full sentences reflecting these controls (output). As shown in Table 1, we start our input with a bracketed header, which is a series of abstract control codes, each denoting the semantic roles and keywords for a span-to-generate (details in Table 2). We map original semantic roles in PropBank to human-readable labels (i.e., ARG0 → AGENT) in order to leverage latent knowledge learned by pretrained models about the meaning of role labels (Paolini et al., 2021). After the header, we append the context, which consists of natural language and blank tokens. The natural language part represents text that should be preserved, and the blank tokens specify where any new text following controls in the header should be generated.
Note that we explicitly separate the header from the context. This is to detach the placement of a role from its semantic representation, such that given any combination of target roles in the header -whose optimal ordering is usually unknownthe generator can recombine them in the most fluent way. We further remove possible correlations between the control codes and the blanks in the context in two ways: First, we remove relative orderings from the control codes. Instead, we begin every input header with the predicate verb (VERB), then core arguments (AGENT, PATIENT), and then randomly ordered adjunct arguments (LOCATIVE, TEMPORAL, etc.). This input-independent ordering discourages the generator from solely following the order in which the arguments appear in the header. Second, we insert extra empty blanks into the context (e.g., <id_3> in Table 1B), so the generator can learn to generate spans in the blank locations that result in the most fluent text.
With this flexibility in argument reordering comes the challenge of making strict controls on a single argument: even when we only want to change verb tense, the generator may reorder other arguments. To balance the tradeoff between generation flexibility and strict control, which facilitates minimal perturbations (Ross et al., 2020), we further vary the number of arguments encoded in the header. As in Table 1C, our generator can take inputs that only mask a subset of arguments, such that e.g., any changes on the LOCATIVE constraint or the VERB do not affect the agent and patient. More details about input formats are in §A.1.

Training
We use OntoNotes 5.0 train (Pradhan et al., 2013) as training data and T5-base (Raffel et al., 2020) as our generator, creating original inputs and outputs using gold semantic roles in the dataset, as in Table 1. In order to train our generator to be sensitive to the different input formats described in the previous section, for each original input, we randomly sample the number of arguments to mask, number of extra empty blanks, and keyword content/specificity for each role. After data processing, our training data consists of 223,619 positive examples and 541,424 negative ones. (details in §A.2).
A key issue in training our generator to be controllable is that standard likelihood training is insufficient for encouraging control code following, as there may exist signals beyond the control codes for the form that a generation should take. Consider the input: [VERB+active+past: comfort | AGENT+partial: athlete | PATIENT+complete: the doctor] In the operating room, <id_0>, <id_1> <id_2>. A generator trained with standard likelihood training may ignore controls AGENT and PATIENT and instead output text such as "The doctor comforted the athlete" rather than "The athlete comforted the doctor," as the former is more natural given context "in the operating room." In order to encourage reliance on controls, we incorporate unlikelihood training (Welleck et al., 2020) to penalize our generator for generating text that conflicts with inputs. That is, besides Table 1A-C which are used for maximum likelihood training, we also create "negative" samples by randomly perturbing the control codes in our header (as in Table 1N, last row), such that most spans in the target output are not aligned with the control codes anymore. As detailed in §A.1, we create three negative samples per input, which randomly perturb: 1) verb voice/tense and primary controls for arguments, 2) keyword contents, and 3) keyword specificities.

Creating Perturbations with Tailor
With Tailor, we can create diverse perturbations by varying controls in inputs. Given an original sentence, we transform it to an input for Tailor by extracting its semantic parses, masking arguments we wish to modify (as well as the predicate), and adding their control codes to the input header. 3 3 External semantic role labelers can be used when gold annotations are not available. Our experiments use the opensourced implementation of the SRL predictor Then, we modify the derived input for Tailor to generate perturbed sentences. While the input can be modified arbitrarily to control generation, we provide an easily-extendable set of macros that capture several common themes in the literature.
Primitive perturbation operations. Perturbations in existing NLP literature broadly fall under three categories. First, syntactic rewriting primarily involves shuffling text to create paraphrases (Zhang et al., 2019) or adversarial examples (Iyyer et al., 2018). We implement such shuffling through operations that perturb predicates. For example, CHANGE_VVOICE would preserves surface-level sentence meaning. Further, CHANGE_IDX, which perturbs placements of blank tokens, can be used to reshuffle textual fragments, whereas SWAP_CORE and CHANGE_CONTENT can be used to swap text fragments from different arguments to reshuffle text in meaning-changing ways.
Second, expansion and abstraction adds or removes text fragments from a sentence based on context (Wu et al., 2021). We recreate these through operations on keywords. Changing specificity with CHANGE_SPEC to be sparser results in expanded fragments, and removing existing keywords with CHANGE_CONTENT can help abstract text. Additionally, we can also DELETE whole arguments.
Finally, data recombination involves recombining existing textual fragments, within or across inputs (Akyürek et al., 2020;Andreas, 2020). With CHANGE_CONTENT, we can add contents not in the original sentence, such that additional context (e.g., from corresponding paragraphs in question answering tasks) can be integrated into perturbations.
In practice, these primitive perturbation operations can be used in conjunction with external knowledge bases to achieve targeted edits. For example, if used with WordNet (Miller, 1998), CHANGE_CONTENT can recreate perturbations that contain natural logic (MacCartney and Manning, 2014): In Table 3, doctor)adult creates an entailment relationship between the original and perturbed sentence, with "doctor" being a hyponym of "adult." Additionally, these operation strings can be composed to achieve more complex perturbation strategies, as shown in §5, §6, and §7.

Tailor Operations
Perturbed Part of Input/ Generated Text Syntactically controlled rewriting CHANGE_VTENSE(present) [VERB+active+past )present: comfort] In the operation room, the doctor comforts the athlete.

CHANGE_VVOICE(passive)
[VERB+active )passive+past: comfort] In the operation room, the athlete was comforted by the doctor.

CORE(SWAP_CORE)
[AGENT+complete: the athlete )doctor | PATIENT+complete: the doctor )athlete ] In the operation room, the athlete comforted the doctor.

LOCATIVE: CHANGE_SPEC(partial)
[LOCATIVE+complete )partial: in the operation room] Under the dim light in the operation room, the doctor comforted the athlete.

LOCATIVE:CHANGE_CONTENT (in the room)
[LOCATIVE+complete: in the operation room] In the operation room, the doctor comforted the athlete.

LOCATIVE:DELETE
[LOCATIVE+complete: in the operation room] In the operation room, the doctor comforted the athlete.

Data recombination (with external labels and/or contents)
AGENT:CHANGE_CONTENT (the adult) [AGENT+complete: the doctor )the adult] In the operation room the adult comforted the athlete.

CAUSE:CHANGE_CONTENT (because he was in pain)
Source sentence: The baby was crying because he was in pain.
[CAUSE+complete: because he was in pain] In the operation room the doctor comforted the athlete because he was in pain. Table 3: We design a list of primitive operations to guide the perturbation on Tailor's inputs. The operations can be parsed into concrete changes on inputs, which drives Tailor to perturb a given sentence. Here, we show their usage on the example in Figure 1.
filter generations using perplexity scores computed with GPT-2 to exclude degenerate outputs.

Intrinsic Evaluation
Following desiderata identified in Polyjuice (Wu et al., 2021) and MiCE (Ross et al., 2020), we evaluate Tailor on the fluency, controllability, and closeness of its generations. 4 Metrics. Fluency measures whether the generated text is grammatically correct and semantically meaningful. Following Ross et al. (2020), we ask whether perturbing a sentence with Tailor drastically changes its likelihood. We compute the loss value for both the original and edited texts using a pretrained GPT-2, and report the ratio of edited / original. We aim for a value of 1.0, which indicates equivalent losses for the original and edited texts. Controllability measures if the generator responds to the designated control criteria. We rely on cycle consistency to evaluate the controls in Table 2, e.g., checking whether the predicted semantic roles on the generated text from a SRL predictor match the control codes in the input (i.e., whether "in the midst of the earthquake" in Figure 1 gets detected with a TEMPORAL tag). While other controls are easy to recover, the SRL predictions can be noisy; therefore, we determine how well cycle consistency measures reflect true controllability of semantic roles through manual annotation (more details in §B): We labeled whether a generated span matches the designated semantic roles for 98 spans, and compared the controllability measures from the SRL predictor with the manual annotations. We obtained a positive Matthews correlation coefficient φ = 0.49 between the two, suggesting that the cycle consistency measures positively correlate with true controllability measures.
Closeness captures whether the generated sentence achieves the desired perturbations only with necessary changes from the original. Since our generator takes controls on the argument span level, we measure closeness with a weighted F1 score on the expected-to-change and actually-changed spans in the original sentence. We identify expected changes from the perturbation operations; For example, in Figure 1A, we expect to change all the spans except for the agent "the doctor." Then,  we find actually changed spans based on editing distance: if ≥ 50% tokens within a span is changed (e.g., "operation room" in LOCATIVE), then we consider the span edited. We weigh the spans by their lengths to arrive at the final F1.
Results. We evaluate Tailor by perturbing 1,000 randomly selected sentences from the OntoNotes 5.0 development set, created the same way as we create negative samples during training (details in §A.1). 5 Tailor generates fluent perturbations, with a loss ratio of 0.982 indicating no notable change in language modeling loss after the edit. Its generations also tend to be close to the original sentence (with an average F1 score of 64.3%), while following the designated controls: it follows controls on predicates 75%-80% of the time, and also generates reasonably correct arguments (with 70% controllability on semantic roles, and~65% on keywords.) Controllability is a core benefit from unlikelihood training: we perform an ablation study, and compare Tailor with a baseline that is finetuned on T5 without unlikelihood training (called Tailor MLE ). Table 4 shows that unlikelihood training encourages controls and minimal perturbations, with the metrics increasing by up to 20%.
Further, as mentioned in §2.2, our input format supports modulating fluency and closeness at generation time. We can increase closeness by only masking the arguments we want to perturb. To quantify this effect, we randomly select only one argument to perturb for 1,000 sentences, but vary the number of arguments masked, and the number of empty blanks inserted. We maximize closeness when we only mask the target argument to perturb in the format of Table 1B (with F1 = 67.4%), whereas masking two extra arguments and inserting six empty blanks decreases closeness by 3% and 6%, respectively. On the other hand, when we want to trade-off closeness to prioritize fluency (i.e., when inserting extra roles whose optimal locations in the generation are not known in advance), we can do so by adding more empty blank tokens. We experiment with this setting on another 1,000 sentences, and observe that adding six extra blanks increases the fluency ratio from 0.93 to 0.95.

Application 1: Contrast Set Creation
To demonstrate how Tailor helps assemble a variety of perturbation strategies, we use it to replicate contrast and challenge sets for different NLP tasks and datasets, including question answering

Replicating Contrast Sets with Tailor
As shown in Table 5, we take advantage of two key properties of Tailor: 6 First, Tailor can make perturbations that are context-dependent. To recreate the BoolQ contrast set, we replicate change events in  by replacing content keywords in questions with words in the paragraph that have the same semantic roles. For example, the paragraph in Table 5 indicates "his bride" can serve as an agent. Second, Tailor allows for compositonal changes. For example, as shown in Table 5, we change prepositional phrase (PP) attachments from verb to noun to recreate the UD Parsing contrast set through the following composition of perturbation operations: append the preposition to the patient keyword content (e.g., "ham or sausages with"), change patient keyword specificity from complete)partial (to generate a new PP attaching to the patient), and delete the argument with BoolQ contrast set  82% (k=1)  original verb attachment (e.g. ADVERBIAL argument "with your breakfast"). Changing attachments from noun to verb involves a similar procedure, except that we remove the preposition from the patient keyword content and introduce adjunct arguments with the preposition as a partial keyword (see §C for an example). 7 Validity of generated contrast sets. Manually creating contrast sets from scratch is expensive (e.g.,  reported spending 10-15 minutes per perturbation for syntactic parsing datasets), whereas inspecting and labeling automatically generated ones can be much more efficient (Wu et al., 2021). We see our perturbation strategies as successful if they help alleviate the burden of manual creation, i.e., a contrast set author can easily label or take inspiration from Tailor's top generations. We reflect this through manual inspections: Two of the authors sampled 100 original instances per task, inspected the top-K perturbations from Tailor, and labeled an instance to be successfully perturbed if there is at least one perturbation out of k that changes the groundtruth answer while being fluent. 8 Because we exercised controls at different levels of granularity (i.e., QA implication, Matres, and BoolQ focus mostly on syntactic rewrites with predetermined content, whereas UD requires sourcing contents from the language 7 For UD Parsing contrast set generation, we use constrained decoding (Hokamp and Liu, 2017) to prevent generation of the original prepositional phrase. 8 We also include perturbations that produce slight changes to context, as these can be easily fixed by an annotator. model), we set k = 10 for UD-an upper bound for not overloading the human inspector-and k = 1 for other tasks. Table 5 shows that applying these perturbation strategies with Tailor results in contrast sets with high validity. 9

Measuring Contrast Set Quality
We assess the quality of Tailor-generated contrast sets by measuring their lexical diversity and impact on feature artifacts, both of which play important roles in dataset debiasing. We also compare these metrics to those of human-produced contrast sets.
Lexical diversity. We measure lexical diversity on UD Parsing contrast sets because it involves sufficient sentence expansion and data recombination. Specifically, we compare the Tailor-and human-generated  contrast examples for the 100 same original UD examples: we randomly sample one contrastive edit for each valid instance, heuristically extract the modified prepositional phrases from the contrastive edits, and then compute diversity as the ratio of unique tokens to total new tokens in all the perturbed arguments, filtering out stopwords. The ratios are 0.783 and 0.883 for Tailor and humans, respectively, in the noun to verb direction. For verb to noun, they are both 1.0. These values suggest that Tailor can be used to generate contrast sets without  significantly reducing lexical diversity. Tailor generations are also quite distinguishable from human generations: their unique tokens only overlap for < 15% in verb to noun, and ≈ 6% for noun to verb. These values suggest that Tailor can be used as a collaborative tool to diversify the pool of tokens in human-generated contrast sets.
Feature-level artifacts. We follow Gardner et al. (2021)'s analysis to determine whether creating perturbations with Tailor helps to remove dataset artifacts. Gardner et al. (2021) showed that making minimal perturbations removes single-feature artifacts when (1 + e i )/s = 2, where e i is the probability that feature i is edited, and s is the probability that an edit changes the label. We manually labeled the same number of automatically-perturbed examples as were in the original BoolQ contrast set, and found that Tailor produces edits with an average value of (1 + e i )/s that is close to that produced by humans:  (Dasgupta et al., 2018;Naik et al., 2018). These perturbations can either preserve or alter the meaning of the original hypothesis. For example, our meaning-changing strategy replace core with subsequence, replaces keyword contents of core arguments with noun chunks of other arguments (e.g., The judge behind the manager saw the doctors. → The doctors saw the manager.) As in Min et al.
(2020)'s setup, we map the meaning-preserving perturbations to the label entailment and meaningaltering perturbations to neutral.
We train classifiers built on the base version of RoBERTa (Liu et al., 2019) on different subsets of data: the original SNLI train data (baseline) and SNLI train data with ≈5% of hypotheses augmented with Tailor perturbations. 11 For each subset, we train 20 models, each with a different random seed governing model initialization and randomly selected data subset. We evaluate each classifier on the in-domain SNLI test set and the out-of-domain HANS test set (McCoy et al., 2019), which is designed to diagnose inference heuristics built on superficial syntactic properties. 12 As shown in Table 6, the augmentation leads to an out-of-distribution gain of +1.73 points on overall HANS and +4.46 points on the "non-entailment" subset. The gains are significant, with p = 0.002 using Student's t-test. Thus, Tailor perturbations decrease reliance on a well-known, lexical-overlapdriven inference heuristic for NLI.

Application 3: Style Transfer
Here, we show how Tailor can be applied to style transfer. We evaluate Tailor without any finetuning 13 on the StylePTB benchmark (Lyu et al., 2021), which builds on the Penn Treebank and assesses fine-grained stylistic changes (lexical, syntactic, semantic, and thematic), as well as compositions of multiple transfers. Single transfers require 11 We augment the original 549,367 SNLI train instances with 30,147 total new instances. See §D for more details. 12 For HANS, which contains binary labels, we collapse neutral and contradiction predictions to non-entailment.
13 This evaluation is zero-shot in spirit, as Tailor   editing an input sentence along one fine-grained stylistic dimension (e.g., To Future Tense). Compositional transfers require editing along multiple stylistic dimensions at the same time (e.g., To Future Tense+ Active To Passive).
For each transfer, we create perturbations for each predicate in the original input using the procedure described in §3. Because this process results in multiple perturbations (one per verb), we choose the one with the lowest perplexity from GPT-2 to represent the transfer. See §E for details.
We compare Tailor with multiple baselines reported by Lyu et al. (2021): GPT-2 and Re-trieveEdit are the best-performing single-transfer models evaluated but require separate models to be trained for each individual transfer. CS-GPT-* are models trained on compositional subsets of data (e.g.,, Tense+Voice, detailed in Table 7 caption). CS-Sys-Gen are ablations of CS-GPT-* trained only on corresponding individual changes but evaluated on compositional transfers. 14 Table 7 shows mean BLEU scores of generations for single and compositional transfers. 15 We evaluate on transfers for which Lyu et al. (2021) show model results in the paper, excluding some 14 CS-Sys-Gen refers to CS-GPT-Zero in Lyu et al. (2021). 15 We report Bleu_1 from nlg-eval (Sharma et al., 2017). for which our semantic-role-derived inputs are not well-suited (see §E). When perturbations using Tailor result in unsuccessful transfers, either due to a failure of perturbation strategy (e.g., no verbs are found by our SRL predictor) or due to a degenerate output (see §9), we treat them as having a BLEU score of 0.0; we also show results on a subset of the StylePTB test set that filters out these bad generations (shown in Table 7 as Filtered Test).
As shown, Tailor outperforms the baseline system trained without compositional fine-tuning, CS-Sys-Gen, on 8/9 compositions and even outperforms CS-GPT-TV on Tense+Voice, which is fine-tuned specifically on this data. Tailor also performs well on single transfers, significantly outperforming 5 of the GPT-2 and 2 of the Re-trieveEdit models finetuned on individual transfers. Low performance on some transfers from Tailor (i.e. ToPresent+ActiveToPassive, ToFu-ture+ActiveToPassive), appears to be driven by unsuccessful transfers, rather than generations that do not follow controls, as indicated by the difference in performance on the filtered subset. Importantly, with Tailor, we achieve these gains in compositional transfers and comparable performance on single transfers with a single model and without any transfer-specific finetuning.
Controlled generation Controllable Text Generation has been widely used to influence the various properties of generated text, for tasks like textsummarization ( , which usually under specify the desired patterns. In contrast, Tailor facilitates a variety of linguistically-driven generations using semantic roles and keywords, and therefore concretizes otherwise sparse controls (e.g., we can specify making a sentence more negative through negation.) Recent work has explored using syntactic signals for paraphrasing (Iyyer et al., 2018;Kumar et al., 2020), which are similar to ours in their high-dimensional specification. Still, to the best of our knowledge, Tailor is the first to incorporate fine-grained semantic controls.
Our generator is also closely related to methods for structured generation, which reconstruct sentences based on semantic representations. Abstract Meaning Representation (AMR) (Banarescu et al., 2013;Mager et al., 2020) is an alternative representation worth exploring in future work. It presents the trade-off between training complexity and control flexibility: Generators that take AMR controls might be able to further handle entity recursions (Damonte and Cohen, 2019), but expressing such relationships in the inputs would be nontrivial. However, when we omit these relationships and reduce AMR graphs to sequences (e.g., like in (Damonte and Cohen, 2019)), semantic parses derived from PropBank have the advantage of enabling stricter control on complete keywords, as they annotate complete spans, whereas AMR only accepts key entities without the "syntactic sugar" (e.g., "as," "in") (Banarescu et al., 2013).
Data perturbation Controlled generators have been shown to be particularly useful for text perturbation. Besides the aforementioned paraphrasing and style transfer, prior works have also successfully generated contrastive examples that are useful for model training, evaluation, and explanation. They usually rely on application-specific class labels (Ross et al., 2020;Madaan et al., 2020b;Sha et al., 2021;Akyürek et al., 2020) or heuristic perturbation strategies that needs to be expressed through pairs of original and perturbed sentences (Wu et al., 2021), which are expensive to generalize. Recently, Huang and Chang (2021) designed SynPG, a paraphraser that can mimic parse tree structures learned from non-paired sentences. We similarly train linguistically controlled generators on single sentences, but with a focus on finegrained semantic perturbations, such that we can more broadly cover different perturbation strategies by composing changes to control codes.
Also related are prior works creating minimally edited datasets through extensive human effortseither by manually rewriting instances Kaushik et al., 2020), or by defining perturbation functions and templates (e.g., (Andreas, 2020;Li et al., 2020;Ribeiro et al., 2020Ribeiro et al., , 2018Zhang et al., 2019;Kim and Linzen, 2020;Wu et al., 2019)). As demonstrated in §5, Tailor can be used to recreate many such datasets, with less manual effort. Moreover, the low overlap between Tailor's perturbations and humans' motivates future explorations of manual and semiautomated generation, where the generator can compensate humans' systematic omissions.

Conclusion and Future Work
We propose Tailor, a flexible system for a broad set of perturbations through semantic controls. By composing perturbation operations, Tailor enables complex and context-aware changes, which support various downstream applications, including contrast set generation, data augmentation, and fine-grained style transfer. Tailor demonstrates that it is possible to drive fine-grained perturbations with semantic features directly derived from an instance. Crucially, it also shows that language models can be finetuned to learn representations of control codes, if paired with unlikelihood training, which encourages reliance on structured controls, rather than surrounding natural text.
Extending Tailor. Although the applications we explore in this work are perturbation-focused, Tailor generator is well-suited for controlled generation tasks more broadly. Given key entities or arguments as keywords and fully masked contexts, we envision Tailor can help generate arguments (Schiller et al., 2021), compositional data augmentation (Akyürek et al., 2020), caption generation (Chen et al., 2020), etc.
The design of controls is also worthy of in-depth exploration. As mentioned in §8, AMR might be an alternative for semantic representation, if our primary goal is to specify key entities and/or to express non-sequential relations (Damonte and Cohen, 2019). On the other hand, dependency parsing labels are useful for fine-grained control over syntactic changes (see §8); future work may try to find balance between syntactic and semantic controls.
Factors that affect Tailor's capability. Though broadly applicable, Tailor's controllability and effectiveness varies on different inputs. First, creating automatic perturbations with Tailor requires external SRL predictors, which can be more noisy on some semantic roles than others -the one we use predicts core arguments more accurately than adjunct arguments (e.g., F1 for ARG0 and ARGM-EXT is 92.9 versus 51.4). Empirically, most applications lean towards modifying the more common arguments, making the predictor reasonably applicable. However, low performance of current models would present a bottleneck in perturbing more challenging language phenomena. In such cases, careful SRL predictor augmentation, or even manual semantic role annotation, might be necessary. We also notice that for some inputs, Tailor produces degenerate outputs. We hypothesize that this effect is a byproduct of unlikelihood training -that the Tailor generator learns to reduce the likelihood of negative sequences by generating tokens that are very unlikely to appear in natural text. Certain generation hyperparameters, particularly the number of beams, can reduce the number of degenerate outputs. While we perform unlikelihood training at the sequence level, future work can investigate the effect of penalizing generation at the level of tokens or spans, which may provide finer-grained signal for which spans should be considered unlikely, as well as more strategically balancing positive and negative samples.
Having noted these opportunities, we believe Tailor is already a powerful tool for perturbation, particularly for tasks where compositional changes are required. Tailor is opensource, and available at https://github.com/allenai/tailor .

A.1 Input and Output Formats
All headers in inputs to the Tailor generator begin with verb controls, followed by core argument controls (first agent, then patient), and then adjunct argument controls. Secondary controls are always given in the order of control code+voice+tense:lemma for verbs and control code+keyword specificity:keyword content for arguments. We also blank the auxiliary verbs of the predicate in an input, using spacy to detect them. We exclude discontinuous arguments (e.g., those with raw SRL labels B-C-*), as well as those with referents (e.g., those with raw SRL labels B-R-*), from input headers. We map ARG0 → AGENT and ARG1 → PATIENT. For other numbered arguments, we create human-readable labels by using argument functions included in the PropBank frame for the given predicate (Palmer et al., 2005).
On the output side, we ask the model to generate the full sentence (Table 1). We add the semantic roles for all the generated arguments, to help the generator build explicit mappings between the input control codes and the output spans -this can be important when the input codes are ambiguous (e.g., a TEMPORAL argument and a LOCATIVE argument that both have keywords "in".)

A.2 Training details
Training inputs. During training, we randomly select, with equal probabilities, whether to mask all arguments or a subset of arguments. If a subset, we uniformly select the proportion of arguments to mask. To determine the number of extra blank tokens, we uniformly select a value less than 10 and set the number of blanks to be the maximum of that selected value and the number of arguments to mask. Any extra blank tokens (i.e., remaining after masking arguments) are inserted between subtrees of the predicate.
We also randomly select keyword contents and keyword specificities. For each argument span, we extract, using spacy, four keyword types from the span: noun chunks, random subtrees, exact keywords, and prefixes. For prefixes, we uniformly select a number of tokens to include as the keyword (from 1 to the entire span). Once we extract all keyword candidates, we create corresponding keyword specificities: A keyword is complete if it contains all tokens in the original span, partial if it contains at least all but 5 tokens, and sparse otherwise. Then, we uniformly select a keyword content/specificity pair for each span from the set of keyword candidates (including the * symbol). 16 To generate unlikelihood samples, we use three perturbation strategies on inputs: 1) Change semantic roles by swapping thematic role control codes (agent/patient), changing adjunct argument control codes to a uniformly selected other adjunct control code, and changing verb tense/voice. We swap verb tense/voice because the control code VERB does not have natural candidate swaps, given that predicates are the building block for semantic parses. We also swap the control codes in the target output. 2) Change keyword contents by replacing verb lemmas and keywords for both the predicate and all arguments. To make content swaps, we first gather the most commonly occurring keyword contents for each argument and predicate in Ontonotes 5.0 train, extracted according to the same process as described above for creating training inputs. For each primary control code and keyword specificity (e.g., TEMPORAL+partial), we store the 15 most commonly occurring keyword contents. To create the negative inputs, for each span, we uniformly sample from these stored keywords given the span's control code and keyword specificity. This perturbation is designed to discourage the generator from ignoring the keyword content and merely generating commonly occurring text for particular semantic roles. 3) Change keyword specificities by uniformly selecting a different specificity. We weight each unlikelihood sample equally, with a reward of -1 (vs +1 for positive samples).
Hyperparameters. We train the Tailor generator using Transformers (Wolf et al., 2020) for 10 epochs with early stopping. We use batch size 4 and default values for other parameters (learning rate of 5e-5, Adam optimizer). 16 Because of how keywords are sampled, we notice that the generator is sensitive to the case of keyword contents. For example, if the keyword for a temporal span is In 1980 instead of in 1980, Tailor is biased towards generating it at the beginning of the sentence. We hypothesize that because some of the keywords we sample during training are cased (e.g., exact will lead to a cased keyword for a capitalized span beginning a sentence), the generator learns a bias towards generating spans with uppercase keyword at the beginning of the sentence. In applying the generator to perturbations, the case of keyword contents can be used to manipulate the order of generated roles when a certain order of generated contents is desired; otherwise, uncased keywords can be used.

B Intrinsic Evaluation Details
Effectiveness of cycle consistency. To evaluate to what extent cycle consistency reflects true controllability, we conducted additional manual annotation on role-following. We sampled 25 sentences from the Ontonotes 5.0 development set, transformed them into inputs with varying numbers of masked arguments and blank tokens, and created up to two perturbed inputs per sentence by randomly replacing their blanked adjunct arguments with other candidate semantic roles (using CHANGE_TAG). The candidate roles were extracted from the frameset for each predicate verb. We also changed the keyword specificity to SPARSE, to make these role swaps more plausible.
We collected Tailor and Tailor MLE generations from both the original and perturbed inputs, and one author manually validated the generated span for each specified argument (98 in total). Our annotations were following or not following the control (i.e., the span matches/does not match the designated semantic role), or the set of controls can be impossible to follow if the human annotator could not think of any generation that would satisfy the control codes, due to a conflict between the role, keywords, and blank placement. We then computed the Matthews correlation coefficient (MCC) between the controllability of the role label as measured by the SRL predictor with the gold controllability annotations for the subset of roles without annotation impossible. The MCCs are 0.49 and 0.51 for Tailor MLE and Tailor, respectively, suggesting that the cycle consistency measures positively correlate with true controllability measures.
Additionally, we measure to what extent the controllability measures from cycle consistency correlate with whether a set of controls is impossible to follow. The MCCs are -0.33 for both Tailor and Tailor MLE ; thus, incorrect role-following as measured by cycle consistency is positively correlated with controls that are impossible to follow. 14/98 instances were manually annotated as having impossible-to-follow controls, suggesting that a nontrivial proportion of the generations for which our intrinsic evaluation measures in §4 found to be unaligned with designated role control codes may be explained by impossible-to-follow controls.

C Contrast Set Details ( §5)
In Table 8, we illustrate our perturbation procedures for creating contrast sets. Besides BoolQ Artifact statistics for BoolQ before and after local edits z = ± 2 i.i.d. edits Figure 2: A comparison on the dataset artifacts in the original BoolQ validation set and contrast set created with Tailor. The figure is plotted in the same way as Figure 2 in (Gardner et al., 2021). and UD English already introduced in §5, Matres contrast set  relies on within-sentence context: As a task that requires detecting and changing the temporal order of two verbs, our perturbations heavily rely on their syntactic relationships. For example, to change the appearance order of verbs in text (as described in ), we would take the parent verb as the base predicate, and MOVE the text span containing the child verb. Further, in,QA implication , we combine Tailor with semantic heuristics: by defining mappings between WH-words and answer types (e.g., "who" and "the Huguenots"), we can easily create new questions that are about different targets.
As mentioned in §5, the Tailor-generated contrast sets contain fewer artifacts compared to the original BoolQ validation set. Here, we provide a straightforward visualization to show the effect. As shown in Figure 2, many tokens in the original BoolQ validation data are biased towards the positive class (with the red dots distributed in the > 0.5 region), while most tokens in the edited set fall within the confidence region denoting no significant feature-level biases.

D Data Augmentation Details ( §6)
Augmented data. Our five perturbation strategies are shown in Table 9. To create our augmented data, we first filter generations by perplexity scores from GPT-2 such that we retain 75% of generations. Then, for each hypothesis we perturb, we uniformly sample a successful perturbation. (An example of a failed perturbation would be one requiring both agent/patient roles, applied to a sentence without both roles.) This process results in a slight skew towards entailment labels (i.e., ≈ 2.75:1, entailment:neutral). Future work can investigate to what extent label imbalance affects augmentation results.
Classifiers. We train all SNLI classifiers, which build on RoBERTa-base (Liu et al., 2019), using AllenNLP (Gardner et al., 2018). We train for 10 epochs using the Adam optimizer with a learning rate of 2e-05 and batch size 32; we use early stopping with a patience of 3.

E Style Transfer Details ( §7)
Transfers Evaluated. We evaluate on the transfers in StylePTB for which Lyu et al. (2021) report results, as their baselines require training separate models for each transfer. Within this subset of transfers, we exclude PP Back to Front and Passive to Active from evaluation, as they contain < 5 test inputs. We also exclude the transfers Substatement Removal, Information Addition, Adjective Emphasis, and Verb/Action Emphasis, for which our semantic-role-derived inputs are not well-suited. For example, Substatement Removal involves removing substatements that represent "referring" and "situations," both of which are technical philosophical concepts that cannot be straightforwardly detected through semantic roles. As another example, Information Addition requires adding unordered keyword contents to a sentence (eg the work force provides the third arm of the alliance; add keywords: force black → the work force provides the third arm of the black alliance force. While the Tailor generator was only trained with ordered arguments, one could extend the keyword contents to also include unordered target tokens. Perturbation strategies. For transfers modifying only verb tense (e.g., To Future Tense), we mask the verb, modal arguments, and negation arguments, as these are relevant to verb conjugations, and make relevant perturbations on the secondary verb control specifying tense. For transfers modifying verb voice, we mask the verb, agent, and patient. For transfers requiring removal of certain parts of speech (POS)-i.e., ADJ or ADV Removal, PP Removal, and all compositional Tense + PP Removal sub-transfers -we first use spacy to detect such POS, next mask all arguments containing them, and finally perturb the keyword contents to remove the POS for these arguments. For PP Front to Back, we mask the argument at the beginning of the original text and implement the change using

CHANGE_IDX.
We use cased keywords (A.2) to encourage generations with similarly ordered arguments as the original sentence, except for the PP Front to Back transfer, which calls for differently ordered arguments. For transfers modifying verb form only, we set the number of extra blanks to be 2 to allow for generation of helper verbs; for other transfers, we allow for 0 extra blanks to preserve the original order of generated spans.
We decode perturbed sentences greedly using beam search (with beam width 10) and preventing repeated bigrams.