Toward Unified Controllable Text Generation via Regular Expression Instruction

Controllable text generation is a fundamental aspect of natural language generation, with numerous methods proposed for different constraint types. However, these approaches often require significant architectural or decoding modifications, making them challenging to apply to additional constraints or resolve different constraint combinations. To address this, our paper introduces Regular Expression Instruction (REI), which utilizes an instruction-based mechanism to fully exploit regular expressions' advantages to uniformly model diverse constraints. Specifically, our REI supports all popular fine-grained controllable generation constraints, i.e., lexical, positional, and length, as well as their complex combinations, via regular expression-style instructions. Our method only requires fine-tuning on medium-scale language models or few-shot, in-context learning on large language models, and requires no further adjustment when applied to various constraint combinations. Experiments demonstrate that our straightforward approach yields high success rates and adaptability to various constraints while maintaining competitiveness in automatic metrics and outperforming most previous baselines.


Input
My friends all love to go to the club to dance.They think it's a lot of fun and always invite.I finally decided to tag along last Saturday.<expression> <options> <choice_0> <mask_0> My friends decided to keep inviting me out as I am so much fun.</choice_0> <choice_1> <mask_1> The next weekend, I was asked to please stay home.</choice_1> </options> </expression> Output <expression> I danced terribly and broke a friend's toe.The next weekend, I was asked to please stay home.</expression> Table 1: Input and output of instruction prompt based Regular Expression Instruction (REI).REI can describe various types of complex fine-grain constraints, and here we present three examples.Meta-data instruction label is colored, lexicon constraints or correct choice is boldfaced, and auxiliary marks for length or lexicon uses gray color.options; abductive reasoning (Bhagavatula et al., 2020) specifies that the position of the output text is between the previous and future contexts; summarization task (Luhn, 1957) limits the length of output; machine translation (Bar-Hillel, 1960) demands to use the vocabulary of the target language for text generation.
Despite the reasonable performance, current methods on transformer-based language models mainly focus on certain constraints but may not be easily transferred to others, let alone the combination of constraints.For example, Non-Residual Prompting (Carlsson et al., 2022) and A*esque Decoding (Lu et al., 2022) only considered lexical and length constraints, but it cannot arbitrarily specify which position the generated text shall occur; on the other hand, COLD (Qin et al., 2022) can generate text given past and future context, but may not add word inclusion constraint nor restrict the output length.Moreover, these controlling methods assume that we have access to the probability distribution or even gradient of the model, but in the case of large language models where we can only obtain the output token via API, these methods may not be available, and thus black-box controlling techniques need further exploration.
To address the above challenges, we proposed instruction-based Regular Expression Instruction (REI), for universal fine-grained controllable generation.Table 1 present a few examples.Our instruction design is inspired by regular expression, which can easily describe mainstream constraints and their combinations.Following Rosenbaum et al. (2022), we use markup language to construct the expression, hoping that model can better distinguish between meta-data (instructions) and data (actual words).We use two popular paradigms, language model fine-tuning, and large language model few-shot, to teach the model to understand the input constraint expression.
Our method has several advantages.First, our constraint expression supports all typical finegrained controlling task and is powerful enough to describe composite control specifications.Second, our method can be adapted to various scenarios, such as summarization with length con-straint, terminology-constrained machine translation, and alternative-ending story infilling.Third, our method is easy to implement and highly transferrable to other models since it requires only finetuning on medium-size models and no further modification on large language models, and it does not need access to probability distribution or gradient.
Experiments demonstrate that current state-ofthe-art language models can understand our controlling language, achieving high success rate while maintaining high automatic evaluation metric score and surpassing most of the strong previous baselines under various constraints.We hope our work can shed light on future works.

Instruction Design
The controlling language REI follows the style of regular expression due to its expressiveness.Also, it's easy to evaluate whether the input expression instruction matches the generated text or not.Following Rosenbaum et al. (2022), HTML-like markup language is used, which helps the model learn that they are meaningful meta-data instructions rather than plain symbols, especially when using large language models in-context learning with limited examples and no parameter update.This markup label can also avoid the usage of the escape character.
REI contains several special labels, as shown in Table 1.<expression> and </expression> mark the beginning and the end of the expression and can be put anywhere in the input text, assuming we only generate according to one expression at a time.<mask_i> is equivalent to the regular expression ".*" and similar to the mask token in BART (Lewis et al., 2020) and T5 (Raffel et al., 2022), where at its position the model shall generate zero or more tokens.<options> and </options> is equivalent to the parentheses "(" and ")" in regular expression, the model shall choose one expression among the group.To make the recognition easier, we use <choice_i> and </choice_i> to wrap each choice.The regular expression notation of length counts at the character level, but in practice, we want to control the output word length.Therefore, we use the <length=n> label to denote the constraint of output word count.
We avoid the shortcoming of T5 (Raffel et al., 2022)  Table 2: Constraint expression of each task.We fine-tune on tasks and variations listed in Table 2a, and additionally evaluate the unseen tasks listed in Table 2b.Notice that for few-shot learning, all the tasks are not trained before.
natural sentences (Lester et al., 2021).On the other hand, we also overcome the redundancy of BART denoising schema (He, 2021), where the whole input is generated again, since we only generate the realized expression.Moreover, beyond fill-in-theblank, we introduce choice-making, which further enriches the expressiveness of our controlling language.

Training
Fine-tuning We could automatically construct the training data from the corpus and conduct selfsupervised learning.Alternatively, we could also directly convert the input of existing supervised datasets into the form of our controlling language, and use them to fine-tune state-of-the-art models such as FLAN-T5 (Chung et al., 2022).The input format is shown in Table 2a.We include αNLG (Bhagavatula et al., 2020) and CommonGen (Lin et al., 2020), two English controllable generation datasets of position and lexicon constraint.In αNLG, given the past observation O 1 and the future observation O 2 , the goal is to generate a hypothesis h that could follow O 1 and trigger O 2 .The regular expression of the constraint is ".*" since no lexicon constraint is required.In CommonGen, given a set of k concepts C = {c 0 , c 1 , ..., c k−1 }, the output text shall include those concepts and at the same time be consistent with common sense.While in the original setting, the appearance order of concepts and their word sense change is not provided, and the model shall make these decisions, here in our controlling language, the exact word and order must be given.Otherwise, we cannot construct the corresponding expression.So, we preprocess the original instances and recover the order and word sense of the concepts by the reference text.To help the model generate the concepts sequentially and track how many concepts it has already used, we append the serial number label (i) to every concept c i on both the input and output sides and remove the labels from the output generation once completed.The regular expression of the constraint is " We also leverage these two datasets to teach the model to control the output length by simply adding the length label with the ground truth length.To better track how many words the model itself has already generated, we append the length number label _i to every word w i ; for example, the sentence "Stephen knocked over a vase while drunk."becomes "Stephen_0 knocked_1 over_2 a_3 vase_4 while_5 drunk._6".Similarly, we remove the length number labels after completion.
Finally, we need to teach the model about choosing grammar.We use αNLI (Bhagavatula et al., 2020) dataset, the task of which is to determine whether H 1 or H 2 is the more plausible hypothesis given the past and future observations O 1 and O 2 , and the constraint of the regular expression is "(H 1 |H 2 )".
In-context Learning For large language models like GPT-3.5 (Brown et al., 2020), where typically access is typically provided via API, we may not apply many traditional controllable generation technics.However, we can leverage its ability of incontext learning to conduct fine-grain constraint generation.More specifically, we leverage the ability to discover and imitate the repeated pattern (Madaan and Yazdanbakhsh, 2022;Min et al., 2022), which is desirable in our case, since unlike other natural language understanding tasks, the specific fine-grain constraint is a well-defined simple pattern that could be easily discoverable and imitable.
Given the input with control expression, we can select k instances with the same expression structure as the instruction prompt and send it to the large language model together with input.Naturally, when evaluating the test set, we can select examples from the training set or validation set, or other instances of the test set when they are not available.Consistantly, we use the same input and output format described before, which saves extra efforts on prompt engineering.In addition, we simply use the popular json format " {"input": [INPUT], "output": [OUTPUT]} " for each demonstrating instances, and naturally seperate them with "\n".By using json, we can further avoid the need for escape character if the input text happens to contain metadata like "Input" or "\n".

Inference
We use rejection sampling to generate output text that is matched by the control expression.Verifying the output is simple, since we could convert the control expression into regular expression and check the validity.Additionally, if the expression contains length constraint label, we count and compare the number of words in the output text.We try at most k times to avoid infinite loop and save costs if we use large language model API.When using medium or small size langauge model, to increase the generation quality, we can perform beam search first and see if it can generate a valid result at the first try.

Recursive Decoding
Different choice might affect the generated text.
For example, consider the case "S 1 S 2 S 3 .*(E 1 |E 2 )", which gives the first three sentence and two alternative endings and the goal is to choose the correct ending while infill the fourth sentence at the same, which is not included in our fine-tuning data.Instead of directly jumping to the answer with possibly insufficient computation, we could also let the model "think step by step (Kojima et al., 2022)".We can solve each choice expression first, then compare the complete choices "(S 4 E 1 |S ′ 4 E 2 )"".The generalized decoding procedure is presented at Algorithm 1, which assumes that each options is independ with each other and greedily solve them from left to right.We leave the evaluation of expression with multipe consecutive options (Lu et al., 2022) for future work.

Setup
We conduct experiments on 2 Nvidia A100 GPUs, with about 10 total GPU hours locally.For mediumsize language model, we use FLAN-T5-xl (Chung et al., 2022) with Apache 2.0 license, which has 3B parameters and is fine-tuned on many natural language understanding and generation tasks.We use Huggingface Transformers library (Wolf et al., 2020) with Apache-2.0license for fine-tuning and evaluation.We trained the model for 3 epochs, with a batch size of 16 and learning rate of 3e-5.We set beam size to 4 for beam search and p to 0.95 for top-p sampling.We generate at most k = 512 samples if we do not obtain any valid outcome.
For large language model, we use GPT-3 (Brown et al., 2020) text-davinci-003 version via Ope-nAI API, and the 175B model is calibrated with Reinforcement Learning from Human Feedback (Stiennon et al., 2020b).We feed 8 in-domain examples as the prompt, set the temperature to 0.7, and retry at most k = 8 times if the result is not valid.All results are from "single" run.
Given only 8 examples with a clear connection between input and output, GPT-3.5 still shows competitive performance in terms of text automatic metrics, and achieves high concept coverage, surpassing all the previous baselines.Compared with natural language instruction, the success rate is very close.And with more supervised data to modify the model's parameter, FLAN-T5-xl performs significantly better than GPT-3.5 and other previous baselines in all metrics and successfully satisfies all lexicon constraints.

Lexicon & Length Constraint
As described in Section 2.2, we slightly modify the devset of CommonGen to introduce the additional length constraint and evaluate GPT-3.5 and FLAN-T5.For metric, we replace Coverage (Cov.) with Success Rate (SuR.), which is the average percentage of output that matches the input expression.In a composite task, the performance of GPT-3.5 downgrades dramatically and struggles to generate valid output, indicating that multi-concept inclusion and length control at the same time is challenging, especially for few-shot in-context learning.Yet, REI still outperforms NLI in terms of success rate, and the "high" n-gram metrics might also indicate the poor instruction following ability in terms of challenging fine-grain constraints, which is consistent with the finding of Zhou et al. (2023).FLAN-T5 only has a minor drop in performance and still maintains a high success rate since it has trained on this composite constraint.
With few-shot learning, GPT-3.5 outperforms two unsupervised baselines and Diffusion-LM, demonstrating its strong in-context learning ability given only a few infilling examples.Since it's a relatively simple constraint, the performance between REI and NLI is very close.With our careful instruction prompt design and adequate fine-tuning, 3B FLAN-T5 shows stronger performance than 11B T5, and remains competitive compared to 20B UL2.

Position & Length Constraint
As mentioned in Section 2.2, we slightly modify the αNLG test set to add the length constraint.
We change the BERTScore metric to SuccessRate (SuR.).Table 4b shows the results.GPT-3.5 manages to imitate both position and length constraints, showing relatively high success rate, while under NLI, it performs badly.But with full-scale supervised learning, FLAN-T5 can robustly generate valid output on the test set 100% of the time.Also, in terms of automatic metrics, the output of both models does not downgrade dramatically.

Position & Lexicon Constraint
We can also modify the αNLG test set to add lexicon constraint, setting the keyword to be the first verb on the reference text.The input format is shown in Table 2b, and Table 4c shows the results.
For GPT-3.5, it still is very likely to generate valid output nearly all of the time, and the automatic metrics enjoy improvement compared with the results of no lexicon constraint, since the additional gold words are provided, and the verb constraint limits the vast scope of possible hypothesis space.Also, REI is slightly better than NLI.For FLAN-T5, although it has been trained on position constraint or lexicon constraint separately, it has not seen the combination, and yet still demonstrates strong performance.

Position & Lexicon & Length Constraint
We can further combine all conditions together, adding both length and lexicon constraints on the test set of αNLG.The input format is presented in Table 2b, and Table 4d shows the results.Compositional constraints challenge few-shot GPT-3.5, as it's more difficult to generate output that matches all three requirements, and the success rate drops slightly.Interestingly, NLI got a very low success rate.But fully-trained FLAN-T5 exhibits robust transfer ability, as the simultaneous three constraints are not included in training data, but FLAN-T5 still manages to achieve close to 100% success rate.

Position Constraint & Alternative Endings
On the test set of Story Cloze Test (Mostafazadeh et al., 2016), which is to choose between the right ending and the wrong one given the four-sentence context, we additionally mask the fourth sentence and require the model to infill the missing sentence while determining the correct ending.The input format is shown in Table 2b, and the result is shown in Table 6.We change the Success Rate (SuR.)metric to Accuracy (Acc.), since choosing either ending is valid.For GPT-3.5, we directly construct promoting examples with the initial input and final output, and surprisingly find that GPT-3.5 handles the composite constraint quite well, and chooses the right ending with not bad accuracy.Also, REI comes close to NLI in performance.For FLAN-T5-xl, we use the recursive decoding (Section 2.4, and it shows moderate performance, with lower accuracy but higher BLEU / ROUGE compared with GPT-3.5.

Summarization with length constraint
REI can also easily support abstractive summarization with desired length (Kikuchi et al., 2016;Fan et al., 2018), as long as the base model has been trained on the summarization task, which is the case in our choosing models FLAN-T5 (Chung et al., 2022) and GPT-3.5 (Ouyang et al., 2022).We choose to evaluate on the test set of English headline generation dataset Gigaword (Graff et  2003), due to its short input and output length.Also, Gigaword is not included in the training set of FLAN-T5 or GPT-3.5.The input format is written in Table 2b.We use ROUGE-L (Lin, 2004) and Success Rate (SuR.) for metrics.
We compare our methods with two unsupervised unconstrainted baselines SEQ (Baziotis et al., 2019) and TED (Yang et al., 2020), and the results are shown in Table 7.Both GPT-3.5 and FLAN-T5 exceed the two baselines in ROUGE-L score, showing relatively good text quality.Since the summarization task constrains more on the semantic of output compared with pure lexicon constraint (CommonGen) or position constraint (αNLG), satisfying length constraint might be more difficult, and GPT-3.5 shows a relatively lower success rate, but NLI has the worst success rate.But nevertheless, FLAN-T5 still achieves 100% success rate.Notice that with limited REI training tasks, the model can still generalize to new tasks with the specific format, demonstrating the robust transfer ability under supervised learning.

Terminology-constrainted machine transaltion
We can also apply REI to machine translation with terminology constraint (Dinu et al., 2019), which is to ensure the given terminologies T = (t 0 , t 1 , ...) are used in translation.We only test GPT-3.5 here, due to its superiority in multi-language understanding, while the majority of output language during pre-training, multi-task learning, and fine-tuning is English.We evaluate on the test set of Wiktionary and IATE (Dinu et al., 2019), two English-German translation dataset, using BLEU-4 (Papineni et al., 2002) and Terminology Coverage (Term) for metrics.

Qualitative Results
Table 8 shows the samples of lexicon & length constraints (Section 3.2.2),position & lexicon & length constraints (Section 3.3.4),position constraint with alternative ending (Section 3.3.5),summarization with length constraint (Section 3.4) and translation with terminology constraint (Section 3.5).Both FLAN-T5 and GPT-3.5 generate valid and fluent sentences.GPT-3.5 also uses more vivid or humanlike words like "antihistamines" or the abbreviation "FIA", probably due to its large-scale model size and training corpus.

Related Work
Tasks of Controllable Text Generation Controllable text generation refers to the tasks that generate text according to the controlling signals (Prabhumoye et al., 2020).Typically, the output can be constrained at three levels from coarse to fine: (Zhang et al., 2022) semantic, structural and lexical.At semantic level, the signals include topic (Tang et al., 2019), sentiment (Logeswaran et al., 2018), format (Li et al., 2020), toxity (Krause et al., 2021) and other abstract attribute.At the structural level, the constraints include key-value data table (Novikova et al., 2017), syntax tree, and partsof-speech (Li et al., 2022).At lexical level, then controlling elements include keyword (Lin et al., 2020), generating position (Shen et al., 2020) and length (Carlsson et al., 2022).

Conclusion
We proposed Regular Expression Instruction (REI), a novel instruction-based method that unifies finegrain lexical-level constrained text generation.Our method is highly adaptable, fitting either language model fine-tuning or large language model incontext learning.Our controlling language can also easily be applied to other related tasks, including story completion while infilling, summarization with length constraint, and machine translation with terminology constraint.Experiments show that our method has a high success rate and outperforms most of the previous strong baselines, demonstrating its effectiveness despite the simplicity.We leave the evaluation and improvement of more complex constraints for future works.

Limitations
Our proposed Regular Expression Instruction is serialized and cannot describe a set of keyword constraints where the appearing order is arbitrary, but only a list of keywords with determined order.Future work is needed to exceed the limit, either by approximating the word order or by repeated random sampling.Also, to obtain valid results we use reject sampling, which might need many repeated trials, thus reducing the efficiency and downgrading the speed.More efficient mechanisms with less retry are worth investigating.Additionally, under the current trends of the instruction following, more sophisticated prompts under 0-shot is worth investigating.

Ethics Statement
This work involves no sensitive data and uses several public-available datasets.This work discusses controllable text generation, which aims for better usage of the black-box language model and may better reduce the problematic biases.We notice that the method proposed in this work can be used to generate disinformation or harmful content directly via controlling language, but the malicious usage can be further avoided by filtering out improper control input and stopping harmful content generation.
span-corruption schema, where the model only generates discontinued spans rather than full

Table 3 :
Results on devset of CommonGen.The best models are bold within each metric.

Table 4 :
Result on test of αNLG.

Table 6 :
Results on Story Cloze Test with positional constraint.

Table 7 :
Results on the test set of Gigaword.