Towards Attribute-Entangled Controllable Text Generation: A Pilot Study of Blessing Generation

Controllable Text Generation (CTG) has obtained great success due to its fine-grained generation ability obtained by focusing on multiple attributes. However, most existing CTG researches overlook how to utilize the attribute entanglement to enhance the diversity of the controlled generated texts. Facing this dilemma, we focus on a novel CTG scenario, i.e., blessing generation which is challenging because high-quality blessing texts require CTG models to comprehensively consider the entanglement between multiple attributes (e.g., objects and occasions). To promote the research on blessing generation, we present EBleT, a large-scale Entangled Blessing Text dataset containing 293K English sentences annotated with multiple attributes. Furthermore, we propose novel evaluation metrics to measure the quality of the blessing texts generated by the baseline models we designed. Our study opens a new research direction for controllable text generation and enables the development of attribute-entangled CTG models.


Introduction
Controllable Text Generation (CTG) aims to automatically generate the text under the restrictions of given conditions (Prabhumoye et al., 2020;Dong et al., 2021;Sun et al., 2022).As the mainstream, controlling multiple attributes enriches the information contained by generation and matches the demand of application scenarios, such as generating Chinese poetry (Yi et al., 2020), restaurant reviews (Chen et al., 2021), and product descriptions (Xu et al., 2019).Take the Chinese poetry generation task as an example, one beautiful poetry sentence should contain multiple attributes and reflect the entanglement (or mixture) of them through reasonable connection, e.g., in the sentence "胡马南来路已荒(The enemy's warhorses march to the south, through destroyed roads)", "胡马(enemy's warhorses)" is a representative element of the military career, "南 来(march to the south)" and "荒(destroyed)" represent the attribute of troubled times.This poetry sentence vividly depicts a picture of War in troubled times through the entanglement of attributes in just seven characters.Yi et al. (2020) also claim that considering the entanglement among attributes can effectively enhance the quality and diversity of generated poetry.Therefore, we believe that better CTG models must focus on the effect of attribute entanglement, i.e., enhancing the reflection of multiple attributes through the use of various representative elements in the generated text.
For Chinese CTG, with poetry generation as a typical scenario, researchers have conducted indepth research on attribute entanglement, but in the English CTG field, the research on attribute entanglement has not been explored.Therefore, to promote research on attribute-entangled CTG in the English community, in this paper we focus on blessing generation, a new CTG task that plays a key role in social scenarios.The automatically generated blessings will greatly promote interpersonal communication and enrich people's daily life.More crucially, the blessing generation task is challenging due to its high requirement for entanglement between attributes, such as objects and occasions.As shown in Figure 1, "Santa's Workshop" connects the occasion (Christmas) and object (Boss) into one phrase, making the blessing wonderful.A more vivid blessing embodies these two attributes in an intertwined manner, such as "keeps the office humming along like Santa's Workshop".
Facing the vacant of blessing generation, we construct EBleT, a large-scale Entangled Blessing Text dataset annotated with multiple attributes.Particularly, the EBleT is constructed with the following two features: (1) EBleT contains 23 occasions and 34 objects annotated on 293,403 blessing texts from 12 blessing websites.(2) As 92% of the blessing texts are personalized for the corresponding attributes, EBleT has at least 82% data containing the entanglement between attributes.
Additionally, the common generation evaluation metrics cannot reflect the characteristics of blessings clearly.To evaluate the generated blessings more comprehensively, we propose novel metrics to automatically calculate the degree of attribute entanglement and the quality of blessings.Our experiments demonstrate that mainstream CTG methods struggle to contain the entanglement.Moreover, existing methods can not balance the fluency, diversity, and entanglement between attributes.These results indicate that the blessing generation task we focus on is challenging and could serve as a useful benchmark for CTG research.

Task Definition
The blessing generation task aims to obtain a generation model G(x 1 , x 2 ; θ) parameterized by θ.Given the input attributes containing an object x 1 ∈ X 1 and an occasion x 2 ∈ X 2 , the model G should output a blessing text y sent to x 1 for x 2 , where y = {y 1 , y 2 , ..., y n } is a sequence containing n words, and x i (i = 1, 2) is a word or a phrase belonging to a collection of objects or occasions.
The generated text y should reflect not only the language style of blessing, but also effective entanglement between both attributes.Additionally, the evaluation metrics for the language style of blessing and entanglement are described in Section 4.

Dataset Construction
Data Collection We search blessing-related keywords (e.g., "send blessing", "send wish") via Google Search and obtain 12 blessing websites.We check the licences of those websites to ensure that data from these websites can be legally employed for our non-profit academic research.The occasions and objects are labeled by page headings and subheadings from these websites.Therefore, we obtain the headings and subheadings, as well as corresponding lists of blessing texts.The occasions and objects are extracted from the headings and subheadings.We totally collect about 1 million texts from the web as the raw corpus.
Data Cleaning After acquiring the original corpus, we remove completely duplicate sentences, delete all non-English text, and remove the sentences that do not reflect corresponding occasion/object attributes.Additionally, we observe that too long or too short sentences are mostly noise.Therefore, to further clean the dataset, we keep only sentences in the range of 10 to 200 words in length.
Human Evaluation To manually evaluate the quality of EBleT, we randomly select 20 data samples from each "object-occasion" pair except for the pairs related to the "General" object and finally obtain 5,520 data samples.Then we employ 3 college students who are English native speakers as annotators to manually assess the personalization and entanglement scores of these samples.As the annotation payment, we provide them 5 dollars for every 100 sentences they judged.Besides, to ensure the reliability of their scores, we carefully explain the concept of personalization and entanglement to them before the start of annotation.Specifically, a blessing can be called personalized if the annotator can easily know its labeled occasion/object.Moreover, a blessing can be called entangled if it cleverly blends the characteristics of the labeled occasion/object, rather than combining the two so rigidly that it can be substituted for any other occasions or objects.After being familiar with the concepts of personalization and entanglement, our annotators are asked to judge the sampled data and give the score (0common, 1 -personalized, 2 -both personalized and entangled).We take the majority vote as the annotation result for a data sample.The Fleiss' kappa (Fleiss, 1971) of the annotations is 0.837, which indicates the annotation results of our annotators can be regarded as "almost perfect agreement" (Landis and Koch, 1977).The results of human evaluation will be presented and analyzed in the "Dataset Quality" of Section 3.2.

Dataset Analysis
Dataset Statistics Table 1 describes statistics of EBleT.Compared with previous annotated CTG datasets, e.g., ROCStories (Mostafazadeh et al., 2016) with 50K stories, GYAFC (Rao and Tetreault, 2018) with 53K sentences and ToTTo (Parikh et al., 2020)   Dataset Visualization After removing the stopwords and the words related to specific occasions and objects, we plot the word cloud of EBleT as shown in Figure 2. We find out that some words (e.g., "wish", "love", and "happiness") appear frequently.This phenomenon not only meets our common sense, i.e., blessing texts usually express wishes for each other, but also provides a class of words that need to be focused on for the development of future blessing generation models.4 Evaluation Metrics

Blessing Score
To measure the quality of blessings, Blessing Score should reflect the extent to which a sentence fits the language style of the blessing.By counting word frequency, we observe that some words, e.g., "happy", "merry", and "heart", frequently appear in blessing texts rather than in other texts.We obtain the 50 most frequently occurring words and remove the stopwords.These words are utilized to construct the bag-of-words of the blessing B.
For a sentence to be evaluated, to avoid the influence of irrelevant words, we use KeyBERT (Grootendorst, 2020) to extract 10 keywords to form a keyword list K as a representative of the sentence.All words in B and K are converted to word embeddings by Word2Vec (Mikolov et al., 2013) model E(.).For each keyword, we calculate its maximum similarity to all words in B, and then average the maximum similarity of all keywords to obtain the Blessing Score (BLE).It is formulated as follows:

Entanglement Score
To evaluate the degree of attribute entanglement, we assume that a blessing sentence with higher Entanglement Score should satisfy that the elements related to the occasions and objects appear simultaneously in more clauses.Further, occasion-related and object-related elements should alternate more times in one more entangled blessing sentence.
We construct two bags-of-words B 1 , B 2 to represent the occasion-related and object-related elements respectively.Specifically, the bags-of-words contain words directly related to the corresponding occasions and objects, which are listed in Table 8 and Table 9 of the Appendix.
For the Entanglement Score, we calculate whether words related to the two attributes occur simultaneously within each clause by cosine similarity, and add a bonus term O 1 for the cases where related words occur alternately multiple times.Formally, for each sentence S to be evaluated, we split S into m clauses S = {s 1 , s 2 , ..., s m } and each clause s i consists of n words s i = {w i1 , w i2 , ..., w in }.The Entanglement Score (ENT) for S is calculated as follows: where I(c) is the indicator function, which has a value of 1 when the condition c is satisfied, t is a predetermined threshold.
1 The specific implementation of our designed bonus is presented in the source code of the supplementary material.

Metric Verification
To verify the effectiveness of our proposed blessing and entanglement score, we conduct consistency analyses between automatic scores and human annotations.We extract 11 subsets and each of them has 100 pieces of data.Meanwhile, we make the proportion of blessings or entanglement (annotated by humans) in each set different, which is from 0.0 to 1.0.The average blessing score and entanglement score for the 11 subsets are calculated by our metrics.The results presented in Figure 3 demonstrate that our proposed metrics are highly consistent with the results of manual annotation.

Experiment Setup
We set up experiments to evaluate the performance of existing models to generate entangled blessing texts.The full dataset is divided into a training set, a validation set and a test set in the ratio of 9:0.5:0.5 by stratified sampling.
To measure the consistency of generated outputs and reference blessing texts, we utilize BLEU (Papineni et al., 2002) and WMD (Kusner et al., 2015).WMD is a method to calculate the minimum embedded word distance required for a document to transfer to another one.In addition, we use Perplexity and Distinct-n(n=1,2,3) (Li et al., 2016) to evaluate the fluency and diversity of generated outputs.Specifically, GPT-Neo (Gao et al., 2020) is employed as the language model to obtain the perplexity.Furthermore, we use Blessing Score and Entanglement Score mentioned in Section 4 to evaluate the quality of blessings.
We evaluate two widely used generation models on EBleT for our proposed task: GPT-2 (Radford et al., 2019) is a Transformerbased decoder-only model (Liu et al., 2022) which achieves stable and excellent generation performance.For this task, we design a prompt: "Send this blessing to <object> for <occasion>", where <object> and <occasion> represent the object and occasion attributes, respectively.The prompt is utilized for the prefix input of GPT-2 model.Diverse Beam Search (Vijayakumar et al., 2016) is employed as the decoding method during the generation process to ensure diversity of generated blessings.
T5 (Raffel et al., 2020)  Table 3: Performance of different models on EBleT.Common represents the news texts collected from British Broadcasting Corporation which is used to make the comparison with blessings."↑" represents higher is better for this metric and "↓" represents lower is better.
text-to-text generation tasks.The prompt mentioned above is utilized for the input of encoder side of T5 model.
Additionally, we consider applying CVAE (Sohn et al., 2015) for generation and using the latent variables to represent the entanglement of the two input attributes.Following the previous work (Fang et al., 2021), we employ pretrained GPT-2 as the backbone of CVAE to obtain higher quality generated results.Furthermore, we employ adversarial training (Adv.)(Yi et al., 2020) instead of minimizing KL divergence in CVAE to allow the model to learn more complex entangled representations.

Experiment Results
The results of Table 3 demonstrate that: (1) Models trained on EBleT can generate fluent blessing texts.The language style of generated texts is generally consistent with that of the blessing texts in the dataset.(2) The diversity and Entanglement Score of texts generated by GPT-2 and T5 are actually low.Meanwhile, employing CVAE or adversarial training architecture based on GPT-2 can effectively improve these two metrics but slightly reduce the quality of blessing.Additionally, the architecture of adversarial training outperforms CVAE in the entanglement and the quality of blessing, suggesting that the adversarial training architecture is more appropriate for entangling the attributes into generation.(3) There exists a gap of diversity and Entanglement Score between generated texts and references.It indicates that EBleT is a challenging benchmark for exploring the entanglement of attributes in CTG.Future work on this task should consider all the metrics of fluency, diversity, quality of blessings, and entanglement to generate blessings that are more in line with human expression.

Related Work
Controllable text generation (CTG) usually takes the controlled element and source text (which can be missing) as the input.Based on the input, the generation model produces the target text satisfying controlled elements.According to the core of CTG, i.e., the diversified controlled elements, we can divide CTG into the following two categories: Attribute Control: Ghosh et al. (2017) add the sentiment information into the generator to control the sentiment of the generated sentences.Luo et al. (2019) explore a framework including sentiment analysis and sentiment generator to control the finegrained sentiment of generation.Chen et al. (2021) introduce a mutual learning framework to generate emotionally controllable comments.In addition, Wang et al. (2019) control the style of the generated text to present a specific style of writing.Zhang et al. (2018) 2020) build LSTML and LSTMR to make sure the entities appear in the generated summary.Xu et al. (2020) incorporate keywords into each sentence of the story over the generation process.Kikuchi et al. (2016); Duan et al. (2020) introduce the methods for controlling the output sequence length.
However, existing research work on controlled generation doesn't include the work related to blessing and neglects the entanglement among attributes.Blessings can be used in many aspects of life, such as e-cards, advertisements, and so on.Thus we introduce a new task -blessing generation and propose the corresponding dataset EBleT.

Conclusion
To explore the entanglement between attributes, we present EBleT, a blessing dataset that presents a new controllable generation task.We propose novel metrics to automatically measure attribute entanglement and the quality of blessings.We also provide several baselines and conduct experiments for blessing generation.Experimental results demonstrate that EBleT could serve as a useful benchmark for attribute entanglement in CTG.

Limitations
In this paper, we conduct experiments on EBleT employing some representative mainstream models.Since our work is only a pilot study of attributedentangled CTG, we do not conduct experiments on more controllable generation models.Because of the challenge of EBleT, we suggest that more complex models can be implemented for improving the performance of blessing generation.

Ethical Considerations
In this paper, to facilitate the study of attributeentangled CTG, we propose the blessing generation task which needs to pay attention to the attribute entanglement to obtain vivid blessings.We believe that the blessing generation task embodies humanistic care, and the various generated blessing texts can not only enrich people's daily life, but also promote interpersonal relationships.We also present EBleT, a large-scale annotated blessing dataset.All the corpora used in EBleT come from freely available resources on public websites and do not involve any sensitive or illegal data.Additionally, we design new automatic evaluation metrics to measure the quality of blessings.We think that our designed metrics are instructive for future research on the CTG tasks.After all, in the current CTG field, how to conduct an effective evaluation is also an important and yet unsolved problem.

A.1 Dataset Details
The size of each object/occasion category is shown in Table 4 and Table 5 respectively.It is worth noting that the "General" category refers to the case where the sending object of corresponding blessing is not acquired during the data collection process.In addition, there is mutual inclusion between some objects in our dataset.We consider this phenomenon is reasonable, e.g., we may write only one blessing message for elders, and send it to others, such as parents, uncles, and teachers, with a little modification.
Some examples of EBleT are shown in

Figure 1 :
Figure 1: Two groups of blessing examples.Each group contains blessing messages without (top) and with (bottom) the attribute entanglement.Representative elements of occasion/object attributes are marked.

Figure 3 :
Figure3: The correlation between human annotations and automatic metrics.The horizontal axis represents the proportion of the set that is manually annotated as blessing or entanglement.
build a generation system to generate conversations with the specific persona.Content Control: Cao et al. (2015) control the topic of generation, exploring the latent semantics of vocabularies and texts to get the distribution of the topic.Keskar et al. (2019) add different controlling code to realize topic control.Koncel-Kedziorski et al. (2016) use the generator to edit the articles written by humans, changing the theme without changing the original story.Additionally, Zheng et al. (

Table 1 :
with 121K tables, our EBleT containing 293K blessing texts with corresponding occasion and object labels can be regarded as a sufficiently large-scale dataset.Moreover, our dataset consists of up to 276 pairs crossed by 23 categories of occasions and 34 categories of objects, which is challenging for models to learn the characteristics of each category of occasions and objects and to entangle them.More details and examples of EBleT are shown in Appendix A.1.Dataset statistics of EBleT.

Table 2 :
Partial human evaluation results of EBleT.
#Sample, #Per.and#Ent.denotethe total number of sampled sentences, the number of personalized sentences and the number of entangled sentences respectively.The full list is presented in Table7.
Table 6 which contain the blessings and the corresponding attributes (i.e., occasions and objects).

Table 4 :
The data size of each object category.

Table 5 :
The data size of each occasion category.