Controllable Conversation Generation with Conversation Structures via Diffusion Models

,


Introduction
Generating long-form and coherent text is an important step in many natural language generation (NLG) applications (Guan et al., 2022). While recent research has shown impressive progress in generating short texts, it is still challenging for generation models to write coherent long text which requires comprehensively incorporating linguistic and world knowledge (Charniak, 1972). Our work takes a closer look at long conversation generation (Gunasekara et al., 2021), one of the most challenging long text generation tasks. The task is to generate an entire coherent conversation from a given short description, i.e., a summary, of it. Conversation generation has various applications from daily entertainment, and story generation, to customer services. However, real human/human conversation logs are scarce; crowdsourcing conversational data is time-consuming, costly, and hard to ensure data quality (Gunasekara et al., 2021). Thus, better conversation generation models would allow us to generate massive natural conversational data more automatically and efficiently, which further helps build better conversational AI systems.
Even though there are a growing number of studies that focused on long text generation such as story generation (Guan et al., 2022;Fan et al., 2018;Li et al., 2022a) using large pre-trained models (Fan et al., 2018;, event planning (Guan et al., 2022;Fan et al., 2018;Li et al., 2022a) and recursive revision , directly applying them to generate long conversation may not work well due to the inherent different structures between stories and conversations. For instance, previous long text generation usually focused on generating stories that talk about one single topic with five sentences to one paragraph. They are shorter compared to conversations, which usually cover multiple topics between different speakers (over ten turns) (Feng et al., 2020). Furthermore, there are diverse discourse relations between different speakers (Chen and Yang, 2021b), making it even more challenging to generate long and coherent conversations.
While there is a line of work about dialogue generation, they are mainly concentrated on generating the next utterance autoregressively based on the given context (Ji et al., 2021;Saha et al., 2022;Ramakrishnan et al., 2022) with sequence-to-sequence models. Such methods usually neglect the conversation structures (Adewumi et al., 2022), and thus might easily lose focus to produce long and coherent conversations after several rounds of generation (Gunasekara et al., 2021). Moreover, the formerly generated utterances could not be further edited to Figure 1: Overall process of our framework. The sequence-to-sequence models would first generate a prototype conversation. Then we first corrupt the conversation through masking actions, masking utterances, shuffling utterances in the forward process, and utilize the diffusion process to gradually enrich the prototype conversation with different levels of structured conversation information.
adapt the later generated utterances. It is also unclear whether and how these sequence-to-sequence models are "gradually planning" to produce the long conversations. Therefore, how to design controllable methods tailored to the structures in conversations for generating long and coherent conversations becomes especially important.
To this end, our work introduces a Controllable Conversation Generation Framework with Diffusion Models (Diffuse-CG, shown in Figure 1) to incorporate different conversational structures in a non-autoregressive manner, inspired by recent advances in deep generative models (Li et al., 2022b;Gong et al., 2022;He et al., 2022). Specifically, we first generate a prototype conversation using pre-trained sequence-to-sequence model based on the input description. Then we leverage the diffusion models to gradually enrich the prototype conversation with conversation structures. The diffusion process allows a more flexible conversation generation by not limiting a fixed left-to-right generation order; it also allows the model to gradually incorporate different levels of conversation structures to control the granularities, including the use of action triples to add more specific topics and events (Gee, 2014;Chen and Yang, 2021b), dialogue acts to make the utterances more like human (Allen and Core, 1997;Sacks et al., 1978;Chen and Yang, 2021a), and discourse relations to generate longer conversations with better coherency (Kirschner et al., 2012;Stone et al., 2013;Asher et al., 2016a). To make the diffusion process more adapted to conversation generation and more stable, we further improve the general diffusion model (Li et al., 2022b) with linguistic-informed noise where we perturb the prototype conversation in the forward process with noise including soft-masking action words, soft-masking utterances, and shuffling discourse relations, rather than pure Gaussian noise (Li et al., 2022b). Experiments on two conversation datasets, SAMSum (Gliwa et al., 2019) and DialogSum (Chen et al., 2021)by visualizing the intermediate-generated conversations, we show that Diffuse-CG achieves better interpretability for understanding how the model is structuring and generating long-form conversations.

Related Work
Long Text Generation Long-form text generation has been a longstanding challenge in natural language generation where models need to generate long, coherent and open-ended narratives (Guan et al., 2022;Fan et al., 2018;Li et al., 2022a;Guan et al., 2021). Recent studies have shown impressive success in generating more coherent stories through adopting hierarchical model structures (Li et al., 2015), leveraging large pre-trained models (Fan et al., 2018;, planing first and then generating framework (Shao et al., 2019;Tan et al., 2020;Goldfarb-Tarrant et al., 2020;Li et al., 2022a) and incorporating external knowledge (Guan et al., 2022;Fan et al., 2018;Xu et al., 2020). However, previous studies mainly focus on generating singlespeaker stories and neglect one important form of long text-conversations. Such methods cannot be directly applied to generate multi-speaker conversations because of the complex linguistic structures in conversations such as back-and-forth interactions (Feng et al., 2020;Chen and Yang, 2021b).
Our work fills this gap by utilizing conversation structures to generate coherent conversations.
Dialogue Response Generation Numerous studies have been conducted on generating short responses conditioned on previous context (Ji et al., 2021;Saha et al., 2022;Ramakrishnan et al., 2022) such as adding user's persona (Wolf et al., 2019), paraphrasing template responses (Lippe et al., 2020) and using example guidance (Gupta et al., 2021;Cai et al., 2020). While achieving state-of-the-art performances, they suffer from generating the entire conversation because they can only generate one utterance at a time and easily lose focus when generating multiple rounds of utterances or the entire conversation (Gunasekara et al., 2021). This is largely due to the fact that former errors cannot be corrected when generating utterance by utterance autoregressively, and the lack of awareness towards rich conversation structures like long-distance relations in conversations (Stone et al., 2013;Asher et al., 2016a). To this end, we design a controllable and interpretable conversation generation framework that makes use of rich structures to generate the entire conversation in a non-autoregressive way.
Diffusion Model Diffusion models (Sohl-Dickstein et al., 2015;Ho et al., 2020;Song et al., 2021) are recently-introduced state-of-the-art non-autoregressive generative models and have shown substantial success for visual modalities (Ramesh et al., 2022;Rombach et al., 2022). They are generally more interpretable and controllable as they gradually denoise random vectors to desired output via multiple intermediate steps (He et al., 2022;Austin et al., 2021). However, it is still difficult to apply diffusion models to textual data, because the input space in text is discrete and text is generally more complex in structures. Although there are a few exceptions to model language generation with diffusion process (Li et al., 2022b;Gong et al., 2022;He et al., 2022;Austin et al., 2021;Hoogeboom et al., 2021) where continuous and discrete space is bridged through embedding and rounding (Li et al., 2022b;Gong et al., 2022;Dieleman et al., 2022), such approaches often utilize Gaussian noise in the forward process, which usually fails to leverage the linguistic structure in text to noise the input textual data and makes the diffusion models unstable and costly (He et al., 2022). Building upon these prior works, we utilize diffusion models for interpretable and controllable conversation generation and design a novel linguistic-informed noise for adapting diffusion models to generate textual conversations.

Background: Diffusion Models
Diffusion models are the recent state-of-the-art deep generative models via iterative denoising the latent variables (Sohl-Dickstein et al., 2015;Ho et al., 2020;Song et al., 2021). Basically, corruption (usually Gaussian noise) is added to the input data distribution gradually during a forward process. Then a diffusion model is trained through learning to recover the corrupted distribution to the original input data distribution step by step. A small amount of information that is perturbed during the corresponding forward process is reconstructed in every diffusion step.
There is usually a forward noising process and a diffusion denoising process in a diffusion model. For a given sampled input data, x 0 ∼ q(x 0 ), a Markov chain of latent variables {x 1 , · · ·, x T } are generated in the forward noising process (q (x t | x t−1 )) by progressively adding a small amount of Gaussian noise to perturb the input data: is a noise schedule controlling the amount of added noise in every step. Through the forward process, x T becomes an isotropic Gaussian distribution. Note that there are no trainable parameters in the forward process.
Then a reversed diffusion process, which is learned by a parameterized model (p(x t−1 |x t )), is learned to denoise x T to the original data x 0 : where µ θ (.) and Σ θ (.) are the learned models.
The diffusion model is trained to maximize the marginal likelihood of log p θ (x 0 ). And Ho et al. expand and reweight the objectives to obtain a meansquared error (L 2 ) loss: whereμ is the mean of the posterior q(x t−1 |x 0 , x t ), and µ θ is the predicted mean of p θ (x t−1 |x t ), which is predicted by the parameterized neural models.

Our Approach
This section introduces our controllable conversation generation model to generate natural and coherent conversations, as shown in Figure 1. Basically, we first utilize a sequence-to-sequence model to generate a prototype version of the conversation based on the given short description (Section 4.1). We then gradually incorporate the conversation structure guidance to edit the prototype conversation in order from lower levels to higher levels (action triples, dialogue acts, and discourse relations) through diffusion models (Section 4.2).

Prototype Conversation Generation
We first train a sequence-to-sequence model f (F (.)) to generate the prototype conversation C p based on the given conversation summary s, is an encoder-decoder network and f (.) is a feed-forward network to map the hidden representations to actual words. We initialize f (F (.)) with a pre-trained encoderdecoder model, i.e., BART-base (Lewis et al., 2020). f (F (.)) is learned using the ground truth summary-conversation pairs, (s, C g ) through minimizing the cross entropy L = − log P (C g |s).
Once the prototype conversation generation model is learned, we utilize F (.) to generate the hidden representations X 0 = {w 0 , ..., w l } of the prototype conversation C with l words: X 0 = {w 0 , ..., w l } = F (s). Note that X 0 ∈ R l×d is a matrix used as the initiate latent variable in Section 4.2, where l is the number of words in the conversation and d is the dimension of the hidden representation.

Editing with Diffusion Models
With the hidden representation, X 0 , of the prototype conversation, we then introduce our diffusion model that gradually edits the prototype conversation to form the desired long conversation. Specifically, we first add linguistic noise to X 0 to get the noisy intermediate latent variables X 1:T in the forward process (Section 4.2.2), and then gradually denoise X T toX 0 with different levels of conversation structure information in the diffusion process (Section 4.2.3). Last, we generate the long conversation C l with the denoisedX 0 : C l = f (X 0 ).

Structures in Conversations
This part introduces the three types of widely-used structures with different granularity in conversa-tions utilized in our work 2 : the action triples, dialogue acts, and discourse relations. The action triples are the "WHO-DOING-WHAT" triplets (e.g., "Sam-Asking for-Betty's number") in conversations that express specific socially situated identities and activities (Chen and Yang, 2021b). The dialogue acts describe the functions and roles of every utterance in one conversation. For example, natural conversations might often have interruption utterances with dialogue acts like acknowledgment, backchannel, response acknowledgment and etc. (Allen and Core, 1997;Sacks et al., 1978). Discourse relations describe the relations between different utterances in one conversation (Asher et al., 2016b). For example, two utterances may be related to each other with the Question Answer Pair.

Forward Process
We first add noise to prototype conversation X 0 = {w 0 , ..., w l } to generate the noisy intermediate latent variables X 1:T in the forward process: X t+1 = q(X t ). To make the diffusion process more stable and efficient, the added noise needs to corrupt the prototype conversation and gives the later diffusion process appropriate flexibility to generate conversations, while avoiding removing all the prior knowledge in X 0 . Thus we design and apply different types of linguistic-informed noises to perturb the structured information in conversation. Here we introduce three types of noise strategies based on the conversation structures into the forward process: Soft-Masking Action Words For soft-masking action words, we only add noise to the action words w i in the prototype conversation in order to perturb the action information. These action words are the words that appear in the action triples extracted from the prototype conversation using OpenIE 3 (Angeli et al., 2015;Chen and Yang, 2021b). At step t, we add a small amount of Gaussian noise to the action words w i in the prototype conversation: where β t is the amount of noise added at step t.

Soft-Masking Utterances
For soft-masking utterances, we only add noise to all the words w i in one utterance u in the prototype conversation so that the dialogue acts of the utterance are perturbed. The utterance to mask is consistent for all the steps for one prototype conversation, while we randomly reselect the utterance to mask in different epochs. At step t, we add a small amount of Gaussian noise to all the words w i in the utterance u: Shuffling Discourse Relations We further randomly switch the positions of two random utterances u i and u j in the conversation to perturb the discourse relations in the prototype conversation. At step t, we randomly shuffle X 0 : In practice, we apply these three types of noises at the same time at every diffusion step t to model q(X t+1 |X t ). Note that the forward process does not contain any trainable parameters.

Diffusion Process
After corrupting the hidden representations of the prototype conversation X 0 to latent variables X 1:T , we then gradually denoise X T toX 0 through diffusion steps,X t−1 = p(X t |θ), where θ is the learned parameter to model the state transition. In practice, the transition is modeled by transformers. After every diffusion step t ∈ (0, T ], we minimize the cross entropy between the predicted conversation fromX t−1 and the ground truth conversation C g : To generate desired conversation in a more controlled way, we incorporate three levels of conversation-structured information introduced in Section 4.2.1 to control the generation and we describe each of them in detail below.
Action Triples By incorporating action triples information, the conversation could include more details with diverse desired actions/events from the token-level. During training, we first extract such action triples A = {a 0 , ..., a m } from the ground truth conversation C g using OpenIE, where a i is a "(WHO, DOING, WHAT)" triple. We then represent every triple a i ∈ A with the average of the output embeddings from the above F (.). In order to encourage the generated conversation to describe the given actions triples, after every diffusion stepX t−1 = p(X t |θ), t ∈ (t a , T ], we also minimize the sum of cosine distances between the average of every token's representation inX t−1 and every action triple's representation: Dialogue Acts Editing the generated conversation with the desired dialogue acts information could encourage the generated conversation to be more diverse and more like human from the utterance-level (Allen and Core, 1997;Sacks et al., 1978). During training, we first extract the dialogue acts D = {d 0 , ..., d m } in every ground truth conversation C g with a learned linear dialogue acts classifier 4 , where d i is a one-hot vector that indicates the dialogue act for i-th utterance. We sum them up to represent the dialogue acts distribution in the ground truth conversation,d = i d i .
In order to encourage the generated conversation to include utterances with the given dialogue acts, we force the generated conversation to have the same dialogue acts distribution with the ground truth conversation. Specifically, after every diffusion step,X t−1 = p(X t |θ), t ∈ (t d , t a ], we first predict the dialogue acts D t−1 = {d t−1 0 , ..., d t−1 n } for every utterance inX t−1 with the learned classifier, where d t−1 i is the predicted vector that includes the probabilities of the i-th utterance is classified as different dialogue acts. We sum the predictionŝ d t−1 = i d t−1 i , where the j-th element ind t−1 denotes the total number j-type utterance in the conversation. We then minimize the L 2 distance between the ground-truth distribution and the predicted distribution from the generated conversation: Discourse Relations Controlling the generated conversation with the discourse relation information would encourage the utterances in it to be more related, leading to a more coherent conversation from a conversation level. During training, we first pre-train a discourse parsing model on a humanannotated multiparty dialogue corpus (Asher et al., 2016b)   this parser, we extract the discourse relation matrix M ∈ R m×m×k from the ground truth conversation, where m is the number of utterances and k is the total number of different discourse relations. We sum the matrix in the first two dimensions to represent the discourse relation distribution in the ground truth conversation:r = i j M i,j,k , where the l-th element inr means the total number of l-th discourse relation in the conversation. We regularize the generated conversation to have the same discourse relation distribution with the ground truth conversation. After every diffusion step,X t−1 = p(X t |θ), t ∈ (0, t d ], we first predict the discourse relation matrix M t−1 ∈ R n×n×k with the pre-trained parser. We also sum M t in the first two dimensionsr t−1 = i j M i,j,k and minimize the L 2 distance between it and the ground-truth distribution: Objectives In practice, we sequentially use all three conversation structures, from lower levels to higher levels, i.e., action triples → dialogue acts → discourse relations. The order is selected through an ablation study (in Section 5.4). During training, we minimize the loss:

Datasets and Baselines
We perform experiments on two widely-used datasets, SAMSum (Gliwa et al., 2019) and Di-alogSum (Chen et al., 2021), as shown in Table 1 we utilize the summary as input and learn the generation model to generate the long conversation. During pre-processing, we add a special token ("<s>") to indicate the begging of every utterance. We truncate the conversation into 800 tokens. We compare our Diffuse-CG framework with several baselines: • BART-base (Lewis et al., 2020): We use BART-base as our backbone model. The input only contains the summary.
• BART-Concat: We improve pure BART by directly concatenating controlling information including the action triples, dialogue acts and discourse relations to the end of the input summary.
• Diffuse-CG-Con: We use a framework similar to our Diffuse-CG while the different levels of information are combined concurrently instead of sequentially.

Experimental Setting
We initialize the prototype conversation generation model with BART-base and learn the model for 20 epochs with 3e-5 learning rate, and 0.15 warm-up ratio. The batch size is 4. For the Diffuse-CG, we utilize a 4-layer transformer whose hidden dimension is 512 to model p(.|θ). We set the diffusion steps to be T = 500 (t a = 300 and t d = 100, which means that we use 300 steps for action triples, 100 steps for dialogue acts, and 100 steps for discourse relations). We follow (Li et al., 2022b) to use an sqrt schedule in the forward process. The learning rate is set to be 3e-4 with a 0.1 warm-up ratio. The batch size is 4 and we train Diffuse-CG for 200k iterations. During inference, the beam size is set to 4. We perform all the experiments on 4 NVIDIA V100 GPUs. For diffuse-CG, the training takes around 4.8 hours, and the inference speed is 1.4second per dialogue generation.

Automatic Evaluation
We first evaluated all the models with: • ROUGE scores (Lin and Och, 2004) measure the n-gram overlap between the generated conversation and the ground-truth conversation.
• Action coverage rate, Dialogue acts coverage rate, discourse Relation coverage measure the    coverage rate of the actions triples, dialogue acts, and discourse relations in the generated conversation compared to the ground-truth conversation.
• LM score measure the fluency by computing the perplexity from a GPT-2 pre-trained on SAMSum and DialogSum.
• Length measures the length of the generated conversation.
As shown in Table 2 and Table 3, we find that after adding the controlling structured information directly to the input, BART-Concat is generating better conversations compared to naive BART. This shows that our introduced conversation-structured guidance can help conversation generation by providing effective information. By applying the diffusion process, Diffuse-CG-Con and Diffuse-CG further consistently improve the performances (e.g., 8%/28%/7% improvements in ROUGE scores), which shows the effectiveness of our introduced controllable conversation generation framework. Because it makes better use of both the input summary and the controlling signals by first generating the prototype conversation and then further enriching it with the extra information using a diffusion process, which prevents the distraction from different information. Among different noise and control signals, the soft-masking action words noise and action triples diffusion worked the best, followed by shuffling discourse relations with discourse diffusion and then soft-masking utterances noise with dialogue acts diffusion. Compared to the concurrent way, our sequential Diffuse-CG works the best, indicating that editing the long conversation with a suitable order (from token levels to utterance levels and to conversation levels) is important. By gradually incorporating different levels of structure, the overall performances are improving (e.g., the ROUGE scores are increasing from 38.12/18.45/27.38 to 40.54/19.43/28.57), suggest-   Table 6: ROUGE-1 (↑), ROUGE-2 (↑), ROUGE-L (↑) scores, Action coverage rate (↑), Dialogue acts coverage rate (↑), discourse Relation coverage rate (↑), language model scores (↓) and the length (↑) of the generated conversation for different orders in Diffuse-CG on the DialogSum Corpus test set. † means the best order.
ing that the sequential diffusion steps can edit the prototype conversation to higher qualities step by step, and all the introduced structures are making contributions.
Human Evaluation We conduct a human evaluation to evaluate the generated conversations qualitatively. We ask Amazon Mechanical Turk to rank the quality of 100 generated conversations (randomly sampled) from a given summary with 4 different models. Specifically, we ask them to rank them in terms of Coherency (the generated conversation is logical and consistent), Fluency (the generated conversation is reader-friendly) and Factualness (the generated conversation is not changing the fact from the given short descriptions). To increase annotation quality, we require turkers to have a 98% approval rate with over 10,000 approved tasks for their previous work. The pay rate was 0.5$ per hit. The rank for every summary was aggregated by majority voting. The Intra-Class Correlation (ICC1k) was 0.511, indicating moderate agreement (Koo and Li, 2016)). The average rank is shown in Table 4. Our Diffuse-CG achieves the best average rankings, indicating the effectiveness of incorporating conversation structures.

Ablation Study
This part describes our ablation studies on how our introduced linguistic-informed noises and the diffusion orders affect the model performances.
Noise Strategy We first visualize the performances of Diffuse-CG with different types of noise strategy in Table 5. Gaussian Noise adds Gaussian noise to all the tokens in the prototype conversation in the forward process, following previous work (Li et al., 2022b), while our introduced Linguisticinformed Noise only adds Gaussian noise to action words and random utterance as well as shuffling the conversations. Our introduced noise shows significantly better performances on SAMSum test set, indicating that our introduced noise strategy which considers the conversation structures is providing more appropriate perturbation to the prototype conversation for the diffusion process. This is because our strategy could provide flexibility to edit the prototype conversation as well as preserve the prior knowledge in the prototype conversation.
Diffusion Orders In terms of the impact of different orders to add different structured information during the diffusion process, as shown in Table 6, we find that the best overall performance is achieved by the order: action triples → dialogue acts → discourse relations, from a lower level (token/action level) to higher level (conversationlevel). This might be because, in this structured order, more specific information can be introduced at the early stages when the conversations are more flexible to adopt a large amount of detailed information. When the conversation has enough information, it is then more effective to operate at a higher level like the relations between different utterances. This also indicates the effectiveness of structured ordering in general, especially when there are multiple levels of controlling information.

Case Study
We further visualize the intermediate outputs in the diffusion process of our Diffuse-CG to interpret the generation process in Figure 2. As it shows, the prototype conversation is short and coarse. When the action information is incorporated through the first diffusion stage, the conversation is enriched by more specific action information like "Amanda text Larry". After the dialogue act diffusion stage, the conversation is further modified to have utterance with dialogue acts like backchannel ("Urgh. All right"). At last, with the discourse relation information being utilized, the conversation is more interactive and coherent with more intra-utterance relations like QA pairs. These coarse-to-fine steps show how Diffuse-CG is editing and generating better and longer conversations over time.

Conclusion
In this work, we introduce a novel controllable conversation generation framework that utilizes different levels of conversation structures to generate long and coherent conversations based on a given short description. Specifically, we first generate the prototype conversation and then enrich it with structure information like action triples, dialogue acts, and discourse relations, together with novel linguistic-informed noises for further adapting diffusion models to generate conversations. Experiments on SAMSum and DialogSum show the effectiveness of our framework by significantly improving over the baselines. Our proposed method also provides interpretability of how the model is gradually generating longer and better conversations.

Limitation
In this work, we mainly leverage control guidance such as action triples, dialogue acts, and discourse relations in structured forms that are extracted automatically from the corpus for training. We encourage future work to explore how to incorporate control information in natural language forms (for example, the natural language descriptions of the action information instead of triples). We also compose multiple modules (like the prototype generation, discourse classifier, etc.) to generate the final conversation which might lead to a larger error cascade if there is some early noise. So future work might explore how to make the pipeline learned in an end-to-end manner. What's more, we mainly focus on using three major conversation structures to help the entire conversation generation, future work might continue to explore other types of linguistic and human knowledge to further improve the conversation generation qualities.

ACL 2023 Responsible NLP Checklist
A For every submission: A1. Did you describe the limitations of your work?

Section 6
A2. Did you discuss any potential risks of your work? Not applicable. Left blank.
A3. Do the abstract and introduction summarize the paper's main claims? B2. Did you discuss the license or terms for use and / or distribution of any artifacts? No response.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)? No response.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it? No response.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? No response.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created? Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results. For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.
No response.

C Did you run computational experiments?
Section 4 C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used? Section 4 The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.

7250
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values? Section 4 C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run? Section 4 C4. If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? Section 4 D Did you use human annotators (e.g., crowdworkers) or research with human participants?
Section 4 D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Section 4 D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)? Section 4 D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating? For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used? Section 4 D4. Was the data collection protocol approved (or determined exempt) by an ethics review board? Not applicable. Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data? Not applicable. Left blank.