Improving Long Dialogue Summarization with Semantic Graph Representation

Although Large Language Models (LLMs) are successful in abstractive summarization of short dialogues, summarization of long dialogues remains challenging. To address this challenge, we propose a novel algorithm that processes complete dialogues comprising thousands of tokens into topic-segment-level Ab-stract Meaning Representation (AMR) graphs, which explicitly capture the dialogue structure, highlight salient semantics, and preserve high-level information. We also develop a new text-graph attention to leverage both graph semantics and a pretrained LLM that exploits the text. Finally, we propose an AMR node selection loss used jointly with conventional cross-entropy loss, to create additional training signals that facilitate graph feature encoding and content selection. Experiments show that our system outperforms the state-of-the-art models on multiple long dialogue summarization datasets, especially in low-resource settings, and generalizes well to out-of-domain data.


Introduction
Summarization of long dialogues is receiving more attention as virtual meetings become prevalent and advanced speech recognition technologies become accessible to accurately transcribe dialogues. Although the state-of-the-art (SOTA) models for abstractive summarization achieved remarkable performance on non-conversational documents via pretraining on datasets involving Wikipedia, books, stories, and news, generating high-quality summaries for long dialogue transcripts remains challenging. In this paper, we present the use of a semantic representation, Abstract Meaning Representation (AMR), to address some of the challenges in long dialogue summarization and we demonstrate its use on two different genres of long dialogues: meetings and screenplays of TV episodes. * These authors contributed equally to this work.
One major challenge with dialogues is their complex structure in combination with their informal nature. Chen and Yang (2021a) discussed the impact of repetitions, false-starts, and hesitations on dialogue structure and suggested that key information can be spread across different portions of the dialogue. Multiple speakers and diverse spoken language styles also complicate the interactions involving speakers and coreferences. For example, it may be challenging to associate speakers with their opinions and actions as well as with their reactions to other speaker utterances. Admittedly, recent encoder-decoder LLMs like BART (Lewis et al., 2020) achieve good performance on short dialogues given adequate finetuning. Long dialogues, however, have more content as well as speakers and this brings greater complexity to dialogue structure, leading to performance degradation for encoderdecoder LLMs.
A second challenge lies in the transformer architecture of LLMs, which often draws spurious correlations (Tu et al., 2020;Kaushik et al., 2019) and easily overfits small or homogeneous training sets. This limitation is particularly pertinent to long dialogue summarization, as it often involves lowresource domains. Meeting summarization datasets are particularly small. In fact, major meeting summary datasets, AMI (Carletta et al., 2006) and ICSI (Janin et al., 2003), have sizes of 137 and 59 transcripts respectively. In these very low-resource scenarios, a greater risk of overfitting is present.
To address these challenges, we propose a novel Abstract Meaning Representation (AMR) for long dialogues, to capture diverse entity interactions and complex structures. AMR, as a semantic graph representation, captures the most salient semantic knowledge using concept nodes and preserves interconcept relations with labeled edges. It is believed to convey information largely orthogonal to what conventional models exploit from the text input (Song et al., 2019). AMRs thus provide reliable semantic and structural cues to alleviate overfitting, which we show is particularly helpful for the low-resource setting. Our approach generalizes the sentence AMR introduced by Banarescu et al. (2013) and specifically designs an algorithm for building topic-segment AMRs for long spans of conversations to preserve global information.
Our approach to incorporating AMRs for summarization also differs from previous research. Existing methods of AMR-based summarization rely on graph-to-graph operations for content selection and additional text generation modules (Liu et al., 2015;Lee et al., 2021), an approach that does not surpass state-of-the-art encoder-decoder LLMs. Instead, our work leverages both graph semantics and a pretrained encoder-decoder model. We develop a novel text-graph attention that exploits structural and global information in AMR to improve text encoding. We also propose a node selection loss used jointly with the standard cross-entropy loss for additional training signals.
Our work is orthogonal to the existing research on dialogue-specific pretraining (Zhong et al., 2021), which improves models' familiarity with dialogue styles and has achieved the state-ofthe-art for long-dialogue summarization with its two model variants, DialogLM and DialogLED. With our novel architecture, our AMR-based system can make full use of an LLM's pretrained weights while providing better global information aggregation and fine-grained structural cues, from topic segments to speaker-action associations and inter-utterance coreferences. Our code is available at https://github.com/Bobby-Hua/ summarization-via-semantic-graph.
In sum, the contributions of our work are: • a new algorithm to build AMRs for long dialogues and their application to summarization • a novel AMR node selection loss for better graph encoding and content selection • new SOTA results on 3 datasets, up by +1.24 in Rouge-1, +1.53 in Rouge-2, and +1.67 in Rouge-L (pooling best results from 3 datasets)

Related Work
Dialogue Summarization Previous approaches to dialogue summarization also seek to improve models' awareness of dialogue structures and interactions. Chen et al. (2021a) combines fact regularization via subject-verb-object (SVO) fact triplets with modeling of the relationship between summary sentences and the positions of their supporting utterances in the source text. Other features, such as topic segmentation (Li et al., 2019;Chen and Yang, 2020), dialogue acts (Goo and Chen, 2018), and conversation stages (Chen and Yang, 2020) have also been used in different models. The state-of-the-art in short dialogue summarization further suggests modeling discourse dependency and speaker-action relations (Chen and Yang, 2021b).
Our dialogue AMR is a more comprehensive representation of dialogue structures, which captures not only many useful features from the previous research but also fine-grained semantic structure and inter-utterance coreferences. The increased granularity of our representation allows it to effectively guide the encoding of text tokens via our novel cross-attention mechanism.
Summarization with AMR Current research that applies AMRs to summarization has primarily relied on graph-to-graph operations to produce summary AMR graphs. Liu et al. (2015) first introduces summarization via AMRs by merging sentence AMRs into document AMRs and conducting subgraph selection with integer linear programming. Other research adopts similar graphto-graph methods, introducing new heuristics and algorithms for summary AMR generation and applying more recent generative models to generate fluent summaries (Dohare et al., 2018;Hardy and Vlachos, 2018;Lee et al., 2021). The graph-tograph algorithms in these models are incompatible with the encoder-decoder LLMs behind current SOTA summarization models. Our work instead leverages both graph semantics and a pretrained encoder-decoder model, via our novel text-graph attention and additional node selection loss.
Dialogue AMR Bai et al. (2021) is the only work on dialogue AMR to the best of our knowledge. Its algorithm connects sentence AMRs with a dummy node and models coreference and identical concepts with additional edges. They only apply their dialogue AMRs to short conversations and for dialogue relation extraction and response generation. Our work is different in that we apply topic segmentation to process long dialogues and use node merging to allow a single AMR graph to capture a longer text span. Node merging, along with our speaker and utterance nodes, also creates a hierarchy of nodes with different centrality to help the graph encoder aggregate information at different levels.

Constructing AMR Representation For
Long Dialogue

Topic Segmentation
A topic segment in a long dialogue is a consecutive sequence of topically-coherent lines. Topic segmentation can provide valuable high-level understanding of complex long dialogue structures, thus contributing important insights for summarization. Within a topic segment, speakers may respond to each other, request and perform actions, and refer to entities mentioned by others.
To make sure an AMR graph captures coherent semantics, we first perform topic segmentation on long dialogue. We adopt an existing strategy (Chen and Yang, 2020) that combines the classic topic segment algorithm C99 (Choi, 2000) with SentenceBERT (Reimers and Gurevych, 2019), to segment long dialogues into topic segments with reasonable lengths.

Topic Segment AMRs
To construct topic segment AMRs after segmentation, we adapt the steps in Bai et al. (2021) to build dialogue AMRs and we additionally introduce speaker nodes, utterance nodes, and a new procedure of node merging leveraging coreference relations. Given a topic segment consisting of multiple utterances, we use the AMR parser by Cai and Lam (2020a) to obtain an AMR graph for each utterance. And then as illustrated by Figure 1, we construct the topic segment AMR by connecting utterance AMRs with utterance nodes, speaker nodes, and a topic segment node (the root node) with appropriate edges, and perform node merging based on coreferences. For topic segment s i , AMR_algo Compared with Bai et al. (2021), we want our proposed speaker nodes and utterance nodes to encode the fine-grained hierarchical information from different levels of the topic segment AMR (synthesizing multiple utterances by one speaker, multiple sentences within one utterance, etc.). This is made possible by our graph encoder, which exploits speaker and utterance nodes' abundant and unique interactions with other AMR concepts (see Section 4.1).
We perform node merging based on coreference relations to obtain the final topic segment AMR graphs. We use a coreference resolution model by Dobrovolskii (2021) to obtain coreference relations between words, and JAMR (Flanigan et al., 2014) to obtain alignment between concepts and words. For a set of coreferencing nodes that do not contain speaker nodes, we merge them and select the most frequently appearing non-pronoun concept in the set as the concept for the merged node. Otherwise, we merge them and use the speaker node concept.
Through merging, a topic segment AMR graph can contain a single node for each distinct instance of a concept mentioned in the sentence AMR graphs. Previous work on node merging in AMR graphs (Liu et al., 2015) aims at reducing graph size and directly producing summary AMR graphs, so they merge identical concepts regardless of whether they are the same entity. Lee et al. (2021) shows that merging identical concepts results in undesirable information loss. In many cases, nodes representing different instances of an entity will be merged across sentences regardless of whether those nodes actually refer to the same entity. Lee et al. (2021) also shows that merging based on the combination of concept and coreference works best at producing summary AMR graphs. In our work, we only merge coreferencing nodes, as we want to maximize the preservation of information while allowing frequently occurring entities to have greater graph centrality in the topic segment AMR, so that relevant information can be explicitly aggregated and later encoded.

Model Architecture
Our model consists of an AMR graph encoder, a text encoder with text-graph cross-attention, and a decoder. The full architecture is illustrated in Figure 2.

AMR Encoder
To exploit AMR's rich structural information and entity interactions, we apply Cai and Lam (2020b)'s graph transformer to encode a list of topic segment AMRs, G = {{V 1 , E 1 }, {V 2 , E 2 }, ...}. First, the relationship r ij between two AMR concepts (nodes) v i , v j is encoded by the shortest path between them using a GRU. Then, we encode the segment AMR in the graph encoder, where every concept node attends to every other concept node with a modified attention mechanism informed by the encoded relationship between them. Overall, the operations of the AMR encoder F G on the i-th The graph output (hidden states) for the entire dialogue is a list of hidden states F G (G) = {H 1 , H 2 , ...}, which will be passed on to the text encoder and the decoder.

Text Encoder
Our text encoder takes as input the text token embeddings X = {x 0 , ...x n } and the graph hidden states F G (G). For the text tokens, we adopt the simple sliding-window self-attention (Beltagy et al., 2020), which allows for a linear computation complexity instead of the O(n 2 ) complexity in conventional transformers, and thus is more suitable for long sequence modeling.
To incorporate the graph information, we propose a new text-graph cross-attention to help the encoding of local tokens with 1) dialogue structural features explicitly exposed by AMRs and 2) the global semantic information from all the topic segments. We describe our text-graph attention and its advantages in the sections below.

Text-graph Cross-attention
Since our dialogues consist of thousands of tokens, we do not want all of them to have cross-attention to the AMRs (to avoid high computation complexity). Instead, we add special [BOU] (beginningof-utterance) tokens to every utterance and only allow the cross-attention from the [BOU]s to all the graph hidden states. Specifically, in every text encoder layer, after a [BOU] token has full attention to the tokens in its window (sliding-window selfattention), it cross-attends to all the topic segment AMRs of the entire dialogue. This step enriches the [BOU] embedding with the structural and global information from AMRs. The enriched [BOU] embedding, when sent to the next encoder layer, will guide the encoding of the surrounding text tokens as they attend the [BOU] via the sliding-window self-attention. This way, even if most of the text tokens do not have cross-attention on AMR graphs, they still benefit from the structural and global information indirectly, via the [BOU] hidden states.
Topic-segment Embedding To improve the textgraph cross-attention's information selection, we want an utterance's cross-attention to distinguish if an AMR graph represents the utterance's own topic segment (i.e. the segment that contains this utterance) vs. AMRs of other topic segments. Intuitively, good cross-attention should treat the topic segments differently depending on their relevance to an utterance, by assigning greater attention weights to the AMR of an utterance's own topic segment, for example. Therefore, we apply learnable segment embeddings (Devlin et al., 2019). For the i-th segment, the embedder produces a segment embedding E s , which is added to the graph output of every concept in the segment. The same embedding is also added to all the [BOU] embeddings, E [BOU ] , corresponding to this segment. All segment embeddings are added to the E [BOU ] and AMR hidden states when the text encoder first receives these inputs. Figure 3 illustrates the topicsegment embeddings.

Advantages of our text-graph cross-attention
This method captures the relational information from AMRs, allowing for a better grasp of dialogue structures. Also, the root node, utterance nodes, and important entity nodes have aggregated information to different extents depending on their centrality, which helps text-graph attention aggregate structural features across different levels of granularity.
This cross-attention also allows the global semantic information from the complete dialogue AMR to guide the encoding of local tokens, since each [BOU] embedding attends to the concepts from all the topic segments of the conversation. The segment embeddings also help our text-graph attention extract relevant patterns more easily by looking at topic segments of different lengths while distinguishing its own segment from others.
Finally, cross-attention to AMR graphs is more efficient than directly attending to all the tokens in the text. AMR abstracts away the unimportant tokens and only keeps the salient information. It thus compresses the input sequence in terms of token numbers. For a typical dialogue, the number of AMR nodes ranges from half to two-thirds of the token number, leading to significantly lower computational complexity for the cross-attention on AMRs than a self-attention to all the text tokens, which is a common way to introduce global information found in LLMs like Longformer.

Decoder
We use a transformer decoder to generate the summary sequence. At each decoding stage t, selfattention is applied to hidden states of the previous t − 1 generated tokens. Then, the model synthesizes information through two cross-attentions, to the text hidden states and graph hidden states respectively. In this way, the AMR information not only benefits text encoding but also directly contributes to the generation of summaries.
During training, the decoder produces a standard cross-entropy loss L t based on the teacher-forcing training strategy (Bengio et al., 2015).

Node Selection Loss
Additionally, we propose an auxiliary task of summary node selection at training time to create additional training signals for our AMR encoder and learn features useful for content selection. For each dialogue, we derive a summary AMR from the gold summary. We then apply a multilayer perceptron classifier to the graph hidden states of dialogue AMR concepts and make binary predictions on whether a node should appear in the gold summary AMR. We use a binary cross-entropy as the node selection loss L s and train this auxiliary task jointly with our summarization task. Thus, our final loss is L = L t + L s .
The node selection loss is only for feature learning, unlike the previous graph-to-graph AMRbased summarization system (Lee et al., 2021), which selects nodes to construct a summary AMR and generates the summary therefrom. Our node selection loss here directly forces the AMR encoder to extract semantic information from AMRs. This change is particularly useful since our text encoder and decoder are initialized with extensively pretrained weights but the graph encoder is initialized with random weights. We hope the node selection loss can help mitigate this gap and help the graph encoder receive meaningful gradient updates from the very beginning of training. Finally, the node selection loss also intuitively helps extract graph features useful for content selection, which are utilized by the downstream text encoder-decoder via cross-attentions.

Long Dialogue Summarization Datasets
We trained and evaluated our models on the AMI (Carletta et al., 2006), ICSI (Janin et al., 2003), and ForeverDreaming (Chen et al., 2021b) datasets. AMI includes 137 transcripts from product design meetings and ICSI includes 57 transcripts from academic group meetings, both have gold summaries written by professionals. ForeverDreaming has transcripts from 4348 episodes of 66 TV shows and community-contributed gold summaries.

Implementation
During our development stage, we found our system had no significant performance difference whether our text encoder and decoder used the configuration of DialogLED-large or that of DialogLED-base. Thus, our full system only uses the base configuration and has 217 million trainable parameters in total, half the size of DialogLEDlarge (460 million). We report the baseline results from both DialogLED-base and the state-of-the-art DialogLED-large for the sake of completeness. We use random weights for the AMR encoder and the pre-trained weights from DialogLED-base to initialize our text encoder and decoder. We extend DialogLED's vocabulary and resize its embedding matrix to include additional AMR-specific tokens.
We adopt a learning rate of 2e-5, Adam optimizer, and a batchsize of 32. Other implementation details are described in Appendix A.

Results
We report ROUGE scores (Lin, 2004) as automatic evaluation metrics and also report human evaluation results. To test the statistical significance of the improvement brought by our model, we use the Almost Stochastic Order test (ASO) (Del Barrio et al., 2018;Dror et al., 2019) as implemented by Ulmer et al. (2022). ASO tests the statistical significance by evaluating stochastic dominance. ASO is more suitable for deep learning models because it does not require t-test's normal distribution assumption, which likely does not hold for neural networks (Dror et al., 2019). We also report t-test results for the sake of completeness.

Performance on in-domain Test Sets
As shown in Table 1 and  increase in Rouge-2 for ICSI, and a 1.67 increase in Rouge-L for ICSI. The improvement is greater when we compare our model with DialogLED-base, which has the same pre-trained weights for the text encoder-decoder as ours. This suggests that our AMR graphs have enhanced the small DialogLEDbase backbone, outperforming its "large" counterpart. For the ForeverDreaming dataset, our model also outperforms DialogLED-large, though the margin is smaller. The smaller performance increase on Forever-Dreaming than on AMI and ICSI supports our intuition that AMRs are more helpful under lowresource settings, where the risk of overfitting is higher. AMRs expose relevant semantic information and abstract away the syntactic/stylistic patterns, which helps prevent models from drawing spurious correlations. As discussed in 5.1, Forever-Dreaming is a larger dataset with diverse training instances, so it may already contain enough counterexamples for the potential spurious correlations, which intuitively contributes to the model's robustness (Tu et al., 2020) and thus reduces the performance gap between the baseline and our model.

Ablation Results
We perform ablation experiments on AMI and ICSI, which required less training time and are most representative of the low-resource dialogue summarization setting we are interested in. The results are presented in Table 3. Overall, the node selection loss, encoder text-graph attention, decoder graph attention, and the dialogue AMR all contributed to our model's performance. Among the architecture changes we proposed, removing the encoder text-graph attention resulted in the greatest performance degradation for both datasets. The benefits of the node selection loss were also substantial and comparable to the improvement brought by the textgraph attention. The conventional decoder graph attention also contributed to the metrics, though removing it did bring a slight increase in Rouge-2 for AMI. Since the magnitude of this increase is small, we still consider the decoder graph attention an important component in our system. Finally, removing AMRs made the model equivalent to DialogLED-base and thus led to the lowest performance in the table.   Table 4: Human evaluation on Succinctness, Fluency, Specificity, and Faithfulness. Results in the percentage of times raters prefer instances generated by our model over DialogLED-L

Human Evaluation
For human evaluation, we ask 8 university students to make pairwise comparisons of the summaries generated by our system and by DialogLED-large. For each source dialogue, we present its two summaries in a random order (summary 1 from our model and summary 2 from the baseline, or the opposite). This way, the students do not know which system produced a specific summary. We use the full test set of 20 meetings from AMI and a random subset of 30 episodes from ForeverDreaming. Student raters compare summaries according to four metrics: succinctness (e.g., does not contain redundant information), fluency (e.g., does not contain grammatical error), specificity (e.g., does not contain too general or uninformative statements) (Louis and Nenkova, 2011), and faithfulness (e.g., does not contain false information to the source text) (Chen and Yang, 2021b; Zhong et al., 2021). We use specificity to replace the commonly used metric, informativeness (covers the most important content (Chen and Yang, 2021b)), since for long dialogues with multiple topics, it is difficult to decide if a piece of information in the summary is important. Specific instructions and definitions of these terms can be found in Appendix B. As shown in Table 4, our system that utilized semantic information from AMR graphs generated better summaries with respect to all four metrics in both datasets. We believe our AMR graphs helped the model utilize the salient information and thus generate more  specific and faithful summaries. We observe less improvement in fluency, likely due to the fact that our model and the baseline have the same decoder structure. The improvement is also smaller for the ForeverDreaming dataset, which is consistent with our ROUGE scores.

Out-of-domain Evaluation
To examine how our model generalizes, we used our model trained on ForeverDreaming and directly evaluated it on the test sets of AMI and ICSI. We noticed that the gold summary styles significantly differ for the meetings and the TV shows. Thus, we combined the TV-finetuned text and graph encoders with a decoder that only has pretrained DialogLED weights. This helped the decoder produce a more neutral style. This setup still constitutes a meaningful scenario where we have some training data for the TV domain but no data for the meeting domain. As shown in Table 5, in the out-of-domain setting, our system outperforms the baseline by an even greater margin than in the in-domain setting. These results further support our claim that our model is good at learning relevant features that are generalizable. The improved generalization brought by AMRs also suggests a promising future direction in using semantic graph representations to improve models' zero-shot performance and apply to a broad set of tasks under low-resource settings.

Conclusion
In this work, we propose a novel AMR algorithm to capture long dialogue structures. We develop a text-graph cross-attention and node selection loss to effectively extract structural features and integrate them into an encoder-decoder LLM for summarization. Empirically, we show that our system outperforms the SOTA models and achieves particularly promising results in low-resource settings and out-of-domain evaluation.

Limitations
Constructing semantic graphs, in general, requires multiple additional tools, which inevitably introduce errors. In this paper, we worked to reduce potential errors. We used the SOTA AMR parser and coreference resolution model and adopted mechanisms to reduce error propagation to our final graphs. However, we have not measured errors involving topic segmentation and AMR parsing due to the expensive human annotations required. It will be helpful to investigate how these errors can impact system performance when they are combined with encoder-decoder LLMs and if the AMR encoder is robust to small errors in AMR graphs. We leave these investigations as future work.
Finetuning LLMs for long dialogues requires many GPU hours and energy. Therefore, our hyperparameter search was limited to 3 different values for learning rates, 3 for warmup steps, and 2 for batchsize. A more extensive search may bring additional improvement to our model.

Ethics Statement
It is possible that unintended use of our system could amplify the impact of offensive language and bias in online discussions, as the salient opinions can be extracted and become more visible to the public. We propose using a toxicity classifier on our output to identify and suppress biased and offensive summaries. Despite the improved faithfulness of our system, automatic summaries in general may contain factual errors or inconsistencies with the opinions of the speaker. Thus, anyone who uses an automatic summarization system must proceed with caution and refer to the source document when necessary.

B Human Evaluation Details
Detailed definitions for our four metrics: Succinctness how much redundant information is there in the summary? For example, the same piece of information should not appear multiple times in slightly paraphrased forms. Please note that the simple repetition of words within a sentence should be considered as lacking fluency instead of succinctness. (see the definition of fluency) Fluency how natural do the sentences seem to humans and how many grammatical mistakes do they have? Sentences with erroneously repeated words or phrases are also considered not fluent. (e.g., "I have have not .." or "they they will not . . . ") Specificity how specific is the information in the summaries? Sentences could be general or specific: general sentences are broad statements made about a topic, while specific sentences contain details and can be used to support or explain the general sentences further. (Louis and Nenkova, 2011) Faithfulness does the summary contain information not supported by the source text? Does the summary associate actions and opinions with the correct individuals? (Chen and Yang, 2021b;Zhong et al., 2021) We instruct the raters to perform the following steps: • Read the full source text first to get a general impression. For each instance, work on one criterion at a time and record which summary is better in terms of each criterion.
• For Succinctness and Fluency, you may make your judgment without consulting the source text again to save time.
• For Specificity and Faithfulness, separate the summaries into Summarization Content Units (SCUs) first (Nenkova and Passonneau, 2004). The goal is to split sentences into small phrases, each conveying a stand-alone piece of information.
• For example, the following summary is split into 4 SCUs: [the project manager recapped the decisions made in the previous meeting] 1 .
[the industrial designer presented the working design] 2 and [discussed the interior workings of a remote] 3 and [how to incorporate the corporate image into the design] 4 .
• Then, for each SCU, check if they are specific and/or unfaithful according to the definition above. Summary with a higher number of specific SCUs is considered better in specificity. Summary with a lower number of unfaithful SCUs is considered better in faithfulness.
• Finally, record your evaluation in the provided spreadsheet.
Our raters consist of 8 university students, primarily native speakers, with some international students who are fluent in English. As shown in Table 6, we measured the interannotator agreement in our human evaluation using the Fleiss' kappa score. According to Landis and Koch (1977)'s interpretation of kappa scores, our scores suggest that there was mostly a fair (0.21-0.40) to moderate (0.41-0.6) agreement among our human annotators, depending on the specific evaluation category and dataset.

C Examples
We provide some example summaries produced by the baseline system (DialogLED-L) and our system, alongside the corresponding source texts.
For better readability, we present the summary as a contiguous text instead of a list of Summarization Content Units (SCUs). We highlight the text spans we found specific (as defined by our specificity criterion) in yellow and mark the spans we found unfaithful with the red text color. When applicable, we use superscripts to mark the corresponding conflicting text span in the source text to explain unfaithfulness.
Hallucinated information is also marked as unfaithful but does not have a superscript to mark related text spans in the source text. Additionally, we underline the redundant text spans in the summary. We didn't mark problems with fluency as our model has similar performance in fluency compared with the baseline. Example 1: In this example, our system's output was consistently considered to have higher specificity and faithfulness than the baseline's by our human raters. Output from Baseline System (DialogLED-L): the project manager opened the meeting and stated the agenda. the marketing expert presented research on users of remote controls, showing that remotes are being considered ugly. the group also indicated that users aged sixteen to forty five years often get lost in the room. the research showed that younger users like to have a fancy-looking remote. the user interface designer discussed the basic layout of the device and suggested that the remote should look very user-friendly. the industrial designer presented requirements for the project , including the number of buttons , power , and teletext 2 . the project manager briefed the group on some new requirements and led them in a discussion in which the group discussed their target group and what functions they would like to see in the remote they will produce. the target group will be people aged 16 to 40 1 . the remote will be used only for televisions. the corporate image will be displayed in the product. the remotes will have buttons for the most frequently used functions. there will be a docking station for the lcd screen. remotes only have the general functions of a remote. teletext will not be included in the design 3 . remote will use a scroll button 4 . the elderly mobile phone will not have internet access. the maximum production cost will be 12.50 euros. the group decided to include teletext for the first time in their remote design. they decided to eliminate the docking station 6 and keep the buttons for channel selection and volume selection. they also decided to ignore the audio signal 5 and to focus on the main functions of the remote. it will be too complex and the production costs will go sky-high. the users of 16 to forty-five years old were generally quite pleased with the design and the interface. Output from Our System: the project manager opened the meeting and introduced the new requirements for the project. the marketing expert presented research on users of remote controls and indicated that users aged sixteen to forty-five typically do not like remotes. the user interface designer discussed the interior workings of a remote and stressed the importance of a simple design. he suggested that the remote should look very simple and be easy to learn. the industrial designer gave a presentation about the functions that will be included in the design. the group discussed the option to include teletext, but decided that it was not a useful feature 3 . the project manager briefed the group on some new requirements and led them in a discussion about the remote's basic layout and its basic layout, and suggested that they should focus on the target group and not on the teletext function. they also discussed the docking station which will allow the remote to be used only for televisions and which would not increase the production cost. they discussed the possibility of using a locator function and decided that the buttons for the most frequently used functions and the lcd screen should be included. the remote will have a docking station for the tv functions. there will be no teletext feature on the remote 3 because it is too complex and the time to market will be too big, if we are to have a fancy looking remote. the group also complained that remotes often get lost in the room. they complained that there was not enough information about how a remote works and that elderly people need to adapt to new devices. the designers of the project had problems with the group's tendency to mess things up and to Source Text: product_manager: everybody found his place again? marketing_expert: yes. product_manager: that's. so this is our second meeting. and still failing? now we're going into the functional design. important thing of this phase is that we're going to try to get an agreement about the user requirements, technical function design, and the working design. so that we can move onto the second phase. but first this phase. first an announcement. there's a little adaptation in the air conditioning system. so there's our ghost mouse again. that that means that you can have a little trouble with, little trouble with the air conditioning, that's because of this it's in wing c _ and e _. it should be over in a while, couple of days. but it's going to be cold anyway, so i don't think you're gonna need it. marketing_expert: no. product_manager: then our agenda. now first the opening. this time i will take the minutes. you're going to have a presentation. all of you. and we've got forty minutes for the whole prese for the whole presentations. so i suggest we take about seven minutes per presentation, and then we can have a little discussion about the new project requirements which have been sent to me. and then the decision on the control functions which we wanna include and those which we don't wanna include. we've got forty minutes for all of it. i suggest let's start with the first presentation. who wants to be first? marketing_expert: think i'll go first. product_manager: just maybe it's easier if you will tell your presentation as. just which function you have and what you're gonna talk about. marketing_expert: my name is freek van ponnen. i 'm the market expert. but you already knew that. i've done some research. we have we have been doing research in a usability lab where we observed users operating remote controls. we let them fill out a questionnaire. we had one hundred of these test subjects. in addition we did some market research. see what the market consists of. what ages are involved. these are three quite astonishing results,. remotes are being considered ugly. f seventy five percent of the people questioned indicated that they thought their remote were was ugly. and an additional eighty percent indicated that they would spend more money on a fancy -looking remote control. so in addition remotes were not very functional. fifty percent of the people indicated they only loo used about ten percent of the buttons on a remote control. and fifty percent of the people indicated that their remote tended to get lost in their room. some things. then we did some research to the most relevant functions. channel selection and volume selection both got a ten on a scale of one to ten for relevancy. the power button got a nine. and teletext got a six and a half. so these are the most important functions on a remote control. then there are some one -time use function. that's what i like to call them. that audio settings, video settings, and channel settings buttons. which are not really used very frequently, but are still considered to be of some importance. channel selection was also indicated to be used very frequently. one hundred and sixty eight times per hour. then these are the this is the market. sixty percent of the market consists of users between the ages sixteen and forty f six. main characteristic of this group is that they're very critical on the remote control. they like to use new f new functions. but they also are very critical. they won't spend their money very easily. the users of forty six to sixty five years cons the make up forty percent of the market. they are not really very interested in features. but they do tend to spend their money a lot easier. what this indicates for our design. we should make a remote for the future. and this means we would have to focus on the age ages sixteen to forty five 1 . this also makes up most the biggest part of the market, so that will also be where our main profit would be gettable. this would mean we would have to make a fancy design. the results also indicated that about one quarter of the people questioned thought that the remote control caused r_s _ r_s_i _. this is certainly something to take into account. and thirty four percent thought that it was hard to learn a n how to operate a new control, remote control. so these are two factors that should be included in the design. besides that the remote must look very. and the functionality as a lot of people indicated, they only use about ten percent of the buttons, we should make very few buttons. this will also be beneficial to the design of the remote. the most frequently used buttons should be emphasised. especially the channel selection and audio selection buttons. they're used most and so they should be robust. they shouldn't break down easily. then as mo as a lot of people indicated that their remote got lost in the room, it might be and i say might be because it would certainly boost the production costs a lot. but it might be a good idea to make a docking station. and this would, could get a button in it which would send a signal to the remote which would then beep. so you'd know where it is in the room. and in addition to this it could recharge the batteries in the remote if you put it in. then a surprisingly great deal of people w indicated that an l_c_d _ screen in the remote control would be preferred. this was mostly people in the age of sixteen to twenty five. but up till forty five it remains feasible. this would also greatly increase the production costs but these are just some small factors we could consider. that would be all. product_manager: anybody have any questions until now? marketing_expert: any questions? product_manager: about functional requirements? user_interface_designer: mm -hmm. industrial_designer: no. product_manager: that's clear. now to the second. user_interface_designer: i've been looking at the user interface of it. f for the techno f functions of it. product_manager: you can take your time. we've got plenty of time, user_interface_designer:? marketing_expert: you should go to the top thingy. slide show. product_manager: there it is. user_interface_designer: we must use the general functions of the remote control. i've do i've done a little research on the internet and not much information about it, about interface but i 've been thinking about a simple manner to put a lot of functions in one remote control. so you 've got a lot of devi devices like d_v_d _ television, stereo. so but it must be user -friendly. you c you can't put a lot of functions in one. in one remote control. product_manager: one remote. user_interface_designer: but got many functions in one remote control, you can see, this is quite simple remote control. few buttons but this re remote control got a lot of buttons. people don't like it, so what i was thinking about was keep the general functions like they are. like the onoff button. keep it l like a red button. everybody knows it so you don't have to change that. my personal preferences. use a display for specific functions of the different device. wh what i was th thinking about was you've got this the remote control and you got here the general functions, like the on -off button sound and here you've got a s a display. it's a touchscreen. you got a general f the functions of the device for a d_v_d _ player or so the pl f for playing reverse. and you got here real buttons for selecting a device. this button is for a d_v_d _ or for every device you've got a f a b a part display of a part buttons. you never got all the buttons on w one device. that's my idea about it. and let's see. so a touchscreen. and th the buttons the real buttons we have to use. we better c use quite large buttons for everybody have to use it so ol even old people young people. we must keep buttons quite s simple and quite large.. that was my part of it. product_manager: anybody has questions about the technical functions? industrial_designer: if we are gonna use a touchscreen we're gonna go way above the twelve and a half euros. user_interface_designer: n i don't. you got quite a cheap touchscreen. s it's not in colour. product_manager: touchscreen. user_interface_designer: it's just one colo i seen w something on the internet not today but a few weeks ago. you got quite an a touchscreen and it's for twenty euros or less. so it's possible. product_manager: that's. marketing_expert: it would certainly make a fancy design. industrial_designer: but the it wouldn't be very robust. it's very fragile and you can get scratches on it. marketing_expert: that is true. user_interface_designer: that's true. product_manager: maybe we can first listen to your presentation? marketing_expert: we would have to look into that. product_manager: and then we have a little discussion about the requirements and design. industrial_designer: that's. product_manager: it's going to it's not too much. industrial_designer: i've got a presentation about the working design. first about how it works. it's really simple. everybody knows how a remote works. the user presses a button. the remote determines what button it is, uses the infrared to send a signal to the t_v _. the t_v _ switches to the frequency, or what function it is. so we've got the plate. it gots conductive disks for every button. when the user presses a button, a signal got sent, goes to the led and transmits tranmi transmits its to the t_v _. it's a very simple device, technically speaking. this is a schematic overview. you've got the buttons. the power source. and when a button gets pressed, its goes to the chip. the chip controls the infrared bulb and perf perhaps a normal bulb. when you press a button you can actually see your pressed button. we should use default materials, simple plastics. keep the inner workings simple, so it's robust. we should focus on aesthetics, the design and the user interface, because if you're going to use high -tech materials the price is going to go sky -high. and you only have to design a remote once, and if you use high -tech materials it come back in every product. it's, in my idea, it's gonna be smart to invest in di in design and not in the product itself. that's it. product_manager: now i hope everybody has a little bit more insight in the functions we all have and what we are doing right now. i'm the project manager so i'm here to mess things up and tell you some new requirements. that's, we've got to design a remote which is only suitable for t_v _. that's because it will be too complex and the time to market will be too big, if we wanna have it for more functions. so it has to be simple. another point is we have to skip the teletext, because in the world of upcoming internet we think teletext is going to be a thing of the past. and it's a function we don't need in our remote control 2 . internet is also mentioned in a function we can use. maybe also on televisions it will be available as. another one is the customer is forty plus. that's the market we have to target, because we are going to develop a new product which is specially designed for the younger customers. this is a bit pity for the marketing expert. because he was aiming on the younger persons. so we have to find a market which is above forty plus but which will suit our remote control, and the other way round. and we have to be very attent in putting the corporate image in our product. so it has to be visible in our design, in the way our device works. and we have to be very clear on this point as. i suggest let's have a discussion on the control functions. marketing_expert: is there any discussion possible about the new product requirement? product_manager: we can see if we can find a way between the functions we wanna use and the market we wanna reach with our product. marketing_expert: you're saying that teletext is gonna be an old feature and it's not gonna be used anymore anyway pretty soon. and new t_v_s will have internet access on them. but if you're targeting people of forty plus, the chance that they will have a t_v _ with internet access within the next like twenty years is very slim. in addition people indicated that teletext simply was an important feature for the remote control. so it's pretty dumb to put no teletext feature on it. i 'm against it. product_manager: against the no teletext? marketing_expert: besides that, the market for forty plus is like pretty small. but if i s if i see this, it's we're just gonna go for another product_manager: it's it is user_interface_designer: forty product_manager: standard remote. marketing_expert: pretty product_manager: no we can marketing_expert: and not innovative product_manager: we can do a lot with the design and the simple buttons marketing_expert: remote control. product_manager: which were also mentioned. if we put a lot of effort in those, we can make a remote control with just two or three buttons. or just a remote which is suitable for the market we wanna reach because it is forty percent of the market. if you look in holland at the whole generation of forty plus, fifty plus, it's the biggest share of the whole population now. marketing_expert: yes but it's not the biggest part of the market. product_manager: no. marketing_expert: and besides that, they're not very critical so they don't really care what the remote control is like. they'll just take the first thing they see and which looks acceptable. product_manager: but don't you think that if we make a remote which is typically made for this market, that people think the people think that's the device i 've looked for although i didn't realise it. let's try it. marketing_expert: that would be the case in the sixteen to forty five age category. because they are critical and they want to have a fancy remote control. people of forty plus, they want it to work, but as soo as soon as it works it's with them. industrial_designer: that if we're if we put our marketing right we can sell this just like i if you 've heard about it in the news, the elderly mobile phone? product_manager: it's a big success. industrial_designer: if we make a remote control just l with that idea in mind, we could make tons of money,. product_manager: very big success. marketing_expert: i haven't heard of it. product_manager: so as. industrial_designer: we don't have to focus on the design then but on functionality. we just change our focus on the project, and we can sell this. product_manager: i simply think that the new products we are gonna make, spef specifically design, are designed for younger people, so maybe we can focus ourself on the elderly people. and we have to see what requirements we need for those remote controls. what you told is the channel selection is important. volume selection, power and teletext. marketing_expert: but the board tends to disagree. product_manager: we haven't voted yet, teletext can be a function as. but only if it won't higher the cost, because i if it will be a lot more money to implement teletext as, but i don't think it will be a problem. or is teletext a user_interface_designer: but deaf people need teletext for subtitles. so it's marketing_expert: also. product_manager: i suggest marketing_expert: it'd definitely be a bad idea not to include teletext. user_interface_designer: it's product_manager: is anybody really against teletext? no? just that, that we just keep the teletext 3 . that's a good idea as, especially for the subtitles. maybe we can make that another point of advantage in our remote control, if we make a k a button ex for big subtitles, which is instantly on the remote control. for elderly people they can think, i wanna have subtitles, and they push the button and they get the big subtitles. industrial_designer: that's a good idea. product_manager: teletext can v can be very useful in our advantage. functionality should be few buttons, you said. that's very important we have a few buttons. marketing_expert: mm -hmm. product_manager: so to keep it simple. marketing_expert: but i don't think that's really an issue any more might be. user_interface_designer: if it's only for televi marketing_expert: but it, if it's only for t_v _ you're not gonna need a lot of buttons anyway. you need a one to zero button, next channel, previous channel, volume up, volume down, and some teletext buttons but product_manager: but do you need user_interface_designer: so we can s we can skip the display, marketing_expert: if you only l user_interface_designer: so we don't need it. product_manager: but do you need the buttons for one to zero. marketing_expert: nah. product_manager: maybe c we can marketing_expert: think if you're gonna include teletext you do. many people like to use that. product_manager: maybe we can use marketing_expert: if you should, if you want to switch from channel one to like thirty five, you don't wanna push the next channel button thirty five times. product_manager: no, maybe we can implement the scroll button? or a joystick like? there are other ways too. just look if you look at telephones. the sony telephone has a scroll button which is very useful in searching names or marketing_expert: that's true but i don't think there are many t_v_s that can switch channels that fast. and so you would need like the t_v _ would need an a function where you can actually view all channels and scroll through it. and if many channels would do have that. if many t_v_s have that. industrial_designer: and besides that it's if we're gonna focus on elderly people they'll have to adapt. they're not used to using scroll buttons. so perhaps we should s stick to the basic layout 4 . product_manager: the numbers. they can see how much buttons there are going to be on the display, and if it's too much we can reconsider it. but there won't be very much buttons. or there don't have to be a lot. marketing_expert: but i don't think if you're gonna make a remote control only to operate a t_v _, you there's not much you can gain on having as few buttons as possible. there are pretty many remote controls that can only operate a t_v _, which already only have the minimum number of buttons. i don't think there's much to be gained in that area. product_manager: the number of buttons? it's very important in the design. you can make a very fancy design with putting the buttons on the right places. and if you have less buttons you can do a lot more with marketing_expert: that is true but there's simply not much to gain on the competition when you're making a remote control only for to operate only the t_v _. product_manager: to operate only the t_v _. marketing_expert: if you have a remote control only to operate a t_v _ there's simply not a lot of buttons required. there's not a lot of functions required so most existing remote controls simply don't have a lot of buttons either. user_interface_designer: no. product_manager: so. marketing_expert: it would be very hard to actually gain on the competition here. that would that would cost a big marketing expedition product_manager: marketing_expert: which was one of the arguments to make it only for the t_v _ because we didn't have the time to market a lot. product_manager: you suggest we could better focus on the docking station. like other functions. instead of f of less buttons. marketing_expert: maybe., mean we need a good way to position all the buttons and but i don't think we should spend very much time in that. product_manager: no. do you think the docking station will is allowed in the budget we have? industrial_designer: it should be possible yes. if it's not too fancy. product_manager: it can be industrial_designer: and if the remote stays rather small, it should be possible product_manager: because that's that's a good advantage point as. if we have a fancy -looking docking station industrial_designer: yes. product_manager: or very that's a requirement. docking station. industrial_designer: we're just gonna focus on the extras? product_manager: so. marketing_expert: maybe we should do some research into what elderly people like to have in a like to have extra in a new remote control. product_manager: that's a good point. you said they easily get lost as. marketing_expert: fifty percent of the people indicated that remote control tended to get lost. product_manager: so maybe we should implement the audio sign,. marketing_expert: that was what i suggested. industrial_designer: like with your key -chain, if you whistle it goes it makes a sound. marketing_expert: you have it on user_interface_designer: hm. marketing_expert: you have it's on some phones too, which have a docking station. and you just press a button and the phone goes ringing. product_manager: so marketing_expert: so where it is. product_manager: audio signal should be possible as. it's not too expensive. 5 another point is the l_c_d _ screen. i if that will rise the cost too much, because industrial_designer: y i we'll have to choose between the docking station or the screen, 6 product_manager: it will be too much as. industrial_designer: it's marketing_expert: since a lot of people indicated that a new remote control is hard to learn, and we're focusing on elderly people here which tend to have a hard time understanding new devices, it might be a good idea to have just a little screen on it, which would explain a button if you press it. which would tell you what it does. and it wouldn't have to be touchscreen or a very expensive screen, product_manager: based. just the l_c_d _. just the normal screen. marketing_expert: just a small screen product_manager: that's a good idea. marketing_expert: with two product_manager: some extra info. feedback. that's a good idea as. marketing_expert: but if that would product_manager: as the small screen. marketing_expert: that would fit into the costs. product_manager: extra button info. that should be possible as. let's see what did we say. more. should be fancy to, fancy design, easy to learn. few buttons, we talked about that. docking station 6 , l_c_d _. general functions and default materials. that's a good idea as, because elderly people don't mind if it's a titanium cover or just a plastic one. so that doesn't really matter. we nee marketing_expert: probably elderly people would be a little bit more careful with their remote controls than youngsters. product_manager: let's specify the target group. because are we talking about elderly people or people from forty to eighty. because what we're going to design now is for people above sixty, maybe. what do we want? if we want a with the telephone for elderly people, we can target the real elderly people. marketing_expert: that would be a if we should do something like that it would be a, it would be really good for the image of the company. and, there would be a good market for it. product_manager: that's the marketing_expert: if we're able to really bring an innovative product. product_manager: now you're talking about sixty to eighty. marketing_expert: the really product_manager: sixty. marketing_expert: but i'd have to look into that a little more. product_manager: and different cultures. are we industrial_designer: i don't think they have different television sets in every country. product_manager: we 've got five minutes left just now. small warning. marketing_expert: and with the little screen in it, which explains the buttons. you could we n it would be a lot easier to product_manager: should marketing_expert: adapt it to different cultures. product_manager: in different languages,. or you have to put a language button in it, user_interface_designer: right. product_manager: but that will be a bit unnecessary. user_interface_designer: no. product_manager: it's better to put it on different markets with it all. that's the target. then a few small things. i will put the minutes in the project document folder. what we're going to do for the next meeting is the industrial designer will do the components concept, user interface designer the user interface concept, and the trend -watching. just keep in mind the things we 've said about the target group, requirements, and the trends which are going on. and specific instructions will be sent to you by the personal coach. i for this meeting. and we have a lunch -break now. marketing_expert: that's good. product_manager: that's a good thing.

Example 2:
In this example, our human raters gave mixed opinions on the succinctness, specificity, and faithfulness of our system's outputs, compared with the baseline's. Output from Baseline System (DialogLED-L): the project manager presented the prototype to the team and discussed the features the prototype contained. the team then discussed the production costs of the remote and the possibility of adding a feature which would make the remote doublecurved or single-curved. the project manager discussed the interior workings of a remote with the team members. they decided to make the buttons on the top, the menu on the bottom, and the parental control. the marketing expert gave an evaluation of the project process and found that the prototype was innovative and innovative and that the team had room to be creative. the user interface designer and the industrial designer presented their prototype, which was evaluated on a scale of 1 ( excellent ) to 8 1 ( very bad ). the team felt that the remote was over budget and that there was not enough room for creativity. they also complained about the energy source and the cost of the batteries. they complained about having to fill out a questionnaire in order to use the smartboard. after the evaluation, the team discussed their initial ideas about the features they would like to include in the remote. they eventually decided to include a feature called the touch-screen, which would allow the remote to be used with only one hand. the remote will be made of rubber instead of rubber. the corporate logo will be incorporated into the design. there will be no feature for the remote except for a feature to make it easier to find when lost in the house. the production cost will be under 12.50 euro. whether to add a feature on the remote or not is dependent on how much the cost is. Output from Our System: the project manager opened the meeting by stating the agenda and then giving the prototype presentation. the user interface designer and the industrial designer presented their prototype design for the remote, which was designed to look like a children's remote. the marketing expert gave an evaluation of the prototype and found that the remote was technologically innovative but was not user-friendly 2 . the group decided to make the remote single-curved or single-shaped. the remote has an lcd screen and is made of a special rubber casing. the designers discussed the features they would like to incorporate into the design, including a feature to make it easier to find when lost. the project manager then led the group in an evaluation in which the group evaluated the project process, finding that they were happy with their design but were unsatisfied with the first meeting. the team then discussed the production costs and what features they should retain or lose in order to stay under the 12.50 euro mark. after the evaluation, the group discussed whether the remote should be single-curved or double-curved. the remote will not feature a teletext button. the corporate logo will be displayed on the remote. the remote will feature a special color. the menu will be covered in a fruity, wood-like material. the buttons will be made in a special colour. the group had problems with the digital pen. the prototype was evaluated on a seven-point scale and received a 7.5. the original design was over budget but was able to meet the target user group. the new design was more suitable for younger users than the existing design was. the change in the color of the remote to use a more fanciful, more technologically innovative design did not satisfy the id c, Source Text: product_manager: wouldn't wanna be project manager., what we going to do., once again i'm gonna take minutes. so, no presentation for me., first we have a prototype presentation by g _ and g _. afterwards some user_interface_designer: yo. marketing_expert: j _ and j _. product_manager: eval evalu industrial_designer: evaluation. product_manager: evaluation user_interface_designer: evaluation criteria. marketing_expert: evaluation. product_manager: s. evaluation crit criteria., in combination with the finance i received a an excel file which we have to fill in later on., you see., and then we must see if we stay under the twelve and a half euro. marketing_expert: interesting.,. product_manager: so, that's a big industrial_designer: mm -hmm. that's gonna be t problem. product_manager: l so let's it we have must, user_interface_designer: some creative product_manager: we must have some time for that because it will be, quite a lot of mathematics. user_interface_designer: product_manager: and after that, an evaluation of the process how we have done it here with the smartboard, with the with our laptops, with the all this. and afterwards, we closing. once again, forty minutes, so let's start. i would g give the word to g _ and g _ for the prototype presentation. user_interface_designer: shall i give a short introduction and then industrial_designer: . product_manager: j _ and j _. user_interface_designer: j _ and j _,. marketing_expert: jane and jane. guys, take it away. user_interface_designer: take it away. industrial_designer: hi. user_interface_designer: this was our first concept. we decided to use a single touch -screen. so, we've worked out this concepts, how to hold it, where to put the buttons and. and, we began with a form of shape, that is easy to hold w in one hand, left or right handed. so, we made i it a little bit less thick and it has some ar artistic meaning. no? this isn't nothing. idea maybe is better., during the meeting i showed you the concept of placing the buttons on top, usable with your thumb, and the menu structure, if necessary, with your other hand, so it's just gonna hold it easily. and it has to be acce accessible with your other hand too,. so we began working out a concept. industrial_designer: and as you saw, we would just have the basic remote with the panel l_c_d _ screen., these would be the main buttons, h you could change them later on in your own profile if you want to. but, it's standard they will be delivered with this set -up. we have the more advanced menu setting right here. we have the sub -menus and stu. we made a top, or a front view. just so like you wanna back view. as you can see, this there, there are two weird bumps in it. this is for the added effect of y youth and dynamic. and this is for the artistic effect., what we figured is we'll show you a picture later on you have more b a better idea after that. but, idea is for to stay in balance with these two. and so when you put it on the table, it will just lay down. it won't roll around or. but it will lie more in your hand like an old telephone maybe, or like these old phones. y you may get the idea. so thi this is about how we figured it should be. the s panel we g you would hide with some more rubber layers, like we discussed early on., you would s you wouldn't see the straight panel, but more fluidly and round. user_interface_designer: the panel just goes like this. but the overlaying layer is a little bit curved and. product_manager: no,. industrial_designer: and, in these bumps you could actually put some electronics that would you can make a more thinner design, and that would actually look very,. and, about the colour, what have user_interface_designer: we added that this can be held with your hands for this maximum is om, one and a half centimetres. so, you have room here for your battery and maybe even other electronic chips. s and you can just be the layer of the touchscreen and some have some wires underneath it to make it as thin as possible in the middle for good grip. industrial_designer: f, as colours, do you have the picture in now, this is the idea about the bumps., you can see there's a v a very youthful dynamic exterior. it you just want to hold it you are young and dynamic like us. marketing_expert: 's l it's like an easter egg. industrial_designer: it's like an e but this is for children. we we want a more adult version. but, this is like a remote control for children. product_manager: it's called a weemote industrial_designer: a weemote. marketing_expert: weemote. product_manager: weemote. industrial_designer: hey, that's actually a brilliant marketing stand., but marketing_expert: what i w got in mind. industrial_designer: so this actually basic the idea. we we just want to build a more adult vers adult version of this. product_manager: imagine that. marketing_expert: mm -hmm. industrial_designer: and and for colours, we figured starting with basic colours like white or metallic grey. those are the technological colours actually, user_interface_designer: it would be best to appeal to a broad public and make the covers exchangeable, industrial_designer: so it d user_interface_designer: so the young people will buy an orange and a red and blue and a purple, industrial_designer: or blue or whatever. user_interface_designer: but when the o older people go in the shop and they see an orange remote control, it would be less appealing than a white one. and young people, we think, are a little bit more flexible, they think, i'll buy for a couple of euros some noi hip marketing_expert: maybe it's an idea to sell it without a cover, so that you can pick a cover in the shop. user_interface_designer: a cover is necessary, als otherwise you'll just have the l_c_d _ screen. marketing_expert: . user_interface_designer: so, there must be some cheap standard cover, maybe white, marketing_expert: user_interface_designer: that's could comes with it and you can buy, so we can make extra money. product_manager: but you d you mustn't forget that our target aim is younger people. marketing_expert: oui. product_manager: we had decided to put some flashy fruity colours in it, and in the survey from milan and paris it came out that the d the older people are more willing to spend money on extra features. so it will be a better idea to have some flashy fruity colours as a standard, user_interface_designer: the other way around, you mean. product_manager: and for the people who really want a more sophisticated, more traditional look, they're willing to pay that. they want they want more luxury, but they have the money to do it and they want to b to buy that. user_interface_designer: mm -hmm. product_manager: so, maybe it's an idea to put that as an extra and not as a standard. industrial_designer: , maybe, perhaps you're right., i would actually agree with this sounds logical. user_interface_designer: . marketing_expert: an another idea., maybe we could develop a cover with wood style. they'll the elder users as. user_interface_designer: a colour of a wood style, a white c and a couple of h hip fruity colours. and lea l delivered standard with a fruity colour, but not too much. industrial_designer: nah. marketing_expert: yes. not not too. user_interface_designer: this is banana and mango, not purple or p orange and yellow. marketing_expert: exactly. product_manager: but, the mai th the standard must be some attractive flashy colours. marketing_expert: or blue or product_manager: not too, but w a little, user_interface_designer: mm -hmm. product_manager: because that's our aim. industrial_designer: li like this. this isn't this isn't too much, is it? user_interface_designer: . no. industrial_designer: i f product_manager: the buttons don't have to be all of industrial_designer: the buttons, marketing_expert: so. industrial_designer: i marketing_expert: except for the buttons it's it could be a standard model. product_manager: it industrial_designer: something like this would be., that's it from us. marketing_expert: it's my time now. user_interface_designer: it's my turn. product_manager: the marketing expert. industrial_designer: uh -oh. marketing_expert: during the design life -cycle we made lot of requirements and trend analysis and., now is the time to evaluate our prototype concept to the past requirements. so we are going to evaluate the design according to the past user requirements and trends analysis., we're going to do that with a seven point scale 1 . opening a word document now. one, i have to expla explain something. we have to be consensive about things. so, it has to be a group decision. product_manager: marketing_expert:? product_manager: so we gon we gonna evaluate the marketing_expert: we're going to vote. we product_manager: the thing we saw. marketing_expert:? the prototype. product_manager: just saw. marketing_expert: one. the remote control is designed for people with age below forty. product_manager: seven? marketing_expert: seven is false. product_manager: true. marketing_expert: b one or product_manager: one. industrial_designer: why? marketing_expert: most true? industrial_designer: it's not just designed for people under the age of forty. it's also designed for people above forty. marketing_expert: so industrial_designer: so marketing_expert: so a o one is appropriate? user_interface_designer: no no, a little more in the middle. marketing_expert: or, more like a four. user_interface_designer: no, three or. industrial_designer: i have i've marketing_expert: three. industrial_designer: two or three, because it's not just the qu question is aimed at is it designed for people with age below forty. but it's also designed for people of age above forty. so, marketing_expert: exactly. industrial_designer: i'll say it's about three. user_interface_designer: it will be primary appealing to m minus forty, but also appealing to marketing_expert: but also for,., second. the remote control is beautiful. user_interface_designer: it's marketing_expert: acco according to us, it's one? or user_interface_designer: it's the marketing angle on television. we have a wonderful marketing_expert: p s of c you have to be very positive and enthusiastic about your own product. user_interface_designer: it's also fancy then. marketing_expert: the remote control looks fancy. industrial_designer: yes. marketing_expert: one? user_interface_designer: we have a perfect remote. marketing_expert: good. four. the remote control has big, clear channel switching buttons. user_interface_designer: yes. they have to agree but product_manager: yes. industrial_designer: leads to user face,. user_interface_designer: i'm the user interface expert. marketing_expert: daniel., teletext buttons and volume buttons? user_interface_designer: no. product_manager: no teletext buttons. teletext is in the menu. user_interface_designer: you you've different menu. industrial_designer: false. marketing_expert: false? user_interface_designer: and volume is impo marketing_expert: and volume? product_manager: volume is true. marketing_expert: true. big and clear? industrial_designer: the they are big and clear. user_interface_designer: big and clear. product_manager: big and clear. user_interface_designer: but you could make a teletext button six. marketing_expert: hey. user_interface_designer: otherwise, the people who read this are gonna think we have no teletext button. marketing_expert: hide. industrial_designer: but the teletext button., you can ch that's in a menu. marketing_expert: it's it's not industrial_designer: so, it's w, it marketing_expert: it industrial_designer: it isn't entirely unclear, marketing_expert: j industrial_designer: but so, i wouldn't give it a seven. user_interface_designer: no. industrial_designer: i would give it a more a five or a six. marketing_expert: five? industrial_designer: i don i. what do you think, mister project manager? marketing_expert: it's. product_manager: ., i agree. i was thinking very black and white. user_interface_designer: black and red. product_manager: j _. user_interface_designer: don't forget to save it. marketing_expert: red. volume. the remote control is easy to be found. user_interface_designer: when we put in fancy colours, product_manager: fruity. user_interface_designer: and industrial_designer: it has these all these fruity colours and it has a strange shape. so, if you so if you have trouble finding it user_interface_designer: but, it's not making any sound, have we deciding? marketing_expert: user_interface_designer: so marketing_expert: but if you put your normal remote control under your bed, or you throw this remote control under your bed, is it better findable? user_interface_designer: it'll make a difference. we have the better re i., so. my remote control's black. marketing_expert: a li little bit maybe? user_interface_designer: a little bit, but. product_manager: we p we can do it glow in the dark. marketing_expert: four? fi product_manager: so, if it's in the dark place, you still see it glowing. user_interface_designer: k. marketing_expert: i user_interface_designer: fo fo five is. marketing_expert: five. it's it's it doesn't really make a lot of industrial_designer: then i'll go for four. because four is between three and also between true and false. user_interface_designer: you're right. marketing_expert: yes, but five is between four and six. industrial_designer: so i'll go for four. product_manager: you must see it as, w according to the other remote controls, there may be there in your t_v _ room, this one will stand out,. industrial_designer: wha marketing_expert: b _. industrial_designer: that's a better question actually. product_manager: exa that's what it's about. marketing_expert: it's user_interface_designer: if your fifteen remotes in a drawer, you find it,? product_manager: if it if this lying on your couch, you're you think what's that for kinda orange thing. so marketing_expert: but but the survey under users was that they really lost it. user_interface_designer: that's stupid. marketing_expert: like, no not seeing it, but lost it in the house. user_interface_designer: but when you lost it you're just not marketing_expert: but,. industrial_designer: if i if you see a strange shape lying somewhere, then you'd recognise it as, whoa, that is strange. product_manager: that's our remote control. user_interface_designer: mostly when you lose your remote control, it's under your marketing_expert: ., i agree. industrial_designer: what is that., user_interface_designer: most of times when you lose it you're sitting on it. industrial_designer: so it's marketing_expert: eight, the remote control has fresh, fruity colours. product_manager: true. user_interface_designer: i would call choose two, we decided not to make two f fresh colours, as it would not. marketing_expert: not too flashy. the remote control is made of soft material. industrial_designer: rubber, is soft. product_manager: but not too soft we have decided. user_interface_designer: kinda soft, but not this. marketing_expert: three? product_manager: three,. user_interface_designer: easy to use, product_manager: easy to use. one 2 . user_interface_designer: very afford 2 . marketing_expert: easy to use? product_manager: can it be zero? industrial_designer: i don, it is marketing_expert: top easy to use? 2 it's it's not the most easy to use user_interface_designer: you can do two, because industrial_designer: no. marketing_expert: it can be easier. user_interface_designer: it can be easier. but then you're l industrial_designer: it could. marketing_expert: jus just with ten buttons, that's the easiest. user_interface_designer: but then you'll lose function f, functionality and our fancy look, so. industrial_designer: functional ability. marketing_expert: but the most easy to use is just with one button user_interface_designer: but it is r it is rather easy to use, because you have the primary buttons always visible. marketing_expert: on t, but easy n not the most easy to use,. industrial_designer: no, it's it i'll go for two. my vote's on two.

ACL 2023 Responsible NLP Checklist
A For every submission: A1. Did you describe the limitations of your work?

Limitations Section
A2. Did you discuss any potential risks of your work?

Ethics Section
A3. Do the abstract and introduction summarize the paper's main claims?
Abstract; 1. Introduction A4. Have you used AI writing assistants when working on this paper?
Left blank. B Did you use or create scientific artifacts? 2, 3, 4, 5 B1. Did you cite the creators of artifacts you used?

References
B2. Did you discuss the license or terms for use and / or distribution of any artifacts? C B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)? C B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it? Not applicable. Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Not applicable. Left blank.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created? Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results. For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be. Section 4 and 5 C Did you run computational experiments?
Left blank.
C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used? 4, A The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values? 4, A C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run? 5 C4. If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? C D Did you use human annotators (e.g., crowdworkers) or research with human participants?
Left blank.
D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? B D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)? Not applicable. Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating? For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used? 5, B D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?
Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data? B