NewsDialogues: Towards Proactive News Grounded Conversation

Hot news is one of the most popular topics in daily conversations. However, news grounded conversation has long been stymied by the lack of well-designed task definition and scarce data. In this paper, we propose a novel task, Proactive News Grounded Conversation, in which a dialogue system can proactively lead the conversation based on some key topics of the news. In addition, both information-seeking and chit-chat scenarios are included realistically, where the user may ask a series of questions about the news details or express their opinions and be eager to chat. To further develop this novel task, we collect a human-to-human Chinese dialogue dataset \ts{NewsDialogues}, which includes 1K conversations with a total of 14.6K utterances and detailed annotations for target topics and knowledge spans. Furthermore, we propose a method named Predict-Generate-Rank, consisting of a generator for grounded knowledge prediction and response generation, and a ranker for the ranking of multiple responses to alleviate the exposure bias. We conduct comprehensive experiments to demonstrate the effectiveness of the proposed method and further present several key findings and challenges to prompt future research.


Introduction
News, especially hot news, is widely discussed in daily conversations, enabling people to connect to others and engage with the public issues they encounter in everyday life (Swart et al., 2017).However, due to the lack of well-designed task definition and the scarcity of training data, news grounded conversation has almost been neglected in dialogue system research (Huang et al., 2020;Ni et al., 2021;Thoppilan et al., 2022).
To pursue news grounded conversation, a natural idea is to refer to existing document-grounded conversations.However, there are two major differences.First, as news articles can be long, complex, and time-consuming for human reading, it is important for the dialogue system to be proactive, which means that it can actively introduce news content during the conversation.Therefore, users know more about the news, and the conversations are more interactive and in-depth.However, traditional document-grounded dialogue datasets rarely consider this proactivity explicitly, and the conversations are more user-driven.For example, in QuAC (Choi et al., 2018), doc2dial (Feng et al., 2020), and WikiDialog (Dai et al., 2022), systems mostly respond to user questions passively based on the documents.Second, both chitchat and information-seeking scenarios (Stede and Schlangen, 2004;Choi et al., 2018) are indispensable for news grounded conversation.Users may ask a series of questions about the news details curiously or express their opinions and be eager to chat.However, existing document-grounded conversations mostly focus on a single scenario of chit-chat or information-seeking rather than both.The work of Choi et al. (2018); Feng et al. (2020); Dai et al. (2022) considers the information-seeking scenario, Have you heard the bad news?A baby girl was hit in the head by a corn thrown from the 19 !" floor.I can't believe what happened to her.Where did it happen?In Jiaxing's Xiuzhou District, the grandmother was holding the infant in the walk at 21 #! afternoon.
Unfortunately, according to the hospital's preliminary examination, the infant has a serious subarachnoid hemorrhage.
Oh my God, throwing objects from height is so dangerous.
Yeah, we should take this as a warning.Police have launched an investigation, while no resident admitted to throwing the corn.

…
Heaven's net is wide!Yes, all crimes are traceable, just like this incident.What is done by night appears by day!How is the baby?
The Corn Thrown from $% $% Floor Hits Baby Girl's Head where the user repeatedly asks questions and the agent answers based on the documents.Another line of research focuses more on chit-chat scenario (Moghe et al., 2018;Dinan et al., 2019), where participants freely talk about specific topics with knowledge from the documents.For real-world applications, both scenarios should be contained naturally.
To bridge these gaps, we propose a novel task named Proactive News Grounded Conversation, which enables dialogue systems to proactively talk about news with humans in a realistic manner.Furthermore, we collect a human-to-human Chinese dialogue dataset NEWSDIALOGUES, which consists of 1K conversations with 14.6K utterances and rich annotations.We include both informationseeking and chit-chat scenarios realistically, and an example is presented in Figure 1.To explicitly model the proactivity, we first annotate the key topics of the news article to summarize the main content of it.Then, the agent can actively lead the conversation based on these topics, as the 1st and 7th utterances in Figure 1.In addition, we carefully annotate the grounded knowledge of each agent utterance, including the target topic and knowledge spans, for a more informative conversation.The major differences between our NEWSDIALOGUES and other document-grounded dialogue datasets are summarized in Table 1.
To further solve the problem, we propose a simple yet effective method Predict-Generate-Rank, which consists of a generator for grounded knowledge prediction and response generation, and a ranker for the ranking of multiple candidate responses to alleviate the exposure bias problem (Zhang et al., 2019;An et al., 2022).We conduct comprehensive experiments based on the state-ofthe-art pre-trained language models and dialogue models.Both automatic and human evaluation indicates that our method has substantial improvements over several baselines on NEWSDIALOGUES.Finally, we analyze the major limitations of current models to facilitate future research.
The main contributions are as follows.
• We propose a novel task named Proactive News Grounded Conversation, aiming to empower dialogue systems to proactively talk about news with humans.
• To further develop this task, we build NEWS-DIALOGUES, which consists of 1K dialogues with 14.6K utterances and rich annotations.Another line of research focuses on the information-seeking scenario, where dialogue systems help users gather information through conversations (Choi et al., 2018;Reddy et al., 2019;Campos et al., 2020;Qu et al., 2020).Different from traditional question answering systems, the conversation context empowers dialogue systems to address open-ended and exploratory questions that need discussions to explore in depth (Dai et al., 2022).To pursue more interactive, Feng et al. (2020); Guo et al. (2021); Wu et al. (2022) introduce clarification questions, which means that agents can also ask questions when user queries are defined as under-specified.Though helpful for information-seeking needs, these dialogue systems lack chatting ability.
We propose news grounded conversation, which has been neglected in previous research but is indispensable in our daily conversations.In addition, both chit-chat and information-seeking scenarios are considered realistically.
Proactive Dialogue System.The proactivity of dialogue systems has been an open challenge.Previous researches model proactive topic transitions based on well-designed knowledge graphs (KGs) (Wu et al., 2019;Liu et al., 2020).However, KGs are hard to construct and have limited coverage of real-world knowledge (Razniewski et al., 2016).To explore the topic connections, Sevegnani et al. (2021) propose the one-turn topic transition task and collect a crowdsourced dataset OTTers.More recently, Cai et al. (2022) use reinforced self-play to train a teacher bot, which can actively convey knowledge during the conversation.However, they encourage token overlap between the generated responses and the grounded documents rather than proactive topic transition.
We propose proactive dialogue generation based on news articles rather than KGs.Specifically, we aim to empower dialogue systems to lead the conversation based on some key topics of the news.To this end, we build NEWSDIALOGUES, including 1K multi-turn dialogues.

Proactive News Grounded Conversation
We propose a novel task named Proactive News Grounded Conversation.As shown in Figure 1, a user converses with an agent based on a given news article in each conversation.The conversation begins with the agent, and during the conversation: • User is curious about the news and eager to chat.They can freely ask questions or express their opinions and feelings.
• Agent plays the role of a knowledgeable expert.They not only reply to users in a passive way but also proactively lead the conversations based on the key topics of the news.
Following Choi et al. (2018); Dinan et al. (2019); Kim et al. (2022), we introduce an informationasymmetric setting, where only the agent has access to the news article, and the user is eager to know through the conversation.Therefore, the conversation is more open-ended and exploratory, and the agent is more helpful in real-world applications.
Both chit-chat and information-seeking scenarios are contained naturally.

NEWSDIALOGUES
To further develop this task, we collect a humanto-human Chinese dialogue dataset NEWSDIA-LOGUES.

News Article Collection
We manually collect news articles from Toutiao2 , a famous news website in China.The criteria for selection are: (1) We prefer hot news, and thus humans are more eager to talk about it.To this end, we select news articles from the hot list in Toutiao; Topic: Hidden reactions of driving after overdosing.
Not mentioned in the news, probably he did not understand the harm of driving after overdosing.People often ignore the adverse reactions, but they are very damaging!9 Guide I see.Are they from an institution?Why so many people?
It is a fraud gang with many collaborators!When arrested by the police, they had more than 180 mobile phones and swindled more than 7 million yuan.
Table 2: Examples of different dialog acts of the agent.We highlight some key words of inform, guide and answer for unanswerable question, more details in Section 4.2.2 and 4.2.3.We also present the target topic for guide.For reading convenience, we translate the original Chinese to English and omit the dialog history and knowledge spans.
(2) We only collect news articles that do not rely on image information and leave the multi-modality features for future work.

Dialogue Collection
In NEWSDIALOGUES, each dialogue derives from a real conversation between two human annotators, one as the user and the other one as the agent.The conversation scenario is based on the task definition in Section 3, and the annotation processes for user and agent annotators are as follows.

User Annotator
Utterance Generation.User annotators freely ask questions or express their opinions and feelings.To further investigate the behavior, we also ask them to annotate the dialog acts (Bunt et al., 2010) of their utterances, which are either Question or Chit-chat.Here, chit-chat represents the comments or feelings of users, e.g., He is so talented and loving!.

Agent Annotator
News Understanding.Before the conversation, the agent annotators carefully read the news articles to understand the overview.Then, we ask them to write the key topics of each news article, typically 2-5 short sentences.They can write key topics in their own words or make appropriate modifications to the section titles of the news articles.
Utterance Generation.During the conversation, the agent annotators choose appropriate dialog acts for each utterance.We introduce three acts, and examples are shown in Table 2.
• Chit-chat.Naturally chat with the user without news information.
• Inform.Passively respond to the user based on the knowledge from the news article.This act is appropriate when the agent answers user questions or replies to user chit-chat utterances with related news information, as the fourth and fifth examples in Table 2.
• Guide.Proactively guide the current conversation based on the key topics and knowledge from the news.According to our analysis, this act is appropriate under the following scenarios: (1) At the dialogue beginning, as the sixth case in Table 2; (2) The current conversation is relevant to a key topic, and the agent can naturally steer the conversation to the topic, as the seventh example in Table 2; (3) When the user asks an unanswerable question, the agent can lead the conversation to a relevant key topic, as the eighth case in Table 2. Details of unanswerable questions are given below.
Furthermore, we find that almost 10% agent utterances first inform relevant news information and then proactively lead the conversation.We annotate these utterances with the guide act, and an example is the last case in Table 2.
Knowledge Grounding.When the act is inform or guide, the agent annotator can choose appropriate text spans from the news article and use them to craft a natural and informative utterance.We annotate these spans at sentence-level, and each sentence is called a knowledge span.Additionally, we annotate the target topic when the act is guide.These annotations are beneficial for modularized dialogue generation (Zhou et al., 2022;Shuster et al., 2022), which has shown great improvements in knowledge utilization.

Unanswerable Questions
During the annotation process, we find a large number of unanswerable questions, which means that there is no direct answer in the news.This phenomenon is common in realistic informationseeking scenarios, because human questions are open-ended and exploratory.Most existing conversational question answering work simply replies to these questions with NO ANSWER (Choi et al., 2018;Reddy et al., 2019;Adlakha et al., 2022).In this paper, we adopt three strategies in order.
• Inform Relevant Information.When there is no direct answer, but providing relevant information possibly fulfills user needs (Wu et al., 2022), as the fourth example in Table 2.
• Guide Topic Proactively.When there is no relevant information, but the agent can naturally steer the conversation to a relevant key topic, as the eighth case in Table 2.
• Chit-chat.When the above strategies are not suitable under the dialogue context, the agent chats with the user, as the second in Table 2.

Statistics
The statistics of NEWSDIALOGUES are shown in Table 3, and there are several noticeable features.First, the news article is long and brings a new challenge to dialogue system research.Second, as shown by the statistics of user dialog acts, both information-seeking and chit-chat scenarios are common in NEWSDIALOGUES.The large proportion of user questions (64.2%) indicates that information-seeking scenario is indispensable for real-world applications.Third, unanswerable questions occupy a large proportion of user questions (1057 of 4398).Therefore, it is important for dialogue systems to address these questions properly.

Task formulation
Each conversation is grounded on a news article n with key topics k, and the dialogue system learns to generate a response r based on the dialog history d.In addition, it should also predict the grounded knowledge g, including both the target topic and the knowledge spans for generation, when needed.

Predict-Generate-Rank
We propose a simple yet effective method, named Predict-Generate-Rank, including a three stage generation process, as shown in Figure 2.
Task 1: Grounded Knowledge Prediction.The model first predicts the grounded knowledge g for response generation.Specifically, we concatenate the target topic and the knowledge spans as g3 , and formulate this problem as a task of language generation.The objective is the negative log-likelihood: where g l represents the l-th token of g, and L is the total length.
Based on the grounded knowledge g, our model learns to generate the response autoregressively.The objective function is as follows: where r t denotes the t-th token of r, and T is the total length.We use the ground-truth knowledge for training and the predicted knowledge for inference.
Our generator is trained with a multi-task objective: Peng et al. (2021).
Task 3: Responses Ranking.One major problem of the above tasks is the gap between the ground-truth knowledge and the predicted knowledge, which results in severe exposure bias (Zhang et al., 2019;An et al., 2022) for text generation.
Particularly, the generated response can be lowquality if the predicted knowledge is irrelevant to the dialogue context.To alleviate this problem, we further introduce a ranking task.Specifically, the generator first samples multiple knowledge and generates the responses based on them.Then, a ranker is used to select the best response.
We use a simple strategy to construct datasets for the training of the ranker.First, we finetune the generator on the training set of NEWSDIALOGUES, then we use this model to sample knowledge and generate responses on the training set, and we can get D = {(ĝ m , rm )} M m=1 for each example, where ĝ is the predicted knowledge and r is the response conditioned on ĝ.For each (ĝ, r), we compute the matching scores with the ground truth (g, r): The responses with ∆ 1 > γ 1 and ∆ 2 > γ 2 are set as positive examples, which means both the knowledge and responses are similar to the ground truths, while other responses are set as negative examples.Then, we can get the training set for the ranker, and the validation set for the ranker is constructed with the same strategy on the validation set of NEWSDIALOGUES.
Suppose R+ is the set of positive examples and R− is the set of negative examples, we train the ranker with contrastive loss: ) is the ranker score and D ϕ represents the ranker, which is BERT (Devlin et al., 2019) in this paper.The input is the concatenation of the dialogue history d and the response r, and s r ∈ R is computed by the representation of [CLS]token and a linear projection layer.We pretrain the ranker on DuConv (Wu et al., 2019) and KdConv (Zhou et al., 2020) to better capture the relation between dialogue histories and responses, more details are given in Appendix C.
Inference.For inference, the generator first samples k grounded knowledge and generates re-sponses based on them.Then, we use the ranker to select the response with the highest score.
6 Experiments 6.1 Baselines Dialogue Model.We first investigate the performance of dialogue models.Specifically, we finetune the models on NEWSDIALOGUES with only dialogue data, the input is the dialogue history and the target is the ground-truth response.As NEWS-DIALOGUES is based on Chinese, we evaluate the performance of Chinese dialogue models, CDial-GPT (Wang et al., 2020) and EVA2.0 (Gu et al., 2022).EVA2.0 has shown the state-of-the-art performance on Chinese dialogue generation.
End-to-end Model.We finetune the pre-trained language models to predict the grounded knowledge and generate the response based on it sequentially.The training process is the same as our prediction and generation task with a multi-task objective.We evaluate a series of models, including BLOOM (Scao et al., 2022) (Multilingual GPT), mBART (Tang et al., 2020) (Multilingual BART), mT5 (Xue et al., 2020) (Multilingual T5), Chinese GPT (Zhao et al., 2019), Chinese BART (Wang et al., 2022) and Chinese T5 (Wang et al., 2022).

Implementation
We randomly split NEWSDIALOGUES into the train / validation / test sets with a ratio of 8 : 1 : 1, and the numbers of dialogues are 800, 100, and 100.For our Predict-Generate-Rank model, we use Chinese T5 as the generator (Wang et al., 2022) and Mengzi-Bert-base (Chinese BERT) (Zhang et al., 2021) as the ranker.The γ 1 and γ 2 are set as 50 and 15, and the candidate num k is set as 16.More details are shown in Appendix C.

Automatic Evaluation
Metrics.We adopt BLEU (Papineni et al., 2002), ROUGE (Lin, 2004) and Distinct (Li et al., 2016) for the evaluation of response generation.In addition, we compute Topic F1 score to evaluate topic prediction and word-level F1 score for knowledge span prediction (Span F1) as in Choi et al. (2018).

Results.
As shown in Table 4, dialogue models perform less competitively than other models.The reason stems from the lack of news information, which is indispensable for NEWSDIALOGUES.In addition, dialogue models show the best diversity, and we conjecture this benefits from the pre-training with large-scale conversation data, which contains abundant topics.For end-to-end models, BART performs poorly as it uses absolute position embedding with the maximum length of 1024, which is not sufficient when the news article is long.T5 models with relative position embedding and BLOOM with the maximum length of 2048 can alleviate this problem.The proposed Predict-Generate-Rank improves the performance substantially, except for diversity.We focus more on the relevance between predicted responses and groundtruth responses, which is reflected by other metrics.

Human Interactive Evaluation
To investigate the performance more realistically, we employ human annotators to converse with different models, humans acting as users while models acting as agents.As human interactive evaluation has a high cost, we only evaluate the best end-toend model Chinese T5 and our Predict-Generate-Rank.More details are shown in Appendix D.

Metrics.
(1) Fluency: whether the response is fluent and understandable.(2) Coherence: whether the response is coherent and consistent with the context.(3) Naturalness: If the response has a target topic, is the topic transition natural and appropriate?(4) Knowledgeability: whether the agent is knowledgeable of the news and uses knowledge reasonably.(5) Proactivity: whether the agent is proactive and helps you understand the content of the news.(6) Engagingness: whether the conversation is engaging and gives you a happy surprise.The first three metrics are utterance-level, while others are dialogue-level.Each score is on a scale from 1 to 3, meaning bad, moderate, and good.
Results.As shown in Table 5, two models show comparable fluency and coherence, and both are far from perfect.For the naturalness of topic transition, Predict-Generate-Rank performs slightly better.Surprisingly, the human score is only 2.60, which indicates the challenge of natural topic transition.
Regarding the dialogue-level metrics, our model greatly improves knowledgeability and proactivity, which is consistent with the better performance on topic and knowledge span prediction in automatic evaluation.Furthermore, human evaluators feel more engaged when talking with Predict-Generate-Rank.Nevertheless, there is a large gap between current models and humans in many aspects, indicating plenty of room for improvement.Table 6: Analysis studies on the candidates number k of Predict-Generate-Rank.

Impact of Ranking
We conduct experiments to investigate the impact of the ranking task.As shown in Table 6, the performance improves when more candidates are generated, and the Span F1 score has an improvement of 3.78 when only four candidates are generated.Our method gets the best results when k = 16, which is the default setting in this paper.According to our manual check, the ranker helps select more relevant responses, thus contributing to the improvements.

Discussion
Based on the above results, we conclude three major defects of current models.First, they have poor conversation ability, as the low human score in fluency and coherence.This problem derives from the scale of NEWSDIALOGUES, and a possible way is using the large-scale conversation data in the general domain for pre-training.Second, current models cannot use news knowledge appropriately, as the low Span F1 and Knowledgeability.According to our analysis, the reasons are in many aspects: (1) The grounded news is typically long and complex.
(2) Many utterances are contextual, and the dialogue system needs to resolve the frequent coreference and information omission (Elgohary et al., 2019) for knowledge extraction.Considering the second utterance in Figure 1, the agent needs to know that "her" represents the "baby girl" in the first utterance.
(3) Rather than answering factoid questions in most existing QA datasets, the conversation scenario is much more open-ended, and commonsense reasoning ability is necessary.As the 4th example in Table 2, only when the dialogue system knows the relation between "awake" and "ICU", can it find the knowledge for a generation.Third, current models are incapable of natural and proactive topic transitions, as the low Topic F1, Naturalness, and Proactivity.This also stems from the lack of commonsense knowledge and reasoning skills to capture the relations between current topics and relevant topics.This is a valuable characteristic of NEWSDIALOGUES, which is challenging but rewarding for dialogue system research.

Conclusion
In this paper, we define a novel task named Proactive News Grounded Conversation, where the dialogue system can proactively lead the conversation based on some topics of the news.In addition, we collect NEWSDIALOGUES with 1K dialogues and rich annotations.Furthermore, we propose Predict-Generate-Rank, which consists of a generator trained with a multi-task objective and a ranker trained with contrastive loss.Comprehensive experiments have been conducted to investigate the performance of current models on NEWSDIALOGUES.We hope that our research will spur the development of dialogue systems that are more proactive and knowledgeable in various scenarios.

Limitations
We acknowledge the following limitations of our work.
Limitations of NEWSDIALOGUES.First, we only collect 1K human-to-human conversations with 14.6K utterances due to the high cost of the annotation process (Section 4.2).This brings difficulties for the learning of news grounded dialogue generation.Second, each conversation in NEWSDI-ALOGUES is grounded on one news article, which may have limited knowledge for real-world applications.We leave the multi-article grounded setting for future work.Third, as mentioned in Section 4.1, the image information in the news article is neglected in this version, which requires further exploration.
Limitations of Experiments.Large language models (LLM) have shown great few-shot learning ability and generation capacity on various tasks, e.g., GPT-3 (Brown et al., 2020), OPT-175B (Zhang et al., 2022) and BLOOM-176B (Scao et al., 2022) etc.It is important to investigate the performance of LLM on NEWSDIALOGUES, while this has been neglected in this work due to the limited computational resources.In addition, it is also valuable to investigate the performance of ChatGPT4 on NEWSDIALOGUES, and we leave this for our future work.

Offensive Content
We have taken two steps to avoid offensive content in NEWSDIALOGUES.First, we ask the annotators not to speak offensive content during the conversations.Second, we manually check all conversations after data collection and throw away the conversations including offensive content.

Terms of Use
Upon acceptance, we will provide all the codes and the proposed dataset NEWSDIALOGUES including conversations, annotations for knowledge and topics, and corresponding URLs for the News according to the terms of use of Toutiao5 .NEWS-DIALOGUES is only used for facilitating dialogue system research and can not be used for any commercial purposes.

A Case Study
For reading convenience, we translate the original Chinese conversation to its English version in Figure 1.Take the original version in Figure 3.

B Annotator Profile
We employ 30 crowdworkers with equally distributed genders for our annotations.They are all native Chinese speakers with ages from 20 to 40 years old.In addition, they are from different regions of China.We pay them a wage above the average in their area.It takes 180,000 ChineseYuan (CNY) for constructing NEWSDIALOGUES.
For both encoder-decoder models and decoder-only models, the input sequence is the concatenation of the news, key topics and the dialogue history, the output sequence is the concatenation of the grounded knowledge and response.We truncate the input sequence according to the maximum sequence length of the model when it uses absolute position embedding.For the T5-based models with relative position embedding, we set the maximum sequence length as 2048.The maximum sequence length and parameters of each model are shown in Table 7.All generative models follow the same hyper-parameter setting.For training, we set the learning rate as 5e − 5, batch size as 32, and use Adam optimizer (Kingma and Ba, 2015) with warmup learning rate schedule, the warmup ratio is 0.1.Each model is trained for 2k gradient steps, and we choose the checkpoint with the lowest perplexity score on the validation set for evaluation.For generation, we use Top-p sampling (Holtzman et al., 2020) with p=0.9.We run all experiments three times and report the best results in this paper.Ours.
Our generator is trained with the same hyper-parameter setting as above.For the ranker, the learning rate and batch size are 5e-5 and 64 respectively.The optimizer is the same as that of the generator.We set the maximum gradient steps as 20K for the pretraining stage and 10K for the finetuning stage, the checkpoint with the highest accuracy on the validation set is used for evaluation.After processing DuConv and KdConv, we have 257146 examples for pre-training the ranker, where each example has 1 positive response and 7 negative responses which are randomly sampled from the datasets.We randomly split these examples with a ratio of 4 : 1 for the training and validation processes of the pre-training stage.For finetuning the ranker on NEWSDIALOGUES, we predict 96 grounded knowledge for each example in the training set of NEWSDIALOGUES and generate responses based on them, finally we can get 597504 responses.Then, we construct positive responses and negative responses based on γ 1 = 50 and γ 2 = 15, each positive response is paired with 7 negative responses as in the pre-trainig stage.Totally, We can get 35159 examples for the training process of the finetuning stage.Using the same method on the validation set of NEWSDIALOGUES, we can get 2854 examples for the validation process of the finetuning stage.Our ranker gets 91.73 accuracy at the pre-training stage and 59.28 accuracy at the finetuning stage.

D Human Interactive Evaluation Setting
We employ 4 humans for human interactive evaluation and collect 40 conversations for each model.Specifically, each conversation is grounded on a news article from the test set of NEWSDIALOGUES, and contains at least 10 turns, 5 from the human and 5 from the model.In addition, we also select the 40 conversations with the same news articles from the test set to further investigate the performance gap between humans and current models.In total, we have 120 conversations, which are then distributed to 4 human evaluators to score from various aspects.

Figure 1 :
Figure 1: An example of NEWSDIALOGUES.We translate the original Chinese dialogue to English version for reading convenience.Notice that some content is omitted as the original version is too long, please refer to the original example in Appendix Figure 3.

Figure 2 :
Figure2: The overview of our Predict-Generate-Rank, including a generator trained with a multi-task objective and a ranker trained with contrastive loss.

Figure 3 :
Figure 3: An example of NEWSDIALOGUES.For reading conveniently, we translate the original Chinese dialogue to English and omit some information in Figure 1.Here is the original version.During the long conversation, the agent proactively steers the conversation to the key topics of news.
It is indeed necessary to pay more attention to the elderly.Yes, after all, we will all grow old.Help the old now, and someone will help us in the future.2Chit-chatWell, did the girl say why she went there?I don't know.Maybe the little girl is naughty and parents truly should take care of their children.Topic: Inherits good genes from her mother.It is possible!I heard that her mother is a physical education teacher, she inherits the good genes and also develops a habit of exercising.
8Guide So, why did this guy drive after overdosing?

Table 4 :
Automatic evaluation on the test set of NEWSDIALOGUES.All metrics evaluate the relevance between generations and ground truths except Distinct-2.We list Distinct-2 for the reference of diversity, which is the proportion of distinct bigrams in the total generations and has no relation with the ground truths.

Table 7 :
The maximum sequence length (MSL) and the parameter number of each model.