Triplet-Free Knowledge-Guided Response Generation

Generating vivid and informative responses (e.g., comments for social posts and utterances for dialogues) is challenging without giving relevant knowledge. Prior works focus on constructing the “latent” knowledge first and then learning how to “ground” it based on pseudo (context, knowledge, response) triplets. However, the retrieval between real responses and their latent knowledge is difficult in nature. In this paper, instead of focusing on how to ground knowledge given the responses, we take a different perspective to optimize the final responses for given guided knowledge directly. This allows us to re-formulate the entire problem in a simplified yet more scalable way. Specifically, we pretrain a response language model (LM) to measure the relevance and consistency between any context and response, then use search engines to collect the top-ranked passages to serve as the guiding knowledge without explicitly optimizing the “best” latent knowledge that corresponds to a given response. The final response generation model is trained through reinforcement learning by taking both the response LM prior and knowledge-injection rate as rewards. For better evaluations, we construct a new Chinese benchmark, “IceKC”, using fresh multimodal online social posts. Both automatic evaluations and human evaluations show our zero-resource approach performs significantly better than prior works. 1


Introduction
Response generation, including dialogue utterances and post comments, is a testbed for machine intelligence and has many applications.However, previous AI models tend to output generic and bland responses as shown in (Li et al., 2016;Shao et al., Figure 1: A graphical illustration of different zeroresource approaches for generation with knowledge.Triplet-based approach is highlighted in blue, which retrieves the most likely knowledge for training.Our approach is highlighted in orange, which aims to generate knowledge-guided responses directly.2017), which led to a few recent works that leverage external knowledge to improve the generation quality from both diversity and informativeness perspectives (Zhou et al., 2018;Dziri et al., 2019;Dinan et al., 2018;Moon et al., 2019;Wu et al., 2019;Hayati et al., 2020;Liu et al., 2021b;Komeili et al., 2022).Although impressive progress has been made, major hurdles still exist.Constructing a context-knowledge-response triplet dataset via crowd-sourcing (i.e., Dinan et al. (2018); Komeili et al. (2022)) for training purposes might be too expensive to scale up, which also violates the mainstream paradigm of large-scale self-supervised pretraining (Radford et al., 2021;Jia et al., 2021).As expected, prior works (Li et al., 2019b;Lian et al., 2019;Zhao et al., 2020a;Lin et al., 2020) show that models trained on such manually-constructed triplets cannot be generalized to other languages and unseen domains.More recently, zero-resource methods (Li et al., 2020;Chen et al., 2021b;Liu et al., 2021a) are proposed where knowledge-aware generation is learned on such triplets with either matched or inferred knowledge.However, one critical challenge is that given the response, their corresponding "knowledge" is extremely difficult to retrieve from a vast knowledge space, especially when retrieving from Internet search results.As shown in Figure 1, such constructed triplets are very unreliable because the corresponding "knowledge" is scattered over the Internet and the retrieval results are noisy where irrelevant sentences can have high overlap with the response.Forcing the generator to associate the response with such pseudo knowledge is insensible.This raises the question: can we relax the requirement of constructing such triplets for training purposes?
To mitigate those challenges, we introduce a new training methodology that does not require constructing any triplet.Our key idea is inspired by a relaxed casual graphical model where the conditional probability distribution (p(R|C, K)) of response (R) given both context (C) and knowledge (K) can be approximated by its lower-bound which only contains two prior, p(R|C) and p(R|K).More specifically, given a context, we use search engines to sample a knowledge passage without explicitly inferring or optimizing, then we use reinforcement learning to optimize the response generation model by jointly considering both prior models as critics to provide reward signals.This allows us to steer the focus from knowledge selection to knowledge-guided generation, which therefore pivots the optimization to explore how to generate informative and engaging responses given both context and knowledge.Without loss of generality, to validate our idea, we leverage a pretrained response language model for p(R|C) and a non-parametric LM for p(R|K) in the current experiments, leaving more advanced prior models for future work.To further encourage more investigation from the community, we construct a benchmark on Chinese multi-modal knowledge-grounded social post commenting, called "IceKC", that facilitates more faithful evaluations.Extensive experiments show that our approach performs significantly better than previous work under a zero-resource setting.
Our contributions are three-fold: (1) We propose a novel zero-resource training strategy for knowledge-guided response generation without the need to build triplets, which is model-agnostic as well as more flexible and easy to scale.(2) We construct a new benchmark from social media posts, which can be used to more faithfully evaluate knowledge-aware multimodal commenting systems.(3) Experiments show that our approach generates significantly better responses than other strong baselines.

Approach
Prior works all focus on constructing the "latent" knowledge (K) first given both the context (C) and response (R).However, as we argued before, such triplet-based approaches are challenging due to the difficulty in knowledge retrieval, which results in noisy triplets (C, K, R) and thus hurts the training.In this paper, we take a triplet-free approach by directly feeding sampled knowledge (i.e., search engine) into the response generation model p(R|C, K; θ), for any given context.As for a fixed context and sampled knowledge pair (C, K), there is no corresponding "ground-truth" response (R) available to apply any supervised learning method.We resolve such unpaired training by leveraging a pretrained response language model that provides distribution-level reward signals.

Triplet-free Knowledge-guided Learning
Mathematically, we achieve our triplet-free knowledge-guided learning by optimizing the lower-bound of p(R|C, K) inspired by the causal graphical model (Hlaváčková-Schindler et al., 2007;Tuan et al., 2020), where one can derive the following inequalities: (1) By further taking the logarithm on both sides of Equation 1, we can get log p(R|C, K) > α log p(R|C) where r c (C, R) defines as the reward that measures how consistent and sensible R is for given C; while r k (K, R) measures how much knowledge is injected from K to R. In principle, they both can be flexibly defined, such as pretrained language models (LM) or adversarial discriminative networks.In our current implementation, we use a pretrained LM to model r c (C, R) while using a simple nonparametric LM to model r k (K, R), which will be Figure 2: An overview of our approach.The learning objective is to maximize the total reward obtained from two critics: consistency reward from pretrained LM and knowledge injection reward from non-parametric knowledge injection model discussed in more detail later.Our entire learning objective is simplified as: which is trained using two separately constructed corpus: context-response, denoted as , where both n and m represent the number of training samples.
Consistency Reward For the sake of optimization efficiency, we first pre-train an LM on contextresponse corpus D cr with unlikelihood objective (Welleck et al., 2019), i.e.
where ϕ denotes the parameters of the LM, and L U L denotes the token-level unlikelihood loss.Once it is pretrained, one can easily use this prior model to estimate p(R|C, ϕ).Because language models generally favor short sentences over longer sentences, we further add a length adjustment to the LM to encourage the generation of relatively long responses.Therefore, r l (R) is defined as the length of the response under some margin m l to control the extent of encouragement for longer responses, i.e.
Non-parametric Knowledge Injection Reward Different from p(R|C), where context-response corpus can be obtained relatively easily, natural knowledge-response corpus is hard to obtain even manually.One workaround is to design r k (K, R) following simple heuristics, which is feasible because p(R|K) measures how likely the response R is based on knowledge K, in other words, how much of K is injected in R. In our current implementation, following Li et al. (2020), we use precision-based n-gram matching score BLEU-n (Papineni et al., 2002) with knowledge margin m k as r k (K, R), since it may not be wise to incorporate all available knowledge in K. Hence, where α is a hyper-parameter to balance consistency reward and knowledge injection reward.We choose n = 2 based on the heuristics that phrases usually consist of two Chinese characters or English words.Note that, although p(R|K) is not restricted to n-gram-based methods, our empirical experiments show that n-gram based score gives surprisingly better results than embedding-based methods such as BLEURT (Sellam et al., 2020), possibly due to the inherent flaws in the soft alignment of embeddings.We leave further investigations in this direction as future work.
Soft Q-Learning Since Equation 3 is challenging to optimize due to the inherent nondifferentiable sampling in the process, we resort to applying soft q-learning (Guo et al., 2021)

Post Selection
Annotators are shown a list of Weibo trending topics, which are fresh and updated every second.To stress the role of external knowledge, annotators are instructed to select individuals' posts where the background information is not fully presented in the post.Considering that most posts contain some images along with text content, we instruct annotators to select posts with/without images at a balanced ratio so that the performance of models on text-only posts and multimodal posts can both be evaluated throughout.

Search Query
After choosing a post, annotators are instructed to use search engines to obtain some background knowledge.Specifically, different from previous work (Zhou et al., 2018;Moghe et al., 2018;Dinan et al., 2018;Komeili et al., 2022), we also consider the rich information conveyed in images.
Annotators can either use a text query or choose an image as a query to retrieve relevant documents via baidu search engine3 or baidu image search engine4 .To explore the effect of images, annotators are encouraged to use images as queries to find relevant documents.Annotators can try another query if they do not find the retrieved result satisfactory.

Knowledge Selection and Comment Generation
During benchmark construction, annotators can click on any of the retrieved documents on the front page to expand the document and choose a sentence that they decide to be appropriate to write a comment based on the sentence.Unlike previous work focusing on factual knowledge only, such as using Wikipedia as the only knowledge source (Dinan et al., 2018;Li et al., 2020Li et al., , 2022)), we do not restrict the scope of knowledge.However, following Moghe et al. (2018), we ask annotators to distinguish different types of knowledge by either factual (such as encyclopedia) or opinion-like (such as movie reviews and comments) because we find that different strategies work differently for different types of knowledge.For instance, a retrieve-rerankrewrite (Cao et al., 2018;Wang et al., 2019) approach may perform better on opinion-like knowledge.Annotators are encouraged to balance factual and opinion-like knowledge so that models' performance can be evaluated on different types of knowledge.To stress the role of images in commenting, annotators are encouraged to write comments that echo the post's images whenever appropriate.

Statistics
The statistics of IceKC can be found in Table 1.
As far as we know, this is the first Chinese multimodal benchmark testset that is designed to evaluate knowledge-aware commenting systems.Again, note that this benchmark is only used for testing, regardless of how the training set is constructed.

Dataset
We evaluate our approach on both our Chinese IceKC benchmark and two public knowledgegrounded dialogue benchmarks, Wizard of Wikipedia (WoW) (Dinan et al., 2018) and Wizard of Internet (WizInt) (Komeili et al., 2022).Models are provided with the complete dialogue history and the golden knowledge sentence in evaluation.
WoW The WoW dataset contains dialogues between two participants where one of them, the wizard, is provided potentially relevant knowledge sentences retrieved from Wikipedia passages.The testset for evaluation is further divided into seen and unseen according to whether the topic of the testing dialogue has appeared in the training dialogues.We further remove a few turns where the wizard chooses not to use any retrieved knowledge sentences for response.After the removal, there are 4087 turns and 4125 turns for model evaluation in Test Seen and Test Unseen, respectively.
WizInt Similar to WoW, WizInt is constructed with dialogues of a wizard and an apprentice.Dif-ferent from WoW, the wizard produces internet search queries, selects knowledge sentences across the Internet which are appropriate, and gives a response based on the knowledge sentence to the apprentice.In evaluation, we use solely the test set of the WizInt dataset, which has 1957 turns for evaluation.However, because Internet-retrieved contents are noisy, the golden knowledge sentences are sometimes not accurately cut into sentences.To avoid cases where useful information is scattered among the long paragraph, we remove any turns whose golden knowledge is more than 50 words long.1041 turns are available for evaluation after all preprocessing.

Baselines
We report the performance of the following methods for comparison: • BASE: The context-response LM trained for consistency reward without using any knowledge.
• ZRKGC (Li et al., 2020) 5 : A variational method to learn the relation between response, context, and knowledge from pseudo triplets where pseudo knowledge is inferred from pseudo knowledge pool constructed by ngram matching, with Unified pre-trained Language Model (UniLM) (Dong et al., 2019).
• UKSDG (Chen et al., 2021b) 6 : An unsupervised method that retrieves the most likely knowledge from candidates first and leverages knowledge distillation to alleviate the noisy labeling problem.
• KAT-TSLF (Liu et al., 2021a) 7 : A threestage learning framework that retrieves pseudo knowledge from an unlabeled knowledge base and trains the model on such weakly constructed triplets.
• OURS: Our proposed full method.
To shed light on whether zero-shot inference with large-scale pretrained models is sufficient for the task, we evaluated two large Chinese language models, PANGU-α (Zeng et al., 2021) and EVA (Zhou et al., 2021).Since PANGU-α and EVA only support text input, we use the following prompt to perform zero-shot inferences: "{post text}；根  据下面信息进行评论：{knowledge}；评论：" ("{post text}; comment based on the following information: {knowledge}; comment:"), where {post text} and {knowledge} denotes the text component of posts and knowledge to be grounded on respectively.

Metrics
For evaluation, following Dinan et al. (2018); Li et al. (2020); Komeili et al. (2022), we adopt F1 score 8 .We also report the BLEU scores (Papineni et al., 2002) up to 4 grams and ROUGE (Lin, 2004) scores.To evaluate models' ability to incorporate knowledge into responses, following previous work (Li et al., 2020;Komeili et al., 2022), we also report the KF1 (knowledge F1) score, i.e., the uni-gram F1 score using knowledge as reference.Although in the ideal case, a high KF1 score indicates that the model is capable of integrating knowledge into its responses, we also report the KF1 score of humans to serve as a base for comparison of the KF1 score. 8We implement F1 score based on https: //github.com/facebookresearch/ParlAI/blob/main/parlai/core/metrics.py

Implementation
To utilize the generalizability of pretrained language models and make the reinforcement learning process smooth, we use BART base9 (Lewis et al., 2020) for Chinese and T5 base10 (Raffel et al., 2020) for English as the backbone of our models.We set α to be 100 and 200 to balance the scale of r(C, R) and r(K, R) for Chinese corpus and English corpus, respectively.Following Guo et al. (2021), we warm up the training of the generator using off-policy updates on real responses for the first 20k steps and train the model using both off-policy (and on-policy) updates for further steps.To align the reward, we linearly transform it to be centered around 0 and approximately bounded by ±50, employing two additional hyper-parameters.We use 4 as the batch size and a learning rate of 1e-5 in our experiments.The experiments are conducted on Nvidia V100 and A100 GPUs.For more information on the model architecture for multimodal context and implementation, please refer to Appendix B.

Results
We show the performance of different models on our IceKC benchmark in Table 3.Since PANGU-α, EVA, and ZRKGC are proposed for text-to-text generation, for a fair comparison, we compare the performance of the models on a subset of IceKC whose posts contains only text.Evaluation results on Wow and WizInt are in Table 2, 4, and 5. Results on the full WizInt are provided in Appendix F for reference.Human evaluation results are in Section 4.6.Here are some observations.First, our higher score compared to the low zero-shot performance of large pretrained language models indicates the necessity of explicitly injecting knowledge into generation.While PANGU-α achieves a slightly higher KF1 score, our approach considers the consistency between context and response, leading to higher scores when incorporating the same level of knowledge.
Second, our approach outperforms triplet-based models in incorporating knowledge into responses, as shown by the higher KF1 score of our model.Triplet-based models struggle with knowledge incorporation due to loosely associated pseudo knowledge selected based on n-gram overlapping.This often results in overlaps on non-important characters rather than keywords or topic words, even after strict processing.By optimizing the lower bound instead of directly learning p(R|C, K) using pseudo triplets, our approach overcomes this issue and achieves higher scores in BLEU, ROUGE, and F1.Furthermore, an interesting observation from Table 5 is that finetuned DIALOGPT has a much higher KF1 score but a lower F1 score compared to finetuned T5.This suggests that simply copying knowledge does not guarantee better performance.Our approach achieves higher scores by selectively incorporating knowledge based on contextual constraints.
Third, the significant improvement of OURS over the base DIALOGPT model with a smaller parameter size indicates that our training method, rather than model size and initialization, is primarily responsible for the performance gain.

Impact of Knowledge Type and Post Modality
To investigate the impact of different knowledge types and the inclusion of images in the context, we conducted ablation studies, as presented in Table 6.Our findings indicate that our model effectively incorporates more knowledge into generated responses for opinion-like knowledge, as evidenced by the higher KF1 score in that category.This outcome is expected since opinion-like knowledge consists of users' opinions on events, which can be relatively easily transformed into responses.The effectiveness of knowledge incorporation is further supported by the higher BLEU, ROUGE, and F1 scores obtained for opinion-like knowledge.Furthermore, when comparing the scores of models with and without using images in the posts, the higher scores achieved by the model incorporating images suggest that the visual modality enhances the ability to provide comments.Although the difference in scores is not substantial, this could be attributed to the fact that the language model trained for consistency reward may not be highly sensitive to images.
Human Evaluation We conducted a human evaluation to qualitatively analyze the performances.
To this end, we followed the methodology of Li et al. (2020); Liu et al. (2021a) and randomly sampled 100 turns of utterances from WizInt, including the full dialogue history and golden knowledge.The generated responses were then evaluated by human judges on contextual relevance (CR), knowledge relevance (KR), and fluency (FL).The evaluation process was double-blind, and we recruited two well-educated human evaluators who assigned scores in the range of 0, 1, 2, with 0 representing "bad," 1 representing "mediocre," and 2 representing "good."The results of the human evaluation are presented in Table 7.We observed that responses generated by our triplet-free approach were more knowledge-rich while maintaining a similar level of contextual relevance to the dialogue context.This suggests that our approach enables selective integration of knowledge that aligns with contextual constraints.For more detailed analysis, we provide specific cases in Appendix G for reference.

Related Work
Generating vivid responses, such as comments (Zheng et al., 2017;Qin et al., 2018;Li et al., 2019a;Yang et al., 2019) and dialogue utterances (Huang et al., 2018), is known to be difficult for neural models.To address the problem, previous work attempts to use additional key phrases (Ni and McAuley, 2018), user profiles (Zeng et al., 2019), and images (Chen et al., 2021a) to serve as an external knowledge to be grounded on.More recently, unstructured documents (Moghe et al., 2018;Zhou et al., 2018;Dinan et al., 2018), such as those retrieved from the Internet (Komeili et al., 2022), are prevalent, especially for open-domain contexts.Generally, knowledge-grounded generation is decomposed into two tasks, knowledge selection (Kim et al., 2020;Zhao et al., 2020b) and knowledge-aware generation (Dinan et al., 2018), and additional search query generation for Internet-retrieved knowledge (Komeili et al., 2022).
Given knowledge-grounded datasets, supervised approaches (Li et al., 2019b;Lin et al., 2020) have been proposed for knowledge-aware generation.Several unsupervised approaches (Lian et al., 2019;Li et al., 2020;Chen et al., 2021b;Bai et al., 2021;Liu et al., 2021a) have also been proposed.They rely on retrieving a most likely knowledge sentence for a specific response by either n-gram matching or model inference and leverage such weak triplets to train models.While triplets can be relatively reliable on human-annotated datasets such as WoW (Dinan et al., 2018), it is generally not feasible in Internet-retrieved documents (Komeili et al., 2022) and leads to poor generalization (Chen et al., 2021b) to different domains.Although some approaches (Li et al., 2020;Liu et al., 2021a) build on unreliable automatically constructed triplets and are free from ill generation, they still struggle to incorporate knowledge since knowledge sentences are loosely associated with responses in such triplets.

Conclusion
In our study, we examine the effectiveness of previous approaches in utilizing unstructured text for knowledge-grounded response generation in a zeroresource setting.Additionally, we propose a novel approach to tackle this challenging task.Instead of adopting a triplet-based approach, which focuses on identifying the specific knowledge underlying a response, we employ a triplet-free approach that aims to generate coherent and knowledgeable responses given appropriate knowledge.Furthermore, we develop the first benchmark specifi-cally designed for Chinese multimodal knowledgegrounded commenting to evaluate the effectiveness of our approach.Experimental results demonstrate that optimizing the lower bound for p(R|C, K) without relying on triplets as learning signals can achieve superior performance compared to existing triplet-based approaches.

Limitations
While we have demonstrated the promising potential of utilizing the lower bound as a substitute for p(R|C, K) in knowledge-grounded generation tasks, there are several limitations that need to be acknowledged.First, the use of the language model (LM) as learning signals can introduce flaws.The model may exploit the LM's weaknesses by generating comments with a high likelihood based on the LM but are nonsensical in reality, resembling adversarial samples.In our experiments, we observed that generating adversarial text samples, unlike vision models, proved challenging, and we did not encounter completely nonsensical comments.However, we did observe the model exploiting the flaws in the LM, indicated by certain common patterns in the generated comments.Second, there are better alternatives to a hard knowledge injection reward, such as an n-gram matching-based BLEU score used in this study.In some cases, a knowledge-grounded comment may not have any word overlaps with the knowledge instances, resulting in a n-gram-based score of 0. Ideally, an embedding-based soft knowledge reward would be more desirable for this reason.However, in our experiments, we found that the soft knowledge reward based on methods like (Kusner et al., 2015;Sellam et al., 2020) was easily exploitable, as the model learned to echo keywords from the context to achieve a high soft knowledge reward.Third, our approach primarily focuses on scenarios where well-constructed triplets are not readily available, such as when retrieving information from the Internet.However, in cases where pseudo knowledge construction is highly accurate, such as applications with more limited scopes, our approach may not outperform triplet-based approaches.Fourth, it is important to note that our method could potentially be used to generate offensive or prejudiced texts.Addressing biases in generative models is a longstanding issue, and it is not the main focus of this work.However, the ethical implications can be partially mitigated by integrating our approach with other debiasing technologies.

Ethical Consideration
In this work, we introduce IceKC, which aims to facilitate Chinese multimodal knowledge-guided evaluation for future research in knowledgepowered generation.To ensure strict adherence to ethical guidelines, we took several measures during the construction of IceKC and the self-constructed training datasets.All data used for training and testing, including user posts, user comments, and Internet search results, are publicly visible.However, user information such as user ID, user name, age, and geographical location is not collected.We also implemented hate comment filtering using swearing word lists.

Data Privacy
The construction of the IceKC benchmark underwent approval from an internal review board.We carefully designed the construction and release protocols to avoid any privacy violations of Weibo users.The posts collected from historical trending posts are publicly available, but user information is not included in the dataset.We conducted a rigorous review process to remove any user privacy information found within the posts, such as email addresses and phone numbers.Additionally, toxic and biased texts were eliminated during the review process.To further protect data privacy, IceKC is released under strict terms for academic use only.Users acquiring the data must agree to our requirements for academic use.
Annotators Annotators are Chinese undergraduate students interning at our institution.They are compensated with a monthly salary for their annotation tasks, which include IceKC, as well as other duties assigned by the institution.They are wellinformed about the ongoing research and understand that the curated data will be used for research purposes.

C Post Comment Corpus
We crawl a large number of posts and their corresponding comments from Weibo.Due to the inherent noise in online posts and comments, we perform extensive cleaning and filtering on the raw post comment corpus to provide a cleaner corpus for LM training.
Specifically, following DialoGPT (Zhang et al., 2020), EVA (Zhou et al., 2021), andPchatbot (Qian et al., 2021), we apply the following procedure: (1) Remove posts that has videos, urls or without comments, or is a repost of some other posts; (2) Remove posts that are votes or have other external links; (3) Remove the reply part (e.g."@xxx") in comments; (4) Remove posts that are too long (i.e. more than 256 Chinese words, using jieba11 to perform Chinese word segmentation; (5) Normalize duplicated characters to three times, for instance, "哈哈哈哈哈哈" ("hahahahahaha") to "哈哈哈" ("hahaha"); (6) Reduce the times of duplication of the comment under the same post to three; (8) Remove urls in comments; (9) Reduce the times of duplication of a comment in the whole dataset to 10,000; (10) Remove comments that contain 90% of tri-grams that have been seen more than 1,000 times in the whole dataset; (11) Remove non-Chinese comments; (12) Remove comments longer than 50.

D Post Knowledge Corpus
To construct post knowledge pair corpus for model training in the second stage, we first retrieve web results using search engines and then perform extensive cleaning to remove irrelevant results in the noisy online contents.
Knowledge Retriever To gather relevant knowledge for each post, we employed both text and image queries to search the Internet.The entire text content of the post was used as the text query, while all images served as image queries.For text search, we utilized Baidu Search12 , and for image search, we relied on Baidu Image Search13 .It's worth not- ing that Baidu Search truncates text queries longer than 38 Chinese characters, but our dataset's posts typically do not exceed this threshold.To maintain quality, we only retrieved passages from the first page of Baidu Search, considering that search results on the internet tend to be noisy, and subsequent pages usually contain lower-quality content.The retrieved passages were then filtered using specific rules to eliminate merchandise advertisements and other noisy content such as online shopping websites and online novel webpages.Finally, we segmented all the retrieved passages into sentences using HanLP14 .
Semantic Reranking The retrieved sentences from the knowledge retriever remain highly noisy, even after rigorous filtering for advertisements.This is primarily due to the fact that the relevant information often constitutes only a small portion of the original passage.To eliminate irrelevant information within the retrieved sentences, we perform separate reranking for sentences obtained from text and image queries.This reranking process involves calculating the cosine similarity between the TFIDF (term frequency-inverse document frequency) representations of the post text and the retrieved sentences.Reranked knowledge sentences from both text and images are then compared to determine whether the post contains wellknown information that can be retrieved.If there is significant semantic overlap between the reranked knowledge sentences obtained from text and images, it suggests that the text and images contain salient information about well-known events.Empirical findings indicate that the results retrieved from image queries are considerably noisier compared to those from text queries.This may be attributed to the fact that ordinary images, such as views of lesser-known mountains in unknown cities, lack relevant knowledge backgrounds.To address this, we assess the confidence of the retrieved results from both text and images and combine them to obtain the final retrieved knowledge instances.Specifically, if both the TFIDF rerank score of knowledge instances from text and the TFIDF rerank score of knowledge instances from images exceed 0.3, it is likely that the post pertains to public events or known content.In such cases, the top 10 knowledge instances based on their scores are retained.Additionally, considering that longer text queries tend to yield more accurate results, we also preserve the retrieved results if the text query exceeds 25 Chinese characters.To eliminate low-relevance knowledge instances and instances with high overlap with the post, possibly because the knowledge instance was retrieved from the same source as the post, we discard all instances with scores lower than 0.2 or higher than 0.6, along with post text shorter than 15 Chinese characters.Empirical observations reveal that, following the aforementioned processing steps, the retrieved knowledge instances are typically related to the context of the post and serve as an external knowledge source.

E Training Corpus Statistics
Statistics of D cr and D ck is shown in Table 9.

F Additional Evaluation Results
Evaluation results of the models on the full WizInt test set is shown in Table 10.

G Cases
Example cases of different approaches on Chinese multimodal posts are shown in Figure 5. Example cases on English dialogues are at Table 11, 12, 13, and 14.

Figure 3 :
Figure 3: (a) Performance of models with different length margins; (b) Performance of models with different knowledge margins.KF1 of human is shown in dotted line for reference.

Figure 4 :
Figure 4: Illustration of the architecture of both our language model and the generator.

Figure 5 :
Figure 5: Cases in the benchmark and the generated comments from different models.The English translation of the post, knowledge, and generated comments are provided under Chinese texts.HUMAN denotes the human written golden comment.Knowledge injected in comments are highlighted.

Table 1 :
Statistics of our IceKC benchmark.UTT.denotes the number of comments; QUERY denotes the number of queries, with TEXT and IMAGE indicating text query or image query; DOMAIN denotes the number of unique domains; and LEN.denotes the average length of the comments."Factual","Opinion", and "Text-only" denotes cases with factual knowledge, with opinion-like knowledge, and without images respectively.

Table 2 :
Evaluation results on WoW test seen.R-n denotes the ROUGE score using up to n-gram.DIALOGPT b denotes the pretrained DIALOGPT model used for the consistency reward.f t denotes models finetuned on the WoW train set.

Table 3 :
The performance on the text-only portion of our IceKC benchmark.BL-n denotes the BLEU score using up to n-gram and R-L denotes the ROUGE-L score.The sizes of models are reported in PARAM.

Table 4 :
Evaluation results of the WoW test unseen.

Table 5 :
Evaluation results on WizInt.f t denotes models finetuned on the WizInt train set.

Table 6 :
The performance of our model on our IceKC benchmark under different settings.−I denotes the scores of the model when the images in posts are not fed to the model in inference.The KF1 score of human is reported as HU. in the table.

Table 7 :
Human evaluation results of WizInt samples.The average score, Cohen's Kappa statistic, and the paired t-test p-values are denoted as "µ", "κ", and "p".The null hypothesis of each p-value is that the mean scores of the corresponding model and OURS are the same.

Table 8 :
Comparison between our proposed benchmark and other existing datasets.The crosses and checkmarks indicate whether the dataset can be used to evaluate a specific aspect of the model.The LM takes image features i 1 , ..., i m and text features t 1 , ..., t n as input, while the generator takes i 1 , ..., i m , t 1 , ..., t n , e [SEP ] , k 1 , ..., k l as input, where e [SEP ] represents the embedding of the [SEP] token.In the multimodal IceKC test set, images are resized to 224×224 pixels before being fed to the models, and the patch size is set to 16×16.If a post contains more than 9 images, we keep only the first 9 due to the limited capacity of the BART base model.Considering the potentially long sequence of image patches, text tokens, and knowledge sentence tokens, we truncate the sequence to 768 tokens to account for capacity limitations.During generator training, we initialize the generator from the LM to accelerate training speed and facilitate the reinforcement learning process.

Table 9 :
Statistics for training corpus.Numbers is obtained by counting the turns for English dialogues except that numbers with * represent the number of dialogues.Note that the knowledge and responses in context-response corpus are not paired.

Table 10 :
Evaluation results on the full WizInt testset.

Table 11 :
A case from WizInt test set.The response generated by the DIALOGPT b is empty with top-p decoding.Appropriate knowledge incorporation is bold.Improper incorporated knowledge is highlighted in red.