Learning When and What to Quote: A Quotation Recommender System with Mutual Promotion of Recommendation and Generation

,


Introduction
The rise of social media platforms exposes people to more opportunities to share viewpoints (Lee and Ma, 2012;Bakshy et al., 2015). People get to know each other by what they post or say, and the art of chatting on the internet has become more and more important. Using quotations would be a good way to make one's expression more clear, beautiful, and persuasive (Booten and Hearst, 2016). However, for many individuals, thinking up a suitable quotation that fits the ongoing context is a daunting task. The issue becomes more pressing when quoting in online conversations where quick responses are usually needed on mobile devices.
To that end, extensive efforts have been made to quotation recommendation -aiming to recommend suitable quotations given conversation context. Nevertheless, previous work (Tan et al., 2015;Liu et al., 2019;Wang et al., 2020) mostly focuses on what to quote, i.e., ranking the quotation candidates, but ignores the problem of whether or when to quote, which should be an indispensable part of a real-world applicable system as people may not realize that quoting in good time can enhance their persuasiveness because of their insufficient knowledge of quotations. We extend the previous quotation recommendation task to a quotation recommender system that consists of three modules, a when-to-quote module to predict whether to recommend quotations, a recommendation module to recommend quotations given conversation context and a generation module to generate sentences in ordinary language or quotations to continue the conversation.
To better illustrate the system, Fig. 1 shows an interaction example between user and system. The system can generate sentences in ordinary language (e.g., "Well at least it worked") to continue the conversation or recommend proper quotations in more persuasive language, where the when-toquote module decides to recommend or generate ordinary sentences. As we are the first to formulate the quotation recommender system with two new modules (i.e., the when-to-quote prediction module and generation module), we provide benchmarks for the two newly added modules and propose a unified framework with novel quotation recommendation module.
For quotation recommendation, the previous work either employs generation-based (Liu et al., 2019;Wang et al., 2020) or ranking-based (Tan et al., 2015;Wang et al., 2021) models, where the quotations are regarded as word sequences and labels, respectively. Rather than applying either of them, we choose to fully explore the advantages of both types of models and propose a novel framework with a mutual promotion of recommendation and generation. Our basic recommendation module is a ranking-based framework, where a pretrained language model is adopted to obtain the context representation. We use the top quotations recommended by the basic recommendation module as pseudo references to enhance the training of generation module. The motivation of using pseudo references comes from the observation that multiple quotations might be acceptable for one context. Taking Fig. 1 as an example, quotations q 1 to q 3 all can continue the context well. And the multiple references are beneficial for model to learn diverse patterns of generation, fitted in different scenarios (Zheng et al., 2018).
Besides, the pseudo references-enhanced generation module is further adopted to facilitate the recommendation module by re-ranking the recommended quotation list. We expect that the semantic coherence between context and quotations can be emphasized by the cross-attention in the generation decoder. Specifically, we get the posterior probabilities of quotations by feeding them to decoder and re-rank the top quotations recommended by the basic recommendation module accordingly, which is denoted as generative ranking. Rather than employing searching algorithms like greedy search (Wilt et al., 2010) or beam search (Cohen and Beck, 2019) used in conventional generation due to unlimited search space, we choose to use generative ranking by calculating the language probability (Yang et al., 2018) since the search space is limited when generating quotations (the number of quotation candidates is usually fixed). Compared to the previous work that relies on beam-search for quotation recommendation (Liu et al., 2019;Wang et al., 2020), the generative ranking method does not require any post-processing to match the generated sentences to quotations and shows better performance.
For experiments, our recommendation module outperforms the previous work significantly on two datasets (Weibo and Reddit), and performs well on when-to-quote prediction and what-to-continue generation. Extensive experiments show that our mutual promotion mechanism is effective. More interesting experiments such as generative ranking vs. beam search are given to yield a better understanding of the quotation recommendation task.
The main contributions of this work are: • We propose a novel quotation recommendation framework with a mutual promotion of recommendation and generation, which outperforms the previous work significantly. • We extend the previous quotation recommendation task to a complete recommender system with two new targets, when to quote and what to continue, and provide corresponding benchmarks. • We conduct extensive and interesting experiments to show the effectiveness of our framework.

Related Work
Quotation recommendation is in line with contentbased recommendation (Liu et al., 2019) or clozestyle reading comprehension (Zheng et al., 2019), which learns to put suitable text fragments (e.g., words, phrases, sentences) in the given contexts. The current research on quotation recommendation can be divided into two categories as follows.
Quotation Recommendation for Formal Writing. Studies in this category explore various setting, such as quoting famous sayings in books (Tan et al., 2015(Tan et al., , 2016, and using idioms in news articles (Liu et al., 2019;Zheng et al., 2019). Tan et al. (2015) propose a learning to rank framework, where particular features (e.g., frequency, vote, web-popularity, and quote-rank) are employed. Then Tan et al. (2016) first apply neural models, which avoids the time-consuming calculation and extraction of the features. As for recommending idioms for news articles, Liu et al. (2019) propose a neural machine translation framework, in which they suppose that the idioms are in pseudo target language and context is the source language to be translated. Zheng et al. (2019) formulate the idiom recommendation task as a cloze test task and contributes a new dataset and provides a benchmark to evaluate the ability of understanding idioms. Quotation Recommendation for Online Conversation. This line of work faces more challenges in modeling the complex interaction and noisy context of online conversations. Lee et al. (2016) combine a recurrent neural network and a convolutional neural network to learn semantic representation and structure representation. Wang et al. (2020) contribute two datasets for quotation recommendation on online conversations. They propose a seq-to-seq framework to do the quotation recommendation and employ a neural topic model to get the latent representation of the history and thus assist the recommendation. Wang et al. (2021) propose a ranking model where a transformation from queries to quotations is employed to enhance the quotation recommendation performance.

Problem Formulation
Input to the System. The input mainly contains the observed conversation context C, and the quotation list Q. The conversation C is formalized as a sequence of turns (e.g., posts or comments) {t 1 , t 2 , ..., t nc } where n c represents the length of the conversation (i.e., number of turns). t i (1 ≤ i ≤ n c ) denotes the i-th turn of the conversation and we use w i to indicate the word tokens contained in it. The quotation list Q covers all quotations that have appeared in the training corpus. It is represented with {q 1 , q 2 , ..., q nq }, where n q is the number of quotations and q k is the k-th quotation in list Q.
Output of the System. Conditioned on the observed conversation C, the when-to-quote prediction module outputs a label y p ∈ {⟨quo⟩, ⟨gen⟩}, where ⟨quo⟩ indicates needing to quote and ⟨gen⟩ means no need to quote. Then the recommendation module outputs a label y q ∈ {1, 2, ..., n q } to indicate which quotation to recommend. Finally, the generation module will generate an output sequence y g = {y g 1 , . . . , y g n } based on the conversation context C and the prediction label y p .

Quotation Recommendation Framework
BART-based Generation Module. The generation module of our designed model follows a general Transformer (Vaswani et al., 2017) sequenceto-sequence framework. To relieve the data burden and enhance context modeling, we choose to finetune a pre-trained BART (Lewis et al., 2020) model

Quotation Prediction Module
Input: Conv C -! , , …, "! , Quotation List -Q Output: Quotation -# Rec or Gen Prediction - Figure 2: Our framework for quotation recommender system, which consists of three modules, a recommendation module to recommend quotations fitting the context, a prediction module to decide when to quote, and a generation module to generate sentences to continue the conversation with ordinary language or quotations.
for our generation module. BART was trained with several denoising objectives on large-scale unlabeled data, and has been shown to be effective in many generation tasks, like summarization, machine translation and persona-based response generation. During finetuning, we concatenate the utterances t i (1 ≤ i ≤ n c ) in context C with appended ⟨eos⟩ tokens in their chronological order as the input, and maximize the probability of the ground-truth target sequence T (T can be a quotation or a general sentence in a real-world conversation). The whole process is summarized as: where w c = [w 1 ; ⟨eos⟩; w 2 ; ..; w nc ; ⟨mask⟩], and y g <k represents the target tokens before y g k . It has been proved effective (Schick and Schütze, 2021) to simulate the operations conducted in the pretraining stage during finetuning, thus we add an ⟨mask⟩ token at the end of context, indicating that we need to produce corresponding context in the position of ⟨mask⟩ token.
When-to-Quote Prediction Module. To make the framework simplified, we embedded the prediction procedure into the process of generation.
To that end, we treat the prediction labels as two special instruction tokens, ⟨quo⟩ and ⟨gen⟩, to indicate whether to recommend quotations or generate normal sentences when continuing the conversation. Specifically, to enable our generation module generating prediction labels, we extend its original vocabulary V with ⟨quo⟩ and ⟨gen⟩ to be V ′ = V ∪{⟨quo⟩, ⟨gen⟩}. Our generation module is supposed to first generate an instruction token and then generate a quotation or a normal sentence accordingly. Therefore, the objective in Eq. 3 should be reformulated as: where y g 0 ∈ {⟨quo⟩, ⟨gen⟩} denotes the instruction token, and y g = {y g 1 , y g 2 , ..., y g n } is the target sequence to be generated.
What-to-Quote Recommendation Module. We adopt the representation for the ⟨mask⟩ token in each conversation to represent the target sentence. The motivation comes from the idea in masked language models (Devlin et al., 2019), where they use ⟨mask⟩ tokens to replace the original content and then use the representation in those positions to predict the original tokens. We denote the representation of the ⟨mask⟩ token as h ⟨mask⟩ . Then it is fed into a two-layer MLP for recommendation: where W 1 , W 2 , b 1 and b 2 are learnable parameters, and α is a non-linear activation function and we use tanh in our work. The output representation r q will be an n q -dimension vector and the candidate quotation list Q will be ranked according to the probability computed based r q : We then denote the ranked quotation list as Q r = {q r 1 , q r 2 , ..., q r nq } and the top m quotations as Q r 1:m .

Mutual Promotion of Recommendation and Generation
Pseudo References Enhanced Generation. Our promotion for generation is motivated by the fact that there might be multiple quotations acceptable for one giving context. We take the instance shown in Figure 1 as an example. The ground truth quotation provided is q 1 , while q 2 and q 3 are also suitable for that context from the perspective of semantic coherence. Therefore, we propose to use the top m q predictions (Q r 1:mq ) to serve as pseudo references. We assume that the pseudo references can help the training of the generation module and thus enhance the generation performance.
Specifically, for each input context C with quotation target, we first extract the content of top m q quotations (i.e. Q r 1:mq ) in recommendation module. After prepending with the instruction token ⟨quo⟩, they also serve as the references for the same context C to increase the training corpus. To distinguish them from ground truth targets, we add confidence weights to the losses computed by the pseudo instances and the weights are set as the recommendation probability computed with Eq. 6. Therefore, the total objective for the generation module is summarized as: where D represents the total training corpus including pseudo instances and C and T are context and target, respectively. p(C, T ) = 1 if it is not a pseudo instance; otherwise let T be the content of quotation q k , then p(C, T ) = p r (q = k) computed with Eq. 6.
Generative Re-Ranking Enhanced Recommendation. The promotion for recommendation is based on a designed re-ranking mechanism, where the ranked quotation list Q r will be re-ranked based on the generative probability calculated by our generation module. The idea comes from our assumption that a well-trained generation module can evaluate the semantic coherence of a given context and target. To that end, we choose to re-rank the top m g quotations (Q r 1:mg ) produced by our recommendation module. We first feed the generation module with the context C (as input) and the corresponding top m g quotations (as target) and then derive the average log-probability for each quotation to compute generation-based quotation probability, which will be added to the original probability (i.e., p r (q = k)) for final re-ranking-based recommendation probability: where log p(y g k i ) is short for log p(y g k i |y g k <i , H c ), representing the log-probability of the i-th token in quotation q r k . λ is a hyper-parameter to control the effects of two probabilities and p(q = k) is the final probability for re-ranking. The quotation with highest probability will be recommended.
Joint Training of Recommendation and Generation Modules. Previous work shows that the quotation recommendation can be regarded as either a ranking task or a generation task, as the recommended quotation is also used to continue the conversation. This kind of dual identity indicates that we can finetune our model with both recommendation and generation objectives to make the two modules promote each other. The total learning objective therefore is defined as: where γ is a hyper-parameter to control the tradeoff between the two losses and L rec is the learning objective for recommendation module: where q c is the ground truth quotation for conversation C. Alg. 1 in Apendix A depicts the detailed process of our mutual promotion.

Experimental Setup
Datasets and Statistics. For the task of what to quote, we employ two datasets, Weibo and Reddit, released by Wang et al. (2020). We adopt the same data split as Wang et al. (2020) for a fair comparison. For when to quote, we augment the two datasets by splitting the history contexts in original datasets 2 . Specifically, we built two different rules for Weibo and Reddit to detect possible positions in context that might need prediction (please refer to Appendix B for details). The statistics of the datasets can be found in Appendix C (Table 8).
Evaluation Metrics. For recommendation, we adopt popular evaluation metrics, including MAP (Mean Average Precision), P@1 (Precison@1), P@3 (Precison@3), and nDCG@5 (normalized Discounted Cumulative Gain@5) for evaluation by following Wang et al. (2020). For generation, we adopt BLEU 3 , Rouge-1 and Rouge-L to evaluate the generated sentences by following Wang et al. (2020). To evaluate the task of when to quote, we adopt Accuracy, F1 and Recall scores.
Parameter Setting. For Reddit, we use English BART-Base model released by Lewis et al. (2020) to initialize our model; while for Weibo, we use Chinese BART-Base model released by Shao et al. (2021). Both are with 768 hidden dimension, 12 attention heads and 6 layers of encoder and decoder. The middle dimension of our MLP recommendation layer is also 768. We use Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1e-4 for optimization and the batch size is 64. Dropout strategy (Srivastava et al., 2014) with dropout rate of 0.1 and L 2 regularization with 0.0003 effect value, as well as early stopping (Caruana et al., 2001), are used to alleviate overfitting. We set γ in Eq. (10) as 0.1 to make the model focus more on the recommendation task. Top 5 (i.e., m q = 5) quotations are used for pseudo references and Top 30 (i.e., m g = 30) quotations are used for re-ranking. λ in Eq. (9) is set to 0.6.
Comparisons. For our main experiments on quotation recommendation, we compare our model with two simple baselines (RANDOM and FRE-QUENCY) and several previous proposed methods, including two generation-based models (NCIR (Liu et al., 2019) and CTIQ (Wang et al., 2020)), and three ranking-based models (LTR (Tan et al., 2015), BERT+MLP (Devlin et al., 2019) and TRAN-SQQ (Wang et al., 2021)). For generation, we also compare the two generation-based models, as well as TAKG (Wang et al., 2019), another generation model for a similar task. As for when-to-quote prediction, we compare some basic methods as no previous work explores this task. To save space, we introduce all of them in detail in Appendix D.

Experimental Results
In this section, we first compare the recommendation results of our model with the previous quotation recommendation models in §5.1. Then we report the results of when-to-quote prediction and generation in §5.2 and §5.3, respectively. Finally, further experiments are given in §5.4 and §5.5, together with a case study in §5.6, to provide insights on how our method works.

Main Recommendation Results
We compare our model with the state-of-the-art baselines and conduct an ablation study to show the effectiveness of the designed mechanisms.
Comparison with Baselines. We first report the main comparison results of quotation recommendation in Table 1   that our model shows great improvements on all evaluation metrics compared to baselines. The improvements may mainly come from two aspects, the utilization of pretrained language model and the proposed mutual promotion mechanism. We'll discuss this more in ablation study.
• Neural ranking models perform better than generative models. We can see that the neural ranking models (i.e., BERT+MLP and TRANSQQ) get better performance than the generative models (i.e., NCIR and CTIQ), with larger performance gap on Reddit. This is because the generative models need to predict quotations word by word, and such an autoregressive process might introduce error accumulation and lead to worse performance. Although a post-processing procedure can match generated sentences (which are not always the same as those in the quotation list) to the quotations, it can only partially reduce the errors. This can be verified by the larger performance gap on Reddit, since the average length of quotations on Reddit is longer than that on Weibo (see Table 8).
Ablation Study. To better show the effectiveness of our model, we conduct an ablation study to compare our full model with four variants (please refer to Appendix D for the details). We report the results in Table 2. Some observations can be drawn: • PLM contributes a lot but our model without  • Mutual promotion of recommendation and generation modules is effective and removing either of them will cause performance degradation. From Table 2, we can see that our model without mutual promotion shows 1.8 and 1.6 MAP degradation on Weibo and Reddit, respectively, which exhibits the effectiveness of the proposed mutual promotion mechanism. We also notice that removing either promotion (i.e., GenPromo and RecPromo) would cause performance degradation, and the degradation is less than removing both of them. This further validates that the two promotions are not contradictory and can improve each other.

When-to-Quote Prediction Results
Before recommending, our system is supposed to predict whether to quote according to the context. We report the prediction results in Table 3. Due to the lack of previous work on this task, we list two simple baselines (RANDOM and ALLYES) to reveal how challenging the task is and a basic model BIL-STM (Zhou et al., 2016) to show the performance of conventional neural networks. Also compared are our full model and the variant without PLM.
The results in Table 3 show that the simple baselines (i.e., RANDOM and ALLYES) perform much worse than neural-based models, indicating the importance of capturing semantic information from context. We can also find that our model gets better performance on almost all metrics (especially when using PLM) and achieves more than 80 F1 scores,  which validates that it is feasible and also reliable to apply such a model for when-to-quote prediction. Another observation is that the prediction results on Reddit are generally better than Weibo. This is because Chinese quotations on Weibo are phrases, which are more flexible when used in conversations, thus more difficult for model to predict.

What-to-Continue Generation Results
We report the generation results from two aspects, results of quotation generation (upper part of Table 4), and results of sentences regardless of quotation or ordinary language (lower part of Table 4). From the quotation generation results in Table 4, we find that the our full model outperforms the previous generation-based recommendation models (i.e., NCIR and CTIQ), which shows our model's stronger ability for quotation generation. On the other hand, the overall results drop a lot regardless of the generated content, especially on Reddit. This is reasonable since ordinary utterances might contain a variety of different content that is difficult to generate, while quotations are relatively fixed sentence expressions that appeared several times in the training corpus. Nonetheless, our model still shows better performance than conventional generation models like BILSTM. The ablation results on both comparison setting ("Quotation" and "All") also show that the promotion of generation module and the pretrained language model are effective.

Effectiveness of Mutual Promotion
Recommended Based VS. Random Pseudo References. To examine whether pseudo references provided by recommendation module are effective, we compare them with random references (i.e., randomly select m q quotations as pseudo references). From Table 5, we can find that recommended-based references show positive effects on the performance while the random references do not show an obvi-  Table 5: MAP scores comparisons between random references and using Q r 1:mq ranked by recommendation module as references with varying reference numbers (i.e.,m q = 2, 5, 10, 20).  Table 6: Comparison results (in %) of generative ranking (i.e., our model with m g = |Q| and λ = 0) and beam search generation for quotation recommendation. † refers to adopting post-processing to match generated sentences to quotations with minimum edit distance. K denotes beam size and T is sequence length. ous promotion. We can also notice that the models trained with random references are not sensitive to the values of m q (the number of pseudo references). Conversely, the number of references affects recommended-based promotion a lot, and 5 pseudo references show the best promotion.

Models
Generative Ranking VS. Beam Search. Previous methods (Liu et al., 2019;Wang et al., 2020) mainly use beam search to produce quotation recommendation. We argue that it is a suboptimal solution since the quotation numbers are fixed and the actual search space is limited. We propose generative ranking, which ranks the quotations by their posterior probability calculated by generation module. It can be viewed as a special case of our generative re-ranking enhanced recommendation, where we set m g = |Q| and λ = 0. We report the recommendation results of our generation module with different recommendation methods (i.e., generative ranking and beam search) in Table 6.
It can be found that our generative ranking shows better performance than beam search, even after post-processing. Nevertheless, it requires more computation cost than the naive beam search (generally K ≪ |Q| and T ≪ |Q|), as it needs to pass all quotations through the model. This observation serves as one of the reasons why we propose generative re-ranking enhanced recommendation, i.e., only using top quotations provided by our recommendation module for generation module to re-rank, which saves computation cost.

Analyses on Mutual Promotion
We explore how hyper-parameters m q (number of quotations for pseudo references), m g (number of quotations for generative re-ranking), λ (trade-off value for two probabilities in re-ranking) affect the recommendation performance in Fig. 3.
Effects of Pseudo References Number. Fig. 3(a) shows the MAP results and training time per epoch (in minutes) when using different numbers (i.e., m q ) of top quotations as pseudo references. Apparently, increasing m q will result in longer training time. The best MAP score is achieved when m q equals to 5, indicating that too many unrelated pseudo references is harmful for the final results.
Effects of Quotation Number for Re-ranking. We examine using how many quotations (m g ) for the re-ranking would achieve the best performance and the corresponding time cost (in seconds) in Fig. 3(b). As can be seen, the MAP scores keep increasing when more quotations are included to do the re-ranking until m g equals 30 and keep unchanged when changed from 30 to 50. This indicates that the very top quotations provided by our recommendation module have already contained the most possible targets, and there is no need to re-rank the longer quotation list to save time cost.
Effects of Trade-off Value in Re-ranking. We also examine how the value of trade-off parameter λ influences the recommendation results in Fig. 3(c). As can be seen, the results on two datasets exhibit different trends. On Reddit, the best performance is achieved around the middle of the curve (λ = 0.6); while the performance on  Fig. 4(a)) and the change of probability distribution of the quotations (Fig. 4(b)).
Weibo remains unchanged when λ ∈ [0, 0.6] and degrades if λ increases from 0.6. We attribute this to the fact that the quotations on Weibo are easier for generation module to predict (validated by much better BLEU scores achieved on Weibo from Table 4). Thus the generative re-ranking on Weibo is more reliable than on Reddit.

Case Study
We use one cherry-pick example to show the distribution differences of quotation probabilities before and after the generative re-ranking, respectively, in Fig. 4. We can see that among the top 5 quotations predicted by recommendation module, the ground truth quotation Q r 3 ranks the third place, and it gets the highest probability after the generative reranking. This might be because the generation module can capture the semantic coherence between the context and the quotation (e.g. "entertaining", "waste" in the context and "enjoy", "wasting" in the ground truth quotation) with cross attention, while the recommendation module treats quotations as discrete labels and ignore that kind of information.

Conclusion
This work explores a realistic recommender system that recommends quotations and provides whento-quote prediction and what-to-continue generation. We provide the benchmarks for the two newly added tasks and propose a mutual promotion mechanism for quotation recommendation. Experiments show that our method can promote both generation and recommendation and contribute to the best quotation recommendation performance.

Limitations
A key limitation of this work is that our model cannot handle the cold-start problem because we adopt a fixed-size MLP layer in the basic recommendation module. Though we think the quotation candidate list is statically unchanged compared to other conventional recommendations (e.g., news, products recommendation), the cold start problem for quotation is still a good point to be explored in the future. Another limitation lies in the evaluation metrics. We only adopt traditional correctness evaluation metrics, which is not sufficient if the quotation recommendation is applied in personalized applications.

B.1 Constructing Corpus for Weibo Data
The Chinese quotations in Weibo dataset are all Chengyu 4 , which mostly consist of four characters and can be regarded as phrases. In our preliminary observation, Chengyu can be a noun, a verb and even an adjective and can be applied in any position of a sentence. Therefore, we considered any positions between two words or phrases (detected by a Chinese tokenizer) to be possible positions for when-to-quote prediction. We then constructed the context-generation pairs by splitting the original conversation at those positions. To alleviate the difficulty of generation and make it closer to quotation generation, we also made the generation limited to at most four words or phrases and remove those samples with too long content to be generated.

B.2 Constructing Corpus for Reddit Data
The English quotations in Reddit dataset are full sentences obtained from Wikiquote 5 . In our preliminary observation, most of the quotations appeared in Reddit dataset are used after a complete sentence, to explain or summarize the previous statements (only 7.6% of them are used after words like "saying", "said", etc. to explicitly indicate that the following sentences might be quotations, and these can be regarded as a small amount of easy samples to be predicted). Therefore, we considered any positions between two sentences to be possible positions for when-to-quote prediction. We then constructed the context-generation pairs by splitting the original conversation at those positions. To alleviate the difficulty of generation, we made the generation include only one complete sentence and remove the rest content.

B.3 Human Evaluation on the Quality of the Constructed Samples
To examine whether the constructed corpus is of high quality, we conducted a human evaluation to check whether humans can predict well on the constructed corpus. We sampled both 50 samples from the original context-quotation samples and the newly constructed context-generation samples, respectively, to build a human evaluation test set. We then invited three crowd-workers to predict whether quotations can be used to continue the context. Each predictor gave a yes-or-no answer or mark as "not sure" (for those they thought are indistinguishable) for each sample. We then evaluate their results only on those are not marked as "not sure". The average results are displayed in Table 7.
As can be seen, humans scored higher than 60% in all metrics for both datasets, indicating that predicting with the corpus is possible from a human  perspective (a random prediction would score about 50%). On the other hand, about 15% (i.e., the notsure rate) of the samples are difficult for human to distinguish, which is reasonable and thus worth trying from the perspective of machines.

C Statistics of the Datasets
We display the statistics of the two used datasets in Table 8, including those of our newly constructed samples. We can observe that quotations on Reddit have a longer average length, and the turn length is also much longer than Weibo, which might make it a more difficult dataset. Our newly added samples (served as negative samples for when-to-quote prediction) are similar in number to the original conversations. The average turn number becomes smaller, since we constructed them by splitting the original conversation context.