Contrastive Learning for Inference in Dialogue

Inference, especially those derived from inductive processes, is a crucial component in our conversation to complement the information implicitly or explicitly conveyed by a speaker. While recent large language models show remarkable advances in inference tasks, their performance in inductive reasoning, where not all information is present in the context, is far behind deductive reasoning. In this paper, we analyze the behavior of the models based on the task difficulty defined by the semantic information gap -- which distinguishes inductive and deductive reasoning (Johnson-Laird, 1988, 1993). Our analysis reveals that the disparity in information between dialogue contexts and desired inferences poses a significant challenge to the inductive inference process. To mitigate this information gap, we investigate a contrastive learning approach by feeding negative samples. Our experiments suggest negative samples help models understand what is wrong and improve their inference generations.


Introduction
In conversations, inference is essential to uncover what the speaker intended to deliver, which often goes beyond the information explicitly expressed (Rieger, 1974;Thorndyke, 1976).Inferences can be made by an explicit or implicit logical reasoning based on utterances and common ground among speakers (Clark, 1975).By reading between the lines, these inferences enable appropriate responses in dialogues.This inference process has been intensely discussed in the early age of research at dialogues (e.g., Thorndyke, 1976).However, research in dialogue systems nowadays often overlook such an aspect and instead rely solely on the capabilities of large language models (LLMs) to understand and comprehend dialogues.Table 1: One example in "Conceivable" difficulty level, comparing the generated inferences from our method, T5-base, and the gold inference.Dial.and Ques.are short for Dialogue and Question.The snippets of inferences highlighted in pink are not explicitly stated in the dialogue and require the model to conduct inference inductively.We refer to this phenomenon as the "information gap" to accomplish this task.
Current LLMs, such as ChatGPT (OpenAI, 2022), lack the so-called "inductive reasoning" ability, while tending to accomplish the reasoning tasks deductively (Bang et al., 2023).It might be due to the fundamental difference between inductive and deductive processes.According to (Johnson-Laird, 1988, 1993), inductive reasoning involves an increase in semantic information from input to output while it remains the same in deductive reasoning.In the context of dialogue inference processes, especially when reading implicit messages, there are information gaps that need to be filled.For instance, somebody's invitation for a "quick lunch as always" might be enough to specify the location and time without further interaction.
In this paper, we inspect the semantic information gap between dialogue contexts and intended inferences using a recently introduced dataset designed for generating inferences in dialogue (Ghosal et al., 2022).We hypothesize that the difficulty of the task can be associated with the amount of information gap required to bridge.We manually annotate the randomly sampled subset of the dataset regarding their information gap, and assess the performance of the models.The analysis shows a decline in model performance as the information gap increases.Furthermore, we propose to apply a contrastive learning approach to improve inference performance.One limitation of the current sequence-tosequence training, especially for reasoning tasks, is that models are never exposed to negative samples (Lee et al., 2021).In deductive reasoning, all the information required to generate an output is provided in the input, and there is no information gap.However, inductive reasoning requires including something that may not be explicitly stated in the input, and that is not simply learnable by only exposing gold samples.Thus, we need to teach the model with more guidance on the reasoning path.In our preliminary experiment using the same dataset and a multiple-choice framework with Roberta-large (Liu et al., 2019), we observed a significant improvement from an F1 score of 83.91 to 96.6 simply by feeding negative samples together with the other candidate, which indicates that feeding negative samples will help the model learn how to fill the information gap.Building on this initial experiment, our experimental results in the generative settings show that contrastive learning helps improve both overall and breakdown performance in each task difficulty level, especially for fully deductive and inductive cases.Additionally, we explore various sampling methods for generating negative samples for contrastive learning.
Our contributions are three-fold: (1) we provide data annotation based on the information gap and the assessment; (2) we suggest that the information gap accounts for the difficulty of the inference generation in dialogue; and (3) our experimental results show that the contrastive learning approach helps to fill the information gap.

Inference in Conversation
As one of the most fundamental forms of the use of natural language (Jurafsky and Marin, 2023), advance in inference in conversation has been inseparable from the flourish of the field of natural language processing (NLP) (e.g., Mann, 1979;Phillips, 1975).Initially, the research focus of inference in conversation was to uncover the underlying rules of human conversations (e.g., Grosz, 1978;Carbonell Jr, 1978;Morgan, 1978).While it remains a core research question, recent works tend to be formed in question answering (QA) style so that we can test models in a handier way.Thanks to the powerful deep learning models, we can perform inference tasks sufficiently well yet leave underlying rules unclear.Recently, a number of QA datasets in conversational formats have been introduced (Choi et al., 2018;Reddy et al., 2019;Ma et al., 2018), and their main focus tends to be comprehension of non-conversational texts.To evaluate the comprehension of dialogues, various tasks have been proposed in different task formulations such as span extraction (Li et al., 2020;Yang and Choi, 2019;Wu et al., 2022), multiple choice (Sun et al., 2019), next utterance prediction (Cui et al., 2020), or natural language inference (NLI) (Welleck et al., 2019).Some tasks focus on a specific aspect of conversational inference, such as speaker guessing (Sang et al., 2022), and temporal reasoning (Qin et al., 2021).In natural language generation format, Ghosal et al. (2021Ghosal et al. ( , 2022) ) presents datasets for generating inferences based on dialogue, while Ghosal et al. (2021) only contains overt inferences and Ghosal et al. (2022) contains implicit guesses as well.

Task Difficulty and Information Gap
Controlling the difficulty of tasks requires delicate tuning as it is crucial for further advance in NLP; too challenging or too easy tasks cannot facilitate the growth of the technology.A task becomes more challenging if we impose additional conditions, such as limiting the amount of data and computational power or adding modality or other languages.Recently, some work has investigated the specific task with controlled or annotated data.For example, Williams et al. (2022) annotates on inference types such as numerical or reference to see which type is the most challenging in NLI.Cui et al. (2023) limit the data to assess the models' capability to properly understand what the word "respectively" refers to in NLI.
Discussing the task difficulty independent of the models' performance is non-trivial.Current assessment of the task difficulty tends to be inseparable from the performance comparison of the models (e.g., Bang et al., 2023).In this way, we can observe the models' strengths and weaknesses across different tasks, but there is still a lack of absolute difficulty rankings of the tasks.One possible way to discuss the difficulty in a model-or task-agnostic way might be based on the information gap, which is the core challenge in inductive reasoning (Johnson-Laird, 1988, 1993).It has been discussed as "given and new information" in (Clark and Haviland, 1974;Clark, 1975) as the foundation in conversations, but this concept can be extended to any tasks (McKeown, 1979).In this line of work, Rudinger et al. (2020) proposes an NLI task in which an inference can be shifted when there is new information offered.These days, not many works explicitly mention "information gap" (Hayashi, 2022).However, we still have the concept underlain.For example, QA datasets commonly contain some portion of unanswerable questions (e.g., Rajpurkar et al., 2018;Bajaj et al., 2016) with the context provided.

Contrastive learning in NLG
Contrastive learning teaches a model to embed similar data sample pairs are closer and disparate sample pairs stay apart (Chopra et al., 2005;Smith and Eisner, 2005).Not only in obtaining better representations of words (Mikolov et al., 2013) or sentences (Fang et al., 2020;Gao et al., 2021;Liu et al., 2021a), contrastive learning is reported to improve a wide range of NLP tasks (e.g., Li et al., 2022b;Klein and Nabi, 2020) including text generation tasks (e.g., Cai et al., 2020;Li et al., 2021;Liu et al., 2021b;Paranjape et al., 2021;Li et al., 2022a;Shu et al., 2021).The main motivation for applying contrastive learning for sequence-to-sequence text generation tasks is that it allows the model to be exposed to negative samples during training (Lee et al., 2021).Indeed, negative samples are generated by some rule-based perturbations (Shu et al., 2021) or machine-generated texts (Cao and Wang, 2021) such as entity-swap (Tang et al., 2022) are reported to be effective for faithful, less hallucinatory text generation.

Information Gap in Inference
While existing work focuses on improving the model performance on inference tasks with various methods, there is still a lack of in-depth investigation on the task itself and how the model behavior is changed with the improved results.To fill this gap, we first propose to connect task difficulty with the "information gap" between contexts and target inferences and classify the inference task difficulty into three levels.Then, we focus on the generative inference in dialogues with the CICERO dataset (Ghosal et al., 2022).We collect additional annotations to assess the task difficulty of a subset of samples for further analysis.

Preliminaries of the CICERO Dataset
We denote a dialogue dataset as {D n } N n=1 , and a dialogue as , where U i is an utterance at turn i.Given an input X = (D I , Q, U t ) where Q is a question and U t ∈ D I is a target utterance, we aim to learn a model f θ to generate a plausible inference Ã = f θ (X).
CICERO dataset comes with five types of questions: 1. Cause: What is or could be the cause of the target utterance?2. Prerequisite: What is or could be the prerequisite of target?3. Subsequent Event (SE): What subsequent event happens or could happen following the target?4. Motivation: What is or could be the motivation of target? 5. Reaction: What is the possible emotional reaction of the listener in response to target?
For subsequent event category, it also offers a more challenging setting called Subsequent Event Clipped (SE_Clipped) where the dialogue is clipped until the target utterance:

Task Difficulty of the CICERO dataset
The CICERO dataset provides commonsense inferences made by human annotators.According to the annotation instructions, generated answers must be grammatically correct and consistent with the dialogue, yet they can be overt or speculative depending on contextual scenarios (Ghosal et al., 2022).While treated equally, some question types seem significantly more challenging than others according to the results breakdown reported in Ghosal et al. (2022).For example, Motivation scores the highest even though it only accounts for 14% of the training set.
Although the surface format of the task is unified and thus cannot distinguish at a glance, we can sense that they challenge different things.For example, SE can be executed simply by summarizing the utterances after the turn t, while SE_Clipped required to predict future sequences from the dialogue.The difficulty differs even among questions in the same question type.Some inferences can be derived simply by paraphrasing the utterances, while others require logical guessing to read between the lines.These differences boil down to the information gap between the answer A and the dialogue D I .Here, we take an initial step to investigate the task difficulties systematically and define three levels of difficulty based on the amount of information in the answer covered by the dialogue: Sufficient, Likely, and Conceivable.
Level 1: Sufficient All the information in the answer is available in the given dialogue.Since there is no information gap between inputs and outputs, questions at this level are the easiest to answer.For example, from the given dialogue context below, it is overt that User A will be available on Saturday morning for delivery.
User A Can you deliver it, please?User B Yes, it costs two pounds fifty.User A All right, can you deliver here on Saturday?User B Sure.Does morning work for you?User A Sounds good.Question What is the prerequisite of the target utterance?Answer User A will be available on Saturday morning.
Level 2: Likely Some pieces of information in the answer are not available or directly stated, but it is possible to guess by combining the clues in the dialogue.Questions at this level can be compared to multi-hop question answering tasks (Yang et al., 2018;Welbl et al., 2018;Inoue et al., 2020).There are arguably different degree of hops to derive an answer depending on the context (Kumar et al., 2019;Cheng et al., 2021), however, here we classify all the questions that requires some sort of "hop" over e.g., a knowledge graph (Speer et al., 2017;Sap et al., 2019;Hwang et al., 2021) regardless of the degree.For example in the dialogue below, we can guess that User B will check the car as per User A's request.To check the car, User B will likely try to turn on the engine.Level 3: Conceivable The answer contains some pieces of information that are not stated in the dialogue, and there is no clear guidance for a "hop".The answer is plausible but hardly verifiable.Questions at this level are not easy even with certain knowledge sources provided and can be compared to check hallucinations in open-domain text generations (Ji et al., 2023).For example, in the dialogue below, Bob may be a brother of User B, and his occupation could be a radio journalist, which is a plausible reason to call Bob to ask about the fire at the factory.However, we cannot verify the answer as the dialogue lacks the evidence to guess the relationship between the speakers and Bob, nor his occupation.
User A There's been a fire at the factory.User B Are you sure?There is nothing in the newspaper about it.User A I just saw it on the 6 o'clock news.User B I will phone Bob.User A Yeah, he always knows what's going on.Question What is the prerequisite of the target utterance?Answer User B's brother Bob is a radio journalist.

Human Assessment of the Difficulty
To the best of our knowledge, there is no absolute automatic metric to compare two pieces of text in terms of the amount of semantic information they contain.Here, we assess the difficulty of the task defined in Section 3.2 by human annotation.We randomly select 75 samples per question type (in total 450 samples) from the CICERO test set.In our annotation scheme, we assign two well-trained annotators per sample to give a difficulty-level label and the other one expert to double-check and finalize the label.In a few cases where the three annotators disagreed on the label, an additional expert is assigned for confirmation.
In Table 2, we summarize the annotated results and the T5-base (Raffel et al., 2020) performance of the same subset that is fine-tuned on the CI-CERO training set.The CICERO dataset has a balanced mixture of the three levels (sufficient: 34.2%, likely: 33.6%, conceivable: 32.2%), and the per- Table 2: The performance of the fine-tuned T5-base gets worse along with the decrease in the amount of information available in the dialogue.
formance of T5-base uniformly degraded with the decrease of the amount of available information.
As reported in Table 3, different question types have different proportions of difficulty levels as anticipated.Although the proportion of likely and conceivable questions can explain the difference in T5-base performance to a certain extent, it does not have a simple correlation.It may be due to the difference in which kind of information is required to bridge the gap between the dialogue and the answer.For example, speakers' emotional reactions might be easily guessed by the sentiment of the utterances, while identifying the cause of the utterance may involve a more complicated understanding of background knowledge.

Methodology
We primarily train our model f θ by minimizing the negative log-likelihood: where a generated inference is denoted as Ãn = {a n j } k j=0 .The contrastive learning objective is defined by: where sim is a cosine similarity function, A is a set of negative samples of inferences, h X , h Ãn , h A ′ are the hidden representations of X, Ãn , A ′ , and τ is a temperature, respectively.Following (Cao and Wang, 2021;Lee et al., 2021), the final training objective L = L NLL + λL CL , where λ is a coefficient.

Selection of Negative Samples
Automatically generating a set of negative samples A for contrastive learning is a non-trivial task.The easiest method to sample negative samples is randomly sampling other inferences in the dataset (usually within the same batch), while the supervision of these negative samples might be weak due to the dissimilarity of the sentences.We denote the contrastive loss for in-batch negative samples as λ b L CL b .Besides, we aim to feed more informative negative samples per gold inference, which we denote as λ s L CLs .Then, the training objective can be formed as Since the CICERO dataset also serves as an MCQ task, each inference has four high-quality plausiblelooking yet not appropriate candidates.These counterfactual candidates are machine-generated and then filtered by human annotators.In our experiments, we explore the following ways for generating negative samples in fully-automatic: Non-Optimal Generation Since the simple finetuning with L NLL does not yield the optimal f θ as reported in Table 3, we directly use generated inferences by the fine-tuned model.We use top-k sampling with k = 10 for diversed generation.
Replacement of Tokens Inspired by (Park et al., 2021), we manipulate tokens of the gold inference using the prediction of a masked language model.More specifically, we compute the probability of each token in the gold inference A when whole context X and A are given and when only A is given.In this way, we can estimate which tokens in A are more affected by the context X.We directly compare the log-likelihood score of each token and select tokens that differ more than a threshold.The selected tokens will be replaced by the randomly selected tokens in top-k prediction by a masked language model.We apply the pretrained Roberta-large model (Replace ZS ) and the Roberta-large trained on the CICERO dataset for MCQ (Replace MCQ ), set k = 10, and the threashold 0.75.

Baselines
We evaluate our proposed method across multiple Transformer-based models: T5small/base/large (Raffel et al., 2020), and GPT2-base (Radford et al., 2019).To have a fair comparison, these baselines are finetuned on the CICERO training set only with L NLL .In addition, we compare our results with the performance of GPT-J (Wang and Komatsuzaki, 2021) and LLaMA-7B (Touvron et al., 2023) in a 3-shot setting.We report an average of three trials of randomly sample manually crafted prompts and  a strategic prompt using tf-idf to retrieve 3-most similar in-context examples.
Human Evaluation For a comprehensive evaluation, we also conduct a human evaluation on Plausibility aspect which focuses on evaluating whether the answers are rational or not.We evaluate the same data samples as those for task difficulty analysis.More specifically, comparing with both generated inferences from the T5-base model and the gold inferences.A/B testing is utilized to compare our proposed method and the corresponding baseline on the CICERO test set.

Training Details
The models are trained using a batch size of 64 after gradient accumulation, with a learning rate set at 1e−4 for T5 models and 1e−5 for GPT-2 models.We limit the training to a maximum of 10 epochs, employing a linear learning rate scheduler.
The checkpoint exhibiting the lowest perplexity on the validation set is chosen as the optimal model for each trial.In the case of contrastive learning, the temperature τ for L CL b and L CLs learning is set to 0.1 and 2.5, respectively, each contributing equally to the total loss with a coefficient λ b = λ s = 0.5.
All the experiments are executed on a single RTX 3090 Ti GPU.

Results
We report the automatic results of both our method and the baselines in Table 4. Automatic metrics based on n-gram overlap are mostly improved thanks to contrastive learning.Moreover, our proposed method is model architecture-agnostic, given that it shows consistent improvement in different encoder-decoder T5 models and encoder-only GPT2.For GPT-J and LLaMA, we could not see any improvement introduced by tf-idf.We suspect that even though lexically similar, these examples may mislead the model to make wrong predictions Overlap-based metrics can reflect the general quality of the generated inferences with respect to the gold answers.However, it does not reflect the inference ability of the generations, not to mention the inductive inference ability.In this work, we also explore the feasibility of NLI metrics for inference ability evaluation.More discussion is included in Section 6.6.
Human Evaluation For a more comprehensive evaluation of inference ability, we conduct a human evaluation of the plausibility of the generated inferences and report in Table 5.We leverage pairwise individual t-tests to validate the significance of the improvements.Inter-annotator agreements are computed using Fleiss' kappa (κ)3 to assess the reliability of the evaluation.As it is shown in Table 5, contrastive learning significantly improves the plausibility of the generated inferences over T5base with a substantial agreement.The generated inferences from T5-base with contrastive learning show comparable plausibility with gold ones in the CICERO test set with a fair inter-annotator agreement.The human evaluation further proves the effectiveness of our proposed method in improving inference ability.We further investigate the improvement breakdown in each difficulty levels to further analyze the effect of contrastive learning in Section 6.5.

Case Study
Table 1 illustrates one example in "Conceivable", comparing the generated inferences from our method, T5-base, and the gold inference.While T5-base tends to copy from the dialogue (highlighted in blue ), contrastive learning promotes the model to infer more rational information which is not stated in the context (highlighted in pink ).We include more examples in Appendix B.2.

Ablation Study
We perform an ablation study on our proposed method using T5-base as the foundational model.The effectiveness of our model is compared against those trained without the application of either L CLs , L CL b , or both In Table 6, our proposed method, employing both contrastive losses, amplifies the performance.A model devoid of L CLs surpasses our own in terms of CIDEr, yet our method achieves superior results across all other metrics.Furthermore, the impact of the different contrastive losses varies across the range of automated methods.While L CL b exhibits minimal impact on ROUGE-L, it proves more effective for CIDEr.The most significant contribution to the ROUGE-L improvement is derived from L CLs .

Comparison of Sampling Methods
In addition to the negative samples provided by the CICERO dataset, three different fully-automated methods of generating negative samples are explored as stated in Section 4.1.We train the model Table 8: The effect of the amount of negative samples.We report the average of three trials for m = 1, 2, 3.
leveraging the negative samples obtained from different methods, respectively and present the performance of the models in Table 7 for comparison.While different generation methods yield different improvements in the automatic metrics, in general, feeding negative samples does not hurt taining the models to perform dialogue inference.The "contradiction" negative samples from the dataset provide the largest improvement to the model performance, which suggests that higher quality of negative samples can guide the models better with a smaller amount.Another method that shows effectiveness is to replace the words that affect the predictions largely for the RoBERTa-large model trained to differentiate positive samples from negative samples, Replace MCQ , while replacement measured by the RoBERTa model in a zero-shot way (Replace ZS ) is less helpful.This indicates that finetuned RoBERTa assigns the probability of the tokens more informatively for inference.Our exploration of using a non-optimal T5-base model to generate negative samples is expected to improve the model performance by iterative self-contrasting.However, self-improvement may not be effective without further human filtering since we might include rational answers as negative samples, which introduce noise during training.

Effect of the Amount of Negative Samples
In our main experiments, we feed all four of the the counterfactual candidates provided by the CICERO dataset as negative samples to compute L CLs .As  The performance is improved thanks to the contrastive learning across all the difficulty levels.Conceiv. is short for Conceivable.The performance is calculated on the same subset of the CICERO test set in Table 2.
the effective amount of negative samples for contrastive learning is under discussion (e.g., Awasthi et al., 2022;Nozawa and Sato, 2021), we conduct a control experiment by feeding randomly sampled counterfactual candidates (m = 1, 2, 3) to observe the effect of the number of negatives.We report the results in Table 8; note that we report the average of three trials with different random seeds for m = 1, 2, 3.The performance generally improves along with the increase in the number of negative samples, implying that the high-quality negative samples contribute to teach the model to inference.Encouraged by our results, it would be interesting to quantify how much guidance is necessary for each level.For example, the "Sufficient" level may need fewer negative samples than the "Conceivable" level to achieve similar performance.It would be also beneficial to investigate the possibility of dynamically controlling the number of negative samples to feed.

Analysis of Improvements based on Task Difficulty
We further investigate how contrastive learning improves the model performance in different task difficulties.Table 9 reports the automatic score breakdown based on the difficulty annotated.Compared to the performance of the T5-base model reported in Table 2, our method yields improvement for all the levels, especially on "Sufficient" and "Conceivable".Similarly, we list the breakdown of human evaluation results to each task difficulty level in Table 5. T5-base with contrastive learning outperforms T5-base on plausibility in all difficulty levels, especially for "Sufficient" and "Conceivable", which is consistent with the trend of automatic metrics.In the "Sufficient" level, the advantage of our model is significant over the T5-base.This proves that contrastive learning can effectively improve the model's inference ability.Moreover, our method even significantly wins over gold in the "Conceivable" level in human evaluation with p < 0.05.include something that is not stated/verifiable in the dialogue contexts provided, while ours tends to be more supported by the dialogue context (see Table 1).We believe this resulted in ours being more favored by annotators.

Challenges in Evaluation of Inductive Reasoning
As we have discussed in the previous sections, it is extremely challenging to evaluate inductive processes because, by nature, outputs contain new information that is not stated in the inputs (Johnson-Laird, 1988, 1993).While the field has been aware of the fundamental difference between inductive and deductive for more than 60 years (Watanabe, 1960), there is no way to directly compare two pieces of text in terms of the amount of "semantic information" until now.Recently, with the arising demand in faithful and factual text generation, several metrics have been applied mainly by computing overlap in named entities or extracted keywords (Mao et al., 2021).Although the overlapbased metrics could be a decent starting point for many tasks such as summarization, it is not appropriate for inference in dialogue as non-overlap is something desired rather than being avoided.Another common choice to measure the plausibility today would be adopting NLI-based metrics (Honovich et al., 2022).In Table 10, we report model-based NLI metrics of UNLI (Chen et al., 2020) and AlignScore (Zha et al., 2023).We measure entailment between generated inferences and the gold references (UNLI gold /AS gold ), or between generated inferences and the corresponding dialogue context (UNLI con /AS con ) on a scale of [0, 1].
The training specifics of the NLI models, as well as their performance, can be found in Appendix A.1.
Despite being promising, the NLI scores are hardly interpretable, showing the consistent trend of contrastive learning degrading except for the GPT2-base.Even gold answers are labeled as "neutral" and undeterminable, and it is difficult to associate numbers with the quality of generated inferences.Although NLI metrics are an effective method to quantify factuality (Zha et al., 2023), this result suggests that NLI metrics are not suitable for inference in dialogue.Future work is needed to investigate possible evaluation metrics of the information gap since it can also benefit a wide range of NLP tasks.

Conclusion
In this paper, we conduct an analysis of inferences in dialogue, focusing on the availability of semantic information between inputs and outputs.As expected, the models perform worse on the samples with larger information gaps.We investigate a contrastive learning approach to teach what is wrong in inference.Our experimental results suggest the effectiveness of our approach, showing the promising direction to bridge the information gap, especially for smaller models with <1B parameters.

Limitations
The main drawback of the proposed method is that it requires more computational resources and longer training time as we increase the amount of training data to yield improvement with contrastive learning over the baselines.Although our method is model, dataset, and language agnostic, our exploration is limited to the popular transformer-based architectures and the single dataset in English.
The other significant aspect we have not covered in the paper (and in most of the literature to the best of our knowledge) is the stopping rule of the inference process in dialogue.As suggested in Clark (1975), there is a clear boundary between what portion of the untold information should be guessed and what can be left unknown in a speaker's intention.However, even in dataset construction phases, this aspect has been neglected (e.g., Bhagavatula et al., 2020;Ghosal et al., 2022).The stopping rule is essential since it can be one factor separating "Likely" questions and "Conceivable" questions.An important question for future studies is how to deal with the stopping rule, as it can be also associated with a boundary of hallucination and acceptable freedom in open-domain dialogue systems.

A Additional Details of Experiments
A.1 Details of NLI Models UNLI Model Following Chen et al. (2020), we apply RoBERTa-large model (Liu et al., 2019) on the u-SNLI datasets.We first train the model on ANLI dataset which is to classify into three categories {entail, neutral, contradict } on SNLI (Bowman et al., 2015), MNLI (Williams et al., 2018), FEVER-NLI (Thorne et al., 2018), ANLI (R1, R2, R3) (Nie et al., 2020), and then switch the classifier to be regression way and train on u-SNLI datasets.A.1 In our observation, this warmup helps to improve the performance on u-SNLI.Our training batch size is 16, and we train for 3 epochs with the learning of 1e − 5.In Table A.1, we report the scores on u-SNLI development set and test set along with the numbers reported in Chen et al. (2020): the Person's correlation coefficient r, the Spearman rank correlation ρ, and the mean square error (MSE).In our main experiments, we use the best model as the model-based evaluation metric.
AlignScore Model Following Zha et al. (2023), we apply the RoBERTa-large model, and the checkpoint distributed by the authors A.2 .

A.2 Details of Human Evaluation
As it is stated in Section 5.2, a human evaluation on Plausibility is conducted on 450 data samples in total across six subtasks from the CICERO test set.We compare the generated inferences from our model with those from T5-base and gold inferences.A/B testing is utilized and we ensure that each data sample are evaluated by three different annotators.For quality control, we limit the annotators' locations to the United States, United Kingdom, Canada, or Australia to ensure English proficiency.All the annotators are required to answer 20 test questions with more than 80% accuracy before starting the annotation.During the human evaluation, we present the same context and two options from different models to annotators in comparison.The annotators are required to decide which inference is more plausible by choosing from "Option 1", "Option 2", "both", and "neither".The annotator instruction is presented in After collecting all the annotations, we first calculate the ratio of Win/Tie/Loss of our model with A.1 https://huggingface.co/ynie/ roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli A.2 https://github.com/yuh-zha/AlignScorerespect to T5-base and Gold, respectively.The corresponding results are shown in Table 5.Moreover, the inter-annotator agreement is also calculated based on Fleiss Kappa.We implement Fleiss Kappa (κ) based on the "statsmodels" package A.3 .
To calculate the significance level of the advantage of our method over the baseline and gold, we also calculate the winning rate of each model under evaluation in Table A.2.More specifically, the model will gain one score if the annotator chooses the corresponding option or "both", or the model will gain a zero score instead.We take an average over all the scores and consider that to be the human evaluation result of the model.For better representation, we show these results as percentages.We calculate the significance level with pair-wise individual t-tests given the scores from all the samples.
A.3 Choice of λ b and λ s The coefficient λ b and λ s is set to be λ b = λ s = 0.5 based on the preliminary experiments reported in Table A.3.

B.1 Score breakdown of question types
In Table B.1, we report the breakdown of the automatic evaluation results of each question type of the CICERO test set.

B.2 Case Study
In Table 1, we show one example in the "Conceivable" level.We also present the examples in the "Sufficient" and "Likely" levels in Table B.3 and Table B.2, respectively.

Figure A. 1 :
Figure A.1: Annotator instruction of the human evaluation on Plausibility.
User A Jim, could you do me a favor?User B Sure, what can I do for you?User A My car has a problem starting.Could you please take a look at it for me?User B Sure thing.Question What subsequent event happens following the target utterance?Answer User B tries to turn on the car engine.

Table 3 :
The difficulty of the inferences varies on the type of questions, and so does the performance of the finetuned T5-base.The corresponding performance is calculated on the same subset of the CICERO test set.

Table 4 :
Automatic results on CICERO test set.CL is short for contrastive learning.We bold the better results between our method and the corresponding baseline model.We also highlight the best results across different models with underline.

Table 5 :
Human evaluation results on Plausibility, together with breakdown performance on each difficulty level.
* Our model achieves a significant advantage over T5-base or Gold with pair-wise individual t-test (p < 0.05).

Table 6 :
Ablation study with the base model as T5-base.

Table 10 :
Conceivable level gold answers tend to UNLI gold UNLI con AS gold AS con NLI-based metric results on the CICERO test set.AS is short for AlignScore.