Consistency Analysis of ChatGPT

ChatGPT has gained a huge popularity since its introduction. Its positive aspects have been reported through many media platforms, and some analyses even showed that ChatGPT achieved a decent grade in professional exams, adding extra support to the claim that AI can now assist and even replace humans in industrial fields. Others, however, doubt its reliability and trustworthiness. This paper investigates the trustworthiness of ChatGPT and GPT-4 regarding logically consistent behaviour, focusing specifically on semantic consistency and the properties of negation, symmetric, and transitive consistency. Our findings suggest that while both models appear to show an enhanced language understanding and reasoning ability, they still frequently fall short of generating logically consistent predictions. We also ascertain via experiments that prompt designing, few-shot learning and employing larger large language models (LLMs) are unlikely to be the ultimate solution to resolve the inconsistency issue of LLMs.


Introduction
AI systems can be more reliable and trustworthy provided they behave in a similar manner to humans (De Visser et al., 2016;Jung et al., 2019).In this regard, ChatGPT, a large language model (LLM) that simulates human-like conversations (Fares, 2023), is gaining widespread popularity, reaching 100 million users only two months after its launch (Milmo, 2023).It offers many convenient features to users, such as summarising documents, writing essays, answering questions, and programming computer languages.Also, Chat-GPT has performed astoundingly well on various examination cases, including passing the United States Medical Licensing Examination (Kung et al., 2023), achieving passing grades in four real exams at the University of Minnesota Law School (Choi et al., 2023), and providing decent answers to Oper-ation Management exam questions, which is a core MBA course (Terwiesch, 2023).These surprising results make people believe that LLMs can assist humans even in professional areas and greatly influence various academic and industrial fields.
Others, however, question ChatGPT's reliability, pointing out its overconfidence in generating factually incorrect information (Skopeliti and Milmo, 2023), inability to comprehend the complexity of human language (Bogost, 2022), and imperfect mathematical abilities (Frieder et al., 2023).Even though these mistakes may appear insignificant in normal daily tasks, e.g., drafting an email, they provoke crucial concerns in conservative and risksensitive domains, such as law, medicine, and finance.
In this article, we investigate the reliability and trustworthiness of ChatGPT in terms of the language model's consistency.By using the BECEL dataset (Jang et al., 2022a), which is designed to ascertain whether language models satisfy various types of consistency, we analyse ChatGPT's ability to generate logically consistent predictions based on three properties: semantic equivalence, logical negation, and symmetricity.Our experimental results show that although ChatGPT understands negation expressions and antonyms much better than previous pre-trained language models (PLMs) like BERT (Devlin et al., 2019), it still violates semantic equivalence and symmetricity quite frequently.Our contributions can be briefly summarised as follows: 1. We analyse the consistency behaviour of Chat-GPT by measuring semantic, negation, and symmetric consistency.2. We observe that ChatGPT achieves a much lower negation inconsistency compared to other PLMs, proving its improved understanding of negation expressions and antonyms.3. We ascertain that ChatGPT is likely to generate different predictions on text inputs that deliver the same meaning, i.e., paraphrased inputs.4.Even worse, we confirm that ChatGPT is selfcontradictory, meaning that it violates semantic consistency for paraphrased inputs generated by ChatGPT itself.5. We find that ChatGPT is extremely sensitive to the input sentence order for order-invariant tasks, e.g., semantic textual similarity (STS).Hence, we conclude that despite its favourable reputation and positive media coverage, ChatGPT is not completely reliable, suggesting that using ChatGPT without human confirmation would be hazardous, particularly in a highly risky industry.

Related Works
The consistency of language models has been an important topic in natural language processing (NLP) but conducted under various definitions.The idea of semantic consistency is the most widely used concept in consistency analysis, meaning that a model should make consistent decisions in semantically equivalent contexts (Elazar et al., 2021).Semantic consistency is an indispensable property that should be satisfied in every textual data and NLP task.Ravichander et al. (2020) observed that PLMs are likely to generate different masked language modelling predictions when an object in queries is replaced with its plural form.Elazar et al. (2021), on the other hand, found that PLMs generate different masked language modelling predictions when given paraphrased queries.Another line of work employed the idea by introducing a consistency regularisation term for training, which penalises the violation of semantic consistency, to train more robust NLP models (Wang and Henao, 2021;Zheng et al., 2021;Kim et al., 2021).
Symmetric consistency is a consistency type based on symmetric inference, defined as f (x, y) = f (y, x).This implies that a model should be input-order invariant for tasks where the symmetric property holds.Regarding the natural language inference (NLI) task, Wang et al. (2019) believed that symmetric consistency applies to data points with "not entailment", i.e., "contradiction" and "neutral", as a label.They showed that many deep-learning-based NLI models change their predictions when the premise and hypothesis are switched.On the other hand, Li et al. (2019) only considered "contradiction" labels for analysis and ascertained that NLI models based on BERT (Devlin et al., 2019) are likely to violate symmetric consistency.Kumar and Joshi (2022) performed a symmetric consistency analysis on NLI and STS tasks in a more conservative manner, arguing that a model should generate not only the same predictions but also the same confidence scores if it is truly input-order invariant.They also observed that PLM-based models violated symmetric consistency and introduced a consistency regularisation term to compensate for the issue.
The fundamental idea lying in negation consistency is the logical negation property (p is true ⇔ ¬p is false; Aina et al. 2018).Intuitively, the main idea behind it is that a model's prediction should differ for text inputs delivering the opposite meaning.Several studies investigated the negation consistency of BERT and found that the model often generates the same outputs when asked negated and non-negated masked queries, e.g., "Birds can lay [MASK]" and "Birds cannot lay [MASK]" (Kassner and Schütze, 2020;Ettinger, 2020).Hossain et al. ( 2020) created negated versions of NLI datasets and also observed the violation of negation consistency, suggesting that PLMs lack the understanding of negation expressions.To alleviate the issue, several works adopted data augmentation to train a model with abundant data containing negation expressions (Asai and Hajishirzi, 2020;Hosseini et al., 2021).Jang et al. (2022b) expanded the evaluation scope from negation expressions to antonyms and ascertained the same tendency in recent PLMs.They proposed a new training task named meaning-matching to enhance PLMs' textual understanding ability and observed performance improvements.
Transitive consistency is a consistency type that can measure the deductive reasoning ability.It is derived from transitive inference, represented as X → Y ∧Y → Z then X → Z for three predicates X, Y, and Z (Gazes et al., 2012;Asai and Hajishirzi, 2020).In the NLI task, Li et al. (2019) employed the concept to generate four transitive inference rules.For three sentences P , H, and Z, the rules are defined as: (2) N (P, H) ∧ E(H, Z) → ¬C(P, Z), (3) they collected a new evaluation set to assess the transitive consistency of BERT-based NLI models and showed the inconsistency of the models.
Other studies investigated the transitive consistency in question answering (QA) (Asai and Hajishirzi, 2020;Mitchell et al., 2022) and WordNet word senses (Lin and Ng, 2022) and ascertained that PLMs lack the ability to perform transitive inference.Jang et al. (2022a) proposed a universal definition of the language model's consistency and a taxonomy of various consistency types.They also created a new benchmark dataset that enables the evaluation of multiple types of consistencies on various downstream tasks.They assessed diverse PLMs on the new benchmark and confirmed that, like studies stated above, none of PLMs show consistent behaviour on all test cases.All the aforementioned works investigated the consistency of PLMs that emerged before the advent of LLMs like ChatGPT.To our knowledge, this paper is the first evaluation of LLMs from a consistency viewpoint.

Evaluation Scope
The BECEL dataset provides 19 test sets for assessing five types of consistency on seven downstream tasks.However, we reduced the scope of our experiments, mainly because of the extremely competitive usage of ChatGPT.Specifically, our experiments do not consider additive and transitive consistency, because most PLMs were highly consistent on the former (Jang et al., 2022a), and the latter requires much more difficult reasoning ability compared to other consistency types.Regarding downstream tasks, we used the SNLI (Bowman et al., 2015), RTE (Candela-Quinonero et al., 2006), and MRPC (Dolan and Brockett, 2005) datasets, which contain test cases for measuring semantic, negation, and symmetric consistency.Table 1 shows the size of the test sets for each downstream task and consistency type.

Consistency Evaluation Method
This section briefly demonstrates the process of consistency evaluation by using the BECEL dataset.The evaluation consists of two steps.First, the predictions of the original test set and its corresponding perturbed test set are generated.Next, the predictions of the two test sets are compared to measure the consistency.
For the three downstream tasks in our evaluation scope, Jang et al. (2022a) collected the perturbed test sets for semantic and negation consistency evaluation by modifying "sentence 2" for the RTE and MRPC tasks and "hypothesis" for the SNLI task, i.e., generating paraphrase and the opposite meaning sentences for semantic and negation consistency, respectively.They switched the order of the two input texts for symmetric consistency evaluation.Figure 1 illustrates the overall process for measuring the three consistency types on the MRPC task.

Generating ChatGPT Predictions
For test cases where the size of data exceeds 1K, e.g., SNLI task and symmetric consistency of RTE and MRPC, we sampled 200 data points due to the heavy usage of ChatGPT.We conducted zero-shot experiments by using the same prompts designed by Eleuther AI1 .The prompts of each downstream task and examples are presented in Table 2. Our experiments are conducted on the 30 Jan version of ChatGPT by using the pyChatGPT package, an unofficial Python wrapper for OpenAI's Chat-GPT API2 .
Normally, ChatGPT gave us an answer "True/ False/Neither (Entailment/Contradiction/Neutral)" for SNLI, "Yes/No (Equivalent/Not Equivalent)" for MRPC, and "True/False (Entailment/Not Entailment)" along with (or without) explanations for the decision.However, we observed a few cases where the output does not follow the aforementioned format but just gives an explanation.We reviewed such cases and gave the correct answers.

Evaluation Metrics
Basically, we used the same inconsistency metric as in (Jang et al., 2022a).Specifically, the metric measures the ratio of predictions that violate the target consistency type.Thus, semantic and symmetric inconsistency count the number of predictions  where ChatGPT generates different answers for the original and its corresponding perturbed input.In contrast, negation inconsistency counts the results where the two predictions are the same.Unlike semantic consistency, which holds unconditionally, negation and symmetric consistencies are conditional properties.For example, negation consistency applies when the gold label is "Entailment" for the NLI task and "Equivalent" for the STS task.Regarding symmetric consistency, it applies unconditionally for the STS task and only to "Not Entailment" for the NLI task.As the BECEL dataset already reflects these conditions, Jang et al. (2022a) calculated the inconsistency metrics based on all test data points.However, it can exaggerate the inconsistency of language models if their performance is insufficient.For example, consider the below example of the MRPC task: S1: In the evening, he asked for six pepperoni pizzas and two six-packs of soft drinks, which officers delivered.

S2:
In the evening, he asked for six pizzas and soda, which police delivered.S2-neg: In the evening, he asked for six pizzas and soda, which police did not deliver.
The gold label of the S1-S2 pair is "Equivalent".However, if the model believes that the answer is "Not Equivalent", then generating "Not Equivalent" as an answer of the S1-S2-neg pair is hard to be considered as violating negation consistency.We observed that the zero-shot accuracy of ChatGPT is much lower than that of fine-tuned PLMs reported by Jang et al. (2022a).Therefore, we introduce a conditioned inconsistency metric, which only uses data points where ChatGPT makes correct predictions.
S1: There were conflicting reports about the number of casualties yesterday.

Original Test Data
S1: There were conflicting reports about the number of casualties yesterday.

Perturbed Test Data
S2: There were sharply conflicting reports tonight on the death toll.

1) Generate paraphrase using ChatGPT
2) Make a perturbed data using generated paraphrased text.

Semantic Consistency
It is widely known that ChatGPT can perform various NLP tasks, including summarisation, question answering, and paraphrasing.Therefore, in addition to the original BECEL dataset, we generated paraphrased sentences using ChatGPT and used them for evaluation.The overall procedure of this evaluation is illustrated in Figure 2.
consistency on average than the best-performing fine-tuned PLM.This suggests that ChatGPT is not completely trustworthy regarding semantic consistency.Moreover, we ascertain that ChatGPT is self-contradictory, i.e., it even produces inconsistent outputs for paraphrased inputs generated by itself with a probability of more than 10%.This implies that ChatGPT failed to generate a proper paraphrased sentence or to capture the meaning of texts delivering the same meaning; either case undermines its reliability.Several examples where ChatGPT violates semantic consistency are presented in Table 5.

Negation Consistency
Table 4 presents the experimental results of the negation consistency evaluation.Compared to the fine-tuned PLMs, ChatGPT attains lower negation inconsistency in all three downstream tasks, improved by 19% on average than the best-performing fine-tuned PLM, and considerably outperformed BERT-large model.In addition, the conditional inconsistency is 3.8% on average and perfectly consistent on the SNLI task.The results suggest that ChatGPT can better understand negation expressions and antonyms, which has been a critical issue for PLMs trained in a self-supervised fashion (Kassner and Schütze, 2020;Ettinger, 2020;Hossain et al., 2020;Hosseini et al., 2021;Jang et al., 2022b).We believe that incorporating human feedback into ChatGPT training (Ouyang et al., 2022) plays a crucial role in learning the meaning of negation expressions and antonyms, compared to previous PLMs that infer their meaning by simply relying on the context information based on the distributional hypothesis.Investigating the impact of providing human feedback on learning textual meaning is an interesting future research direction.are presented in Table 7.

Symmetric Consistency
The results of symmetric consistency evaluation are described in Table 6.There is a surprising degree of inconsistency in ChatGPT compared to finetuned PLMs.Compared to the best-performing PLM, ChatGPT produces three times higher symmetric inconsistency in the MRPC task and five times higher in the RTE task, even in conditioned inconsistency, which reduces the overestimation of inconsistent behaviour.We observe that the extremely high inconsistency for the SNLI task is mainly because ChatGPT fails to distinguish between the labels of "Neutral" and "Contradiction".However, the model is still not completely consistent even in SNLI-2C, which integrates "Neutral" and "Contradiction" into the same class.Although the inconsistency rate might be considered trivial, especially in the SNLI-2C case, the issue should not be overlooked, considering the simple nature of the symmetric property.Consider a model that takes a list of symptoms and generates prescriptions.For such a model that should operate conservatively, it would greatly undermine the model's trustworthiness if it generates entirely different prescriptions whenever the order of symptoms changes, even if such an error occurs with a probability of 2%.Hence, an effort should be made to make LLMs satisfy logical consistencies to enhance their reliability and safe usage in realworld applications.symmetric consistency violations.

ChatGPT's Explainablity
Providing explanations is a core property of trustworthy systems (Huang et al., 2020).It is widely known that generative language models like Chat-GPT can provide answers with explanations.However, we observed that while ChatGPT generates plausible explanations, those explanations are not perfectly reliable.Table 8 presents some examples.For the first example, the explanations of the original and perturbed inputs contradict each other.Regarding the second example, the explanation of the perturbed input is not correct, i.e., the input did mention the age and gender of the person pushing the shopping cart ("boy" and "A young man").These wrong explanations also contribute to undermining ChatGPT's trustworthiness.

Discussion
Can Prompt Design be a Solution?Prompts are input text consisting of a task demonstration and, for a few-shot task, some examples (Lester et al., 2021).Prompt design has been shown to be an effective method of regulating the behaviour of GPT-3 (Brown et al., 2020).Hence, one might argue that searching for an optimal prompt for each task can improve consistency.However, we are sceptical of this claim.The consistency metrics could be improved with different prompts, but we believe that it cannot fundamentally resolve the inconsistency problem, because prompt design cannot go beyond inductive reasoning.The underlying idea behind prompt design is that prompts created by experimenters might not be optimal, because language models might have acquired target information from completely different contexts (Jiang et al., 2020).That is, prompt design can be regarded as maximising the generalisation effect by searching for the most closely related prompts to perform the target task during training.As a result, no matter how prompt design allows us to find the best prompt that maximises the generalisation effect, it cannot resolve the issue, as our experimental results suggest that various consistency properties are not reflected in ChatGPT's inductive bias.Moreover, consistency improvements with prompt design can be considered another violation of semantic consistency, because the prompts will deliver identical semantic meaning, i.e., task description.Data Augmentation is Not Sustainable.Creating new data points based on certain consistency types and using them for training (Asai and Hajishirzi, 2020;Hosseini et al., 2021) or consistency regularisation (Wang and Henao, 2021;Zheng et al., 2021;Kim et al., 2021) is the most widely used approach to reflect logical consistency in the model's inductive bias.This remedy, however, is unsustainable.First, the data augmentation process requires a tremendous effort.For simple consistency types, e.g., symmetric and negation consistency, generating or collecting data points is relatively simple, but for complex consistency types, such as transitive and semantic consistency, it can be extremely challenging to cover all possible variations.Second, even if we successfully expand the data, it is doubtful whether we can afford to update an LLMs on the new dataset.Considering the everchanging character of language, the data expansion and update of an LLM should be performed continuously.However, training an LLM entails tremendous financial and environmental costs (Bender TASK: MRPC, CONSISTENCY TYPE: Symmetric ORIGINAL QUESTION: "Sentence 1: In the evening, he asked for six pepperoni pizzas and two six-packs of soft drinks, which officers delivered.Sentence 2: In the evening, he asked for six pizzas and soda , which police delivered.Question: Do both sentences mean the same thing?Answer:" ORIGINAL OUTPUT: "Yes, both sentences mean the same thing.They convey that in the evening, the person made a request for six pepperoni pizzas and two six-packs of soft drinks, and that the officers delivered the items.The second sentence is a simplified version of the first one, where 'pizzas' refers to pepperoni pizzas, and 'soda' refers to soft drinks."PERTURBED QUESTION: "Sentence 1: In the evening, he asked for six pizzas and soda , which police delivered.Sentence 2: In the evening, he asked for six pepperoni pizzas and two six-packs of soft drinks, which officers delivered.Question: Do both sentences mean the same thing?Answer:" PERTURBED OUTPUT: "No, the two sentences do not mean the same thing.The first sentence says that he asked for six pizzas and soda, while the second sentence says that he asked for six pepperoni pizzas and two six-packs of soft drinks.The type of pizza and the specific drinks requested are different in the two sentences."TASK: SNLI, Semantic-BECEL ORIGINAL QUESTION: "An enthusiastic boy pushes a shopping card containing a smiling boy.Question: A boy is pushing a shopping cart.True, False or Neither?Answer:" ORIGINAL OUTPUT: "True" PERTURBED QUESTION: "An enthusiastic boy pushes a shopping card containing a smiling boy.Question: A young man is pushing a shopping cart.True, False or Neither?Answer:" PERTURBED OUTPUT: "Neither (the description does not mention the age or gender of the person pushing the shopping cart, only that the person inside the cart is a smiling boy.) "  et al., 2021).For instance, training a BERT-base model without hyperparameter tuning, which is 1590 times smaller than ChatGPT, requires a CO2 emission of 650kg, which is comparable to flying from New York to San Francisco for one passenger (Strubell et al., 2019).A simple expectation of CO2 emission for re-training ChatGPT is 1033t, while a human is responsible for 5t CO2 emission per year.Therefore, it is desirable to enlarge our viewpoint beyond LLMs to implement sustainable remedies that can fundamentally solve the inconsistency problem, particularly in a modern society facing the global climate crisis.

Summary and Outlook
The recent advent of ChatGPT is accelerating the developments in the NLP field driven by LLMs.Its outstanding performance captured considerable attention, resulting in many articles, posts, and analyses highlighting ChatGPT's positive aspects across numerous media.There are others, however, who question its reliability based on the model's faulty behaviours.To this end, this study aims to examine the trustworthiness of ChatGPT in terms of the language model's consistency.
We have investigated the consistency behaviour of ChatGPT across three consistency types and downstream tasks.Our experimental results demonstrated that ChatGPT achieves a certain level of enhanced language understanding ability, especially in negation expressions and antonyms, show-ing considerable improvements in negation consistency compared to the earlier version of PLMs.However, contrary to the widespread belief regarding the outstanding performance of ChatGPT, its overall consistency falls short of expectations.It frequently changes its decision when an input text is replaced with a paraphrased sentence, even though it is generated from ChatGPT itself, i.e., the model is self-contradictory.Moreover, in inputorder invariant tasks, ChatGPT is likely to make a different decision when the order of the input sentences is switched.Given how simple and natural the symmetric consistency is in human reasoning, violating symmetric consistency is a huge blow to ChatGPT's reliability.These fallacious behaviours are lethal to domains operating conservatively and at high risk.Although LLMs are a revolutionary technique that brought an unprecedented era to NLP, such issues should be resolved before ChatGPT is used in real applications, particularly considering the huge economic and environmental costs for training and inference of LLMs.

Figure 2 :
Figure 2: Overall process of measuring semantic consistency by using paraphrases generated by ChatGPT.

Table 2 :
Format and example of prompts used in our experiments for each downstream task.

Table 5 :
Examples of semantic consistency violation.

Table 6 :
Table 7 presents examples of Experimental results of the symmetric consistency evaluation.τ and τ C denote the original and conditioned symmetric inconsistency, respectively.The best performance is in bold.
The dead cavalry have been honored for more than a century with a hilltop granite obelisk and white headstones .S2: The dead cavalry have been honored for more than a century with a hilltop granite obelisk and white headstones .S2: The dead cavalrymen are honored with a hilltop granite obelisk and white headstones .

Table 7 :
Examples of negation and symmetric consistency violations.

Table 8 :
Examples of ChatGPT's output with explanations.