A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms fine-tuned models on some tasks. We find that it is better at understanding non-Latin script languages than generating them. It is able to generate multimodal content from textual prompts, via an intermediate code generation step. Moreover, we find that ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. It is, for example, better at deductive than inductive reasoning. ChatGPT suffers from hallucination problems like other LLMs and it generates more extrinsic hallucinations from its parametric memory as it does not have access to an external knowledge base. Finally, the interactive feature of ChatGPT enables human collaboration with the underlying LLM to improve its performance, i.e, 8% ROUGE-1 on summarization and 2% ChrF++ on machine translation, in a multi-turn"prompt engineering"fashion. We also release codebase for evaluation set extraction.


Introduction
ChatGPT is a successor of the large language model (LLM) InstructGPT (Ouyang et al., 2022) with a dialog interface that is fine-tuned using the Reinforcement Learning with Human Feedback (RLHF) (Christiano et al., 2017) approach. 1 In the last couple of months, ChatGPT has gathered close to 1 million user base (Hu, 2023) and is being used by businesses and consumers alike for a myriad of mostly textual tasks. One reason for its unprecedented popularity is that ChatGPT, through its scale and via RLHF, has shown impressive abilities in many areas of NLP as well as emergent abilities such as code generation and multimodal generation. Another reason is that its dialog interface allows users to interact with the underlying large language model more effectively and efficiently via interactive chats that are akin to multi-turn prompt engineering.
However, despite its powerful abilities, anecdotal reports on ChatGPT have consistently shown significant remaining challenges -for example, it fails in some elementary mathematical (Gilson et al., 2022;Goldberg, 2023;Frieder et al., 2023;Choi et al., 2023;Davis, 2023) and commonsense reasoning tasks (Guo et al., 2023;Davis, 2023); it hallucinates with human-like fluency and eloquence on things that are not based on truth (Shen et al., 2023;Thorp, 2023;Smith, 2023); and as a general-purpose language model trained from everything on the web, its language coverage is questionable (Lu et al., 2022;Jiao et al., 2023). OpenAI has listed many limitations of ChatGPT on its website. 2 CEO tweeted that "It's a mistake to be relying on [ChatGPT] for anything important right now" (Altman, 2022). Many researchers have argued that, despite appearances, LLMs like Chat-GPT are only good at language abilities, not actual reasoning (Mahowald et al., 2023).
Consequently, it is not clear what people can or cannot use it for despite its popularity. For users and researchers alike, it would be beneficial to have a sense of confidence in its reliability in various NLP/AI tasks.
Previous works have discussed the ethical implications or concerns associated with ChatGPT (and other LLMs) (Jabotinsky and Sarel, 2022;Susnjak, 2022;Blanco-Gonzalez et al., 2022;Aydın and Karaarslan, 2022;Jeblick et al., 2022). However, there has not been much technical evaluation of the strengths and limitations of ChatGPT 3 . To fill this gap, we conduct experiments on ChatGPT with samples from standard public test sets on major NLP tasks such as question answering, reasoning, summarization, machine translation, automatic post-editing, sentiment analysis, language identification, and task-oriented dialogue (dialogue state tracking & response generation) and misinformation detection. We evaluate its multilingual performance as well as vision-language multimodal abilities. With additional experiments, we also quantitatively evaluate its primary limitations in reasoning and hallucination. In addition, we conduct experiments to test its multi-turn interactivity as a means for better prompt engineering. We hope to provide insights to users of ChatGPT on the above-mentioned strengths and limitations, as well as how they can improve outcomes with interactivity. (Note that we are not able to quantitatively evaluate the RLHF aspect of ChatGPT without access to the user log. We hope OpenAI will publish this work and one can carry out such evaluations in the future in collaboration with OpenAI.) The following are the major insights we have gained from the evaluations:

Multitask, Multimodal, and Multilingual
• For 9/13 NLP datasets, ChatGPT outperforms previous LLMs with zero-shot learning. It even outperforms fully fine-tuned taskspecific LMs on 4 different tasks. In other cases, ChatGPT is on par or slightly lower than fully fine-tuned for specific NLP tasks; • ChatGPT fails to generalize to low-resource and extremely low-resource languages (e.g., Marathi, Sundanese, and Buginese). There is an overall performance degradation in lowresource languages, especially in non-Latin scripts in the case of translation; its weakness lies in generation rather than understanding part of the translation process; • ChatGPT enables a code intermediate medium to bridge vision and language, even though the multi-modality ability is still elementary compared to vision-language models.
Reasoning We tested 10 different reasoning categories with 600 samples in total. Based on our experiments, ChatGPT shows more weakness in inductive reasoning than in deductive or abductive reasoning. ChatGPT also lacks spatial reasoning while showing better temporal reasoning. ChatGPT also lacks mathematical reasoning, which aligns with recent findings by Frieder et al.. Further, we found that ChatGPT is relatively better at commonsense reasoning than non-textual semantic reasoning. Finally, while ChatGPT shows acceptable performance in causal and analogical reasoning, it is bad at multi-hop reasoning capability as similar to other LLMs' weakness in complex reasoning (Ott et al., 2023).
Hallucination Similar to other LLMs (Radford et al., 2019;Muennighoff et al., 2022;Workshop et al., 2022), ChatGPT suffers from the hallucination problem. It generates more extrinsic hallucinations -factual statements that cannot be verified from the source, from its parametric memory across all tasks since it does not possess the access to external knowledge bases.
Interactivity One of the primary differentiating factors of ChatGPT from its predecessors is its multi-turn dialog interactivity. This enables Chat-GPT to perform multiple tasks within a dialog session. There is also significant performance improvement (8% ROUGE-1 on summarization and 2% ChrF++ on low-resource machine translation) via multi-turn interactivity in various standard NLP tasks. This process is akin to prompt engineering with feedback from the system.
Organization of This Paper: We first provide an overview of ChatGPT and related work ( §2). Then, we provide evaluation results on ChatGPT on various application test sets, on multilingual test sets, and on a new multimodal task in §3. We then explore the three main strengths and weaknesses of ChatGPT, namely reasoning ( §4), hallucination ( §5) and interactivity ( §6) in the subsequent three sections. Finally, we discuss and give a conclusion on our findings of ChatGPT.
2 Background and Related Work

Large Pretrained Models
Large Language Models (LLMs) are language models with parameter sizes over a hundred billion, beginning with the introduction of GPT-3. Examples of LLMs include, but are not limited to, GPT-3, Gopher (Rae et al., 2021b), Megatron (Shoeybi et al., 2019), GPT-Jurassic (Lieber et al., 2021), OPT-175B Zhang et al. (2022). Beyond fine-tuning models with task-specific data, LLMs have shown robustness and generalizability through zero-shot and few-shot learning with examples. Scaling up the models unlocked new, emergent abilities that were not observed with smaller models (Wei et al., 2022a). Prompts are used to probe the LLMs to generate the target outcome by sampling the language distribution. To enable the LLMs to demonstrate their abilities, sophisticated prompt engineering (NeuralMagic, 2023) is required. However, previous LLMs only allow one-time probing, which means the target outcome varies a great deal with minor changes in the prompt instruction.
Whereas scaling up LLMs improve generalizability, generic LLMs may fall short in specific applications. Despite its name, ChatGPT has not been primarily used as a chatbot. Its dialog ability serves as the user interface to the underlying LLM. We nevertheless refer to other dialog systems here in this paper. A number of large pre-trained dialogue models have been created, following the pre-trainthen-finetune paradigm. LaMDA (Thoppilan et al., 2022) is a large-scale conversational model, finetuned from an LLM with a parameter size of 134 billion. Blenderbot 3.0 (Shuster et al., 2022), scaled up to 175 billion parameter size, is also introduced with similar abilities as LAMDA. Both models are pre-trained on public dialogue and other public web documents and then fine-tuned with manually curated dialogue data. They also have access to external knowledge sources for information retrieval, thus they have shown an excellent ability for fluent and natural dialogue generation as well as information retrieval. However, the aforementioned large dialogue models suffer from catastrophic forgetting of the knowledge obtained from the pre-training. Models after fine-tuning show stable and strong performance on specific tasks, but they only preserve the knowledge learned from the task-specific data while losing the generalization ability. Chat-GPT, on the other hand, was trained on a large-scale conversational-style dataset constructed from web documents directly (Schulman et al., 2022), which unifies the pre-training and fine-tuning data format. Thus, ChatGPT is able to preserve the knowledge from pre-training and produce informative outputs without access to external knowledge sources.

ChatGPT
Compared to existing LLMs, ChatGPT has unique characteristics. First, it has the ability to interact with users in a conversation-like manner, while retaining its accumulated knowledge and generalization ability gained from pre-training. This is achieved by pre-training ChatGPT on a large-scale conversational-style dataset, that is constructed by transforming a large-scale instruction-tuning corpus used for building InstructGPT into a conversational format, then fine-tuning the model based on a reward model to further improve the generation quality and align the generation with human preference. ChatGPT should be considered a generic language model which can be probed in a conversational manner. The biggest advantage of such conversational interaction is that, unlike previous LLMs, ChatGPT can intelligently "answer follow-up questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests" (Schulman et al., 2022).
Second, ChatGPT is trained with a better humanaligned objective function via Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017). Conventional natural language generation models, including dialogue models, are trained with maximum likelihood estimation (MLE) and might not be aligned with human preferences. For instance, for dialogue systems, humanness, engagement, and groundedness are some examples of essential criteria for success. Such discrepancy between training objectives and evaluation metrics becomes a bottleneck to performance improvement. By using RLHF, ChatGPT aligns more closely with human preferences in generating text than by using MLE.
As ChatGPT has become available to public users through an easily accessible UI, there have been many discussions from a wide range of communities, not just from AI or NLP, but also from other disciplines. A line of discussion is the specific emergent ability and strength of ChatGPT in more technical perspectives. Guo et al. (2023) conducts linguistic analyses and human evaluations of Chat-GPT's writing against human experts with their proposed corpus named Human ChatGPT Comparison Corpus and found that ChatGPT responses are strictly focused on the given question, more formal, objective, and less emotional. Nov et al. (2023) also studies ChatGPT's generated medical advice if it passes the Turing test. Frieder et al. (2023) investi-gate mathematical capabilities of ChatGPT on both publicly available and hand-crafted datasets, including graduate-level mathematics, and show that "significantly below those of an average mathematics graduate student." There are many investigations of ChatGPT's understanding and potential applications in different fields such as law (Choi et al., 2023), medical domain (Blanco-Gonzalez et al., 2022;Jeblick et al., 2022) and finance (Birch, 2022;Dowling and Lucey, 2023). Jeblick et al. (2022) conduct a case study of the application of ChatGPT on simplified radiology reports. Another important line of discussion is the ethical concerns over the use of ChatGPT. The most active discussion is over the use of academic writing and exam integrity (Jabotinsky and Sarel, 2022;Susnjak, 2022). OpenAI also discusses the misuse of LM for disinformation and remedies. 4 Zhuo et al. study AI ethics of ChatGPT in criteria of bias, reliability, robustness, and toxicity.

LLM benchmark and evaluation
With the advancement of LLMs' generalization ability, there have been efforts to understand their capabilities, limitations, and risks. Recently, several benchmarks with a collection of a large number of NLP datasets, such as BIG-Bench (Srivastava et al., 2022) and AI LM Harness (Gao et al., 2021), have been introduced. Moreover, HELM (Liang et al., 2022) is proposed to conduct a holistic evaluation of LLMs that considers scenarios and metrics with a top-down approach. In this work, we instead focus on specific limitations and unique findings of ChatGPT that had not been discussed with previous LLMs. There is difficulty to evaluate ChatGPT with the whole test set from such benchmarks due to limited access to ChatGPT 5 .
There are also other works that discuss LLMs' emergent abilities through thorough surveys or case studies. Mahowald et al. (2023) thoroughly studies LLMs capabilities by distinguishing formal and functional linguistic competence with reference to cognitive science, psychology, and NLP to clarify the discourse surrounding LLMs' potential. Other works focus on more specific abilities such as mathematical skills (Davis, 2023), reasoning (Webb et al., 2022a;Qiao et al., 2022). Also, there have been overviews of existing LLMs (Gozalo-Brizuela and Garrido-Merchan, 2023;Wolfe, 2023) 3 Multitask, Multilingual, and Multimodal Evaluations of ChatGPT

Evaluating the Multitask Ability of ChatGPT
ChatGPT has become very well-known in such a short period of time to general public users, not just those who are in AI, machine learning, and NLP communities who might be more familiar with LLMs. One of the main reasons is that, in addition to media reports, innumerable use cases of ChatGPT are shared by both non-academic and academic users online (Marr, 2022;Gordon, 2023;Shankland, 2023). There have been debates and panels on whether ChatGPT is approaching Artificial General Intelligence (AGI), as it seems to be able to carry out a multitude of tasks without specific fine-tuning (Desk, 2023;Johnson, 2023;Kingson, 2023). On the other hand, there has also been as much sharing of its failures in simple tasks (Gilson et al., 2022;Choi et al., 2023;Shen et al., 2023). Instead of relying on anecdotal examples, we first evaluate ChatGPT's performance in various standard NLP tasks in a zero-shot manner to obtain a basic/better understanding of its multi-task ability. We compile results from the existing literature on ChatGPT and compare them with the stateof-the-art fully-fine-tuned and zero-shot models across multiple tasks. We evaluate ChatGPT performances on 21 datasets covering 8 tasks, i.e., summarization, machine translation, sentiment analysis, questions answering, task-oriented dialogue, open-domain knowledge-grounded dialogue, and misinformation detection tasks. For ChatGPT, we sample testing cases from existing standard test sets for each task with a sample size ranging from 30 to 200 samples per task.

Multitask Generalization of ChatGPT
The result of the multitask evaluation is shown in Table 1. ChatGPT is shown to achieve remarkable zero-shot performances on multiple tasks, surpassing previous state-of-the-art zero-shot models on 9 out of 13 evaluation datasets with reported zero-shot LLMs performance. In most tasks, especially taskoriented and knowledge-grounded dialogue tasks, task-specific fully-fine-tuned models outperform ChatGPT. Compared to the latter, ChatGPT yields  lower performance in most tasks while still surpassing the performance on 4 evaluation datasets. Furthermore, from the evaluation results, we also observe several limitations of ChatGPT, e.g., 1) limited language understanding and generation capabilities on low-resource languages, 2) lacking reasoning ability as shown from the results in QA, and 3) performing task-oriented and knowledgegrounded dialogue tasks. More detailed experimental setup and analysis for each task are shared in the next subsections, i.e., §3.1.1: Experiment details and result and §3.1.2: ChatGPT on Dialogue System. We also provide the complete list of all the datasets used in our evaluation in Appendix C.  Table 1, ChatGPT achieves a similar zero-shot performance with text-davinci-002, which is expected since they evolved from the same GPT3 pre-trained checkpoint. However, the fine-tuned Bart still outperforms zero-shot ChatGPT by a large margin. Furthermore, we evaluate the ChatGPT's unique interaction capabilities in §6.
Sentiment Analysis Sentiment analysis has been widely explored for both high-resource and low-resource languages (Wang et al., 2018a;Wilie et al., 2020;Ilmania et al., 2018). We explore the sentiment analysis ability of ChatGPT through 4 languages with diverse amounts of resources in NusaX (Winata et al., 2022): English (eng), Indonesian (ind), Javanese (jav), and Buginese (bug). For each language, we sample 50 sentences from the corresponding dataset for our experiment and measure the macro F1 score as the evaluation metric. We compare the results with two baselines, i.e., supervised state-of-the-art performance from Winata et al. (2022) and zero-shot multilingual LLM from Cahyawijaya et al. (2022). ChatGPT outperforms the previous state-of-the-art zero-shot model by a large margin except for the Buginese, where it performs on par. This shows that ChatGPT still has a limited understanding of extremely low-resource languages.
Question Answering Since Question Answering (QA) is a broad topic, we classify QA datasets into different categories based on the knowledge/reasoning type required to do the task, e.g commonsense reasoning, spatial reasoning, temporal reasoning, etc., to have a clearer analysis on ChatGPT's abilities. For each category, we select several datasets, and for each dataset, we sample 30 instances and test ChatGPT on the subset. Details on the dataset will be described in which subsection of 4. Furthermore, we inspect the rationales provided by ChatGPT that it used to come up with the answers. Some of them will be discussed in detail in the corresponding section (4). Based on our experiment results, ChatGPT outperforms the existing zero-shot and some of the fine-tuned stateof-the-art performance on question answering. Furthermore, ChatGPT achieves near-perfect scores on three tasks, i.e., bAbI task 15, EntailmentBank, and Pep-3k.
Misinformation Detection We test ChatGPT's ability to detect misinformation with the test sets that consist of scientific and social claims related to COVID-19 (Lee et al., 2021) with 100 samples. We take half from scientific (covid-scientific) and another half from social (covid-social) sets. We evaluate the accuracy of the veracity by manually checking the generated text. ChatGPT could detect misinformation 92% (46/50) and 73.33% (22/30, excluding verification-refusing cases) accuracy on covid-scientific and covid-social respectively.

ChatGPT on Dialogue Tasks
Given that ChatGPT has the ability to generate conversation-like responses, it is interesting to test their ability in response generation in different dialogue settings: 1) Knowledge-Grounded Open-Domain Dialogue and 2) Task-Oriented Dialogue (TOD).

Knowledge-Grounded Open-Domain Dialogue
Open-domain dialogue systems interact with humans with generated responses automatically and aim to provide users with an engaging experience. To boost informativeness, these systems leverage external knowledge, including structured knowledge such as knowledge graphs (Zhao et al., 2020;Ji et al., 2022c) and unstructured knowledge such as free text (Xu et al., 2022).
To quantitatively measure ChatGPT's performance on knowledge-grounded dialogue, we apply it to 50 samples randomly selected from the test set of OpenDialKG (Moon et al., 2019), which contains open-ended dialogues grounded on a knowledge path. We use the following instruction for this KGD task: "Can we try dialogue generation? I will give you turns, and you can generate the next turn, but only one.\n \n You can also consider the knowledge of XXX for your reference in the dialogue." According to human judgment, the responses from ChatGPT are of high quality with fluent response generation as well as incorporating the provided knowledge in the response. However, the automatic evaluation results in Table 2 are relatively low compared with GPT2 (Radford et al., 2019), which is fine-tuned on this dataset. Specifically, ChatGPT obtains a 4.05 BLEU and an 18.62 ROUGE-L score as the generated responses tend to be longer than the golden answers. For FeQA,  which measures the generated response's faithfulness to the input source, ChatGPT gets 15.03 since some generated responses include content from its parametrized knowledge injected during pretraining.
Task-Oriented Dialogue In task-oriented dialogue (TOD), a model needs to fulfill a specific objective by interacting in natural language with the user. This task is often split into three modules: natural language understanding with belief state tracking, decision-making through dialogue policies, and response generation -a modular approach that handles each of these steps with different models. Besides, unified approaches are starting to show increasingly strong performances (Hosseini-Asl et al., 2020;Peng et al., 2021). Although ChatGPT seems more appropriate for open-domain dialogue tasks, we investigate and discuss how ChatGPT's emergent abilities and interactivity could potentially be leveraged for TOD as well. We explore two setups A) modular approach: testing both dialogue state tracking and response generation using oracle actions; B) unified approach: a direct approach to simulate the TOD interaction while leveraging information in a structured database. We provide an example of the modular and unified approaches in Appendix F.

Setup A: Modular Approach We investigate
ChatGPT's ability for both dialogue state tracking and response generation in 50 dialogue turn samples taken from MultiWOZ2.2 (Zang et al., 2020). In detail, we ask the model to provide the belief state as domain-intent: [slot1, value1], . . . in the prompt following previous zero-shot (Lin et al., 2021) and few-shot (Madotto et al., 2021) approaches, and provide an exhaustive list of domainintent-slot-value for the given dialogue. For the response generation, we provide only the oracle dialogue actions (e.g. 'Hotel-Inform':['area', 'centre']), and ask ChatGPT to generate a TOD response given the dialogue history. We assess DST with joint goal accuracy (JGA), the ratio of dialogue turns with correct dialogue states compared  to the ground truth, and response generation with BLEU and inform rate(%) As shown in table 3, the performance for DST is mediocre with a JGA of 28%, but a lot of the failure cases are from the intent being misclassified, while slot-value are inferred correctly. We postulate that the prompt information about slot-value overwhelms the one about intents (4-5 times smaller in the prompt), and the JGA reaches 56% if considering domain-slot-value triplets' accuracy. For response generation, ChatGPT successfully leverages all information provided while answering the questions with an 83% inform rate and 7.06 BLEU score. The BLEU is computed directly on the lexicalized response as ChatGPT skips the delexicalized generation, and the generation is often as if not more natural than the gold response.
Setup B: Unified Approach We explore Chat-GPT's ability to simulate a TOD interaction in an end-to-end manner by providing nothing more than a structured database and giving the instruction "Use the following knowledge base to complete the task of recommending a restaurant as a task-oriented dialogue system". In this setup, we could investigate whether ChatGPT is able to complete basic retrieval queries and respond to users' requests such as "Give me some restaurants that serve Italian food" or "I would prefer cheap options please". However, there are several limitations that we could investigate as follow.
• Long-term Multi-turn Dependency: Chat-GPT cannot keep the belief state across multiple turns within the interaction. For instance, asking for Italian food will overwrite the previous turn's belief state by asking for restaurants with a rating of 3 or higher. However, if the user explicitly asks to recall the earlier preferences, ChatGPT is able to correct the retrieved information and incorporate the previous belief state. This is interesting as it shows that the information previously given in multi-turn is still usable, but needs to be called explicitly.  Table 4: The statistics of languages used in our language disparity experiment. HRL denotes highresourced language, MRL denotes medium-resourced language, LRL denotes low-resourced language, X-LRL denotes extremely low-resourced language.
• Basic Reasoning Failure: ChatGPT's response tends to be wrong if the query introduces a basic level of reasoning such as when it is asked for "recommendation for restaurants with European food" (ChatGPT has to filter the types of cuisine which are based on countries) or "recommendation for restaurants with a rating of 3 or higher" (ChatGPT needs to understand rating 3, 4 and 5). Even with a basic knowledge base, ChatGPT fails to answer correctly 66% of the time.
• Extrinsic Hallucination: ChatGPT tends to generate hallucinated information beyond the given knowledge. This is especially harmful in TOD as ChatGPT will sometimes hallucinate some prices for hotel booking, or availability for restaurants.

Evaluating Multilinguality of ChatGPT
Training data size affects language understanding and generation quality of LMs (Radford et al., 2019;Raffel et al., 2022;Cahyawijaya et al., 2021;Rae et al., 2021a;Workshop et al., 2022;Chowdhery et al., 2022;Hoffmann et al., 2022). As an LLM, the same premise also applies to Chat-GPT, and the question is to what extent. We investigate this question through a series of experiments by analyzing 1) the language understanding capability using two different tasks, i.e, language identification (LID) and sentiment analysis, and 2) the language generation capability through machine translation using English as the pivot language. Based on the percentage of data in the CommonCrawl 8 , we group languages into 3 categories, i.e., high-resource (>1%), mediumresource (>0.1%), low-resource (<0.1%). The statistics of all the languages under study are shown in Table 4.

Language Understanding
We propose a framework for investigating the language understanding ability of ChatGPT through 3 languages from different language categories in NusaX (Winata et al., 2022), i.e. English (eng), Indonesian (ind), Javanese (jav). In addition, we incorporate an extremely low-resource language from NusaX, i.e., Buginese (bug), which is not even listed on CommonCrawl since the LID used in CommonCrawl 9 , i.e., CLD2 (Ooms, 2023), does not support Buginese (bug). We sample 50 sentences per language from the corresponding dataset for our experiment.
ChatGPT fails to generalize to extremely lowresource languages As shown in Table 5, Chat-GPT achieves 84%. 80%, 78%, and 56% accuracy for English, Indonesian, Javanese, and Buginese, respectively. This result supports the results in prior works focusing on LLM (Chowdhery et al., 2022;Workshop et al., 2022;Muennighoff et al., 2022), where LLMs, including ChatGPT, yield a lower performance for lower resource languages. Interestingly, the performance gap between English, Indonesian, and Javanese is considered marginal compared to the performance gap with Buginese. This result suggests that ChatGPT still has a limitation in generalizing toward extremely low-resource languages.
ChatGPT understands sentences in lowresource languages but lacks the ability to identify the language ChatGPT correctly classified the languages for English and Indonesian 100% of the time. While for the language identification for Javenese and Buginese, Chat-GPT either misclassifies the samples as other languages or is unable to determine the language for 100% for Javanese and 88% for Buginese.   lower compared to English and Indonesian which suggests that ChatGPT can understand the semantic meaning of sentences in low-resource languages, such as Javanese, without having enough knowledge to identify the language itself.
ChatGPT displays better human-preferred responses As shown in Table 7, ChatGPT lets the user know that its prediction is uncertain when it does not completely understand the language and also provides broader information regarding the language, such as location and tribe of which the predicted language is spoken. This fact provides evidence regarding the benefit of using the RLHF approach compared to other training approaches for aligning LLMs with human preferences.

Language Generation
We assess the multilingual language generation ability of ChatGPT through machine translation.
ChatGPT has been shown to be competitive compared to commercial translation products for highresource languages (Jiao et al., 2023). Specifically, we choose 2 languages from each language category, i.e., French (fra), Chinese (zho), Indonesian (ind), Korean (kor), Javanese (jav), and Sundanese (sun) from the FLORES-200 dataset (Team et al., 2022;Goyal et al., 2021). For each language, we sample 30 English-XXX parallel sentences and perform two directions of translation using English as the pivot language. The correctness of the translation results is manually validated by a native speaker of the corresponding language.
ChatGPT performs worse on low-resource languages As shown in Table 6, similar to other LLMs (Workshop et al., 2022;Muennighoff et al., 2022), ChatGPT produces better English translation quality from high-resource languages, such as French and Chinese. While for lowresource languages, such as Javanese and Sundanese, ChatGPT tends to generate several mistranslated words/phrases and sometimes even hallucinate some objects. Moreover, we also observe that sometimes ChatGPT translates the English sentence into a different but related language other than the requested target language (see §6.2). This fact suggests that the generalization of LLMs, including ChatGPT, to low-resource languages, remains an open challenge.
ChatGPT understands non-Latin scripts better than it can generate them Despite being highresource and medium-resource languages, the translation from English to Chinese and Korean is much inferior to the other languages with Latin scripts, i.e., French or Indonesian. Similarly, prior works focusing on transliteration (Chau and Smith, 2021; Muller et al., 2021) have shown the effectiveness of utilizing Latin scripts over other scripts, e.g., Cyrillic, Georgian, Arabic, etc, especially for lowresource languages. Interestingly, this problem of using non-Latin scripts is less severe for translation from Chinese and Korean to English, which suggests that ChatGPT can better neutralize the effect of non-Latin scripts as source languages (Wan, 2022), but it still lacks the ability to generate non-Latin script languages.

Evaluating Multimodality of ChatGPT
Since ChatGPT is a purely text-prompted language model, it is unlikely to explore its multimodal capabilities with visual inputs like contemporary visionlanguage works (Rombach et al., 2022;Ramesh et al., 2021;Yu et al., 2021a;Radford et al., 2021;Dai et al., 2022a;Lovenia et al., 2022). Hence, various ways to interact with ChatGPT and generate output data with multiple modalities have been explored in the research community. For example, as shown in Figure 1, ChatGPT can generate a wellformed and suitable intermediate representation in

ChatGPT
InstructGPT text-davinci-003 The language of the text appears to be a variant of the Bugis language spoken in Indonesia.
The language of the text is the Sasak language, spoken in Lombok, Indonesia.
The text is written in Buginese.
I am sorry, I do not recognize the language of the text.
The language of the text is Koyukon Athabascan.
The text is in the Balinese language.
The language of the text appears to be a dialect of the Indonesian language.
The language of the text is Indonesian.
The language of the text is Indonesian. Table 7: Example of Buginese language identification response from ChatGPT, InstructGPT, and text-davinci-003. code format in order to synthesize images given the dialogue context and user prompts. Thanks to the code understanding and generation ability of ChatGPT, we believe programming codes can serve as the intermediate medium to bridge vision and language (Rasheed, 2020;Shiryaev, 2022). Given textual prompts, ChatGPT can generate code representations of visual images using the SVG (Scalable Vector Graphics) format or APIs such as the HTML Canvas element and the Python Turtle graphics. In this way, even though the generated images are symbolic and their quality is not comparable to the ones generated by modern textto-image models (Ramesh et al., 2021;Rombach et al., 2022), it is worth exploring due to three reasons. Firstly, it helps us investigate the visual understanding and reasoning abilities of ChatGPT, which can be seen as an emergent skill after the very large-scale pre-training on text and code data. Furthermore, representing images with code is a more explainable way to understand the model's behaviors and rationales in text-to-image generation. Third, it is a natural way to evaluate ChatGPT's ability on multi-turn interaction by asking for post-  editing and corrections of the generated images.

Flag Drawing Task
To systematically evaluate the image generation ability of ChatGPT through code generation, we design a national flag drawing task. This is a unique task showing how ChatGPT's textually described knowledge (language) converts into the drawing (vision) through the SVG (code), using multi-turn interactions in the dialogue.
Task Formulation The flag-drawing task contains three steps. Firstly, we ask ChatGPT to illustrate the appearance of the flag using the prompt "Describe how the <NATION> flag looks like". Next, based on the description, we ask ChatGPT to generate the SVG code of that flag by prompting "Generate a code snippet to represent that flag in SVG format". Finally, if the generated image contains errors, we iteratively ask ChatGPT to fix them. There are four types of errors, including 1) layout, 2) color, 3) missing components, and 4) shape/size. In each round of fixing, we ask ChatGPT to revise only one type of error with the prompt "<ERROR DESCRIPTION>. Revise the image". We terminate the conversation once the generated flag becomes perfect or we have already passed two rounds of fixing. We uniformly collect 50 national flags from different continents and conduct the flag-drawing task on ChatGPT. The full results are shown in Appendix A. The generated flag images are evaluated by the aforementioned four error types as criteria. We further assess the image quality with five grades, A ∼ E, which indicate zero to four (or above) errors with an increment of one. We assign grades to each round so that we can assess the number of improvements and degradation through conversational interactions (post-editing). An overview of the result evaluation is provided in Table 8.

Findings
Based on our results, we summarize insights as follows: ChatGPT is capable of drawing, yet better with a self-generated textual description. As demonstrated in Table 8 and Appendix A, by following the task formulation, ChatGPT can generate plausible national flags using the SVG format. To better understand the behavior of ChatGPT, we per-form an ablation study by removing the description generation step. As illustrated by Figure 2, the performance drops dramatically without first prompting the textual flag description, which is generated by ChatGPT itself. Quantitatively, the proportion of E-graded images increases from 32% to 62% after removing this step. Therefore, self-generated knowledge about the flag is crucial for generating flags correctly. From another point of view, explicitly describing the appearance of the flag and then drawing disentangles the image generation process, which can be considered as a chain-of-thought reasoning.
ChatGPT is an elementary illustrator. Among the four error types, the majority lies in the shape/size error, which happens 68% of the time. For the other three error types (layout, color, missing components), they appear 34%, 20%, and 18% of the time, respectively. For instance, ChatGPT cannot generate the exact shape of the maple leaf in the Canadian flag while it gets the layout and the color correctly without missing components ( Figure 5). There are two potential reasons for this behavior. First, there might not be sufficient training data in such a pattern. To draw sophisticated shapes, the <path> tag in SVG is generally used, but it might not be commonly seen in the pre-training code data, thus leading to ChatGPT being incapable of creating complex shapes. Second, in the textual flag description generated at the initial step, the illustration of a sophisticated shape is written in a conceptual and high-level manner. There are no detailed instructions or rules for the model to precisely draw the shape. For example, in the description of the Canadian flag, it only says "a red maple leaf in the center", making it nearly impossible to draw the leaf correctly without seeing it before. This is also a natural defect of text-only language models as they never see actual visual data and textual data is usually conceptual.

Reasoning Evaluations of ChatGPT
Reasoning is one of the most actively discussed and debated abilities of LLMs as scaling the model parameter size also increases the implicit knowledge in LLMs (Wei et al., 2022a;Wang et al., 2022;Huang and Chang, 2022). Mahowald et al. eloquently argues that "language ability does not equal to thinking" or "reasoning" in LLMs, and that LLMs have poor reasoning skills despite possessing human-level language skills.  In the NLP literature, evaluating a model's reasoning often means evaluating its various skills in arithmetic, commonsense, and symbolic reasoning in different NLP tasks that require such skills (Talmor et al., 2020;Zelikman et al., 2022;Wei et al., 2022b). This is in line with the anecdotal experience of users with ChatGPT -some of the examples demonstrate surprisingly good "reasoning" abilities compared to previously introduced LLMs but at the same time ChatGPT fails in very simple reasoning problems Venuto, 2023;Qiao et al., 2022;Cookup.ai, 2022;Labs, 2022).
In this paper, we investigate the reasoning ability of ChatGPT in a more fine-grained manner, which includes deductive, inductive, abductive, analogical, causal, multi-hop, temporal, and spatial reasoning, via question answering tasks. We first categorize available QA tasks into each category by avoiding overlap (i.e., choosing a test set that requires mainly one specific category of reasoning) as shown in Table 9. We share experimental results on each of the categories in the following subsections §4.  in Appendix E. We further discuss each reasoning task in the following sections.

Logical Reasoning
Inductive, deductive, and abductive reasoning are common forms of logical reasoning, a process of deriving a conclusion or judgment based on given evidence or past experience and observations (Rogers et al., 2022;Wason and Johnson-Laird, 1972;Huang and Chang, 2022). Inductive and deductive are categorized by "a degree to which the premise supports the conclusion" based on logic and philosophy (Qiao et al., 2022;Rogers et al., 2022;Hawthorne, 2021). Inductive reasoning is based on "observations or evidence" while deductive is based on "truth of the premises" (i.e., necessarily true inference) (Douven, 2017). Another way to categorize is based on the "direction of reasoning" -deductive is from premise to conclusion while abductive is from conclusion to the most probable premise that supports the conclusion (Walton, 2014).

Deductive vs. Inductive Reasoning
Deductive reasoning involves processes of driving specific conclusions based on more general premises. On the contrary, inductive reasoning involves specific observation of patterns, processing them on increasingly abstract cycles of hypotheticodeductive reasoning to draw a more general conclusion (Lawson, 2005). Comparing the two types of reasoning, deduction requires less "guessing" from the perspective of ChatGPT, as induction requires figuring out rules (Rogers et al., 2022). The former can be viewed as top-down while the latter is bottom-up. We explore ChatGPT's ability of inductive and mouse T

Entailment-Bank
earth is a kind of planet. a planet is a kind of celestial object / celestial body. earth is located in the milky way galaxy. Which object is a planet found in the Milky Way Galaxy?
Earth is a planet located in the Milky Way galaxy. Given facts: Bernhard is a swan. Greg is a frog. Brian is a swan. Julius is a lion. Greg is gray. Julius is yellow. Lily is a lion. Lily is green. Brian is yellow.
The most recent fact is the correct fact.
Based on the given facts above, do a reasonable inference on this question using inductive reasoning: What color is Bernhard?
Based on the given facts, it can be inferred that Bernhard is yellow, as Brian is also stated to be yellow. Russell is the brother of Jason. grandson F Table 11: Prompting samples on deductive and inductive reasoning tasks. ChatGPT is performing better deduction rather than induction. On both types of reasoning, when ChatGPT is explicitly asked to do reasonable inferences, its ability for reasoning increases. Additionally, it also makes frequent mistakes regarding the grandson's kinship.

StepGame (Basic) Breakdown Analaysis
Result Example ChatGPT answer Gold T/F Clock-position 5/20 G is at Y's 6 o'clock. What is the relation of the agent Y to the agent G?
The agent Y is to the right of the agent G.

Above F
Basic Cardinal 17/20 D and K are parallel, and D is under K. What is the relation of the agent K to the agent D?
The spatial relation of the agent K to the agent D is that K is above D.  deductive reasoning in two different levels: 1) basic and 2) advanced. Basic-level tasks are the prerequisites to probe reasoning. While solving these tasks does not necessarily indicate full reasoning capability, if ChatGPT fails on any of these tasks, then there are likely real-world tasks that it will fail on too if they require similar reasoning mechanisms. Consequently, the advanced-level tasks are there to probe those capabilities in real-world tasks where the noises are present, and solving them requires a more systematic generalization. Additionally, we choose tasks that do not require or are dependent on external knowledge and the solution could be only derived by premises to focus on dissecting the capability of each reasoning mechanism.
ChatGPT is a lazy reasoner that suffers more with induction We first investigate basic reasoning skills with bAbI tasks (Weston et al., 2016b), 30 examples each from task 15 (inductive) and task 16 (deductive). Each test example includes a list of premises to derive inference for a certain question. Interestingly, when ChatGPT was asked to answer a question given premises without any prompt engineering, it performs poorly in inductive reasoning (0 out of 30) while it achieves much better performance in deductive (19 of 30). ChatGPT answers "It is not specified what <attribute> <en-tity> is." for most of the time when it was asked a question requiring inductive reasoning. However, when ChatGPT is explicitly asked for reasonable inference with a prompt "Based on the given facts, do a reasonable inference on this question using inductive reasoning:", its ability for inductive reasoning increases to 20 out of 30. Yet, it is still not as good as in deduction as the same prompt engineering also helps increases its ability for deductive reasoning to 28 out of 30.
When we repeat the analysis on the advancedlevel tasks, specifically on CLUTRR (Sinha et al., 2019) for induction and EntailmentBank for deduction (Dalvi et al., 2021), the same conclusion holds based on our experiment. We could derive similar insight as ChatGPT only correctly answered for half of the time while it could make inferences deductively well for 90% of the time. CLUTRR requires induction on extracting relations between entities, and in the ChatGPT responses, it often asks for more information to make inferences. An interesting finding along with CLUTRR was that ChatGPT can't differentiate son and grandson but can differentiate daughter and granddaughter when  it induces the logical rules governing kinship relationships. We show all performances in Table  10 and some of the prompting samples in Table  11. We follow (Qiao et al., 2022) categorization on the deductive and inductive reasoning datasets, but we only use the QA part of EntailmentBank, that the authors took from ARC dataset (Clark et al., 2018), as we aim to test for reasoning capability. Regarding EntailmentBank, it might trigger the universe-related knowledge out of ChatGPT, which could help the model to derive the correct answer, although the test set is designed to test deductive reasoning skills. One of the future explorations would be with checking the rationale of ChatGPT as a follow-up question.

Abductive Reasoning
Abductive reasoning is the inference to the most plausible explanation given observations. For instance, "if Jenny finds her house in a mess when she returns from work, and remembers that she left a window open, she can hypothesize that a thief broke into her house and caused the mess" 10 . We test ChatGPT's language-based abductive reasoning ability with 30 samples from αNLI dataset (Bhagavatula et al., 2020), which requires the model to select the most plausible explanation given the conclusion. Based on our test, it could achieve 86.7% (26 out of 30) accuracy.

Non-textual semantic reasoning
It is often investigated in public sharing about Chat-GPT errors/ failure instances 11 that it lacks the reasoning ability that required non-text semantic understanding such as mathematical, temporal and spatial reasoning. In this section, we investigate the non-text semantic reasoning capabilities of Chat-GPT.
Mathematical reasoning Mathematical capabilities or numerical reasoning has been frequently mentioned to be lacking for LLMs, not only Chat- GPT (Frieder et al., 2023). Frieder et al. test Chat-GPT's capability with publicly available datasets as well as the human-curated dataset, which consists of 728 prompts. The shared findings for ChatGPT's mathematical capabilities include 1) ChatGPT often understands the question but fails to provide correct solutions; 2) it shows inconsistent poor performance on graduate-level advanced mathematics; 3) it has a great ability to search for mathematical objects. 12 We also test separately on MATH dataset. Not surprisingly, it could only score 23.33% (7/30) for the MATH dataset (Saxton et al., 2019), which tests mathematical reasoning.
Temporal reasoning Temporal reasoning is mentioned a few times in the literature but is less common than others. It tests the understanding of the time duration of and the relation between events.
For this category, we conduct experiments on the dataset TimeDial (Qin et al., 2021), which solely requires temporal reasoning. We follow the format of the task in the BIG-Bench benchmark (Srivastava et al., 2022), which is multiple-choice (single correct answer), Overall, ChatGPT correctly answers 86.67% of the time (26/30), suggesting that it has a decent temporal reasoning ability. Also, compared to Chinchilla and Gopher which have the accuracy of 68.8% and 50.9% respectively, ChatGPT shows a promising improvement for LLMs in that aspect.
Spatial Reasoning Spatial reasoning is using an understanding of spatial relations among different objects and spaces. For spatial reasoning, we utilize two existing datasets: SpartQA (Mirzaee et al., 2021) and StepGame (Shi et al., 2022b), which compose of story-question pairs about k relations of k+1 (where k is up to 10) entities written in natural language. ChatGPT is asked to answer spatial relations between two entities based on the provided descriptions of different entities. ChatGPT could only score 40% on SpartQA and 23.33% for StepGame with k = 9, which is referred to as StepGame (Hard). Especially for StepGame, for around 43% of the time, ChatGPT could not provide any spatial relations but instead generated "It is not specified in the given description". Even with the fine-tuned model, as the number of relations (k) increases in context description, performance drops (Shi et al., 2022b). Table 13 summarizes the results of this part. 12 Refer to detailed findings in the original paper.
To understand spatial reasoning ability at a more elementary level, we test with less complicated examples from StepGame which we refer to as StepGame (Basic). It does not involve multi-hop reasoning but purely spatial relation between two entities. (e.g, "C is sitting at the top position to Y. What is the relation of the agent Y to the agent C?"). We test for basic spatial relations with 8 labels from StepGame {left, right, above, below, lowerleft, lower-right, upper-left, upper-right}. When we test on StepGame (Basic), ChatGPT scores higher (63.33%).
We investigate the errors that it often fails to understand clock direction (e.g., "W is at K's 3 o'clock") and diagonal spatial relations. We further analyze the results by breaking down the test examples of StepGame (Basic) into two comparisons: i) types of directions (basic cardinal vs. diagonal) and ii) ways of spatial description for cardinal directions (basic cardinal 13 vs. clock-position cardinal). We take 20 more samples for each category (basic cardinal, diagonal, clock-position cardinal) and tested them as illustrated in Table 12. • ChatGPT poorly infers with clock-position description. Although it is a simple cardinal direction, ChatGPT could only correctly answer for 5 samples (25%), which is clearly poorer performance in comparison to performance with the basic cardinal description (17 correct answers).
• ChatGPT is worse at the diagonal position. It correctly answers around half of the time (55%), which is worse than basic cardinal points (85%  ChatGPT Answer It is implausible that a knife could chop through a rock. Knives are made to cut through softer materials such as food, paper, or wood. Rocks are much harder than these materials and a knife would not be able to make a dent in a rock, let alone chop it. Chang, 2022; Bhargava and Ng, 2022). However, Bhargava and Ng also point out that the reasoning tasks underlying these benchmarks are still far from being solved, since most existing studies primarily report the performance of the models, without a detailed examination of the quality of the rationales produced.
To evaluate ChatGPT's capability on commonsense reasoning, we first test it on two widely used benchmark datasets CommonsenseQA (Talmor et al., 2018) and PiQA (Bisk et al., 2020). CommonsenseQA focuses on general commonsense question answering such as "Where is a business restaurant likely to be located?", and PiQA is about physical commonsense reasoning: given a sentence such as "When boiling butter, when it's ready, you can ", the goal is to fill in the blank with one of two answer options, "Pour it onto a plate" and "Pour it onto a jar". We use the validation split for both of the datasets since there are no labels provided on the test set that we retrieve. We also further probe ChatGPT by evaluating a more challenging commonsense reasoning dataset in a more comprehensive way. We use Pep-3k (Wang et al., 2018b), which requires the model to recognize plausible but possibly novel events, such as "man swallow paintball". Each instance in the Pep-3k is an s-v-o predicate, and the task is to judge  if the predicate is plausible or not. But instead of evaluating ChatGPT's performance only based on the binary judgment, we also check if the answer contains relevant rationales (explanations) that lead to its judgment. ChatGPT shows surprisingly good commonsense reasoning capability in our evaluation tasks, perhaps due to its large parametric memory. We sample 30 instances from each of the test sets. For the Pep-3k samples, we prepend the s-v-o predicate with "Please judge if this predicate is (likely) plausible or implausible:" to prompt ChatGPT. We show the results in Table 14. As we see, ChatGPT performs quite well on the three datasets in terms of answer accuracy, which matches our anticipation. Furthermore, as we also check the rationales in ChatGPT's answer when evaluating Pep-3k samples, we can see that ChatGPT does quite well not only in terms of answer accuracy but also in generating reasonable reasoning procedures to support its answer. We show a concrete example in Table 15. As we can see, ChatGPT's answer explains well what kinds of materials are usually cut through with knives (i.e., food, paper, or wood). Then, it reasons why rocks cannot be chopped with a knife by explaining 'rocks are much harder than these materials.' While our findings are based on 30 samples from each dataset, we see the potential in ChatGPT's commonsense reasoning capability, and further large-scale investigation is worth exploring.

Causal, Multi-Hop, and Analogical Reasoning
Causal Reasoning Causal reasoning is the process of identifying the relationship between causes/actions and effects/changes (i.e., causality) (Thomason, 2018; Huang and Chang, 2022). We test ChatGPT on 30 samples of human-annotated explainable CAusal REasoning dataset (E-CARE) (Du et al., 2022) and it could score 24 samples correctly (80%). Note that our evaluation is mainly based on whether the model can make a judgment on correct causes or effects instead of its gener-ated explanation of why the causation exists -the follow-up generation on explanation can be future exploration.

Multi-hop Reasoning
To be able to reason over a larger context, a system has to perform multi-hop reasoning over more than one piece of information to arrive at the answer (Mavi et al., 2022). We test ChatGPT's multi-hop reasoning capability on 30 samples of HotpotQA dataset (Yang et al., 2018) and we find that ChatGPT has difficulty performing with such capability, only answering 8 samples correctly, although the questions posed are only 2-hops. It is worth noting that ChatGPT oftentimes generates the answer in a short passage of explanations, thus we evaluate manually each of the Chat-GPT responses to check its accuracy. This aligns with the findings that LLMs are also limited in several ways, and fail to produce accurate predictions due to their inability to accomplish complex reasoning, such as solving tasks that require multi-hop reasoning (Ott et al., 2023).
Analogical Reasoning Analogical reasoning is a way of thinking that relies upon an analogy, comparing two or more objects or systems of objects (Bartha, 2013) to drive a conclusion. We test with 30 samples from Webb et al. (2022b) and evaluate based on human evaluation, to see if the generated answer match with/contain the gold answer. Chat-GPT could correctly answer all 30 examples, which may reveal that ChatGPT has a good capability in analogical reasoning skills.

Factuality and Hallucination Evaluations in ChatGPT
LLMs are known to be susceptible to generating nonfactual, untruthful information, which is referred to as hallucination (Lee et al., 2022;Ji et al., 2022b,c;Su et al., 2022;Dai et al., 2022b). Many anecdotal witnesses show ChatGPT also seems to suffer from the same problem as other LLMs. To evaluate this aspect of ChatGPT, we first explore existing fact-checking test sets and QA tasks that required knowledge ( §5.1). We illustrate the challenge of hallucination in ChatGPT by sharing hallucination examples from different NLP tasks ( §5.2).

Factuality in ChatGPT
We first test ChatGPT's ability to detect misinformation with the test sets that consist of scientific and social claims related to COVID-19 (Lee et al., 2021). We take 50 samples each for scientific (covid-scientific) and social (covid-social) sets. ChatGPT is able to detect misinformation 92% (46/50) and 73.33% (22/30, excluding verificationrefusing cases) accuracy on covid-scientific and covid-social respectively. In comparison to its previously reported performance, ChatGPT's performance on covid-scientific is impressive. However, this may be affected by the fact ChatGPT's training data includes information about COVID-19 while the previous models did not. Interestingly, for more societal-related claims, 40% of the time (20/30), ChatGPT refuses to make verification as saying "I'm sorry, but I don't have enough information to accurately answer your question ... ". This happens more often for falselabeled claims (misinformation) than true claims. This may be related to ChatGPT's safety layer or policy (the Covid-social test examples include political claims). OpenAI continues to tune ChatGPT's performance with RLHF, and that aspect also contributes to its performance.
However, it cannot avoid the criticism that parameterized knowledge is obtained by better memorization as it still shows worse performance in questions designed to cause imitative falsehoods. For instance, "Does someone share more genes with their siblings or parents?" and "What are the health benefits of pink Himalayan salt?" 14 We test on 65 test samples 15 from TruthfulQA (Lin et al., 2022), which tests the extent of LLMs to mimic humans falsehood, and 35.38% of the time ChatGPT fails to answer truthfully.

Hallucination in ChatGPT
There exist two categories of hallucination (Ji et al., 2022b). Intrinsic hallucinations that refers to the LLM generation that contradicts the source/input content. Extrinsic hallucinations that refers to the LLM generations that cannot be verified from the source/input content (i.e., output that can neither be supported nor contradicted by the source).
In Table 17, we share examples of these hallucination types detected from different task explorations. With the setting of tasks we test, we often find extrinsic hallucinations, including both untruthful and factual ones, across various tasks such as Machine Translation, Question answering.
The intrinsic hallucinations are barely found as discussed in tasks about summarization and knowledge-grounded open-domain dialogue. For instance, in the abstractive summarization task, in which neural models usually suffer from intrinsic hallucination, ChatGPT's generated summarisation did not include any intrinsic hallucination examples based on our experiments. It rather shows a factual extrinsic hallucination, for instance, Chat-GPT could correctly paraphrase "Britain and five other countries" from source input into "P5+1 (US, UK, France, China, Russia, and Germany)," which is assessed to be factual. We could also observe an interesting intrinsic hallucination for our proposed multi-modal task, the flag drawing task. ChatGPT is first asked to generate a description of how the flags look before it is asked to generate code for the flag. Although it generates the correct description as "The flag of Mexico consists of three vertical bands [...]", the final drawing (SVG code) consists of horizontal bands.
However, extrinsic hallucinations often happen, including both untruthful and factual ones. In the question-answering task, we often find extrinsic hallucination to be non-factual which harms the final performance. For instance, in the question of asking for the relationship among entities, although step kindship is never mentioned in the question, ChatGPT answers the question with step kinship, as illustrated in Table 17. We could also observe that ChatGPT's weakness with extrinsic hallucination also degrades machine translation. When it is asked to translate the text "Like some other experts, he is skeptical about whether diabetes can be cured, noting that these findings have no relevance to people who already have Type 1 diabetes." into Korean, it contains a piece of information that was not found in the source, "ᄌ ᅥᄌ ᅮᄑ ᅡ ᄎ ᅵᄅ ᅭ" (transcutaneous electrical nerve stimulation) in the translated text.

Evaluating Interactivity in ChatGPT
ChatGPT has a built-in interactive ability thanks to conversational data fine-tuning and RLHF. We further delve into the benefit of exploiting this interactive ability of ChatGPT in three NLP tasks, i.e., 1) summarization, 2) machine translation, and 3) multimodal generation. Our experiments demonstrate the potential of employing multi-turn interaction to refine the quality of the generated responses and improve the task performance of ChatGPT.

Interactivity on Summarization
Summarization models aim to extract essential information from documents and to generate short, concise, and readable text (Yu et al., 2021b;Su et al., 2021). Recently, Goyal et al. (2022) show that zero-shot prompting with GPT-3 (Brown et al., 2020) performs better than the state-of-the-art finetuning model (Liu et al., 2022) on human evaluation. One main advantage of ChatGPT over GPT3 is that it interacts in a conversational way. Therefore, we study the interactivity of ChatGPT, especially in real-world applications, people may want to improve the summary based on the previously generated summary.
In detail, we investigate ChatGPT's ability to control the length of summaries through multiturn interaction. To run experiments, we randomly sample 50 documents from a dialogue summarization dataset called SAMSum (Gliwa et al., 2019) and conduct a two-turn iterative prompt approach. Given an input dialogue as the context, we first input the prompt "Summarize the above dialogue" to the ChatGPT. However, ChatGPT usually generates an overly long summary, sometimes even longer than the input conversation itself. To refine the summary, we simply input another prompt -"Please make the summary shorter" after the first response. According to the second prompt, GhatGPT could provide a much shorter summary than the first response. In order Task   Key Text Content

ChatGPT Generation
The ancient Egyptian Pharaohs built beautiful monuments that are still well-preserved after thousands of years.

Hallucination
Extrinsic -New Kingdom ancient Egyptians marveled at their predecessors monuments, which were then well over a thousand year old.

ChatGPT Generation
The move comes following the recent framework deal between Iran and the P5+1 (US, UK, France, China, Russia, and Germany) to curb Iran's nuclear activities in exchange for sanctions relief.
Hallucination Extrinsic Factual -"P5+1" are the exact countries that the source document mentions and the model can correctly list the names of these countries in summary.

Prompt
Continue the dialogue as a task-oriented dialogue system called SYSTEM. The function to send booking confirmation by email doesn't exist and is not mentioned anywhere

Prompt
Can we try dialogue generation? I will give you turns and you can generate the next turn, but only one. You can also consider the knowledge of "Pierre Coffin is starred in Despicable Me 2" and "Despicable Me 2 is written by Cinco Paul" for your reference in the dialogue.
User: Do you know of any movies directed by Pierre Coffin?
ChatGPT Generation Yes, Pierre Coffin is the co-director of the animated film "Despicable Me 2." Cinco Paul is the film's writer.
Hallucination animated film is a factual extrinsic hallucination.   to quantify the experimental results, we calculate the ROUGE-1 scores among the first summary and the second summary. Experimental results show that with the second length control prompt, the refined summaries achieve 7.99, 1.64, and 5,19 gains on ROUGE-1, ROUGE-2, and ROUGE-L, respectively. Figure 3 shows an example of how multi-turn interaction helps to control the length of the summary.

Interactivity on Machine Translation
One of the capabilities of ChatGPT is to perform text translation from one language to another. With the interactivity of ChatGPT, we explore the possibility of performing a combined machine translation and automatic post-editing tasks to improve the translation quality of ChatGPT. We explore this capability on translation from English to the target language since the translation quality from high-resource and medium-resource languages to English of ChatGPT is near perfect (see §3.2). For the experiment, we adapt the dataset used in §3.2.2 which samples 30 parallel sentences from 6 language pairs in NusaX (Winata et al., 2022), i.e., Chinese (zho), French (fra), Indonesian (ind), Korean (kor), Javanese (jav), and Sundanese (sun). We experiment with a multiturn approach, where we first query ChatGPT to translate to the target language using "What is [TARGET_LANGUAGE] translation of the following sentence?\n\n [INPUT_SENTENCE]" as the prompt template, and then query for the post-editing using the following prompt template: "Could you perform a post-editing to ensure the meaning is equivalent to "?". The post-editing results are manually validated by a native speaker in the corresponding language to validate: 1) whether the post-edited sentence is better than the translation one, and 2) whether the post-edited sentence is the correct translation of the given English sentence. As shown in Figure 4, despite the translation and post-editing being done using a single ChatGPT model, the multi-turn approach method helps to improve the correctness of the translation by making partial corrections or even full corrections in some cases. This result reflects that performing automatic post-editing through interactive LLMs, such as ChatGPT, yields consistently better translation results compared to a single-turn machine translation, which is especially useful for translation in low-resource languages. We provide per-language examples of the machine-translated and post-edited sentences in Appendix D.
To further strengthen our hypothesis, we conduct an additional experiment on the automatic post-editing (APE) shared task dataset on WMT 2022(Bhattacharyya et al., 2022, which focuses on English→Marathi post-editing task. Marathi (mar) is also a low-resource language with 0.02% data size on CommonCrawl. We sample 50 samples from the corresponding dataset and conduct the evaluation in 2 ways: 1) human-targeted translation error rate (HTER) 17 , SacreBLEU (Post, 2018) and METEOR (Banerjee and Lavie, 2005) between the Marathi generated sentence compared to the human post-edited sentence, 2) HTER, Sacre-BLEU, METEOR, and semantic similarity score, i.e., BERTScore (Zhang* et al., 2020), between the English back-translated sentence and original English sentence. 18 Figure 5: Changes in ChatGPT's drawing of the Canadian flag over three turns. Layout, color, completion, and shape/size are marked as if they align with those of the ground truth, and otherwise.
As shown on Table 18, the single-turn translation without post-editing produces a slightly better evaluation score on the Marathi language, but the multi-turn with post-editing consistently yields better evaluation performance on the back-translated English text on all metrics. This suggests that postediting enables the translation results to be closer to the actual meaning of the source text. Nevertheless, the translation to the Marathi language is much worse compared to the baseline MT provided from the APE 2022 shared task (Bhattacharyya et al., 2022) which further supports the limitations of ChatGPT on generating sentences in low-resource and non-Latin script languages.

Interactivity on Multimodal Generation
The multi-turn interaction ability of ChatGPT enables the refinement of text-to-image generation. It is one of the most natural ways for humans to create artwork or product designs by requesting an AI tool iteratively. For example, Figure 6 shows the process of creating an interesting painting by prompting ChatGPT with varied requirements through multiple turns.
To quantitatively study how this ability impacts text-to-image generation, as mentioned in the task formulation of the flag drawing, we conduct at (https://translate.google.com/). Figure 6: From fruits to a Christmas tree.
Step-by-step image drawing and modification by ChatGPT. most three rounds of post-editing. As shown in Figure 7, in the first round of generation, ChatGPT rarely generates errorless SVG images except for some relatively simple flags (e.g., Nigerian and German). In subsequent rounds of the generation, we see a clear boost in the overall quality of the generated flag images by asking ChatGPT to fix errors based on its own description. We observe that 34% and 36% of samples experience improvement (i.e., fewer errors) from turn 1 to turn 2 and from turn 2 to turn 3, respectively. Meanwhile, there are also 6% and 8% of samples that experience degradation after each dialog turn. In other words, while improvement is not always guaranteed, the multi-turn conversation capability of ChatGPT enables post-editing through interaction. We also test with the InstructGPT (davinci-003), which has the same backbone model as ChatGPT but lacks conversation ability. As demonstrated in Appendix B, InstructGPT cannot achieve a significant improvement by directly putting the intermediate results in the input context.

Multitask, Multilingual, Multimodal
ChatGPT outperforms multiple state-of-the-art zero-shot LLMs on various tasks and even surpasses fine-tuned models on some tasks. Although ChatGPT performs well in most of the tasks, there are still some failure cases on each task ( §3.1). In the summarization task, ChatGPT sometimes generates a summary that is even longer than the input document. In the machine translation task, ChatGPT sometimes produces an incorrect translation for some words, making the meaning slightly shifted. Therefore, dealing with these special cases is a complex but important task.
In terms of multilinguality, ChatGPT achieves strong performance in many high-resource and medium-resource languages. Nevertheless, Chat-GPT still lacks the ability to understand and generate sentences in low-resource languages ( §3.2). The performance disparity in low-resource languages limits the diversity and inclusivity of NLP (Joshi et al., 2020;Aji et al., 2022;Wan, 2022). Additionally, ChatGPT also lacks the ability to translate sentences in non-Latin script languages ( §3.2.2), despite the languages being high-resource. This raises the question of language representation in ChatGPT. Research on shared representation for non-Latin scripts (Amrhein and Sennrich, 2020; Pfeiffer et al., 2021;Wan, 2022) is needed.
In terms of multimodality, it is very natural to have visual information (images or videos) in the form of dialogue (Sun et al., 2022;Mostafazadeh et al., 2017) in real applications, which may be provided by the user or generated by the model. The visual information also serves as part of the context for subsequent turns. Can textual models like ChatGPT switch to a multimodal backbone? Through our flag drawing experiments, we find that ChatGPT is able to translate visual concepts and structures to basic code formats (e.g., circle SVG element), which define the exact shape, orientation, color, and placement of the objects. Given this structured way of generating an image, one of the research questions is: if a model learns an image as a composition of basic shapes, would it help a model understand the abstraction of visual concepts and structures (Ji et al., 2022a)? Moreover, would it produce more interpretable results for the users?

Reasoning
The highly impressive performance of ChatGPT has sparked interest in expanding its usage beyond traditional NLP tasks into more complex domains requiring sophisticated reasoning such as problemsolving, decision-making, and planning. Our evaluation of its reasoning abilities shows that they are not reliable. Specifically, our findings indicate that ChatGPT exhibits a tendency to be a lazy reasoner and that its capabilities are inconsistent across various reasoning abilities.
In terms of logical reasoning, ChatGPT performs better deductive and abductive reasoning compared to inductive reasoning. As a language model, Chat-GPT still lacks the ability to answer non-textual semantic reasoning tasks, such as mathematical, temporal, and spatial reasoning. Instead, many suggest pairing ChatGPT with another computational model, such as Wolfram 19 , to solve each specific set of problems. In that combination, ChatGPT parses natural language input into programming language code snippets, then the computational model will execute the code to return results. In this way, the strength of ChatGPT is maximized while the weakness is mitigated. Surprisingly, ChatGPT excels in commonsense, causal, and analogical reasoning. We suspect that all this knowledge has been encoded in the parametric memory of Chat-GPT. Nevertheless, ChatGPT lacks the ability to perform multi-hop reasoning which suggests that, like other LLMs, ChatGPT possesses a limited ability to accomplish complex reasoning tasks.
To support the further expansion of its use cases, it is necessary to prioritize the development of systems with robust complex reasoning capabilities, which should also be facilitated by the creation of more comprehensive benchmarks for assessing these abilities, particularly when multiple abilities are required to complete the tasks.

Factuality and Hallucinations
Although powerful, ChatGPT, like other LLMs, still makes things up (Ji et al., 2022b). To ensure factuality, it is possible to build LLMs with an interface to an external knowledge source, like Blenderbot 3.0 (Shuster et al., 2022), RETRO (Borgeaud et al., 2021), andLaMDa (Thoppilan et al., 2022). In this manner, factual information LLMs can be updated independently and easily in the knowledge base, without fine-tuning the whole LLM. However, how to balance the generative power of its parametric memory with external knowledge sources is an active research area (Lee et al., 2022;He et al., 2023) Meanwhile, there are many forms of hallucinations from LLMs that are not necessarily counterfactual but still undesirable. The RLHF process of ChatGPT can ensure human feedback to mitigate undesirable responses. However, researchers need to work on coming up with more automatic and scalable methods to detect and mitigate hallucinations and other undesirable artifacts of LLMs.

Interactivity
Compared with the previous LLMs, the interactive ability of ChatGPT has made a leap according to both qualitative and quantitative measures. Based on our evaluation, through interactivity, we can improve the performance of ChatGPT by 8% ROUGE-1 on summarization tasks and 2% ChrF++ on the machine translation tasks. However, sometimes ChatGPT retains the wrong answer even after receiving multiple rounds of prompts from the user. Improving the ability of ChatGPT to handle multiple rounds of user feedback is also an important challenge.
The conversational ability and multi-turn interaction of ChatGPT make it natural for people to use it as a dialog system. We carry out the very difficult task of using ChatGPT as a task-oriented dialog system with structured knowledge given in the prompt to perform. Whereas ChatGPT shows strong performance in various modules, challenges remain for us to use ChatGPT as a fully task-oriented dialog system, due to the lack of controllability and knowledge grounding in its response.
The interactivity inadvertently enables the user to "jail-break" ChatGPT to carry out harmful actions. For example, a user could ask ChatGPT to turn off its safety layer, causing potential damage (Christian, 2023).

Responsible design and usage of LLMs including
ChatGPT is an important and pressing challenge today. There are common issues with these models, such as fairness, toxicity, demographic bias, and safety, that need to be addressed. In the case of ChatGPT, OpenAI constructs safety layers and uses RLHF and potentially other means to filter out undesirable system responses. This process is resource intensive and opaque to the public. We hope to see a more open discussion and sharing of responsible design of LLMs from various organizations including OpenAI in the future.
References 2023. Chatgpt vs satya nadella over biryani: The chatbot is learning from its mistakes.

A Flag Drawing Task Results
We provide the detailed results of the flag drawing task described in §3.3.1 in Figure 7.

B InstructGPT for Multimodality
We show an example of a multi-turn flag drawing of InstructGPT in Figure 8. Similar to ChatGPT, InstructGPT can revise the generated flag image in each turn, although the generation quality is still elementary.

C List of Evaluation Datasets
We provide a detailed list of all the datasets used in our experiment on Table 19.  (20) QA bAbI tasks that a set of proxy tasks that evaluate reading comprehension via question answering. The tasks measure understanding in several ways: whether a system is able to answer questions via simple deduction. The tasks are designed to be prerequisites for any system that aims to be capable of conversing with a human.
bAbI task 16 QA This basic induction bAbI tasks is taken from the (20) QA bAbI tasks that a set of proxy tasks that evaluate reading comprehension via question answering. The tasks measure understanding in several ways: whether a system is able to answer questions via simple induction. The tasks are designed to be prerequisites for any system that aims to be capable of conversing with a human. Weston et al. (2016b)  αNLI QA Abductive Natural Language Inference (αNLI) is a new commonsense benchmark dataset designed to test an AI system's capability to apply abductive reasoning and common sense to form possible explanations for a given set of observations. Formulated as a binary-classification task, the goal is to pick the most plausible explanatory hypothesis given two observations from narrative contexts.
CommonsenseQA QA CommonsenseQA is a new multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers. The dataset is provided in two major training/validation/testing set splits: "Random split" which is the main evaluation split, and "Question token split", see paper for details. Letter string analogy QA The letter string analogy domain was introduced in order to evaluate computational models of analogical reasoning. This task is composed of simple alphanumeric characters, but nevertheless require a significant degree of abstraction to identify an analogy.

Webb et al. (2022b) -30
StepGame QA StepGame is a robust multi-hop spatial reasoning dataset in textual format which addresses the limitation from the bAbI dataset task 17 and task 19. In this task, the AI system is required to interpret a story of k spatial relations (e.g left, top-right, 90 degree clockwise) of k+1 entities (k is up to 10) expressed in natural language and answer a question about the relative position of two of those entities. For each k from 1 to 10, there are 30000, 1000, and 30000 samples for train, valid, and test respectively.

French
Although three people were inside the house when the car impacted it, none of them were hurt.
French 34 per cent of those in the poll share this view, wanting Queen Elizabeth II to be Australia's last monarch.

French
The qualities that determine a subculture as distinct may be linguistic, aesthetic, religious, political, sexual, geographical, or a combination of factors.

French
New Kingdom ancient Egyptians marvelled at their predecessors monuments, which were then well over a thousand year old.
Les Égyptiens antiques du Nouvel Empire étaient émerveillés par les monuments de leurs prédécesseurs, qui avaient alors bien plus de mille ans.

French
The U.S. Corps of Engineers estimated that 6 inches of rainfall could breach the previously damaged levees.

French
Several large television screens were installed in various places in Rome to let the people watch the ceremony.

French
Pests can spoil food, cause irritation, or in a worse case cause allergic reactions, spread venom, or transmit infections.

French
It is obvious enough that the world has changed much because of humankind's scientific and technological advancements, and problems have become greater because of overpopulation and mankind's extravagant lifestyle.
Il est clair que le monde a beaucoup changé grâce aux progrès scientifiques et technologiques réalisés par les êtres humains, et que les problèmes sont devenus plus importants en raison de la surpopulation et du mode de vie extravagant de l'humanité.
Il est assez évident que le monde a beaucoup changé en raison des avancées scientifiques et technologiques de l'humanité et que les problèmes se sont accrus en raison de la surpopulation et du mode de vie extravagant de l'homme.
Il est évident que le monde a beaucoup changé en raison des avancées scientifiques et technologiques de l'humanité, et que les problèmes ont augmenté à cause de la surpopulation et du mode de vie extravagant de l'homme.

Prompt
Use the following knowledge base to complete the task of "recommending a restaurant" by continuing the conversation as a task-oriented dialogue system: