ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning

Over the last few years, large language models (LLMs) have emerged as the most important breakthroughs in natural language processing (NLP) that fundamentally transform research and developments in the field. ChatGPT represents one of the most exciting LLM systems developed recently to showcase impressive skills for language generation and highly attract public attention. Among various exciting applications discovered for ChatGPT in English, the model can process and generate texts for multiple languages due to its multilingual training data. Given the broad adoption of ChatGPT for English in different problems and areas, a natural question is whether ChatGPT can also be applied effectively for other languages or it is necessary to develop more language-specific technologies. The answer to this question requires a thorough evaluation of ChatGPT over multiple tasks with diverse languages and large datasets (i.e., beyond reported anecdotes), which is still missing or limited in current research. Our work aims to fill this gap for the evaluation of ChatGPT and similar LLMs to provide more comprehensive information for multilingual NLP applications. While this work will be an ongoing effort to include additional experiments in the future, our current paper evaluates ChatGPT on 7 different tasks, covering 37 diverse languages with high, medium, low, and extremely low resources. We also focus on the zero-shot learning setting for ChatGPT to improve reproducibility and better simulate the interactions of general users. Compared to the performance of previous models, our extensive experimental results demonstrate a worse performance of ChatGPT for different NLP tasks and languages, calling for further research to develop better models and understanding for multilingual learning.

The recent advances in NLP feature large language models (LLMs) that have parameter sizes over a hundred billion and are pre-trained on massive data, e.g., GPT-3 (Rae et al., 2021), Megatron (Shoeybi et al., 2019), GPT-Jurassic (Lieber et al., 2021), OPT-175B (Zhang et al., 2022b), and multilingual BLOOM (Scao et al., 2022).Although still relying on the Transformer architecture, the unprecedented scales of model size and training data have allowed new emergent abilities to change the landscape and practices in NLP (Wei et al., 2022).An important emergent skill involves prompt-based learning that facilities the probing of information from LLMs with prompts by sampling the learned language distributions (Brown et al., 2020).In this way, the models demonstrate strong generalization in few-shot and zero-shot learning while avoiding parameter updates for the underlying architectures.
To this end, ChatGPT1 is one of the latest developments in NLP.In the first two months of its launch, ChatGPT has attracted 100 million users (Milmo, 2023).As the next iteration of InstructGPT (Ouyang et al., 2022), ChatGPT is optimized on top of a GPT-3.5 series model using reinforcement learning from human feedback (RLHF) (Christiano et al., 2017).In contrast to pre-vious LLMs, ChatGPT and InstructGPT leverage human demonstrations of desired outputs for input prompts to train supervised models, while human rankings of generated outputs are obtained to train a reward model to further optimize the LLMs with reinforcement learning.Compared to InstructGPT, ChatGPT is trained with conversational data to allow follow-up questions.In this way, ChatGPT is able to interact with humans in multi-turn conversations to generate more aligned outputs with human interests, thus being more natural and accessible to users.In addition, due to the deployment of public APIs to facilitate general users, there have been multiple reports on the successes of ChatGPT in solving challenging tasks in various areas, e.g., passing the United States Medical Licensing Examination (Kung et al., 2022) and real exams in a law school (Choi et al., 2023), performing competitively with commercial translation services for some high-resource languages (Jiao et al., 2023), and even producing code from natural language instructions.Nonetheless, the communities also express concerns about long-term implications of ChatGPT and LLMs for society, citing issues on plagiarism, privacy, misinformation, and security (Bang et al., 2023).
Similar to other LLMs, ChatGPT is trained on a mix of training data from multiple languages.Although English is the majority, the combination of multilingual data contributes to ChatGPT's abilities to accept inputs and generate responses in different languages, making it accessible and widely adopted by people around the world.However, given the recency of the technology, ChatGPT has been mainly evaluated over English data.The community is lacking a comprehensive, public, and independent evaluation of ChatGPT over various non-English languages for diverse NLP tasks to provide proper perspectives for future research and applications.Given ChatGPT's transformative potentials, associated long-term risks, huge cost for training, and limited transparency, a fundamental question is whether multilingual LLMs such as ChatGPT can also be reliably adopted for different languages or it is necessary to develop languagespecific LLMs/other technologies to solve NLP problems for non-English languages.
To address the multilingual concerns for Chat-GPT, a few recent studies have investigated ChatGPT's performance and responses for non-English languages.However, the considered tasks/languages/settings and scale of evaluation data in existing multilingual evaluations are still limited, which is unable to show a comprehensive picture of the potentials/performance of the technology on a diversity of other languages.For instance, (Bang et al., 2023) evaluates the multilingual performance of ChatGPT on three tasks of language identification, sentiment analysis, and machine translation; however, only a few languages are selected for each task and the number of evaluation samples for each language does not exceed 50.Beyond English, the analysis of ChatGPT's responses for input questions in (Guo et al., 2023) is only done for Chinese, while the results of the medical licensing examinations for ChatGPT are only shown for Japanese in (Kasai et al., 2023).In addition, (Fang et al., 2023) and (Wang et al., 2023a) explores ChatGPT in three languages English, Chinese, and German; however, the studies only focus on grammatical error correction or cross-lingual summarization.
To this end, our paper aims to perform a more thorough evaluation of ChatGPT for its performance on multiple languages over different NLP tasks.Our experiments consider 37 diverse languages, characterizing high-, medium-, low-, and extremely low-resource languages, to better highlight ChatGPT's potentials and limitations.To our knowledge, this is one of the largest sets of languages evaluated for ChatGPT in a public study to date.In addition to Natural Language Inference (NLI), Question Answering, and Common Sense Reasoning, our current work will examine the tasks of Part-of-Speech (POS) Tagging, Named Entity Recognition (NER), Relation Extraction, and Summarization, which are not covered in previous multilingual evaluations for ChatGPT.To improve the reproducibility of the evaluations and better reflect the approach of general users, our current work will focus on the zero-shot learning setting for ChatGPT where no human-provided examples are presented to the model.Importantly, due to the scale of available languages/tasks/datasets/models and the growing nature of multilingual learning research in NLP, we will use this work as an ongoing and public effort to evaluate ChatGPT and other LLMs for multiple languages, emphasizing on understudied languages to measure robustness and democratize impacts of the technologies.Despite some potential updates with future experiments, our current experiments suggest the following tendencies: • ChatGPT's zero-shot learning performance is generally worse than the state-of-the-art performance of the supervised learning models for a majority of the considered tasks across different languages, including high-, medium-, low-, and extremely-low resource languages.The performance gaps are usually very large, demonstrating the unfit of ChatGPT as a general solver for different NLP problems.It thus highlights the importance of task-specific models for the development of NLP applications.• ChatGPT's performance is generally better for English than for other languages, especially for higher-level tasks that require more complex reasoning abilities (e.g., named entity recognition, question answering, common sense reasoning, and summarization).The performance differences can be substantial for some tasks and lower-resource languages, which justifies the biases of ChatGPT for English and suggests the necessity to develop language-specific models/LLMs for different languages and groups.• ChatGPT can perform better with English prompts even though the task and input texts are intended for other languages, further confirming the biases toward English of ChatGPT.

Related Work
Since the release of ChatGPT in November 2022 with impressive language abilities, there has been a growing interest in evaluating ChatGPT for different aspects of natural language understanding.The first line of work concerns the performance comparison of ChatGPT and state-of-the-art systems for important tasks in NLP such as text summarization (Wang et al., 2023a;Yang et al., 2023), machine translation (Hendy et al., 2023;Jiao et al., 2023;Kocmi and Federmann, 2023), question answering (Tan et al., 2023;Omar et al., 2023;Lai et al., 2023), information extraction (Wei et al., 2023;Gao et al., 2023), text classification (Kuzman et al., 2023;Amin et al., 2023), grammatical error detection (Fang et al., 2023), and stance detection (Zhang et al., 2022a).Along this line, several recent studies have attempted to examine the performance of ChatGPT more comprehensively on multiple datasets (Bang et al., 2023;Qin et al., 2023;Koco'n et al., 2023;Zhong et al., 2023).The second direction for ChatGPT evaluation focuses on the robustness/reliability of the model against possible variants of input texts.For example, (Wang et al., 2023b) explores the robustness of ChatGPT under the adversarial and out-of-domain learning settings while (Jang and Lukasiewicz, 2023) examines the logical prediction consistency of ChatGPT for inputs with semantic equivalence, logical negation, or symmetricity.Finally, the third dimension for ChatGPT evaluation discusses the potential impacts and risks of the technology for the broader society, e.g., in education (Susnjak, 2022;Khalil and Er, 2023), law (Choi et al., 2023), medical (Kung et al., 2022), ethnics (Shen et al., 2023), human-computer collaboration (Lanzi and Loiacono, 2023), and cognition (Mahowald et al., 2023).However, to our knowledge, none of existing work has conducted large-scale evaluations of ChatGPT for multiple and diverse languages/tasks as we do.

Methodology
The goal of our research is to evaluate the performance of ChatGPT and LLMs for NLP tasks in different languages.Given the large numbers of NLP datasets/tasks/languages and the growing developments of LLMs, our work will be an ongoing effort to include additional experiments to be more comprehensive along the way.In the current version of the paper, we will evaluate ChatGPT on seven diverse NLP tasks, i.e., Part-of-Speech (POS) Tagging, Named Entity Recognition (NER), Relation Classification, Natural Language Inference (NLI), Question Answering (QA), Common Sense Reasoning (CSR), and Summarization.Over different tasks, our experiments will cover 34 diverse languages, characterizing high-, medium-, low-, and extremely low-resource languages to provide broader perspectives.Following (Bang et al., 2023), we employ the ratio of the data for each language in the CommonCrawl corpus2 , i.e., the main data to pre-train GPT-3, to classify the resource levels.In particular, a language will be considered as high-, medium-, low-, and extremely low-resource if its data ratio is greater than 1% (> 1%), between 0.1% and 1% (> 0.1%), between 0.01% and 0.1% (> 0.01%), and smaller than 0.01% (< 0.01%) respectively.learning setting for ChatGPT.We also report the state-of-the-art performance of the supervised models for a task in each language as a reference for research progress.In zero-shot learning, an NLP task T is specified by a natural-language task description D. Given a new data sample with input text X for the task T , the concatenation of D and X will then be sent into the ChatGPT model G as the input prompt to generate a natural-language response R = G([D; X]).Afterward, the response R will be parsed using pre-defined task-specific rules P to obtain an output in the required format for T (e.g., a pre-defined label for classification problems).Finally, the outputs Y for examples in an evaluation dataset will be scored to return ChatGPT's performance for task T .
Different from some previous work that exploits two-stage prompting to adopt a zero-shot chain of thoughts (Kojima et al., 2022;Qin et al., 2023), we directly utilize single-stage prompting that only adds the task description D into each input X to simulate the common approach of general users for ChatGPT.Other prompting strategies can be explored in future work.As such, in the current version, we aim to design simple task descriptions D while ensuring necessary information to indicate the task and facilitate the parsing of responses to produce accurate outputs Y .In addition, for tasks in a non-English target language, we will evaluate task descriptions in both English and target-specific languages to shed light on the best approach to prompt ChatGPT in multilingual settings.To facilitate the experiments, all non-English task descriptions are obtained using the automatic translation tool Google Translate3 to translate the designed English descriptions for each task.Finally, all of the responses from ChatGPT in this work are obtained between March 1 and April 5.This is right after ChatGPT is made available in OpenAI APIs to enable large-scale requests from the public for comprehensive evaluations.To improve reproducibility, we clear the conversations in ChatGPT for each query to remove any previous context.In the following, due to the space constraint, we will only describe the tasks, datasets, and ChatGPT's performance.The designed prompts for each task will be provided in the Appendix.

Part-of-Speech Tagging
Part-of-Speech (POS) Tagging is a coarse-grained word classification task whose goal is to label the syntactic information of the words in a sentence.
In the experiments, we utilize the XGLUE-POS dataset from Huggingface Datasets4 that only includes 17 languages (e.g., excluding Portuguese).As such, we use the test sets of XGLUE-POS with more than 15K samples for the selected languages in the evaluation.Appendix A provides details for our POS Tagging prompt for ChatGPT.Results: Table 2 presents the performance of ChatGPT (zero-shot learning with both English and language-specific task descriptions) and the fully supervised XLM-R model (based on XLM-RoBERTa base) (Liang et al., 2020).Here, performance is measured via the accuracy of the predicted POS tags.As can be seen, ChatGPT outperforms XLM-R over 13 out 17 languages for multilingual POS tagging.Different from XLM-R where English has the best POS tagging performance, ChatGPT seems to have better accuracy than English with some other languages (e.g., French, Spanish).Finally, we observe that English prompts tend to perform better or at lest competi-

Named Entity Recognition
Named Entity Recognition (NER) is an important task in NLP (Sang and Meulder, 2002), aiming to identify spans and semantic types of names (e.g., person, organization) in text.NER is usually formulated as a sequence tagging problem where a label is assigned to each word in a sentence to indicate names.The BIO annotation schema is often leveraged to form the labels to capture both span and type information (Ratinov and Roth, 2009).
For multilingual NER evaluation of ChatGPT, we employ the datasets from the recent shared task MultiCoNER (Malmasi et al., 2022)  Results: Table 3 evaluates the performance of Chat-GPT (zero-shot learning with both English and language-specific task descriptions) and DAMO (Wang et al., 2022a) Results: Table 4 shows the performance of Chat-GPT (zero-shot learning with both English and language-specific task descriptions) and mT5-IL (Chen et al., 2022) and mT5-XXL also seem smaller for high-resource languages.Finally, ChatGPT with target-language task descriptions produces significantly lower accuracy than those with English task descriptions across all considered languages, suggesting the benefits of English descriptions for multilingual NLI with ChatGPT.

Question Answering
Given a context passage and a question, a Question Answering (QA) model needs to return the answer for the question, which should be a span of text in the input passage.To this end, we utilize the XQuAD dataset (Artetxe et al., 2020) to evaluate ChatGPT in multiple languages for QA.XQuAD involves 240 paragraphs and 1190 question-answer pairs in English and their translations into ten other languages for evaluation.We describe our Chat-GPT prompt for QA in Appendix F. Given the responses from ChatGPT for our QA prompts for the examples, we remove the period characters in the end and directly evaluate remaining responses using the SQuAD's scorer 5 , which is suggested by the original paper of XQuAD (Artetxe et al., 2020 Results: Table 6 shows the performance of Chat-GPT (zero-shot learning) and mT5-XXL (Xue et al., 2021), a state-of-the-art supervised learning model for XQuAD.As such, for each language, mT5-XXL is trained over the combination of English training data and the translations to the target language to achieve optimal performance.We report the performance using both the exact match (EM) and F1 scores.Table 6 illustrates that Chat-GPT's zero-shot performance is significantly worse than the supervised model mT5-XXL for all the languages.Across different models and prompts, the QA performance for English is significantly better than those for other languages, demonstrating the clear bias for English of current multilingual language models.Finally, we find that prompting ChatGPT with English tends to produce better performance for multilingual QA than using target languages.

Common Sense Reasoning
Common Sense Reasoning (CSR) evaluates the reasoning of the models via multiple-choice questions.The inputs for the models involve a question and a few choices for the answer, and the models need to select one of the choices.To evaluate ChatGPT's multilingual abilities for CSR, we leverage two datasets: (i) X-CSQA (Talmor et al., 2019;Lin et al., 2021), which involves English data and its translations to 15 other languages, and (ii) Wikipedia Cloze QA from IndicNLPSuite (Kakwani et al., 2020), which covers 11 low-and extremely-low-resource Indian languages.We evaluate the models on the dev set of X-CSQA with 1,000 samples for each language, while the Wiki Cloze QA dataset from IndicNLPSuite contains 62,314 samples for all languages.Appendix G presents our ChatGPT prompt for CSR.
Results: Table 7 reports the accuracy of Chat-GPT (zero-shot learning for both English and language-specific prompts) and the state-of-theart supervised model TRT (Fang et al., 2022) on the X-CSQA dataset.TRT is based on the XLM-RoBERTa large model (Conneau et al., 2020) where commonsense knowledge in different sources is retrieved to enrich input questions and answers.Except for English, the table illustrates the poorer performance of ChatGPT than TRT across all other languages for CSR on X-CSQA when the English task description is used.Interestingly, in contrast to other tasks, we find that language-specific prompts tend to perform better than English prompts for ChatGPT in CSR for highresource languages (except for Chinese), leading to some improvement over supervised learning (e.g. for French, Spanish, and Dutch).
For IndicNLPSuite, Table 8 demonstrates the accuracy of ChatGPT and IndicBERT (Kakwani et al., 2020), a pre-trained encoder-only model using the ALBERT architecture over an Indian language corpora.IndicBERT is fine-tuned on training data to deliver state-of-the-art performance for IndicNLP-Suite in the original paper (Kakwani et al., 2020).Our experiment results for IndicNLPSuite confirm the general tendency that supervised learning models still perform better than ChatGPT over different languages.However, there are two exceptions with Hindi and Kannada where ChatGPT can produce better accuracy over IndicNLPSuite.Finally, Table 8 suggests that English prompts are a better way to prompt ChatGPT for Indian languages than these languages themselves (except for Marathi and Gujarati).Finally, our ChatGPT evaluation for multilingual summarization is included in Appendix H.

Discussion
The most important findings from our experiment results is that ChatGPT exhibits significantly worse performance than state-of-the-art supervised models for most of considered NLP tasks in different languages.Given the huge costs to train ChatGPT and similar LLMs as well as the necessity of paid APIs to run large amounts of requests with Ope-nAI, it seems more reasonable to build smaller task-specific models for NLP problems (or at least for the considered tasks) in different languages that can be hosted locally to serve at lower costs.
In addition, we notice an exception for the POS tagging task where ChatGPT can achieve competitive or even better performance than the supervised learning models (especially with English prompts) over different languages.For instance, ChatGPT has significantly better POS tagging accuracy for Thai, Vietnamese, Bulgarian, Hindi, and Urdu, which are medium-and low-resource languages.As such, in contrast to other considered tasks which require some level of semantic reasoning, POS tagging focuses on low-level syntactic analysis.We thus hypothesize that ChatGPT possesses high-level skills in grammar and low-level abilities of semantic reasoning to generate seemingly fluent texts for multiple languages.However, for more complicated semantic analysis, ChatGPT might find it more challenging to perform accurate predictions and generations.
Regarding the classification of high-, medium-, low-, and extremely low-resource languages, our work currently relies on data ratios for the languages in the CommonCrawl corpus.According to our experiments, it is interesting that the performance of ChatGPT for low-and extremely-lowresource languages in some tasks is better or comparable to those for high-or medium-resource languages.For instance, for POS tagging in Table 2, ChatGPT's performance for Urdu (a low-resource language) is better than the performance for Vietnamese and Thai (high-and medium-resource languages).In NER, ChatGPT achieves better performance for the low-resource language Bengali than for Chinese (using English prompts in Table 3).For the common sense reasoning task in Table 7, ChatGPT's performance for the extremely-lowresource language Swahili is comparable to those for Polish (with English prompts).To this end, it seems evident that data size might not be the only factor that dictates the resource level and performance for a task of a language with ChatGPT and LLMs.
Compared to language-specific prompts, the superior performance of ChatGPT with English task descriptions over a majority of problems and languages suggests that ChatGPT might better understand/analyze the tasks with English prompts to lead to improved abilities to generate responses with accurate outputs.In addition, the inclusion of English task descriptions for non-English inputs can be seen as an approach to shift the representations of language-specific inputs toward the English space that can be better processed by ChatGPT due to the domination of English in its training data.However, we also note some recent work that reveals a rather different findings, suggesting that ChatGPT can perform competitively or even better with language-specific prompts for NLP tasks in target languages (Hasan et al., 2023;Deng et al., 2023).A reason for those different findings might come from potentially different versions of Chat-GPT at different times that are used to conduct the studies.It thus highlights the importance of better transparency for LLMs, e.g., with respect to training data (Nguyen et al., 2023), to allow accurate and deeper investigation of the models.Finally, the better performance with English prompts also raises an interesting question on whether English is the optimal language to prompt ChatGPT or it is better to employ other languages for this purpose for different target languages.

Conclusion
Toward a more comprehensive understanding of ChatGPT and LLMs on their multilingual learning abilities for NLP, our work conducts an evaluation for ChatGPT on 7 different tasks, i.e., Part-of-Speech Tagging, Named Entity Recognition, Relation Extraction, Natural Language Inference, Question Answering, Common Sense Reasoning, and Summarization.Using 37 diverse languages with high-, medium-, low-, and extremely low resources for the experiments, our results reveal the less optimal performance of ChatGPT in the zero-shot learning setting for NLP tasks in different languages, advocating for task-specific models to secure best performance.As an ongoing research, we plan to extend the experiments to include more languages, tasks, models, criteria, and settings in future work to obtain broader and deeper insights.

Acknowledgement
This research has been supported by the Army Research Office (ARO) grant W911NF-21-1-0112, the NSF grant CNS-1747798 to the IUCRC Center for Big Learning, and the NSF grant # 2239570.This research is also supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract 2022-22072200003.The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government.The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

Limitations
As an ongoing work to evaluate ChatGPT and LLMs on multilingual learning tasks, our current work observes several limitations that can be addressed in future studies.First, although our experiments have covered 37 languages, including lowand extremely low-languages, there are still many other languages that are not explored in the current work.Some tasks/datasets in our work have not covered lower-resource languages.The future work can expand the language set with greater focuses on lower-resource languages to better understand LLMs' performance in this important direction.Second, many other tasks, including those with available multilingual datasets, have not been considered in the current work.Examining more tasks and datasets will enable a more comprehensive understanding of ChatGPT and LLMs in multilingual settings.Third, our current work only evaluates ChatGPT in the zero-shot learning setting, thus unable to show comparisons with other recent multilingual LLMs, e.g., BLOOM (Scao et al., 2022), GPT-4, and BARD, in various learning scenarios.While some of these models are currently less accessible for large-scale evaluations, our plan is to further include more models and learning settings along the way to strengthen our evaluations and comparisons when possible.Finally, the current work only evaluates ChatGPT in terms of performance over NLP tasks in different languages.To better characterize ChatGPT and LLMs, other evaluation metrics should also be investigated to report more complete perspectives for multilingual learning, including but not limited to adversarial robustness, biases, toxic/harmful content, hallucination, accessibility, development costs, and interpretability.

A Part-of-Speech Tagging Prompt
Our prompt for POS tagging for ChatGPT consists of a task description, a note for output format, and an input sentence, concatenated in that order, i.e., Prompt P OS = [task description; output format note; input sentence].Notably, instead of directly using the text of input sentence, we feed ChatGPT with the list of words in the sentence to facilitate the word-label alignment and parsing of ChatGPT responses for POS tagging.Our task description and output format note then emphasize on the expected format for the ChatGPT's responses to follow the tuple structure with pairs of words and their corresponding POS tags.In the experiments, this approach has led to better performance for Chat-GPT than the direct input sentence.We illustrate an example for the English POS prompts for ChatGPT in Figure 1.

B Named Entity Recognition Prompt
Our prompt structure for ChatGPT with Named Entity Recogntion (NER) follows the prompts for POS Tagging, i.e., Prompt N ER = [task description; output format note; input sentence], which involve a task description to explain the task and list entity type/labels of interest.We also have a note to specify the expected output format with tuples of words and predicted tags for names.However, a key difference for NER is that we explicitly ask ChatGPT to produce tags for each work in the BIO format.Although this approach seems to make the task more challenging for ChatGPT, we find that it actually improves the performance for ChatGPT.Our hypothesis is that the BIO tag requirement encourages ChatGPT to solve NER as a sequence labeling problem, thus forcing it to comprehensively annotate names in input sentences.In contrast, the simpler approach to prompt ChatGPT for names without BIO specification might suggest reading comprehension formulation that does not tag all names with exact spans for NER.The responses from ChatGPT are also harder (i.e., more ambiguous and unpredictable) to parse for NER outputs without the BIO requirement.We provide an English prompt example for NER for ChatGPT in Figure 2.
Task Description: You are working as a named entity recognition expert and your task is to label a given text with named entity labels.Your task is to identify and label any named entities present in the text.The named entity labels that you will be using are PER (person), LOC (location), CORP (corporation), CW (creative work), GRP (group of people), and PROD (product).You may encounter multi-word entities, so make sure to label each word of the entity with the appropriate prefix ("B" for the first word of the entity, "I" for any non-initial word of the entity).For words which are not part of any named entity, you should return "O".Note: Your output format should be a list of tuples, where each tuple consists of a word from the input text and its corresponding named entity label.Input: ["john", "is", "first", "mentioned", "in", "a", "charter", "from", "1247", "."] ⇒ [("john", "B-PER"), ("is", "O"), ("first", "O"), ("mentioned", "O"), ("in", "O"), ("a", "O"), ("charter", "B-CW"), ("from", "O"), ("1247", "B-PROD"), (".", "O")].In order to better understand the performance of ChatGPT for MultiCoNER, we use the scoring script nervaluate6 to compute detailed scores for each entity types for ChatGPT.Table 9 shows labelwise precision, recall, and F1 scores of ChatGPT (with English prompts).We also include spurious percentages (over total numbers of predictions), which are the percentages of ChatGPT's predictions that do not exist in the annotated data for each type.As can be seen, ChatGPT's extraction performance is very poor for GRP (group of people) and CW (creative work), which have F1 scores of less 15%.Also, the spurious percentages of Chat-GPT are generally high for all entity types, which suggests ChatGPT's verbosity and confusion for NER.

D Relation Extraction Prompt
An input example for RE involves an input text and two entity mentions in the text for classification.To probe ChatGPT for RE for an example, we design the prompt via the concatenation of a task description, input text, and two entity mentions, i.e., Prompt RE = [task description; output format note; input text; entity 1; entity 2].In the task description for RE, we explicitly include all the relation types to inform ChatGPT.We also introduce an output format note to specify the expected format for the responses from ChatGPT for RE, thus facilitating response parsing for relation labels.To illustrate the RE prompts for ChatGPT, we present an example with the English prompt and corresponding response in Figure 3.

E Natural Language Inference Prompt
To construct the prompt for ChatGPT for each example in XNLI, we directly concatenate the task description, the premise, the hypothesis, and a multiple choice question (of entailment, contradiction, and neural) in this order, i.e., Prompt N LI = [task description; premise; hypothesis; question].An example of English input prompts and responses from ChatGPT is shown in Figure 4.

F Question Answering Prompt
We collect the English task description for QA from the NaturalInstructions repository (Wang et al., 2022b) for ChatGPT.In addition, as ChatGPT tends to generate long responses, we introduce a note to remind the model that the answers for our dataset should be short and directly extracted from the input passage.This approach has helped ChatGPT Task Description: Please identify whether the premise entails or contradicts the hypothesis in the following premise and hypothesis.The answer should be exact "entailment", "contradiction", or "neutral".Premise: And he said, Mama, I'm home.Hypothesis: He called his mom as soon as the school bus dropped him off.Is it entailment, contradiction, or neutral?⇒ Neutral.The premise doesn't confirm or deny the hypothesis.... to provide more direct answers in our experiments.To this end, for an example with an input passage and question, our prompt for ChatGPT is formed via: Prompt QA = [task description; passage; question; note].We demonstrate an example of the QA prompts in Figure 5.

G Common Sense Reasoning Prompt
In the CSR prompts for ChatGPT, we combine the task description, the question, and the multiple choices for each sample, i.e., Prompt CSR = [task description; question; multiple choices].Here, for the task description, we also indicate the language of the input question and multiple choices.Two examples of prompts for CSR inputs are presented in Figure 6 for the X-CSQA dataset and in Figure 7 for the Wikipedia Cloze QA dataset from IndicNLPSuite.

H Summarization
In summarization, systems need to provide key and concise information for a longer input text, which can be helpful for different downstream applications such as news analysis, marketing, question answering, and scientific document processing.To study the performance of ChatGPT for summarization in multiple languages, we choose the XL-Sum dataset (Hasan et al., 2021) that provides summaries of news articles in 44 languages.In contrast to extractive summarization that select important sentences in the input text to a summary, XL-Sum addresses abstractive summarization to allow text generation with more creative writing in the summary (the sentences in the summary might not necessarily appear in the input text).Despite Task Description: Answer the question from the given passage.Your answer should be directly extracted from the passage, and it should be a single entity, name, or number, not a sentence.Passage: Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls.He is also the oldest quarterback ever to play in a Super Bowl at age 39.The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver's Executive Vice President of Football Operations and General Manager.Question: How old was Peyton Manning when he played in Super Bowl 50?
Note: Your answer should be directly extracted from the passage and be a single entity, name, or number, not a sentence.⇒ 39.Task description: In this task, you will be presented with a question that has multiple possible answers in English.You should choose the most suitable option out of "A", "B", "C", "D", and "E", based on your commonsense knowledge.greater challenges, abstractive summarization can produce more natural texts to better serve downstream applications.
To facilitate the experiments, we select 12 languages in XL-Sum, covering high-, medium-, low-, and extremely low-resource languages, and eval- uate ChatGPT's performance on the test datasets of the languages.Table 10 shows the sizes of test data (i.e., the numbers of samples) in XL-Sum for the selected languages.In the experiments, we utilize the ROUGE-1, ROUGE-2, and ROUGE-L scores as performance measures for summarization.Note that for the non-English languages, the scorer script in the original paper of XL-Sum (Hasan et al., 2021) is used for performance computation.
As a summary in XL-Sum is expected to be written in the same language as the input text, given an input text, our summarization prompt for ChatGPT is constructed via the concatenation: Prompt SU M = [task description; output language specification: input text].Accordingly, the task description is simply: "Summarize this <lang> text."while the output langauge specification is expressed via: "The output should be in <lang>".Here, <lang> indicates the the same language that is presented in the input text and expected in the summary response.<lang> can be translated into appropriate languages as required by the language of the prompts.For instance, using English for the prompts, the summarization prompt for a French input is "Summarize this French text.The output should be in French: . . .".In the experiments, we find that ChatGPT might generate responses in English even for non-English inputs and including output language specifications in the prompts is important to instruct the same language in the inputs and outputs for ChatGPT.Results: Tables 10 and 11 presents the summarization performance of ChatGPT (zero-shot learning) for the selected languages in XL-Sum using English and language-specific prompts respectively.In the tables, we also include the performance of the mT5-XXL model that is trained over training data of specific languages in XL-Sum.mT5-XXL has achieved state-of-the-art performance for XL-Sum as reported in (Aharoni et al., 2022).It is obvious from the tables that ChatGPT's performance is consistently inferior to mT5-XXL's with large performance gaps in different languages.To better understand the poor performance of ChatGPT, Tables 10 and 11 also report the average lengths of the human-provided summaries and the summaries generated by ChatGPT (in terms of the numbers of characters).It is clear from the tables that ChatGPT tends to generate lengthy summaries, potentially leading to its poorer performance.In addition, the tables show the success rates of ChatGPT for each language, which is defined as the ratios of requests sent to the ChatGPT server and received non-empty responses/summaries.As can be seen, the success rates of ChatGPT for lower-resource languages are also lower that can further explain ChatGPT's performance and reliability for such languages.

Figure 1 :
Figure 1: Input prompt and output of ChatGPT for the XGLUE-POS dataset.

Figure 2 :
Figure 2: Input prompt and output of ChatGPT for the MultiCoNER dataset.

Figure 3 :
Figure 3: Input and output of ChatGPT for the SMiLER dataset.

Figure 4 :
Figure 4: Input prompt and output of ChatGPT for the XNLI dataset.

Figure 5 :
Figure 5: Input prompt and output of ChatGPT for XQUAD dataset.

⇒
Question: When you return to work you will likely need what to get in the door if you are the first to arrive?Option B is the most suitable answer: key.

Figure 6 :
Figure 6: Input prompt and output of ChatGPT for X-CSQA dataset.

Figure 7 :
Figure 7: Input prompt and output of ChatGPT for Wikipedia Cloze QA dataset (IndicNLPSuite).Translation of the statement and options by Google Translate: Ratan Devasi was born on 25 September 1975 at Mount Abu in the Sirohi district of <MASK>.His father's name is Shankarlal Devasi and wife's name is Viraj Devasi.Devasi has been a brilliant student since childhood.He is a Diploma in Hotel Management degree holder.Devasi is quick-tempered and soft-spoken since his student life.Option A: Congress; Option B: NSUI; Option C: Rajasthan; Option D: Lok Sabha.

Table 2 :
Accuracy of ChatGPT (zero-shot learning) and XLM-R (supervised learning) on the test sets of XGLUE-POS.ChatGPT is evaluated with both English (en) and language-specific (spc) task descriptions.

Table 5 :
, a state-of-the-art supervised in-language prompting model for SMiLER.mT5-IL is based on the base version of mT5.Micro F1 scores are used as the performance metric for RE.From Table4, the results suggest that mT5-Accuracy of ChatGPT (zero-shot learning) and mT5-XXL (supervised learning with English and translated data) on the development set of XNLI.ChatGPT is evaluated with both English (en) and language-specific (spc) task descriptions.
lations in the target language to achieve the best reported performance on XNLI.It is clear from the table that ChatGPT performs significantly poorer than mT5-XXL across different languages by large margins.The performance gaps between ChatGPT

Table 6 :
Performance of ChatGPT (zero-shot learning) and mT5-XXL (supervised learning with translated data) on the XQuAD dataset.en and spc indicate whether ChatGPT uses English or target language prompts.The performance is computed using exact match (EM) and F1 scores.