Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading Comprehension

The conversational machine reading comprehension (CMRC) task aims to answer questions in conversations, which has been a hot research topic in recent years because of its wide applications. However, existing CMRC benchmarks in which each conversation is assigned a static passage are inconsistent with real scenarios. Thus, model's comprehension ability towards real scenarios are hard to evaluate reasonably. To this end, we propose the first Chinese CMRC benchmark Orca and further provide zero-shot/few-shot settings to evaluate model's generalization ability towards diverse domains. We collect 831 hot-topic driven conversations with 4,742 turns in total. Each turn of a conversation is assigned with a response-related passage, aiming to evaluate model's comprehension ability more reasonably. The topics of conversations are collected from social media platform and cover 33 domains, trying to be consistent with real scenarios. Importantly, answers in Orca are all well-annotated natural responses rather than the specific spans or short phrase in previous datasets. Besides, we implement three strong baselines to tackle the challenge in Orca. The results indicate the great challenge of our CMRC benchmark. Our datatset and checkpoints are available at https://github.com/nuochenpku/Orca.


Introduction
Machine Reading comprehension (MRC) is one of the most popular variations of question answering (QA) and draws a lot of attention in natural language processing communities (Rajpurkar et al., 2016;Chen et al., 2022a;You et al., 2021).Recently, researchers commonly agree that it is quite difficult to satisfy users in real scenarios by just answering a single-turn question.Subsequently, Conversational Machine Reading comprehension (CMRC) has aroused increasing research interest in recent years, and arising of large-scale CMRC benchmarks (Choi et al., 2018;Reddy et al., 2019;Campos et al., 2020;Zaib et al., 2022;You et al., 2022) have been proposed, such as CoQA (Reddy et al., 2019), QuAc (Choi et al., 2018).Technically, CMRC aims to endow a machine with the ability to fully understand the given text passage/paragraph and respond appropriately to a set of questions in a conversation.However, despite notable successes in a variety of types and on a sizable scale, there are still a number of important obstacles that have rarely been mentioned in prior CMRC efforts: (1) Conversations in existing CMRC datasets are only grounded on a single/static document passage.Each time a question is asked, it is necessary to consider this passage as the sole source of information.In practice, the natural multi-turn QA is driven by a specific topic and goes on conversations with dynamic knowledge from their minds rather than restricted to a static evidence passage.Thus, there is a great discrepancy between the pattern of training data and real application scenarios.
(2) Many annotated answers in these datasets are restricted to certain span texts in the provided passages.For instance, a significant proportion of CoQA answers are short phrases from the document (Campos et al., 2020).While span texts taken straight from the original document and basic phrases like "yes/no" might be seen as ground-truth answers, they do not sound like natural responses in real-life settings.Instead of being pushed to choose certain type spans, people's responses are naturally formed with flexibility and information.Intuitively, constructing a CMRC dataset with well-annotated human-crafted responses is urgently needed to build effective and reasonable CMRC models catering to real human needs.
(3) Previous CMRC benchmarks are only built on English corpus, limiting the progress of CMRC Q: 为什么要搭建这个？Why build this? (Causal) R: 方便人类长期在太空中工作，不需要短时间返回地球进行资源补充。 It is convenient for people to work in space for a long time, and there is no need to return to the earth for resource replenishment in a short time.
P of   : 空间站分两种类型：第一种就是单一式的，而第二种是组合式的。而科学家搭建空间站的原因是方便 人类长期在太空中工作，不需要短时间返回地球进行资源补充。There are two types of space stations: single and combined.The reason why scientists build the space station is to facilitate human beings to work in space for a long time, and there is no need to return to the earth for resource replenishment in a short time.
P of   : 目前(截至2021年6月)世界上只有两个空间站在运行。一是由美俄等六家西方航天机构共同建设的 国际空间站，另一个是中国自主建造的天宫系列空间站。Currently (as of June 2021) there are only two space stations operating in the world.One is the International Space Station jointly built by six Western space agencies including the United States and Russia, and the other is the Tiangong series of space stations independently built by China.  in other languages.Besides, as the success of in-context learning with large language models (LLM), a further crucial measure of model efficacy is testing in the context of few-shot learning.However, the performances of current CMRC models with only a few conversation sessions are still to be explored, especially in Chinese.

P of 𝒕𝒖𝒓𝒏
To this end, we collect Orca, the first few-shot Chinese CMRC benchmark.Concretely, we hire professional annotators to annotate the following components in Orca: 1) Topic, which consists of several sentences to drive the whole conversation; 2) Domain, only one or two words to indicate the specific field of the conversation; 3) Conversations, where each turn is assigned a golden knowledgeable passage.Importantly, the question and response in each turn are human-crafted, leading to more coherent conversations.And response-related passages are also manually selected from the search engine.Figure 1 shows an abridged conversation in Orca.In total, we collect manually annotated 831 conversations and 4742 turns in Orca.Even though the size of our dataset could be questioned, our goal is to build strong CMRC models under few-shot/zero-shot settings, and without needing to collect data for each target domain.
Our dataset has the following salient features: 1) Each turn is assigned a passage as the knowledge source, that is, the given passages are dynamically changed rather than only one topic-central passage.Orca gives more in-depth background knowledge about the specific topic in this manner, making the following-up dialogue more natural and diversified.In this aspect, Orca poses a greater challenge to existing methods since it necessitates modeling both dialogue history and dynamic evidence passages.
2) Answers at each turn are natural and informative responses from human annotation rather than certain spans in the provided passage, moving forward to real scenarios.This way, we can both evaluate models' comprehension ability and generation ability.3) Data in Orca are collected from November 2021 to November 2022 on Weibo1 , one of the most popular social media platforms in China.This means that the collected data reflect real human interests, are quite new, and have never been included in earlier benchmarks, posing a challenge to existing language models.Moreover, good results on Orca are of practical interest.4) We carefully annotate conversations across 33 domains, such as Society, People, Celebrity, Book, Finance.In contrast, as the current commonly-used datasets, data in CoQA are only from 7 domains, DoQA contains data from 3 domains.The variety of data domains makes Orca closer to real scenarios and better-evaluating the generalization of CMRC models.
In this paper, we implement three strong baselines to tackle the challenge in Orca: 1) Incontext learning with LLMs like GPT-3 (Brown et al., 2020); 2) End-to-end training with Encoder-Decoder language models 2 ; 3) Three-stage framework which consists of Query Rewriter, Passage Reader, and Response Rewriter (short for QPR).We conduct our experiments under both zero-shot and few-shot settings 3 .The results indicate that Orca provides a challenging evaluation testbed for the CMRC systems.

Related Work
More and more MRC datasets have been proposed to promote the development of QA systems (Chen et al., 2021(Chen et al., , 2022c;;Kim et al., 2021;Ishii et al., 2022).Tabel 1 presents the comparison of textbased MRC datasets in recent years.The release of QuAC (Choi et al., 2018) and CoQA (Reddy et al., 2019) have aroused great interest in CMRC.In these datasets, each conversation is assigned a document passage, which requires the model to answer questions based on the passage.However, this pattern is quite different from natural human-human conversations.In reality, a natural conversation is driven by a certain topic and an evidence passage that is available for each turn doesn't exist.Limiting the conversation content into a given document passage is inconsistent with real scenarios.Thus, model's evaluation on these datasets could not reflect its ability towards realistic scenarios reasonably.Besides, almost CMRC datasets are collected from English corpus, limiting the language diversity in CMRC tasks.In particular, CMRC in Chinese is less explored because of lacking corresponding benchmarks.
Unlike existing datasets, we propose a Chinese CMRC benchmark with various hot topics and each turn of a dialog is assigned a golden responserelated passage.Thus, models' comprehension ability and generation ability towards real scenarios can be evaluated more reasonably.Furthermore, we collect Orca in few-shot settings, challenging models to learn unseen domains with few samples.Although one can question previous fulldata CMRC datasets also could support few-shot training for CMRC models, we argue that it could lead to unclear comparisons due to the inconsistent settings of different works on these datasets (Chen et al., 2022b).In contrast, we present a standard benchmark for thorough comparisons.

Problem Definition
Generally, the previous conversational machine reading comprehension task can be illustrated as: Given a conversation at turn t, <P, H t Q t , A t >, where P is the evidence passage, H t refers to the with a few dialogue sessions.dialogue history at turn t, Q t , A t denote the QA pairs at the current turn, respectively.The task requires the model to predict the answer A t based on understanding the P, H t and Q t .A t often refers to the spans in P.
In contrast, given a conversation at turn t, Orca formalizes the CMRC task as <T , P t , H t Q t , R t >, where T is the topic which drives the whole conversation, P t represents the evidence passage of turn t, R t is the corresponding response.Orca requires the model to generate the Response R t at the current turn based on understanding the T ,P t , H t and Q t .Of note, 1) Since Orca delivers an evidence passage for each round in the conversation, compared to other data sets, it demands the model to have a stronger capacity to comprehend these dynamic knowledge; 2) Each R t is the human-annotated natural response rather than A t , thus, Orca needs sequence-to-sequence modeling; 3) H t in Orca is allowed to include passages of previous turns.

Dataset Collection
As previously stated, a high-quality dataset is required for the development of successful CMRC models.In our dataset, we include the following elements: 1) Topic, a collection of several sentences that serves as the conversation's central focus; 2) Domain where simply one or two words designate the specific field o which the content of the conversation belongs; 3) Conversations, where each turn is allocated a knowledgeable passage.In particular, Orca is collected in three stages: 1) Topic and domain collection; 2) Conversation collection; 3) Quality control.We introduce each of them in the following parts: Topic and domain collection With the goal of collecting data that can reflect real human needs and daily life, we treat the social media platform Weibo4 as the data source.More than 200 million Chinese users share things around them and talk about hot topics on Weibo every day.We employ ten annotators to find out and write down topics that they are interested in from the hot-topic list5 .If the selected topic on the list is not a grammatically correct sentence, the marker needs to rewrite a few sentences to re-explain the topic.Meanwhile, the domain of this topic is annotated in Orca.In order to avoid prejudice brought about by individuals, we recruit another annotator to further check the selected topics.We also require the annotators not to choose the topic of the same domain three times in a row, so as to increase the domain diversity.Selected topics in Orca are collected from November 2021 to November 2022 on Weibo, guaranteeing the timeliness and diversity of our data.
Conversation collection After obtaining topics, we split them into five parts equally.Then, we divide the ten annotators into pairs (one questioner and one responder) and assign them the topics.We ask them to have a conversation around each assigned topic.The questioners are demanded to ask following up questions as many as possible and responders are asked to give natural responses seriously.The conversation begins with the questioner formulating a free-text question from the information on the chosen topic.To ensure that the responses are accurate, comprehensive as well as informative, we let responders use web search engines to find a knowledgeable passage that includes the information of answering this question.
Then, the responder needs to answer the question based on the manually selected passage and their knowledge.Responses are required to be the freeform conversational text, such as a sentence with demonstrative pronouns, making the expression more fluent.To facilitate more natural and consistent conversations, the responder should provide additional information about the answer or the knowledge about the topic to encourage the questioner to continue with questions related to the latest question.Besides, we encourage the questioner to find interesting questions from the recorded dialogues about the topic on Weibo.Dialogues are ended when more than two unanswerable questions were asked or one of the partners decides to end the con-versation.As a result, we obtain 831 high-quality conversations that each turn is assigned a knowledgeable passage.
Quality Control To avoid bias and further control the data quality, we have made some strategies to ensure the quality of conversations: • We employ another five annotators to check whether an inconsistent problem exists in the conversation context.Then, annotators are demanded to revise the questions or responses or directly delete the turns that cause inconsistency.
• Simultaneously, gruff questions and responses are revised by annotators, aiming to guarantee natural expressions.
• As our annotation work continues for about the whole year, in order to keep the enthusiasm and work efficiency of annotators, quality check of annotators happens every month.The best-performing annotators will be monetarily rewarded, and the worst performers will be retrained or even separated.
• At last, we will manually re-check the quality to further ensure the high quality of Orca.
All annotators are Chinese native speakers recruited from universities.The annotation costs about 100K RMB in total, with 20 RMB per dialogue turn.

Dataset Analysis
Orca contains 831 conversations and 4,742 turns.For each conversation, there are 5.71 turns on average, and each turn is assigned a knowledgeable

Conversation Analysis
As shown in Figure 2, there are 33 different domains in Orca, covering a large range of topics in daily life, such as Society, People, Celebrity, Book, Finance.The variety of domains makes Orca more challenging, which can effectively evaluate model's generalization ability towards different domains.
To evaluate model's generalization ability towards various domains, we provide zero-shot and few-shot settings and extract 200 conversations from Orca as the support set and treat the rest data as test set.Table 2 shows the statistics of Orca.Each conversation contains 5.71 turns on average, which is closer to the number of QA pairs in a human conversation.There are about 91.9 tokens per passage and 16.35 tokens per response on average, indicating the abundant information of manually retrieved passages.Comparing the length of responses with words per passage, the great discrepancy demonstrates that models should be equipped with strong comprehension ability and generation ability to generate a pithy response according to a long passage.
In terms of the link between a question and its conversation history, we categorize questions as dependent or independent of the conversation history.According to our findings, only 19.9% of questions do not contain coreference with the conversational history and can be answered on their own.Explicit coreference indicators such as he, she, and it appears in over half of the queries (57.8%).These are either entities, people, buildings, or events mentioned in the conversation.The remaining 22.3% lack explicit coreference indicators but are indirectly related to an item or event, like simple following-up questions "Why this?".The analysis shows Orca poses challenging linguistic phenomena to build an effective CMRC model.

Query Analysis
Different from existing datasets (Choi et al., 2018;Reddy et al., 2019;Christmann et al., 2019;Saha et al., 2018) in which most queries are Factoid, query types in Orca are much more diverse.For example, factoid questions accounted for about 60% of CoQA.In contrast, 60.9% of the questions are non-factoid questions on Orca.As shown in Figure 3 (a), Causal, List, and Confirmation all account for a considerable part.
Figure 3 (b) shows the distribution of the bigram prefixes of queries.We observe that most of the first word of queries are similar to other datasets, like QuAC, CoQA and CSQA.However, queries beginning with "Why" and "Is" account for a larger proportion compared with other datasets.We find that "Why" pattern is mainly related to queries with Causal type (eg.Why is he popular in China?) and "Is" pattern is mainly related to queries with For ease of reading, English translation is only provided here.
Confirmation type (eg.Is it a developed country?).The diversity and relatively uniform distribution of queries in Orca provide conversation scenarios that are as realistic as possible.

Response Analysis
According to the statistics of all responses in Orca, we observe that 69.30% of responses cannot be found as exact text spans from the passage.Only 30.70% of responses have overlap spans in the given passage (like entities).It indicates that most of the responses in Orca are free-form expressions that are more colloquial, calling for models with strong generative ability.

Models
In this work, we propose three strong baselines to address the challenge in Orca: 1) In-context learning with large language models; 2) End-toend training with generative language models; 3) Three-stage framework, QPR, which consists of three modules: Query Rewriter, Passage Reader, and Response Rewrite.We briefly introduce each of them in this section.

In-context Learning
Previous works have proved that directly utilizing large language models with in-context learning could lead to promising performances in zeroshot.In-context learning means the learning procedure that the models learn to make predictions in a target task via conditioning on a few input label pairs.In our experiments, we employ GPT-3 as the backbone, which contains 175 billion parameters.Concretely, given a few conversation sessions as prompt on the train set, we validate the performances of GPT-3 on the test set from Orca.

End-to-End
As most of the responses in Orca are free-form texts, rather than text spans extracted from the given passages, we build our end-to-end pipeline with several widely-used generative models: T5 and Bart.T5 (Raffel et al., 2020) is a pre-trained encoder-decoder Transformer model which has achieved strong results on a diverse set of benchmarks, such as question answering.We use a pre-trained T5 based on Chinese (Wang et al., 2022) with 784 million parameters for experiments.BART (Lewis et al., 2020) is a pre-trained encoderdecoder Transformer model for text generation tasks which has 375 million parameters.
In our implementation, we optimize the model with the widely used negative log-likelihood loss during training.At turn t in a conversation, the optimized objective can be formulated as: (1)

QPR Framework
As explained in Section 1, human cognition process towards CMRC can be split into three steps.To this end, given the related passage of each turn, we further design a QPR that consists of Query

Experiments
In this section, we first introduce experimental settings.Then automatic and human evaluation metrics are introduced to evaluate model performances.
At last, we present extensive experimental results.

Experimental Setting
We implement experiments under zero-shot/fewshot settings to evaluate model's performance towards unseen domains with few samples.Following existing works (Mi et al., 2022), we split 5 sessions and 10 sessions from the support set and define them as 5-shot and 10-shot, respectively.0shot means nonuse of support set.In a word, experiments are implemented under 0-session, 5-session, 10-session, and 200-session settings.
For QPR framework, due to the deficiency of ground-truth of Query Rewriter and Passage Reader, it cannot be fine-tuned based on the support set.Therefore, we only evaluate this system under 0-shot setting.Furthermore, we only test GPT-3 in 0-shot and 5-shot settings with in-context learning due to the significant expense of fine-tuning it.T5 and BART are evaluated under all the settings.
We fine-tune BART for 50 epochs with a learning rate of 2e-5.The batch size is set to 10.The max length of input and output sequences are set to 512 and 128, respectively.For T5, we fine-tune it for 20 epochs with the initial learning rate of 1e-4 and batch size of 8.The max length of input and output sequences are set to 512 and 300, re- spectively.In this work, we utilize large version of BART6 and base version of T57 and text-davinci-002 version of GPT-3 in our experiments.

Evaluation Metrics
Automatic Metrics.We employ the following widely-used metrics to validate the lexical, semantic, and diversity aspects of the generated responses, including exact matching score (EM), ROUGE-L (Lin, 2004), BLEU-1, BLEU-2 (Papineni et al., 2002), Distinct-1, and Distinct-2 (Li et al., 2016).Human Evaluation.In order to further get a full view of how well the model works, we recruit 3 annotators to rate the generated response of the best two models under 0-shot setting and 200-shot setting.Annotators are asked to give three scores according to Relevance, Completeness, and Naturality with levels of {1, 3, 5}, which represent unreadable, mediocre and flawless, respectively.We present the details of scoring rules in the Appendix.For all evaluation metrics, the larger number denotes better performances.

Results
Experimental results are presented in Table 3.We will analyze the results of 0-shot setting and fewshot setting, respectively.

Results of 0-shot Setting
Automatic Evaluations For 0-shot setting, we directly evaluate our proposed three strong baselines on the test set of Orca.As shown in Table 3, BART and T5 perform worst.Surprisingly, QPR system outperforms the other systems, including GPT-3.For example, it achieves 61.10% on BLEU-1 score which surpasses other baselines by a large margin.We argue that two reasons may contribute Topic An earthquake occurred last night in Aba, Sichuan, and lasted all night.

Conversation History
Question When did this earthquake occur?

Current Trun
Question What is the maximum magnitude of it?

Response
The maximum magnitude of this earthquake is 6.0.

T5
It is at 5.8 level.

BART
It is 5.8. to its performance: 1) The architecture of QPR is more consistent with the human cognition process towards CMRC; 2) post-training QPR on several similar Chinese benchmarks brings more benefit.Furthermore, we use conversations without passages to evaluate GPT-3 as well.Compared with GPT-3, "GPT-3 w/o P" performs worse on most of the metrics.The performance gap indicates that Orca contains plenty of conversations with fresh knowledge and it's hard for GPT-3 to generate accurate responses only based on its parameters pretrained by old knowledge8 .It also proves the value of passages provided by Orca.
Human Evaluations To further evaluate the performance of QPR system and GPT-3 under the zero-shot setting, we randomly sample 100 sessions from the test set and conduct human evaluations.As shown in Table 4, QPR system outperforms GPT-3 on the three aspects.From the aspect of Relev.and Comple., QPR system perform better than GPT-3 more significantly, proving its good ability to understand the passage, history and question, so as to generate complete and natural responses.Based on the analysis of generated responses, we find that GPT-3 tends to generate long responses with extra information, ignoring the response completeness sometimes.The human eval-  uation results basically match their performances in terms of automatic evaluations.

Results of Few-shot Setting
Automatic Evaluations As shown in Table 3, we evaluate T5 and BART under 5-shot, 10-shot, and 200-shot settings.We also evaluate GPT-3 under 5-shot setting.Because of the superiority in model parameters, T5 performs better than BART among all the few-shot settings.
With the increase of support samples for finetuning, model performances on the test set become better.Compared with 0-shot setting, given 5 support samples, the pre-trained end-to-end baselines achieve an increase of 19.58% of ROUGE-L on average.Comparing ROUGE-L scores under 5-shot and 10-shot settings, the improvement of end-toend baselines is 12.25% on average.It proves that it's really difficult for end-to-end baselines to generate appropriate responses under 0-shot setting within 1 billion parameters.Despite this, the pretrained end-to-end baselines have strong learning abilities under few-shot settings.Learning from only 10 support samples, their performances improve markedly and can catch up with that of the pipeline system.Given 200 support samples, endto-end baselines achieve significant promotions and surpass the pipeline system.
Human Evaluations As shown in Table 4, we implement human evaluation of T5 and BART under 200-shot setting.With the same evaluation settings of 0-shot setting, T5 and BART achieve comparable performances in terms of human evaluation metrics.Results from Table 3 indicate that these end-to-end models have strong learning ability to generate complete and natural responses after fine-tuning on our support set.
In summary, the results under 0-shot and fewshot settings prove the effectiveness of the proposed three pipelines, but they still remain the challenge in Orca.For instance, none of the models performs above 80% on automatic metrics.From the human evaluation perspective, they achieve the worst results on Relev., showing they still suffer from deficiencies in modeling.

Analysis
We further explore the model performances on different query types.Similarly, we conduct experiments on zero-shot and 200-shot settings.
6.4.1 Influence of Query Type 0-shot Figure 4(a) reports the results of different types of queries for fine-grained analysis.We observe that QPR system exceeds GPT-3 the most on Confirmation queries and performs worse than GPT-3 on List queries.

200-shot
Focusing on 200-shot setting, Figure 4(b) shows the results of T5 and BART under different query types.T5 almost outperforms BART under all query types, depending on its advantage in the number of model parameters.We obverse that BART shows a poor performance on Causal queries, List queries, and Hypothetical queries in particular.As shown in Figure 4, comparing BART under 200-shot setting with the QPR system under 0-shot setting, the performance gaps on these three query types are relatively small.According to Figure 3, these three query types account for 45.5% in Orca, proving that the variety of query types in our benchmark poses a thorny challenge for CMRC research.

Error Analysis
We conduct an error case study to show the challenge of Orca.As seen in Table 5, we first find that there is a coreference phenomenon in the latest question.That is, utilizing it to indicate the earthquake in the above question.Besides, the collected passage doesn't mention the highest earthquake magnitude clearly; instead, it states the average level, which confuses the model.As a result, two models are generated with 5.8 rather than the ground-truth answer 6.0.This example demonstrates that not only do the questions in our dataset encompass a wide range of linguistic phenomena, but it is also challenging to comprehend the gathered passages.We present more case studies in our Appendix.

Conclusion
We propose the first few-shot benchmark Orca for Chinese CMRC.Unlike existing CMRC datasets that each dialog contains a static passage, we assign each turn a golden response-related passage in Orca, which is more consistent with real scenarios.All the conversations in Orca are hot-topic driven and collected from various domains, responses are natural and informative via human annotation.In this way, model's comprehension ability and generation ability towards real scenarios can be evaluated reasonably.Moreover, zero-shot and few-shot settings are provided to evaluate model's learning ability towards unseen domains with few samples.Experimental results of our proposed three baselines indicate that Orca provides a challenging testbed for CMRC research.
Aspect Score Rule Rel.
1 The prediction is unrelated to query or topic.

3
The prediction doesn't cover the semantics of golden response accurately.It only covers a part of information mentioned by golden response or involve some redundant information.

5
The prediction covers the semantics of golden response accurately. Com.
1 The expression of the prediction is uncompleted. 3 The expression of the prediction is completed but it's a bit awkward to read it with the query.

5
The expression of the prediction is completed and smooth. Nat.
1 The expression of the prediction is unnatural. 3 The expression of the prediction is somewhat natural but it's slightly insufficient with the absence of demonstrative pronoun.

5
The expression of the prediction is natural and has demonstrative pronoun.

A Details of Human Evaluation
For human evaluation, we ask annotators to give three scores according to Relevance, Completeness, and Naturality with level of {0, 1, 2}, respectively.Table 6 shows the details of scoring rules of the three aspects.
B Case Study of GPT-3 with and without Passage under 0-shot Setting As shown in Table 3, we evaluate GPT-3 with and without passage under 0-shot setting."GPT-3 w/o P" performs worse on most of the automatic metrics.For intuitive analysis, we present cases of GPT-3 w/ and w/o passage in Table 7.It indicates that GPT-3 cannot generate accurate response without passage in some cases, specially the queries that need fresh knowledge.The reason is that GPT-3 pre-trained on old knowledge is probably unaware of new knowledge.Moreover, the comparison of GPT-3 with and without passage proves the value of golden passages provided by Orca.

C Case Study of T5 under 200-shot Setting
As shown in arXiv:2302.13619v1 [cs.CL] 27 Feb 2023 Topic : 中国空间站等你来出差 China Space Station is waiting for your business trip.Domain : 科技 Technology Q: 中国空间站叫什么名？What's the name of China Space Station? (Factoid) R: 它是天宫空间站。Its name is Tiangong Space Station.Q: 世界上有几个空间站？How many space stations in the world?(Factoid) R: 世界上有两个空间站。There are two space stations in the world.Q: 我们实现长期驻留了吗？Have we achieved long-term residence?(Confirmation) R: 中国空间站开启了有人长期驻留时代。China Space Station has opened an era of long-term presence of people.P of   : 2021年10月16日9时58分，三名航天员成功进驻中国空间站天和核心舱，这意味着中国空间站开启了 有人长期驻留时代。At 9: 58 on October 16, 2021, three astronauts successfully entered the Tianhe core cabin of China Space Station, which means that China Space Station has started an era of long-term presence of people.
: 中国空间站(天宫空间站，英文名称 China Space Station)是中华人民共和国建设中的一个空间站系 统，预计在2022年前后建成。China Space Station (Tiangong Space Station, English name China Space Station) is a space station system under construction in People's Republic of China (PRC), which is expected to be completed around 2022.

Figure 1 :
Figure 1: An abridged conversation in Orca.Q, R, and P means query, response, and passage respectively.

Figure 2 :
Figure 2: Distribution of domains in Orca.

Figure 3 :
Figure 3: (a) Distribution of query types in Orca.(b) Distribution of the bigram prefixes of querys in Orca.For ease of reading, English translation is only provided here.

Figure 4 :
Figure 4: Automatic evaluation results of different methods under various query types.

Table 1 :
Comparison of our benchmark with other typical QA datasets.Because Orca is a multi-domain CQA dataset rather than open domain, we don't compare Orca with some open domain benchmarks.We don't report the domain diversity if some datasets' original papers don't mention it.Free-form text denotes that the answers include span-text, abstractive text and "yes/no".

Table 2 :
Statistics of collected data of Orca.

Table 3 :
Comparison of three baseline system on Orca, which have significant improvement with p-value < 0.05."GPT-3 w/o P" means we use the conversations without passages to evaluate GPT-3.Of note, the minimum unit of shot is a dialogue session in our experiments.In other words, 0-shot refers to 0-session.
Chen et al. (2023) Reader, and Response Rewriter.In this architecture, given a question at turn t, Query Rewriter module first rewrites the question Q t to a more natural text Qt .Then Passage Reader module is responsible for generating answers A t based on (T , P t , H t , Qt ).At last, Response Rewriter module takes A t as inputs and generates the ural, we feed the answer and original query without Query Rewriter into Response Rewriter to get the final response.We use the dataset fromChen et al. (2023)to fine-tune the generative Response Rewriter based on BART.

Table 5 :
An error case of BART and T5 under 200-shot settings.We highlight the ground-truth response or the evidence text in the table.
Table 8, we present two cases of T5 under 200-shot setting.Table 8(a) shows a case that T5 generated response is the same as golden response.It indicates that T5 has strong learning ability and has an excellent performance under 200-shot setting.Table 8(b) shows a case that T5 generated response is slightly worse than golden response.Compared with the golden response, T5's response has obvious information loss.The reason could be that the related passage is too long to generate an accurate response from it.