Evaluating Open-Domain Question Answering in the Era of Large Language Models

Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of large language models (LLMs) for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging. Without accurate evaluation, the true progress in open-domain QA remains unknown. In this paper, we conduct a thorough analysis of various open-domain QA models, including LLMs, by manually evaluating their answers on a subset of NQ-open, a popular benchmark. Our assessments reveal that while the true performance of all models is significantly underestimated, the performance of the InstructGPT (zero-shot) LLM increases by nearly +60%, making it on par with existing top models, and the InstructGPT (few-shot) model actually achieves a new state-of-the-art on NQ-open. We also find that more than 50% of lexical matching failures are attributed to semantically equivalent answers. We further demonstrate that regex matching ranks QA models consistent with human judgments, although still suffering from unnecessary strictness. Finally, we demonstrate that automated evaluation models are a reasonable surrogate for lexical matching in some circumstances, but not for long-form answers generated by LLMs. The automated models struggle in detecting hallucinations in LLM answers and are thus unable to evaluate LLMs. At this time, there appears to be no substitute for human evaluation.


Introduction
Reliable benchmarks have been a bedrock to measuring progress in open-domain QA, the task of an- Top: Jicheng is a credible answer although not present in the list of gold answers. Existing automated evaluation mechanisms fail to identify it as correct. Bottom: A seemingly correct but unattributable answer from InstructGPT (Ouyang et al., 2022) for which automatic evaluation goes astray.
swering information-seeking questions over a massive text corpus. In recent years, we have seen great strides in open-domain QA by novel models (Chen et al. 2017;Wang et al. 2018;Clark and Gardner 2018;Lee et al. 2019;Asai et al. 2020;Izacard and Grave 2021b,a;Khattab et al. 2021;Singh et al. 2021;Asai et al. 2022; inter alia) that continue to raise state-of-the-art on well-established benchmarks such as Natural Questions-OPEN (Lee et al., 2019). The standard procedure for evaluating opendomain QA models, borrowed from reading comprehension (Rajpurkar et al., 2016), is to perform lexical matching between gold answers provided in the benchmark and models' predictions. However, as the performance of open-domain QA approaches that of humans, 2 these classic evaluation methods begin to fail. Such failures largely stem from the incomplete list of gold answers that do not fully cover all plausible answers. For example, in Figure 1, "Jicheng" is a correct answer to what was the city of Beijing previously known as? while not annotated as a gold answer in Natural Questions-OPEN (NQ-OPEN; Lee et al. 2019).
With the recent success of generative QA systems in the open-domain setting (Izacard and Grave, 2021b;, it becomes harder for lexical matching to recognize correct answers, and in turn for us, to recognize performance differences between models. The problem is exacerbated by a tendency of Large Language Models(LLM)-based systems (Brown et al. 2020;Chowdhery et al. 2022;Zhang et al. 2022;Black et al. 2022; inter alia) to occasionally hallucinate plausible but incorrect answers (Dziri et al., 2022;Ye and Durrett, 2022). For instance, in Figure 1, InstructGPT (Ouyang et al., 2022) generates "Jack Nicholson" in great details to answer who won the oscar for best actor in 1975? but although looks natural, the answer is not factually correct (he won in 1976). Therefore, human confirmation of answer correctness demands additional effort and care due to the ability of LLMs to formulate these answers as complete and seemingly authoritative.
While it might be assumed that improved performance under lexical matching would reflect improved performance in an absolute sense, even if some correct answers are missed, we show this assumption does not hold. For this purpose, we manually re-evaluate several open-domain QA models on a random subset of NQ-OPEN (Lee et al., 2019), an established benchmark. Not only is true performance substantially underestimated by this benchmark, but the relative performance of the models alters after re-evaluation: InstructGPT (zero-shot) achieves an accuracy of 12.6% on our NQ-OPEN subset, but our human judgment reveals its true performance to be 71.4%, a nearly +60% improvement. Our linguistic analysis of the failure cases of lexical matching, an extension of a similar study by Min et al. (2021), shows that the mismatches are mostly linguistically shallow and could be captured by simple patterns, such as regular expressions.
In contrast, automated evaluation mechanisms such as BEM (Bulian et al., 2022) based on semantic matching between the gold answers and generated answers produce a relative performance that is mostly consistent with human evaluation, although the absolute improvements are lower. However, long-form answers, generated by LLMs, introduce a new challenge that did not occur on prior models; they are prone to carry unattributable information (Rashkin et al., 2021). Automated evaluation models often deem the hallucinated responses correct, which is why, InstructGPT (zero-shot) is overestimated under these models, compared to human judgment.
We repeated this experiment with the 20-yearold CuratedTREC dataset (Voorhees, 2003) that provides its gold answers in the form of regular expressions. We observe that the relative performance of models remains mostly consistent under all three evaluation mechanisms, i.e., regular expressions, human evaluation, and semantic matching, with only slight differences in absolute performance. However, the ranking discrepancy still persists between the two LLMs, i.e., InstructGPT (zero-shot) and InstructGPT (few-shot). Also, only under human judgment does the absolute performance of LLMs exceed that of the heavily engineered statistical NLP systems from 20 years ago on this collection. Until recently, the best of these classical systems has been substantially superior to even the best of the modern neural models. In light of our observations, we highlight that while semantic matching against exact answers would have been sufficient for QA evaluation prior to LLMs, they cannot accurately evaluate LLMs.

Related Work
Answer Equivalence in QA. One way to tackle this task is through the automatic collection of alternative plausible answers from auxiliary knowledge sources such as a knowledge base (Si et al., 2021). However, the effectiveness of this approach is heavily contingent on the presence of answers in the knowledge source, which is often not the case. For instance, numerical answers or common phrases are unlikely to be found in a knowledge base. Moreover, matching gold answers with knowledge base entries can also be problematic as their surface forms may not be identical. Thus, these approaches fail to scale for various types of answers. Another line of work focuses on building models to perform semantic similarity between candidate answers and gold answers, which can supersede lexical matching for verifying answers (Chen et al., 2019Risch et al., 2021;Bulian et al., 2022). These methods indeed work well in reading comprehension because the presence of an input context often curtails the possibilities of models' generated answers. However, they are susceptible to failure in opendomain QA where questions should be answered without any additional context. Similarly, unsupervised semantic similarity-based evaluation met-rics such as BERTScore (Zhang et al., 2020) that rely on token-level matching of contextualized representations exhibit poor correlation with human judgment in QA evaluation (Chen et al., 2019) and lack the ability to capture attributability (Maynez et al., 2020).
Human Judgment in QA. Many works Min et al., 2021) resort to human evaluation to assess QA models. Although using humans for evaluation is expensive and not scalable, Min et al. (2021) find that the performance of QA systems bumps up 23% on average using human judgment. The substantial gap between the true performance and token-based metrics showcases the long known strictness problem of lexical matching.

Open-domain QA Evaluation
The task of open-domain QA is referred to finding answers for information-seeking questions given a massive knowledge source such as Wikipedia (Voorhees and Tice, 2000). The questions are typically factoid with short answers and acontextual (Rogers et al., 2022). Open-domain QA datasets encompass questions with their annotated gold answers that serve as a reference for evaluation. Following reading comprehension (Rajpurkar et al., 2016), evaluation is carried out via lexical matching using the following two widely used metrics to measure the performance of models: • Exact-Match accuracy (EM): A candidate answer is deemed correct iff it can be found in the set of gold answers. The ratio of correct answers in the test collection is reported as EM accuracy.
• F 1 score: Considering answers as bags of tokens, a candidate answer receives a partial score (F 1 ) iff its tokens overlap with those of a gold answer. The maximum F 1 score over a set of gold answers is assigned to the candidate answer. The final metric at corpuslevel is measured via averaging F 1 scores over the test collection.
Based on the implementation of Rajpurkar et al. (2016), answers are normalized (i.e., case-folded, and punctuation and articles are discarded) to compute these metrics.

Models
We select open-domain QA models with publicly available codebase and reproduce their reported results. For all models, the "base" flavors are chosen for the experiments. In total, we use 12 models.
Retriever-Reader Models. DPR (Karpukhin et al., 2020) is a well-known open-domain QA model that consists of a bi-encoder retriever and leverages an extractive reader. In addition to DPR, we pair several retrievers with Fusion-In-Decoder (FiD; Izacard and Grave 2021b), a prominent generative model that condition generating an answer on a list of passages: ANCE (Xiong et al., 2021), Contriever 3 (Izacard et al., 2022) RocketQAv2 (Ren et al., 2021), and FiD-KD (Izacard and Grave, 2021a). Further, we leverage GAR (Mao et al., 2021), a sparse retrieval model that augments questions with relevant contextual information generated by a fine-tuned T5 . We fuse ANCE and GAR results with BM25, namely ANCE+ and GAR+, as they led to better results. We also use R2-D2 (Fajcik et al., 2021) that combines extractive and generative readers.
End-to-End Models. EMDR 2 (Singh et al., 2021) is an end-to-end model that jointly trains a dense retriever with a FiD-style reader. We also use EviGen (Asai et al., 2022) that jointly learns to predict the evidentiality of passages and to generate the final answer in a multi-task fashion.
Closed-book Models. We use InstructGPT 4 (Ouyang et al., 2022) in two settings, following Brown et al. (2020): zero-shot and few-shot where the prompt includes 64 question/answer pairs, randomly sampled from the NQ-OPEN training data.

Dataset
We select questions from NQ-OPEN (Lee et al., 2019), a popular open-domain QA benchmark, that consists of 3610 questions in the test set. We randomly sample 301 questions from NQ-OPEN. Answers are generated via the prominent open-domain QA models, described in §3.1, for the selected questions. In total, the number of unique answers generated by the 12 models for 301 questions amounts to 1490 question/answer pairs. Our experiments are done on Wikipedia, following the same settings provided by Karpukhin et al. (2020).

Strategies for Evaluating Open-domain QA Models
Our goal is to shed light on the discrepancies between the actual and the measured accuracy of open-domain QA models. To this end, we adopt three evaluation mechanisms in addition to lexical matching to assess 12 open-domain QA models and draw a comparison between their estimated accuracy and the token-based performance.

Supervised Evaluation via Semantic Similarity
A common paradigm to evaluate QA systems is to cast evaluation as a classification task where the goal is to decide whether gold answers and candidate answers are semantically equivalent or not (Risch et al., 2021;Bulian et al., 2022). To this end, we use a recent BERT-based model, namely BEM (Bulian et al., 2022), that is trained on a human-annotated collection of answer pairs given a question, derived from SQuAD (Rajpurkar et al., 2016). For evaluation, we feed a question along with a gold answer and a candidate answer to BEM and take its prediction. For questions with multiple gold answers, each gold answer is independently tested with a candidate answer. Once matched with either of the gold answers, a candidate answer is deemed correct.

Zero-shot Evaluation via Prompting
We also test the ability of LLMs for evaluating QA models. In open-domain QA, the task of answer equivalence requires supplementary information in the absence of a given context, e.g., matching "Jicheng" with "Peking" in Figure 1; therefore, LLMs are a reasonable choice here because they are equipped with an implicit memory that encompass knowledge , serving thus as an auxiliary information. To use LLMs for evaluating models, we elicit the following prompt through InstructGPT (Ouyang et al., 2022): Question: what was the city of Beijing previously known as?
We include the gold answer along with the candidate answer in the prompt, akin to the semantic similarity mechanism, as the objective here is to verify the correctness of the candidate. We call this evaluation method, InstructGPT-eval. We also test GPT-4 (OpenAI, 2023) using the same evaluation method, namely GPT4-eval, and observe that its results, reported in §A, closely resemble to those obtained from InstructGPT-eval.

Human Evaluation
Human evaluation reflects the true performance of a model and serves as a basis for checking the feasibility of other evaluation mechanisms. For this purpose, we ask two human annotators 5 to judge whether a given answer to a question is correct or not. We present only question/answer pairs to human annotators to avoid any inadvertent biases, i.e., the annotators do not know which answers correspond to which model nor do they know if an answer is a gold answer. Annotators are allowed to use a search engine to find evidence that supports or rejects a candidate answer. Our annotation procedure is specifically geared towards open-domain QA unlike those of Risch et al. (2021) and Bulian et al. (2022) that are designed for reading comprehension where annotators decide equivalence between a pair of answers given a question and a context.
The Fleiss' Kappa score between the two annotators is 72.8%, i.e., 202 disagreements out of 1490 cases (13.6%), indicating substantial agreement. Most disagreements arise from questions that are more likely to possess subjective answers. They mainly fall into three categories: ambiguous (e.g., "what is the corporate tax rate in great britain"), list-style (e.g. "who dies in the lost city of z"), and time-dependent (e.g. "latest series of keeping up with the kardashians") questions. We ask a third annotator to judge the 202 cases where the two annotators diverged and take a majority vote to determine the correctness. The accepted answers by the annotators are then added to the set of gold answers for the selected questions. We compute the accuracy of the 12 models after amending the gold answers and compare it with the original accuracy that is computed via lexical matching.    compared to the de facto EM accuracy. The accuracy of all models consistently surges across all three evaluation mechanisms, i.e., 16%, 20%, and 23% on average for BEM, InstructGPT-eval, and Human, respectively. InstructGPT (zero-shot) and

Results and Discussion
InstructGPT (few-shot) are the top 2 models with the highest raise in accuracy across the evaluation mechanisms, whereas the amended result of DPR achieves the lowest increase. Moreover, the accuracy reported using BEM and InstructGPT-eval are yet lower than that of human judgment, i.e., trailing 7.6% and 3.3% on average across all open-domain QA models, respectively.
More importantly, the ranking of models is readjusted by applying the three evaluation mechanisms. Figure 2 visualizes the accuracy of the open-domain QA models before (using only EM) and after our evaluation. EMDR 2 , originally the best performing model with +3% lead, loses the top spot to InstructGPT (few-shot) by a nearly +3% margin using human evaluation. BEM picks FiD-KD as the best model, whereas the LLM-based evaluation method estimates the highest accuracy for InstructGPT (zero-shot). Also, the Kendall's τ correlation of InstructGPT-eval, and BEM with human evaluation is 0.82, and 0.70, respectively, whereas EM and F 1 show a significantly weaker correlation of 0.22 and 0.37.
In contrast to human evaluation, BEM and InstructGPT-eval show that InstructGPT (zeroshot) has 4%, and 9% advantage, respectively, over InstructGPT (few-shot). To further investigate this phenomenon, we manually examine the Instruct-GPT (zero-shot) generated answers that are deemed incorrect by humans. We identify 47 unattributable answers out of 86 answers. The generated answers of InstructGPT (zero-shot) tend to be long statements that offer supplementary information, which raises the risk of containing hallucinated content. InstructGPT-eval accepts 30 of those answers (∼10% error over the 301 questions), whereas BEM incorrectly predicts 18 (∼6% error) answers as correct. Interestingly, GPT4-eval performs better and misidentifies only 9 cases (∼3% error). Yet, these results highlight that the automated methods are prone to misjudging hallucinated long answers, essentially rendering them unreliable against answers generated by LLMs.

Linguistic Analysis of Correct Answers
In this section, we aim to examine model answers that are not considered correct based on EM, but are in fact acceptable according to our assessment. Min et al. (2021) conducted a similar analysis on 50 questions for the participating models in the EfficientQA competition at NeurIPS 2020. In line with this work, we provide an in-depth analysis on a broader scale using more recent models to emphasize the drawbacks of widely used lexical-based evaluation metrics and semantic similarity methods. We further dissect the categories presented by Min et al. (2021) into more detailed sub-categories. Specifically, we group the 493 question/answer pairs that are deemed correct by humans while cannot be matched with gold answers into hierarchical categories as follows: 6 Semantic Equivalence: Model predictions and gold answers convey the same meaning while not matching verbatim: (i) Multinominal entities, e.g., "Bhimrao Ramji Ambedkar" and "B. R. Ambedkar." (ii) Synonymous answers, e.g., "a virtual reality simulator" and "a virtual reality world." (iii) More elaborate answers, e.g., "Typically , no" and "not required in all jurisdictions." (iv) Exact-Match in explanatory answers, e.g., "1995" and "Michael Jordan returned to the NBA in 1995." (v) Bridging/Abridging, e.g., "citizens" vs. "ordinary citizens" or "in the Gospel of Luke" vs. "Gospel of Luke." (vi) Tokenization mismatches, especially in the presence of punctuation marks, e.g., "s-block" and "s -block." Symbolic Equivalence: In case of numeric answers, gold answers and predicted ones can be symbolically identical either exactly or approximately while their surface text differs, e.g., "about 3.99 degrees" vs. "3.97 degrees" or the year "1524" vs. "the 16th century." Intrinsic Ambiguity in Questions: Ambiguous questions have several interpretations, each of which can lead to different answers.  found that ambiguity is prevalent in NQ-OPEN. Unlike other categories, mismatches that stem from ambiguity are not rooted in answers and instead, arise from questions themselves. For instance, "when does the next episode of iZombie air?" presupposes a reference point in time that can only be clarified within a context. Thus, both "May 07, 2018" and "February 26, 2018" are correct, depending on when the question is asked.
Granularity Discrepancies: Predicted answers may appear at different granularity levels than the gold answers. This case often arises for answers indicating spatial or temporal references. Indeed, under different presuppositions, some granularity levels are more preferable than others. Nonetheless, both predictions and gold answers are valid. We further categorize this discrepancy into: (i) Temporal granularity discrepancy, e.g., "when was the 50th star added to the united states flag?" can be answered by both "1960" and "July 4, 1960." (ii) Spatial granularity discrepancy, e.g., both "Camping World Stadium" and "Orlando, Florida" answer the question "where is the citrus bowl held this year?" List-style Questions: Actual answers to these kinds of questions encompass a set of plausible answers that is not fully specified in gold answers. For these questions, model answers are deemed correct if they are among at least one gold answer. We broke this group down into: (i) List questions, e.g., gold answers to "list of strict nature reserve in the Philippines" consist of six locations that is by no means comprehensive.
(ii) Open-ended questions such as "what is an example of a government monopoly in the United States?" where "the United States (iii) Compound questions ask about multiple pieces of information in one question. They are a special case of multi-hop questions (Yang et al., 2018), e.g., "when was the canadian pacific railway started and finished?" where the gold answer is "between 1881 and 1885" vs. "Started in 1881 and finished in 1885." that is a correct answer.
Incorrect Gold Answers: Models produce correct answers, but gold annotations are incorrect. Mismatches in this category are a byproduct of data quality issues. For example, the answer to "what is the largest ethnic group in Mexico today?" is annotated "K'iche'", whereas the correct answer is "Mestizos."

Discussion
The statistics for each category are presented in Figure 3. Semantic equivalence (50.3%) is the most common failure mode of exact matching. The most frequent subcategories within this category are bridging/abridging (11.4%), EM in explanatory answers (10.1%), and multinominal entities (9.3%).
Other top frequent failure modes are list-style questions (20.6%) and granularity discrepancy (15.0%). Interestingly, most of these failure cases are related to syntactical variations of answers, which is why specifying gold answers via regular expressions can S e m a n t i c E q . S y m b o l i c E q . G r a n u l a r i t y D i s c r . be useful in capturing these variations. Moreover, 14% of EM failures are attributed to data quality issues, i.e., ambiguity and incorrect gold answers.
Error Analysis of Automated Evaluation Methods. The answers that InstructGPT-eval and BEM reject but humans consider correct are a subset of EM failures. 7 More precisely, InstructGPT-eval and BEM reduce the 493 failure cases of EM to 149 (70% ↓) and 217 (56% ↓), respectively. For GPT4eval, the number of failure cases is 137 (72% ↓), only slightly lower than InstructGPT-eval. The breakdown of the high-level failure categories for each evaluation method is shown in Figure 4. The three automated evaluation methods are able to fix most of the failures corresponding to semantic equivalence, granularity discrepancy, and symbolic equivalence. However, they do not perform that well on list-style questions where InstructGPT-eval and GPT4-eval still fail on more than 10% of the EM failures, and BEM falls short on 14%. They also perform nearly on par with EM on data qualityrelated failure cases, i.e., incorrect gold answers and ambiguous questions.

Regex Matching on CuratedTREC
An alternative to lexical matching between gold answers and predicted answers during evaluation is to specify gold answers as regular expression patterns. Regex matching allows for capturing syntactical answer variations where exact-match falls short. In this section, our main goal is to highlight 7 With only 3 exceptions: InstructGPT-eval rejects only 2 actually correct answers matching with gold answers that correspond to list questions where candidate answers appear in the middle of the gold answers. Moving the candidate answer to the top of the gold answer list would fix the issue. Similarly, BEM rejects only 1 exactly matched correct answer, i.e., "P-A-D-A-W-A-N." while the gold answer is "Padawan". the advantages and pitfalls of using answer patterns in QA evaluation by comparing its results with our three evaluation mechanisms, described in §3.1.
Dataset. We make a comparison across opendomain QA models on CuratedTREC 2002 (Baudiš and Šedivỳ, 2015), a dataset whose gold answers are specified by regular expressions. The questions in CuratedTREC are derived from the dataset in the QA tracks (Voorhees, 2003) of TREC 2001 to 2003 after a manual review to discard ambiguous or outdated questions. The knowledge source for TREC QA is originally English news text, namely AQUAINT, from three news sources (AP, NYTimes, and Xinhua), dating back to the late 90s.
Here, we opt for the original knowledge source to replicate the same environment as TREC QA 2002 so as to quantitatively gauge progress over two decade by comparing recent models with the models that took part in the QA track in 2002. This experiment is an out-of-distribution test for the neural models to check whether they are actually capable of using the knowledge source to answer questions or they answer from memory because the old news articles is less likely to have appeared in the pretraining corpus. However, LLMs inevitably do not use the knowledge source as they perform the task from their memory in a closed-book fashion. Cu-ratedTREC 2002 consists of 444 questions whose answers are looked up in the AQAUINT corpus, comprising around 1M news articles. We follow Karpukhin et al. (2020) to split the articles into nonoverlapping passages of 100 words, which amounts to over 4M passages in total. Similar to NQ-OPEN, we ask two annotators to judge 1872 question/answer pairs, followed by a third annotator who evaluates the diverging cases. The Fleiss' Kappa score between the two annotators is 83.5%, i.e., 150 disagreements (8.0%), indicating an almost perfect agreement.

Models
The results are shown in Figure 5. Interestingly, the ranking of models via regex matching is left unchanged by all three evaluation mechanisms, except for InstructGPT (zero-shot) and InstructGPT (few-shot). Consistent with our observation on NQ-OPEN, both BEM and InstructGPT-eval assign a higher accuracy to InstructGPT (zero-shot) over In-structGPT (few-shot). However, in contrast to NQ-OPEN, they do not overestimate InstructGPT (zeroshot). Human evaluation shows that InstructGPT (few-shot), by scoring 92%, is the best performing model, analogous to NQ-OPEN. Among the non-LLM models, ANCE+ and Contriever consistently surpass other models. Similar to EM, regex matching is too rigid albeit to a lesser extent. In particular, the accuracy is underestimated by 6.6%, 6.4%, and 9.9% on average via BEM, InstructGPT-eval, and human evaluation, respectively.
We note that LCCmain2002, an original TREC run, outperforms all models prior to our assessment. Human evaluation highlights that both InstructGPT models are superior to LCCmain2002 by +1.9% (for zero-shot) and +2.9% (for few-shot). However, BEM and InstructGPT-eval fail to reflect this result. For other non-LLM models, ANCE+ and Contriever surpass pris2002 via all three evaluation methods (with the exception of Contriever using InstructGPT-eval). An interesting finding here is that although neural open-domain QA models are repeatedly proven to be powerful in accomplishing state-of-the-art, LCCmain2002, a heavily engineered statistical method from 20 years ago, ruffles their feathers by a substantial margin of 20%. Only under human judgment does the absolute performance of LLMs surpass LCCmain2002.

Conclusion
Despite the simplicity and ubiquity of lexical matching as an evaluation metric in open-domain QA, it is unnecessarily rigid because plausible candidate answers are likely not to appear in the list of gold answers. This flaw has been long known, but the efforts to circumvent it have been mostly artisanal. In this paper, we report a systematic study of lexical matching by manually judging answers generated by several prominent open-domain QA models. We found that LLMs achieve stateof-the-art on NQ-OPEN. The accuracy of models is severely underestimated, with most EM failure cases stemming from syntactical variations of answers. Moreover, a zero-shot prompting method can be a reasonable substitute for human evaluation although it cannot detect unattributability in long-form answers. Our insights and analysis in this paper will hopefully underpin the development of solid evaluation techniques in open-domain QA.

Limitations
Our main focus in this work is limited to factoid information-seeking questions that typically prompt short answers. However, lexical matching is adopted by more complicated forms of QA that require complex reasoning. More precisely, QA tasks such as multi-hop reasoning (Yang et al., 2018), discrete reasoning (Dua et al., 2019), and causal relations (Lin et al., 2019) also warrant similar systematic analysis as studied in this paper.
For the sake of completeness, we test the ability of GPT-4 (OpenAI, 2023) for evaluating QA models as explained in §4.2. Following Table 1 layout, Table 2 presents the accuracy of the open-domain QA models, computed using GPT4-eval in conjunction with lexical matching, InstructGPT-eval, and human judgment as reference points. The accuracy of all models consistently increases by an average of 20% using GPT4-eval, which is similar to the increase level observed in InstructGPT-eval. Moreover, analogous to InstructGPT-eval, the GPT4-eval accuracies are, on average, 3.3% lower than those of human judgment. Figure 6 visualizes the accuracy of the opendomain QA models on NQ-OPEN using EM and GPT4-eval, similar to Figure 2. Unlike InstructGPT-eval, GPT4-eval estimates the highest accuracy for FiD-KD, followed by InstructGPT (zero-shot), InstructGPT (few-shot), and EMDR 2 . Also, the Kendall's τ correlation of GPT4-eval with human judgment is 0.80, on par with 0.82 of InstructGPT-eval. Error Analysis: As illustrated in Figure 4, GPT4-eval errors closely resemble the errors found in InstructGPT-eval. However, for a small number of cases, GPT4-eval demonstrates unique erratic behaviours. First, for 2 cases, the model exhibits overconfidence in its internal memory and disregards gold answers that can be simply matched using EM. For example, GPT4-eval incorrectly rejects the candidate answer "Jermaine Jackson" (that is also a gold answer) to the question "Who sings Somebody's Watching Me with Michael Jackson?" We also observe the contradictory response of "No, the candidate is correct" for 2 candidate answers that are correct, but are not included in the gold answers. Moreover, GPT4-eval incorrectly abstains from evaluating 2 candidate answers because it thinks more context is needed. For instance, it falsely utters "I cannot determine if the candidate is correct, as there is not enough information provided about the show "Fall" and the character Rose. Valene Kane is an actress, but without more context, it is unclear if she is related to this specific show or character." as a response to the question "Who is Rose in the Fall season 2?" and the candidate answer "Rose is a new character introduced in the second season of the show Fall. She is a mysterious woman who is connected to the supernatural events occurring in the town." that is entirely fabricated. Figure 7, GPT4-eval follows closely InstructGPTeval on CuratedTREC 2002. Specifically, it indicates a higher accuracy for InstructGPT (zero-shot) compared to InstructGPT (few-shot) and ranks LC-Cmain2002 ahead of both InstructGPT models despite human evaluation suggesting otherwise.

Model
Sampled (301) Table 1 and copied here solely as a reference. GPT4-eval demonstrates approximately similar behaviour as InstructGPT-eval when ranking the models.