Perhaps PTLMs Should Go to School – A Task to Assess Open Book and Closed Book QA

Our goal is to deliver a new task and leaderboard to stimulate research on question answering and pre-trained language models (PTLMs) to understand a significant instructional document, e.g., an introductory college textbook or a manual. PTLMs have shown great success in many question-answering tasks, given significant supervised training, but much less so in zero-shot settings. We propose a new task that includes two college-level introductory texts in the social sciences (American Government 2e) and humanities (U.S. History), hundreds of true/false statements based on review questions written by the textbook authors, validation/development tests based on the first eight chapters of the textbooks, blind tests based on the remaining textbook chapters, and baseline results given state-of-the-art PTLMs. Since the questions are balanced, random performance should be ~50%. T5, fine-tuned with BoolQ achieves the same performance, suggesting that the textbook’s content is not pre-represented in the PTLM. Taking the exam closed book, but having read the textbook (i.e., adding the textbook to T5’s pre-training), yields at best minor improvement (56%), suggesting that the PTLM may not have “understood” the textbook (or perhaps misunderstood the questions). Performance is better (~60%) when the exam is taken open-book (i.e., allowing the machine to automatically retrieve a paragraph and use it to answer the question).


Introduction
Question answering (QA) is a yardstick for measuring machine understanding performance (Hermann et al., 2015). QA's popularity as an evaluation technique has led to several sub-categories: tasks can require a model to answer questions from either its background knowledge or from a short passage (e.g., SQuAD, Rajpurkar et al., 2016) or with information retrieval to allow the model to search for the answer in a large corpus (e.g., ARC, Clark et al., 2018). Answering can take the form of true/false classification (BoolQ, Clark et al., 2019), multiplechoice, span selection (SQuAD, Rajpurkar et al., 2016), or text generation (TriviaQA, Joshi et al., 2017).
Transformer architectures optimized for specific QA formulations have driven recent progress in question answering. For example, some models target IR-oriented QA (Guu et al., 2020) while others optimize their learning strategy to specific question types (e.g., by optimizing for expected answers to factoid questions, . While specialization improves performance, it limits generalization. UnifiedQA (Khashabi et al., 2020) takes a step forward by generalizing the architecture and training over multiple data sets with different QA formulations.
Most research assumes that the information necessary to answer questions is either included with the query (e.g., BoolQ, SQuAD 1.1) or that the information was already stored in language models during initial pre-training or a task-specific second pre-training. 1 However, this assumption limits language models relying on massive corpora (Gao et al., 2020; to learning oftrepeated facts (Petroni et al., 2019). Valuable, domain-specific information seldom is repeated often enough to be captured by language models. An evaluation of domain-specific knowledge without access to a relevant text is even more challenging as simple strategies like identifying the answer by information retrieval are ineffective. Even reasoning tasks such as ARC (Clark et al., 2018) only target general scientific knowledge and offer large text corpora to aid QA systems.
We propose Learning from Textbooks (LEFT), a new task to classify domain-specific statements drawn from a textbook's review questions as true or false using three evaluation configurations. The first configuration tests the ability to answer questions without any domain-specific material (e.g., applying a PTLM with no access to domain-specific knowledge). This setting is equivalent to a person taking the test before taking the class. In the second configuration, a model has access to the textbook's content and may encode the information in the textbook but may not access the textbook during the test; we call this closed book. The second configuration tests a model's ability to learn by reading. In the third configuration, which we call open book, models can access the textbook during the test. Thus, LEFT supports contrasting QA formulations and reading methods to explore the strengths and weaknesses of various QA approaches. The LEFT data and leaderboard are available at https:// leftleaderboard.isi.edu.

Related Work
Question Answering. Most previous research specializes QA models to target specific question formulations. Question answering with a relevant paragraph often relies on span selection (Rajpurkar et al., 2016;Yang et al., 2015) or simple reasoning (Clark et al., 2019). Previous open-book QA methods first filter a large corpus to a small set of relevant documents using information retrieval (Karpukhin et al., 2020;Robertson and Zaragoza, 2009). The document set then provides context for answering questions (Dhingra et al., 2017;Dunn et al., 2017;Joshi et al., 2017;Nguyen et al., 2016). Conversely, closed-book QA instead requires models to answer using only their implicit knowledge . Taking a step towards generalizing QA, UnifiedQA (Khashabi et al., 2020) proposes a unified architecture that answers various question types relying partly on knowledge encoded in its language model.

Knowledge in Pre-trained Language Models.
Pre-trained language models (PTLMs) have shown good performance in cloze-style queries (Petroni et al., 2019), fact-checking (Thorne et al., 2018), entity linking (Guo and Barbosa, 2018;Hoffart et al., 2011), and open-domain QA (Joshi et al., 2017;Kwiatkowski et al., 2019;Petroni et al., 2021). However, in most cases, the PTLMs rely on knowledge learned from massive corpora during pre-training. LEFT tests domain-specific knowledge acquired from a textbook, a small corpus of only a few hundreds of thousands of words (see  Furthermore, it quantifies pre-trained language models' pre-existing knowledge by requiring that models take the task before and after reading LEFT's two textbooks.

Task Description
Learning from Textbooks (LEFT) contains two machine-readable college-level introductory textbooks and a set of true/false statements manually derived from review questions written by the textbook authors. The task requires that systems based on language models classify the statements before and after reading the given textbook material to separate what was learned from the book from what was known before reading. "Reading" is any algorithm method that learns from the domain text without storing a copy of the text. To support comparisons with existing QA approaches, LEFT also supports the open-book setting, where a system can use a textbook paragraph when answering. Our goal is to support testing pre-trained language models, e.g., T5 , and also those approaches that extract and store triples during reading (e.g., <U.S. Declaration of independence; signed; Aug 2, 1776>). While learning corpora appear in other question answering tasks (e.g., ARC, 14M words, Clark et al., 2018), the text included in LEFT is small and corresponds to the textbook chapters relevant to each question set. The largest text in LEFT contains only 300K words (for details, see Table 1).
LEFT includes two openly licensed 2 collegelevel introductory textbooks, American Government 2e (Krutz, 2019) and U.S. History (Corbett, 2014), and true/false statements derived from each book's review questions. We manually rewrote each textbook's multiple-choice review questions into a balanced set of true and false statements. 3,4 We intentionally wrote the statements such that each true and false pair has high word overlap to deter classification strategies that rely on word overlap with the textbook. We include five sample statements from LEFT in Appendix A and discuss statement correctness in Appendix C.
We measure task performance by accuracy. Since the two textbooks are used in teaching college students, we do not release the correct labels (see the Ethical Considerations section). We split each textbook into a Dev set consisting of the first eight chapters and a Test set consisting of the remaining chapters (see Table 1 for an overview). We allow unlimited submissions to the Dev set, but for any submission, we only provide the overall accuracy without feedback on which statements were correctly classified. This design decision aims to prevent divulging the correct answers (see the Ethical Considerations section).
LEFT has three evaluation configurations: (1) Prior-knowledge; (2) Closed-book, after reading; and (3) Open-book. Prior-knowledge tests the ability to answer questions without any domainspecific material. Language models must rely solely on the knowledge learned from their large pre-training corpora. In the second configuration, Closed-book, after reading, models may access the textbook's content and may encode the information in the textbook but may not access the textbook during the test. For each set, models may read the set's corresponding textbook chapters, the entire textbook, or both textbooks. We require that all model submissions to this evaluation configuration also submit to Prior-knowledge. Predictions before reading (Prior-knowledge) quantify 2 Both textbooks are licensed under the Creative Commons Attribution License v4.0 license. 3 We construct one true and one false statement for each question to obtain a balanced data set. For example, the question When was the U.S. Declaration of Independence signed? (A)(correct) August 2, 1776 (B) December 2, 1776, (C) August 2, 1746, (D) August 22, 1976 could become The U.S. Declaration of Independence was signed on August 2, 1776 (true) and The U.S. Declaration of Independence was signed on August 2, 1746 (false). 4 For U.S. History's Dev set, we also process questions written by a community of instructors. the information included in each model through initial pre-training. The change in performance from Prior-knowledge to Closed-book, after reading illustrates each model's reading effectiveness. In the third configuration, Open-book, models can access the textbook or relevant chapter during the test. To support research on open-book question answering, with each statement, we include the textbook paragraph that provides the information necessary to classify the statement. In our experiments, we call this goldIR. Thus, LEFT supports contrasting QA formulations and reading methods to explore the strengths and weaknesses of various QA approaches.

Results
We illustrate baseline performance on LEFT using two state-of-the-art language models: T5  and GPT-Neo (a GPT-3 architecture, Brown et al., 2020, trained on the open Pile corpus, Gao et al., 2020). We fine-tune the two language models using BoolQ (Clark et al., 2019). Table 2 shows results in LEFT's three evaluation settings: Prior-knowledge (out-of-the-box language models fine-tuned on BoolQ), Closed-book, after reading (language models with continued light pre-training on LEFT's text content), and Open-book (where models have access to the relevant textbook paragraph). Since the Prior-knowledge and Closedbook settings do not include the relevant paragraph for each question, we adjust fine-tuning to only use BoolQ's questions and ignore its text snippets. In the Open-book setting, we consider automatically retrieved textbook paragraphs (using sBERT, Reimers and Gurevych, 2019) and manually identified the relevant paragraphs (gold information retrieval, goldIR). When selecting the relevant textbook content, we select one natural paragraph (i.e., as written by each textbook's authors). However, due to technical limitations imposed by T5's memory consumption, in our experiments, we limit the concatenated statements and paragraphs to a maximum length of 128 word pieces (see Appendix B.1).

Baseline Results
T5 and GPT-Neo's scores are indistinguishable from the random baseline of 50% in the Priorknowledge setting, suggesting that the textbooks query for information is either not present in the two language models or not easily accessible. Con-  Table 2: Baseline accuracy with the current state-of-the-art language models. U.S. History's Dev set consists of statements based on the textbook statements and on questions from a community of instructors. In the heading, each set's name is followed by its number of statements. The order of abbreviations reflects the order of operations. All models are fine-tuned with BoolQ; +/-ctx -whether we included BoolQ's context during fine-tuning; +pt -whether we pre-trained on the relevant textbook chapters.
tinuing each model's pre-training with the relevant textbook parts sometimes helps, but not consistently. The lack of improvement after reading is further evidence that the models memorize, but not in beneficial ways, i.e., they can complete sentences but do not learn the subject matter and cannot classify the statements, even after 20 epochs. It also suggests that the closed-book setting represents a new challenge for PTLMs.
Accuracy in the open-book setting is far higher, especially when using goldIR (i.e., a manually selected relevant paragraph). As in the closed book setting, we contrast models using only prior knowledge with models pre-trained on the textbook. Pretraining with the textbook never improves the system's accuracy, suggesting that even in this setting, the models are not learning by reading the textbook. The gap between goldIR-and sBERT-based retrieval suggests that there is room for retrievalbased improvement in the open-book setting. However, even with goldIR, T5 only achieves an accuracy of~70%, suggesting that paragraph-based QA alone is not solved with existing models.

Conclusions & Future Work
There are several natural directions in which we can extend and improve LEFT. We are extending U.S. History's Test set as we did with the Dev set by including statements based on questions written by a community of instructors. We are also collecting relevant paragraphs for the extra statements. Lastly, we are categorizing the kind of knowledge required to classify each statement to better understand what kinds of knowledge pose the most difficulties.
We draw several conclusions from this work. Foremost, Learning from Textbooks (LEFT) represents a new type of challenge task for PTLMs, contrasted with the much-studied challenges of (1) common sense QA based on prior knowledge, (2) reading comprehension given a paragraph, and (3) QA using large domain-specific corpora, e.g., science at the elementary-or middle-school level. The task is intended to stimulate research on the following dimensions: 1. Zero-shot learning, much as an entering college student could do when studying a textbook, 2. Measuring a system's knowledge before vs. after "reading" the textbook,

Capability in both closed-book and open-book question answering,
4. The effect of IR accuracy on task accuracy compared to the system's language understanding performance.
Our baseline studies show that T5 and GPT-Neo thus far are challenged to show improvement after reading the relevant textbook, that open-book evaluation is easier than closed-book (as it is for humans), and that the gating factor in LEFT is understanding the textbook and/or the question rather than paragraph retrieval. The baseline results show there is much room for improvement.

Ethical Considerations
We have reflected on two ethical considerations when creating Learning from Textbooks (LEFT): content and environmental impact. Content. The two textbooks in LEFT cover topics that include history, race, and politics. Open-Stax textbooks follow a set of Diversity and Representation Development Guidelines, which aim to "properly represent genders, gender identities, races, cultures, geographies, ethnic backgrounds, disabilities, nationalities, ages, sexual orientations, socio-economic status, and diverse viewpoints". 5 As creators of an NLP task, we do not make any claims, nor do we comment on the topics covered in the two textbooks. Furthermore, we understand that documents as large and complex as textbooks are bound to contain inaccuracies. We invite users with specific content accuracy concerns to consult the official textbook errata included in each textbook's instructor resources. 6 Releasing labels for the statements in LEFT would indirectly reveal the correct answers for multiple-choice questions in the two textbooks. While both American Government 2e and U.S. History include answer keys, they are incomplete. We believe releasing the correct answers to all multiplechoice questions in the book would be detrimental to the intended primary users of the two textbooks; in other words, it might hinder students' learning. We only used full-time employees compensated according to U.S. law to rewrite the multiple-choice review questions in the two textbooks.
Environmental. We included baseline results based on large pre-trained language models. Strubell et al. (2019) raised concerns about the environmental impact of training deep learning language models. Patterson et al. (2021) pointed out that most of the energy consumption for deep learning language models comes during the initial pre-training. In this work, we limit ourselves to fine-tuning and light continued pre-training of T5 and GPT-Neo. While we do not have information about GPT-Neo's training, T5's training took place in highly efficient data centers whose energy con-5 See Diversity and Representation Development Guidelines in the instructor materials for each textbook.
6 See the Errata Release Notes at https: //openstax.org/details/books/ american-government-2e?Instructor% 20resources for American Government 2e and https://openstax.org/details/books/ us-history?Instructor%20resources for U.S. History. sumption was offset by purchasing electricity from renewable sources (Patterson et al., 2021). For our light pre-training and fine-tuning, we use a machine with four NVIDIA Quadro RTX 8000 fed from California's energy grid. The total computation time for the experiments in this paper is about 500 hours, but this is an informal estimate rather than an accurate measurement.

A Sample Statements
Sample statements in LEFT. The first two statements are from American Government 2e, the following three from U.S. History: • Public goods are available to all without payment.
• In a majoritarian voting electoral system voters select the party of their choice rather than an individual candidate.
• Europeans did not introduce Indians to wampum.
• Philadelphia served as the base for British operations for most of the Revolutionary War.
• The British bombardment of Baltimore inspired The Star-Spangled Banner.

B Training Details
For all light pre-training and fine-tuning, we use a machine with four NVIDIA Quadro RTX 8000 GPUs.

B.2 GPT-Neo 2.7B
We use GPT-Neo 2.7B from the Hugging Face Model Hub. 7 GPT-Neo matches the architecture of GPT-3 (Brown et al., 2020), but is trained on the openly available Pile corpus (Gao et al., 2020).

C Ensuring Statement Correctness
We took several steps to ensure statements' true/false correctness and prevent data bias/tells. For true/false correctness, we manually inspected the statements to check that they correspond to the correct and incorrect choices as given by each textbook's instructor material. We then wrote a script to automatically count the statements for each chapter to ensure that there are as many true labels as there are false. If some labels were to change accidentally during our research, the script would detect the change. For the manually retrieved relevant passages, the humans read each statement and identified the relevant paragraph. In the process, they also checked each statement's label.
To prevent data bias, we wrote statement pairs to have as much word overlap as logically and grammatically possible. We used multiple annotators to write the statements for the two textbooks (two native speakers for U.S. History; one native, one fluent non-native for American Government 2e). No partition is composed of statements written exclusively by a single person, ensuring no personspecific tells. Following that, we checked all statements for grammar and punctuation issues using automated checkers and another annotator reading. This stage deals with copy-paste tells in the data and cases where statements for one label sound unnatural.