Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language Models

Language models have been shown to exhibit positive scaling, where performance improves as models are scaled up in terms of size, compute, or data. In this work, we introduce NeQA, a dataset consisting of questions with negation in which language models do not exhibit straightforward positive scaling. We show that this task can exhibit inverse scaling, U-shaped scaling, or positive scaling, and the three scaling trends shift in this order as we use more powerful prompting methods or model families. We hypothesize that solving NeQA depends on two subtasks: question answering (task 1) and negation understanding (task 2). We find that task 1 has linear scaling, while task 2 has sigmoid-shaped scaling with an emergent transition point, and composing these two scaling trends yields the final scaling trend of NeQA. Our work reveals and provides a way to analyze the complex scaling trends of language models.


Introduction
Language models have been shown to exhibit positive scaling, where task performance improves as models are scaled up in terms of size, compute, or data, like the blue curve in Figure 1 (Kaplan et al., 2020;Brown et al., 2020;Rae et al., 2021;Chowdhery et al., 2022;Srivastava et al., 2022;Liang et al., 2022).However, there are exceptions.Recent works show that some tasks can exhibit inverse scaling (McKenzie et al., 2022), where the performance degrades as models are scaled up (green curve), or U-shaped scaling (Wei et al., 2022b), where the performance degrades first but then improves as models are scaled up (red curve).Analyzing tasks that exhibit different scaling trends, such as inverse and U-shaped scaling, is therefore useful for better understanding the behaviors of language models, identifying their limitations, and guiding future development.In this work, we introduce NeQA, a new task of answering multiple-choice questions containing negation words, constructed by transforming questions from OBQA (Mihaylov et al., 2018) and NegatedLAMA (Kassner and Schütze, 2020).We conduct experiments on this task using 4 language model families and 3 prompting methods, and show that large language models do not follow straightforward positive scaling on this task.Specifically, as we use more powerful prompting methods or model families, NeQA exhibits a gradation from inverse scaling to U-shape to positive scaling.This result provides a unified view of when the three types of scaling trends (inverse, U-shaped, and positive scaling) occur for language models.Our result indicates that the development of large language models' capability to process negation may be a complex and nuanced problem.
To further understand this nuanced scaling trend of the NeQA task, we decompose the task into two subtasks: question answering (task 1) and negation understanding (task 2).Our empirical results show that task 1 has linearly positive scaling, while task 2 has sigmoid-shaped scaling with an emergent transition point, where the transition point is influenced by the prompt method and model family.Combining these two scaling trends yields the final scaling trend observed in NeQA.The task decomposition provides a new way to think of the scaling on a task in terms of a combination of its component skills.
In summary, our contributions are (1) the NeQA dataset that contains diverse distributions of texts about negation; (2) an evaluation of different large language models on the NeQA dataset, which exhibits different scaling trends; (3) a task decomposition analysis explaining the above scaling trends.
2 Dataset: NeQA We develop NeQA, a question answering dataset designed to evaluate the ability of models to process negation in natural language.Each example of the dataset consists of a negated question and two answer choices, one correct and one incorrect.An example of NeQA looks like: (question "Child does not want?",correct choice "marriage", incorrect choice "love").To construct this, we leveraged NegatedLAMA (Kassner and Schütze, 2020) and OBQA (Mihaylov et al., 2018).The NegatedLAMA dataset includes negated questions from four subsets: ConceptNet, GoogleRE, SQuAD, and TREx.Each subset comprises multiple files that represent different question distributions, such as questions about different entity relations.Each question is associated with a negated question, an answer, and a misprimed question (i.e., a wrong answer followed by the question).For instance, when "Child wants?" is the original question, "Child does not want?"can be its associated negated question, "love" can be the answer, and "Marriage?Child wants?" can be its misprimed question.We turn it into a multiple choice question by setting the negated question as the question, and setting the wrong answer in the misprimed question in conjunction with the correct answer as the two choices.For instance, in the above example, we get "Q: Child does not want? A. love B. marriage" (Appendix Table 3).To ensure diversity and representativeness, we randomly selected at most 50 questions from each file.
To be able to analyze the impact of different negation types, we also created additional data by applying diverse rules to transform questions in OBQA (Mihaylov et al., 2018) into negated ones.We defined six types of negation transformations: action verb negation (e.g., "cause" → "does not/doesn't cause"), linking verb negation (e.g., "is" → "is not/isn't"), modal verb negation (e.g., "can" → "can not/can't"), conjunction negation (e.g., "because" → "not because"), negation prefix (e.g., "able" → "unable"), and negation prompt (e.g., add "choose the wrong answer").For each type, we collected 50 questions by applying a rule-based transformation, sampling an incorrect answer as the correct answer, and treating the correct answer as the incorrect answer.For example, "Pushing on a pedal is an example of" is an original question in OBQA with the correct answer "force" and one of the incorrect answers "speed".We apply the rule-based transformation to change the verb "is" to "isn't" and get "Q: Pushing on a pedal isn't an

1,718 questions
The following are multiple choice questions (with answers) about common sense.Question: Child does not want? A. love B. marriage Answer: _ The following are multiple choice questions (with answers) about common sense.Note that if there is a negation in the question, we should choose the wrong answer to the original question.example of? A. speed.B. force", where "A" is the answer (Appendix Table 3).
We employ post-processing techniques such as redistributing labels evenly between "A" and "B" and balancing the use of negation words such as "not" and "n't".The validity of each question is ensured through manual examination and editing.Our dataset comprises a total of 1718 questions sourced from ConceptNet (150 questions), GoogleRE (374 questions), SQuAD (100 questions), TREx (594 questions), and OBQA (500 questions), providing a diverse range of negation types, text distributions, and prompts.We believe that this dataset serves as a valuable benchmark for assessing the ability of language models to process negation.Data distributions are shown in Figure 3.
Out of the 1718 questions, we define a set of 944 questions from ConceptNet, TREx, and a subset of OBQA that exhibit clear positive scaling on the corresponding original (non-negated) questions.
For our experiments ( §3), we randomly select 100 questions from this positive set in order to make the scaling more obvious during our analysis.
For zero-shot and zero-shot with hint evaluation, we follow the evaluation protocol of the MMLU pa-per (Hendrycks et al., 2021).We generate a prompt composed of a question and multiple choice options, where the options are labeled "A" and "B".For example, a prompt may be "Question: Child does not want? A. love B. marriage Answer:".We then generate one token from the language model and rank the probability of the model selecting option "A" or "B".For few-shot with CoT, we follow the evaluation protocol of CoT paper (Wei et al., 2022c) by generating sentences until reaching the end and parsing the answer using regular expressions.As our metric, we report the accuracy of the model predictions, where the chance accuracy is 50% as NeQA is a balanced two-choice dataset.

Scaling Trends
Our evaluation reveals that the scaling trends of language models on the NeQA task vary depending on the prompting method and model family used (Figure 2).We found that the scaling trends of all language model families can be altered by different prompts.For example, zero-shot prompting resulted in inverse scaling in 3 out of 4 model families, whereas few-shot CoT prompting consistently resulted in positive scaling.As the prompt becomes stronger (i.e., more information, like rationales and demonstrations, is provided for the language model), we observed a transition from inverse scaling, to U-shaped scaling, to positive scaling.For instance, GPT-3 exhibited inverse scaling, U-shaped scaling, and positive scaling, respectively, with these prompting methods.Additionally, we discovered that switching to a stronger model family can alter the scaling shape.For example, transitioning from GPT-3 to GPT-3 Text Series, which was further trained to align with human values on multiple tasks compared to GPT-3, resulted in a shift from inverse scaling to U-shaped scaling when the same prompting (e.g., zero-shot) is used.
In conclusion, stronger prompts or model families lead to a transition from inverse scaling, to U-shaped scaling, to positive scaling on the NeQA task.We may also make the following interpretation: the overarching scaling trend of language models for NeQA is U-shaped, and if the model is weak (i.e., weaker prompt or model family), the left part of "U"/inverse slope is observed; if the model is strong, the right part of "U"/positive slope is observed.

Task Decomposition Analysis
We conducted further empirical analysis on the reasons why the scaling trends can be inverse, Ushaped, or positive and can transition with different prompts or model families.We decomposed the NeQA task into two subtasks: task 1 is to answer the original non-negated questions, and task 2 is to "understand negation".In Figure 4, we show the scaling of task 1 and task 2 performance with GPT-3 and GPT-3 Text Series families.The task 1 performance is measured by the accuracy of answering original non-negated questions, and the task 2 performance is measured by the accuracy of differentiating original questions from negated questions.The task examples are shown in Figure 4 right.Both tasks are evaluated in a zero-shot way.
Our experiments showed that task 1 scales mostly linearly in a positive direction, whereas task 2 scales like a sigmoid shape with an emergent transition point, analogous to the Grokking curve (Power et al., 2022).Before this transition point, models do not "understand negation", and achieve low accuracy in differentiating original questions from negated questions, which results in outputting the same answer to both the original and negated questions.It is worth noting that the labels for the composed task NeQA are essentially the inverse of the non-negated QA labels for task 1.Therefore, the positive scaling in task 1 results in inverse scaling for the composed task NeQA, because the predictions remain unchanged while the ground-truth labels are inverted.After the transition point, models start to "understand negation" and predict opposite answers to the original questions, resulting in positive scaling.When the transition point never happens within the sizes available in the model family, the overall scaling looks inverse; when the transition point happens before the smallest model, the overall scaling looks positive.When the transition point is in the middle, the overall scaling looks U-shaped.We provide further explanations of the composed performance curve in §A and §B.
Interestingly, we found that the transition point can be moved earlier with stronger prompting methods or model families.For example, both GPT-3 and GPT-3 Text Series show that the transition point happens much earlier when using the stronger prompt compared to the weaker prompt (see Figure 4).Furthermore, GPT-3 Text Series has an earlier transition point than the GPT-3 models.This can explain why using stronger prompts or stronger model families results in a transition from inverse scaling, to U-shaped, to positive scaling.
By decomposing a task and studying the scaling trends of the individual subtasks, our analysis offers a new way to understand the complexity of language model scaling trends.This analysis could be applied to various tasks beyond NeQA, especially tasks that consist of multiple subtasks, each of which may be of different levels of difficulty.This analysis can provide a deeper understanding of the strengths and weaknesses of different language models and offer useful insights into the development of better models and training/prompting methods.

Related Works
Scaling trends.Recent years have seen significant scaling of language models, such as scaling from GPT-1 to GPT-3, which has led to tremendous improvements in their performance and capabilities in natural language processing (Radford et al., 2018(Radford et al., , 2019;;Brown et al., 2020).Researchers have begun to investigate the scaling trends of language models to capture the relationship between model performance and model scale, including the parameter count and amount of training data/compute used (Kaplan et al., 2020).While most scaling papers show positive scaling trends where larger models perform better on various tasks (Brown et al., 2020;Rae et al., 2021;Chowdhery et al., 2022;Srivastava et al., 2022;Liang et al., 2022), it is important to also investigate tasks that exhibit other trends such as inverse scaling, which can shed light to limitations in current language model development and guide future improvements.For instance, TruthfulQA (Lin et al., 2022) was one of the earliest tasks that exhibit inverse scaling, where they find larger language models are prone to hallucination and generate more untrue answers.A recent competition, the Inverse Scaling prize (McKenzie et al., 2022), called for tasks that cause inverse scaling.In the first round, four tasks, including NeQA, redefine math, quote repetition, and hindsight neglect, showed inverse scaling.Wei et al. (2022b) then found that some of these tasks show U-shaped scaling after further scaling up language models.In this work, we unify the above findings and provide a holistic picture of scaling trends, including the transition from inverse to U-shaped to positive scaling across model families and prompting methods, and empirical explanations behind these scaling trends.
Negation understanding.Negation is a fundamental aspect of natural language understanding (Ackrill et al., 1975;Blanco and Moldovan, 2011).Existing works have found that NLP models can struggle in processing negation in text (Jiménez-Zafra et al., 2020).For example, these works investigate models' abilities to process negation through natural language inference tasks (Cooper et al., 1996;Dagan et al., 2006;Hossain et al., 2020;Geiger et al., 2020), machine translation (Fancellu and Webber, 2015;Hossain et al., 2022), language model prompting (Kassner and Schütze, 2020;Ettinger, 2020;Jang et al., 2022), contrastive reading comprehension (Ravichander et al., 2022), and probing model activations (Burns et al., 2022).In response, existing works have also studied methods to improve the abilities of NLP models to process negation, such as leveraging datasets about negation (Kim et al., 2019;Jiang et al., 2021), auxiliary training objectives/tasks (Khandelwal and Sawant, 2020;Moore and Barnes, 2021;Hosseini et al., 2021;Truong et al., 2022), and neuro-symbolic reasoning modules (Yasunaga et al., 2021(Yasunaga et al., , 2022)).While these existing works typically study a fixed size or type of models, our work provides the first studies into the effect of negation on the scaling trends of language models.We find that negation can exhibit nuanced scaling trends, e.g., U-shaped scaling with increased model size and improved model families and prompting methods.This finding offers a more comprehensive insight into how to improve the abilities of language models to understand negation, e.g., the model size, training algorithm, and prompting method all matter.

Conclusion
We introduced NeQA, a new question answering dataset that yields different scaling trends of language models than traditional positive scaling.We then proposed task decomposition analysis, a general idea to decompose the task to better understand the complex scaling trends and their transitions.We hope that these insights can facilitate the understanding and development of language models.

Limitations
This work introduced NeQA, a question answering dataset for evaluating the ability of large language models to process negation.While our NeQA attempted to cover diverse types of negation (e.g., different negation phrases and positions) and multiple data sources (e.g., OBQA, LAMA), it is possible that the dataset construction misses some types of negation or domains of text.Our future work will extend the dataset to cover more comprehensive types of negation and domains of text, beyond OBQA and LAMA.Additionally, NeQA is an English dataset, and it would be interesting to extend it to non-English languages and conduct a more comprehensive evaluation of language models, including multilingual ones.
Another potential limitation is sensitivity in language model prompting.Language model performance is known to be influenced by the specific prompt used to query the model (e.g., a rephrased prompt may lead to different model outputs), and prompt engineering-finding the "right" promptmay be needed to obtain reasonable outputs from the language models (Jiang et al., 2020;Ruis et al., 2022;Wang et al., 2022).As our language model evaluation protocol uses prompting ( §3), the evaluation results may inherit such prompt sensitivity.It would be an interesting future work to incorporate techniques to mitigate prompt sensitivity in language model evaluation (e.g., Burns et al. 2022).

Ethics Statement
Our work offers benchmarks and insights to help develop language models that understand negation.Developing language models that understand negation is crucial to the society in many ways.
First, as language models are being used in various real-world applications, including fields like finance, healthcare, and law, it is important to ensure that they understand negation and make correct predictions.If they do not understand negation, they may output the opposite of what we actually want and may make harmful decisions for humans.
Negation is also a fundamental aspect of natural language understanding, and a language model that does not understand negation correctly may not be able to truly process natural language.This can undermine trust and confidence in the outputs of the model, ultimately undermining its utility.
Understanding negation correctly is therefore crucial for the development of reliable language models.We hope that our benchmark and evaluation results provide insights into the behavior of current language models and inspire the future development of language models that understand negation.

Reproducibility Statement
We provide our datasets and implementations at https://github.com/yuhui-zh15/NeQA.The implementations will enable researchers to reproduce datasets and results described here, as well as apply our negation transformations to other datasets and run their own analyses.

A Task Decomposition Simulation:
Composing Subtask Scaling Trends Yields U-shape Scaling In this section, we present a simple simulation to demonstrate how the U-shape scaling trends of a composed task can be obtained through the scaling trends of each decomposed task.Let's assume that the accuracy of Task 1 (Question Answering) is represented by t 1 (x) and has a linear shape with an initial performance of 0.5 (random performance) and a final performance of 1.0 (perfect performance).Similarly, the accuracy of Task 2 (Negation Understanding) is represented by t 2 (x) and has a sigmoid-like shape with an initial performance of 0.5 (random performance) and a final performance of 1.0 (perfect performance), where x represents the scale (a combination of model size, data size, training computation, and prompting method).We define the score of negation understanding as s 2 (x) = (t 2 (x) − 0.5)/0.5, which represents the probability that the model will treat a negated sentence differently from the original sentence.For the composed task, NeQA, it will have an accuracy of t Figure 5 shows the plots of these three curves, t 1 (x), t 2 (x), and t(x).The simulated performance curve of NeQA, t(x), indeed exhibits a U-shape.

Discussion of Task Decomposition Validity and
Generalizability to Other Tasks.We first clarify that task decomposition analysis is not intended to derive scaling laws (i.e., predict the exact performance of language model scaling).Instead, our analysis aims to explain scaling trends (inverse, U-shape, positive).For example, translation performance may not be simple addition of generation performance and word translation performance but should be positively correlated.Furthermore, while this exact decomposition structure might not hold in more complex tasks, our proposed decomposition analysis is a pioneering attempt to explain scaling trends on a task other than vanilla language modeling.Investigating the applicability of decomposition to other tasks is an essential future direction, and we hope our work will inspire others to push these boundaries.Lastly, we believe that our work's focus on negation is already a well-scoped and significant research contribution, as negation is one of the most common linguistic phenomena.
To study negation, we collected the NeQA dataset, which exhibits inverse/U/positive scaling.To explain this, we propose this decomposition intuition, which works well because answering negated questions requires first answering the original questions and then flipping the answers.

B Fine-tuning Simulation: Training Data Attributes and Training Computes Also Impact Scaling Trends
In addition to the prompting methods and model families that we studied in §3, we are also interested in studying other factors that may contribute to scaling trends, specifically those related to the training process.However, most large language models are not publicly available and training/reproducing them from scratch would require excessive computational resources.In light of this, here we conduct experiments using synthetic data and small-size language models to simulate and analyze the language model learning process.We adapt the SST-2 dataset (Socher et al., 2013) for our simulation.For each sentence s in SST-2, with probability 1−x, we modify it to "s.This does suggest it is good/bad (depending on the label)", and with probability x, we change it to "s.This does not suggest it is good/bad".Then, we finetune different sizes of GPT-2 (Radford et al., 2019) on this synthetic corpus with the standard causal language modeling objective.We vary the numbers of epochs t and negation ratio x to understand their effect to scaling trends.
To evaluate the fine-tuned language models, we use the language model to complete "s.This does suggest it is _" for the original sentiment classification task (similar to task 1 in the main paper), and use the language model to complete "s.This does not suggest it is _" for the negated sentiment classification task (similar to the composed task NeQA in the main paper).We report accuracy on the original sentiment classification and negated sentiment classification.
Our simulation demonstrates that the scaling trends on negated sentiment classification are influenced by the negation ratio x and training epoch t (Figure 6).With the same number of training epochs t = 1, increasing the negation ratio x from 0.01%, to 0.1% and then to 1% causes the scaling to shift from inverse scaling, to U-shape, then to positive scaling.Additionally, increasing the number of training epochs from 1 to 3 causes the scaling trend to shift from inverse scaling to U-shape when the negation ratio is x = 0.01%, and from U-shape to positive when the same negation ratio is x = 0.1%.
This simulation highlights that factors in the training process, such as dataset attributes (e.g., negation ratio) and training compute, also have significant impacts on the scaling trends.Together with the inference factors, such as prompting methods and model families discussed in the main paper, we provide a comprehensive understanding of the complexity of scaling trends and how different factors can influence them.
The transition of the scaling trends can also be explained by task decomposition, where Task 1 (original sentiment classification) is always positively scaled, while Task 2 (negation understanding) is also positive but is shaped like a sigmoid, with the transition point controlled by the number of negation examples seen by the language model.The number of negations seen can be modified by using a larger negation ratio or more training epochs.The composition of these subtask scaling trends yields the final scaling curves.
The reason why Task 1 has a more linear shape, while Task 2 has a more sigmoid-like shape, can be understood with the intuition of deep learning processes.Empirical risk optimization (ERM) optimizes for average performance, and since negated sentences are significantly underrepresented in comparison to non-negated sentences in the training data, they are ignored at the beginning of training (Sagawa et al., 2020;Sohoni et al., 2020;Liu et al., 2021).As a result, the performance for negated sentences lags behind the average.However, as the majority of the training examples are learned, ERM finally starts to optimize for the underrepresented groups, leading to improved performance for negated sentences.This intuition adds new insights into the emergence of language models (Bommasani et al., 2021;Wei et al., 2022a), and we leave more rigorous analyses to future works.

C.1 Results
The performance of various models on different tasks that generate Figure 2 and Figure 4 can be found in Table 1 and Table 2.

C.2 Data
In Table 3, we provide examples showing the data generation process of the NeQA dataset that was introduced in §2.
In Table 4 and 5, we present a list of 100 data samples from NeQA that were utilized throughout the paper to examine scaling behaviours and task decomposition.

C.3 Prompts
The specific prompts utilized for various prompting methods and tasks are outlined in Table 6.

C.4 Models
In Table 7, we present a list of all the models used in this work, including 4 model families and 17 models.Model details are from Liang et al. (2022).

D Additional Analyses D.1 Few-Shot Prompting
Few-shot in-context learning has been demonstrated to be an effective method for adapting pretrained language models to specific tasks.We experimented with few-shot prompting (not few-shot chain-of-thought prompting) but didn't include the results in the main paper because the scaling shapes were often the same as zero-shot prompting across 3 out of 4 studied model families (inverse for GPT-3 and Cohere, U-shape for GPT-3 Text-Series; only Jurassic changes from inverse to U-shape).We provide the few-shot prompting results in Table 1.
Several recent works can explain why few-shot prompting doesn't alter the scaling curve shape.For example, Min et al. (2022) and Xie et al. (2021) show that in-context learning can be viewed as a Bayesian inference process, with the model learning more about input-output format than inputoutput mapping.When providing demonstrations of negated question-answer pairs, the model fails to learn the mapping between them and predicts the same answer as without demonstrations.

D.2 Prompt Variations
Due to the sensitivity of language model performance to prompts (Jiang et al., 2020;Ruis et al., 2022;Wang et al., 2022) (also discussed in limitations), we experimented with various prompts and found: 1. Minor changes like word substitution or paraphrasing result in similar scaling shapes; 2. Major prompt changes can alter curve shape, e.g., adding 'For example, "isn't", "is not", "not because ", "do not" are strong signs of negation' to zero-shot w/ hint prompting changes GPT-3 from inverse to a weak Ushape.This can be seen as increasing CoT strength by providing more hints/rationales; 3. Varying CoT information levels affects the shape.Intermediate-level information in CoT prompts shows a scaling shape between U-shape (zero-shot w/ hint; weakest CoT version) and strong positive (few-shot CoT; strongest CoT version).

D.3 Dataset Validity
NeQA is curated by applying rule-based transformations on existing QA datasets.To ensure the dataset quality, we carefully design and verify the transformation rules through manual inspection of the transformed examples.We found that the transformation rules generally work well and only removed a few questions due to grammatical errors after adding negation.Furthermore, as part of the submission for the inverse scaling prize (McKenzie et al., 2022), the organizers have done crowdsourcing experiments to demonstrate the validity of our dataset.Specifically, they validated labels by crowdsourcing 50 random examples from NeQA, and found the average agreement between workers and gold labels is 100% with no confusing questions.

D.4 Subset Selection
The NeQA dataset is composed of five subsets: ConceptNet, GoogleRE, SQUAD, TREx, and OBQA.For the purpose of this analysis, we only include ConceptNet, TREx, and OBQA.Our goal is to examine the scaling trends, so we aim for steeper scaling.However, GPT-3 does not exhibit strong positive scaling and inverse scaling on the original and negated GoogleRE and SQUAD datasets (Figure 7), so these subsets were not included in the analysis.
Furthermore, these scaling trends of NeQA subsets provide additional verification of our task decomposition analysis.When language models fail to understand negation (Task 2), a stronger positive scaling on the original dataset (Task 1) causes a stronger inverse scaling on the negated data (Composed Task).

D.5 Negation Category
In Figure 8 (left), we find that negating by adding "un-/in-" prefix to a word or negating modal verbs  (e.g., "can" to "cannot") does not show clear inverse scaling in zero-shot prompting.We suspect that the difference is because these negation categories replace a word instead of adding an additional word "not".We leave the further analysis to future work.

D.6 Wrong Choice
This experiment aimed to understand whether more confusing choices will change the scaling (Figure 8 (middle)).For example, given the question "Apple is not made by", the wrong choice can be "Microsoft" (high-ranked, more confusing), or "air" (low-ranked, less confusing), or a random word "China".We find that the wrong choice has little impact on the scaling trends.

D.7 Mispriming
Following Kassner and Schütze (2020), we put the wrong choice (i.e., the correct choice before negation) before the question (e.g., change "iPhone is not made by" to "Apple?iPhone is not made by").Mispriming makes inverse scaling stronger on negated questions in zero-shot prompting setting (Figure 8 (right)).Interestingly, we also note a phase change happens in small-size models.While this is a very interesting finding, mispriming might not be frequent in real-world applications of language models, so we are not including this in the NeQA dataset.A4.Have you used AI writing assistants when working on this paper?
Used ChatGPT and Grammarly to check and improve grammar.
B Did you use or create scientific artifacts?
B1. Did you cite the creators of artifacts you used?
B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
Our data will be made publicly available and can be used for research purposes.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Section 2.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Our dataset does not contain names or sensitive information.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Section 2.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.Section 2.
C Did you run computational experiments?
C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Section 3.

Figure 1 :
Figure 1: Illustration of three types of scaling trends.

Figure 2 :
Figure 2: Scaling trends of various language models on the NeQA dataset.As we use more powerful prompting methods or model families, we observe a gradation from inverse scaling to U-shape to positive scaling.More details in §C.

Figure 3 :
Figure 3: (Left) Statistics of NeQA data sources.(Right) Three prompting methods that yield shifts in scaling trends.

Figure 4 :
Figure4: Task decomposition analysis of the NeQA task.NeQA task can be decomposed into two subtasks: question answering (task 1) and negation understanding (task 2).Our empirical results show that task 1 has linear scaling, while task 2 has sigmoid-shaped scaling with an emergent transition point, where the transition point is influenced by the prompt method and model family.Combining these two scaling trends yields the final scaling trend observed in NeQA.

Figure 5 :
Figure 5: Task decomposition simulation illustrates the U-shape of composed task when given individual tasks.

Figure 6 :
Figure 6: Fine-tuning simulation reveals dataset attributes and training computes also impact scaling trends.We fine-tune different-sized GPT-2 models on a transformed SST-2 sentiment classification dataset with the casual language modeling objective.When the negation ratio in the dataset or fine-tuning epoch increases, we observe a shift from inverse scaling to U-shape to positive scaling on negated sentiment classification task.The original sentiment classification task always shows positive scaling.
you discuss any potential risks of your work?Page 6. A3.Do the abstract and introduction summarize the paper's main claims?Page 1.

Table 3 :
Data generation process of NeQA dataset.NeQA is constructed by transforming two existing QA datasets: NegatedLAMA and OBQA.All the fields of the original and transformed questions are shown.