NEWTON: Are Large Language Models Capable of Physical Reasoning?

Large Language Models (LLMs), through their contextualized representations, have been empirically proven to encapsulate syntactic, semantic, word sense, and common-sense knowledge. However, there has been limited exploration of their physical reasoning abilities, specifically concerning the crucial attributes for comprehending everyday objects. To address this gap, we introduce NEWTON, a repository and benchmark for evaluating the physics reasoning skills of LLMs. Further, to enable domain-specific adaptation of this benchmark, we present a pipeline to enable researchers to generate a variant of this benchmark that has been customized to the objects and attributes relevant for their application. The NEWTON repository comprises a collection of 2800 object-attribute pairs, providing the foundation for generating infinite-scale assessment templates. The NEWTON benchmark consists of 160K QA questions, curated using the NEWTON repository to investigate the physical reasoning capabilities of several mainstream language models across foundational, explicit, and implicit reasoning tasks. Through extensive empirical analysis, our results highlight the capabilities of LLMs for physical reasoning. We find that LLMs like GPT-4 demonstrate strong reasoning capabilities in scenario-based tasks but exhibit less consistency in object-attribute reasoning compared to humans (50% vs. 84%). Furthermore, the NEWTON platform demonstrates its potential for evaluating and enhancing language models, paving the way for their integration into physically grounded settings, such as robotic manipulation. Project site: https://newtonreasoning.github.io


Introduction
Natural Language Processing (NLP) has made remarkable progress using contextualized representations trained on extensive unprocessed text data As works begin leveraging LLMs in physically grounded contexts, it is crucial to understand whether such models possess the ability to reason about everyday scenarios.NEWTON, a repository, pipeline, and benchmark, facilitates evaluation of various LLMs in a physical reasoning context.(Zhao et al., 2023).As we start using LLMs in physically embodied pipelines (Driess et al., 2023;Ahn et al., 2022;Wu et al., 2023), it is crucial to comprehensively understand the extent that LLMs can perform physical reasoning.Some studies have proposed generalized frameworks to assess language model performance (Ribeiro et al., 2020;Kiela et al., 2021), while others have designed question answering and reading comprehension datasets to probe LLMs (Zellers et al., 2018;Chen et al., 2019;Rogers et al., 2023).However, few have explored the physical reasoning ability of LLMs.
Physical reasoning involves the cognitive process of comprehending and predicting the dynamics of physical systems based on observable phenomena and fundamental principles (McCloskey et al., 1983;Carey, 2000).It encompasses the capacity to make sense of the world by applying knowledge of attributes, such as brittleness, malleability, etc.By considering brittleness, for instance, we recognize the need to handle an object with caution, while malleability suggests that an object can be easily reshaped under pressure without fracturing.These abstract concepts enables reason-ing about the response of objects to interactions or changes in the environment, as shown in Figure 1.
Nevertheless, creating an evaluation framework for physical reasoning is difficult, primarily due to the lack of paired object-attribute data.Humans possess a wealth of knowledge regarding the internal structure of objects and their interactions with the physical world.However, this knowledge is often implicitly acquired, making it difficult to explicitly represent such information.Previous studies in the field of physical reasoning, such as Bisk et al. (2020); Aroca-Ouellette et al. (2021), have focused on common-sense reasoning or small-scale validation.As a result, the need for a comprehensive and systematic assessment of physical reasoning remains an open problem.
To address this gap, we propose NEWTON, a repository, pipeline, and benchmark designed to evaluate the physical reasoning capability of LLMs.The NEWTON repository consists of labeled object-centric data, crowd-sourced for 700+ objects across 8 physical attributes.The NEWTON pipeline introduces a method for systematically generating infinite evaluation questions tailored to specific use cases.The NEWTON benchmark consists of 160K pre-generated questions of progressive difficulty, spanning tasks of foundational attribute comprehension, explicit application, and implicit scenario-based analysis.Extensive empirical findings demonstrate the unique contributions of NEWTON, revealing its usefulness for evaluating LLMs' understanding of underlying physics principles that dictate the behavior and properties of objects in everyday scenarios.Moreover, NEW-TON effectively compliments the existing repertoire of reasoning benchmarks and datasets, further enhancing the potential to assess and refine the physical reasoning capabilities of LLMs.

NEWTON Repository
At the core of constructing an evaluation framework for physical reasoning lies the need for a representation that captures the essential attributes of objects.In this section, we highlight the NEW-TON repository, including the identification and a shortlist of objects and attributes, and obtaining a set of consistent object-attribute annotations.

Objects and Attributes
To preserve grounding to physical objects, we leverage the mainstream 3D object datasets: Objaverse (Deitke et al., 2023), YCB (Calli et al., 2015), Scanned Objects by Google (Downs et al., 2022), and Amazon Berkeley Objects (Collins et al., 2022).To establish a common representation of categories for the combined objects, we match the object title of each 3D object and approximate it to the nearest WordNet Synset (Miller, 1995).Object categories are then filtered for redundancy (repeated categories), ambiguity (uncommon, abstract categories), and irrelevance (non-manipulable objects), resulting in a curated collection of 700 common household manipulable objects.Using a subset of the objects (5%), iterations of pilot studies were run to fine-tune the format of human-facing questions, and identify the most comprehensive yet nonredundant set of physics attributes.Ultimately, we identified categories of malleability, elasticity, stiffness, softness, sharpness, surface smoothness, surface hardness, and brittleness.

Crowdsourcing Framework
In order to acquire accurate ground truth annotations for object-attribute pairs, we have devised a likert-scale annotation setup where annotators are tasked with selecting the most appropriate option from a 3-point scale to depict the given pair.The utilization of a 3-point likert scale serves two primary purposes: simplicity and representation of extremes.The objective of the interface is to collect responses that facilitate the categorization of data into distinctly opposing groups based on attribute conditions.Thus, the 3-point scale offers a straightforward decision-making process for users, minimizing complexity, and allowing for the selection of extreme options when there is a high level of certainty, while uncertain responses prompt the selection of the middle option.To gather these annotations, we have employed Label Studio Enterprise1 , to create tasks where a minimum of four annotators contribute overlapping annotations for each object-attribute pair.

Annotation Process
We provide an example of the annotation interface in Figure 2.Each annotation task consists of three main components: a question which specifies the object category and attribute, a set of 10 randomlysampled images illustrating common objects within the object category, and three answer choices.The question is designed to incorporate some descrip-Figure 2. NEWTON Pipeline.In addition to facilitating the curation of the NEWTON repository and benchmark, the NEWTON pipeline enables convenient extensibility of evaluation measures to suit any scenario.The pipeline consists of four main components, including annotation, filtering, template generation, and LLM evaluation.The annotation component starts with retrieving object categories from 3D object datasets, the categories of which will be filtered for irrelevancy, redundancy, and ambiguity, and matched with the WordNet Synset to remove overlapping categories.We then obtain the object-attribute templates after combination with the physical attributes, and conduct the crowdsourcing process.Each object-attribute sample has a minimum of four overlapping annotations, the agreement between which is used to filter the annotations and form the NEWTON repository of object-attribute pairs.The template generation step begins with a generic template, which will be filled through condition specification and object sampling.With the generated questions, we form a benchmark of 160K questions.The pipeline also enables formulation of infinite personalized evaluation prompts to suit any intended scenario.
tion of the attribute of focus, and prompts the annotator to select an option which best describes the object, given a particular attribute topic.The annotators are asked to focus on the textual category, however, a visualization of 10 randomly-sampled thumbnails are included for reference.We provide the suite of prompts used for different attributes and potential options in Table 1.For each task, a minimum of four annotations are collected, resulting in a total of 20 000+ annotations.

NEWTON Pipeline
Leveraging the object-attribute repository, we introduce a pipeline which enables systematic generation of diverse questions based on pre-defined personalized templates, as shown in Figure 2. The process starts with crowd-sourcing human annotations on object attributes, which is then filtered based on inter-annotator agreement.The filtered data is then used to populate template-based questions in a systematic manner, a process which is extensible to a diverse range of personalized scenarios, beyond those established in the paper.

Annotation
Detailed analysis of the annotation interface and workflow is described in Section 2.2 and 2.3.A minimum of four annotators contributed overlapping annotations for each pair.Over 20 000 annotations were collected, featuring 700+ objects and 8 unique physical reasoning attributes.

Filtering
The selection of confident object-attribute pairs is contingent upon the level of agreement between annotators.To ensure reliability, a minimum of four annotations that overlap are mandated for each object-attribute pair, enabling the calculation of inter-annotator agreement.Inter-annotator agreement is calculated as a percentage of annotators who agreed on the majority response of each objectattribute task.A stringent filtering threshold of 0.75 is applied, necessitating at least three out of the four annotations to exhibit agreement.Additionally, annotations with extreme likert scores (either 1 or 3) are preserved, thereby eliminating ambiguous responses and maintaining clarity in the dataset.

Template Generation
Leveraging the NEWTON repository of object-attribute pairs, pre-defined templates can be filled.Each template is defined with associated object slots and attribute conditions.The condition statements are defined as {attribute, polarity, padding}, where attribute defines the physics attribute used to group, polarity defines whether to employ the highest or lowest extremity, and padding specifies how many objects of the opposite polarity to use.Each condition statement is designed to obtain a group of n objects, consisting of one object of the specified attribute and polarity, and n − 1 objects of the negative polarity.Using the condition statements as a filter, we can obtain groups of object-attribute pairs which satisfy the given conditions and fill in the object slots, the process of which is shown in Figure 2.

NEWTON Benchmark
We introduce the NEWTON benchmark, a tool to assess the cognitive ability of language models to understand and reason about physical attributes of everyday objects.The NEWTON benchmark comprises of three progressively challenging tracks, and have a combined 160k questions covering 700+ objects and 8 unique attribute categories.

Tasks
NEWTON benchmark has 160k questions distributed over three reasoning tracks, namely Foun-dational Attribute Comprehension, Explicit Application, and Implicit Scenario-Based Analysis.These tracks are selected to align with facets within Bloom's cognitive taxonomy, including comprehension, application, and analysis (Adams, 2015).
The underlying task involves multiple choice question answering, where given a query q and up to four possible candidate choices, c 1...4 the language model must select the correct option, c correct of which there is exactly one for any given query.
Foundational Attribute Comprehension.The first step to understanding concepts in an objectcentric manner is to make the connection between objects and their attributes.This one-dimensional reasoning is the core assessment strategy within the Foundational Attribute Comprehension track, where each question involves understanding a single object-attribute pair.This track serves as a means to gauge the disparities in the distribution of comprehension between humans and language models regarding object attributes.Questions in this track closely mirror those presented to human annotators, as shown in Table 1, with minimal adjustments made to accommodate the diverse prompting formats required for different models.
Explicit Application.To be able to apply knowledge of understanding attributes in reasoning-type tasks is crucial for language-model integration in downstream tasks.Hence, the explicit application task aims to evaluate the language model's capacity to effectively apply their understanding of object attributes in explicit reasoning tasks.Through a combination of Boolean and multiple choice questions, language models must reason about the correctness of statements concerning individual objects or pairs of objects, as shown in Table 1.
Implicit Scenario-based Analysis.This track assesses the language model's aptitude for reasoning in scenario-based tasks where the attribute to be inferred is not explicitly mentioned.Each prompt presented within this track has two components: context and question.The context serves as a description of the scenario, and implicitly highlights the attributes(s) of focus.The question presents candidate objects, one of which is the correct answer.We show example templates in Table 1.

Statistics
In total, NEWTON benchmark consists of 160K questions distributed over the three tasks of Foundational Attribute Understanding, Explicit Appli-cation, and Implicit Scenario-Based Analysis.We provide a visualization of the distribution and analysis of the data in the Appendix.In comparison to other datasets and benchmarks which have examined the topic of Physical Reasoning, NEW-TON differs in its object-centric focus, providing unmatched diversity and scale as shown in Table 2.  ing from an object-centric approach, using a small subset of objects.Our dataset examines the understanding of language models from a physical reasoning perspective, with a rich and diverse set of objects, attributes, and questions.

Synthesis of Additional Challenge Sets
The NEWTON Repository of Object-Attribute pairs can be used to synthesize additional challenge sets, as seen in Figure 3.The process begins with Context and Attribute Specification, where users identify a context, and relevant attributes.
Next, object filtering involves using the identified attribute(s) to automatically filter objects into a positive set and a negative set.Using the grouped objects, one can specify query templates, and automatically populate the templates to synthesize a diverse and customized challenge set.Using this, language models can be evaluated for accuracy in the specific user-identified context to find the optimal prompting strategy and model.Examples of the generated dataset are shown in the Appendix.

Results
In this section, we evaluate the performance of state-of-the-art models on NEWTON.Specifically, we quantitatively analyze the performance of the models on the tasks of foundational attribute comprehension, explicit application, and implicit analysis.We also qualitatively examine the patterns in errors made by the models.

Experimental Design
Query Templates.We make minimal changes to the prompts of language models between evalua-   tions of different models.Aside from the particular formatting requirements necessary for inference by different models, we make no changes to the format of the prompt.Prompt structures for different families of models are illustrated in the Appendix.Models.We provide an outline of the various models bench-marked with NEWTON in Table 3.We consider several families of large-scale pre-trained models, fine-tuned on different instructed or question answering datasets.
Metrics.Two metrics are used in the evaluation of LLMs' performance across the three benchmark tracks: Agreement (%) and Accuracy (%).Agreement (%) is used to evaluate Track 1, while accuracy is used to evaluate Track 2 and 3.
Where R LM denotes the response from the language model, R HM denotes the majority human response, and R H denotes the human response.The agreement metric regards the human agreement percentage as an upper limit and adjusts the calculated accuracy based on this percentage.This adjustment results in a higher emphasis on questions with substantial human agreement, while assigning comparatively less significance to questions where human annotator responses exhibit greater diversity.This metric aids in evaluating the extent to which language model responses align with those of humans.
On the other hand, the accuracy metric considers the human-majority response as the definitive label and thus has a maximum attainable value of 100%.This metric gauges the proportion of responses that align with the majority-voted human response.

NEWTON Benchmark as a Diagnostic for Knowledge on Physical Attributes
The three-track setup of NEWTON benchmark enables analysis of Language Models' ability to comprehend, apply, and analyze physics attributes of everyday objects.Through a quantitative analysis, we draw several insights.
Language models have a non-uniform comprehension of physics concepts of objects.The foundational attribute understanding task requires classification of objects based on physical attributes.Through the task, as shown in Some language models are capable of decisionmaking in implicit scenario-based tasks.In Table 6, we present a quantitative evaluation of language models across nine scenario templates.The results show that GPT-4 (OpenAI, 2023) consistently outperforms other models by a significant margin in the majority of scenarios.It achieved an impressive overall average of 87.7% across the defined scenarios, compared to the average of 44.5% for all the evaluated models.However, it is important to note that GPT-4 struggles to provide consistently accurate responses in certain scenarios, particularly in Scenario 8.This indicates the need for evaluation schemes that carefully consider the specific deployment situations in which language models will be utilized.

Ablative Studies
In this section, we provide an analysis of NEW-TON, focusing on potential ways of leveraging NEWTON to enhance model performance in a physical reasoning context, and examining the consistency of LLMs with regard to model size, question polarity, and answer positioning.Fine-tuning using NEWTON.Aside from using NEWTON to create evaluation tools, it serves as a resource for fine-tuning pre-trained LMs, improving their grasp of physical concepts.We experiment with fine-tuning on Track 2 questions that explicitly address object attributes, then evaluate on Track 3 involving implicit attribute reasoning.Track 2 and Track 3 questions, detailed in Table 1, are distinct.Fine-tuning focuses on multiplechoice tasks using the base BERT model (Devlin et al., 2019), initially trained on SWAG (Zellers et al., 2018).We fine-tune with subsets of NEW-TON's Track 2-5k, 10k, and 40k samples.Models are assessed on Track 3's implicit reasoning using NEWTON.Figure 4A reveals significant enhancement in language models (e.g., BERT) when NEWTON is part of pre-training, with increasing performance as fine-tuning samples rise.This Table 4. Track 1: Foundational attribute comprehension results for various language models.We report the agreement percentage, computed as a percentage of responses which agree with the majority voted human response, weighted by the inter-annotator agreement.We also provide the overall averaged agreement across language models, and also across attributes.
In addition, we report the inter-annotator agreement average for the listed attributes for reference.
underscores NEWTON's potential for improving LLMs' physical reasoning through fine-tuning.
Language Model Size and Impact on Performance.To assess the influence of model size on performance, we focus on the Flan-T5 series and analyze the performance of small, base, large, and XL model sizes, as shown in Table 7.We observe that, in general, larger model sizes lead to improved performance for Track 2 and 3.However, for Track 1, the Flan-T5-Base model demonstrates the best performance.This inconsistency is likely attributed to the nature of the questions, as those in Track 1 are more descriptive in nature, while questions in Tracks 2 and 3 are more concise.The contrasting question styles likely account for the varying outcomes observed across different model sizes, as certain models may excel in handling longer, more detailed queries while others excel in providing responses to shorter, more focused questions.
Polarity and Position Bias in Language Models.We explore the impact of question polarity on model performance using NEWTON benchmark -Track 2. Prompting questions are categorized into Positive and Negative polarity, where Positive polarity questions include phrases like "is," "is more," and "is the most," while Negative polarity questions involve phrases like "is not," "is less," and "is the least" (as outlined in Table 1).Conducting a t-test reveals that GPT-4 and Dolly-V2-7B have no significant difference between the means of the different polarity groups, while for other models, there is a statistically significant difference (p < 0.05) between the means of the different polarities, indicating the presence of a polarity bias, as shown in Figure 4B.Additionally, we investigate the presence of bias based on answer position by grouping questions according to the position of the ground truth answer, as shown in Figure 4C.Among the models, UnifiedQA demonstrates the lowest difference in the accuracy between highest and lowest scoring accuracy positions, as quantified by a tstatistic of 15.0 and p < 0.0001.On the other hand, Flan-Alpaca-GPT4-XL exhibits the largest difference between the highest and lowest position accuracy, quantified by a t-statistic of 170.1 and p < 0.0001.The observation that models exhibit inconsistencies in accuracy by position is also highlighted in (Aroca-Ouellette et al., 2021).
Effect of Prompt Engineering.We analyze the influence of prompt engineering on both GPT-Turbo and GPT-4.Table 8 illustrates our exploration of five distinct approaches to formulating prompts.These strategies are denoted as V1, V2, and V3, representing three different versions of prompts distinguished by their word choices.The Ensemble method amalgamates the outcomes from V1, V2, and V3, utilizing a majority voting system to determine the final answer.Meanwhile, the Instruction strategy supplements each prompt with a set of instructions, comprising eight example questions and corresponding answers that cover diverse attributes.Notably, these examples do not overlap with any of the tested questions.It's noteworthy that the Ensemble and Instruction-based prompt techniques enhance performance for Boolean-style questions.However, the performance remains comparable or, in some cases, even less favorable for Multiple Choice-style questions.These findings underscore the significance of identifying an appropriate prompt and crafting variations that are attuned to diverse question styles.Table 5. Track 2: Explicit application evaluation results on various LLMs.We separate the questions into two streams, Boolean, which consists of True/False style questions, and Multiple Choice, which consist of QA style questions with four answer choices.We report the model accuracy across each stream, as well as the combined accuracy.For each stream, we report an averaged accuracy percentage across all models for each physical reasoning attribute.We also report an averaged accuracy percentage across all attributes, to gauge the overall understanding of language models across all physical reasoning attributes.Table 6.Implicit scenario-based analysis.We present accuracies of language models on the nine scenario-based tasks from the NEWTON benchmark.Scenario 1 to 9 represents a range of scenario based questions, ranging from arrangement, to tool-use, to safety.We provide an average across each scenario as a measurement for the overall scenario complexity, as well as an average encompassing the overall performance of the model across the nine given scenarios.

Related Works
Evaluation of Reasoning Abilities.In the past year, significant progress has been made in exploring Natural Language Processing (NLP) developments attributable to the advent of large language models.Prior studies like Bakhtin et al. (2019) focused on basic physical reasoning without lan-  From left to right: A) BERT fine-tuning results using NEWTON, note the increase in accuracy on the unseen implicit questions after finetuning on NEWTON, using of sample of 5000, 10000, and 40000, respectively.B) Accuracy by question polarity, where positive polarity represents questions phrased with is, is more, and is the most, while negative polarity represents questions phrased with is not, is less, and is the least.C) Accuracy by position, where the position value indicates the placement of the correct answer within the sequence of possible options in the question template.Before the LLM surge, efforts were made to link vision and language (Das et al., 2017;Gordon et al., 2018;Shridhar et al., 2020;Duan et al., 2020).
Most of these were in embodied contexts, where agents answered questions by exploring environments (Das et al., 2017;Gordon et al., 2018), performed tasks with sub-steps (Shridhar et al., 2020), or rearranged scenes given instructions (Szot et al., 2022).Unlike these tasks which focus on visual and semantic attributes, NEWTON challenges language models with questions about explicit/implicit physical object properties, some beyond visual inference.More recently, LLMs are being explored for grounding robotic manipulation, often generating code or instructions for actions/skills (Liang et al., 2022;Ahn et al., 2022).Yet, there's a gap in using LLMs for object-centric physical reasoning in manipulation, which stems from limited evidence of LLMs' grounding abilities.Our work focuses on this by exploring and evaluating such abilities in physically grounded settings.

Conclusion
In this work, we present NEWTON, a repository, pipeline, and benchmark to support the assessment and refinement of LLMs in the context of physical attribute understanding of everyday objects.The NEWTON repository offers a large collection of object-attribute pairs, enabling the generation of infinite test scenarios for evaluating language models, using the NEWTON pipeline.
To demonstrate its usefulness and potential, we introduce the NEWTON benchmark, which consists of 160K questions involving 700+ objects and 8 attributes.These questions cover three distinct tracks: fundamental object understanding, explicit application, and implicit analysis.NEWTON is built to enable comprehensive and automatic evaluation of language models across an array of scenarios, and support the reliable incorporation of LLMs into physically grounded contexts and applications.

A Appendix
A.1 NEWTON Benchmark Statistics NEWTON has three main tracks: fundamental attribute understanding, explicit application, and implicit scenario-based analysis.The three tracks comprise of a total of 160K questions, covering 700+ objects and 8 attributes.Each object is annotated with 1-8 tags of object attributes.We present visualizations of the data-statistics for Track 1, 2, and 3 in Figure 5, 6, and 7, respectively.Figure 5 illustrates the data distribution of questions and attributes, total number of tokens per attribute, and number of attributes remaining after filtering for each object category.Figure 6 illustrates the data distribution for the explicit application track, and includes an analysis on the distribution of the top 100 highest occurring categories, percentage distribution of different attribute occurrences within the track, counts of question polarity and type, and total tokens for the questions in each attribute.Figure 7 illustrates the data distribution for the implicit scenario-based analysis track, and includes a percentage fraction of different scenarios, and the total number of tokens for questions in each scenario.Taking a closer look, we can see the initial set of attributes have a bias towards the elasticity, sharpness, softness, and malleability attributes, due to the process of filtering inconsistencies.To ensure a balanced dataset, template formation of Track 2 questions involved an additional re-sampling step to ensure the questions cover the attributes in a uniform way, hence why Track 2 questions are more uniform in nature.Track 3 scenarios are manually designed for each scenario, and hence data balance was not a primary focus.

A.2 Detailed Results for NEWTON Benchmark Track 2
Track 2 involves a mixture of question types (Boolean, multiple choice), and questions which differ by polarity (positive, negative).We provide a detailed breakdown of results by polarity for models evaluated on the NEWTON Benchmark, Track 2, in Table 9.

A.3 Qualitative Results
We present an example for model outputs for the Foundational Attribute Understanding (Track 1) in Figure 8.Note the common mistakes which language models make include (1) a tendency to choose the middle option, (2) hallucination, and (3) inaccurate understanding.In Figure 9, we examine the common failure modes of language models for a set of 10 questions from Track 2: Explicit Attribute Understanding.We find that for most models, the failure mode occurs due to the tendency to hallucinate.For instance, in Figure 9, for question 6, 7, and 8, Dolly fails to provide an answer which reflects the given options.

A.4 Query Templates for Language Models
Different models require different variations of prompt input, and the unsuitable prompt could lead to an inaccurate assessment of the capability of language models.Hence, we adapt the question template for different families of models to adhere to the most effective prompting strategy, as shown in Figure 10.

A.5 Synthesizing Additional Challenge Sets with NEWTON
We show a snapshot of the generated dataset using the procedure for synthesizing additional challenge sets in Figure 11.Note that all objects also have corresponding 3D models and 2D thumbnails, as shown in Figure 12 should the user wish to extend the dataset to a vision-language setting.

A.6 Comparing Agreement and Accuracy Scores
In Table 10, we show the performance of various models when evaluated using the accuracy measure.In comparison to Table 4, which provides the quantitative results using the agreement measure, there is an overall increase in the absolute percentages, since the upper-threshold for the maximum attainable value has increased from human agreement percentage to 100%.However, we note that the conclusions mentioned in Section 5.2 remain the same.

A.7 Dataset Examples
In this section, we provide a snapshot of the dataset, as shown in Figure 12.Each generated question is tagged with corresponding choices, ground truth response, question polarity, and relevant 2D RGB thumbnails and 3D object models.While NEW-TON is designed for prompting and assessing language models, potential extensions could also leverage the paired RGB and 3D models to assess visually grounded models.Table 9. Track 2: Explicit application evaluation results on various language models.We separate the questions into two streams, Boolean, which consists of True/False style questions, and Multiple Choice, which consist of QA style questions with four answer choices.We report the model accuracy across each stream, as well as the combined accuracy, separated by the polarity of the question.For each stream, we report an averaged accuracy percentage across all models for each physical reasoning attribute.We also report an averaged accuracy percentage across all attributes, to gauge the overall understanding of language models across all physical reasoning attributes.Table 10.Accuracy Performance of Models on Foundational Attribute Comprehension.We report the accuracy percentage, computed as a percentage of responses which agree with the majority voted human response.The upper threshold is 100%, which represents perfect overlap with the ground truth (majority voted human response).

Figure 1 .
Figure 1.As works begin leveraging LLMs in physically grounded contexts, it is crucial to understand whether such models possess the ability to reason about everyday scenarios.NEWTON, a repository, pipeline, and benchmark, facilitates evaluation of various LLMs in a physical reasoning context.

Figure 3 .
Figure 3. Synthesis of additional challenge sets.Using NEWTON's broad object-attribute combinations, users can design custom challenge sets for their needs.The process starts with Context and Attribute Specification.Objects are then automatically filtered based on chosen attributes into positive and negative sets.With these, users can quickly create tailored challenge templates.This helps evaluate language models in user-defined contexts, optimizing prompt strategies and model selection.

Figure 4 .
Figure 4. Ablations.From left to right: A) BERT fine-tuning results using NEWTON, note the increase in accuracy on the unseen implicit questions after finetuning on NEWTON, using of sample of 5000, 10000, and 40000, respectively.B) Accuracy by question polarity, where positive polarity represents questions phrased with is, is more, and is the most, while negative polarity represents questions phrased with is not, is less, and is the least.C) Accuracy by position, where the position value indicates the placement of the correct answer within the sequence of possible options in the question template.

Figure 5 .
Figure5.NEWTON Benchmark Track 1 data statistics.We highlight the data distribution of questions and attributes, total number of tokens per attribute, and number of attributes remaining after filtering for each object category.

Figure 6 .
Figure6.NEWTON Benchmark Track 2 data statistics.We examine the data distribution for the explicit application track, and illustrate the percentage distribution of different attribute occurrences within the track, counts of question polarity and type, and total tokens for the questions in each attribute.

Figure 7 .
Figure 7. NEWTON Benchmark Track 3 data statistics.Graphs show the percentage fraction of different scenarios, and the total number of tokens for questions in each scenario.

Figure 8 .
Figure 8. Example of responses from NEWTON Track 1.Note that the query is marked in the yellow text box, and the correct answer to choose is marked with a green highlight.Responses from different LLMs are shown in blue text-boxes, and correct responses are marked with a green check-mark.

Figure 9 .
Figure 9. Success vs. Failure Examples.Typical failure modes of various models on a sample of NEWTON questions.Models typically fail due to hallucination and inaccurate understanding.

Figure 10 .
Figure 10.Prompting Templates.We show examples of the queries used for evaluating language models.

Figure 11 .
Figure11.Snapshot of generated challenge-set example.In this example, we demonstrate synthesis of a multiple choice challenge set.We see that each synthesized sample has context, question, and ground truth annotations.

Figure 12 .
Figure 12.Dataset Sample.We provide a snapshot of the NEWTON Benchmark.Questions are populated automatically with meaningful object types.Each question is labelled with the attribute, object category, ground-truth object, question polarity, question type, and several candidate object options.Each candidate object also has several corresponding 3D object models, should the user wish to use NEWTON with perception in the loop.
Low: object can support very minimal weight before deforming or cracking.b) Moderate: object can withstand a small amount of weight, but something heavy can break the object.c) High: object can easily withstand more than a 10 kg weight.Malleability Can <MASK> be reshaped to other forms?a) No: object cannot be reshaped.b) Somewhat: object can be slightly reshaped.c) Yes: the object can be reshaped to most arbitrary forms.
Table1.Question Templates.Styles of questions across the three tracks in the NEWTON benchmark.There are three tracks: foundational attribute comprehension, explicit application, implicit scenario-based analysis.Track 1: Foundational attribute comprehension consist of identical questions to those used in the human-annotation process.Track 2: Explicit application consist of questions where the object attribute is mentioned explicitly in the query, which is formatted in Boolean or Multiple Choice style.Track 3: Implicit Scenario-based Analysis consists of implicit questions, where the attribute(s) of focus is not explicitly mentioned.

Table 2 .
Dataset Comparison.Comparison of NEWTON with two other benchmarks aimed at Physics Understand- (Bisk et al., 2020PIQA(Bisk et al., 2020) is a dataset aimed at physics common-sense reasoning, without a focus on object-centric attribute understanding.PROST (Aroca-Ouellette et al., 2021) tackles physical and affordance reason-

Table 3 .
Model Details.We provide the details of models evaluated using NEWTON, including the name of the model, number of parameters, the underlying foundation model, and the instructed or QA datasets used to fine-tune the foundation model.Note that dashes represent undisclosed details.

Table 7 .
Model Size' Impact on Performance.We provide a comparison of different sizes of the Flan-T5 models evaluated using the different tracks of the NEWTON benchmark.