Distilling ChatGPT for Explainable Automated Student Answer Assessment

Providing explainable and faithful feedback is crucial for automated student answer assessment. In this paper, we introduce a novel framework that explores using ChatGPT, a cutting-edge large language model, for the concurrent tasks of student answer scoring and rationale generation. We identify the appropriate instructions by prompting ChatGPT with different templates to collect the rationales, where inconsistent rationales are refined to align with marking standards. The refined ChatGPT outputs enable us to fine-tune a smaller language model that simultaneously assesses student answers and provides rationales. Extensive experiments on the benchmark dataset show that the proposed method improves the overall QWK score by 11% compared to ChatGPT. Furthermore, our thorough analysis and human evaluation demonstrate that the rationales generated by our proposed method are comparable to those of ChatGPT. Our approach provides a viable solution to achieve explainable automated assessment in education. Code available at https://github.com/lijiazheng99/aera.


Introduction
Assessing students' responses is a critical component of the educational process.It helps educators evaluate the effectiveness of their teaching methods and provides students with valuable feedback to identify their strengths and weaknesses (Nicol and Macfarlane-Dick, 2006).Timely and detailed feedback can significantly enhance students' learning experiences and support their academic growth Score: 2 2 points; This student answer matches two key elements, "Active transport... uses energy" and "Passive transport... without energy".The student answer didn't award any score on "Diffusion...", since the explanation is incomplete.

points
Figure 1: Text classifier-based automated student response assessment methods lack of transparency, and the result is unexplainable.Text generation-based student response assessment with rationale generation provides solid evidence for a transparent assessment decision-making process and can be used for more responsive student feedback to improve learning quality.(Hattie and Timperley, 2007).However, evaluating and providing detailed feedback on student responses can be time-consuming, especially for large classes.Additionally, assessments conducted by different examiners may result in varying assessment standards, leading to inconsistencies in the evaluation process (Weigle, 2002).These challenges often necessitate a more efficient and consistent approach to students' response assessment.
Recent years have seen a surge in applying Pretrained Language Models (PLMs), such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2020), which have achieved state-of-the-art results in various natural language processing tasks, including sentiment analysis and topic classification.However, when applied to student response assessment, these models still suffer from a lack of transparency and trustworthiness, as they tend to produce predictions without offering an explicit rationale behind their decisions.This lack of trans-parency can be a significant drawback, as students and educators may not understand the reasoning behind the assessments, limiting the usefulness of the feedback.This issue prevents students and educators from understanding the model's thought process and limits its applicability in educational settings.
Recent developments in Large Language Models (LLMs), such as GPT-3 (Brown et al., 2020) and ChatGPT 1 (Stiennon et al., 2020), have showcased remarkable capabilities not only in generating human-like response but also in exhibiting advanced reasoning and inference abilities.These models have been utilized as "reasoning teachers" to fine-tune smaller language models, harnessing their superior inferential skills while maintaining computational efficiency (Ho et al., 2022;Lu et al., 2022).This approach can potentially overcome the limitations of traditional explanation techniques built on PLM-based text classifiers.LLMs can generate context-aware and nuanced reasoning in an understandable textual format without expensive human annotations.As a result, the LLMgenerated rationale for student response assessments offers a great opportunity to efficiently train smaller language models to provide accurate and understandable assessment feedback that is useful for both students and educators, ultimately enhancing the learning experience and assessment explainability.
In this paper, we aim to distill ChatGPT as a reasoning teacher to train a smaller language model to generate meaningful student response assessment rationale via zero/few-shot inference.We developed three prompt templates based on different reasoning levels to compare rationale quality and generate plausible explanations.Moreover, we use semantic confidence from LLMs to study the reliability of human labels and identify corrupted data, and use LLM outputs as noisy label data to fine tune a smaller language model for explainable student answer scoring.Extensive experiment results show our method is capable of achieving high student response assessment performance meanwhile provide transparent rationales.This method can not only improve the trustworthiness when applying automated student response assessment systems in large-scale examinations but also shed new light on new strategies to improve transparency in explainable text classification.
1 https://chat.openai.com/chatIn summary, our contributions are: (1) We proposed various prompt reasoning methods to explore the rationale generation capability of ChatGPT in student response assessment; (2) We adopted semantic confidence interval and data augmentation methods to improve the assessment performance with rationale generation significantly; (3) Comprehensive experiments show that our method is able to generate accurate rationales without requiring human-annotated rationales for model learning.To the best of our knowledge, our proposed method is the first to distill ChatGPT for learning a smaller language models for explainable student response assessment.We will release our code on GitHub.

Related Work
Our research is related to three topics in the literature: automated student answer assessment, rationale generation, and knowledge distillation.

Automated Student Answer Assessment
The development of automated student answer assessment systems has been an active area of research in educational technology.Early work in this area focused on employing traditional machine learning algorithms, such as Support Vector Machines and Naive Bayes for automated essay scoring and short answer grading.With the advent of pretrained language models, such as BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020), researchers have explored their use in various educational applications, including automated essay scoring (Filighera et al., 2022), question-answering (Lu et al., 2022), and providing feedback on student writing (Yannakoudakis et al., 2011).However, most existing approaches suffer from a lack of transparency, as they often produce predictions without providing an explicit rationale behind their decisions.Our work addresses this limitation by leveraging ChatGPT-generated rationales to train a smaller language model for explainable student answer assessment, enhancing the trustworthiness and utility of automated assessment systems.
Rationale Generation Rationale generation in NLP has gained increasing attention in recent years to enhance the interpretability and trustworthiness of NLP models (Lei et al., 2016;Yu et al., 2019).Rationales are concise, human-readable explanations for model predictions, which can be particularly useful for tasks such as sentiment analysis (Lei et al., 2016), machine translation (Wu et al., Step 2: Rationale & Data Refinement

Student Answers
Rationale Generation Template

Question:
A group of students wrote the following procedure for their investigation …

Key Elements:
You need to know how much vinegar was used in each container.
You need to know what type of vinegar was used in each container …

Rubric:
Score 3 The response describes three additional pieces of information … In order to repeat the experiment you would need to know what you are testing for.They never say the reason for the experiment you need to know how much vinegar they used, because that should be the same for each container.Also in their procedure they dont tell what the samples aren't.If you only After reading the procedure, there is a lot more information that they need in order to replicate this experiment.They should say how much vinegar to put into the container and which container to put it in.They should be more specific when stating how things should be measured by saying 'measure Information that would be required to replicate the experiment would be the amount of vinegar to be poured in each container, the size or amount of each sample of material, and where to store the sample when they are drying from the vinegar.Step 1: ChatGPT Generate Rationales

Score and Rationale
Step 3: Distill T5 for Student Answer Assessment and Rationale Generation

Score and Rationale
3 points; This response describes three additional pieces of information that would be needed to accurately replicate the experiment: "the amount of vinegar to be poured in each container...size and amount of each sample of material...where to store the sample when they are drying by the vinegar." 2 points; This response describes two additional pieces of information that would be needed to accurately replicate the experiment: "how much vinegar to put into the container" and "which container to put it in."0 points; This response does not provide any additional information that would be needed to accurately replicate the experiment.2018), and document classification (Bastings et al., 2019).Methods for generating rationales can be broadly divided into two categories: extractionbased methods, which select a subset of input features as rationales (Lei et al., 2016), and generationbased methods, which synthesize rationales from scratch (Liu et al., 2019;Wiegreffe et al., 2021).Recent advances in rationale generation have leveraged techniques such as reinforcement learning (Yu et al., 2019) and pre-trained language models (Wiegreffe et al., 2021) to improve the quality and coherence of the generated explanations.However, challenges still remain in evaluating the quality of generated rationales.
Knowledge Distillation Knowledge distillation (Hinton et al., 2015) is a technique for training smaller models (student models) to mimic the behavior of larger, more complex models (teacher models).This process involves transferring knowledge from the teacher model to the student model, allowing the latter to achieve comparable performance with a lower computational cost.Knowledge distillation has been used for various tasks, including image classification (Zagoruyko and Komodakis, 2016;Romero et al., 2014), natural language processing (Jiao et al., 2020), and speech recognition (Ba and Caruana, 2014).The technique has been further refined by introducing approaches such as attention transfer (Zagoruyko and Komodakis, 2016), which aims to align intermediate representations between teacher and stu-dent models, and Born-Again Neural Networks (Furlanello et al., 2018), which involve multiple iterations of distillation for improved performance.
In the context of our work, we distill the knowledge from ChatGPT by leveraging its generated rationales to fine tune a smaller language model for explainable answer scoring.This approach enables us to harness the power of large-scale pre-trained models while maintaining the efficiency of smaller models.
3 Generating Free-form Rationales for Text Classification via Reasoning Teacher Distillation The adoption of PLMs in various downstream applications, including student answer assessment, has been hindered by concerns surrounding interpretability.Existing text classification methods struggle to provide meaningful feedback to students, making them less useful for educational purposes.Moreover, the scarcity of datasets annotated with marking rationales and the high costs associated with human annotation create barriers to developing rationale generation approaches.
To address these challenges, we introduce a LLM-based, in-context learning framework, AERA: Automated Explainable Student Response Assessment, which enables fine-tuning small language models for explainable student answer scoring.Our approach consists of three main stages: (1) Using ChatGPT for zero-shot or few-shot rationale generation to create an initial set of rationales, (2) Incorporating confidence intervals and predicted results for data selection and further rationale augmentation, and (3) Demonstrating how these generated rationales, despite the presence of noise, can effectively fine-tune small language models for efficient and plausible rationale generation.

Notations
In a standard short student response assessment problem, the dataset consists of four components: question Qu, key elements Ke, rubric Ru, student answers X, and corresponding scores Y .Each element in X × Y is an i.i.d.realization of random variables, denoted as (x i , y i ), i = 1, 2, ..., N .The dataset also includes training materials for assessor training, containing sample student answers across a range of scores accompanied by simple rationales.We represent these student answers from the training materials that deviate from the dataset as x j , scores as y j , and rationales as r j , where j = 1, 2, ..., E and E is the number of humanannotated fine-grained examples.We denote predicted score and rationale as ŷi and ri , respectively.
We use the following notations to describe the rationale generation process: • I → O, which maps inputs to outputs.
• I → OR, which maps inputs to outputs and rationales.
• IO → R, which maps inputs and outputs to rationales.
In the rest of the paper, we utilize green font to differentiate the few-shot examples and blue font to emphasize texts predicted by the language model.

ChatGPT for Rationale Generation
Leveraging ChatGPT's remarkable in-context learning ability across various tasks, we employ ChatGPT as reasoning teachers to generate freeform rationales for text classification tasks.In edu- cation, student answer assessments usually require comprehensive steps and a complex reasoning cycle rather than simply comparing the student's answer with the key elements.To evaluate the robustness and generalisability of ChatGPT on rationale generation, as shown in Table 1, we gradually lower down the reasoning difficulty by providing more information in the prompt template.As mentioned by (Brown et al., 2020), in some cases, it may even be difficult for humans to understand the task format without prior examples.Therefore, the exampleinstructed prompt template is the easiest setting in our method.We categorize rationale reasoning into three types: goal-oriented reasoning with simple instruction, goal-oriented reasoning with comprehensive instruction, and example-oriented reasoning.These can then be formatted into two types of scenarios: Zero-shot Rationale Generation We first design a goal-oriented, I → OR prompt with simple instruction, where the language model is asked to provide the desired output without specifying the reasoning process.This template examines the reasoning cycle from planning to execution and output prediction: [ Question ]: <Qu> Previous research (Marasovic et al., 2022) suggests that more comprehensive prompt instructions can help large language models (LLMs) achieve better inference.As a result, we designed a goal-oriented I → OR template with a more elaborated instruction.Instead of creating the reasoning process from scratch, this template offers the necessary reasoning cycle for assessing student responses, primarily focusing on examining the reliability of the procedure execution:  Few-shot Rationale Generation Although ChatGPT has demonstrated exceptional reasoning capabilities, some research indicates that unexpected prediction behaviours in free-form text generation, such as hallucinations, might affect the generated rationale's accuracy and quality (Bang et al., 2023).Thus, we propose a few-shot-based prompt template to investigate the ability of LLMs to recognize the rationale generation pattern based on the given examples: Evaluating the correctness of rationales generated by both zero-shot and few-shot methods is challenging since the only true label available is the score.Therefore, we follow previous approaches (Ho et al., 2022) and assume the rationale supports the score if the ChatGPT-predicted answer score is correct.Extracting scores from zero-shot outputs is difficult due to the remix of rationale and predicted score, so we manually extract the scores for zero-shot outputs and use regular expressions for few-shot output score extraction.

Tabular Data Transformation
Some datasets contain tabular data, which is a structured format that is challenging to input into language models.As shown in Figure 3, we leverage ChatGPT's table-understanding capability first to create table descriptions based on a text-based table and then verify its understanding by having ChatGPT output a table based on the provided description 2 .All the tables from our experiments could be correctly recovered from the table description by the ChatGPT; therefore, we use adopted those table descriptions to replace all the structured tabular data in our prompts.

Semantic Confidence Interval
ChatGPT, being a non-deterministic language model, generates outputs that can vary with each iteration.We utilize the semantic uncertainty method outlined by (Kuhn et al., 2023) to calculate confidence intervals for rationales.Generated rationales ri with the same predicted score ŷi as semantically equivalent rationales.The probability of each semantically similar group s can be then represented as: where S is the set for all the occurrence of ŷi .
Through our experiments, we demonstrate that incorporating this confidence interval helps reduce aleatoric uncertainty present in the dataset.This approach helps the identification of corrupted input data and human label errors, thus improving the overall data quality.

Further Rationale Refinement
Since I → OR formatted rationale generation cannot guarantee the predicted label ŷi is the same as the gold label y i .Therefore, we propose a rationale refinement strategy by giving the input with the correct score provided to generate rationales: IO → R.This prompt template is similar to the example-oriented reasoning one, but given the gold in the template:

Optimizing Student Model for Efficient Rationale Generation
Although LLMs have exhibited impressive inference and in-context learning capabilities, concerns regarding the accuracy and faithfulness of student response assessments persist due to occasionally unexpected outputs generated.Furthermore, the substantial running costs 3 render LLMs less practical for being used in real-world student response assessment.Consequently, we propose using ChatGPT-generated rationales to fine-tune smaller language models (e.g., T5 models) for efficient rationale generation.Initially, we utilize the free-form rationale generated by ChatGPT, as described in §3.2, and clean the data using the semantic confidence interval outlined in §3.2.2.Since no reliable automatic metrics exist for evaluating free-form generated rationales without human annotations, we assume the generated rationales support the assessed score, filtering out rationales with incorrectly predicted scores.However, this may result in a data scarcity issue after removing those incorrect predictions.To address this problem, we employ the rationale refinement method detailed in §3.2.3.
Our fine-tuning process takes an input and generates output with rationale, I → OR.Due to the lack of automated evaluation metrics between O and R, we therefore use sacreBLEU to evaluate the rationale quality on the validation set and choose the best checkpoint.The prompt template resembles the example-oriented template, but without the few-shot examples:  3.
Evaluation Metric We adopt the Accuracy (Acc) and macro f1 score (F1) and Quadratic Weighted Kappa (QWK) to evaluate the classification performance.We adopt the sacreBLEU to evaluate the rationale generation during validation and select the best checkpoint.
We provide detailed dataset description, baseline implementation, QWK implementation and hyperparameters setup in A.1.

Overall Comparison
Table 2 displays the performance of student answer assessment across three task scenarios: fine-tuned text classification, zero/few-shot prompting, and fine-tuned text generation.For text classification baselines, when comparing BERT and Longformer, we observe that using a model which can accommodate longer input text length can have improved performance when trained solely on student answers.However, when incorporating additional information (e.g., Question, Key Elements, and Rubric) into the input alongside student answers (Longformer-all), we do not see an improvement in overall performance.Instead, three out of four datasets experience a performance decline, suggesting that these resources may be treated as noise and not utilized in the student assessment process.Text classifier-based student response assessment may not be reliable since it cannot offer explanations of its scoring decisions.
For results generated from ChatGPT, we observe that using a more elaborated instruction gives better results compared to a simple instruction.By providing some demonstration examples and even without using elaborate instructions, we observe performance improvement in two out of four datasets, leading to the highest overall performance scores, and the lowest variance across all four datasets.This shows the superiority of the few-shot setting (Example Instruction) compared to the zero-shot settings (Simple & Complex Instruction).
Consistent with ChatGPT's inference performance, the fine-tuned text generation model we trained using ChatGPT-generated rationales with correctly predicted scores also exhibits a steady overall performance increase.Interest- By employing data filtering, we retain only the rationale with correct scores.To augment the size of the filtered dataset, our focus shifts towards refining incorrect ChatGPT predictions using methods described in §3.2.2 and §3.2.3.As shown in Table 4, rationale refinement can significantly increase the QWK performance, indicating that the lower performance on #2 may be attributed to data scarcity.Moreover, we addressed potential inaccuracies in the data by rectifying a total of 44 possibly mislabelled instances, resulting in further performance increment.This highlights the influence of data uncertainty on the overall performance.Applying data refinement on all datasets We extend the same refinement strategy employed on dataset #2 to all the other datasets.As shown in Table 5, this approach shows consistent performance improvements across all datasets and results in the highest overall performance compared with other rationale generation methods.Although the overall performance does not match that of text classifiers, given the intricate nature of the text generation task, noteworthy performance gains are observed on datasets #5 and #6, surpassing those achieved by the BERT classifier.This shows the benefit of enhancing the transparency of automated student answers assessment by generating rationales.

Leveraging Confidence Interval to Mitigate Data Uncertainty
We can derive a confidence score for each instance in the training set using the semantic confidence interval method outlined in §3.2.2.ChatGPT may make errors similar to human assessors, who may count fewer or more key elements that correspond to the student response, leading to a one-score difference from the gold label for most incorrect predictions in the training set.Instead, we focus on analyzing the 100% confident incorrect predictions with a score difference greater than one, as illustrated in Table 6.Intriguingly, we discover that these highly confident incorrect predictions may actually be correct, suggesting that the gold labels are noisy and possibly mislabeled.For example, the first three cases are evidently incomplete student responses lacking critical information, yet the original annotation assigns a relatively high score to the student answers.In addition to low-quality answers receiving higher scores, we also identify instances that may have been annotated with lower scores.For instance, the last example in the table clearly covers two key elements based on the rubric (highlighted in orange), but the original gold label is 0.
Such mislabeled data could be difficult to detect in a text classification task, as we might lack the resources needed for identification.This unique finding from the dataset has not been highlighted in previous research, and it verifies our concern that inconsistent standards may be present in large-scale student response assessments.This inconsistency can lead to incorrect feedback for students, emphasizing the need for rationale generation.

Analysis of ChatGPT Hallucinations
In this section, we discuss various hallucination cases observed in the zero-shot rationale generation, as described in §3.2.Table 7 demonstrates that zero-shot rationale generation can lead to inconsistencies and inaccurate assessments, mainly due to four types of cases: (1) Incorrect scoring scale: Despite providing a clear 0-3 integer score rubric, ChatGPT occasionally generates rationales with incorrect score caps, such as 5 or 12, or a fractional  2) Inconsistent assessment: Some rationales display completely contradictory scores at the beginning and end of the rationale.(3) Uncertain prediction: Similar to the first case, zero-shot rationales may disregard the rubric, resulting in uncertain scores like "1-2 points".(4) Factual mistake: We notice some rationale declared correctly matched key elements have never been mentioned by the student's answer or key elements.( 5) Redundant rationale: We observe that zero-shot generated rationales often provide vague or irrelevant explanations to the student response, which may not be helpful for feedback and could be difficult to understand.
In contrast, few-shot examples showcase the rationale generation style and scoring, leading to a more structured format.Moreover, exampleoriented instructions assist ChatGPT in moving away from its knowledge base by teaching it patterns based on existing materials.Consequently, we ultimately selected example-oriented few-shot rationale generation as our primary rationale generation method.

Student Model Generated Rationale Analysis
In this section, we present an analysis of the student model assessment results and generated rationales, comparing them with the teacher model ChatGPT's outcomes, as illustrated in Table 8.We observe that the T5-generated results exhibit similarities to ChatGPT's results, with both presenting clear scores followed by well-defined rationales that highlight the reasoning behind the scores.
The refinement of the training data, which involved cleaning and correcting some inaccurately Incorrect scoring scale: ... answer should receive 1 point out of 5.

Inconsistent assessment:
Score: 1 point This student answer ... Therefore, the answer is not relevant to the question and should receive a score of 0 points.Uncertain prediction: ... Therefore, this answer would receive a score of 1-2 points out of 3.
Factual mistake: ... this Student answer includes three of the key elements: selective permeability, passive transport, and facilitated diffusion Redundant rationale: ... the answer demonstrates some understanding of protein synthesis but is missing several key elements and contains some inaccuracies.predicted rationales by providing accurate scores, has led to a stronger correlation between T5 results and the test labels.For example, the ChatGPTgenerated results for #1, #2, and #6 exhibit minor discrepancies due to over-matching or undermatching certain key elements.
We also noticed a small number of mistakes in the T5-generated results, primarily attributable to the students' vague descriptions, making it difficult for the language model to compare the answers with the key elements.Additionally, some questions include rubrics such as "other acceptable responses," which are particularly challenging for language models to assess, given their lack of domain-specific background knowledge.
In conclusion, our distilled T5 model demonstrates a strong capability in assessing student responses and generating rationales.Despite the occasional errors and challenges posed by vague student answers and certain rubrics, the model's overall performance is promising for applications in educational settings.

Conclusion
In this paper, we investigate the application of Chat-GPT in student response assessment, demonstrating the potential of ChatGPT-generated rationales for providing explainable and transparent evaluations.We propose a rationale generation framework AERA and examine three prompt strategies with varying levels of reasoning difficulty, revealing that few-shot based prompting methods are best suited for generating plausible and accurate rationales.The utilization of ChatGPT's semantic confidence interval reveals that LLM results may be more precise than human annotations, underscoring the necessity for automated response assessment models.Further experimental outcomes indicate that fine-tuned student models can offer more accurate student response assessments while generating meaningful rationales, which can be applied to provide responsive student feedback.This paper presents a paradigm for harnessing LLMs' capabilities in student response assessment with minimal fine-grained data to decrease potential human assessor costs and enhance the reliability and transparency of large-scale examinations.
#1: To replicate the group's experiment, the procedure would have to state how much vinegar was poured into the containers.Also, they should specify what kinds of containers they were using as this could affect the results.In the procedure, they also failed to tell what 4 samples were used in the experiment, a key variable to it's success.Score: 3 T5: 3 points; This response describes three additional pieces of information that would be needed to accurately replicate the experiment: ‚"how much vinegar was poured into the containers...what kinds of containers they were using...and what 4 samples were used in the experiment.‚"ChatGPT: 2 points; This response describes two additional pieces of information that would be needed to accurately replicate the experiment: ‚"how much vinegar was poured into the containers‚" and ‚"what 4 samples were used in the experiment.‚"It also hints at a potential third piece of information needed: ‚"what kinds of containers were used.‚"#2: a.Based on the data, the strectchiest polymer plastic of the four is plastic type B, because in the both trials, it stretched the most (22 and 23 mm).b. 1.) Add more trials to increase accuracy of the results.2.) Repeat the experiment with heavier weights to see if it affects the results.Score: 2 T5: 2 points; The student provides an acceptable conclusion based on the data: ‚"...the stretchiest polymer plastic of the four is plastic type B....‚" One correct way to improve the experimental design and/or the validity of the results is given: ‚"Add more trials to increase accuracy of the results.‚"The second improvement given, ‚"...trying the experiment with heavier weights to see if it affects the results,‚" is too vague and does not receive credit.ChatGPT: 3 points; The student provides an acceptable conclusion based on the data: ‚"...the stretchiest polymer plastic of the four is plastic type B...‚" Two correct ways to improve the experimental design and/or the validity of the results are given: ‚"1.)Add more trials to increase accuracy of the results....‚" and ‚"2.)Repeat the experiment with heavier weights to see if it affects the results.‚"#5: tRNA will transfer it out.Score: 0 T5: 0 points; The student answer does not match any key elements given.ChatGPT: 0 points; The student answer does not provide any coherent or relevant information on the steps involved in protein synthesis.
#6: Osmosis, Is the movement of water across the membrain.Endocitosis, is the movement of things in to the cell.Exocitosis, is the movement of things out of the cell.Score: 1 T5: 1 point; This student answer only matches one key element, "Osmosis... movement of water".The other two concepts are incorrect or incomplete.ChatGPT: 2 points; This student answer matches two key elements, "Osmosis... movement of water across the membrane" and "Endocytosis... movement of things into the cell" but didn't include an explanation for "Exocytosis".

A Further Experiment Details and Discussion
A.1 Experimental Setup Dataset In this paper, we employ the Hewlett Foundation: Short Answer Scoring (ASAP-SAS) dataset 4 : This dataset encompasses over 23,000 short answer responses from students in grades 7 to 10, including ten questions spanning subjects such as Science, Biology, English, and Art.Expert human raters have manually scored each response on a scale of 0-2 or 0-3, based on predefined rubrics.Instead of focusing on assessment on the grammatical or writing side of the student responses, we are more interested in response assessment on STEMrelated questions.Therefore, we only selected four subsets (#1, #2, #5 and #6) relating to Science and Biology from the ASAP-SAS datasets.We didn't include other subsets since they are either focused on English and Art or contain multi-modal data (e.g.Graphs) in the question that is difficult to feed into language models.As the original dataset only provides the training and test sets, we created a development set by partitioning the training set in an 8:2 ratio.
Quadratic Weighted Kappa Implementation Quadratic Weighted Kappa, a widely used metric in evaluating the agreement between two raters in student response assessment, is defined as: where k is the score set, w is the weighted matrix, calculates as: w i,j = (i−j) 2 (k−1) 2 .O is a k×k histogram matrix and E being the k×k expected value matrix.

Hyper-parameters Setup
We utilized the Ope-nAI API with the gpt-3.5-turbomodel version 23 Mar 2023 for the generation of Simple/Complex/Example instruction-based rationale.The parameters were kept at their default settings, and the temperature was set to 1.0.Our fine-tuning experiments were conducted using NVIDIA A100 80G graphics cards.For T5 fine-tuning process, we used Long-t5-tglobal-large as our base model.The training was performed with a batch size of 8 for 30 epochs for the rationale generation (RG) task and a batch size of 16 for 30 epochs for the text classification (TC) task.Each configuration was repeated three times (RG) or five times (TC) to get an average performance, employing random seeds of 210, 102, 231, 314, and 146.Model Implementation We adopted the Pytorch-Transformers package5 for the implementation of all our Transformer-based language models.Since there is no established automated evaluation method for assessing the quality of rationales given the output, we can examine the connection between O and R by fine-tuning a text classifier to predict O using R as input.As shown in Table 10, we observe that the four datasets achieve high performance in terms of accuracy.This finding suggests that there is a discernible pattern in the rationales that points to the assessment scores.Lower accuracy on #2 verifies our observation on LLM struggling on the rubric.Thus, our assumption that the score is supported by its rationale can be verified through these results.

B Prompt Details
In this section, we provide the full detail of the question, key elements and rubric we used in the prompt for each dataset.[Marking Rubric]: 3 points: The response describes three additional pieces of information that would be needed to accurately replicate the experiment; 2 points: The response describes two additional pieces of information that would be needed to accurately replicate the experiment; 1 point: The response describes one additional piece of information that would be needed to accurately replicate the experiment; 0 point: The response describes little or no accurate or relevant information from the acid rain investigation.

B.2 Subset #2
[Question]: A student performed the following investigation to test four different polymer plastics for stretchability.Procedure: 1.Take a sample of one type of plastic, and measure its length.2. Tape the top edge of the plastic sample to a table so that it is hanging freely down the side of the table.3. Attach a clamp to the bottom edge of the plastic sample.4. Add weights to the clamp and allow them to hang for five minutes.5. Remove the weights and clamp, and measure the length of the plastic types.6. Repeat the procedure exactly for the remaining three plastic samples.7. Perform a second trial (T2) exactly like the first trial (T1).The student recorded the following data from the investigation.The table shows the amount of stretch (in millimeters) for four different types of plastic, labeled as A, B, C, and D, when subjected to two different stretching forces, labeled as T1 and T2.For plastic type A, it stretched 10mm under T1 and 12mm under T2.For plastic type B, it stretched 22mm under T1 and 23mm under T2.For plastic type C, it stretched 14mm under T1 and 13mm under T2.Lastly, for plastic type D, it stretched 20mm under both T1 and T2.a. Draw a conclusion based on the student's data.b.Describe two ways the student could have improved the experimental design and/or validity of the results.
[Key Elements]: Conclusions: Plastic sample B has more stretchability than the other polymer plastics.Plastic sample A has the least amount of stretchability compared to the other polymer plastics.Not all polymer plastics have the same stretchability.Different polymer plastics have different stretchability (and are therefore suited for different applications).A reasonable conclusion cannot be drawn due to procedural errors.Other reasonable conclusions Experimental Design Improvements: Provide the before and after measurements for length (Did the samples all start out the same size?).Make sure the samples are all of the same thickness.Variations in thickness could have caused variain stretchability.Perform additional trials.Some of the samples have similar stretchability (A and C, B and D).Two trials may not be enough to conclusively state that one is more stretchable than the other.Indicate how many weights were added to the clamps (Was it the same number for each sample?).Other acceptable responses [Marking Rubric]: 3 points: The response draws a valid conclusion supported by the student's data and describes two ways the student could have improved the experimental design and/or the validity of the results; 2 points: The response draws a valid conclusion supported by the student's data and describes one way the student could have improved the experimental design and/or the validity of the results.-or-The response describes two ways the student could have improved the experimental design and/or the validity of the results but fails to draw or incorrectly draws a conclusion from the student's data; 1 point: The response draws a valid conclusion sup-ported by the student's data but fails to describe, or incorrectly describes, how the student could have improved the experimental design and/or the validity of the results.-or-The response describes one way the student could have improved the experimental design and/or the validity of the results but fails to draw or incorrectly draws a conclusion from the student's data.; 0 points: The response provides little or no correct information from the polymer investigation.

B.3 Subset #5
[Question]: Starting with mRNA leaving the nucleus, list and describe four major steps involved in protein synthesis.
[Key Elements]: mRNA exits nucleus via nuclear pore.mRNA travels through the cytoplasm to the ribosome or enters the rough endoplasmic reticulum.mRNA bases are read in triplets called codons (by rRNA).tRNA carrying the complementary (U=A, C+G) anticodon recognizes the complementary codon of the mRNA.The corresponding amino acids on the other end of the tRNA are bonded to adjacent tRNA's amino acids.A new corresponding amino acid is added to the tRNA.Amino acids are linked together to make a protein beginning with a START codon in the P site (initiation).Amino acids continue to be linked until a STOP codon is read on the mRNA in the A site (elongation and termination).

B.4 Subset #6
[Question]: List and describe three processes used by cells to control the movement of substances across the cell membrane.
[Key elements]: Selective permeability is used by the cell mem-brane to allow certain substances to move across.Passive transport occurs when substances move from an area of higher concentration to an area of lower concentration.Osmosis is the diffusion of water across the cell membrane.Facilitated diffusion occurs when the membrane controls the pathway for a particle to enter or leave a cell.Active transport occurs when a cell uses energy to move a substance across the cell membrane, and/or a substance moves from an area of low to high concentration, or against the concentration gradient.Pumps are used to move charged particles like sodium and potassium ions through membranes using energy and carrier proteins.Membrane-assisted transport occurs when the membrane of the vesicle fuses with the cell membrane forcing large molecules out of the cell as in exocytosis.Membrane-assisted transport occurs when molecules are engulfed by the cell membrane as in endocytosis.Membrane-assisted transport occurs when vesicles are formed around large molecules as in phagocytosis.Membrane-assisted transport occurs when vesicles are formed around liquid droplets as in pinocytosis.Protein channels or channel proteins allow for the movement of specific molecules or substances into or out of the cell.

3
points; This response describes three additional pieces of information that would be needed to accurately replicate the experiment: " h o w m u c h v i n e g a r t o p o u r i n t o e a c h container...what kinds of samples are being 1 point; This response describes one additional piece of information, but it is not relevant to accurately replicate the experiment.The experiment already specifies the use of four different samples and identical containers.1 point; This response describes one additional piece of information that would be needed to accurately replicate the experiment: "a list of the materials needed to determine the mass."However, the response does not address any of

Figure 2 :
Figure 2: Our proposed AERA: Automated Explainable Student Response Assessment framework contains three major steps: Step 1: Query ChatGPT for rationale generation, Step 2: Apply rationale filtering or data refinement strategy, and Step 3: Distill smaller language model for more efficient rationale generation.

Figure 3 :
Figure 3: Demonstration of using ChatGPT for tabular data and text description transformation.

Dataset
In this paper, we employ the Hewlett Foundation: Short Answer Scoring (ASAP-SAS) dataset.The detailed train, devevelopment, and test splits are shown in Table

# 1 :
In order to replicate this procedures, you would need to know Label: 3 High Confident Prediction: 0 #2: In conclusion trial any Label: 2 High Confident Prediction: 0 #2: Plastic Label: 3 High Confident Prediction: 0 #1: In order to replicate this experiment, you would need to know: 1. how they got the mass of the four different samples 2. A list of constants 3.You would have to know how much of a sample you would place into the container of vinegar and if it's the same for all four materials.Label: 0 High Confident Prediction: 2 Table 6: Examples of highly confident ChatGPT predictions identified wrongly scored data score scale derived from its knowledge base.(

Table 1 :
Comparison of reasoning difficulty between different prompts.

Table 5 :
Results of data refinement (rationale refinement & label fixing) across all datasets.

Table 10 :
R → O Verification