Exploring the Potential of Large Language Models in Generating Code-Tracing Questions for Introductory Programming Courses

In this paper, we explore the application of large language models (LLMs) for generating code-tracing questions in introductory programming courses. We designed targeted prompts for GPT4, guiding it to generate code-tracing questions based on code snippets and descriptions. We established a set of human evaluation metrics to assess the quality of questions produced by the model compared to those created by human experts. Our analysis provides insights into the capabilities and potential of LLMs in generating diverse code-tracing questions. Additionally, we present a unique dataset of human and LLM-generated tracing questions, serving as a valuable resource for both the education and NLP research communities. This work contributes to the ongoing dialogue on the potential uses of LLMs in educational settings.


Introduction and Background
The teaching of introductory programming courses continues to be a challenging endeavor, despite the global uptake and popularity of such courses.High enrollment rates often result in diverse student populations, with a wide range of programming experience from those just starting their journey to others with prior exposure (Lopez et al., 2008).Ensuring an effective learning experience that accommodates this wide disparity presents a daunting task, making the teaching of these courses complex.
One critical component in teaching introductory programming is the focus on code tracing, a skill identified as instrumental in enhancing code writing abilities (Lister et al., 2009;Venables et al., 2009;Kumar, 2013).Current educational methodologies encourage code tracing through a variety of means, such as practice questionnaires (Lehtinen et al., 2023), direct teaching strategies (Xie et al.,   1 Our data and code are available at https: //github.com/aysafanxm/llm_code_tracing_question_generation Figure 1: We aim to assess the Large Language Models' (LLMs') capability to generate code tracing questions, pivotal in computer science education.The accompanying illustration outlines our approach, where GPT4 is employed to generate questions based on given code snippets and descriptions.Subsequent comparative analysis with human-created questions aids in exploring critical aspects, such as the quality and diversity of generated questions, discernibility between human and AI authors, and the relative superiority in question quality.2018), and tracing quizzes (Sekiya and Yamaguchi, 2013).These strategies consistently utilize codetracing questions aimed at fostering and developing a student's understanding and skills.
However, the preparation of code-tracing questions poses challenges.Manual question creation by instructors (Sekiya and Yamaguchi, 2013;Hassan and Zilles, 2021) is time-consuming and lacks scalability.Automatic generation using program analysis saves time, yet is limited by the analyzer's capabilities and lacks question diversity (Zavala and Mendoza, 2018;Thomas et al., 2019;Russell, 2021;Lehtinen et al., 2021;Stankov et al., 2023).
In light of the increasing potential of Large Language Models (LLMs) in sectors like code summarization and explanation (Chen et al., 2021;Siddiq et al., 2023), the question arises: Can LLMs generate high-quality code-tracing questions?Our study explores this query using GPT4 (OpenAI, 2023), leveraging prompts to guide its question generation based on given code snippets and descriptions.To assess the LLM's capability in this pivotal aspect of computer science education, we devised a set of human evaluation metrics.This allowed for an objective appraisal of the LLM-generated questions, and, through a comparative analysis with human-created counterparts, critical aspects such as question quality, diversity, discernibility between human and AI authors, and relative superiority in quality were explored (Figure 1).These analyses have enhanced our understanding of the potential roles of LLMs in computer science education.
This investigation provides a foundation for considering the potential inclusion of LLMs in learning platforms, which could offer new possibilities for enhancing the learning experience in introductory programming courses.Given these advancements, our study contributes to the field as follows: • The curation of a high-quality dataset consisting of human and LLM-generated code tracing questions and associated code snippets.
• An exploration and evaluation of GPT4's capability in question generation, including comparisons with both GPT3.5-turbo and humanauthored questions, and an examination of few-shot and zero-shot scenarios.
• The introduction of a human evaluation methodology and a comprehensive assessment of the quality of LLM-generated questions, offering valuable insights into the potential of LLMs in educational contexts.
Code LLMs for CS Education: Recent advances in code large language models (LLMs) (Chen et al., 2021;Wang et al., 2021;Le et al., 2022;Wang et al., 2023) have enabled various downstream applications, including code completion, retrieval, summarization, explanation, and unit test generation (Lu et al.;Siddiq et al., 2023;Tian et al., 2023).Studies have showcased the LLMs' ability to generate novice programming content comparable to humans (Finnie-Ansley et al., 2022;Piccolo et al., 2023).LLMs have been utilized in classroom environments (Kazemitabaar et al., 2023), to generate coding exercises and explanations (Sarsa et al., 2022), and to create counterfactual questions (Narayanan et al., 2023).Our study represents the first exploration of LLMs for code tracing question generation, a critical component of CS Education, thus underscoring the potential of these models for generating educational content.
3 Our Approach

Task Definition
In automatic tracing question generation, given a description (optional) d ∈ D ∪ ∅, detailing the code context, and a code snippet c ∈ C provided by an instructor or student, the aim is to generate a set of relevant questions Q ′ for student practice.This task can be formally defined as a function: where D represents all possible descriptions, C all possible code snippets, and Q ′ is a subset of all possible questions Q.

Curating the Code-Tracing Question Dataset
For our experiment, we curated a unique dataset reflecting the range of tracing questions encountered by beginner programmers.We sourced 158 unique questions from CSAwesome2 , a recognized online Java course aligned with the AP Computer Science A curriculum.To enhance diversity, we added 18 questions extracted from relevant YouTube videos.
Other platforms and sources were also examined but didn't fit due to a lack of explicit tracing questions.Our final dataset consists of 176 unique code snippets and question pairs, allowing a fair evaluation of LLMs' ability to generate introductory programming tracing questions.

Prompt Engineering and Model Selection
In our iterative approach to prompt engineering and model selection, we first refined prompts and then generated tracing questions using GPT-3.5turboand GPT-4.Using BERTScore, we assessed question diversity and similarity.Based on these insights, we combined the optimized prompt with the chosen model to determine the most effective generation approach, be it few-shot or zero-shot.Our final prompt, refined iteratively from (Brown et al., 2020), positioned in Appendix B, adopts an expert instructor's perspective, encourages deep understanding via code-tracing questions, and maximizes the inherent versatility of LLMs.
Next, we considered GPT-3.5-turbo and GPT-4 for model selection, and investigate the generated tracing questions diversity by BERTScore (Zhang* et al., 2020).Regarding the automatic evaluation of the diversity in generated questions, we adopted the following methodology: For each code snippet, we utilize a singular human-authored tracing question as the reference.Both GPT3.5-turbo and GPT4 are then tasked with generating multiple tracing questions for every snippet.Following this, we employ regular expressions in a postprocessing step to segment the generated content, isolating individual tracing questions.Subsequently, for each generated prediction p, its BERTScore is computed in relation to the reference, denoted as BERTScore(reference, p).
The boxplot in Figure 2 displays the Precision, Recall, and F1 scores for both models.From the graph, it's clear that GPT-3.5-turbo has a median Precision score around 0.45, Recall slightly above 0.6, and an F1 score hovering around 0.5.In comparison, GPT-4 shows a more balanced performance with a median Precision score close to 0.6, Recall near 0.55, and F1 just above 0.5.Notably, the variability in scores, particularly for GPT-4, highlights the diverse outcomes in its results.Based on our results, we chose GPT4 for subsequent evaluations.Enhanced performance examples from GPT4 are in Appendix C.
Next, we hypothesized that the few-shot question generation approach, which feeds the model with three tracing question examples and their respective code snippets, would yield higher-quality questions than the zero-shot generation that relies solely on the prompt.Contrary to our expectations, the experiment showed that the few-shot method introduced a significant bias towards the example questions, thus narrowing the diversity in the generated questions.Consequently, we opted for the zero-shot generation in our tests, which fostered a broader spectrum of question types.Detailed examples of outcomes from both the zero-shot and few-shot approaches are available in Section 4.4.

Human Evaluation
Next, we conducted a human evaluation comparing the quality of GPT4-generated and humanauthored tracing questions.The expert evaluators were meticulously screened based on specific criteria: they had to be computer science graduate students with at least one year of programming teaching or tutoring experience.Four such experts, meeting these criteria, participated in the evaluation.
Each evaluator was assigned a set of 44 randomly selected code snippets from a pool of 176 human-authored tracing questions.For each snippet, evaluators received a pair of questions (one human-authored and one GPT4-generated) in a randomized order to mitigate potential ordering bias.ensured.

Evaluators unawareness of question authorship was
The evaluators rated each question based on five criteria shown in Table 1.They also guessed the question's authorship and expressed their preference between the pair.Detailed evaluation criteria and labels can be found in Table 1.

Analyses and Results
This section details our analysis and highlights the results, encompassing quality ratings, expert perceptions, and textual similarities in question generation.To assess the quality disparity between LLMgenerated and human-authored questions, we applied Mann-Whitney U tests (Mann and Whitney, 1947) to the median ratings of four evaluation cri-teria in Table 3. Significant differences emerged in three criteria: relevance to learning objectives, clarity, and difficulty appropriateness.However, the relevance to the given code snippet showed no significant difference, indicating comparable performances.

Comparative Analysis of Quality Ratings
Despite U-tests highlighting significant differences in some criteria, the practical quality difference was minimal.As further detailed in Table 2, LLM-generated questions had slightly lower mean ratings, yet their median ratings closely mirrored those of human-authored questions.
Considering these two analyses together, it is apparent that despite some statistical differences, LLM-generated questions still maintain a high pedagogical standard effectively.Consequently, LLM, while underlining areas for potential enhancement, demonstrates proficiency in generating questions that align closely in quality and course relevance with those crafted by humans.

Expert Perception of Question Authorship
We further evaluated the discernibility of LLMgenerated questions from human-authored ones using a Confusion Matrix (Table 5).Approximately 56% (99 out of 176) of GPT4-generated questions were mistakenly identified by experts as human-generated, and about 20% (35 out of 176) of human-authored questions were misattributed to the GPT4.This overlap signifies the high quality of the generated questions and GPT4's proficiency in producing pedagogically relevant tracing questions.Moreover, the matrix reveals an evaluator bias toward attributing higher-quality questions to human authorship.

Textual Similarity between Questions
Table 6 presents BLEU (Post, 2018), ROUGE-1/2/L (Lin, 2004), and BERTScores (Zhang* et al., 2020), comparing the similarity between the randomly selected GPT4 generated questions and corresponding human-authored questions.The low BLEU and ROUGE scores suggest that GPT4 is generating distinct, non-verbatim questions compared to human-authored questions.A moderate BERTScore, reflecting semantic similarity, suggests that GPT4-generated questions align with the context of human-authored ones.This further underscores GPT4's capability to independently generate relevant and diverse code-tracing questions, distinct from those crafted by humans.Few-Shot Result: 1.What is the purpose of this function?What does it aim to accomplish?
Tracing question 1: What is the output of the method redo(9, 3)? 2. What is the base case for this recursive function?What happens when the base case is reached?
Tracing question 2: What is the output of the method redo(0, 5)? 3.If the input values are i = 8 and j = 2, what will be the output of the function?Please trace through the code step-by-step and explain your reasoning.
Tracing question 3: What is the output of the method redo(27, 3)? ... ... 6.What happens if both input values are negative, such as i = -8 and j = -2?Does the function handle this case correctly according to the requirement?Explain your reasoning.Combining with the previous analyses, the LLM, such as GPT4, thus exhibits substantial promise in generating high-quality, course-relevant codetracing questions, illustrating its utility as a teaching aid.

Few-shot vs Zero-shot Generation Results
Few-shot generation biased our model towards the provided examples, largely reducing question diversity.In contrast, zero-shot generation yielded more diverse questions, prompting us to favor it for broader question variety in our experiment.Detailed examples of the generated results for both 0-shot and few-shot methods can be found in Appendix D.
Table 4 provides a side-by-side comparison of GPT-4's performance in few-shot and zero-shot settings.The zero-shot results exhibit a broader range of question types, while the few-shot results seem to be more templated, reflecting the bias introduced by the provided examples.
Possible reasons for these observations include

Conclusion
This study explored the capability of GPT-4 in generating code-tracing questions that rival the quality of those crafted by human educators.The findings illuminate the potential of LLMs to bolster programming education, marking a significant stride in the domain of code-tracing question generation and LLM application.This sheds light on scalable, high-quality automated question generation.

Limitations and Future Work
This study marks a step closer in evaluating LLMs for code tracing question generation, but it is not without its limitations.Our research was primarily anchored to GPT-4, raising concerns about the generalizability of our findings to other LLMs, such as CodeT5+.Moreover, the study did not delve into the personalization of tracing questions based on individual student submissions, a facet that could greatly enhance the learning experience.Furthermore, the real-world educational efficacy of the LLM-generated questions remains an open question, given that our study did not involve actual students.
Several avenues beckon for further exploration.Evaluations with a broader range of models will offer a more comprehensive perspective on LLM capabilities.While our study centered on introductory Java tracing questions, assessing LLM versatility across different programming domains is imperative.The potential of LLMs extends beyond mere question generation; by tailoring questions to student needs, we can amplify the educational relevance.Our roadmap includes the development of an educational platform integrated with LLM questions, followed by classroom experiments and usability testing.To ensure broader applicability, expanding our dataset is crucial.Lastly, our findings on few-shot and zero-shot learning necessitate further investigation into model adaptability, biases in question generation, and the potential of intermediate-shot learning.
These directions not only underscore the transformative potential of LLMs in AI-driven education but also emphasize the importance of comprehensive evaluations.

Ethical Statement
Our exploration of Large Language Models (LLMs) in introductory programming education was conducted ethically.We sourced public data and maintained evaluator anonymity and data confidentiality through secure storage.Evaluators were informed of the objectives and participated voluntarily.All evaluation results, as committed in the IRB forms, are securely stored.We strived for educational fairness by properly compensating the educators involved in our evaluation.We are mindful of the societal impacts of LLM integration in education.While acknowledging their promise, we believe careful consideration of pedagogical goals within the educational ecosystem is vital.Our future work will be guided by these ethical principles of privacy, informed consent, secure data handling, inclusivity, and conscientious progress focused on students' best interests.
n % 10; sum = sum + digit; n = n/10; } System.out.println(sum);Description: Identify the reverse digits on a credit card.Collected Data LLM-generated Tracing Questions: 1.Based on the given code, what will be the output if the value of ... 8.If the requirement was to identify the reverse digits on a credit card and then return them as a single integer, how would you modify the code to accomplish this?

Figure 2 :
Figure 2: Comparison of the BERTScore on all LLMgenerated questions and human-authored questions.

Table 1 :
Criteria used for expert evaluation.

Table 4 :
Illustrative comparison between GPT-4's code-tracing question generation in Few-Shot and Zero-Shot settings, showcasing the diversity and specificity of generated questions.

Table 5 :
Confusion Matrix indicating experts' attributions of code-tracing questions.The table displays the number of actual GPT-4 generated and human-authored questions and how they were predicted by the experts, underscoring the challenge in distinguishing between the two.