The ADAIO System at the BEA-2023 Shared Task: Shared Task Generating AI Teacher Responses in Educational Dialogues

This paper presents the ADAIO team’s system entry in the Building Educational Applications (BEA) 2023 Shared Task on Generating AI Teacher Responses in Educational Dialogues. The task aims to assess the performance of state-of-the-art generative models as AI teachers in producing suitable responses within a student-teacher dialogue. Our system comprises evaluating various baseline models using OpenAI GPT-3 and designing diverse prompts to prompt the OpenAI models for teacher response generation. After the challenge, our system achieved second place by employing a few-shot prompt-based approach with the OpenAI text-davinci-003 model. The results highlight the few-shot learning capabilities of large-language models, particularly OpenAI’s GPT-3, in the role of AI teachers.


Introduction
The current success of large language models (LLMs) in generating natural language responses that are almost indistinguishable from that of a human indicates that AI systems are steps closer to passing the Turing test.Apart from being used as conversational agents, LLMs can be employed in various educational settings as described in Kasneci et al. (2023) including as an AI teacher to help students practice and improve.Tack et al. (2023) launches a shared task at the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), called Generating AI Teacher Responses in Educational Dialogues.Inspired by Tack and Piech (2022), this task requires teams to develop Intelligent Tutoring Systems (ITS) that generate teacher responses in real-world teacher-student interactions.This task serves as a benchmark to gauge the capability of generative models in functioning as AI teachers.
Dialogue-based ITS face various requirements and challenges in meeting the needs of effective educational support.This entails generating factually accurate content and ensuring educational efficacy by speaking to students in a teacher-like manner, understanding their needs, and helping them improve their understanding (Tack and Piech, 2022).However, several challenges must be addressed.
One significant challenge lies in acquiring appropriate data for training ITS, particularly real teacher-student interactions that cover various subjects.Another challenge involves developing models that can effectively capture the student's learning style and accommodate long-range dependencies within conversational sequences.Furthermore, evaluating the quality of teacher responses is essential.The responses should not only sound natural but also demonstrate an understanding of the student's queries and provide valuable guidance to help the student improve.

Related Work
Research on Intelligent Tutoring Systems has spanned many decades, with various proposed systems that include both text-based (Graesser et al., 2005), spoken dialogue tutoring systems (Litman and Silliman, 2004) and multi-modal systems that have been developed to improve student learning.
Earlier dialogue-based ITS were designed using rule-based cognitive modelling methods (Aleven, 2010;VanLehn et al., 2002) in generating teacher responses.In recent years natural language generation (NLG) tasks generally benefited from models using sequence-to-sequence architectures (Sutskever et al., 2014).Current state-of-the-art models such as OpenAI GPT-3 (Brown et al., 2020) have shown tremendous results on a range of downstream NLG tasks such as response generation.One of the major underlying components of the language model is the transformer architecture (Vaswani et al., 2017) which increases its capacity for context awareness and long-range dependencies.Currently, the application of LLMs within the educational domain (Bibauw et al., 2022;Hendrycks et al., 2021) indicates they could improve student learning outcomes.However, their efficacy in conversational tutoring has not been fully evaluated (Tack and Piech, 2022).
On bench-marking the efficacy of LLMs in generating responses to accomplishing teaching goals, Tack and Piech (2022) investigate the suitability of these AI-teacher responses by comparing text generated by state-of-the-art models, Blender (Roller et al., 2020) and GPT-3, on real-world tutoring dialogue data.The paper comparatively analyses the responses based on a stack of evaluation methods.Furthermore, the paper suggests the following pedagogical dimensions to evaluate the AI-teacher generated responses, on its ability to speak like a teacher, understand a student and help a student.These dimensions form the core of the AI-teacher challenge.

Dataset
Teacher-Student Chatroom Corpus (TSCC) The dataset used in this task is derived from the Teacher-Student Chatroom Corpus (TSCC) (Caines et al., 2020).The TSCC consists of 102 chatrooms where English as a second language (ESL) teachers interact with students to work on language exercises and assess students' language proficiency.From each dialogue, shorter passages limited to 100 tokens were extracted, comprising sequential turns between the teacher and student.These passages serve as data samples and end with the teacher's utterance, which acts as the reference response.The dataset follows a JSON format, including fields such as id, utterances (dialogue context), and response (teacher's ending utterance).
The dataset includes a train set of 2,747 dialogues with an average of 3.9 turns per dialogue (±2.2, max=17).The dev set consists of 305 dialogues with an average of 4.0 turns (±2.2, max=16), while the test set comprises 273 dialogues with an average of 2.6 turns (±1.5, max=11).The response lengths in the train set range from 1 to 66 words, with an average of 9.1 words (±8.2) whereas the dev and test sets are without the response data.
4 System Architecture

Model
We conducted our experiments using OpenAI GPT-3 (Brown et al., 2020) pre-trained LLMs.Initial trials revealed that the text-davinci-003 model produced responses that closely resembled human-like and contextually relevant interactions, surpassing the performance of ada, curie, and babbage.Consequently, we predominantly employed this model for our experiments.However, considering the cost associated with utilizing the models, we opted for the text-ada-001 model for the fine-tuning setting described below.A schematic overview of our experimental process is depicted in Figure 1.

Training Methods
Earlier deep-learning models would employ finetuning techniques to update the parameters of a pre-trained model by retraining it on new data samples from the target domain.Pre-trained LLMs such as GPT-3 and others have demonstrated the ability to utilize natural language prompts either with or without accompanying examples in performing downstream NLP tasks such as classification, summarization or generation (Brown et al., 2020;Liu et al., 2023).Within dialogue generation, fine-tuning with example data can lead to responses generated with desirable attributes or tones such as empathy, persuasion, encouragement, etc.In tutoring situations, there are attributes that make a good teacher, and we wanted to examine the ability of dialogue-based ITS to embody such characteristics.The training methods we explored include zero-shot, few-shot, and fine-tuning settings.
1. Zero-Shot: In this approach, we simply provided the GPT-3 Model with a modified version of Prompt A (see Section 4.3), without any example dialogues.
2. Few-shot: This approach features adding to the prompts five handpicked sample dialogues (see Table 3) from the training set.These dialogues included the speaker-role for each turn, i.e. student, and teacher just like in the training data.Our criteria were to choose dialogue examples with a teaching focus as defined in Caines et al. (2020).As per the teaching focus, we selected example dialogues that consisted of conversational sequences that sought to provide grammatical and lexical resources to the student while also showing aspects of discourse management and interactive communication.We replicated this approach using two language models, namely text-ada-001 and text-davinci-003.
3. Fine-tuning on the TSCC corpus: We finetuned the text-ada-001 model on the training data following OpenAI's API documentation (https://platform.openai.com/docs/guides/finetuning).Our fine-tuned data consisted of approximately 95% of the training data, excluding the test data that we set aside for our internal evaluation.Afterwards, we used the fine-turned data to prompt the model, exactly like the few-shot approach to generate the teacher responses for the test sample.

Prompts Engineering
In this section, we delve into the adaptability of the dialogue-based Intelligent Tutoring System (ITS) by employing prompts that experiment with various aspects, including the roles of the participants, the teaching approach adopted by the tutor, and the specific teaching goals.To achieve this, we utilized the few-shot approach, providing explicit instructions to the model regarding dialogue response generation.The prompts used, along with corresponding dialogue examples, are presented below and in Table 3.
1. Prompt A You will be given a dialogue chat between a teacher and a student, and your task is to generate a teacher response that is appropriate to the context, in which the teacher is polite, helpful, professional, on topic, and factually correct.The following are example dialogues with a teacher and a student.
2. Prompt B You will be given a dialogue chat between a teacher and a student, and your task is to generate a teacher response and probe the student's understanding in a strict manner.The following are example dialogues with a teacher and a student.
3. Prompt C You will be given a dialogue chat and your task is to generate a teacher response.
The following are example dialogues with a teacher and a student.
4. Prompt D You will be given a dialogue chat between an English language learner and a teacher.Your task is to generate the teacher's response to encourage conversational skills.
The following are example dialogues with a teacher and a student.

5.
Prompt E You will be given a dialogue chat between two conversational partners.Generate the utterance that is appropriate within the dialogue context.The following are example dialogues.
Prompts A and B are designed to incorporate aspects of the tutor's teaching approach, with prompt A, exhibiting more desirable attributes (adopted from Tack and Piech (2022)).In contrast, prompt B adopts a slightly different approach to probe the learner's understanding.Prompt C takes a neutral stance without any characteristics, putting more focus on the student-teacher roles of the dialogue participants.Prompt D attempts to generate responses that ought to focus on the learning goalsecond language acquisition skills as specified in the TSCC corpus.Lastly, Prompt E removes the teacher-student roles and shifts towards dialogue participants with unspecified roles.The role tags in the few-shot examples are changed to Speaker A and Speaker B in this prompt.

Implementation Details
We used the OpenAI Python library to call the GPT-3 engine to make the inferences on the test dialogues.Among the available models, we employed the top-performing text-davinci-003 in the zero-shot and few-shot scenarios, and text-ada-001 in the fine-tuned approach.Additionally, we compared the performance of davinci and ada in the few-shot experiments.We used the following parameters for all our experiments: temperature=0.7,max tokens=100, top p=0.8, frequency penalty=0 and presence penalty=0.We experimented with a range of values for max tokens including 20, 30, 70, 100, 256.After some initial trials, we decided to go with max-tokens=100 as it generated both a concise and relevant response most of the time.Across all our trials we kept the parameters and settings the same.
In our few-shot experimental settings, we intentionally disregarded examples samples that lacked teaching material in the reference teaching response such as turns that expressed acknowledgement, greetings or parenthetical statements, for example, conversational turns like sure, okay, hi, etc.The few-shot prompts dialogue examples were kept the same across the experiments.

Model Selection
We randomly selected fifty samples from the training data to constitute our internal test set for model selection.These samples did not overlap with the few-shot dialogue examples and thus allowed us to compare the training methods listed in Section 4.2.We utilized the machine-based evaluation metric BertScore (Zhang et al., 2019) and reported the recall, precision, and F1 scores in Table 1 (all models were fed with Prompt A).The BERTScores show little variability across the models despite the apparent differences we noticed when inspect-

Models
Prec.Rec.F1 Zero-Shot (davinci-003) 0.83 0.847 0.842 Few-Shot (ada-001) 0.848 0.839 0.844 Few-Shot (davinci-003) 0.840 0.844 0.842 Finetuned (ada-001) 0.811 0.836 0.824 Table 1: BertScore evaluation of models on the internal test set ing the generated responses.Eventually, for our final system entry, we chose the Few-Shot davinci-003 model based on Prompt A as we believe this system generated the most meaningful responses required by the shared task.From our observation, both the few-shot and fine-tuned ada-001 models generated out-of-context and incoherent responses most of the time.We abstain from reporting the BertScores of models fed by Prompt B to E for the performances were consistent as shown in Table 1 and that we didn't have the resources to engage human evaluation on the quality of the generated response.Nevertheless, the generated responses piqued our interest, leading us to incorporate a few in the Appendix.

Shared Task Results
Table 2 presents the results of our ADAIO System (Few-Shot davinci-003 model based on Prompt A) during the development and evaluation phases of the shared task.The numbers in parentheses represent the system's rank among the top 10 entries.The BertScore deviation observed as compared to the model selection results may be attributed to the variation in data between the reference responses in the real test set and the training set.Apart from BertScore, the shared task incorporates another automated dialogue evaluation metric known as Di-alogRPT (Gao et al., 2020).This metric assesses the generated response's performance in relation to the preceding dialogue context, considering indicators such as updown (the average likelihood that the response receives the most upvotes), human vs rand (the average likelihood that the response is contextually relevant), human vs machine (the average likelihood that the response is human-written rather than machine-generated), and final (the average/maximum) weighted ensemble score derived from all DialogRPT metrics.Our ADAIO System ranked second place after the two phases.

Discussion
The evaluation results from both machine and human assessments of the generated responses on the test set provide evidence of the effectiveness of LLMs, particularly GPT-3, in tutoring dialogue applications.While the dialogues in the TSCC corpus primarily concentrate on everyday speech and language usage, which proves advantageous for short conversational exchanges such as corrections, explanations, or clarifications, it is crucial to examine the GPT-3 model's reliability in tutoring scenarios that involve longer sequences within a wider discourse context (Graesser et al., 1995).Furthermore, we perceived a limitation in relying solely on automatic evaluation metrics (as detailed in Section 5.1) Prompt engineering to adapt language and tone in tutoring systems Our experiments reveal an intriguing finding where manipulating the prompt influences the tone and language of the generated response, presenting an opportunity for tutoring systems to potentially adapt to the students' learning styles and/or teaching goals.Further research should delve into teaching instruction methods, potentially exploring the pedagogy of constructivist learning (Graesser et al., 2005) or engaging students in ill-structured exercises for productive failure (Kapur, 2008) using LLMs of this nature.
GPT-3's robust handling of errors and noncanonical form of language During the data preparation phase, a manual inspection of the data revealed the presence of grammatical and spelling errors in some utterances.Additionally, since the dataset originated from chatroom text-based conversations, there were instances where mathematical symbols were used instead of natural language, such as this example utterance Output teacher: But e.g.pleased with their visit = good idea.It is worth noting that we did not employ any NLP processing toolkit to correct these errors or non-canonical forms in the dialogue utterances.However, despite this, the GPT-3 model could still generate appropriate responses effectively.
LLMs' potential in multilingual settings In the context of L2 acquisition, the dialogue nature in Caines et al. (2020) provides valuable opportunities for tutors to adapt to students' native languages.Code-switching strategies as such have been found to enhance teaching, including the explanation of concepts (Köppe and Meisel, 1995), and leveraging AI tutoring systems can facilitate this process.
LLMs possess multilingual capabilities that enable them to address language barriers, accommodate low-resource languages, and exhibit promising performance even on unseen languages (Yong et al., 2022).To enhance accessibility, the development and adoption of open-source multilingual models, such as BLOOM (Scao et al., 2022), should be encouraged, thereby facilitating the utilization of LLMs in educational applications across diverse linguistic contexts.

Conclusion
In this paper, we have presented our system entry to the BEA 2023 Shared Tasks on AI-teacher response generation.Our approach investigates the capability of the state-of-the-art language generative model, OpenAI GPT-3, in addressing the requirements of the AI teacher challenge outlined by Tack and Piech 2022.Through extensive experimentation utilizing zero-shot, few-shot, and finetuning techniques, we investigated the adaptability of the system's responses by leveraging meticulously designed prompts and carefully selected dialogue examples that emphasize desirable teacher qualities.Our submitted system, featuring a fewshot prompt-based method, achieved 2nd place in the BEA Shared Task 2023 challenge.

Figure 1 :
Figure 1: The framework for the proposed GPT-3 based Intelligent Turing System.Depending on the experimental setup, the specified prompt followed by a few handpicked dialogue examples (if applicable) is sent to the LLM (GPT-3) to generate an AI-teacher response.

Table 2 :
BEA Shared Task official results of the adaio system