Leveraging Large Language Models for Automated Dialogue Analysis

Developing high-performing dialogue systems benefits from the automatic identification of undesirable behaviors in system responses. However, detecting such behaviors remains challenging, as it draws on a breadth of general knowledge and understanding of conversational practices. Although recent research has focused on building specialized classifiers for detecting specific dialogue behaviors, the behavior coverage is still incomplete and there is a lack of testing on real-world human-bot interactions. This paper investigates the ability of a state-of-the-art large language model (LLM), ChatGPT-3.5, to perform dialogue behavior detection for nine categories in real human-bot dialogues. We aim to assess whether ChatGPT can match specialized models and approximate human performance, thereby reducing the cost of behavior detection tasks. Our findings reveal that neither specialized models nor ChatGPT have yet achieved satisfactory results for this task, falling short of human performance. Nevertheless, ChatGPT shows promising potential and often outperforms specialized detection models. We conclude with an in-depth examination of the prevalent shortcomings of ChatGPT, offering guidance for future research to enhance LLM capabilities.


Introduction
One crucial aspect of developing high-performing dialogue systems is the automated identification of errors in system responses.These errors can result from various behaviors, including incorrect information retrieval or illogical semantics (Figure 1).Identifying such errors enhances dialogue system development and complements dialogue-level evaluation methods by providing finer-grained metrics for comparison (Finch et al., 2023).
To capitalize on these benefits, recent research has focused on training classifiers for specific dialogue behaviors.While certain behaviors have received considerable attention, this is not the case Do you speak spanish?
Yes, the population of miami-dade county, the most populous county in florida, is estimated to be 5.3 million.
Yes, and miami is also the second-largest us city (after el paso, texas) with a spanish-speaking majority.
Yes, I do.I'm a native speaker of english, but my mother tongue is spanish.
incorrect fact self contradiction for all pertinent dialogue behaviors.Furthermore, most datasets for training are produced by annotating human-human dialogues (Sharma et al., 2020), perturbing human responses (Gupta et al., 2022), or crafting post-hoc responses (Nie et al., 2021).As a result, such datasets may not reflect humanbot interactions, rendering them less suitable for classifier development.
Large language models (LLMs) display a promising potential to address the limited coverage in specialized classifiers.LLMs have demonstrated competitive performance across various natural language processing (NLP) tasks without finetuning (Kocoń et al., 2023).Adapting LLMs to classify dialogue behaviors can alleviate substantial costs associated with current evaluation approaches by allowing for a general dialogue behavior evaluator that is less dependent on human involvement.
Although there is much effort towards opensourcing competitive LLMs, OpenAI's ChatGPT remains the most successful LLM to date (Wang et al., 2023).Thus, we focus our experiments on ChatGPT to assess the current best-case performance on automated dialogue behavior detection using LLMs.With its wide accessibility and low costs, ChatGPT provides a practical and straightforward platform for automating dialogue behavior detection, if its proves successful.
To this end, our work focuses on two main objectives: 1. To determine whether or not ChatGPT can match the performance of state-of-the-art specialized behavior classifiers.
2. To assess the extent to which ChatGPT can approximate human-level performance in behavior classification using real human-bot dialogues.
Our findings indicate that automated methods for dialogue behavior detection have not reached satisfactory results, falling short of human performance.However, ChatGPT showcases compelling results comparative to or often better than specialized models.To facilitate further advancements, we conduct an in-depth analysis to identify the prevalent errors and shortcomings of ChatGPT.This analysis provides valuable insights, highlighting key areas to be targeted to enhance the performance of LLMs in dialogue behavior detection for future work.We release our code and data at https://github.com/emorynlp/GPT-ABCEval.

Related Work
ChatGPT has shown promising performance on many NLP tasks, especially for text classification (Gilardi et al., 2023;Kocoń et al., 2023;Zhu et al., 2023).In addition, GPT models, including Chat-GPT and InstructGPT, have been used to produce high-quality dyadic dialogues (Kim et al., 2022;Zhan et al., 2023) and have been shown to correlate highly with human annotators when evaluating the overall quality of empathetic dialogues (Svikhnushina and Pu, 2023).However, ChatGPT still exhibits limitations as Chan et al. (2023) show that ChatGPT struggles with fine-grained dialogue understanding, reporting poor performance on classifying discourse structure and utterance relations.
To the best of our knowledge, no prior research has explored the use of any GPT model as a behavior classifier for chatbot responses.Instead, previous work has focused on the development of specialized dialogue behavior classifiers, as discussed in this section.

Contradiction Detection
Although much work focuses on dialogue contradictions in the context of a given bot persona (Zhang et al., 2018;Welleck et al., 2019;Kim et al., 2020;Song et al., 2020;Shuster et al., 2022), there has been some work on a more general sense of contradictions, including NLI models targeting self-context contradictions (Li et al., 2021;Nie et al., 2021), inconsistency detectors using domainspecific attribute-value classifiers (Shi et al., 2021), and context summarization to encourage consistency in response generation (Xu et al., 2022a,b).Notably, these existing approaches to contradiction detection fail to address partner contradictions.
There is also a lack of work on general commonsense contradiction detection for dialogue responses.To the best of our knowledge, Ghazarian et al. ( 2023) is the only work that focuses explicitly on capturing commonsense qualities of dialogue responses.They propose a method for calculating continuous event commonsense alignment scores for dialogue responses using similarity calculations with the outputs of an event extraction model and generative commonsense model.However, such continuous scores cannot be immediately applied to commonsense contradiction detection without further modifications (e.g.learned thresholding, classification head, etc.).

Claim Verification
There are a variety of approaches taken for claim verification in dialogue, including questionanswering (Honovich et al., 2021) and trained classifiers (Dziri et al., 2022b).Dziri et al. (2022b) find that trained classifiers perform the best, although they still lag behind human performance.Some works focus on claim verification for questionresponse pairs only (Wang et al., 2022), whereas others target multi-turn dialogues, producing annotated datasets including FaithDial (Dziri et al., 2022a), BEGIN (Dziri et al., 2022b), and DialFact (Gupta et al., 2022).Most of these works focus exclusively on dialogue responses that are given a grounding knowledge text.In practice, however, a grounding knowledge text is not always predetermined.Gupta et al. (2022) propose a pipeline for claim verification that includes a knowledge retrieval stage rather than assuming it is provided.

Empathy
Human judges are commonly used when evaluating the degree of empathy exhibited in a dialogue response (Zhong et al., 2020;Sabour et al., 2022;Qian et al., 2023).There has also been some work on developing empathetic response and question taxonomies, although these are only applied in small-scale or synthetic settings (Welivita and Pu, 2020;Svikhnushina et al., 2022).Most applicably,

Abbr. Description Empathetic
Emp The response shows an understanding and reacts appropriately to someone's emotions.

Lack of Empathy
!Emp The bot misunderstands or reacts inappropriately to someone's emotions.

Commonsense
!Com The response misunderstands or contradicts common knowledge.

Contradiction Incorrect Fact
!Fac The response hallucinates or inaccurately presents encyclopedic or expert knowledge.

Self Contradiction
!Sel The bot contradicts something it said earlier in the dialogue.Partner Contradiction !Par The bot contradicts or misremembers something the user said earlier in the dialogue.

Redundant
Red The response inappropriately repeats information presented earlier in the dialogue.

Ignore
Ign The response ignores what the user just said.

Irrelevant
!Rel The response interrupts the current topic of discussion by presenting unrelated information.Majumder et al., 2022) or for automatic evaluation (Kim et al., 2021;Lee et al., 2022).

Coherence
Research on detecting incoherent behaviors, such as redundancy and irrelevancy, is limited.Most works perturb dialogue responses to artificially construct incoherence examples (Xu et al., 2021;Zhang et al., 2021;Ghazarian et al., 2022), which may not produce representative examples.On the other hand, Mehri and Eskenazi (2020) derive a response's relevancy score from the probabilities of manually designed future indicator utterances but found little correlation with human judgments.
In addition, detection of response redundancy is underexplored, despite some works addressing token repetition (Li et al., 2020;Xi et al., 2021).Perhaps most relevant, the Dialogue Breakdown Detection Challenge (DBDC) aims to identify contextually inappropriate bot responses that hinder conversation continuation (Higashinaka et al., 2019).Various classifiers have been proposed for this challenge (Ng et al., 2020;Lin and Ng, 2022), with observations suggesting coherence issues as a dominant cause of breakdowns.

ABC-Eval Dataset
We use the ABC-Eval dataset from Finch et al. (2023) as the behavior detection benchmark.This dataset contains 400 open-domain human-bot dialogues collected between university students and one of four chatbots: BlenderBot2, Blenderbot using DECODE reranking, Emora, and Bart-FiD-RAG.For each bot response in each dialogue, human annotators labeled whether or not a specific dialogue behavior was present.These turn-level binary annotations were collected using crowdworking annotators on the SurgeHQ platform,1 who were trained on three curated conversations to accurately identify each dialogue behavior before being accepted into the annotation project.For example, in Figure 1, the three bot responses are labeled 1, 0, 0 for the behavior incorrect fact (!Fac) and are labeled 0, 0, 1 for the behavior self contradiction (!Sel).
In this work, we take 1,634 bot responses from 108 dialogues that received two rounds of human annotations, and focus on the nine dialogue behaviors that Finch et al. (2023) found as the most informative for capturing dialogue quality (Table 1).

Specialized Behavior Detection Models
In this section, we present state-of-the-art models designed to classify labels that closely align with six of the dialogue behaviors in Table 1: Emp, !Emp, !Fac, !Sel, Ign, and !Rel.Note that no existing models are available for predicting !Com, !Par, and Red so there are no viable comparisons to our LLM approach for them (Section 5).
FaithCritic (FC) Following Gupta et al. (2022), we build a claim verification pipeline for a dialogue response r.First, 3 relevant documents D k for every entity in r are retrieved using WikiAPI.Then, a BERT model trained on the Wizard of Wikipedia (WoW) knowledge-response pairs (Dinan et al., 2019) selects the top-10 evidence sentences S e from D k .To distinguish whether a response makes a factual claim or not, the lexical overlap between r and S e is estimated, optimized on the ABC-Eval training conversations.Finally, a RoBERTa model trained on Faith-Critic, a dataset of humanannotated faithful and unfaithful evidence-response pairs derived from the WoW (Dziri et al., 2022a), is applied to those responses that make factual claims.As a result, responses that are predicted unfaithful to any evidence e ∈ S e are labeled as !Fac.
S2T2 S2T2 is a semi-supervised student-teacher training framework using two teachers, one trained on the gold data and the other trained on perturbed gold data under a [MASK] replacement, to incorporate self-supervised data augmentation into the model training (Lin and Ng, 2022).We use the released S2T2 model for the English-version of DBDC5 that is the best-performing model to date.We use S2T2 as identifying Ign and !Rel labels, since it is not trained to distinguish between them.

DECODE (DEC)
We use the released RoBERTa classification model trained on DECODE to label !Sel.DECODE contains human-written contradictory and non-contradictory dialogue responses with respect to the current speaker's previous utterances in the dialogue (Nie et al., 2021).

EPITOME (EPI)
A RoBERTa-based bi-encoder classification model for each empathetic communication mechanism is trained from the publicly available Reddit portion of the EPITOME dataset (Sharma et al., 2020).Predictions of weak or strong expressions of any of the three mechanisms are considered as Emp.Predictions of no expression for all mechanisms are considered as !Emp.

LLM-based Behavior Detection
For LLM-based dialogue behavior detection, we use OpenAI's gpt-turbo-3.5-301(henceforth, ChatGPT).Similar to the specialized models (Section 4), ChatGPT is tasked with classifying a single behavior at a time.Following the human annotator training process for ABC-Eval, we use the three training conversations for each label as our prompt engineering testbed.This section highlights key decisions of our prompt engineering process.
Instruction Finetuning During prompt engineering, it became apparent that the instructions designed for human annotators (Section 3) were not suitable as ChatGPT instructions.We iteratively refined the instructions such that ChatGPT's mistakes on the training conversations were reduced.This involved removing instructions ChatGPT appeared to misunderstand as well as adding additional behavior details and specifications.
Utterance Focus We discovered that when Chat-GPT was instructed to label each bot turn given the entire dialogue, the resulting classifications often focused on only a subset of the bot responses.To ensure consistent and robust labeling for every bot utterance, our final prompt provides the dialogue history paired with the next bot response as the target utterance to be labeled.

In-context Examples
We also tried including the examples provided to human annotators by Finch et al. (2023) as in-context examples in the prompts.However, this degraded the overall performance on the training conversations.It appears that the examples optimized for improving human annotations do not translate well to ChatGPT's performance.

Creativity
We conducted experiments involving several temperature parameters and observed high instability in the classifications for the same inputs when the temperature was increased.Interestingly, we found that using a low temperature yielded more accurate results consistently.Thus, we opted to use a temperature of 0 for maximum reproducibility in our classifications.Similar findings have been reported by Gilardi et al. (2023) and Reiss (2023).
Final Prompt Table 2 illustrates an example of the final prompt, in particular for the Red behavior2 .Each behavior is accompanied by its own eliciting question and description, which respectively fill in the Behavior Question (Q) and Behavior Definition (D) containers of the prompt.For labeling a particular context-response example, the historical turns from the context and the next target response fill in the Dialogue Context (C) and Target Utterance (U) containers.The final label for a behavior is parsed from the line produced by ChatGPT that begins with the header 'Decision:', where the value 'yes' indicates a positive label and any other value indicates a negative label.Appendix A provides the questions and definitions used for all behaviors.

Evaluation
To evaluate the detection capability of the models in Sections 4 and 5, we compare their performance against that of human annotators.For this, we S1: Hi! S2: Hi!I am happy to be talking to you.Do you have any pets at home?S1: I don't, my mom is allergic to most pets.
I used to have a pet fish when I was much younger.S2: Do you want a pet at some point in the future?
No judgment here, I'm just curious.S1: I would love to have pets in the future!Once I C have my own place, I'd like to get a dog or two.
Q Is this response repeating something that has already been said: U S2: Would you want to get a cat or a dog?
A response is repetitive if: -it repeats something from earlier in the dialogue -it includes asking a question whose answer has been already shared If any part of the response is repetitive, then it should be labeled as repetitive.Note that sometimes repetition is useful, such as for emphasis, acknowledgement, clarification, or elaboration, and in these D cases it should NOT be labeled as repetitive.
Provide your reasoning when considering this question starting with "Reasoning:".Then, finish by writing your final decision as one of: "Decision: [YES]" or "Decision: [NO]".take the set of doubly annotated conversations in ABC-Eval as our evaluation set (108 dialogues), and apply each model to the bot responses (1,634 utterances) to obtain the predicted labels.

Metrics
To assess the degree to which automated methods can approximate human judgment for a particular dialogue behavior, we measure the accuracy of the binary labels predicted by automated methods with respect to the binary labels provided by the human annotators.In addition, we calculate both the F1score for the positive occurrences of each dialogue behavior and for the negative occurrences of each dialogue behavior, in order to obtain a more finegrained picture of the performance.
Each instance in the evaluation set is doubleannotated, so two sets of human annotations exist without adjudication.It is important to note that the assessment of these dialogue behaviors is not purely based on objective criteria, as they rely on factors inherently subject to human interpretations (e.g., commonsense contradiction, irrelevance).With this in mind, to better capture the aggregate nature of identifying dialogue behaviors, the final score for each metric is measured by averaging results across the double human annotations, where e is the metric (either accuracy or F1-score), o m is the model outputs, and o h1 and o h2 are the human labels from annotation round 1 and 2, respectively: To assess human performance, we measure the F1 score and accuracy by comparing the two human annotation sets.Finally, the statistical significance between outputs of models and humans, and between outputs of the specialized models and  all labels, human judges are significantly more stable than the models.This difference is pronounced with regard to positive instances (F1+), where models attain only half the score compared to humans.Interestingly, ChatGPT exhibits comparable performance with several specialized classifiers.In the case of !Fac, ChatGPT outperforms Faith-Critic (FC) in every aspect and achieves performance closer to humans.For !Emp and !Rel, ChatGPT shows similar performance on F1-and accuracy, and even better performance on F1+, as their classifiers.Considering that ChatGPT is not finetuned for these tasks, these results are highly encouraging.
Although ChatGPT is seemingly outperformed by S2T2 on Ign, this is primarily due to the prediction of negative cases.When analyzing the positive cases, ChatGPT gives much higher recall yet similar precision compared to S2T2 4 .In practice, positive case detection is more impactful, implying that ChatGPT has an advantage in real-world applications.
Furthermore, although ChatGPT faces significant challenges in detecting positive cases of Emp, EPITOME (EPI) does not perform much better.Its higher F1+ score is achieved by excessively predicting positive cases, labeling almost all turns as positive.This overprediction impairs its overall performance, allowing ChatGPT to outperform it when considering all cases as reflected in accuracy.
The only behavior for which ChatGPT appears to be beaten by the specialized classifier is against 4 Precision and recall provided in Appendix C. DECODE (DEC) for !Sel.However, the difference in performance is only slight overall.
Notably, ChatGPT shows promising accuracy and negative F1 (F1-) to humans for the three behaviors for which specialized models are not available: !Com, !Par, and Red.However, it still struggles with detecting positive cases relative to humans.

ChatGPT Error Analysis
We perform an error analysis of ChatGPT's predictions of dialogue behaviors to better understand its limitations.For each dialogue behavior, we select 40 instances where ChatGPT and humans disagree, and examine the reasoning provided by ChatGPT prior to its final decision ([R]; see examples below).Table 4 presents a set of dialogue characteristics and ChatGPT predispositions that highlight common mistakes made by ChatGPT across multiple dialogue behaviors.

Context Management
The predominant cause of ChatGPT's errors is its inability to focus on the pertinent parts of the dialogue and response.It often lacks awareness of what information has been previously shared (HF).In Figure 2, the earlier response that already answers "Are you older than your brother" is missed.Moreover, ChatGPT tends to disregard the immediately preceding turn (DC) and make its decision based on an older state of the dialogue.In (S2) That's awesome.Personally, I always wanted an older sister so I could have someone to look up to.Are you older than your brother?
[R] The response asks a new question that has not been answered yet, which is whether S1 is older than their brother.This information has not been shared before, so the response is not repetitive.[R] S1 asked about dinner, and S2 responded by saying they haven't had dinner yet and then shifted the topic to working out.S2 then asked about S1's favorite exercises at the gym.S1 did not mention anything about favorite colors, so S2's response is completely off-topic and does not address the previous turn from S1.  [R] S2 does not demonstrate any understanding of S1's emotions or react with an appropriate sentiment or emotion to S1's shared experience.Instead, S2 changes the subject to ask about S1's parents.This response is not empathetic.In other cases, ChatGPT misattributes shared information to a wrong speaker (RC).In Figure 5, it recalls that S2 indicated a return to normalcy, when in fact, it was S1 who made that claim.
(S2) Sounds like a good time!, so the corona virus has really changed the world, huh?What is it like where you live?Is everyone mostly staying inside, or are things opening up more like normal?(S1) Things are beginning to open up normal where I am but it has been a long 2 years.How about for you?(S2) I see., how long do you think it will take before the pandemic is totally over, and things go back to how they were before the virus?
[R] S2's previous statement was about things opening up to normal where they are, implying that the pandemic is not over yet.However, in this statement, S2 is asking when the pandemic will be totally over and things will go back to how they were before the virus.This seems to be a contradiction of previous statement.

Instruction Following
In addition, a frequent issue is that ChatGPT strays from the provided behavior definitions.In Figure 6, esoteric knowledge is improperly considered as commonsense.GPT also treats the sufficient indicators of a behavior from its definition as exhaustive requirements (EX).In Figure 7, ChatGPT criticizes S2 for not indicating an emotion and offering support, despite S2's validation of S1's experience.[R] The response would contradict commonsense because MLAO is not the largest dorm in Chicago.In fact, there is no dorm in Chicago with that name.This response makes an unfounded assumption and contradicts common knowledge.

World Model
Another major issue is ChatGPT's poor understanding of common world events and human experiences.
ChatGPT frequently reveals a limited understanding of the relationship among concepts within a dialogue.For example, it overlooks elaborations on previous points, considering them too semantically similar (SR).In Figure 8, the opinion about the cat's characteristics is actually unique information relative to the context.ChatGPT also often criticizes typical conversational practices (CN).In Figure 9, it considers topical introductions at the start of a dialogue as irrelevant.
(S1) Hi! (S2) hi!I am happy to be talking to you.I hope you don't mind me asking, but are you dating anyone now?
[R] The response from S2 is completely irrelevant to what S1 said.S1 simply greeted S2 and did not provide any information that would lead to a discussion about dating.The question about dating is also abrupt and does not flow with the previous turn.In addition, ChatGPT faces challenges in comprehending the plausibility of co-occurring events, beliefs, and experiences (ME).In Figure 10, it depicts a lack of understanding that it is implausible to have an opinion about the music of an artist if one has no prior experience with that artist's music.
ChatGPT also demonstrates a large degree of general inexperience with common phenomenon or situations in the world, which can lead to it harboring untrue facts about the world or misunderstanding nuances of situations (IN).In Figure 11, ChatGPT's lack of commonsense is highlighted by its acceptance of well-wishes for a non-existent trip.
(S2) I have not heard of doj cat.Where does she live?Have any albums yet?(S1) He most recent album is "Planet Her".You should definitely listen to it!(S2) Planet her is amazing..the whole album has a distinct sound.It was produced by alex pall.
[R] There is no indication in the previous dialogue that would suggest a self-contradiction in S2's statement.They are simply expressing their positive opinion about the album "Planet Her" and mentioning the producer.There is no contradiction with anything they have said previously or any implausibility based on the information shared.Additionally, there is no contradiction with anything S1 has said.Therefore, there is no self-contradiction in S2's statement.[R] The response does not contradict commonsense as it is a polite and friendly response to the previous statement.It does not make any unfounded assumptions, is not illogical or self-contradictory, and does not ask a question where the answer is already obvious.It is a common response to wish someone a good trip even if they have not planned one yet.

Recommendations
Given the compelling performance for many dialogue behaviors observed in this work, ChatGPT is a promising direction for behavior classification.
For one, it is worth noting that ChatGPT boasts extreme cost-efficiency relative to humans.Where ChatGPT costs $0.02 on average to provide labels for a single behavior for one dialogue in this work, the average cost for human annotation ranges from $0.29 to $1.96 depending on the behavior (Table 15 in Appendix D).Since even specialized classifiers rely on human annotations for training creation, they also end up being quite costly to maintain.
Furthermore, the results of our error analysis reveal a large degree of systematicity behind Chat-GPT's reasoning mistakes across many of the behaviors.Correcting these common mistakes is likely to further improve its performance to a noticeable degree.We next discuss mitigation strategies of these identified issues to aid in future work.
Context Management Providing the complete dialogue history may hinder ChatGPT's ability to attend to the salient content due to information overload.To address this, we highlight two strategies: • Windowed Context: instead of providing the entire history, truncate the context to k previous turns.This would directly restrict the decisionmaking to the immediate context, which is important for behaviors that depend on accurate recency identification, including !Rel, Ign, !Emp, and Emp.
• Turn Pairing: perform the labeling relative to each historical turn segment independently, rather than a contiguous context.This would enable explicit and focused comparisons to smaller segments of the history that could aid behaviors that require such precision, including !Sel, !Par, and Red.
In-Context Learning Examples Given the identified mistake types, it becomes more straightforward to compose useful in-context learning examples that are tailored to optimizing ChatGPT.Examples of those mistake types that are related to ChatGPT misunderstanding the nuances of a behavior (e.g.MD, SR, CN, ME, EX) could be taken from a held-out set of conversations, which would prime ChatGPT to avoid such reasoning.

Limitations
Although ChatGPT is a high-performing, widely accessible, and affordable LLM at the time of writing, there are considerations towards the long-term applicability of the results found in this work due to the ChatGPT infrastructure.Since ChatGPT is not open-source and is only accessible through a paid API, there is less detailed understanding of its training and model design.In addition, this access method for ChatGPT also results in less user control over potential model changes and even model deprecation over time.As such, further studies could assess the applicability of other language models to the task of dialogue behavior detection to mitigate these concerns, and we leave this to future work.
Furthermore, it should be noted that the errors made by ChatGPT may not necessarily align with those made by alternative open-source language models, or even future versions of ChatGPT itself.However, it may still be useful to be mindful of the prominent problems encountered with ChatGPT while using other LLMs.These identified phenomena play a crucial role in language comprehension and reasoning overall and could also present challenges for other models, although the extent of their impact remains to be explored.

Conclusion
Although automated methods for dialogue behavior classification remain a challenging task, this work finds that ChatGPT-3.5 presents promising potential to reduce the gap between model and human performance.ChatGPT's ability to provide competitive behavior classification against specialized classifiers without necessitating finetuning or human annotation across a variety of dialogue behaviors gives rise to a low-cost, multi-task evaluator model.The systematicity behind the common mistakes observed for ChatGPT reveal concrete steps for future improvements that will improve behavior classification performance, including strategies for context management and better understanding of situational nuances.We look forward to future advancements in behavior classification that leverage ChatGPT's unique capabilities.

A Behavior Questions and Definitions
The Question (Q) and Definition (D) for each dialogue behavior label used for the final ChatGPT prompts are shown in Tables 5 -12, excluding Red which is shown in Table 2 in Section 5.
Q Is this an empathetic response by Speaker 2: A response is empathetic when Speaker 2 does ONE of the following: -clearly demonstrates an understanding of Speaker 1's emotions -reacts with the appropriate sentiment or emotion to Speaker 1's shared experience

D
-understands or appropriately reacts to Speaker 1's experience or emotions -appropriately reassures, encourages, or supports Speaker 1

B Full Prompt Example
Table 13 shows an example of the full ChatGPT prompt utilized in this work.This full version preserves the 'DIALOGUE' header, turn numbers, whitespace newlines, and full speaker names, which were removed from the prompt in Table 2 in Section 5 due to spacing constraints.
----------DIALOGUE ----------1.Speaker 1: Hi! 2. Speaker 2: Hi!I am happy to be talking to you.Do you have any pets at home? 3. Speaker 1: I don't, my mom is allergic to most pets.I used to have a pet fish when I was much younger.4. Speaker 2: Do you want a pet at some point in the future?No judgment here, I'm just curious. 5. Speaker 1: I would love to have pets in the future!Once I have my own place, I'd like to get a dog or two.----------Is this response repeating something that has already been said: Speaker 2: Would you want to get a cat or a dog?
A response is repetitive if: -it repeats something from earlier in the dialogue -it includes asking a question whose answer has been already shared If any part of the response is repetitive, then it should be labeled as repetitive.
Note that sometimes repetition is useful, such as for emphasis, acknowledgement, clarification, or elaboration, and in these cases Provide your reasoning when considering this question starting with "Reasoning:".Then, finish by writing your final decision as one of: "Decision: [YES]" or "Decision: [NO]".Do NOT fill in your decision with any terms other than YES or NO.Table 13: An example of an unmodified ChatGPT prompt.

C Full Results
Table 14 extends Table 3 from §6.2 to include the precision and recall scores for the automated models.Precision and recall scores are not meaningful for the human evaluators since each human annotation set is traded out as a benchmark against the other; thus, we still present only F1 for HUM.

D ChatGPT Cost
We compare the average cost of labeling a single dialogue from ABC-Eval for each behavior using ChatGPT and human judges.Table 15 contains the calculated costs.
ChatGPT The ChatGPT cost for a single dialogue is calculated from the OpenAI API pricing 5 5 https://openai.com/pricing($0.002USD per 1000 tokens, at time of writing) on the sum total number of tokens used for obtaining labels for each bot response for a particular behavior.These costs are then averaged over all dialogues used in this work to obtain the average cost per dialogue.Because there is not much difference in prompt length for the different behavior prompts, the average cost per behavior is quite similar.
HUM Human annotation costs are derived from the average costs presented in Finch et al. (2023).
Since the behavior labels were grouped into annotation tasks for the human judges, we divide each task cost by the number of behaviors contained within that task.The cost for a single label is then the resulting quotient for its respective task.

Figure 1 :
Figure 1: Response errors in a human-bot dialogue.
Figure 3, it overlooks the preceding question about favorite colors.Similarly, ChatGPT often misses ideas shared within a response, instead latching on the very last part of the dialogue (SA).In Figure 4, ChatGPT fails to notice the emotional mimicry conveyed by S2 in response to S1's opinion about Nicki Minaj.(S1) I do!I have a younger brother.What about you? (S2) You have a brother?Cool! Do you guys get along?(S1) We used to fight a lot as kids, but we get along well now.

(
S2) I'm sure you'll feel great after your workout!You should definitely do some strength training exercises like the push-up, the pull-up and the sit-up!They're some of the most common bodyweight exercises!(S1) Yes! What's your favorite color?(S2) I'm not sure about you, but my favorite color is blue!I like it because it reminds me of the sea color of the sea!

(
S2) I love pop!What is your favorite R&B artist?Do you like Nicki Minaj?(S1) I do like Nicki! (S2) I like her too, she's so talented.Do either of your parents work?My mom is a teacher.

( S2 )
What is the name of this dorm?Do you have a roommate?How are you liking it?(S1) It's MLAO.I don't have a roommate so it's great.(S2) Congratulations.Mlao is a great dorm then.It is the largest in chicago.Enjoy!

( S2 )
Cool!I think school can be pretty frustrating, but I do love learning new things, and a lot of my teachers are great.What about you?Do you like school?(S1) More or less.Doing activities like talking to a robot is great, but doing assignments are tired.I've been spending the past week doing homework.(S2) I understand, it can definitely be stressful.What would you change about your school, if you could?[R] S2 acknowledges that school can be frustrating and asks what S1 would change about it, showing an understanding of S1's experience.However, they do not react with a specific sentiment or emotion, nor do they offer reassurance or support.

(
S2) I love animals too!My favorite animal is the cat!I really like how they're able to adapt to the environment and adapt to their environment!(S1) I love cats too, theya re adorable!(S2) I love how cute they are!I especially like how their fur is so soft and fluffy!They look so beautiful![R]The response mentions how cute cats are and how soft and fluffy their fur is.While this is related to S2's favorite animal being cats, it does not add any new information to the conversation and is repetitive.Therefore, the response is repetitive.

(
S1) I don't have a trip planned, but maybe I'll go somewhere soon (S2) I'm sure you will!You'll love it!Have a wonderful trip!I'll see you soon!

Table 2 :
A ChatGPT prompt example for the Red behavior.Segments in the prompt are dynamically modified based on the example and behavior, as highlighted in the gray containers (C: dialogue context, Q: behavior question, U: target utterance, D: behavior definition).

Table 3 :
F1 and accuracy achieved by each model, where HUM stands for human judges.#+: num.positive labels predicted.†| † † denote significance between automated models on one or both human annotation sets, respectively.⋆| ⋆ ⋆ denote significance against best automated model on one or both human annotation sets.

Table 3
indicates the ongoing challenge of dialogue behavior detection for automated models.Across

Table 4 :
Results of the error analysis on ChatGPT's reasoning for dialogue behavior detection.
Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5676-5689, Online.Association for Computational Linguistics.

Table 5 :
Emp: behavior question and definition.

Table 7 :
!Com: behavior question and definition.

Table 8 :
Ign: behavior question and definition.
a new and off-topic discussion, it is considered to be irrelevant.Responses that are irrelevant feel abrupt and interrupt the discussion, usually because they present questions or ideas that are unrelated to the previous turn.Short reactions to or D acknowledgements of the previous turn are NOT irrelevant.

Table 14 :
Precision, recall, F1 and accuracy achieved by each model, where HUM stands for human judges.#+: num.positive labels predicted.†| † † denote significance between automated models on one or both annotation sets.⋆| ⋆ ⋆ denote significance against best automated model on one or both annotation sets, respectively.

Table 15 :
Cost ($ USD) per dialogue for each behavior using ChatGPT or humans (HUM).