Is ChatGPT a Good Multi-Party Conversation Solver?

Large Language Models (LLMs) have emerged as influential instruments within the realm of natural language processing; nevertheless, their capacity to handle multi-party conversations (MPCs) -- a scenario marked by the presence of multiple interlocutors involved in intricate information exchanges -- remains uncharted. In this paper, we delve into the potential of generative LLMs such as ChatGPT and GPT-4 within the context of MPCs. An empirical analysis is conducted to assess the zero-shot learning capabilities of ChatGPT and GPT-4 by subjecting them to evaluation across three MPC datasets that encompass five representative tasks. The findings reveal that ChatGPT's performance on a number of evaluated MPC tasks leaves much to be desired, whilst GPT-4's results portend a promising future. Additionally, we endeavor to bolster performance through the incorporation of MPC structures, encompassing both speaker and addressee architecture. This study provides an exhaustive evaluation and analysis of applying generative LLMs to MPCs, casting a light upon the conception and creation of increasingly effective and robust MPC agents. Concurrently, this work underscores the challenges implicit in the utilization of LLMs for MPCs, such as deciphering graphical information flows and generating stylistically consistent responses.


Introduction
Large Language Models (LLMs), including notable instances such as ChatGPT (OpenAI, 2022) and GPT-4 (OpenAI, 2023), are ushering in a new era in the field of natural language processing (NLP), showcasing remarkable zero-shot and fewshot generalization capabilities.The methods of pretraining language models on extensive text corpora (Brown et al., 2020;Touvron et al., 2023), followed by alignment fine-tuning to ensure adherence to human instructions (Wang et al., 2023b;Peng et al., 2023), has significantly amplified their proficiency in language comprehension, generation, interaction, and reasoning.The impressively high performance exhibited by these models has been extensively documented in related works (Qin et al., 2023;Wang et al., 2023a;Bubeck et al., 2023).
Multi-Party Conversations (MPCs) represent a prevalent and natural facet of human communication.In these exchanges, more than two participants engage in interactive discourse on a variety of topics.Such a dynamic introduces new challenges and opportunities for dialogue systems.
Here, the key requirement isn't merely generating coherent and relevant utterances, but also making strategic determinations about when to intervene and whom to address in the conversation.We queried ChatGPT about its potential strategies for addressing the inherent challenges of MPCs with "Can you solve multi-party conversation tasks?".ChatGPT's response was as follows: "I do not have built-in mechanisms to keep track of individual participants in a conversation.Therefore, it's important to explicitly mention the name or identifier of the participant you are addressing when providing instructions or asking questions.", which was intriguing and touched upon some of the critical aspects we are investigating in this paper.
Considerable efforts have been made to explore the capabilities of LLMs across a variety of NLP tasks (Qin et al., 2023;Sun et al., 2023).Despite these advancements, the effectiveness of LLMs in handling MPCs remains largely underexplored.In this study, we scrutinize the potential of LLMs such as ChatGPT and GPT-4 in managing MPCs by implementing five distinct tasks including Emotion Detection, Addressee Recognition, Speaker Identification, Response Selection, and Response Generation, across three different MPC datasets.Our experiments reveal that both ChatGPT and GPT-4 can achieve performance on par with supervised methods when evaluated on the EmoryNLP and MELD datasets.However, the performance of these models on the more complex Ubuntu IRC dataset remains less than satisfactory.To address this, we introduce a strategy known as MPC Structure Incorporation, which weaves the speaker and addressee structure information of MPCs into the LLMs.This addition leads to a significant performance boost of ChatGPT and GPT-4 across all three datasets, underpinning the value of this approach for improving LLM effectiveness in MPC scenarios.
In summary, our contributions in this paper are three-fold: 1) An exploratory study to examine the performance of ChatGPT and GPT-4 in handling MPCs within a zero-shot context is carried out.This is the first study of its kind to investigate how these LLMs perform in MPC scenarios.2) An MPC structure incorporation approach is proposed, which enhances the performance of ChatGPT and GPT-4 in managing MPCs.This strategy integrates the speaker and addressee structure information of MPCs into LLMs, leading to substantial performance improvements.3) We delve into the potential of LLMs in handling MPCs and shed light on the challenges that need to be tackled in future research.This discussion forms the basis for continued investigations into improving LLM effectiveness in MPC contexts.

Multi-Party Conversations
Existing methods on building MPC systems can be generally categorized into retrieval-based approaches (Ouchi and Tsuboi, 2016;Zhang et al., 2018;Wang et al., 2020;Gu et al., 2021Gu et al., , 2023) ) or generation-based (Hu et al., 2019;Gu et al., 2022;Li and Zhao, 2023).On the one hand, Ouchi and Tsuboi (2016) and Zhang et al. (2018) proposed to update speaker embeddings with conversation streams dynamically and role-sensitively.Wang et al. (2020) proposed to track the dynamic topic in a conversation.Gu et al. (2021) proposed jointly learning "who says what to whom" in a unified framework by designing selfsupervised tasks.Gu et al. (2023) present graphinduced fine-tuning which adapts Transformerbased LMs by integrating four types of edges into attention mechanisms.On the other hand, Hu et al. (2019) explored generation-based approaches by proposing a graph-structured network (GSN) that encoded the conversation context using homogeneous GNNs.Gu et al. (2022) proposed HeterMPC to model the complicated interactions between utterances and interlocutors with a heterogeneous graph.Li and Zhao (2023) proposed to iteratively generate missing addressees and optimize the generative model via the EM algorithm.
LLMs for NLP Tasks Several contemporaneous papers present empirical studies of leveraging LLMs for various NLP tasks to explore whether LLMs can achieve competitive performance.For example, Qin et al. (2023) study the zero-shot learning capability of ChatGPT by evaluating it on 7 representative task categories, such as reasoning, natural language inference, and summarization.Liu et al. (2023) and Wang et al. (2023a) explore whether ChatGPT is a good evaluator of natural language generation.Sun et al. (2023) investigate LLMs for relevance ranking in information retrieval.Zheng et al. (2023) seek to understand why ChatGPT falls short in providing truthful answers and provide a guideline towards truthfulness in question answering via LLMs.
To the best of our knowledge, this paper makes the first attempt to empirically analyze the zeroshot learning ability of LLMs on the MPC tasks, providing comprehensive evaluation and analysis to inspire the development of more effective and robust MPC agents.

Approach
To explore an out-of-box MPC solver, our emphasis lies in the zero-shot setting.Instruction for each task is shown in Figure 1.For each task, LLMs are first instructed with the prompt "You have been presented with a sequence of multi-party conversational turns, organized in chronological order.".Then, LLMs are instructed to complete the task with task-specific prompts.To stabilize

Instruction:
You have been presented with a sequence of multi-party conversational turns, organized in chronological order.${Task Instruction} ${Output Template} ${Extra Description} Use temperature=0, minimize unnecessary words to not get confused.

Input:
${History with Input Template} ${Extend Instruction} Response: the output of LLMs, the effect of temperature is amplified with the prompt "Use temperature=0, minimize unnecessary words to not get confused.".1 For more comprehensive explication, certain tasks require an Extended Instruction as shown in Table 5.

Task-Specific Prompts
Different instructions for each task are designed to guide LLMs to complete the task, namely taskspecific prompts.
Emotion Detection (ED) LLMs are tasked to predict the emotion of each utterance with the task instruction "Please evaluate the emotions of each utterance in the dialogue using the following n labels: {...}." and the output template "The output format must be: #{num} -{utterance} // {emotion}" Here, n is the number of emotion labels, and {...} is the list of emotion labels.Dialogue history is formalized as "#{num} -{utterance}".
Addressee Recognition (AR) LLMs are tasked to predict the addressee of each utterance with the task instruction "Your task is to find the addressee of each utterance."and the output template "The output format must be: #{num} -{utterance} // Reply to #{reply_i}".And we can get the addressee information if we know the reply-to utterance of each utterance.Since the addressee of the first utterance is unknown, we add the extra description "Please start from #1 since #0 is the first utterance that has no reply-to utterance.You should not leave any utterance unattended." to inform LLMs of the addressee of the first utterance.Dialogue history is formalized as the same as emotion detection.
Speaker Identification (SI) LLMs are tasked to predict the speaker of the last utterance with the task instruction "Please identify the speaker of the last sentence."and output template "The output format should be only one speaker.".Dialogue history with speaker is formalized as "#{num} -{speaker}: {utterance}".
Response Selection (RS) LLMs are tasked to select the most appropriate response from the candidates with the task instruction "Your task is to select the most appropriate response from the candidate set." and output template "The output format must be: #{num} -{utterance}".The candidate set is formalized as "#{num} -{utterance}".Thus the history with input template is formalized as "Dialogue History: {conversation turns} Candidates: {candidates}".
Response Generation (RG) LLMs are tasked to generate a response with the task instruction "Your task is to generate the most appropriate response.".There is no need to provide the output template since the generation task is free-form.

MPC Structure Incorporation
Considering that the complicated interactions between interlocutors, between utterances and between an interlocutor and an utterance naturally increase the difficulty of fully understanding MPCs, it might be helpful to incorporate the MPC structure information.

Speaker Structure
The speaker structure is incorporated into LLMs to help understand utterances.Specifically, the prompts with "{utterance}" is replaced with "{speaker}: {utterance}" to inform LLMs of the speaker of each utterance.An example is shown in Figure 2.
Addressee Structure The addressee structure of MPCs is constructed with adding sentence "Reply to #{reply_uid} -{reply_utterance}" into prompt.

Datasets
EmoryNLP EmoryNLP (Zahiri and Choi, 2018) is a collection of the TV show Friends, where
MELD The Multimodal EmotionLines Dataset (MELD) (Poria et al., 2019) was developed by improving and expanding the original EmotionLines dataset (Hsu et al., 2018).MELD includes the same multi-party dialogue instances as EmotionLines but incorporates additional audio and visual modalities alongside text.Each utterance within a dialogue is tagged with one of the seven emotions, including: Anger, Disgust, Sadness, Joy, Neutral, Surprise, Fear.This comprehensive tagging facilitates deep emotion-focused analysis.
Ubuntu IRC benchmark Ubuntu IRC (Hu et al., 2019) represents a substantial, Ubuntu Internet Relay Chat (IRC) channel corpus, replete with annotations for multiple interlocutors.Furthermore, it is distinguished by the provision of addressee labels accompanying each individual utterance.
The EmoryNLP and MELD datasets proffer emotion labels, hence enabling the execution of ED tasks.Conversely, the Ubuntu IRC dataset offers Addressee tags, thereby serving as a resource for AR tasks.(Hu et al., 2019) 311,725 5,000 5,000

Implementation Details
All supervised models were trained with the AdamW method (Loshchilov and Hutter, 2019).
The learning rate was initialized as 6.25e-5 and was decayed linearly down to 0. The batch size was set to 128 gradient accumulation steps.Models were trained in 10 epochs.For ChatGPT and GPT-4, we used the API endpoints gpt-3.5turbo-0301and gpt-4-0314 provided by OpenAI respectively.3

Metrics
To evaluate ED task, we employed the weight-F1 score, which is the harmonic mean of precision and recall.To evaluate SI and AR tasks, we employed accuracy.To evaluate RS task, we employed R10@1, which is the percentage of the first correct response selected from 10 candidates.
To evaluate the quality of the generated text, we employed the standard string-similarity-based metrics SacreBLEU (Post, 2018), ROUGE (Lin, 2004) and METEOR (Banerjee and Lavie, 2005).
Higher is better for all metrics.All metrics were calculated by the evaluate toolkit4 .

Evaluation Results of MPC Understanding
As shown in Table 2, we evaluated the dialogue understanding performance of supervised language models and LLMs on three test sets.Here, the SOTA results of EmoryNLP and MELD on ED are copied from Song et al. (2022).The SOTA results of Ubuntu IRC of all three tasks are copied from Gu et al. (2023).

Supervised Language Models Versus LLMs
When one juxtaposes the outcomes associated with supervised language models and LLMs, it becomes apparent that the LLMs demonstrate parity in performance with their supervised counterparts on the EmoryNLP and MELD datasets.However, their performance on Ubuntu IRC falls short of the mark.
It is unsurprising to discern that the capability of GPT-4 surpasses its predecessor, ChatGPT across all four understanding tasks.In the ED task, both

Speaker Information Enhancement
To delve into the significance of the interlocutor within the MPC understanding, we incorporate speaker information into three distinctive tasks across three datasets, disregarding the ED task.Comparing the line of ChatGPT (GPT-4) 5 and ChatGPT (GPT-4) w/.Speaker, we can find that the incorporation of speaker information can improve the performance of ChatGPT (GPT-4) on all five tasks except for the RS task of Ubuntu IRC.This improvement is not surprising.In the context of MPCs, the speaker doesn't adhere to the rigid alternation characteristic of two-party dialogues, hence the incorporation of speaker information can enhance the lucidity of the conversation, rendering it more readily comprehensible.Note that the substantial advancement observed in the AR task partly stems from our modification of the task from identifying the response sentences to directly recognizing the addressee.Nonetheless, compared with ChatGPT, the subpar performance of ChatGPT w/.Speaker on the RS task of Ubuntu IRC suggests that the imparted speaker information could not be optimally assimilated and deployed by ChatGPT.On the contrary, it appeared to disrupt the process of response selection.However, GPT-4's performance on this task was substantially superior, enhancing the effectiveness of the RS task by a margin of Models Tasks EmoryNLP MELD Ubuntu IRC ED (F1) SI (ACC) RS (R10@1) ED (F1) SI (ACC) RS (R10@1) AR (ACC) SI (ACC) RS (R10@1) Supervised BERT (Devlin et al., 2019) 34 Table 3: Evaluation results of MPC Generation.Numbers in bold denoted that the results achieved the best performance.S-BLEU is the short of SacreBLEU.The vacant cells signify their incalculability. 6The SOTA of Ubuntu IRC is BART w/.EM (Li and Zhao, 2023).9.50%.This result suggests that GPT-4 is better at fusing speaker information than ChatGPT.
Addressee Information Enhancement To probe the import of the addressee within the context of MPC comprehension, we integrate addressee information into the SI and RS tasks of Ubuntu IRC.This is primarily due to the absence of addressee information within the other two datasets.The AR task is excluded from this process, given that addressee information is unavailable within the corresponding Ubuntu IRC task.When drawing comparisons between the results of ChatGPT w/.Speaker and ChatGPT w/.Speaker & Addressee on 6 For SI task, the speakers in dialogue history are needed.It means that we cannot lack SI information, so these cells are empty.For EmoryNLP and MELD datasets lacking the addressee information, therefore corresponding rows are empty.For the Ubuntu IRC AR task, it asked not to add the addressee information, so these cells are empty.
the SI and RS tasks, as well as between the results of ChatGPT and ChatGPT w/.Addressee on the RS task, we find, to our surprise, that the integration of addressee information has led to a diminution in performance.When comparing the results of GPT-4 w/.Speaker and GPT-4 w/.Speaker & Addressee on SI and RS tasks, we find that the incorporation of addressee information can slightly improve the performance.The only marginal improvement of approximately 1.00% observed on the SI and RS tasks of Ubuntu IRC can be predominantly attributed to the proficiency of GPT-4 w/.Speaker in correctly inferring addressee information, given its noteworthy accuracy of 82.50% on AR tasks.
Speaker and Addressee Information Enhancement When contrasting the results of ChatGPT (GPT-4) w/.Speaker & Addressee with the Chat-GPT (GPT-4) on RS tasks, we discern that the Table 4: One selected example of the SI task on Ubuntu IRC with speaker and addressee structure.Here we also list the results of ChatGPT (GPT-4) w/.Speaker for comparison.
integration of speaker and addressee information precipitates a performance decrement of 25.18% for ChatGPT, whilst conversely, it elicits an augmentation of 11.50% in the performance of GPT-4.Indeed, both categories of information serve to enhance the comprehension of the conversation, yet the infusion of additional data concurrently amplifies the complexity of understanding.Empirical outcomes indicate that ChatGPT grapples with the processing of this surplus information, whereas its more potent successor, GPT-4, exhibits the capacity to assimilate it effectively.

Evaluation Results of MPC Generation
As shown in Table 3, we evaluated the dialogue response generation performance of supervised language models and LLMs on three test sets.The SOTA results of Ubuntu IRC of all three tasks are copied from Li and Zhao (2023).
Supervised Language Models Versus LLMs It becomes apparent that the SacreBLEU scores of supervised models consistently eclipse those of LLMs in a significant number of cases across all three evaluative subsets.This phenomenon is largely a byproduct of ChatGPT and GPT-4's inclination to generate more prolix responses, an approach inherently detrimental to the calculation of SacreBLEU.Regarding the ROUGE L metrics, LLMs yield superior results compared to supervised models on the EmoryNLP and MELD test sets.In contrast, within the ambit of the Ubuntu IRC dataset, supervised models command a greater presence-an outcome likely driven by the amplified comprehension complexity associated with this particular set.In terms of the METEOR metric, LLMs outdistance supervised models across all test sets, thereby affirming the formidable aptitude of LLMs in generating responses.
MPC Structure Incorporation Similar to the results in MPC Understanding, the inclusion of interlocutors' information emerges as beneficial to the generation of dialogue responses.This is substantiated by the marked superiority of Chat-GPT (GPT-4) w/.Speaker over ChatGPT (GPT-4) across all three test sets, with the exception of ChatGPT's performance on the EmoryNLP dataset.Nonetheless, addressee information doesn't seem to confer a discernible advantage in the generation of dialogue responses when comparing the performance of ChatGPT w/.Addressee to the ChatGPT, as well as the performance of ChatGPT w/.Speaker & Addressee to ChatGPT w/.Speaker.When our focus shifts to GPT-4, we observe that the performance of GPT-4 w/.Addressee does not exhibit the anticipated enhancement.However, with the addition of interlocutor information, there is a noticeable improvement in the performance of GPT-4 w/.Speaker & Addressee.

Case Study
To analyze the performance of LLMs on the tasks of MPC understanding and generation specifically, case studies were conducted by presenting randomly selected examples for further illustration.resulting in inaccurate responses from both Chat-GPT and GPT-4.However, when furnished with addressee information, GPT-4 shows an improved aptitude to comprehend the dialogue, leading to a correct response.Conversely, ChatGPT appears to be confounded by the addressee information, yielding an incorrect answer that lies beyond the purview of the candidates.

MPC Understanding As shown in
MPC Generation As shown in Table 5, the response generated by BART is conspicuously devoid of substantive content.All the LLMs, especially GPT-4, exhibit the ability to produce lengthy and meaningful responses.Although ChatGPT and GPT-4 are capable of crafting responses with greater pertinence, their propensity to yield verbose responses hampers the SacreBLEU score.

Conclusion
In this paper, we have explored the abilities of generative LLMs for MPCs, which have been largely underexplored.We empirically analyze the zero-shot learning ability of ChatGPT and GPT-4 by evaluating them on three popular MPC datasets covering five representative tasks.On EmoryNLP and MELD datasets, ChatGPT and GPT4 achieve comparable performance to the supervised training models.However, ChatGPT performs poorly on the evaluated Ubuntu IRC tasks, while GPT-4 shows promising results.However, there is still a large gap relative to the supervised training on SI task of Ubuntu IRC.Taking into account the structure of the MPC, it is evident that both ChatGPT and GPT-4 exhibit enhanced performance across nearly all tasks when equipped with speaker information.However, in the context of addressee information, ChatGPT's performance may decline if it is encumbered with extraneous data.Conversely, GPT-4 aptly leverages this information to accomplish the task with heightened proficiency.Devoting efforts towards efficaciously intertwining the graphical structure inherent in MPCs with LLMs through prompts or even supervised finetuning constitutes one of future work.

Limitations
The design of prompts plays a crucial role in determining the final results, as it holds significant sway over the outcome.The way in which prompts are constructed and structured can greatly impact the performance and effectiveness of ChatGPT and GPT-4 when it comes to handling MPC tasks.However, our current prompt architecture might not fully encompass the ideal potential and capabilities of these advanced language models in tackling MPC tasks.There is a possibility that further improvements and refinements in the prompt design can unlock even greater performance and unleash the full potential of ChatGPT and GPT-4 in handling MPC tasks.It is essential to explore and enhance the prompt architecture to ensure optimal results and leverage the capabilities of these powerful language models to their fullest extent in the realm of MPC tasks.

Figure 1 :
Figure 1: The prompt template to complete the tasks.

Figure 2 :
Figure 2: The prompt and output of emotion detection with speaker structure.
Table 1 presents the statistics of the three datasets evaluated in our experiments.

Table 1 :
Statistics of the datasets evaluated in this paper.

Table 2 :
(Gu et al., 2023)2)of MPC Understanding.Numbers in bold denoted that the results achieved the best performance.The vacant cells signify their incalculability.6TheSOTA of ED task is SPCL-CL-ERC(Song et al., 2022), and the SOTA of Ubuntu IRC tasks is MPC-BERT w/.GIFT(Gu et al., 2023)

Table 4
, in the absence of addressee information, the model exhibits inadequacies in comprehending MPC, ### Instruction: You have been presented with a sequence of multi-party conversational turns, organized in chronological order.Your task is to generate the most appropriate response.The output format is "[Reply to #{reply_uid} -Speaker {reply_spk}: {reply_utterance}] Speaker {rsp_spk}: {rsp}".Use temperature=0, minimize unnecessary words to not get confused.Speaker 2: i see there 's some confusion , that 's my mistake i guess.. with active i meant that windows ca n't 'see ' the other partitions.. but i can see that 's only a historical issue.. -Please give a response on behalf of Speaker 1 for Uttenrance #4.The part of response is [Reply to #4 -Speaker 2: i see there 's some confusion , that 's my mistake i guess.. with active i meant that windows ca n't 'see ' the other partitions.. but i can see that 's only a historical issue..] Speaker 1. Please finish the response generation.thepostfixsetuproutine and use one machine as the relay for all the others.This will save you time and effort.w/.Speaker & Addressee Yes, you can automate the setup process by creating a script that answers the questions for you.This will save you time and ensure that the setup is consistent across all machines.GPT4If you're trying to automate the process and distribute it to multiple machines, you might consider creating a shell script or Ansible playbook that sets up everything necessary for node.jsdevelopment, including postfix.This way, you can handle the postfix setup with pre-defined configurations, and then package all the necessary files and tools into an AMI for EC2 or an Ubuntu image.This would be more maintainable and scalable.
w/. Speaker & Addressee Yes, you can automate the process.You can use tools like 'expect' or 'autoexpect' in Linux to handle interactive command line sessions.These tools can learn from your inputs and can generate a script that automates the process.Alternatively, you might also consider using a configuration management tool like Ansible or Puppet.Ground Truthand untar your default config on all your client machines .

Table 5 :
One selected example of the RG task on Ubuntu IRC with speaker and addressee structure.Here we also list the results of ChatGPT, GPT-4 and BART for comparison.