Who is Speaking? Speaker-Aware Multiparty Dialogue Act Classification

,

Utterances are the building blocks of a conversation, but they do not occur in isolation.To recover the communicative function behind an utterance, we need to consider its context.The context here refers not just to previous utterances, but also to dialogue structure, speaker behavior, etc. Utterances are multifunctional in the sense that they encode many roles-e.g., turn management, relaying communicative intentions, etc. (Bunt, 2006).One example of this is the broader category of the Question DA, which, along with seeking information, also implicitly gives the floor to the addressee.Therefore, interlocutor information is needed to instill situational awareness of speakers in a conversation.In a dialogue, speakers influence each other which in turn dictates how they behave and convey their intentions.Different speakers can also fulfill different roles, for example, within the local context of a conversation, a speaker can be leading the conversation by providing information while others could be active listeners and signal they are following along by means of backchannels as shown in Table 1.
There has not been much prior work on incorporating speaker awareness for DA classification, particularly in the case of multiparty conversations (i.e., conversations involving more than two interlocutors).While there is some research on modeling speakers for DA classification, it is mostly focused on two-party conversations.Chi et al. (2017) use a separate BiLSTM to model each speaker in a dyadic setting.Although this is one way to capture speaker behavior, it does not scale to the case of multiparty dialogues and real-world situations where the number of speakers can vary at test time.Recently, He et al. (2021) introduced a way to capture speaker turns by learning turn embeddings.For this to work in a multiparty setting, the dialogue needs to be reduced to a dyadic one.This is not ideal since doing this would take away crucial information on different participants' contributions.
In the related field of emotion recognition in dialogues (ERC), some work has been done to encode speakers (Lee and Choi, 2021;Ghosal et al., 2019;Zhang et al., 2019;Song et al., 2023).These works primarily use graph neural networks (GNNs) to enhance the representation of utterances.Speaker information is injected either through edges or by adding speaker nodes to the graph.Since the main objective of these methods is to encode utterances better, often different types of edges are introduced to better capture the different relationships between utterances.For example, different types of edges for past and future connections (Ishiwatari et al., 2020) or learning dedicated edge types for each individual speaker (Ghosal et al., 2019).This translates to learning a bigger model with more parameters as well as a denser graph with many edges.Due to the large memory requirements of such models, training them can be hard in a resource-scarce situation.We show in our work that instead of using the speaker as a means to capture utterance context better, directly learning the speaker representations is more efficient.This results in a simpler graph with fewer edges.
In this work, we present a GNN-based framework that takes contextual utterance vectors as input and encodes conversation dynamics by connecting the node of each speaker with their utterances.The learned speaker representations are then concatenated with the utterance representations to get speaker-enriched utterance representations which are then used for DA classification.We conduct extensive analysis on the ICSI Meeting Recorder Dialog Act (MRDA) corpus (Shriberg et al., 2004) to understand how the fine-grained DA classes are affected by incorporating speaker representations.

Related Work
DA classification Most existing works on DA classification (Kumar et al., 2018;Chen et al., 2018;Raheja and Tetreault, 2019;Bothe et al., 2018;Khanpour et al., 2016) use recurrent neural networks (RNNs) as backbone models to represent utterances.To capture context, a hierarchical RNN with some variation is used.DA label sequences can help learn associations between tags that occur together or follow a pattern; conditional random fields (CRFs) have extensively been used for this purpose.Additionally, Raheja and Tetreault (2019) use attention to leverage context more effectively.
To make utterances speaker turn-aware, He et al. (2021) learn two turn embeddings to capture turns.The embedding vector based on the speaker label is added to the utterance vector and then passed to another RNN to capture context.Whereas Shang et al. (2020) use a CRF layer to model turn-taking.While this is an effective way to inject speaker information, multiparty dialogues need to be converted to dyadic dialogues for these approaches to work.Chi et al. (2017) work in the setting of conversational agents and aim to capture the speaker roles of the agent and the user using a separate BiLSTM for each.This technique cannot be applied to natural conversations where the number of participants is not fixed.Colombo et al. (2020) use three RNN encoders: word-level, speaker-level, and utterance-level.The speaker-level RNN is fed utterances grouped by speaker to capture speaker personas.The representations from this encoder are then fed to the utterance-level encoder for wider context modeling.Their work differs from ours in the way that they model each speaker.While they encode utterance context at the speaker level, we learn an explicit speaker representation.

Graph-based methods to encode speakers in dialogues
We discuss some of the work here that uses GNNs (Scarselli et al., 2008) for emotion recognition in conversations.Ghosal et al. (2019) build a graph with only utterances as nodes and incorporate context at the speaker level using different edges.There is a separate directed edge type from each speaker to every other speaker while also differentiating the directionality of past and future.Thus, there are 2 2 edge types in total, where  is the number of unique speakers.This technique does not scale in cases where there are more speakers in a dialogue at test time than the maximum number of speakers seen during training.Shen et al. (2021b) and Sheng et al. (2020) construct a dialogue graph by only using utterance nodes.Two different edge types are used to indicate if two connected utterances share the same speaker or not.In some ways, this is akin to converting a multiparty dialogue to a two-party one, since the utterances only have binary information of speakers.
Some other works construct the graph with both speaker and utterance nodes.This includes Lee and Choi (2021), who treat ERC as a relation extraction task through a GNN.While Zhang et al. (2019) share speaker nodes across dialogues, we are more interested in capturing the dynamics of speaker influence within a local context and assume that any new speaker can be seen during test time.Liang et al. (2021) learn fixed embeddings for each unique speaker and use them to initialize the speaker nodes.
One thing common to all these papers is that they explicitly learn utterance node representations through the graph (Sun et al., 2021;Song et al., 2023)-i.e., they build the graph representations as a means to get better utterance representations.The speaker nodes serve as global nodes for effective message passing between utterances.This not only results in a very dense graph but also increases other complexities by means of introducing different edge types based on speaker associations.In contrast, we are interested in learning speaker representations and how using those can enrich the utterances.Doing so only requires connecting the speaker to its utterance nodes.We present a simple graph construction scheme and show that modeling speakers directly is not only efficient but effective as well.Our proposed approach can be used on top of any pre-trained utterance encoding system.

Task
A dialogue D with |P | participants can be defined as a collection of utterances D = { 0 ,  1 , ...,   }.That is, |D | = .Each utterance is associated with a speaker given by (  ) =   and a DA label   .The aim of DA classification is to assign a DA label from a set of labels C to each utterance   in D.

Model
In this section, we describe the components of our model, an overview of which is presented in Figure 1.

Utterance Encoder
The utterance encoder is the same baseline as presented in He et al. (2021).2The sentence separator <s> token from the last layer of the RoBERTa3 Liu 2We include the base model without the speaker turn embeddings.
3https://huggingface.co/docs/transformers/ model_doc/roberta et al. ( 2019) model is used to derive a representation (  ) for each utterance.We use RoBERTa-base with 12 layers and fine-tune the final layer.To get the contextual representations, the utterances are passed to a bidirectional GRU (Cho et al., 2014).These final utterance representations are defined by where ℎ  ∈ R  .Here,  is the dimension of each utterance vector.

Speaker Turn Indicating Tokens
Participants speak intermittently in a dialogue.If a single speaker has had the floor for the past few utterances and the current utterance is also by the same speaker, then the chances of certain DA labels such as backchannel, mimic, and collaborative completion decrease, whereas the chances for other DA labels such as repeat, self-correct misspeaking increase.This example shows why modeling speaker turns is crucial.
To capture this behavior of turn-taking at the utterance level, a turn indication token can be used (Żelasko et al., 2021).Each utterance is prepended with a special token to get the updated utterance  ′ as given by (2).These updated utterances are then fed to the encoder defined in § 4.1.The upside of encoding speaker turns this way is that no new parameters are introduced into the model.

Graph Speaker Modeling
The addition of the turn token can still only capture the binary speaker transitions.While this might be enough in the case of a dyadic conversation, it reduces a multiparty conversation to a two-party one.To overcome this, on top of the turn tokens, we also learn a graph-based representation for each speaker and use it to inform each utterance of its speaker.

Graph Structure
We define the graph as G = ⟨V, E, R⟩, where the nodes   ∈ V can be one of two types: an utterance    or a speaker    .Labeled edges    ∈ E denote edges between   and   .Finally,  ∈ R is the type of relation an edge represents.
We introduce a single type of relation R in the graph that an edge    connecting two nodes   and   can take.We denote it by (   ,    (  ) ), it represents an undirected edge between an utterance and its speaker.We show in § 6.3 that compared to more complex graph structures, this one type of edge is enough to model speakers effectively.

Speaker Learning
The utterance encoder in all speaker graph experiments is initialized with the trained weights of the encoder given in § 5.3.The  baseline layer is removed, and the encoder weights are kept fixed during training.We use a Relational Graph Attention Network (RGAT) (Busbridge et al., 2019) to model the speakers with respect to their utterances.The graph is constructed and initialized as detailed above and passed to RGAT to get updated representations of the vertices given by:

DA Classification
The updated speaker representations    ∈ V from the graph are extracted and concatenated with their respective utterance representation ℎ  from (1).The final utterance representation is given by: Here,    (  ) is the graph node representation of ℎ utterance's speaker, ℎ speaker-enriched ∈ R 2 .Finally,  ℎ speaker-enriched is passed to a feed-forward network to get the predicted label: ŷ =  Spk-Graph ℎ speaker-enriched . (5)

Dataset
We experiment with two publicly available datasets for DA classification.For both the datasets, we use the dataset split used by Lee and Dernoncourt (2016).The dataset overview and distribution of utterances are given in Table 2.
MRDA The first dataset is the MRDA corpus, which consists of around 72 hours of natural conversations, with an average dialogue length of 1445.41 utterances.There are 11 general and 39 specific tags, along with three types of disruptions and a non-speech label.The MRDA corpus follows a hierarchical annotation scheme, where each utterance is labeled with compulsory general and need-based zero or more specific tags.Over the years, many grouping schemes have been devised to consolidate the MRDA tags into higher-level categories.One of the most widely used such grouping is the basic-tags introduced in Ang et al. ( 2005)4.While the majority of the existing works using MRDA focus on these five coarse-grained categories (Kumar et al., 2018;Chen et al., 2018;Raheja and Tetreault, 2019;Bothe et al., 2018;Khanpour et al., 2016;He et al., 2021), we focus on the fine-grained labels.
In particular, an utterance is always assigned the first (in case of more than one) specific tag if one is present.If the DA label consists of only a general tag, then that label is used.5We made three changes to the labels: (i) we dropped rising tone since it is not a DA (Dhillon et al., 2004), (ii) we dropped declarative question because it captures the syntactic structure of the utterance, it is used when a question is framed as a statement, and (iii) when a pipe symbol (|) is used to annotate the floor mechanism at the start of an utterance, we take the label assigned to the later part of the utterance instead of the floor mechanism.Floor mechanisms at the start of an utterance are not the most informative tag when only choosing one label.Examples of annotated utterances are shown in Table 1.

SwDA
The second dataset we work with is the Switchboard Dialog Act corpus (Jurafsky, 1997), a collection of telephone conversations between two people on a pre-specified topic.The SwDA 6 corpus contains 43 DA types, and the average dialogue 4These are Statement, Question, Floorgrabber, Backchannel, and Disruption 5We use the full labels as presented here: https://github.com/NathanDuran/MRDA-Corpus 6https://github.com/cgpotts/swdalength is 192.3 utterances.

Evaluation Metric
The distributions of DAs in both the MRDA and the SwDA datasets are highly skewed, with the five most frequent classes making up 66.7% and 78.1% of the data in MRDA and SwDA respectively.Deep learning models have been known to be biased toward the classes with the most number of samples in the data (Johnson and Khoshgoftaar, 2019).As shown in Figure 2, both datasets follow a long-tailed distribution, with a few majority classes accounting for most of the data.Previous works primarily report the accuracy of their models.However, relying solely on accuracy to judge a model on a highly imbalanced dataset can be problematic (Gu et al., 2009;Chawla, 2009;Bekkar et al., 2013).Model performance can be overestimated by accuracy even if it only performs well on a few frequent classes (Kotsiantis et al., 2006).Building a system that can also perform well on minority classes for DAs is important for many downstream tasks.For example, infrequent DAs such as repetition request and partial reject can be effective indicators of misunderstanding for conversational agents (Aberdeen and Ferro, 2003).DAs such as command and suggestion that together make up less than 4% of data in MRDA, are especially important for conversational AI assistants to recognize.
The F1 score is the harmonic mean of precision and recall and it serves as a balanced assessment of both these measures (Buckland and Gey, 1994).With the end goal of building a system that can be useful for several downstream tasks, we report the macro F1, precision, and recall scores to give equal weight to each class.Although accuracy is not the most appropriate metric to use in this setting, we still report it for a sense of comparison with prior work.

Baseline
As a baseline, we take the utterance representations from the encoder presented in § 4.1 and pass them to a feed-forward layer to get class probabilities: ŷbaseline =  baseline (ℎ). (6) We use a fixed chunk size of 128 for all the baselines.Based on the results in He et al. (2021), this is the best chunk size for SwDA.The effect of chunk size on DA classification accuracy is negligible for the MRDA dataset.We compare the speaker graph model with two baselines.The other, turnaware baseline, has augmented turn-aware input as presented in § 4.2.

Experimental Setup
The implementation is done in PyTorch (Paszke et al., 2019)

Results and Discussion
We compare the speaker graph model against the stronger of the two baselines and consequently also use the turn-aware input for the graph model.All the results are an average of 5 runs to account for the fluctuation introduced by randomness.The speaker graph model's performance is statistically significant with p<0.0001 using Student's t-test.

Prior Work Baselines
As mentioned previously, prior work has focused on the high-level categories of DA and they report accuracy only.Table 3 shows the results on finegrained classes for the presented speaker graph model along with several related systems.Our results differ from previous work in that we report the accuracy from the epoch with the best F1 score on the validation set instead of the best accuracy.Most papers on DA classification do not have their code publicly available (with the exception of He et al. ( 2021)), making comparisons with them difficult.We present the results for three systems from DA classification.
• BiLSTM+CRF (Kumar et al., 2018)  We also compare with related works from emotion recognition, • DialogueGCN (Ghosal et al., 2019) leverages self and inter-speaker dependency to model the entire conversational context using a graph neural network.• DialogueRNN (Majumder et al., 2019) keeps track of the individual party states throughout the conversation using speaker-specific RNN.• DialogXL (Shen et al., 2021a)   Since some of these works use older, noncontextualized word embeddings such as GloVe (Pennington et al., 2014), we swap them with RoBERTa (Liu et al., 2019) to be comparable with our work and to make sure the gains in our approach aren't due to the use of a better language model.For a fair comparison across models, we also chunk the dialogues into same size.All results are an average of 5 random seeds.Details on changes made to any baselines to make them compatible with our data can be found in Appendix B.
SwDA Adding graph-learned speaker representation does not help and brings the performance down slightly in terms of both F1 and accuracy.The speaker graph model is 0.18 F1 worse than the best-performing turn-aware baseline.SwDA is a two-party dataset and we postulate that adding speaker information in the form of a token to the input is better able to make use of the speaker turns.This may be because of inducing turn awareness at an earlier stage in the utterance before passing it to an RNN to capture sequential context.
On the other hand, modeling speakers using a graph has the downside of losing this sequential information (Ishiwatari et al., 2020).This could deteriorate the performance in dyadic dialogues where a speaker switch indicating token is enough to instill interlocutor information.
As mentioned in the earlier section, we mainly report and study the macro F1 scores to evaluate the models.We also present the accuracy in Table 3 for the sake of comparison with prior work.

MRDA
The turn-aware graph model gives the best performance on MRDA.When we compare the speaker graph model with the baseline where the input to both also contains the turn tokens, we see an improvement of 1.53 F1 score.For the rest of the discussion, we focus on MRDA, since we are interested in a multiparty dialogue setting.

Classwise Results
In this section, we analyze how the addition of speaker representations affects individual DA types.The detailed results can be found in Table 8.Performance on floor mechanisms improves with the addition of speaker representations.Floor mechanism tags all share a very similar vocabulary (Dhillon et al., 2004).In hold, a speaker is passed the floor while in floor grabber, the speaker tries to gain the floor.In order to disambiguate between these tags, it is vital to have information about who is speaking.
Without looking at prior turns, tags like affirmative answer and negative response can easily be confused with statements (Dhillon et al., 2004).These types of responses are hard to catch because disambiguating them requires analyzing the utterance in light of the dialogue context.The turn-aware speaker graph model gets a boost of 2.4 F1 points on affirmative answer and 5.4 F1 points on negative Table 4: The effect of introducing different edge connections on the macro F1 score.E is the approximate number of total edges in the model.The results are sorted by increasing density of the graph in ascending order.
answer.This highlights the ability of the graph model to capture the situated speaker behavior.
Performance on minority classes 35 of the 50 classes in MRDA comprise less than 1% each of the data.Due to this class imbalance, it is difficult to build a system that performs well on these classes.The speaker graph model is able to make improvements on many infrequent classes.Some examples include mimic, about-task, self-correct misspeaking, follow me, downplayer.
The graph model can capture the subtle nuances of speaker behavior.Tags such as mimic, downplayer, joke see an improvement of 6.6, 3.8, 2.2 F1 score respectively.mimic utterances are those where the speaker repeats another speaker.They often serve as a form of acknowledgment from listeners (Dhillon et al., 2004).This shows certain intentions cannot be recovered from their semantic content alone.

Ablation Study on the Graph
In this section we study how introducing different edges affects the model's ability to capture speakers.First, we reiterate the edges used in our experiments along with introducing two new edge types.
• (   ,    ): A directed edge from one utterance to another.
• (   ,    ): An undirected edge between two speakers.To also capture context at the graph level, we connect utterances with each other.Having an edge from every utterance to every other is not feasible due to GPU memory limitations.Following Ghosal et al. (2019), every utterance node is connected with its immediate past and future utterances falling within a window .  4 that the most important edge type is the one connecting every speaker with its own utterances.Capturing context at the graph level through the  →  edge is not needed to model speakers well and hurts the performance slightly.One possible explanation for this is that the utterance nodes are already initialized with contextualized utterance vectors.Furthermore, connecting speakers with each other (spk ↔ spk) hurts the performance in all cases.This shows that giving the speaker nodes direct access to only their own utterances is important and adding more edges can potentially introduce noise.The simplest graph with just a single type of edge (   ,    (  ) ) denoted by spk → utt results in an F1 score that is still 1 point better than the turn-aware baseline.
The best-performing model also has edges between the utterances and their respective speakers, i.e., the (   ,    (  ) ) edges becomes undirected.Any (   ,    ) edges introduced to this setting hurts the performance.Capturing context at the graph level becomes computationally expensive and often impossible as the dialogue size increases.One common workaround is to connect utterances falling within a window size of each other.Even in such a setting, the memory requirements incurred are often large and inhibit the value of w to be set too big.We show in our work that this is not only redundant but also hurts the performance.Instead, a simple objective of learning speaker representations directly is not only compute efficient but performs better as well.The best performing model has 2|D| edges (row three of Table 4), whereas any graph with  →  edges introduces an additional 2|D| edges.While both settings have the number of edges as a linear function of the dialogue length, in practice the former can be more effective in lowresource settings such as under GPU memory and speed constraints.

Effect of Chunk Size
Modeling long conversations in one go is not feasible due to computational resource limitations.Therefore, we segment the dialogues into smaller, more manageable chunks.The maximum and average number of speakers corresponding to each chunk size along with the macro F1 scores of the graph model are shown in Table 5.It is possible that this segmentation can strip away useful information in cases where an utterance at a later stage of the dialogue relies on an earlier utterance to resolve context issues.
Our experiments show there is not much difference in performance by choosing a smaller chunk size.A smaller chunk size of 16 still gives 1 macro F1 score improvement over the turn-aware baseline that had access to a larger context of 128.Furthermore, we observed performance gains even when only 3 speakers on average are involved in a multiparty setting, as opposed to the purely dyadic case of SwDA where no gains were observed.This highlights the usefulness of speaker modeling in multi-participant dialogues.The results on the validation set are presented in Table 7.

Utterance Graph Nodes
In this section we present the results of the model after using the utterance nodes from the graph instead of the speaker nodes to augment the utterance representations from Equation 1.The ℎ speaker-enriched  becomes: Similarly, we also include the results of concatenating both speaker and utterance nodes.Table 6 shows that although all three systems perform better than the baseline, the model with speaker nodes

Conclusion
We propose a graph-based approach to learn speaker-informed utterance representations for DA classification.We show that directly learning speaker representations with a simple graph is both effective and efficient.Instilling speaker information this way helps disambiguate DA labels in the case of multiparty dialogues.The learned speaker representations can be used on top of any utterance encoding scheme to include speaker information.Future work can also look into incorporating audio features to encode a speaker since prosody plays an important part in signaling intention.
of type of edge connections to have along with other hyperparameters such as the window size can be restrictive because of compute resource limitations.

A Experimental Setup
A learning rate (LR) scheduler reduces the LR by 0.1 after 4 epochs of no improvement on the validation set.We train the models, including all the baselines, for a maximum of 100 epochs with early stopping after no improvement on the macro F1 on the validation set for 10 epochs.
For training the baseline encoder, we keep most of the hyperparameters the same as He et al. (2021).The GRU hidden size used for the encoder is 200, since this is the largest encoder model we could fit on the GPU due to memory constraints when using the output from the encoder as input to the graph model.
The window hyperparameter  was chosen from a search space of {0, 5, 10}.There are two layers of RGAT with attention computed "across-relation" using "additive-self-attention". The number of attention heads for both layers is set to two.
The graph model on top of the encoder has around 154M parameters.All the experiments on MRDA are done on a single NVIDIA A100 GPU, and the model converges in about half an hour.The SwDA experiments are all done on a single NVIDIA RTX A6000, with a single run taking approximately an hour.

B Baseline Models
All the systems included (except DialogXL) use RoBERTa-base as the pre-trained language model.For all the graph models, the hyperparameter w is set to 10-the same as in our graph model.
• We make two changes to SUNET (Song et al., 2023) in the results we report after replicating their system.First, They use the utterance representations of all the utterances of a speaker from the training set to initialize that speaker node.We randomly initialize speakers.The utterance nodes are initilized using trained encoder from § 5.3.This was because we couldn't fit the model on GPU by directly using RoBERTa representations since those have a large dimension compared to the encoder (768 vs 400).• For Turn Modeling (He et al., 2021) system, we do not include Topic embeddings for the SwDA dataset.

C Results on Validation set
We present the results on the validation set here.All the results are an average of 5 runs.Even though the model with chunk size 96 performs slightly better on the validations set, We picked the chunk size to be 128.This makes for a fair comparison with the baseline that uses 128-sized chunks.

D Classwise Results
The test set does not have any sample of the welcome class.

Figure 1 :
Figure 1: Example graph for a dialogue with five utterances and three unique speakers.Dotted lines denote edges of type (   ,    (  ) ).
Given a dialogue D with |P | speakers, utterance nodes are initialized using (1), following Zhang et al. (2019), and speaker nodes are randomly initialized.Therefore, the total number of nodes is |V | = |D | + |P |.

Figure 2 :
Figure 2: Both the datasets follow a long-tailed distribution where a few frequent classes have a very large number of samples and there are many infrequent classes with only a few samples.

Table 1 :
A short excerpt from a dialogue involving multiple speakers from the MRDA corpus.

Table 2 :
Number of classes (|C|), participants per dialogue (|P |), and number of dialogues and utterances in each split.

Table 3 :
F1 score of the speaker-enriched model compared to prior work.The input to both the last two rows is augmented with turn tokens.♣ are the results from rerunning the systems on our data and ♦ are our reimplementation.

Table 7 :
Effect of chunk size on the validation set of MRDA for the speaker graph model.