MPC-BERT: A Pre-Trained Language Model for Multi-Party Conversation Understanding

Recently, various neural models for multi-party conversation (MPC) have achieved impressive improvements on a variety of tasks such as addressee recognition, speaker identification and response prediction. However, these existing methods on MPC usually represent interlocutors and utterances individually and ignore the inherent complicated structure in MPC which may provide crucial interlocutor and utterance semantics and would enhance the conversation understanding process. To this end, we present MPC-BERT, a pre-trained model for MPC understanding that considers learning who says what to whom in a unified model with several elaborated self-supervised tasks. Particularly, these tasks can be generally categorized into (1) interlocutor structure modeling including reply-to utterance recognition, identical speaker searching and pointer consistency distinction, and (2) utterance semantics modeling including masked shared utterance restoration and shared node detection. We evaluate MPC-BERT on three downstream tasks including addressee recognition, speaker identification and response selection. Experimental results show that MPC-BERT outperforms previous methods by large margins and achieves new state-of-the-art performance on all three downstream tasks at two benchmarks.


Introduction
Building a conversational agent with intelligence has drawn significant attention from both academia and industry. Most of existing methods have studied understanding conversations between two participants, aiming to return an appropriate response either in a generation-based (Shang et al.,  Here, "I." is the abbreviation of "interlocutor". Serban et al., 2016Serban et al., , 2017Zhang et al., 2018bZhang et al., , 2020 or retrieval-based manner (Lowe et al., 2015;Wu et al., 2017;Zhou et al., 2018;Tao et al., 2019a,b;Gu et al., 2019aGu et al., ,b, 2020. Recently, researchers have paid more attention to a more practical and challenging scenario involving more than two participants, which is well known as multiparty conversation (MPC) (Ouchi and Tsuboi, 2016;Zhang et al., 2018a;Le et al., 2019;. Table 1 shows an MPC example in the Ubuntu Internet Relay Chat (IRC) channel, which is composed of a sequence of (speaker, utterance, addressee) triples. In addition to returning an appropriate response, predicting who will be the next speaker (Meng et al., 2018) and who is the addressee of an utterance (Ouchi and Tsuboi, 2016;Zhang et al., 2018a;Le et al., 2019) are unique and important issues in MPC. An instance of MPC always contains complicated interactions between interlocutors, between utterances and between an interlocutor and an utterance. Therefore, it is challenging to model the conversation flow and fully understand the dialogue content. Existing studies on MPC learn the representations of interlocutors and utterances with neural networks, and their representation spaces are either separate (Ouchi and Tsuboi, 2016) or interactive (Zhang et al., 2018a). However, the semantics contained in the interlocutor and utterance representations may not be effectively captured as they are from two different representation spaces. Recently, to take advantage of the breakthrough in pre-training language models (PLMs) for natural language understanding, some studies proposed to integrate the speaker (Gu et al., 2020) or topic (Wang et al., 2020) information into PLMs. Despite of the performance improvement on response selection, these models still overlook the inherent relationships between utterances and interlocutors, such as "address-to". Furthermore, most existing studies design models for each individual task in MPC (e.g., addressee recognition, speaker identification and response prediction) separately. Intuitively, these tasks are complementary among each other. Making use of these tasks simultaneously may produce better contextualized representations of interlocutors and utterances, and would enhance the conversation understanding, but is neglected in previous studies.
On account of above issues, we propose MPC-BERT which jointly learns who says what to whom in MPC by designing self-supervised tasks for PLMs, so as to improve the ability of PLMs on MPC understanding. Specifically, the five designed tasks includes reply-to utterance recognition, identical speaker searching, pointer consistency distinction, masked shared utterance restoration and shared node detection. The first three tasks are designed to model the interlocutor structure in MPC in a semantics-to-structure manner. In the output of MPC-BERT, an interlocutor is described through the encoded representations of the utterances it says. Thus, the representations of utterance semantics are utilized to construct the conversation structure in these three tasks. On the other hand, the last two tasks are designed to model the utterance semantics in a structure-to-semantics manner. Intuitively, the conversation structure influences the information flow in MPC. Thus, the structure information can also be used to strengthen the representations of utterance semantics in return. In general, these five self-supervised tasks are employed to jointly train the MPC-BERT in a multi-task learning framework, which helps the model to learn the complementary information among interlocutors and utterances, and that between structure and semantics. By this means, MPC-BERT can produce better interlocutor and utterance representations which can be effectively generalized to multiple downstream tasks of MPC.
To measure the effectiveness of these selfsupervised tasks and to test the generalization ability of MPC-BERT, we evaluate it on three downstream tasks including addressee recognition, speaker identification and response selection, which are three core research issues of MPC. Two benchmarks based on Ubuntu IRC channel are employed for evaluation. One was released by . The other was released by Ouchi and Tsuboi (2016) and has three experimental settings according to session lengths. Experimental results show that MPC-BERT outperforms the current state-of-the-art models by margins of 3.51%, 2.86%, 3.28% and 5.36% on the test sets of these two benchmarks respectively in terms of the session accuracy of addressee recognition, by margins of 7.66%, 2.60%, 3.38% and 4.24% respectively in terms of the utterance precision of speaker identification, and by margins of 3.82%, 2.71%, 2.55% and 3.22% respectively in terms of the response recall of response selection.
In summary, our contributions in this paper are three-fold: (1) MPC-BERT, a PLM for MPC understanding, is proposed by designing five selfsupervised tasks based on the interactions among utterances and interlocutors. (2) Three downstream tasks are employed to comprehensively evaluate the effectiveness of our designed self-supervised tasks and the generalization ability of MPC-BERT. (3) Our proposed MPC-BERT achieves new state-ofthe-art performance on all three downstream tasks at two benchmarks.

Related Work
Existing methods on building dialogue systems can be generally categorized into studying twoparty conversations and multi-party conversations (MPC). In this paper, we study MPC. In addition to predicting utterances, identifying the speaker and recognizing the addressee of an utterance are also important tasks for MPC. Ouchi and Tsuboi (2016) first proposed the task of addressee and response selection and created an MPC corpus for studying this task. Zhang et al. (2018a) proposed SI-RNN, which updated speaker embeddings role-sensitively for addressee and response selection. Meng et al. (2018) proposed a task of speaker classification as a surrogate task for speaker modeling. Le et al. (2019) proposed a who-to-whom (W2W) model to recognize the addressees of all utterances.  proposed a graph-structured network (GSN) to model the graphical information flow for response generation. Wang et al. (2020) proposed to track the dynamic topic for response selection.
Generally speaking, previous studies on MPC cannot unify the representations of interlocutors and utterances effectively. Also, they are limited to each individual task, ignoring the complementary information among different tasks. To the best of our knowledge, this paper makes the first attempt to design various self-supervised tasks for building PLMs aiming at MPC understanding, and to evaluate the performance of PLMs on three downstream tasks as comprehensively as possible.

MPC-BERT and Self-Supervised Tasks
An MPC instance is composed of a sequence of (speaker, utterance, addressee) triples, denoted as {(s n , u n , a n )} N n=1 , where N is the number of turns in the conversation. Our goal is to build a pre-trained language model for universal MPC understanding. Given a conversation, this model is expected to produce embedding vectors for all utterances which contain not only the semantic information of each utterance, but also the speaker and addressee structure of the whole conversation. Thus, it can be effectively adapted to various downstream tasks by fine-tuning model parameters.

Model Overview
In this paper, BERT (Devlin et al., 2019) is chosen as the backbone of our PLM for MPC. Thus, we name it MPC-BERT. It is worth noting that our proposed self-supervised tasks for training MPC-BERT can also be applied to other types of PLMs.
We first give an overview of the input representations and the overall architectures of MPC-BERT. When constructing the input representations, in order to consider the speaker information of each utterance, speaker embeddings (Gu et al., 2020) are introduced as shown in Figure 1. Considering that the set of interlocutors are inconsistent in different conversations, a position-based interlocutor embedding table is initialized randomly at first and updated during pre-training, which means each interlocutor in a conversation is assigned with an embedding vector according to the order it appears in the conversation. Then, the speaker embeddings for each utterance can be derived by looking up this embedding table. The speaker embeddings are combined with standard token, position and segmentation embeddings and are then encoded by BERT. The output embeddings of BERT corresponding to different input tokens are utilized by different self-supervised tasks for further calculation.

Tasks of Interlocutor Structure Modeling
The first three tasks follow the semantics-tostructure manner. In MPC-BERT, each interlocutor is described through the encoded representations of the utterances it says. Thus, the representations of utterance semantics are utilized to construct the conversation structure. Figure 1 shows the input representations and the model architectures of these three tasks. A [CLS] token is inserted at the start of each utterance, denoting its utterancelevel representation. Then, all utterances in a conversation are concatenated and a [SEP] token is inserted at the end of the whole sequence. It is notable that these three tasks share the same form of input data. Thus, the input only needs to be encoded once by BERT while the output can be fed into three tasks, which is computation-efficient. As shown in Figure 1, a task-dependent non-linear transformation layer is placed on top of BERT in order to adapt the output of BERT to different tasks. We will describe the details of these tasks as follows.

Reply-to Utterance Recognition
To enable the model to recognize the addressee of each utterance, a self-supervised task named replyto utterance recognition (RUR) is proposed to learn which preceding utterance the current utterance replies to. After encoded by BERT, we extract the contextualized representations for each [CLS] token representing individual utterances. Next, a non-linear transformation followed by a layer normalization are performed to derive the utterance representations for this specific task {u rur i } N i=1 , where u rur i ∈ R d and d = 768. Then, for a specific utterance U i , its matching scores with all its preceding utterances are calculated as where A rur ∈ R d×d is a linear transformation, m ij denotes the matching degree of U j being the replyto utterance of U i , and 1 ≤ j < i. We construct a set S by sampling a certain number of utterances  in a conversation and this recognition operation is performed for each utterance in S. Meanwhile, a dynamic sampling strategy is adopted so that models can see more samples. Finally, the pretraining objective of this self-supervised task is to minimize the cross-entropy loss as where y ij = 1 if U j is the reply-to utterance of U i and y ij = 0 otherwise.

Identical Speaker Searching
Having knowledge of who is the speaker of an utterance is also important for MPC. The task of identical speaker searching (ISS) is designed by masking the speaker embedding of a specific utterance in the input representation, and aims to predict its speaker given the conversation. Since the set of interlocutors vary across conversations, the task of predicting the speaker of an utterance is reformulated as searching for the utterances sharing the identical speaker.
First, for a specific utterance, its speaker embedding is masked with a special [Mask] interlocutor embedding to avoid information leakage. Given the utterance representations for this specific task where u iss i ∈ R d , the matching scores of U i with all its preceding utterances are calculated similarly with Eq. (1). Here, m ij denotes the matching degree of U j sharing the same speaker with U i . For each instance in the dynamic sampling set S, there must be an utterance in previous turns sharing the same speaker. Otherwise, it is removed out of the set. Finally, the pre-training objective of this task is to minimize the cross-entropy loss similarly with Eq. (2). Here, y ij = 1 if U j shares the same speaker with U i and y ij = 0 otherwise.

Pointer Consistency Distinction
We design a task named pointer consistency distinction (PCD) to jointly model speakers and addressees in MPC. In this task, a pair of utterances representing the "reply-to" relationship is defined as a speaker-to-addressee pointer. Here, we assume that the representations of two pointers directing from the same speaker to the same addressee should be consistent. As illustrated in Figure 2 (a), speaker S m speaks U i and U j which reply to U i and U j from speaker S n respectively. Thus, the utterance tuples (U i , U i ) and (U j , U j ) both represent the pointer of S m -to-S n and their pointer representations should be consistent..
Given the utterance representations for this specific task {u pcd i } N i=1 where u pcd i ∈ R d , we first capture the pointer information contained in each utterance tuple. The element-wise difference and multiplication between an utterance tuple (U i , U i ) are computed and are concatenated as  Figure 2: Illustrations of the self-supervised tasks of (a) pointer consistency distinction and (b) shared node detection. Rectangles denote utterances, circles denote interlocutors, a solid line denotes an utterance replying to an utterance, and a dashed line denotes an utterance from an interlocutor.
where p ii ∈ R 2d . Then, we compress p ii and obtain the pointer representationp ii as where W pcd ∈ R 2d×d and b pcd ∈ R d are parameters. Identically, a consistent pointer representationsp jj and an inconsistent onep kk sampled from this conversation are obtained. The similarities between every two pointers are calculated as where m ij denotes the matching degree of pointer p ii being consistent with pointerp jj . m ik can be derived accordingly. Finally, the pre-training objective of this task is to minimize the hinge loss which enforces m ij to be larger than m ik by at least a margin ∆ as

Tasks of Utterance Semantics Modeling
Intuitively, the conversation structure might influence the information flow, so that it can be used to strengthen the representations of utterance semantics. Thus, two self-supervised tasks following the structure-to-semantics manner are designed.

Masked Shared Utterance Restoration
There are usually several utterances replying-to a shared utterance in MPC. Intuitively, a shared utterance is semantically relevant to more utterances in the context than non-shared ones. Based on this characteristic, we design a task named masked shared utterance restoration (MSUR). We first randomly sample an utterance from all shared utterances in a conversation and all tokens in this sampled utterance are masked with a [MASK] token. Then the model is enforced to restore the masked utterance given the rest conversation. Formally, assuming U i as the masked shared utterance and l i as the number of tokens in U i . Given the token representations for this task {u msur where u msur i,t ∈ R d , the probability distribution of each masked token can be calculated as where W msur ∈ R d×V is the token embedding table, V denotes the vocabulary size, and b msur ∈ R V is a bias vector. Finally, the pre-training objective of this self-supervised task is to minimize the negative log-likelihood loss as where p u i,t is the element in p u i,t corresponding to the original token.

Shared Node Detection
A full MPC instance can be divided into several sub-conversations and we assume that the representations of sub-conversations under the same parent node tend to be similar. As illustrated in Figure 2 (b), two sub-conversations {U 3 , U 5 , U 7 , U 8 } and {U 4 , U 6 , U 9 } share the same parent node U 2 . Thus, they should be semantically relevant. Under this assumption, we design a self-supervised task named shared node detection (SND), which utilizes the conversation structure to strengthen the capability of models on measuring the semantic relevance of two sub-conversations. We first construct the pre-training samples for this task. Empirically, only the sub-conversations under the top shared node in a conversation are collected in order to filter out the sub-conversations with few utterances. Given a full MPC, the two sub-conversations with the most utterances form a positive pair. For each positive pair, we replace one of its elements with another sub-conversation randomly sampled from the training corpus to form a negative pair.
Formally, given two sub-conversations c i and c j , utterances in each sub-conversation are first concatenated respectively to form two segments. Then, the two segments are concatenated with a [SEP] token and a [CLS] token is inserted at the beginning of the whole sequence. This sequence are encoded by BERT to derive the contextualized representation for the [CLS] token. A non-linear transformation with sigmoid activation is further applied to this representation for calculating the matching score m ij , i.e., the probability of c i and c j sharing the same parent node. Finally, the pretraining objective of this task is to minimize the cross-entropy loss as where y ij = 1 if c i and c j share the same parent node and y ij = 0 otherwise.

Multi-task Learning
In addition, we also adopt the tasks of masked language model (MLM) and next sentence prediction (NSP) in original BERT pre-training (Devlin et al., 2019), which have been proven effective for incorporating domain knowledge (Gu et al., 2020;Gururangan et al., 2020). Finally, MPC-BERT is trained by performing multi-task learning that minimizes the sum of all loss functions as 4 Downstream Tasks

Addressee Recognition
Given a multi-party conversation where part of the addressees are unknown, Ouchi and Tsuboi (2016) and Zhang et al. (2018a) recognized an addressee of the last utterance. Le et al. (2019) recognized addressees of all utterances in a conversation. In this paper, we follow the more challenging setting in Le et al. (2019). Formally, models are asked to predict {â n } N n=1 given {(s n , u n , a n )} N n=1 \{a n } N n=1 , whereâ n is selected from the interlocutor set in this conversation and \ denotes exclusion. When applying MPC-BERT, this task is reformulated as finding a preceding utterance from the same addressee. Its RUR matching scores with all preceding utterances are calculated following Eq. (1). Then, the utterance with the highest score is selected and the speaker of the selected utterance is considered as the recognized addressee. Finally, the fine-tuning objective of this task is to minimize the crossentropy loss as where m ij is defined in Eq. (1), y ij = 1 if the speaker of U j is the addressee of U i and y ij = 0 otherwise.

Speaker Identification
This task aims to identify the speaker of the last utterance in a conversation. Formally, models are asked to predictŝ N given {(s n , u n , a n )} N n=1 \s N , whereŝ N is selected from the interlocutor set in this conversation. When applying MPC-BERT, this task is reformulated as identifying the utterances sharing the same speaker. For the last utterance U N , its speaker embedding is masked and its ISS matching scores m N j with all preceding utterances are calculated following Section 3.2.2. The finetuning objective of this task is to minimize the cross-entropy loss as where y N j = 1 if U j shares the same speaker with U N and y N j = 0 otherwise.

Response Selection
This task asks models to selectû N from a set of response candidates given the conversation context {(s n , u n , a n )} N n=1 \u N . The key is to measure the similarity between two segments of context and response. We concatenate each response candidate with the context and extract the contextualized representation e [CLS] for the first [CLS] token using MPC-BERT. Then, e [CLS] is fed into a nonlinear transformation with sigmoid activation to obtain the matching score between the context and the response. Finally, the fine-tuning objective of this task is to minimize the cross-entropy loss according to the true/false labels of responses in the training set as where y = 1 if the response r is a proper one for the context c; otherwise y = 0.

Datasets
We evaluated our proposed methods on two Ubuntu IRC benchmarks. One was released by , in which both speaker and addressee labels was provided for each utterance. The other benchmark was released by Ouchi and Tsuboi  (2016). Here, we adopted the version shared in Le et al. (2019) for fair comparison. The conversation sessions were separated into three categories according to the session length (Len-5, Len-10 and Len-15) following the splitting strategy of previous studies (Ouchi and Tsuboi, 2016;Zhang et al., 2018a;Le et al., 2019). Table 2 presents the statistics of the two benchmarks evaluated in our experiments.

Baseline Models
Non-pre-training-based models Ouchi and Tsuboi (2016) proposed a dynamic model DRNN which updated speaker embeddings with the conversation flow. Zhang et al. (2018a) improved DRNN to SI-RNN which updated speaker embeddings role-sensitively. Le et al. (2019) proposed W2W which jointly modeled interlocutors and utterances in a uniform framework, and predicted all addressees.
Pre-training-based models BERT (Devlin et al., 2019) was pre-trained to learn general language representations with MLM and NSP tasks. SA-BERT (Gu et al., 2020) added speaker embeddings and further pre-trained BERT on a domain-specific corpus to incorporate domain knowledge. We re-implemented SA-BERT with the pre-training corpus used in this paper to ensure fair comparison.

Implementation Details
The version of BERT-base-uncased was adopted for all our experiments. For pre-training, GELU (Hendrycks and Gimpel, 2016) was employed as the activation for all non-linear transformations. The Adam method (Kingma and Ba, 2015) was employed for optimization. The learning rate was initialized as 0.00005 and the warmup proportion was set to 0.1. We pre-trained BERT for 10 epochs. The training set of the dateset used in  was employed for pre-training. The maximum utterance number was set to 7. The maximum sequence length was set to 230. The maximum sampling numbers for each example were set to 4 for RUR, 2 for ISS and 2 for PCD. ∆ in Eq. (6) was set to 0.4, achieving the best performance out of {0.2, 0.4, 0.6, 0.8} on the validation set. The pre-training was performed using a GeForce RTX 2080 Ti GPU and the batch size was set to 4. For fine-tuning, some configurations were different according to the characteristics of these datasets. For , the maximum utterance number was set to 7 and the maximum sequence length was set to 230. For the three experimental settings in Ouchi and Tsuboi (2016), the maximum utterance numbers were set to 5, 10 and 15, and the maximum sequence lengths were set to 120, 220 and 320. All parameters in PLMs were updated. The learning rate was initialized as 0.00002 and the warmup proportion was set to 0.1. For , the fine-tuning process was performed for 10 epochs for addressee recognition, 10 epochs for speaker identification, and 5 epochs for response selection. For Ouchi and Tsuboi (2016), the fine-tuning epochs were set to 5, 5 and 3 respectively. The fine-tuning was also performed using a GeForce RTX 2080 Ti GPU. The batch sizes were set to 16 for , and 40, 20, and 12 for the three experimental settings in Ouchi and Tsuboi (2016) respectively. The validation set was used to select the best model for testing.
All codes were implemented in the TensorFlow framework (Abadi et al., 2016) and are published to help replicate our results. 1

Metrics and Results
Addressee recognition We followed the metrics of previous work (Le et al., 2019) by employing precision@1 (P@1) to evaluate each utterance with ground truth. Also, a session is marked as positive if the addressees of all its utterances are correctly recognized, which is calculated as accuracy (Acc.). Table 3 presents the results of addressee recognition. It shows that MPC-BERT outperforms the best performing model, i.e., SA-BERT, by margins of 3.51%, 2.86%, 3.28% and 5.36% on these test sets respectively in terms of Acc., verifying the effectiveness of the proposed five selfsupervised tasks as a whole. To further illustrate the effectiveness of each task, ablation tests were performed as shown in the last five rows of Table 3. We can observe that all self-supervised tasks are useful as removing any of them causes performance  Ouchi and Tsuboi (2016) Table 4: Evaluation results of speaker identification on the test sets in terms of P@1. Numbers in bold denote that the improvement over the best performing baseline is statistically significant (t-test with p-value < 0.05).
drop. Among the five tasks, RUR plays the most important role, and the tasks focusing on modeling interlocutor structure contribute more than those for utterance semantics.
Speaker identification Similarly, P@1 was employed as the evaluation metric of speaker identification for the last utterance of a conversation and the results are shown in Table 4. It shows that MPC-BERT outperforms SA-BERT by margins of 7.66%, 2.60%, 3.38% and 4.24% respectively in terms of P@1. Besides, from the ablation results we find that all tasks are useful for improving the performance of speaker identification and ISS and RUR contribute the most. In particular, removing PCD, MSUR and SND only leads to slight performance drop. The reason might be that the information conveyed by these tasks is redundant.
Response selection The R n @k metrics adopted by previous studies (Ouchi and Tsuboi, 2016;Zhang et al., 2018a) were used here. Each model was tasked with selecting k best-matched responses from n available candidates, and we calculated the recall as R n @k. Two settings were followed in which k was set to 1 and n was set to 2 or 10. Table 5 presents the results of response selection. It shows that MPC-BERT outperforms SA-BERT by margins of 3.82%, 2.71%, 2.55% and 3.22% respectively in terms of R 10 @1. Ablation tests show that SND is the most useful task for response selection and the two tasks focusing on the utterance semantics contribute more than those  Ouchi and Tsuboi (2016) Len-5 Len-10 Len-15 R 2 @1 R 10 @1 R 2 @1 R 10 @1 R 2 @1 R 10 @1 R 2 @1 R 10 @1 DRNN (Ouchi and (2016) and Zhang et al. (2018a). Numbers in bold denote that the improvement over the best performing baseline is statistically significant (t-test with p-value < 0.05). focusing on the interlocutor structures. Figure 3 illustrates how the performance of BERT, SA-BERT and MPC-BERT changed with respect to different session lengths on the test sets of Ouchi and Tsuboi (2016). It can be seen that the performance of addressee recognition and speaker identification dropped as the session length increased. The reason might be that longer sessions always contain more interlocutors which increase the difficulties of predicting interlocutors. Meanwhile, the performance of response selection was significantly improved as the session length increased. It can be attributed to that longer sessions enrich the representations of contexts with more details which benefit response selection. Furthermore, as the session length increased, the performance of MPC-BERT dropped more slightly than that of SA-BERT on addressee recognition and speaker identification, and the R 10 @1 gap between MPC-BERT and SA-BERT on response selection enlarged from 2.71% to 3.22%. These results imply the superiority of MPC-BERT over SA-BERT on modeling long MPCs with complicated structures.

Conclusion
In this paper, we present MPC-BERT, a pre-trained language model with five self-supervised tasks for MPC understanding. These tasks jointly learn who says what to whom in MPCs. Experimental results on three downstream tasks show that MPC-BERT outperforms previous methods by large margins and achieves new state-of-the-art performance on two benchmarks.