Unsupervised Conversation Disentanglement through Co-Training

Conversation disentanglement aims to separate intermingled messages into detached sessions, which is a fundamental task in understanding multi-party conversations. Existing work on conversation disentanglement relies heavily upon human-annotated datasets, which is expensive to obtain in practice. In this work, we explore training a conversation disentanglement model without referencing any human annotations. Our method is built upon the deep co-training algorithm, which consists of two neural networks: a message-pair classifier and a session classifier. The former is responsible of retrieving local relations between two messages while the latter categorizes a message to a session by capturing context-aware information. Both the two networks are initialized respectively with pseudo data built from the unannotated corpus. During the deep co-training process, we use the session classifier as a reinforcement learning component to learn a session assigning policy by maximizing the local rewards given by the message-pair classifier. For the message-pair classifier, we enrich its training data by retrieving message pairs with high confidence from the disentangled sessions predicted by the session classifier. Experimental results on the large Movie Dialogue Dataset demonstrate that our proposed approach achieves competitive performance compared to previous supervised methods. Further experiments show that the predicted disentangled conversations can promote the performance on the downstream task of multi-party response selection.


Introduction
With the continuing growth of Internet and social media, online group chat channels, e.g., Slack 1 and Whatsapp 2 , among many others, have become increasingly popular and played a significant social 1 https://slack.com/ 2 https://www.whatsapp.com/ Anyone finished the assignment?
Not yet. Working on it When is the due date?
Hmm… tonight? I'd like to get a new keyboard. Any suggestions?
The one I have is pretty good. S1 S1 S1 S1 S2 S2 Session Message and economic role. Along with the convenience of instant communication brought by these applications, the inherent property that multiple topics are often discussed in one channel hinders an efficient access to the conversational content. In the example shown in Figure 1, people or intelligent systems have to selectively read the messages related to the topics they are interested in from hundreds of messages in the chat channel.
With the goal of automatically grouping messages with the same topic into one session, conversation disentanglement has proved to be a prerequisite for understanding multi-party conversations and solving the corresponding downstream tasks such as response selection (Elsner and Charniak, 2008;Lowe et al., 2017;Wang et al., 2020). Previous research on conversation disentanglement can be roughly divided into two categories: (1) two-step methods, and (2) end-to-end methods. In the two-step methods Charniak, 2011, 2008;Jiang et al., 2018), a model first retrieves the "local" relations between two messages by utilizing either feature engineering approaches or deep learning methods, and then a clustering algorithm is employed to divide an entire conversation into separate sessions based on the message pair relations. In contrast, end-to-end methods (Tan et al., 2019;Yu and Joty, 2020) capture the "global" information contained in the context of detached sessions and calculate the matching degree between a session and a message in an end-to-end manner.
Though end-to-end methods have been proved to be more flexible and can achieve better performance , these two types of methods are interconnected and complementary since a global optimal clustering solution on the local relations will produce the optimal disentanglement scheme (McCallum and Wellner, 2004). Although the previous research efforts have achieved an impressive progress on conversation disentanglement, they all highly rely on humanannotated corpora, which are expensive and scarce to obtain in practice (Kummerfeld et al., 2019). The heavy dependence on human annotations limits the extensions of related study on conversation disentanglement as well as the applications on downstream tasks, given a wide variety of occasions where multi-party conversations can happen. In this work, we explore the possibility to train an endto-end conversation disentanglement model without referencing any human annotations and propose a completely unsupervised disentanglement model.
Our method builds upon the co-training approach (Blum and Mitchell, 1998;Nigam and Ghani, 2000) but extends it to a deep learning framework. By viewing the disentanglement task from the local perspective and the global perspective, our method consists of a message-pair classifier and a session classifier. The message-pair classifier aims to retrieve the message pair relations, which is of a similar purpose as the model used in a two-step method that retrieves the local relations between two messages. The session classifier is a global context-aware model that can directly categorize a message into a session in an end-to-end fashion. The two classifiers view the task of conversation disentanglement from the perspectives of a local two-step method and an global end-to-end model, which will be separately initialized with pseudo data built from the unannotated corpus and updated with each other during co-training. More concretely, during the co-training procedure, we adopt reinforcement learning to learn a session assigning policy for the session classifier by maximizing the accumulated rewards between a message and a session which are given by the message-pair classifier. After updating the parameters of the session classifier, a new set of data with high confidence will be retrieved from the predicted disentanglement results of the session classifier and used for updating the message-pair classifier. As shown in Figure 2, the above process is iteratively performed by updating one classifier with the other until the performance of session classifier stops increasing.
We conduct experiments on the large public Movie Dialogue Dataset . Experimental results demonstrate that our proposed method outperforms strong baselines based on BERT (Devlin et al., 2019) in two-step settings, and achieves competitive results compared to those of the state-of-the-art supervised end-to-end methods. Moreover, we apply the disentangled conversations predicted by our method to the downstream task of multi-party response selection and get significant improvements compared to a baseline system. 3 In summary, our main contributions are three-fold: • To the best of our knowledge, this is the first work to investigate unsupervised conversation disentanglement with deep neural models.
• We propose a novel approach based on cotraining which can perform unsupervised conversation disentanglement in an end-to-end fashion.
• We show that our method can achieve performance competitive with supervised methods on the large public Movie Dialogue Dataset. Further experiments show that our method can be easily adapted to downstream tasks and achieve significant improvements.

Related Work
Conversation Disentanglement Conversation disentanglement has long been regarded as a fundamental task for understanding multi-party conversations Charniak, 2008, 2010) and can be combined with downstream tasks to boost their performance Wang et al., 2020). Previous methods on conversation disentanglement are mostly performed in a supervised fashion, which can be classified as two categories: (1) two-step approaches and (2) end-to-end methods. The two-step methods (Elsner and Charniak, 2008, 2010, 2011Chen et al., 2017;Jiang et al., 2018;Kummerfeld et al., 2019) firstly retrieve the relations between two messages, e.g., "reply-to" relations (Guo et al., 2018;, and then adopt a clustering algorithm to construct individual sessions. The end-to-end models (Tan et al., 2019;Yu and Joty, 2020), instead, perform the disentanglement operation in an end-to-end manner, where the context information of detached sessions will be exploited to classify a message to a session. End-to-end models tend to achieve better performance than two-step models, but both often need large annotated data to get fully trained , which is expensive to obtain and thus encourages the demand on unsupervised algorithms. A few preliminary studies perform unsupervised thread detection in email systems based on two-step methods (Wu and Oard, 2005;Erera and Carmel, 2008;Domeniconi et al., 2016), but these methods use handcrafted features which cannot be extended to various datasets. Compared with previous works, our method can conduct end-to-end conversation disentanglement in a completely unsupervised fashion, which can be easily adapted to downstream tasks and used in a wide variety of applications.
Dialogue Structure Learning One problem that may be related to conversation disentanglement is dialogue structure learning (Zhai and Williams, 2014;Shi et al., 2019). Both are related to understanding multi-party conversation structures but they are different tasks. Dialogue structure learning aims to discover latent dialogue topics and construct an implicit utterance dependency tree to represent a multi-party dialogue's turn taking (Qiu et al., 2020), while the goal of conversation disentanglement is to learn an explicit dividing scheme that separates intermingled messages into sessions.
Co-training Co-training (Blum and Mitchell, 1998;Nigam and Ghani, 2000) has been widely used as a low-resource learning algorithm in natural language processing (Wu et al., 2018;, which assumes that the data has two complementary views and utilizes two models to iteratively provide pseudo training signals to each other. Our method consists of a message-pair classifier and a session classifier, which respectively view the unannotated dataset from the perspective of the local relations between two messages and that of the context-aware relations between a session and a message. To the best of our knowledge, this is the first work that utilizes co-training in the research of conversation disentanglement. We will extend the co-training idea to the deep learning paradigm to construct novel models for disentanglement.

Formulation and Notations
Given and K are unknown to the model. Our goal is to learn a dividing scheme that indicates which session a message m i belongs to. We solve this task in an end-to-end fashion where we formulate unsupervised conversation disentanglement as an unsupervised sequence labeling task. For a given message m i , there exists a session set T = {T 1 , · · · , T z(i) } where z(i) indicates the number of detached sessions when m i is being processed. The model needs to decide if m i belongs to any session in T. If m i ∈ T k , then m i will be appended to T k ; otherwise a new session T z(i)+1 will be built and initialized by m i , and the new session will be added to T.

Method
In this section, we describe our co-training based framework in detail, which contains following components: 1. A message-pair classifier which can retrieve the relations between two messages. The relation scores will be used as rewards for updating the session classifier during co-training.

2.
A session classifier which can perform end-toend conversation disentanglement by retrieving the relations between a message and a session. The predicted results will be used to build new pseudo data to train the messagepair classifier during co-training.
3. A co-training algorithm involving the message-pair classifier and the session classifier. Two classifiers will help to update each other until the performance of the session classifier stops growing.
We will introduce the details of the three components in following sections.

Message-pair Classifier
The message-pair classifier is a binary classifier which we denote as F m in the remainder of this paper. Due to the lack of annotated data in unsupervised settings, the goal of F m is to predict if two messages are in the same session; i.e., whether they talk about the same topic, which is different from most previous work that predicts the "reply-to" relation. In our experiment, we adopt a pretrained BERT (Devlin et al., 2019) in the base version as our message encoder.

Model
Given two messages m i and m j , we separately obtain the sentence embeddings of the two messages: The probability of m i and m j belonging to the same session is computed with the dot product between v m i and v m j : We abbreviate Eq. 1-3 as: F m is trained to minimize the cross-entropy loss. The predicted probabilities between message pairs will be used as rewards during the co-training process to update the session classifier.

Initialization
One important step in standard co-training algorithm is to initialize the classifiers with a small amount of annotated data. Since our dataset is completely unlabeled, we create a pseudo dataset to initialize the message-pair classifier. The assumption we use in our experiments is that one speaker mostly participates in only one session 4 . 4 We verify this assumption in two natural multi-party conversation datasets: Reddit dataset (Tan et al., 2019) and Ubuntu IRC dataset (Kummerfeld et al., 2019). Statistics show that only 6% of speakers will join multiple sessions on the Reddit dataset and 20% on the IRC dataset.
To construct the pseudo data D m , we use the message pairs from the same speaker in one conversation as the positive cases, while randomly sampling messages from different conversations as the negative pairs. In this way we obtain a retrieved dataset D ret m containing 937K positive cases and 2,184K negative cases. However, we observe that the positive cases constructed from the above process are very noisy because: (1) there are still some speakers who will appear in multiple sessions, and (2) even message pairs from the same speaker in the same session can be very semantically different since they are not contiguous messages. These noisy training cases will result in low confidences for the predicted probabilities of F m , which will be used later in co-training. Thus we randomly select some messages from the unlabeled dataset and use a pretrained DialoGPT (Zhang et al., 2020) to generate direct responses in order to form new positive cases, which we denote as D gen m . In this way, we finally obtain the pseudo data D m = D ret m ∪ D gen m , which contains 1,212K positive cases and 2,184K negative cases, to initialize F m .

Two-step Disentanglement
After being trained on the pseudo data D m , the message-pair classifier F m can be exploited as for two-step conversation disentanglement. Given an unlabeled conversation as C = [m 1 , m 2 , · · · , m N ], we first use F m to predict the probability between each message pair in C. Then we perform the greedy search algorithm widely used in the previous works (Elsner and Charniak, 2008) to segment C into detached sessions.

Session Classifier
The session classifier, denoted as F t , aims to calculate the relations between a session and a message that indicates if the message belongs to the session or not. Given the current context of a session as T = [m 1 , · · · , m |T | ] and a message m, the goal of F t is to decide if m can be appended to T or not.

Model
For each message m j ∈ T , we obtain its sentence embedding v m j by a Bidirectional LSTM network (Hochreiter and Schmidhuber, 1997) and a multilayer perceptron (MLP): After obtaining sentence embeddings of all the messages in T as [v m 1 , · · · v m |T | ], we adopt a self attention mechanism (Yang et al., 2016) to calculate the session embedding v T by aggregating the information from different messages. Specifically, where w and b are trainable parameters. For the message m, we use the same Bidirectional LSTM network and MLP as in Equation 5 and 6 to obtain its sentence embedding v m . Then the probability of m belonging to T is calculated with the dot product between v m and v T : We abbreviate the above process as: F t is trained to minimize the cross-entropy loss.

Initialization
Similar to the message-pair classifier, we build a pseudo dataset D t to initialize the session classifier F t to make it be able to decide if a message is semantically consistent with a sequence of messages. We construct D t based on the same assumption that one speaker is involved in just one session for most of the time.
Given a conversation C = [m 1 , m 2 , · · · , m N ] from the unlabeled corpus, we retrieve the messages from a speaker S as C S = [m S 1 , m S 2 , · · · ] where C S ⊂ C. Based on the assumption, the messages in C S are in the same session, so the message m S i ∈ C S where i = 1 and its preceding context can be regarded as the positive input of F t . Consider a positive case with m S 2 as the message, the message m and session T are defined as follows: The reason is that m S 1 ∈ [m 1 , · · · , m S 2 −1 ], so [m 1 , · · · , m S 2 −1 ] and m S 2 should be semantically consistent according to the assumption.
For the negative instances of D t , we randomly sample a conversation as T from the corpus, and a message from another conversation as m. As such we obtain a pseudo dataset D t consisting of 460K positive instances and 1,158K negative cases.
Algorithm 1 An end-to-end method for conversation disentanglement with the session classifier Input: An unlabeled conversation C, the initialized session classifier F t Output: A set of sessions T end if 22: end for Return T

End-to-end Disentanglement
Note that after initialized with the pseudo data D t , the session classifier F t can be directly applied to perform end-to-end conversation disentanglement. Suppose message m i is being processed where m i ∈ C and C = [m 1 , m 2 , · · · , m N ], we first calculate the probability of m i belonging to its preceding context C i = [m 1 , · · · , m i−1 ], which we denote as where z is a function indicating the number of disentangled sessions in C i ; otherwise m i will be used to calculate the matching probability with each session in T, and then be classified to the session which has the greatest matching probability. The overall end-to-end algorithm is shown in Algorithm 1.

Co-Training
The confidence of F m and F t is not high because they are initialized with noisy pseudo data. We propose to adapt the idea of co-training to the disentanglement task, which is leveraged to iteratively update the two classifiers with the help of each other. The session classifier will utilize the local probability provided by the the message-pair classifier with reinforcement learning, while more training data, built from the outcomes of the session classifier, will be fed to the message-pair classifier. We will introduce the details in this subsection.

Updating Session Classifier
Since no labeled data is provided to train F t , we formulate the disentanglement task as a deterministic Markov Decision Process and adopt the Policy Gradient algorithm (Sutton et al., 1999) for the optimization. For each co-training iteration, F t will be initialized with the pseudo data D t and then updated by reinforcement learning.
State The state s i of the i th disentanglement step consists of three components (m is the preceding context of m i ; T is the detached session set which contains z(i) sessions.
Action The action space of the i th disentanglement step consists of two types of actions: 1. Classifying m i to a new session, which we denote as a new i ∈ {0, 1}. If a new i is 0, m i will be used to initialize a new session T z(i)+1 ; otherwise m i will be categorized into an existing session. 2. Categorizing m i to an existing session in T, which we denote as a t i ∈ {1, · · · , z(i)}.
Policy network We parameterize the action with a policy network π which is in a hierarchical structure. The first layer policy π new (a new i |s i ; θ 1 ) is to decide if a message m i belongs to C i , and the first layer action a new i ∈ {0, 1} will be sampled: If a new i is 1, which means m i belongs to a session in T, the second layer policy π t (a t i |s i ; θ 2 ) will decide which of existing sessions that m i should be categorized to: where θ 1 and θ 2 are both parameters.
Reward The rewards are provided by the message-pair classifier F m . For a new i = 0, we want m i to be different from all the messages in C i . Thus it is defined by the negative average of the probabilities between m i and all the messages in C i . However, for a new i = 1 and a t i = k, we want m i to be similar to all the messages in T k , and thus the reward is defined as the average of the probabilities between m i and all the messages in T k : An issue associated with r m i is that its confidence might be low because F m is trained on noisy pseudo data. We hence design another speaker reward r S i based on our assumptions. For a message m i initializing a new session T z(i)+1 , its speaker S i should not appear in C i ; while for a message m i categorized to an existing session T k , it should receive a positive reward if its speaker S i appears in T k : The final reward r i for an action is calculated as: where γ is a parameter ranged in [0, 1] that balances r m i and r S i , which we set to 0.6 in experiments. The policy network parameters θ 1 and θ 2 are learned by optimizing:

Updating Message-pair Classifier
As mentioned in Section 4.1.2, the pseudo data D m for initializing the message-pair classifier F m is noisy. Thus we enrich D m with new training instances D new m retrieved from the predicted disentanglement results of F t .
Given a conversation C, F t can predict the disentangled sessions as T = {T k } K k=1 . Given session T k = [m k 1 , · · · , m k |T k | ] as an example, for a message m k i , we retrieve its preceding M messages in T k and form M pairs } as the new positive pseudo message pairs. In order to raise the confidence of the newly added data, we filter out those pairs where the two messages have less than 2 overlapped tokens after removing stopwords. For each co-training iteration, F m is retrained on the data D m ∪ D new m .

Dataset
A large corpus is often required for end-to-end conversation disentanglement. In this work, we conduct experiments on the publicly available Movie Dialogue Dataset  which is built from online movie scripts. It contains 29,669/2,036/2,010 instances for train/dev/test split with a total of 827,193 messages, where the session number in one instance can be 2, 3 or 4. Since we make explorations in unsupervised settings, no labels are used in our training.

Implementation Details
We adopt BERT (Devlin et al., 2019) (the uncased base version) as the message-pair classifier. For the session classifier, we set the hidden dimension to be 300, and the word embeddings are initialized with 300-d GloVe vectors (Pennington et al., 2014). For training, we use Adam (Kingma and Ba, 2015) for optimization; the learning rate is set to be 1e-5 for the message-pair classifier, 1e-4 for initializing the session classifier, and 1e-5 for updating the session classifier with reinforcement learning. We iterate for 3 turns for co-training when the best performance is achieved on the development set.

Evaluation Metrics
Four clustering metrics widely used in the previous work (Elsner and Charniak, 2008;Kummerfeld et al., 2019;Tan et al., 2019) are adopted: Normalized mutual information (NMI), One-to-One Overlap (1-1), Loc 3 and Shen F score (Shen-F). More explanations about the metrics can be found in Appendix A.1. We also report the mean squared error (MSE) between the predicted session numbers and the golden session numbers as previous work . This metric can measure whether the model can disentangle a given dialogue to the correct number of sessions. Table 1 shows the results of unsupervised conversation disentanglement of different methods. We can observe that: (1) for two-step methods, BERT has a very poor performance without finetuning, while after finetuned on our pseudo dataset, its performance gets improved with a relatively large margin.

Disentanglement Performance
(2) Utilizing the pseudo pairs generated by a pretrained DialoGPT can further improve the performance of BERT based on D ret m . We consider this is because the messages from one speaker are usually not contiguous in a conversation, while DialoGPT can directly produce a response to a message, which is beneficial to BERT on capturing the differences of two messages. (3) Table 2: The performance of the session classifier and the message-pair classifier in each co-training iteration. Columns NMI, 1-1, Loc 3 and Shen-F are for session classifier and Column F1 is for the message-pair classifier. "Base" represents session classifier trained on D t and message-pair classifier finetuned on D m . the co-training process, the pseudo pairs retrieved from the predictions of the session classifier can help BERT to achieve a performance close to that of a supervised BERT, which demonstrates the effectiveness of our proposed co-training framework.
(4) BERT finetuned with golden message pairs just has a marginal performance advantage compared to the pseudo data D m . This is caused by the weakness of two-step methods in which the clustering algorithm is a performance bottleneck .
In general, end-to-end methods perform much better than two-steps methods as shown in the table, which is in accordance with the conclusions in previous works under supervised settings (Yu and Joty, 2020). The session classifier trained on the pseudo data D t can achieve a Shen F score of 59.61, which is +5.29 improvement compared to the supervised BERT in two-step settings. This proves that the model structure and the approach to building D t are effective for the task of unsupervised conversation disentanglement. Meanwhile, our proposed co-training framework can further improve the performance of the session classifier and achieve competitive results with the current state-of-the-art supervised method. With further updating during the co-training process, the session classifier raises the NMI score from 24.96 to 29.72 and 1-1 from 54.26 to 56.38. Such a performance gain proves that our co-training framework is an important component in handling unsupervised conversation disentanglement.
Moreover, as we can see in the table, two-step methods have a high MSE on the predicted session numbers, but with the pseudo data D m , BERT can achieve performance which is much better than that without finetuning and even comparable with that finetuned on the golden pairs. End-to-end ses-  Table 3: The performance on multi-party response selection with disentangled conversations. The first column respective stands for no disentanglement, the disentangled conversations predicted by our method and the golden disentangled conversations.
sion classifier achieves a significant improvement on the MSE by reducing it from 1.4602 to 0.8059, while our proposed co-training framework further improves it to 0.6871, which is close to the performance of the supervised model. It demonstrates that the co-training method can help the session classifier to better understand the semantics in the conversation and thus to more accurately disentangle the conversation into sessions.

Analysis of Co-training
In this section we analyze the iteration process of co-training. Table 2 shows the performance of session classifier in different iterations. We also include in the last column of Table 2 the performance of the message-pair classifier on the task of pair relation prediction. As we can see, the model performance is improved iteratively as the iteration increases. For the first iteration, the reward r m i is received from the base message-pair classifier, of which the F1 score on relation prediction is 68.26. After the first iteration, new pseudo pairs will be retrieved from the disentanglement results and used to improve the performance of the message-pair classifier to 68.44. Thus better reward r m i will be provided to update the session classifier. As shown in the table, with such a co-training procedure, performance of both the session classifier and the message-pair classifier are significantly enhanced.

Performance on Response Selection
Conversation disentanglement is a prerequisite for understanding multi-party conversations. In this section we apply our predicted sessions to the downstream task: multi-party response selection.
We create a response selection dataset based on the Movie Dialogue Dataset. We adopt a LSTM-based network to encode the conversations/sessions, and use attention mechanism to aggregate the information from different sessions as in . More details of the model and implementation can be found in Appendix A.2.
The results are shown in Table 3. Note that the three experiments are performed on the model with the same number of parameters. We can see that with the disentangled conversations predicted by our method, there is a significant performance gain comparing with the baseline model. Though golden disentanglement can bring the best performance, the annotations are usually expensive to acquire. With our method, a disentanglement scheme can be obtained for better understanding multi-party conversations with no annotation cost.

Conclusion
This is the first work to investigate unsupervised conversation disentanglement with deep neural models. We propose a novel approach based on cotraining which consists of a message-pair classifier and a session classifier. By iteratively updating the two classifiers with the help of each other, the proposed model attains a performance comparable to that of the state-of-the-art supervised disentanglement methods. Experiments on downstream tasks proves that our method can help better understand multi-party conversations. Our method can be easily adapted to a different assumption, and also it can be extended to other low-resourced scenarios like semi-supervised settings, which we will leave as our future work.

A.1 Metric Explanation
We use four metrics in our experiments: Normalized mutual information (NMI), One-to-One Overlap (1-1), Loc 3 and Shen F score (Shen-F). NMI is a normalization of the Mutual Information, which is a method for evaluation of two clusters in the presence of class labels. 1-1 describes how well we can extract whole conversations intact. Loc 3 counts agreements and disagreements within a context window size 3. Shen calculates the F-score for each gold-system conversation pair, finds the max for each gold conversation, and averages weighted by the size of the gold conversation.

A.2 Multi-party Response Selection
Given a conversation C = [m 1 , · · · , m N ] and a candidate message m, the goal of response selection is to decide if message m is correct response to the conversation C.
We obtain the disentanglement scheme of C as T = {T 1 , T 2 , · · · , T K }, where session T k = [m 1 . · · · , m |T k | ]. For each session T k , we encode each message m k i within it by a Bidirectional LSTM network and a multilayer perceptron (MLP): After obtaining sentence embeddings of all the messages in T k as [v m k 1 , · · · v m k |T k | ], we adopt a self attention mechanism (Yang et al., 2016) to calculate the session embedding v T k by aggregating the information from different messages. Specifically, where w and b are trainable parameters. In this way we can acquire all the session representation as {v T 1 , v T 2 , · · · , v T K }. Meanwhile, we obtain the candidate message representation as v m with the same LSTM and MLP in Equation 22-23. We follow  to aggregate the information from different sessions with the attention … BiLSTM+MLP … Self Attention Figure 3: The model structure that incorporating disentangled sessions for the task of response selection. mechanism: The final matching score between the conversation and the message is given by: The overall of structure of the method incorporating the disentangled sessions is shown in Figure 3. For the vanilla model using conversation C without any disentanglement, we use the LSTM, MLP and self attention as in Equation 22-26 to obtain its vector representation v C . And then the matching score is calculated by the dot product between v C and v m .
The whole model is trained to minimize the cross-entropy loss of both positive instances and negative instances.