CSAGN: Conversational Structure Aware Graph Network for Conversational Semantic Role Labeling

Conversational semantic role labeling (CSRL) is believed to be a crucial step towards dialogue understanding. However, it remains a major challenge for existing CSRL parser to handle conversational structural information. In this paper, we present a simple and effective architecture for CSRL which aims to address this problem. Our model is based on a conversational structure aware graph network which explicitly encodes the speaker dependent information. We also propose a multi-task learning method to further improve the model. Experimental results on benchmark datasets show that our model with our proposed training objectives significantly outperforms previous baselines.


Introduction
Recent research has achieved impressive improvements on conversation-based tasks, such as dialogue response generation (Li et al., 2017;Dinan et al., 2019;Wu et al., 2019), task-oriented dialogue modeling (Mrkšić et al., 2017;Budzianowski et al., 2018) and conversational question answering (Choi et al., 2018;Reddy et al., 2019). However, the frequent occurrences of ellipsis and anaphora in human conversations still create huge challenges for dialogue understanding. To address this, Xu et al. (2021) proposed the Conversational Semantic Role Labeling (CSRL) task whose goal is to extract predicate-argument structures across the entire conversation. Figure 1 illustrates an example, where a CSRL parser needs to identify "《泰坦尼克 号》(Titanic)" as the ARG1 argument of the predicate "看过 (watched)" and the ARG0 argument of the predicate "是 (is)". One can see that in the original conversation, "《泰坦尼克号》(Titanic)" is omitted in the second turn and referred as "这 (this)" in the last turn. Xu et al. (2021) has demonstrated * Corresponding author. the usefulness of CSRL on many downstream tasks such as dialogue generation and dialogue rewriting.
Despite the successes that CSRL has achieved, existing CSRL model (Xu et al., 2021) is a simple extension of BERT. Specifically, they first encode each utterance into local contextual representations with pre-trained language models, and then utilize a stack of self-attention layers to obtain global contextual representations. We argue that their model may suffer two main problems. First, in the local feature extraction phase, they ignore the fact that jointly considering the predicate and context utterances could help the model better identify some relevant ommited arguments. Second, in the global feature extraction phase, some vital conversational structural information, such as the speaker information, is not properly encoded in their model. Indeed, speaker-dependent information is necessary for modelling inter-speaker and intra-speaker dependency, both of which could help the model to better handle coreference resolution and zero pronoun resolution.
Motivated by the above observations, we propose a new CSRL model, which consists of three main components. First, we use a pre-trained language model to generate local contextual representations for tokens (Sec. 2.1), which is similar to Xu et al. (2021). Then, we propose a new attention strategy to learn predicate-aware contextual representations for tokens (Sec. 2.2). Finally, we propose a Conversational Structure Aware Graph

UT loss
Multi-Task Learning Network (CSAGN) for learning high-level structural features to represent utterances (Sec. 2.3). The resulted utterance representations are incorporated with token representations obtained in the previous two components. With the enhanced token representations, our model predicts the arguments for the given predicate.
In addition, we introduce a multi-task learning method with two new objectives. Experimental results on benchmark datasets show that our model substantially outperforms existing baselines. Our proposed training objectives could also help the model to better learn predicate-aware token representations and structure-aware utterance representations. Our code is publicly available at https: //github.com/syxu828/CSRL_dataset.

Model
The overall architecture of CSAGN is illustrated in Figure 2 which consists of three main components, and we introduce them as follows:

Input Representation
Given a dialogue C = (u 1 , u 2 , ..., u K ) of K utterances, where u k = (w k,1 , w k,2 , ..., w k,|u k | ) consists of a sequence of words, we first use a pretrained language model such as BERT to obtain the initial context representation e.

Predicate-Aware Utterance Representation
We propose a new attention strategy to better learn predicate-aware context representations for tokens. Specifically, tokens are only allowed to attend to tokens in the same utterance or the utterance that includes the predicate: where i, j are the token indexes in the dialogue, U i and U j are the utterances that the i-th and j-th token belongs to, U pred is the utterance that includes the given predicate. For example, in Figure 2(b), assuming U 4 includes the predicate, previous utterances U 1 , U 2 and U 3 could attend to themselves and U 4 , while U 4 only attends to itself. In practice, we use additional four self-attention blocks (Vaswani et al., 2017) with our proposed attention strategy to learn predicate-aware context representations from e, which results in token representations p. Then, we obtain utterance representations u by max-pooling over words in the same utterance.

Conversational Structure Aware Graph Network
We present the Conversational Structure Aware Graph Network (CSAGN) to capture speaker dependent contextual information in a conversation.
In particular, we design a directed graph from the encoded utterances to capture the interaction between the speakers. Formally, a conversation having T utterances is represented as a directed graph G = (V, E, R, W), with vertices v i ∈ V; labeled edges (relations) e ij ∈ E with label r ij ∈ R are the relations between vertices v i and v j ; α ij ∈ W is the weight of the relational edge r ij with 0 ≤ α ij ≤ 1. Furthermore, the graph is constructed from the utterances in the following way: Edges: Each utterance (vertex) is contextually dependent on its past utterances in a conversation, thus each vertex v i has an edge with the vertices that represent the past utterances: {v 0 , v 1 ,...,v i−1 }.
Edge weights: We calculate edge weights as follows: for vertex v j , the weight of incoming edge r ij is: where W e is the attention matrix learnt from the training. This ensures that, vertex v i which has incoming edges with vertices vertex v 0 ,...,v i−1 receives a total weight contribution of 1.

Relations:
The relation r of an edge r ij depends upon two aspects: Speaker dependency -The relation depends on both the speakers of the constituting vertices: Predicate dependency -The relation also depends upon whether the utterance u i or u j includes the predicate.
If there are M distinct speakers involving in a conversation, then the number of relational edge types is M (from_speaker) * M (to_speaker) * 2 (containing predicate or not) = 2M 2 .
Graph feature transformation We now discuss how to propagate global information among the nodes. Following previous work (Schlichtkrull et al., 2018;Ghosal et al., 2019), we use a twostep graph convolution process which essentially can be understood as a special case of messenge passing method (Gilmer et al., 2017) to encode the nodes. We formulate the process as following: where h (l) i is the i-th encoded node feature from l-th layer. N r i denotes the neighboring nodes of i-th node under relation r ∈ R. σ is ReLU (Nair and Hinton, 2010) After the message propagation, the node representations are updated with the initial node embeddings and the message representations. The final utterance representations are denoted as h.

Multi-Task Learning
In this section, we describe how to train the model, based on the representations e, p, g and h.
SRL Objective. Formally, given an input conversation x, this objective is to minimize the negative log likelihood of the corresponding correct label sequence y. Our model predict the corresponding label y t based on the token representation p t and its corresponding utterance representation h k : where W c is the softmax matrix and δ yt is Kronecker delta with a dimension for each output symbol, so softmax(W c [h t ⊕ g k ]) T δ yt is exactly the y t 'th element of the distribution defined by the softmax.
Intra-Argument Objective. The arguments in CSRL could be categorized into two classes, i.e., the intra-and cross-arguments, where the former are in the same dialogue turns as the predicate while the latter usually occur in the dialogue history. Intuitively, the model should be able to identify the intra-arguments without using the dialogue contextual information. Motivated by this observation, we introduce a new loss function that only uses e and p to predict the intra-arguments: where σ(y t ) is a boolean scalar that indicates whether y t is an intra-argument token, W c is the softmax matrix that used in the SRL Objective.
Utterance Type Objective. We additionally introduce a utterance type objective to learn better utterance representations. Specifically, we classify all utterances into three categories, namely predicate-utterance (utterances containing the predicate), argument-utterance (utterances containing arguments) and irrelevant-utterance (utterances without any arguments). We use utterance representations g and h to classify the utterance type: where y k is the utterance type and K is the total number of utterances.

Experiments
We evaluate our model on three datasets, i.e., DuConv (Wu et al., 2019), NewsDialog (Wang et al., 2021) and PersonalDialog (Wu et al., 2019). 1 DuConv is a multi-turn dialogue dataset that focuses on the domain of movie and star, while News-Dialog and PersonalDialog are open-domain dialogue datasets. 2 We use the same train/dev/test split as Xu et al. (2021): DuConv annotations are splitted into 80%/10%/10% as train/dev/in-domain test set while the NewsDialog and PersonalDialog annotations are treated as the out-domain test set.
The hyper-parameters used in our model are listed as follows. The hop size and embedding dimension of CSAGN is set to 4 and 100, respectively. The α 1 , α 2 , α 3 are set to 1.0, 1.0 and 1.0, respectively. The batch size is set to 128.
Results and Discussion. We used the microaverage F1 scores over the (predicate, argument, label) tuples. Following Xu et al. (2021), we also evaluate F1 scores over intra-and cross-arguments. We compare with two baselines that use different strategies to encode the dialogue history and speaker information. In particular, SimpleBERT (Shi and Lin, 2019) uses the BERT as their backbone and simply concatenates the entire dialogue history with the predicate; CSRL-BERT (Xu et al., 2021) also uses the BERT but attempts to encode the dialogue structural information by integrating the dialogue turn and speaker embeddings in the input embedding layer. Table 1 summarizes the results of our model and these baselines.
We can see that our model significantly outperforms existing baselines on both the in-domain and out-domain datasets. We can also see that our model benefits from the multi-task training. In particular, when only using the SRL objective, the F1 all scores drop by 0.37, 2.16 and 2.28 on three datasets. Without using either the intra-argument or utterance-type objective, the performances on all datasets also decrease. Moreover, we observe that introducing the intra-argument objective could consistently improve F1 intra while the utterancetype objective is more important to the F1 cross . We think this is because (1) the intra-argument objective denoises the noise from other irrelevant information within the dialogue context; (2) identifying cross-arguments requires a better understanding of the global dialogue structures.
Let us first look at the impact of different components on our model. From Table 1, we can see that both the predicate-aware representation and speaker-aware graph network (SAGN) could improve the F1 cross performance. These results indicate that (1) the predicate-aware attention strategy could help the model to better capture the long-distance dependencies between arguments and predicates; (2) the speaker information encoded in the SAGN is also helpful to identify arguments across dialogue turns. Furthermore, we also experiment with the full attention strategy to obtain predicate-aware representations, that is, each token attends to all tokens in the entire dialogue. From Table 1, we can see that this strategy achieves worse performance over all metrics. This result is expected since the full attention treats all utterances equally, therefore it may encode irrelevant and noisy dependencies into the contextual representation.
Recall that, we model two types of dependencies in our graph neural network, i.e., the speaker and predicate dependency. We investigate the impact of each dependency on our model and results are shown in Table 1. We can see that removing either dependency may hurt the performance, especially the F1 cross score. This result suggests that these structural information is useful for identifying the cross arguments.

Conclusion
In this paper, we propose a conversational structure aware graph network for the task of conversational semantic role labeling and a multi-task learning method. Experimental results on the benchmark dataset show that our method significantly outperforms previous baselines and achieves the state-of-the-art performance.