SocAoG: Incremental Graph Parsing for Social Relation Inference in Dialogues

Inferring social relations from dialogues is vital for building emotionally intelligent robots to interpret human language better and act accordingly. We model the social network as an And-or Graph, named SocAoG, for the consistency of relations among a group and leveraging attributes as inference cues. Moreover, we formulate a sequential structure prediction task, and propose an \alpha-\beta-\gamma strategy to incrementally parse SocAoG for the dynamic inference upon any incoming utterance: (i) an \alpha process predicting attributes and relations conditioned on the semantics of dialogues, (ii) a \beta process updating the social relations based on related attributes, and (iii) a \gamma process updating individual’s attributes based on interpersonal social relations. Empirical results on DialogRE and MovieGraph show that our model infers social relations more accurately than the state-of-the-art methods. Moreover, the ablation study shows the three processes complement each other, and the case study demonstrates the dynamic relational inference.


Introduction
Social relations form the basic structure of our society, defining not only our self-images but also our relationships (Sztompka, 2002). Robots with a higher emotional quotient (EQ) have the potential to understand users' social relations better and act appropriately. Given a dialogue as context and a set of entities, the task of Dialogue Relation Extraction (DRE) predicts the relation types between the entities from a predefined relation set. Table 5 shows such an example from the dataset DialogRE .

S1:
Well then we'll-we'll see you the day after tomorrow. Argument Pair Trigger Relation Type R1 (S2, S1) dad per:children R2 (S3, Gunther) sexy blonde per:positive impression R3 (S3, S1) mom per:children R4 (S1, S3) mom per:parents R5 (S1, S2) dad per:parents  . Trigger word annotations are not used for training, but rather for illustrating purpose only. Chen et al., 2020) focus on identifying entities' relations from the semantics of dialogues-they utilize either the attention mechanism or a refined token graph to locate informative words (e.g., "dad" and "mom") that imply the argument pairs' relations. However, there are still three missing parts in current models for social relation inference according to our observations. First, current models lack the explicit modeling of the relational consistency among a group of people-such consistency helps humans reason about the social relation of two targets by using their relations with a third person. For the example in Table 5, by knowing S2 and S3 are S1's parents and S3 is S1's mother, we can infer that S2 is S1's dad. Second, the personal attribute cues (e.g., gender and profession) can also aid the relational inference but are not fully utilized. In the above example, besides inferring S3 is S1' mother according to S3's feminine attribute, we can also have a guess that Gunther is a waiter, which might be useful for the future socialrelational inference. Third, since the BERT-based and token-graph-based models take dialogues as a whole for relation prediction, they cannot perform dynamic inference-updating the relational belief with an incoming dialogic utterance. This can limit their ability to track the evolving relations along social interactions, e.g., strangers become friends over a good chat (Kukleva et al., 2020), unveiling intermediate reasoning results, or dealing with long dialogues. Motivated by these observations, we propose to model social relation as an attributed And-Or graph (AoG) (Zhu et al., 1998;Zhu and Mumford, 2007;Wu and Zhu, 2011;Shu et al., 2016;Qi et al., 2018), named SocAoG, and develop an incremental graph parsing algorithm to jointly infer human attributes and social relations from a dialogue. In specific, SocAoG describes social relations and personal attributes with contextual constraints of groups and hierarchical representations. To incrementally parse SocAoG and track social relations, we apply Markov Chain Monte Carlo (MCMC) to sample from the posterior probability calculated by three complementary processes (α-β-γ) (Qu et al., 2020;Zayaraz et al., 2015). Figure 1 schematically demonstrates a graph update of both relations (i.e., disambiguating mom/dad and adding a new party) and attributes (e.g., gender and profession) with the utterance "S2: Your mother just added him to the list." from the example dialogue in Table 5.
We evaluate our method on two datasets of Di-alogRE  and MovieGraph (Vicol et al., 2018) for relation inference, and the results show that our method outperforms the state-of-theart (SoTA) ones. Overall, we make the following contributions: (i) We propose to model and infer social relations and individual's attributes jointly with SocAoG for the consistency of attributes and social relations among a group. To the best of our knowledge, it is the first time done in the dialogue do-main; (ii) The MCMC sampling from α-β-γ posterior enables dynamic inference-incrementally parsing the social relation graph, which can be useful for tracking relational evolution, reflecting the reasoning process, and handling long dialogues; (iii) We perform an ablation study on each process of α-β-γ to investigate the information contribution, and perform case studies to show the effectiveness of our dynamic reasoning.

Related Work
We review the related works on the social relation inference from documents, which is a well-studied task, and those from dialogues, which is the emerging task that our work is focused on.

Relation Inference from Documents
Most of the existing literature focus on relation extraction from professional edited news reports or websites. They typically output a set of "subjectpredicate-object" triples after reading the entire document (Bach and Badaskar, 2007;Mintz et al., 2009;Kumar, 2017). While early works mostly utilize feature-based methods (Kambhatla, 2004;Miwa and Sasaki, 2014;Gormley et al., 2015) and kernel-based methods (Zelenko et al., 2003;Zhao and Grishman, 2005;Mooney and Bunescu, 2006), more recent studies use deep learning methods such as recurrent neural networks or transformers (Kumar, 2017). For example, Zhou et al. (2016) propose bidirectional LSTM model to capture the longterm dependency between entity pairs, Zhang et al. Two streams of work are closely related to our method. Regarding social network modeling, while most works treat pairs of entities isolated Xue et al., 2020b;Chen et al., 2020), Srivastava et al. (2016) formulate the interpersonal relation inference as structured prediction (Belanger and McCallum, 2016;, inferring the collective assignment of relations among all entities from a document (Li et al., 2020;Jin et al., 2020). Regarding relation evolution, a few works are aimed to learn the dynamics in social networks, i.e., the development of relations, from narratives by Hidden Markov Models (Chaturvedi et al., 2017), Recurrent Neural Networks (Kim and Klinger, 2019), deep recurrent autoencoders (Iyyer et al., 2016). Our method differs from the aforementioned works by modeling the structured social relations and their changes concurrently, which can be useful for the task of tracking social network evolution (Doreian and Stokman, 1997) and unveiling the reasoning process of relations. We achieve this by parsing the graph incrementally per utterance with the proposed α-β-γ strategy.

Relation Inference from Dialogues
Recently,  introduce the first human-annotated dialogue-based relation extraction dataset DialogRE, in which relations are annotated between arguments that appear in a dialogue session. Compared with traditional relation extraction tasks, DialogRE emphasizes the importance of tracking speaker-related information within the context across multiple sentences. SoTA methods can be categorized into token-graph models and pre-trained language models. For typical tokengraph models, Chen et al. (2020) present a token graph attention network, and Xue et al. (2020b) further generate a latent multi-view graph to capture relationships among tokens, which is then refined to select important words for relation extraction. For pre-trained models, Yu et al. (2020) evaluate a BERT-based baseline model (Devlin et al., 2018) and a modified version BERTs, which takes speaker arguments into consideration. Xue et al. (2020a) propose a simple yet effective BERT-based model, SimpleRE, that takes a novel input format to capture the interrelations among all pairs of entities.
Both categories of SoTA models take a discriminative approach, whereas ignoring two key constraints on relations: (i) social relation consistency in a group and (ii) human attributes. Different from them, our method formulates the task as dialogue generation from an attributed relation graph, so that the posterior relation estimation models both two constraints. Moreover, SoTA models also assume the relations are static-they cannot learn the dynamics of the relations, while the incremental graph updating strategy naturally enables the dynamic relation inference.

Problem Formulation
Our goal is to construct a social network through utterances in dialogue. The network is a heterogeneous physical system (Yongqiang et al., 1997) with particles representing entities and different types of edges representing social relations. Each entity is associated with multiple types of attributes, while each type of relation is governed by a potential function defined in human attribute and value space, acting as the social norm. The relations are often asymmetric, e.g., A is B's father does not mean B is A's father. To model the network, we utilize an attributed And-Or Graph (A-AoG), a probabilistic grammar model with attributes on nodes. Such design takes advantage of the reconfigurability of its probabilistic context-free grammar to reflect the alternative attributes and relations, and the contextual relations defined on Markov Random Field to model the social norm constraints.
The social network graph, named SocAoG, is diagrammatically shown in Figure 2. Formally, SocAoG is defined as a 5-tuple: , where S is the root node for representing the interested society. V = V and ∪V or ∪V e T ∪V a T denotes all nodes' collection. Among them, And-nodes V and represent the set of social communities, which can be decomposed to a set of entity terminal nodes, V e T , representing human members. Community detection is based on the social network analysis (Bedi and Sharma, 2016;Du et al., 2007), and can benefit the modeling of loosely connected social relations. Each human entity is associated with an And-node that breakdowns the attributes into subtypes such as gender, age, and profession. All the subtypes consist of an Or-node set, V or , for representing branches to alternatives of attribute values. Meanwhile, all the attribute values are represented as a set of terminal nodes V a T . We denote E to be the edge set describing social relations, X(v i ) to be the attributes associated with node v i , and X( e ij ) to be the social relation type of edge e ij ∈ E.

And-node
And-node (attribute)

Word context
Terminal node (attribute)

Social relation
Or-node (attribute subtype) Given P to be the probability model defined on SocAoG, a parse graph pg is an instantiation of SocAoG with determined attribute selections for every Or-node and relation types for every edge. For a dialogue session with T turns is the utterance at turn t, our method infers the attributes and social relations incrementally over turns: , where pg (t) represents the belief of SocAoG at the dialogue turn t. We incrementally update the pg by maximizing the posterior probability: , where pg * is the optimum social relation belief, and θ is the set of model parameters.

α-β-γ for Graph Inference
For simplicity, we denote X(v i ) as v i and X( e ij ) as e ij in the rest of the paper. We introduce three processes, i.e., α, β, and γ process, to infer any SocAoG belief pg * . We start by rewriting the posterior probability as a Gibbs distribution: , where Z is the partition function. E(D|pg; θ) and E(pg; θ) are dialogue-and social norm-based energy potentials respectively, measuring the cost of assigning a graph instantiation.
Denoting a dialogue as a sequence of words: D = {w 1 , w 2 , ..., w T }, the dialogue likelihood energy term E(D|pg; θ) can be expressed with a language model conditioned on the parse graph: , where c t = [w 1 , ..., w t−1 ] is the context vector. Intuitively, the word selection depends on the word context, the entities' attributes and their interpersonal relations. We approximate the likelihood by finetuning a BERT-based transformer with a customized input format , which is a concatenation of the dialogue history D and a flattened parse graph string encoding the current belief. We call the estimation of pg from the dialogue likelihood p(w t |c t , pg) to be the α process. α process lacks the explicit constraints for social norms related to interpersonal relations and human attributes.
For the social norm-based potential, we design it to be composed of three potential terms: , where V (pg) and E(pg) are the set of terminal nodes and relations in the parse graph, respectively. We call the term p(e ij |v i , v j ) the β process, in which we bind the attributes of node v i and v j to update their relation edge e ij , in order to model the constraint on relations from human attributes. Reversely, we call the terms p(v i |e ij ) and p(v j |e ij ) the γ process, in which we use the social relation edge e ij to update the attributes of node v i and v j . This models the impact of relation to the attributes of related entities. β, γ l , and γ r are weight factors balancing α, β and γ processes. Figure 3(a) shows the graph inference schema with the three processes. Combining equation 4, 5, and 6, we get a posterior probability estimation p(pg|D; θ) of parse graph pg, with the guarantee of the attribute and social norm consistencies. Here we also provide a reduced version of our model, SocAoG reduced , which applies when characters' attributes annotation are not available for training 2 . With the same dialogue-based energy potential, We define the parse graph prior energy over a set of relation triangles: E(pg; θ) = −β e ij , e ik , e jk ∈E(pg) log(p(e ij |e ik , e jk )).

(7)
The method directly models the constraint of two entities' relation from their relations to others, with the inference schema demonstrated in Figure 3(b).

Incremental Graph Parsing
Incrementally parsing the SocAoG is accomplished by repeatedly sampling a new parse graph pg (t) Algorithm 1: Incremental SocAoG Parsing for Social Relation Inference Input: dialogue DT = {D (1) , D (2) , ..., D (T ) }, target argument pairs {a1, a2}. Initialize pg (0) . Initialize vi and eij. for t = 1, ..., T do for s = 1, ..., S do Compute the posterior p(pg|D (t) ; θ). Make proposal moves with probabilities q1, q2 to get a new parse graph pg . Compute the posterior p(pg |D (t) ; θ). Compute acceptance rate α(pg |pg, D (t) ; θ). Accept/reject pg according to the acceptance rate. end for return ea 1 ,a 2 from the average of accepted pg samples. end for from the posterior probability p(pg (t) |D (t) ; θ). We utilize a Markov Chain Monte Carlo (MCMC) sampler to update our parse graph since the complexity of the problem caused by multiple energy terms.
At each dialogue turn t, we initialize the parse graph with the α classification process, by replacing all the Or-Node tokens with a special token [CLS]. We sample the parse graph for S steps and use the average value of obtained samples as an approximation of pg (t) . We design two types of Markov chain dynamics used at random probabilities q i , i = 1, 2 to make proposal moves: • Dynamics q 1 : randomly pick a relation edge e ij under the uniform distribution, flip its social relation type e ij according to the prior distribution given by β process: • Dynamics q 2 : randomly pick a terminal node v i and its attribute subtype under the uniform distribution, and flip the one-hot value of attribute v i according to the prior distribution given by γ process: Using the Metropolis-Hastings algorithm (Chib and Greenberg, 1995), the proposed new parse graph pg is accepted according to the following accep-tance probability: α(pg |pg, D; θ) = min(1, p(pg |D; θ)p(pg|pg ) p(pg|D; θ)p(pg |pg) ) = min(1, p(pg |D; θ) p(pg|D; θ) ) (10) , where the proposal probability rate is cancelled out since the proposal moves are symmetric in probability. We summarize the incremental SocAoG parsing in Algorithm 1. Dialogues give a continuously evolving energy landscape: at the beginning of iterations, p(pg (0) |D; θ) is a "hot" distribution with a large energy value; by iterating the α-β-γ processes for pg updates through the dialogue, the pg converges to the pg * , which is much cooler.

Datasets
We use DialogRE (V2) 3  and MovieGraph 4 (Vicol et al., 2018) for evaluating our method. Detailed descriptions on the two datasets, e.g., relation and attribute types, are provided in Appendix A.
DialogRE contains 36 relation types (17 of them are interpersonal) that exist between pairs of arguments. For the joint parsing of relation and attribute, we further annotate the entity arguments with attributes from four subtypes (by following the practice of MovieGraph (Vicol et al., 2018)): gender, age, profession, and ethnicity, according to Friends Central in Fandom 5 . DialogRE is split into training (1073), validation (358), and test (357). Following previous works Xue et al., 2020b), we report macro F1 scores in both the standard and conversational settings (F1 c ).
MovieGraph provides graph-based annotations of social situations from 51 movies. Each graph comprises nodes representing the characters, their emotional and physical attributes, relationships, and interactions. We use a subset (40) of Movie-Graph with available full transcripts and split the dataset into training (26), validation (6), and test (8). For MovieGraph, we only evaluate with F1 since the trigger word annotation for computing F1 c is not available.

Experiment Settings
We learn the SocAoG model with a contrastive loss (Hadsell et al., 2006) comparing the posterior of a positive parse graph against a negative one. All parameters are learned by gradient descent using the Adam optimizer (Kingma and Ba, 2014). During the inference stage, for each utterance, we run the MCMC for S = min{w × (KM + K(K − 1)N ), S max } steps given K entities, M attributes, N relations, and a sweep number of w. The probability of flipping the relation q 1 is set to 0.7 to bias towards the relation prediction at first.

Baseline Models
We compare our method with both transformerbased (BERT, BERT S , SimpleRE) and graphbased (GDPNet) models. Given dialogue history D and target argument pair vlin et al., 2018)   ) is a speaker-aware modification of BERT, which also takes speaker information into consideration by converting it into a special token. SimpleRE (Xue et al., 2020a) models the relations between each pair of entities with a customized input format. GDPNet (Xue et al., 2020b) takes in token representations from BERT and constructs a multi-view graph with a Gaussian Graph Generator. The graph is then refined through graph convolution and DTWPool to identify indicative words. Table 2 shows the performance comparison between different methods on the two datasets. It clearly shows that both of our models, SocAoG and SocAoG reduced , outperform the existing methods by all the metrics. In specific, without using any additional information of attributes, SocAoG reduced surpasses the state-of-the-art method (SimpleRE) by 1.9% (F1)/2.1% (F1 c ) on DialogRE testing set, and by 5.1% (F1 c ) on MovieGraph testing set. Such improvement shows the importance of relational consistency for the modeling, and proves the effectiveness of our SocAoG formulation to introduce the social norm constraints.

Performance Comparison
Moreover, by comparing between SocAoG and SocAoG reduced , we see that SocAoG further improves most of the metrics by leveraging the attribute information for relation reasoning, e.g., 69.1% vs. 68.6% for DialogRE testing F1 and BERT (Devlin et al., 2018) 59.4 (0.7) 54.7 (0.8) 57.9 (1.0) 53.1 (0.7) 50.6 (1.2) 53.6 (0.3) BERT S  62.2 (1.3) 57.0 (1.0) 59.5 (2.1) 54.2 (1.4) 50.7 (1.1) 53.6 (0.4) GDPNet (Xue et al., 2020b) 67  Table 2: Performance comparison between BERT, BERT S , GDPNet, SimpleRE, SocAoG reduced , and SocAoG. We report 5-run average results and the standard deviation (σ). 64.1% vs. 63.2% for MovieGraph testing F1. The results demonstrate our method can effectively take advantage of the attributes as cues for social relation predictions. We compare our SocAoG model with the existing model of highest accuracy (Sim-pleRE) by relation types, and see consistent improvements for all types. A part of the results are shown in Figure 4. We also observe that there are larger accuracy boosts for relations between human entities than non-human entities (e.g., humanplace), by an average of +2.5% vs. +1.8% in F1, which is also reflected from Figure 4 (left 10 bars vs. right 10 bars). This can be explained as relation/attribute constraints are more meaningful for interpersonal relations, e.g., there are more constraints for the relation between three humans than the relation between two humans and a place. Table 2 also sees more accuracy improvement on MovieGraph dataset than DialogRE (+3.2% vs. +6.0% in test F1 c using SimpleRE as baseline). This is possibly because the dynamic inference nature of our method makes it effective for dealing with dialogues with more turns: while existing methods either truncate dialogues or use sliding windows, our method continuously updates the relation graph given an incoming turn. We case study the dynamic inference in the next subsection.   .

Case Study on Dynamic Inference
Our method incrementally updates the relation and attribute information for a group of entities upon per utterance input with the proposed α-β-γ strategy. Such dynamic inference can potentially help reflect the evolving relations, unveil the reasoning process, and deal with long dialogues. Figure 5 shows the parse graph sequence by SocAoG inferring from a DialogRE testing dialogue as shown in Table 3. We can see that the method continuously refines the relation/attributes from an initial guess with incoming contexts, e.g. S2-S3: friends→parents in turn 5. Besides, the case also shows that attributes can aid relation predictions, e.g., the inferred age of Emma clarifies her relation with S3. Moreover, since our method models the relation consistency among a group, it can predict the relation between two humans that do not talk directly. For example, S1 and S2 are inferred to be a couple by their dialogues with S5 in turn 7. Figure 5 also plots the average MCMC acceptance rate for the case, as defined in Formula 10, indicting the convergence of the inference. We see that the algorithm only needs to update the current graph belief slightly with a new perceived utterance. A peak in the curve can indicate that a key piece of information is detected that contradicts the existing belief: e.g., there is a peak of convergence curve in turn 7, which corresponds to "S5: Hey Mr. and Mrs. Geller!", indicating that S1 and S2 are a couple rather than friends. As such, we can see the algorithm get several relations updated accordingly. We also show the convergence plots for 50 random testing cases from DialogRE in Figure 6, and the mean/standard deviation convergence rate as the black line/blue shade. We prove that our updating algorithm is robust for the converged results.

Ablation Study on α-β-γ
The α-β-γ strategy is designed to update relations and attributes jointly, having the input information flowing through the parse graph for the consistency of predictions. To validate the design, we ablate the processes on DialogRE to evaluate their impact on performance. Table 4 shows that α process, which is the discriminative model, makes the fundamental contribution, whereas β and γ processes alone cannot recognize social relations since they cannot perceive information from dialogues. Significantly, removing either one of the two processes will decrease the overall performance since the inference efficiency is reduced.

Conclusion
The paper proposes a SocAoG model with α-βγ processes for the consistent inference of social relations in dialogues. The model can also leverage attribute information to assist the inference. MCMC is proposed to parse the relation graph incrementally, enabling the dynamic inference upon any incoming utterance. Experiments show that our model outperforms state-of-the-art methods; case studies and ablation studies are provided for analysis. In the future, we will further explore how different initialization of the parse graph could help warm start the inference under various situations and how multi-modal cues could be leveraged.

Ethical Considerations
Endowing AI to understand social relations is an essential step towards building emotionally intelligent agents. By jointly inferring individual attributes and social relations, our incremental parsing algorithm enables consistent and dynamic relational inference in dialogue systems, which can be remarkably useful for a wide range of applications such as a chatbot that constantly perceives new information and conducts social relation inference. However, we never forget the other side of the coin. We emphasize that an ethical design principle must be in place throughout all stages of the development and evaluation. First, as discussed in Larson (2017), we model the attributes as a social construct from a performative view. For example, "gender performativity is not merely performance, but rather performances that correspond to, or are constrained by, norms or conventions and simultaneously reinforce them. Second, our model relies upon the attribute-category ascription provided by MovieGraph (Vicol et al., 2018) and Friends Central in Fandom. However, we acknowledge that the annotation could be prone to a partial understanding of human relationships, and the real situation could be more complicated. Lastly, selfidentification should be the gold standard for ascribing attribute categories. Practitioners are suggested to prompt users to provide self-identification and respect the difficulties of respondents when asking. Our model helps increase the interpretability of the relational inference process by tracking the attributes and updating the relational belief. We expect that the biases from relation recognition can be easier to measure, and our α-β-γ processes may provide a multidimensional way for correcting them.