PRIDE: Predicting Relationships in Conversations

Automatically extracting interpersonal relationships of conversation interlocutors can enrich personal knowledge bases to enhance personalized search, recommenders and chatbots. To infer speakers’ relationships from dialogues we propose PRIDE, a neural multi-label classifier, based on BERT and Transformer for creating a conversation representation. PRIDE utilizes dialogue structure and augments it with external knowledge about speaker features and conversation style.Unlike prior works, we address multi-label prediction of fine-grained relationships. We release large-scale datasets, based on screenplays of movies and TV shows, with directed relationships of conversation participants. Extensive experiments on both datasets show superior performance of PRIDE compared to the state-of-the-art baselines.


Introduction
Motivation and Problem. Personal knowledge about individual users is a valuable asset for personalizing downstream applications, such as intelligent assistants, recommender systems and search engines. However, such personalized services are commonly achieved with end-to-end learning approaches, where user information is bound to be in latent representation and inaccessible to users. Explicit Personal Knowledge Bases (PKBs) (Balog and Kenter, 2019), which are built independently of any downstream application, serve as background knowledge for personalization. PKBs are crucial for empowering users with control over what can be learned from their data collected by big tech companies. Such PKBs will also provide transparency and explainability to end users about inferred personal knowledge and any personalized decisions made by the systems.
With the ubiquity of social media and online forums, user-generated content is available in abundance. Mining personal knowledge from user-  generated content to populate PKBs, or user profiling, is a long-standing topic in NLP (e.g., Flekova et al., 2016;Basile et al., 2017;Tigunova et al., 2019). While users' demographic attributes and interests can be learned from their profile descriptions and posts, interpersonal relationships with other users are rarely mentioned explicitly and may only be inferred from their interactions and conversations. In this work, we develop an automatic method for predicting fine-grained relationships between two speakers, given their logged conversation history.
Consider the example in Figure 1. From the excerpt of interactions between A and B, the reader can figure out that B is the child of A by observing (i) the address term 'sweetheart', (ii) the commanding but soft tone of user A, (iii) the reference to the other family member 'dad', and (iv) the context created by the word 'homework'. Yet, neither of the speakers directly mentions their relationship, making this task difficult for automatic methods relying on explicit patterns.
The relationship information extracted from such conversations, e.g., B, child_of, A , can be entered into the PKBs of users A and B. By combining such relationship information with User B's age and personal interests (e.g., playing guitar, Justin Bieber) inferable from User B's social media (exemplified in Figure 1), a system will be able to provide user A with relevant personalized recommendations for a query "birthday present ideas for my daughter".
Prior Work and its Limitations. There has been considerable research on extracting relationships between characters in literary texts such as novels (Chaturvedi et al., 2016(Chaturvedi et al., , 2017. These methods are inappropriate for conversational data, though, which is colloquial and less structured than literary texts. Moreover, predicting relationships is often modeled as a binary task of sentiment classification (i.e., person A is positive or negative about person B). Prior works on conversational data are restricted to small-scale data (Yu et al., 2020), or merely handle coarse labels of relationship aspects (Rashid and Blanco, 2018;Qamar et al., 2021). Most approaches use general models for text classification Jia et al., 2021), which disregard the particularities of conversational settings.

Approach and Contributions.
We present PRIDE, a neural multi-label classifier for Predicting Relationships In DialoguE. PRIDE makes inference among 12 fine-grained directed relationships (like child or boss, see Table 2) from conversational data by hierarchically creating utterance representations and combining them with signals on the users' personal attributes (e.g., age and occupation) and the conversation style (e.g., intense or superficial). PRIDE uses BERT (Devlin et al., 2019) to create contextual word embeddings for each utterance, and Transformer encoders (Vaswani et al., 2017) to build conversation representations that preserve information about the sequence and speakers of utterances.
The contributions of this paper are: (i) a method for inferring speakers' relationships, which outperforms strong baselines; (ii) the largest conversational dataset 1 of 1.1K speaker pairs annotated with multi-label, directed relationships and (iii) an exhaustive analysis of the model's performance. 1 https://pkb.mpi-inf.mpg.de/pride/ 2 Related work Relationship Prediction. There is only limited research on relationship prediction in dialogues, as most studies focus on literary texts. The relationships in novels are often predicted on the coarse granularity (positive or negative sentiment) (Chaturvedi et al., 2016), modelled as emotionrelated classes (anger, fear) (Kim and Klinger, 2019), or described in a topic-modelling manner (Iyyer et al., 2016;Chaturvedi et al., 2017). While fictional texts often contain dialogues, they are interleaved with narratives, where the language is less colloquial and more descriptive, which aids explicit extraction of fictional characters' relationships.
On the other hand, screenplays or scripts of theatre plays, movies or TV series are more similar to real-life conversations. Nalisnick and Baird (2013) explored Shakespeare plays to analyze the polarity and intensity of emotions of characters towards each other. The same data is used in Azab et al. (2019), where fine-grained relationship classes adopted from Massey et al. (2015) are predicted by applying a logistic regression classifier on a pair of learned character embeddings. However, such approach predicts relationships solely based on characters' latent attributes without considering any conversational context. Rashid and Blanco (2018) investigated the prediction of interpersonal dimensions (Wish et al., 1976) of utterances in the Friends series, where SVM classifiers on bag-of-words were trained per dimension to determine whether an utterance is, for instance, equal or hierarchical. Similarly, Qamar et al. (2021) leveraged vector representations of emotion words, to classify a dialogue taken from a movie script corpus into four attachment styles (e.g., friend, family) and four association types (e.g., secure, fearful), which are then combined into 16 relationship classes. Both approaches do not provide explicit and detailed information about the speakers' relationships, such as who is the parent of whom, and instead focus on relationship characteristics. To improve our approach's ability to predict specific relationships, we leverage interpersonal dimensions as an additional signal following Rashid and Blanco (2018).
Speakers' relationships are part of 36 predicates investigated by Yu et al. (2020), which focused on the general relation extraction task between two arguments appearing in a dialogue (e.g., spouse, place_of_residence), taken from the Friends series; 14 of the predicates refer to the relationships between people. The authors used BERT to predict relations contained in a dialogue snippet, taking as input the conversation text concatenated with two relation arguments. Similarly,  collected conversations from Chinese TV series scripts and used three annotators to label them with 24 relationships and 7 emotions. The relationships labels were hierarchically split by field (family, school, company, other) and seniority (elder, peer, junior); only one relationship label was allowed per dialogue excerpt. On the resulting dataset the authors run predictive models (CNN and BERT) using a single subsequent pair of utterances as input, which is not the most optimal strategy given the short length of such input and the absence of surrounding context. In contrast with both abovementioned works, our model can handle the full history of conversations, enabling to distinguish multiple labels per speaker pair. Jia et al. (2021) annotated relationships of the characters in the movie scripts with 13 relationship labels, belonging to four main categories (family, intimacy, official, others), resulting in the DDRel dataset. Their best performing model is based on BERT, fine-tuned for classifying a dialogue session between a pair of speakers; we used their model as one of our baselines. Unlike in Jia et al. (2021), we consider directed relationships (e.g., parent and child as separate labels) and each pair can have multiple relationship labels. Moreover, our annotated data, which is almost twice the size of DDRel, is arguably more reliable, using the agreement of 4 out of 6 annotators per speaker pair, as opposed to DDRel, which was labeled by a single annotator.
Multi-speaker Dialogue Representations. Many NLP tasks based on conversational speech (chatbot answer generation, utterance intent classification, emotion prediction, etc.) require creating a representation of a given multi-speaker conversation as input. Our approach draws inspiration from these methods and adds extensions to better model conversations and incorporate signals relevant for relationship prediction.
One popular way to represent a conversation is to model words and utterances in a hierarchical manner. Hierarchical approach is widely applied to microblog sentiment and emotion classification. There are also non-hierarchical approaches to representing conversations. Welch et al. (2019) used a BiLSTM to process conversation spans represented by GloVe embeddings. The model is run on the conversations from a single individual to predict the attributes of his interlocutors, including personal relationships. However, the list of predicted relationships is limited as well as the size of the input samples. Prior work in response retrieval for chatbots (e.g., Lu et al., 2020;Gu et al., 2020) used BERT to encode dialogue context and response, additionally enhancing the input with speaker embeddings.

Methodology
The neural model architecture, inspired by Li et al. (2020), is shown in Figure 2. PRIDE hierarchically creates word and utterance representations, which are then combined with representations of personal attributes and interpersonal dimensions (Table 1) to create a representation of the full conversation history. Given this representation of the conversation, a multi-label classification layer predicts one or more of the twelve relationship labels ( Table  2). The model is trained with supervision on the relationship labels. In the following subsections we describe the model's components in more detail.

Contextual word representations
The input for a pair of speakers (sp A , sp B ) is N utterances u 1 , ...u N , where i-th utterance consists of words w 1 i , ...w n i i . In the first step, the word representations r j i are created with a function f word (w 1 1 , .., w n 1 1 , ..., w n N N ) = r j i , which takes as input the concatenation of all utterances and produces the representations for each word. We chose BERT (Devlin et al., 2019) to create word representations, because this model efficiently captures contextual information.
Considering that the maximal input length of BERT is 512 tokens, we split the input sequence of utterances into chunks and run BERT several times. Each chunk in the split has the maximal possible length that fits into one run without breaking individual utterances. We find this splitting strategy more effective than running BERT on single utterances  or short sequences which do not fully utilize the max 512 limit (Jia et al., 2021). In our method more conversational context is provided to create word representations. Also, simply truncating input to 512 tokens (Lu et al., 2020) might cause a loss of important cues.
As information about the current speaker we use BERT's segment embeddings, so that the Asegment corresponds to tokens from sp A and the Bsegment to sp B . Furthermore, we encode the information about the utterance boundaries by prepending special tokens before each utterance: [s1] for the utterances of speaker A and [s2] for speaker B.

Utterance representations
Next, word representations r j i are aggregated within each utterance to create utterance representations r i with the aggregation function a word (r 1 i , ...r n i i ) = r i . The aggregation is performed on the utterances from all runs of BERT and outputs r 1 , ...r N as the representations of utterances. In our hyperparameter search we tried instantiating a word with max, average and selfattention weighted average functions, or taking the representation of BERT's [CLS] token as a sequence summary.
Some ofr i are being produced by separate runs of BERT due to its input length limitation. Therefore we create enriched utterance representations in the unified context from all BERT runs with the function f utt (r 1 , ...,r n ) =r i . We instantiate f utt with a Transformer encoder (Vaswani et al., 2017), which allows us to input long sequences of utterances. Before computing enriched representations, we sum the utterance representations r i with sinusoidal positional encoding p i and speaker embeddings sp i , yieldingr i = r i + p i + sp i . The speaker embeddings are randomly initialized and learned during model training. Positional encoding is performed following Vaswani et al. (2017).

Classification layer
Finally, the utterance representationsr i are aggregated with the function a utt (r 1 , ...r n ) = C. a utt is instantiated with the same aggregation functions as a word . For the case with [CLS] representation we prepend a trainable embedding to the sequence.
We incorporate additional information relevant to the relationship prediction by concatenating embeddings of personal attributes and interpersonal dimensions with the conversation representation C:C = C|r age |r dim l , which are described in the following subsections. A fully connected layer takes the resulting concatenated representationC as input and produces probability scores for each of L relationship labels. Since some relationships are not symmetric (e.g., parent/child) the labels represent directed relationships from sp A to sp B .

Incorporating personal attributes
Additional personal information about the speakers from a PKB, such as their age or occupation, could improve relationship prediction. In this work, we investigate the benefits of incorporating age information into the model, since some relationships in our dataset can commonly be characterized by age differences between the speakers. For instance, children are usually much younger than their parents (and a parent can never be younger). Similarly, employees are generally younger than their bosses (but the magnitude of their age difference is less than in parent/child pairs).
To do so, we introduce a representation for the age difference of speakers, r age . We first calculate d = age A − age B , which belongs to one of the age difference bins (see Appendix C.1). For each difference bin, we learn an m-dimensional embedding, where m is a tuned hyperparameter (see Appendix C.3). We take the corresponding embedding for d as r age .

Incorporating interpersonal dimensions
Rather than fine-grained relationship labels such as colleague or child, interpersonal relationships can also be characterized by various aspects in their interactions (e.g., spatially near vs distant) and communication styles (e.g., intimate vs unintimate). One way to organize such aspects was proposed by Rashid and Blanco (2018), who define several interpersonal dimensions describing speakers' interactions (which take place when the speakers refer to each other in their utterances) and relationships (which are defined as a sequence of interactions), shown in Table 1. Most of the relationship labels considered in our experiments can be characterized by a set of these dimensions; for instance, a boss/employee relationship is hierarchical, while colleague is an equal one. Similarly, spouse is an intimate relationship, in contrast with colleague.
Given a hint of the applicable dimensions, a model can better predict the underlying relationship. For instance, in Figure 1 the pleasureoriented ("dad said we'll go somewhere exciting!"), intimate ("of course, sweetheart") and hierarchical ("make sure to finish your homework first!") relationship is most likely a parent/child relationship. In our model we use all 11 proposed dimensions to provide a comprehensive summary of the relationship's fine-grained characteristics.
Using the data provided by Rashid and Blanco (2018) we train a separate BERT classifier on the utterance level for each dimension dim l , where l is the index of the dimension, ranging over the number of interpersonal dimensions that we use. We obtain a K-dimensional CLS representation from the trained classifier for each utterance, thus producing a K-dimensional representations r dim l i for the i-th input utterance. To incorporate these representations into our model, we obtain a single representation r dim l at the conversation level by performing max pooling over all utterance representations for a given speaker pair.

Dataset
We present FiRe-a Film Relationship dataset, consisting of labeled relationships of fictional characters in popular movies, obtained via crowdsourcing. FiRe is based on movie scripts, which are a good approximation for real-life conversations. To the best of our knowledge, this is the first and the largest conversational dataset with directed, multilabel relationship labels. Data preparation. We use the Jinni Movie Dataset collected in Gorinski and Lapata (2018), which provides speaker labels for each utterance as well as the film genre metadata. We selected the movies which: • can be automatically associated with their Wikipedia page for annotation purposes, and • have real-life genres, such as drama or family (see Appendix A.1), to better approximate reallife conversations. The selection of realistic movie scripts distinguishes FiRe from DDRel (Jia et al., 2021). The model trained on FiRe is potentially more adaptive to real-life dialogues.
For each pair of characters we kept only the film scenes where they are the only participants. Additionally, we include all uninterrupted dialogue spans of the considered pair in the 3-character scenes (details are in Appendix A.2). We kept only the pairs which have at least 30 utterances throughout the whole movie.

Crowdsourcing annotation
Inspired by Massey et al. (2015), we manually created a list of 21 fine-grained relationships, divided into 3 categories: Family, Social and Professional (Table 2). We annotated character pairs in our dataset using Mechanical Turk (MTurk), following the task design described in Massey et al. (2015). For each character pair a worker was supposed to indicate all applicable relationships, given the links to the movie descriptions (Wikipedia and GradeSaver 2 , if available). Further details of the MTurk annotation task are included in Appendix B.1. Based on several pilot runs we opted to assign the labels agreed by 4 out of 6 annotators. Label aggregation. We selected the best label aggregation method based on the evaluation of several state-of-the-art models, ranging from basic Majority Voting to more complex resource-intensive methods. To create the ground truth for comparison, we manually annotated 15% of the pairs, retaining the labels on which 2 out of 3 annotators agreed. The full details of the evaluation are included in Appendix B.2. Ultimately, we calculate workers' scores based on the Honeypot method (Lee et al.,

4641
Family Social Professional parent* friend* colleague/co-worker* boss/employer/master* child* enemy* doctor/patient (medical)* employee/servant* sibling* (ex-)love interest (lover)* client/seller (commercial)* religious relationship (ex-)spouse* fan classmate engaged idol teacher distant family member members of the same club student  2010) and use Majority Voting weighted by these scores.
Dataset analysis. We calculated Fleiss' kappa for the multi-label case (see Appendix B.3 for details).
We obtained a kappa of 0.45, which corresponds to moderate agreement. We obtained 783 annotated character pairs from 254 films, of which 5% are labeled with more than relationships. The original set of labels was filtered to include only those which have at least 20 representative samples, resulting in 12 labels. Summary statistics of the final dataset are given in Table 3 and the relationship label distribution in Table 7.

Series dataset
We created an additional dataset of labeled TV series scripts, which are slightly different from film screenplays because they contain a longer history of interactions. We crawled https: //transcripts.foreverdreaming.org/ for the scripts of popular series. As there is no information about scene boundaries in the gathered scripts, for a given speaker pair we kept only the uninterrupted sequences of at least 7 utterance turns.
To include in the dataset, we selected the series which would be realistic and diverse in topics (see the full list in Appendix A.1). Following the same crowdsourcing annotation procedure as for FiRe, we collected 365 labeled pairs with 0.33 Fleiss' kappa agreement; the dataset's statistics are included in Table 3. Compared to FiRe, character pairs in this dataset have larger number of utterances, around four times as much in average.

Experimental setup
Data splitting and preprocessing. We performed five-fold cross-validation, where the folds are arranged so that the sets of movies, where the input character pairs come from, are disjoint. We additionally balanced label distributions as described in Appendix C.1. We trained the models on three folds and chose hyperparameter settings according to the performance on 1-fold validation set. We report the results on the remaining 1-fold test set.
From the input scripts we removed personal names 3 and movie-specific words (which we defined as words found in only one movie script), to reduce overfitting to movie domain or genre.

Model setup and evaluation metrics.
We fine-tuned a pretrained BERT model (bert-baseuncased) to create word embeddings. For incorporating the information on the age difference of speakers, we gathered the data about speakers' ages by crawling imdb.com for the ages of the corresponding actors on the year the film/series was made. For each speaker pair we calculate the age difference between the speakers and assign it to one of the age difference bins, defined in Appendix C.1. To produce interpersonal dimension embeddings, we train BERT on the labeled data from Rashid and Blanco (2018) on each dimension separately, resulting in 768-dimensional representations.
We trained the model with Binary Cross Entropy loss. During training we oversampled the underrepresented labels. We performed grid search to tune hyperparameters, detailed in Appendix C.3. We perform multi-label classification by predicting all labels with scores over a certain threshold, which we treat as a hyperparameter. We compute macro-averaged multilabel precision, recall and F1 scores as evaluation metrics. During grid search we optimized the F1 score of the performance on the development set.  The data and source code for all models are provided at https://pkb.mpi-inf.mpg.de/ pride/.
6 Results and discussion 6.1 Quantitative results The main quantitative results are presented in Table 4. PRIDE outperforms all baselines by a large margin, including other BERT-based models. Unlike BERT ddrel , which aggregates predictions on conversation snippets outside of the model, PRIDE internally learns the conversation representation. Furthermore, PRIDE has an advantage that it makes use of the full history of conversations. We also analyze PRIDE's transfer learning performance on the Series dataset as our test data. From the results shown in Table 4, we observe the same behaviour of the models, with PRIDE outperforming the baselines. F1 scores are generally lower than the evaluation on the FiRe dataset, due to the different nature of data (longer input sequences). PRIDE's precision is similar on both datasets, but the larger amount of input with Series seems to reduce recall.

Comparison with human performance
It is often complicated even for humans to recognize the relationship between the speakers in a given conversation. Thus, human performance can be regarded as an upper bound on the model's performance.
To obtain this upper bound estimation, we asked three human annotators to read the complete conversation history of two movie characters (the same as the input given to the model) and identify the applicable relationships. (This differs from our main dataset because annotations are based on conversations rather than on character descriptions.) We sampled 5 pairs for each relationship label, resulting in 60 pairs. As human-predicted labels we assigned the relationships selected by at least 2 out of 3 annotators. The results on this dataset are shown in Table 5. While PRIDE substantially outperforms the baselines, it achieves about half of human precision, illustrating the difficulty of this task.

Ablation study
To investigate the impact of different components of PRIDE on its performance, we run an ablation study, removing one PRIDE component at a time. The ablation on Transformer is done by substituting it with aggregation operations on word and utterance levels consecutively. Results are shown in Table 6. It can be observed that positional encoding gives the least impact. On the other hand,  the quality considerably drops by removing Transformer, which is caused by a very low recall. Removing other elements cause a drop in precision, suggesting that incorporating age differences and interpersonal dimensions improves performance.

Varying input length
To investigate how many utterances are needed to make accurate predictions, we ran the trained PRIDE model on a subset of data with inputs of varying lengths. To do so, we selected a subset of user pairs with at least 150 utterances, and performed inference while increasing the amount of input utterances in a sequence from 10 to 150. This was repeated over 100 runs. The averaged results are shown in Figure 3. We observe that approximately 40 utterances are needed to maximize performance.

Per class analysis
In Table 7 we show the label distribution and per class F1 scores for PRIDE and two ablated versions. We observe that using speaker embeddings benefit predictions on asymmetric classes, such as child and parent, as their F1 scores drop significantly when speaker embeddings are not used. Removing interpersonal dimensions damages performance on   spouse and child in particular, illustrating how this signal can help differentiate relationships that use similar vocabulary.

Misclassification analysis
The confusion matrix for PRIDE's predictions is shown in Figure 4 with correct predictions omitted. We observe that there are many misclassifications into friend and lover, which are the most common labels (see columns). This can be attributed to the model's tendency to predict majority classes because of a considerable class imbalance. Considering specific pairs, we see that the model often confuses spouse for lover (red line). They may talk to each other in a similar tone and use the same address terms. Conceptually, however, these classes are different, with spouses having tighter family bonds, discussing children and household issues, and lovers talking more casually. Similarly, child and spouse are often confused as well (purple line). Both may use terms related to family and discuss similar topics. The differences between lover and friend are indeed subtle (yellow square), and these pairs were also sometimes confused by human annotators.
Finally, we investigated the impact of confusion within asymmetric classes (for example, confusing parent to child). We found that if we accept the model's predictions of either label as correct, the average number of false positives for such classes drops by 34%, resulting in an increase of the average F1 score from 0.38 to 0.43. This illustrates the challenge posed by considering relationship directions and the importance of including asymmetric labels.

Conclusion
We presented PRIDE, a model for predicting finegrained relationships from conversations. Our results illustrate the utility of our approach, showing that PRIDE outperforms state-of-the-art baselines and can effectively transfer learn on different types of dialogue data. In ablation experiments we demonstrated that the design decisions behind the model improve the quality of relationship prediction in conversations. To support future work on this topic, we created and released the largest labeled collection of relationships in conversations, which additionally improves over existing datasets by including asymmetric relationships.  (Gorinski and Lapata, 2018) containing preprocessed scripts, with information about scene boundaries and utterances' speakers, presented in XML format. Moreover, the dataset includes films' metadata crawled from Jinni website 4 , such as genre and plot keywords. We filtered the dataset to include only the films containing sufficient descriptions (e.g., having the plot summary section) in their Wikipedia pages. Secondly, we selected the film genres that can guarantee the dialogues to be more similar to the reallife ones. We used Jinni attributes style, genre and attitude, shown in Table A1, to restrict our movie list.
However, such strong restrictions made us reject many popular films belonging to the excluded genres, such as 'Thriller'. To alleviate this situation we additionally included 100 most popular movies (by IMDb 5 popularity), whose plots we manually checked for being realistic (see Table A2).
Series dataset. We selected the series, which are (i) realistic and (ii) diverse in topics, yielding the following 14 TV shows: Gilmore Girls, FRIENDS, The O.C., One Tree Hill, Veronica Mars, The Office, How I Met Your Mother, Secret Life of an American Teenager, Queer As Folk, Greek, Dawson's Creek, The Big Bang Theory, Republic of Doyle and Frasier.

A.2 Three-character scene processing
From the scenes containing utterances by exactly three characters, we extracted unintrrupted sequences of utterances of two characters with at least three utterance turns. Assume that we have speakers A, B and C in the scene and we are interested to extract interchanges for pairs (A,B) and (A,C). If the sequence of utterances in the scene looks like ABAACABA, then it can be broken into homogeneous sequences: {ABAA, AACA, ABA}. Thus, the number of utterance turns for each pair in the given scene will be seven for (A,B) and four for (A,C). 4 http://jinni.com/ 5 http://imdb.com

B Crowdsourcing Annotation
Manually annotating datasets in character relationship prediction task is a regular practice in related work (Kim and Klinger, 2019;Chaturvedi et al., 2016;Azab et al., 2019). We conducted our study on Mechanical Turk (MTurk), following literary character annotation by Massey et al. (2015). Our work is still significantly different from Massey et al. (2015), because we allow for multiple relationship labels for each sample, discard changing relationships and aggregate results from many annotators.

B.1 MTurk task details
The screenshot of the mturk task is shown in the Figure B.1. In one task, the worker had to indicate the relationships for a given pair of characters, supplied with a link to the movie's Wikipedia page and the movie description on Gradesaver 6 (if available). The annotators were supposed to indicate all the relationships applicable to the pair of characters, pertaining to the given rules. In the remainer of the subsection we list the exact instructions for the workers.
Task rules Read plot summary and/or character descriptions from given link(s). Pay attention that the relationships are directed, mark the relation only from A to B.
Inspect relationships in all 3 categories and select all that apply, at least one relationship in this HIT should be selected. Select not more than one (can be zero) relationships from each category (some categories can be empty). There are the following exceptions to this rule: • General: if the relationship changes during the film you can select several labels from the same category only for the following labels: -Family: spouseengaged -Social: friendenemylover -Professional: classmateteacherstudentcolleaguebossemployee    checkbox, which will also allow you to select one other label from social. Example: A and B are pals but A has a secret love for B, then the correct selection will be [friend, lover, one-way relationship].

Important notes
• Friend does not mean just positive sentiment, it means a stronger bond, like 'buddy' or 'pal'. Enemy is not a negative sentiment, but a stronger adverse relationship, like 'policeman vs. criminal'.
• If the business hierarchy level between A and B is not clear (whether it is higher/lower/same position), select colleague/co-workers. • If you selected spouse, do not mark lover as it follows automatically.

B.2 Label aggregation
We first conducted several dry runs of the study with 10 annotators, after which we made revisions to the labeling rules and the list of relationships. We used manually annotated subset to fine-tune  Table B1: Comparison of answer aggregation the number of annotators based on the F1 score. We found that selecting 6 annotators to label each pair did not result in significant drop in precision and ensured greater recall, at the same time saving annotation resources. We used an existing benchmark 7 of aggregation approaches, which enabled us to try out at least 7 different aggregation methods. Here we report only the best performing ones: • David Skene model (DS, Dawid and Skene, 1979) is based on Expectation Maximizaion algorithm (EM), which jointly estimates the expertise of workers and the task label. This method has shown consistently optimal performance in many studies.
• Generative model of Labels, Abilities, and Difficulties (GLAD, Whitehill et al., 2009) is an extension to EM that additionally estimates the difficulty of each task.
• Bayesian Classifier Combination (BCC, Kim and Ghahramani, 2012) uses Gibbs sampling to optimize the posterior joint probability of labels and workers.
We compare them to the basic Majority Voting (MV) approach. Note, that most of the models are based on the assumption of single-label answers, so we had to reformulate the problem as multiple binary-decision problems to fit them. Taking into account that each pair can have multiple labels associated with it and that the agreement can be reached only on a subset of those labels, we propose to evaluate both partial (workers' answers partially match the golden set) and total (workers' answers and golden sets are identical) accuracy. Additionally, we evaluate precision and recall. The results are shown in Table B1.
The results for all models are close, with MV having the greatest total accuracy and BCC yielding the best recall. We opted to use MV aggregation, as we consider high precision and accuracy more important for this task. Additionally MV has the advantage of being easier to interpret. One reason why the iterative approaches work as good as simple majority voting could be the large number of workers, most of which do only 1-2 tasks, which prevents the iterative models from effectively inferring the workers' expertise.
To further ensure the high quality of our annotated data, we additionally tried the Honeypot method (Lee et al., 2010), where the questions with the known true answers (honeypots) are mixed into the task. The workers' scores are calculated as the fraction of their correct answers to the honeypots; the workers who did not get any honeypots were assigned an average score. After that all worker's answers are scaled by the obtained scores and the label is considered as correct if the sum of its votes exceeds a threshold, finetuned on the annotated set.

B.3 Details on Fleiss' kappa calculations
In this subsection we present the calculation of Fleiss' kappa coefficient for the multiclass, multilabel case.
Let N be the number of annotated pairs, indexed by i = 1, ..N . K would be the total number of possible labels, with indexing j = 1, ..K. k i is the number of labels, which were selected by at least one annotator for this pair. n is the total number of annotators and n ij is the number of annotators, who assigned j-th label to the i-th pair. Then kappa κ is calculated as follows: the agreement of annotators per pair: the number of assignments per label: We perform five-fold cross-validation, arranged so that the sets of movies, where the input character pairs come from, are disjoint. With that as a hard restriction, we tried to maximally balance the label distributions across the folds. For that we created multiple random assignments of movies to folds and chose the one that maximized the balance metrics, which we defined as follows: mean([ d l S l for l in labels]), where S l denotes the number of pairs for label l, and s i l for the number of pairs for label l in fold i. To create age embeddings we calculate the age difference (diff ) between the speakers and assign it to one of the predefined diff bins. We set diff bins to be [(-inf; -13], [-12; -6]

C.2 Training mechanism
We train PRIDE in two steps. First we train the model without external representations (age difference and interpersonal dimensions). We save the best checkpoint, based on the development set performance, and plug it in the full model with external representations (except for the final classification layer). Then we train full PRIDE again with all the weights frozen, except for the external representations and classification layer weights.

C.3 Training and hyperparameters
In our experiments we used a cluster with 46 GPUs (MEGWARE Gigabyte G291-Z20 server), with 4core NVIDIA Quadro RTX 8000 (48 GB GDDR6, 295 W).   We used grid search with 144 parameter combinations (128 to create a checkpoint without external representations and another 16 to tune the full model). We picked the best combination on the development set performance based on F1-score metrics (in case of a tie on the F1 score, we maximized the precision score). The development set performance for the experiments described in the paper are given in Tables C1 and C2. The decision threshold was tuned on the predictions of the model on the development set after training with the best hyperparameter setup. We also tried tuning decision threshold on a per class basis, but that did not significantly change the results.
We tuned the following hyperparameters: One epoch of training PRIDE with 420 training samples runs 17 seconds on average, with 12 minutes to train until the best epoch (all times are averaged across 5 folds). The inference on one test fold with the average of 156 samples takes 6.3 seconds. In addition to that, prior to training we create interpersonal dimension representations, the inference for one dimension takes 36.2 minutes on average.
The number of parameters in PRIDE is given in Table C3. We separately calculated the parameters in BERT input embeddings, other BERT compo-nents, Transformer and the remaining components of PRIDE (such as age and speaker embeddings, classification layer and fully-connected layers for attention mechanism).
Additionally we tried several other training strategies: learning rate scheduling, word and utterance dropout, pretraining BERT and Transformer on movie script data and fine-tuning only BERT bias terms. We also experimented with attaching learned emotion representations to each utterance. We found that none of these modifications significantly changed the performance.