Annotation and Evaluation of Coreference Resolution in Screenplays

Screenplays refer to characters using different names, pronouns, and nominal expressions. We need to resolve these mentions to the correct referent character for better story understanding and holistic research in computational narratology. Coreference resolution of character mentions in screenplays becomes challenging because of the large document lengths, unique structural features like scene headers, interleaving of action and speech passages, and reliance on the accompanying video. In this work, we ﬁrst adapt widely-used annotation guidelines to address domain-speciﬁc issues in screenplays. We develop an automatic screenplay parser to extract the structural information and design coreference rules based upon the structure. Our model exploits these structural features and outperforms a benchmark coreference model on the screenplay coreference resolution task.


Introduction
Screenplays are semi-structured text documents containing the dialogue and directions of a film. Automated screenplay analysis provides an opportunity early in the creative process to offer insights into character representations and portrayals (of who interacts with whom, about what, and how), including from a diversity, inclusion, and social impact perspective (Ramakrishna et al., 2017;Shafaei et al., 2020;Martinez et al., 2020). A typical screenplay contains indented blocks of text that can be classified into scene headers, scene descriptions, speakers, and utterances, as shown in Figure 1 (Agarwal et al., 2014). A scene header starts a new scene and provides location and temporal information. Scene descriptions describe the characters and their actions, and the speaker and utterance blocks contain the characters' names and speech.
Screenplays can refer to a character by different names, pronouns, and nominal expressions. For Figure 1: Coreference-annotated screenplay excerpt from the movie The Shawshank Redemption (1994). Mentions of the same character are underlined with the same color. example, the screenplay excerpt shown in Figure 1 refers to the character Andy by the mentions -Andy Dufresne (name), The wife-killin' banker (nominal), and his (pronoun). Many downstream tasks need to find and map all such mentions to the correct referent. For example, Gorinski and Lapata (2015) resolved pronominal mentions to their correct antecedent (prior co-referring mention) to find speaker-listener and semantic relations between characters for the movie-summarization task. Chen and Choi (2016) crowdsourced character-mention labels in TV show transcripts for automatic character identification. Deleris et al. (2018) used direct and indirect character references in utterances to study social relationships between characters. Thus, resolving character mentions in screenplays is an essential subtask in many applications. In NLP literature, this task is formally called coreference resolution (Jurafsky and Martin, 2009) and has been studied extensively (Sukthanker et al., 2020;Stylianou and Vlahavas, 2021).
Most existing coreference datasets focus on news and web text (Pradhan et al., 2012;Webster et al., 2018), but do not include screenplays that have distinct content and structure. First, the document lengths of screenplays are much larger than news articles (Gorinski and Lapata, 2015), increasing the computational complexity of antecedent scoring (Lee et al., 2017). Second, scene headers alter the story's context affecting coreference between mentions of different scenes. Lastly, coreference annotation of some mentions requires the knowledge of the accompanying movie or TV clip because textual descriptions may not capture all the post-production visual details of a scene. Thus, two main challenges for coreference resolution in screenplays are: (1) the lack of suitable annotation rules to handle domain-specific issues and (2) finding methods to leverage the unique screenplay structure to improve coreference resolution.
Our objective in this work is to address the coreference resolution of characters in screenplays. We only focus on characters because most modern narrative studies are centered around their role as agents driving the plot (Bamman et al., 2013;Labatut and Bost, 2019). Our contributions are 1) we establish coreference annotation guidelines for screenplays and use them to label screenplay excerpts, 2) we develop a screenplay parser to convert the semi-structured text into a machine-readable format, and 3) we use the structural information of screenplays to design coreference rules, which improves the performance of coreference resolution when combined with a benchmark coreference resolution model (Lee et al., 2018).

Related Work
Screenplay Parsing: Weng et al. (2009) motivated the need for screenplay parsing for social network analysis. Agarwal et al. (2014) formalized the screenplay parsing task and developed an SVMbased parser. Winer and Young (2017) extended it to extract fine-grained information from scene headers. Our parser uses a rule-based algorithm to achieve comparable performance. Coreference Resolution: OntoNotes 5 is the benchmark dataset for English coreference resolution, containing documents from newswire, broadcast news, telephone conversations, and weblogs (Pradhan et al., 2012). However, it does not contain screenplay texts. The closest work to screenplay coreference is the LitBank dataset (Bamman et al., 2020), which contains coreference annotations of 100 works of fiction taken from Project Gutenberg (Lahiri, 2014).
Few previous works address coreference annotation in screenplays. Chen and Choi (2016) labeled character mentions for two TV shows: Friends and The Big Bang Theory. Zhou and Choi (2018) later extended this dataset by including plural mentions, but mainly focused on character utterances and did not consider action notes. Ramanathan et al. (2014) created a dataset of 19 TV episodes for joint coreference resolution in visual and textual media content. Gorinski and Lapata (2015) created the ScriptBase corpus of movie screenplays, which included coreference labels. However, they found the labels using the Stanford CoreNLP system (Lee et al., 2011), which has not been evaluated for screenplay coreference.

Annotation
We annotate screenplays with character mentions for coreference resolution (see Figure 1). Following OntoNotes 5 annotation guidelines, we mark the maximal extent of noun phrases, pronouns, and possessives that refer to some character (Pradhan et al., 2012). Characters do not have to be persons; consider for example, the spider Aragog in the Harry Potter movies. We include singleton character entities. We do not label mentions that refer to multiple characters because finding the correct antecedent of plural mentions often requires the accompanying video clip's aid (Zhou and Choi, 2018). For example, it is difficult to decide whether They refers to Vosen and agents, Others, or both in the following lines without watching the movie.
[Vosen and agents]x come running out of the front door.
[Others]y leave through a side entrance.
[They] jump in sedans. (Bourne Ultimatum, 2007) We follow OntoNotes 5 annotation guidelines to handle appositions (adjacent noun phrases separated by comma, colon, or parentheses), copula (noun phrases connected by linking verbs, for example, is, look, etc.), and generic you mentions (Pradhan et al., 2012). If a mention's referent is revealed to be identical with another character as the story progresses, we tag the mention with the latter character (Bamman et al., 2020). The screenplay sometimes contains references to the reader or the camera's point of view. We tag such instances with a special READER entity, for example: We downloaded the screenplay documents from IMSDb 1 . We chose these movies because the annotators were familiar with them, and they cover a wide range of genres (drama, action, thriller, and war). Three doctoral students were each assigned one screenplay, which they annotated for coreference according to the guidelines of section 3. The lead author checked the annotations independently for labeling errors. Less than 1% of the mentions required correction, suggesting high overall agreement. Table 1 describes some statistics of the labeled data. The corpus contains 2, 807 mentions in total, covering 106 characters. More than 44% of the mentions are pronouns, and about 11% are nominal mentions.

Model
Our coreference model consists of two parts: 1) a screenplay parser to extract structural information, and 2) coreference rules to resolve mentions occurring in speaker blocks.

Screenplay Parser
The screenplay parser reads raw screenplay text documents and assigns a structural tag -scene header, scene description, speaker, utterance or other (see Figure 1) to each line, following the tagset of Agarwal et al. (2014). The parser uses regular expressions to assign the structural tags.

Coreference Resolution
We use the following strategy to find the coreference clusters of characters in screenplays.
Add says: Given a screenplay, we parse it using our screenplay parser and collect all lines tagged as scene header, scene description, speaker, or utterance. We add the word says after speaker-tagged lines and concatenate all lines, separated by a newline delimiter. This lexical addition tells the model that the character mentioned in the speaker-tagged line speaks the succeeding utterance-tagged lines.
Keep speakers: We apply a coreference resolution model, pre-trained on OntoNotes 5, to the concatenated text to find coreference clusters. Since OntoNotes 5 annotates for unrestricted coreference (find coreference clusters of all ACE entities and events), we need to prune the clusters to keep only those containing character mentions. We keep clusters that contain any mention which appears in a speaker-tagged line.
Merge clusters: Due to the large document size (see Table 1), long coreference chains, like those of main characters, sometimes get segmented and occur as separate clusters. We merge the segmented clusters using speaker information. Screenplays usually refer to a character by a unique name in the speaker-tagged lines. If two speaker-tagged lines belonging to separate clusters contain identical names, we merge the corresponding clusters.

Screenplay Parser Evaluation
We annotated lines of 39 movie screenplay excerpts for the structural tags. These movies were different from the ones annotated for coreference. Three annotators, all doctoral students, labeled 9,758 lines, with a Krippendorff's inter-rater reliability score of 0.983 (strong agreement). We parsed the annotated excerpts using our rule-based parser and evaluated its classification performance. -1.0 -2.5 -1.9 -9.3 -15.0 -12.7 -5.1 -3.6 -4.3 -5.1 -7.0 -6.3 -Keep speakers -2.5 -2.2 -2.4 -2.4 -2.5 -2.4 -17.7 +3.5 -6.5 -7.5 -0.4 -3.7 -Merge clusters -0.2 -2.9 -1.7 +3.8 -9.8 -5.3 -1.1 -6.3 -3.8 +0.8 -6.3 -3.6   , 1998) and CEAF e (Luo, 2005) measures to evaluate our model. We also used the NEC score (Agarwal et al., 2019) to evaluate coreference resolution by mention-type -name, pronoun and nominal. Tables 2 and 5 show the performance of our model and baseline in coreference resolution and character mention identification respectively.   Ablation Study: We study how each coreference rule -Add says, Keep speakers, and Merge clusters -described in section 4.2, contributes to the model's performance. Table 2 shows the results of the ablation experiments. -Add says means that we do not add says after speaker-tagged lines, -Keep speakers implies that we retain clusters that contain any mention whose named entity tag is PERSON instead of those that contain any mention appearing in speaker-tagged lines, and -Merge clusters denotes that we do not merge clusters.

Discussion
The results of Table 2 suggests that inputting the raw screenplay directly to the pre-trained coreference model performs poorly. The performance substantially improves when we use the coreference rules (+17.3 avg. F1). The improvement is largest for pronouns (+22.3 NEC F1), as shown in Table 4, possibly because of the Add says rule that helps the model to find the antecedent of personal pronouns in utterance-tagged lines. The rule adds 6.3 avg. F1 towards the overall performance (Table 2). Coreference resolution of named mentions also improves greatly (+8.5 NEC F1), probably because of the Merge clusters rule that joins clusters if they contain mentions in speaker-tagged lines that have identical names. It contributes 3.6 avg. F1 to the final score ( Table 2). The Keep speakers rule adds 3.7 avg. F1, which suggests that retaining clusters containing speaker-tagged mentions is better than keeping those containing PERSON-tagged (NER) mentions to retrieve character references.

Applications
We show two applications of coreference resolution in computational narratology: finding mentiontype interactions and character actions. Mention-type Interactions: Figures 2 and 3 show character networks of the top five speaking characters from the movie Inglourious Basterds (2009), capturing speech and mention-type interactions respectively. The edge weight between characters A and B in the speech network (Fig 2) is the number of times A speaks right after B or vice versa (Ramakrishna et al., 2017). The directed edge weight from character A to character B in the mention network (Fig 3) is the number of times A mentions B in their speech. We used the structural tags and coreference annotations to create these networks. We observe that the two character networks provide different insights. Using degree centrality, Bridget is the least 'important' character in terms of speech interactions, but is most mentioned by other characters. This supports the movie plot which contains a scene where the Basterds discuss their plans of meeting Bridget, without her being there. Character Actions: We can use semantic role labeling and coreference resolution to find character actions. Figure 4 shows a subgraph of the character action network of The Shawshank Redemption (1994) movie. The directed edge label is the action, the head node is the agent (ARG0), and the tail node is the patient (ARG1) of the action. We applied the SRL model of Shi and Lin (2019) to the sreenplay's sentences, and then substituted the semantic roles with their referred character, wherever possible. From figure 4, we observe that Andy had positive interactions with Red, but was negatively treated by Boggs and Hadley, which is in line with the movie plot.

Summary and Future Work
We presented a coreference annotation guideline for screenplays and developed rules based on the screenplay's structure to improve coreference resolution performance. Our work can facilitate future annotation and modeling of coreference resolution in screenplays to support computational narratology studies. We plan to label more screenplays to train an end-to-end coreference model and study character interactions using coreference clusters. The data is available in the supplementary material.