Character Coreference Resolution in Movie Screenplays

,


Introduction
Screenplays are semi-structured text documents that narrate the events within the story of a movie, TV show, theatrical play, or other creative media. It contains stage directions that provide a blueprint for the produced content, offering valuable information for media understanding tasks (Turetsky and Dimitrova, 2004;Sap et al., 2017;Martinez et al., 2022). Automated text processing methods can enable a scalable and nuanced analysis of screenplays and provide key novel insights about the narrative elements and story characteristics (Ramakrishna et al., 2017). This paper focuses on a critical as-  pect of automatic screenplay analysis: character coreference resolution.
A screenplay has a hierarchical structure that includes a sequence of scenes, with each scene containing action passages and character utterances (Argentini, 1998). A scene starts with a slugline that specifies where and when the scene takes place. The action passages introduce the characters and describe their actions, and the utterances contain their verbal interactions. Sometimes, transition phrases, such as FADE IN and CUT OUT, separate adjacent scenes, detailing the camera movement. These structural elements form the document-level syntax of the screenplay. Identifying this modular structure, called screenplay parsing, is an essential preprocessing step for downstream analyses.
Screenwriters usually follow a uniform indentation and word case scheme while formatting the screenplay. Some standard conventions are to arrange sluglines and action passages at the same indentation level and write sluglines and speaker names in upper case (Riley, 2009). However, publicly-available screenplays can exhibit wide variability in document formatting and deviate from these norms by removing indentations, containing watermark characters, or omitting location keywords (INT and EXT) in sluglines. A screenplay parser should be robust to these issues to extract the document's structure correctly and consistently. Once parsed, we can process the extracted structural segments for narrative analysis.
Narrativity occurs when characters interact with each other in some spatiotemporal context (Piper et al., 2021). Computational narrative understanding involves natural language tasks such as named entity recognition (NER) to identify the characters (Srivastava et al., 2016), coreference resolution to gather co-referring character mentions (Elson et al., 2010;Baruah et al., 2021), semantic role labeling to find their actions (Martinez et al., 2022), relation extraction to discover character attributes (Yu et al., 2022), sentiment analysis to understand the attitude of their interactions (Nalisnick and Baird, 2013), and summarization to spot the key events (Gorinski and Lapata, 2015). Out of these tasks, coreference resolution presents a unique challenge because of the long document size of screenplays (Baruah et al., 2021). Modern coreference models rely on transformers to capture the discourse semantics. However, the average screenplay length (30K words) far exceeds the restricted context of transformers (512 tokens). As document size increases, the number of words between a mention and its antecedent increases. Several scenes can elapse before a character is mentioned again in the screenplay. A screenplay coreference model should be able to handle such distant coreference relations. Fig 1 shows an example of a screenplay excerpt annotated with character coreference labels.
In this paper, we tackle character coreference resolution in movie screenplays. We focus on characters because they are the primary agents that drive the plot (Bamman et al., 2013) through rich and dynamic interactions (Labatut and Bost, 2019). Longrange coreference relations are also more common for characters than other entities (Bamman et al., 2020). To support our modeling, we augment existing screenplay datasets synthetically and through human annotations. First, we systematically add different formatting perturbations to screenplay documents and train our screenplay parser to be robust to these variations. We use this parser to find speakers and segment boundaries as a preprocessing step to coreference resolution. Second, we annotate character mentions in six full-length screenplays and model the coreference relation by scoring word pairs (Dobrovolskii, 2021). We adapt the model inference to long screenplay documents by fusion-based and hierarchical methods. We summarize our contributions as follows:  Agarwal et al. (2014) trained support vector machines on synthetic training data to build a robust screenplay parser. We adopt a similar approach, but handle a wider set of document-related issues and leverage modern sequence embedding models instead of hand-crafted features. Winer and Young (2017) used a recursive descent parser to extract the spatiotemporal information from sluglines. Ramakrishna et al. (2017) built a rule-based screenplay parser to find character names and utterances, ignoring the action passages. Baruah et al. (2021) annotated the line-wise structural tag of 39 screenplay excerpts to evaluate their rule-based parser. We extend this dataset with synthetic formatting variations to train our robust screenplay parser. Screenplay Coreference Resolution. Baruah et al. (2021) established annotation guidelines for the character coreference task in screenplays. They annotated three screenplay excerpts to evaluate pretrained coreference models. They combined the neural model with rules inspired by the narrative structure of screenplays. The limitation of their work is that they used excerpts instead of fulllength scripts. We adopt their annotation guidelines to label six full-length screenplays, enabling us to study how our models scale to the entire narrative.
Literary Coreference Resolution. Several past studies have tried to extract social networks from literary texts to study character interactions, where they naturally need to unify different character mentions to create the network's nodes (Labatut and Bost, 2019). Most methods clustered the person names using heuristics such as matching gender and honorifics or finding name variations and nicknames (Elson et al., 2010;Elsner, 2012;Coll Ardanuy and Sporleder, 2014;Vala et al., 2015). Combined with mention pruning, they achieved quadratic complexity in document size. Dobrovolskii (2021) substituted spans with head words, removing the need for mention pruning and maintained the same quadratic runtime. Bohnet et al. (2022) used a text-to-text paradigm to make sentence-level co-referential decisions and achieved state-of-the-art performance using the T5 text-to-text encoder (Raffel et al., 2022).
However, these methods do not scale to long documents because the quadratic complexity of scoring mention (span or token) pairs becomes intractable as document size increases. 2 Memorybounded methods (Xia et al., 2020;Toshniwal et al., 2020;Thirukovalluru et al., 2021) keep a finite set of entity representations and update it incrementally for each mention. Most of these models are evaluated on the OntoNotes (Pradhan et al., 2012) and LitBank corpora whose average document length is less than 500 and 2K tokens, respectively: an order of magnitude less than the average screenplay size. The entity spread (number of tokens between the first and last mention) and the maximum active entity count (active entity count of a token is the number of entities whose spread includes the token) are larger for screenplays because the main characters tend to appear throughout the story in bursts (Bamman et al., 2020). In this work, we adapt Dobrovolskii's (2021) word-level coreference model to movie screenplays by fusing the word-pair coreference scores from overlapping segments or by running inference hierarchically.
3 Screenplay Parsing 3.1 Problem Setup Agarwal et al. (2014) posed screenplay parsing as a classification task of assigning a single structural label to each screenplay line. The structural types include slugline, action, speaker, expression, utterance, and transition. We define a screenplay segment as a contiguous sequence of lines with the same structural label.
Sluglines indicate the beginning of a new scene. They contain information about the scene's location, time, and whether it occurs in an interior (INT) or exterior (EXT) setting. Action lines describe the characters and their actions. Most non-dialogue lines fall under this class. Speaker lines contain character names that immediately precede their utterances. Utterance lines comprise words spoken by the characters. Expression lines describe speech acts or other scene information that provide more context to utterances, such as shouting, interrupting, contd., pause, and O.S. (off-screen). Screenwriters usually enclose expressions in parentheses. Expressions can appear alongside speakers and utterances on the same line. We classify such lines into the latter class (speaker or utterance) to avoid ambiguity. Transition lines detail the camera movement when the scene changes on screen.

Screenplay Document Issues
Screenplay documents retrieved from online resources like IMSDB 3 and DailyScript 4 , or even  shooting drafts shared by movie studios, can contain optical character recognition (OCR) errors or disregard formatting conventions. The most common issues found are: 1. No Whitespace -The screenwriting convention is that sluglines and action lines should have the smallest indent, followed by utterances and speakers. Screenwriters should separate segments by at least one blank line (Riley, 2009). Non-adherence to these rules or OCR errors might remove all indents and blank lines, making it challenging to determine segment boundaries.
2. Missing Scene Keywords -Sluglines omit the INT or EXT keyword.
3. Uncapitalization -Sluglines and speaker names are written in lowercase.
4. Watermark -Some screenplays might only be publicly-available as PDF files containing watermark logos or stamps to credit the creator. OCR conversion of the PDF might retain text-based watermarks as spurious letters in screenplay lines.
5. Speaker Name contains Keyword -Speaker names might include keywords used in sluglines or transitions, for example, CLINT, CUTTER, DEXTER. These names can confuse rule-based parsers relying on keyword lookup for classification.
6. Extra Expressions -Expressions might be misplaced and appear between action lines, instead of with utterances.
7. Extra Symbols -Asterisks or page numbers can occur in the beginning or end of screenplay lines.
Fig 2b shows some screenplay formatting issues. The performance of rule-based parsers declines with the preponderance of these anomalies.

Data Augmentation
Following Agarwal et al. (2014), we create synthetic data to train a robust parser. We collect clean screenplay excerpts annotated line-wise with structural labels. Then, for each issue described in Sec 3.2, we create a noisy copy of the labeled data wherein some lines are changed or inserted to contain the corresponding anomaly.
For example, for the Missing Scene Keyword type, we remove INT and EXT keywords from all sluglines. For the Watermark type, we randomly insert a new watermark line containing spurious English letters. We create the other copies similarly. We augment the original clean corpus with these noisy copies and use the combined collection for training and validating our screenplay parser.

Screenplay Parsing Model
In a typical screenplay, action lines follow sluglines, utterances follow speakers, and transitions mainly occur before sluglines. We model our screenplay parser as a recurrent neural network (RNN) to exploit this structural arrangement.
We encode each line using a concatenation of sentence embeddings and syntactic, orthographic, and keyword-based features. Sentence embeddings capture the semantics of a sentence in a fixed di-mensional vector representation. Syntactic features include counts of part-of-speech and named entity tags. Orthographic features comprise counts of left and right parentheses and capitalized words. Keyword-based features contain tallies of slugline and transition keywords.
We input the feature representations to a bidirectional RNN to obtain a sequence of hidden vectors. Each hidden vector corresponds to a screenplay line. We feed each vector into a densely connected feed-forward neural network. We compute the label probability as a softmax function over the output neurons of the dense layer. We train the model using class-weighted cross entropy loss.

Problem Setup
Character coreference resolution is a documentlevel hard-clustering 5 task where each cluster is a set of text spans that refer to a single unique character. We call text spans that refer to some character as character mentions. Character mentions can occur in any structural segment of the screenplay.

Screenplay Coreference Model
We adapt the word-level coreference resolution model of Dobrovolskii (2021) to the screenplay character coreference resolution task. The model first finds the coreference links between individual words and then expands each word to the subsuming span. We chose this model because it achieved near-SOTA 6 performance on the OntoNotes dataset while having a simple architecture and maintaining quadratic complexity. Following Lee et al. (2017), we formulate the coreference resolution task as a set of antecedent assignments y i for each word i.
have an antecedent or is not a character mention. We model the probability distribution P (y i ) over candidate antecedents as: s(i, j) is the coreference score between words i and j. We fix s(i, ϵ) = 0. Following steps show how we compute s(i, j). Word Representations. Given a screenplay S containing n words, we tokenize each word to obtain m wordpiece subtokens. We encode the subtokens using BERT-based transformers (Devlin et al., 2019) to obtain contextualized subtoken embeddings T. We do not pass the whole subtoken sequence to the transformer because the sequence length m is usually greater than the transformer's context window. Instead, we split the subtoken sequence into non-overlapping segments and encode each segment separately. Joshi et al. (2019) showed that overlapping segments provided no improvement. We obtain the word representations X as a weighted sum of its subtoken embeddings. We find the weights by applying a softmax function over the attention scores of its subtokens. We calculate the subtoken attention scores A by a linear transformation W on T.
A = T · W (2) Character Scores. The character score of a word calculates its likelihood to be the head word of a character mention. We calculate character scores because we only want to model the coreference between characters instead of all entity types. We obtain word-level character representations Z by concatenating the word representations X with word feature embeddings. The word features include part-of-speech, named entity, and structural tags of the word. The word's structural tag is the structural tag of the screenplay line containing the word. We apply a bidirectional RNN to Z to obtain hidden vectors H. We input each hidden vector H i to a feed-forward neural network with a single output neuron to find the character score s r (i) for word i.
s r (i) = FFN r (H i ) (4) Coarse Coreference Scores. The coarse coreference score is a computationally efficient but crude estimate of how likely two words corefer each other. We calculate it as the sum of the bilinear transformation W c of the word embeddings and the character scores of the words. s c (i, j) = X i · W c · X ⊺ j + s r (i) + s r (j) (5) Antecedent Coreference Score. We retain the top k likely antecedents of each word according to the coarse coreference scores s c . We encode a word and candidate antecedent pair (i, j) by concatenating their word embeddings, the element-wise product of their word embeddings, and a pairwise representation ϕ which encodes the distance between words i and j, and whether they are spoken by the same character. We feed the word-antecedent representations to a feed-forward neural network to obtain antecedent coreference scores s a (i, j).
The final coreference score of the word and candidate antecedent pair (i, j) is the sum of coarse and antecedent coreference scores s(i, j) = s c (i, j) + s a (i, j). The predicted antecedent for word i is the antecedent with the maximum s(i, j) score. Word i has no antecedents if s(i, j) is negative for all k candidate antecedents. Span Boundary Detection. We find the span boundaries of the words that are coreferent with some other word. We concatenate the word embedding with embeddings of neighboring words and pass the pairwise representations through a convolutional neural network followed by a feed-forward network to get start and end scores. The preceding and succeeding words with maximum start and end scores mark the span boundaries.
We obtain the final coreference clusters by using graph traversal on the predicted antecedent relationship between the head words of the spans. The time and space complexity of the model is O(n 2 ).

Training
The large document size of screenplays prevents us from calculating gradients from the entire document within our resource constraints. Therefore, we split the screenplay at segment boundaries (segments are defined in sec 3.1) into non-overlapping subdocuments and train on each subdocument separately. We do not split at scene boundaries because scenes can get very long.
Following Dobrovolskii (2021), we use marginal log-likelihood L MLL and binary cross entropy L BCE to train the coarse and antecedent coreference scorers. We optimize the marginal likelihood because the antecedents are latent and only the clustering information is available, as shown in Eq 7. G i denotes the set of words in the gold cluster containing word i. The binary cross entropy term L BCE improves the coreference scores for individual coreferring word pairs. We scale it by a factor α. We use cross entropy loss to train the character scorer and span boundary detection modules, denoted as L Char and L Span , respectively. We train the coreference, character scorer and span boundary detection modules jointly (Eq 8).

Inference
Unlike training, we cannot run inference separately on non-overlapping subdocuments because we will miss coreference links between words occuring in different subdocuments and each coreference cluster will be confined to a single subdocument. We devise two approaches to scale inference to long screenplays, one based on fusing coreference scores and the other is a hierarchical method.

Fusion-Based Inference
We split the screenplay into overlapping subdocuments and run inference separately on each to obtain coreference scores s k (i, j) for each subdocument k. If a word pair (i, j) lies within the overlap region of two adjacent subdocuments k 1 and k 2 , we might calculate two different coreference scores s k 1 (i, j) and s k 2 (i, j). We average the two scores to obtain the final coreference value s(i, j) and use it to find the final coreference clusters. This method finds coreference scores for all word pairs whose separation is less than the overlap length.

Hierarchical Inference
We split the screenplay into non-overlapping subdocuments. We run inference and find coreference clusters for each subdocument separately. For each subdocument coreference cluster, we sample some representative words that have the highest character scores s r . We calculate the coarse and antecedent coreference scores for every word pair (i, j), where words i and j are representative words of coreference clusters from different subdocuments. If the average coreference score s(i, j) is positive, we merge the corresponding subdocument coreference clusters. We obtain the final coreference clusters after no further merging can take place. This method allows merging distant clusters together.  (Song et al., 2020) to get sentence embeddings. The model uses Siamese and triplet network structure to obtain sentence representations. We employ English Spacy models (Honnibal et al., 2020) to find the syntactic features. The parser's RNN layer is a single layer LSTM with 256-dimensional hidden vectors. We train the parser on sequences of 10 screenplay lines. We use learning rates of 1e-3 and 1e-5 for the sentence encoder and LSTM, respectively. We train for 5 epochs using Adam optimizer. Evaluation. We use leave-one-movie-out crossvalidation and average the performance across the 39 excerpts to obtain the final evaluation scores. We use per-class F1 as the evaluation metric. Dataset. We label six full-length screenplays for character coreference using the annotation guidelines of Baruah et al. (2021). The scripts are publicly available from IMSDB. Three trained individuals annotated two unique screenplays each plus an additional script excerpt previously labeled by experts for rater-reliability measures. The average LEA F1 (Moosavi and Strube, 2016) of the annotators against the expert labels is 85.6. We used the CorefAnnotator tool to annotate the screenplay documents (Reiter, 2018). The six movies are Avengers Endgame, 2019; Dead Poets Society, 1989;John Wick, 2014;Prestige, 2006;Quiet Place, 2018;and Zootopia, 2016. We add these movies to the coreference dataset annotated by Baruah et al. (2021) to create the MovieCoref dataset. The average document length of the full-length screenplays is about 30K words. MovieCoref covers 25,793 character mentions for 418 characters in 201,804 words. The maximum active entity count (defined in sec 1) is 54. The same statistic for OntoNotes and LitBank is 24 and 18 respectively (Toshniwal et al., 2020). Table 1 shows per-movie statistics of MovieCoref dataset.

Screenplay Coreference Resolution
Baseline. We use the screenplay coreference model of Baruah et al. (2021) as our baseline. It combines the neural model of Lee et al. (2018) with structural rules to adapt to the movie domain.
Implementation. We retain the architecture of the word-level coreference model of Dobrovolskii (2021) for the word encoder, coreference scorers, and span boundary detection modules. We pretrain these modules on the OntoNotes corpus. Following Dobrovolskii (2021), we use RoBERTa (Zhuang et al., 2021) to encode the subtokens. The character scorer uses a single-layer bidirectional GRU with 256-dimensional hidden vectors.
We train the coreference model using a learning rate of 2e-5 for the RoBERTa transformer and 2e-4 for the other modules. We decrease the learning rates linearly after an initial warmup of 50 training steps. We use L2 regularization with a decay rate of 1e-3. The size of the training subdocuments is 5120 words because it is the maximum we could fit in 48 GB A40 NVIDIA GPUs. We retain the top 50 antecedent candidates during the pruning stage. We set the binary-cross-entropy scaling factor α = 0.5. Appendix A contains additional details.
Evaluation. We use leave-one-movie-out crossvalidation to evaluate the model. We obtain the final evaluation score by averaging across the six full-length screenplays. The conventional evaluation metric for coreference resolution is CoNLL F1, which is the average of the F1 scores of three metrics: MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998), and CEAF e (Luo, 2005). Moosavi and Strube (2016) pointed out interpretability and discriminative issues with these metrics and proposed the alternative LEA (Link-Based Entity Aware) metric. LEA calculates the weighted sum of the resolution scores of the clusters. The resolution score of a cluster is the fraction of detected coreference links, and its weight equals its cardinality. We use LEA F1 as the primary evaluation metric.     ). The rule-based parser is a better slugline detector for well-formatted screenplays, but its performance degrades significantly for noisy screenplays. The performance of our parser varies significantly less than the rulebased parser across document issues, proving its robustness (F-test, p < 1e −4 ). Table 3 shows the cross-validation results of our model and the baseline (Baruah et al., 2021) on the character coreference resolution task. For the fusion-based inference, we split the screenplay into overlapping subdocuments, each of length 5120 words with an overlap of 2048 words. For the hierarchical inference, we split the screenplay into nonoverlapping subdocuments, each of length 5120 words, and sample three representative words per cluster. Both inference approaches achieved signif-icantly better F1 scores than the baseline except on the CEAF e metric but did not differ significantly from each other (t-test, p < 0.05). The hierarchical approach retrieves more coreference links (+10 LEA recall) but is less precise  than the fusion-based approach. This might be because the hierarchical approach performs a second round of coreference clustering, which merges distant clusters but also introduces wrong coreference links. We use Bonferroni correction (n = 3) to adjust for multiple comparisons.

Character Scorer Ablation
Model LEA F1 (w/o character scorer) 72.2 (w/ character scorer) 74.2 The main difference between our model and Dobrovolskii's (2021) word-level coreference model is the inclusion of the character scorer module. Excluding the character scorer module implies that we do not have the s r terms in Eq 5 and the L Char term in Eq 8. Table 4 shows that the coreference perfor-mance of the fusion-based approach improves (+2 LEA F1) by adding the character scorer module. Similar results hold for the hierarchical approach.

SubDoc
Overlap (words) (words) 256 512 1024 2048 2048 51.5 (7.6) 53.1 (7.6) --3072 52.4 (7.7) 59.2 (7.7) 63.8 (7.7) -4096 58.6 (7.8) 58.4 (7.8) 68.7 (7.8) -5120 60.1 (7.9) 62.4 (7.9) 70.8 (8.0) 74.2 (8.0) 8192 64.6 (8.5) 67.0 (8.5) 72.9 (8.5) 76.6 (8.5)  Table 5 shows the coreference performance and memory usage of the fusion-based inference approach across different subdocument and overlap lengths. Performance improves significantly if we split the screenplay into larger subdocuments with greater overlap (t-test, p < 0.05). Increasing the subdocument size enables the model to directly find the coreference score of more word pairs. Increasing the overlap between adjacent subdocuments allows the model to score all word pairs whose separation is less than the overlap length. Memory consumption remains almost steady for increasing overlap sizes at a given subdocument length.   Table 6 shows the coreference performance and memory usage of the hierarchical inference approach across different subdocument sizes and number of representative words. Similar to the fusion-based approach, performance improves significantly upon increasing the subdocument length (t-test, p < 0.05). Sampling more words per sub-document cluster also improves performance because they provide more information from different discourse locations about the character. However, it substantially increases memory usage. Memory consumption decreases for greater subdocument sizes for a given number of representative words. This might be because increasing the subdocument length decreases the total number of subdocuments, which reduces the number of clusters obtained from the first round of coreference scoring.

Conclusion
We introduce two movie screenplay datasets for parsing and character coreference resolution tasks. We develop a robust screenplay parser that can handle various document formatting issues. We devise inference approaches to scale coreference models to long documents without drastically increasing memory consumption and evaluate them on fulllength screenplays. Future work entails applying our screenplay coreference model to gather longitudinal insights on movie character interactions.

Limitations
The coreference annotations of the MovieCoref dataset exclude plural character mentions because the annotation guidelines did not cover them (Baruah et al., 2021). It contains few singleton coreference clusters (65). Our model only identifies singular characters and cannot retrieve singleton clusters. All the movies in the dataset have a linear narrative. Non-linear stories can confuse a coreference model because of time skips and flashbacks which is not explored in our work. Both our inference approaches require at least 10 GB of GPU memory for finding coreference clusters from full-length screenplays.

Ethics Statement
Our work adheres to the ACL Ethics Policy. Using our proposed models, we can scale coreference resolution to long documents while leveraging transformer-based mention-pair scorers and without substantially increasing memory consumption.