Dramatic Conversation Disentanglement

We present a new dataset for studying conversation disentanglement in movies and TV series. While previous work has focused on conversation disentanglement in IRC chatroom dialogues, movies and TV shows provide a space for studying complex pragmatic patterns of floor and topic change in face-to-face multi-party interactions. In this work, we draw on theoretical research in sociolinguistics, sociology, and film studies to operationalize a conversational thread (including the notion of a floor change) in dramatic texts, and use that definition to annotate a dataset of 10,033 dialogue turns (comprising 2,209 threads) from 831 movies. We compare the performance of several disentanglement models on this dramatic dataset, and apply the best-performing model to disentangle 808 movies. We see that, contrary to expectation, average thread lengths do not decrease significantly over the past 40 years, and characters portrayed by actors who are women, while underrepresented, initiate more new conversational threads relative to their speaking time.


Introduction
Movie and TV dialogues, or dramatic dialogues more generally, have offered linguists a wealth of resources to study conversational behaviors (Lakoff and Tannen, 1984;He and Herman, 1998;Richardson, 2010), including within NLP (Danescu-Niculescu-Mizil and Lee, 2011;Ramakrishna et al., 2017;Sap et al., 2017;Azab et al., 2019).While dramatic dialogues do not necessarily mimic conversations in real life, they present complex pragmatic and sociolinguistic phenomena that warrant study; given the widespread viewership of movies and TV, what appears on screen-both visually and in dialogue-can have a real social impact in the world (Rosen, 1973;hooks, 1992;Heldman, 2016).
An important feature of such dialogues is that they are entangled.In his work on dramatic dialogues, McKee (2016, p. 3) articulates a speech act view: in screenplays, "all talk responds to a need, engages a purpose, and performs an action."In any given scene in a movie or TV show, then, we can often see multiple needs expressed by different characters in one sequence of conversation.Consider this scene from Young Sheldon (Fig. 1): Missy relays a message, Sheldon wants to know where his bow tie is, and Georgie seeks to avoid showing up with Sheldon at school-each of them starts a new conversational thread (or, subconversation) with their speech act that reflects those different intents.In the screenplay, there is no explicit structure that indicates where each subconversation starts and ends.If, however, we could disentangle dramatic conversations, we could ask such questions as: What kind of characters get to start a thread?How long do conversations tend to last?Answers to those questions can enhance our understanding of cultural representations on screen.
Much of the work on conversation disentanglement in NLP has studied Internet Relay Chat (IRC) logs, most notably #Linux (Elsner and Charniak, 2008) and #Ubuntu (Kummerfeld et al., 2019).IRC logs are in a different domain from dramatic conversations, and some salient features in IRC, such as invoking a username to indicate replies to that user, are not found in screenplays.Conversely, there is no equivalent for "off-screen" speakers in IRC.Given the face-to-face nature of conversations in drama, movie and TV characters can start new conversational threads by entering the scene.Those major differences mean chat logs may be insufficient to train models that disentangle drama.
To bridge the gap, we present in this work a new annotated dataset to support the study of conversation disentanglement in the domain of movies and TV shows.We draw heavily on the theoretical resources found in film studies, sociology, and linguistics as we design our annotation framework, with particular attention paid to the semantic and pragmatic signals of the start of a new thread.
In this work we make the following contributions: • We draw on theoretical research in sociolinguistics, sociology, and film studies to operationalize a conversational thread (including the notion of a floor change) in dramatic texts, and use that definition to annotate a dataset of 10,033 dialogue turns (comprising 2,209 threads) from 831 movies.All annotations are freely available for public use under a CC BY-NC-SA license on GitHub.1 • We compare the performance of several disentanglement models on this dramatic dataset to see if model architectures designed for or models trained on Kummerfeld et al. (2019) perform well in the domain of drama.
• We apply the best-performing model to analyze and disentangle 808 films in SCRIPTBASE-J (Gorinski andLapata, 2015, 2018), investigating both the relationship between historical thread length and intensified continuity style (Bordwell, 2002) and the relationship between gender and power in floor claiming.In this data, we see that, unlike shot lengths, average thread lengths do not decrease significantly over the past 40 years (contrary to expectation), and characters por-

Related work
Conversation disentanglement.Conversation disentanglement seeks to identify threads (or, clusters, subconversations) in a sequence of utterances.Conceptually, this task requires a robust operationalization of thread, which is usually understood as related to topic or floor change (O' Neill and Martin, 2003;Shen et al., 2006;Jiang et al., 2018).Elsner andCharniak (2008, 2010) considered this problem in the context of chat history, which has been extensively studied since (for a recent survey, see Gu et al., 2022).
In terms of modeling, there are two popular approaches for this task (Zhu et al., 2021): two steps (models link individual utterances first, and then we recover thread membership) or end-to-end (models predict thread membership directly).Our work adopts the two-step method: we first calculate the similarity score to identify the reply-to relations between two utterances (the link prediction task), and apply a greedy clustering algorithm to put utterances that reply to one another into threads (the clustering task).For the two-step method, there have been attempts to adopt a multi-task learning setup: At training time, when gold cluster information is available, this auxiliary task calculates another loss function dedicated to thread prediction, which can be used to improve the performance of link prediction, the main task, with which we also experimented.
Datasets for conversation disentanglement are currently limited.In Mahajan and Shaikh's (2021) comprehensive survey on multi-party dialogue understanding, most datasets are not curated with this purpose in mind.Kummerfeld et al. (2019)'s corpus, built on annotations of IRC chat logs, has been the standard benchmark dataset for this task.Liu et al. (2020, p. 3871) released a dataset of movie dialogues, where they "collect 869 movie scripts that explicitly indicate the plot changing" and "extract 56,562 sessions from the scripts and manually intermingle these sessions to construct a synthetic dataset."In this work, we present a new annotated dataset built on movies and TV shows, adding a spoken, scripted corpus to facilitate this line of work.Instead of inferring from the narrative structure and threading conversations through a synthetic process, we developed an annotation framework (described in §4).
Theoretical approaches to conversation.Conversation has been extensively theorized and studied in sociology, linguistics, and film studies, which this work draws on.For our annotation scheme, Goffman's (1963) idea that conversation is a form of focused interaction and McKee's (2016) speech act view on dramatic dialogues provide us with the theoretical foundation.
The ideas related to focus and topic are further explored in the following: Ervin-Tripp (1964), who considers the surface and semantic features of sustained attention in conversational organization; Roberts (1996), who expounds on "topic under discussion" in pragmatics; Ng and Bradac (1993), who see topic change and floor claiming from the perspective of power dynamics between speakers involved.More broadly, Sacks et al. (1974); Goodwin (1981) detail the organizational and pragmatic principles for conversation, and He and Herman (1998) consider them specifically in the context of drama.
This work is further motivated by the social implications of conversations taking place on screen, and we find the following particularly relevant: Richardson (2010) carried out a sociolinguistic study on TV dialogues; Silverman (1988);Boon (2008); O'Meara (2019) highlights related issues as they pertain to gender and race.
Modeling multi-party conversation structure.The interaction structure between speakers in a multi-party conversation is shown to be useful for conversation disentanglement (Mayfield et al., 2012).More recently, various neural architectures have been proposed to encode utterances and the hierarchical structure of conversations (Jiang et al., 2018;Henderson et al., 2020;Wu et al., 2020;Yu and Joty, 2020).Pre-training with self-supervised tasks (Zhu et al., 2020;Wang et al., 2020;Gu et al., 2021) is also used to derive contextual embeddings while factoring in the conversational structure (replacing, for example, next sentence prediction with next utterance prediction).With or without selfsupervision, additional embedding layers or attention mechanisms have been proposed to encode the information of conversation structure.Gu et al. (2020); Liu et al. (2021) incorporated speaker embeddings and Sang et al. (2022) and Ma et al. (2022) emphasize the interaction between speakers

EXT. ONE OF THE EXITS -MADISON SQUARE GARDEN -NIGHT -
Emily and Junior are standing, waiting for Kane.

Is Pop Governor yet, Mom?
Just then, Kane appears, with Reilly and several other men.Kane rushes toward Emily and Junior, as the men politely greet Emily.EXT.ONE OF THE EXITS is the scene header.The dialogue portions are usually indented, while action statements ("Emily and Junior are standing") are not indented.Speaker labels are found in cue lines in all caps, followed by their dialogue lines.through attention.For our experiments, as a baseline, we start with a simple and intuitive approach, where we train embedding layers based on structural features like the distance between utterances, and concatenate relevant embedding vectors with standard contextual embeddings from pre-trained language models that we can fine tune for this task.

KANE
In this work, conversation structure is situated in the specific domain of films and TV shows, spoken by characters in a scene.The structural richness of screenplays makes it an interesting textual representation of movies in NLP (Bhat et al., 2021;Chen et al., 2022), and here, we use screenplays to create an annotated dataset to facilitate future work on conversation disentanglement.

Data
To study conversation disentanglement in drama, we consider 831 titles: 340 movies and 491 TV series,2 randomly sampling one scene from each title for annotation.Movies are taken from SCRIPTBASE-J (Gorinski and Lapata, 2018), based on IMSDB,3 because of its extensive coverage of genres and temporal span, along with rich metadata that can adequately support NLP research related to movies.Since TV dialogues have different linguistic styles from movie dialogues (Nelmes, 2010), we curated a new dataset, TVPILOTS, using teleplays of pilot episodes made available by TV Writing. 4ll screenplays and teleplays from these sources come in the standard format (Fig. 2): They have distinct scene headers, speaker labels, and other typographical features for annotators to distinguish between an action statement and a dialogue line.Since those features are consistent, we have reliable scene markers and speaker labels, and annotators can refer back to the original screenplays when necessary.Action statements also helped annotators better understand the scene.

Task definition
We consider conversation disentanglement applied to the domain of scripted conversations in TV series and movies.On the highest level, given a segmented scene in a screenplay, we want to identify threads in a conversation between multiple characters.Intuitively, in a scene, a character can change the subject or redirect other characters' attention to themselves, which, in our formulation, means they start a new conversational thread.Threading drama can help us understand conversational patterns, such as who gets to start a new conversation, who dominates the conversation, and how long an average conversation lasts.
The interaction structure involves an utterance of interest (UOI) and its parent utterance.An utterance is a dialogue line spoken by a character.In this work, an utterance of interest holds a directed edge to its parent utterance: Each utterance of interest has only one parent, but one parent can have multiple children.A thread is the transitive closure of all such pairwise links.

Threading dramatic conversations
In defining conversational threads in drama, we first adapt Goffman's (1963, p. 24) definition of conversation:5 a thread is a kind of focused interaction: one where "persons gather close together and openly cooperate to sustain a single focus of attention, typically by taking turns at talking."Second, we assume conversations in a given scene in drama are entangled and there is often more than one thread in any given conversation.Taken together, we define a thread to be a cluster of semantically and pragmatically coherent utterances that are part of a conversation.Those utterances share a single, sustainable focus of attention (cf.Ervin-Tripp, 1964;Sacks et al., 1974), either on a character (who has other characters' attention) or a topic (often related to the wants and needs of a character), as well as other observable contextual relations (O 'Neill and Martin, 2003). 6n a conversation, attention can be paid to a character (who has the floor) or a topic (why they are having this conversation): Floor.Elsner and Charniak (2010, p. 392) describe the start of a new conversational thread as the process of participants (or in our case, characters) "hav[ing] refocused their attention . . .away from whoever held the floor in the parent conversation."Like in Goffman (1963), attention is key: the character holding the floor can safely assume that they have attention from others.Such attention is singular and must be sustained throughout the thread; or someone else has the floor.Sheldon's line off screen in Fig. 1 further demonstrates how characters can start a thread by gaining the floor from other characters present in the scene, so they can express their need in their own voice.His line, coming out of nowhere, also does not respond to the speech acts of others, making it a typical example of thread starters in drama.
Topic.Since we follow McKee (2016) and see dialogues as speech acts, we can often relate the topic of a thread to the desire or intent of the character who started the conversation: characters can express their own needs, or respond to someone else's needs, and the need acts as the driving passion of the dramatic conversation.Operationally, a topic change within a scene usually occurs when the conversation is no longer about the original speech act that starts the scene.In Fig. 1, Georgie's line, "Can I drive in with you?" is an example of topic change and a new expressed intent from a different character in the scene.
For further details and examples, see Appendix B.

Annotation process
Prior to annotation, all the plays in TVPILOTS are OCR'd, and all plays were pre-processed follow- ing Bhat et al. (2021): we extracted the content of structural components (scene headers, character cue lines, dialogue and action lines) in each screenplay and store this information as tabular data.We further segmented each dialogue line into sentences with spaCy 3.3.0(Honnibal and Johnson, 2015), which allows us to annotate at a greater granularity, since sentences in the same dialogue turn can reply to different previous sentences.We also assigned each scene, action line, dialogue turn, and each sentence in a turn an ID.All annotations were carried out by the authors of the paper in the span of six months.We spent two months on pilot runs as we revised annotation guidelines and discussed edge cases, and the rest on independent annotating.On average, it took an hour to work through 500 dialogue and action lines.Our agreement rate is reported in Table 1; we considered standard metrics (described in Appendix A) based on 3,271 jointly annotated lines.Our agreement rate is comparable to previous work (Kummerfeld et al., 2019).

Experiments
We compare seven models to see how well existing model architectures that have been proposed for conversation disentanglement perform in the domain of dramatic texts.We would also like to know if models trained on Kummerfeld et al.'s (2019) data, but evaluated on our dramatic data, can leverage its size (seven times larger than ours) to compensate for the striking difference in domain.

Notation
We define a dataset D = {(c where: • i is the index of an UOI, j that of a candidate parent utterance; an utterance u is a sentence of a dialogue line • S i,j denotes the scene both u i and u j are in • c j is the context of u j , defined as all the dialogue and action lines preceding u j in S i,j • u + j is a true parent, and u − j is a negative example (u j ∈ S i,j ) • u i = {t i , k i , w 1 i , w 2 i , . . ., w m i } is an utterance of m tokens spoken by character k in turn t; turn information is often given in the play parenthetically as (CONT'D)

Models
We consider the following models: Previous.Adapting Kummerfeld et al. (2019), we connect all UOIs to their immediate previous utterances; i.e., for u i , its parent utterance is u i−1 .
Featurized.Zhu et al. (2021) showed that manually selected features could offer a robust baseline; we take inspiration from Kummerfeld et al. (2019) and selected 8 features to train a featurized model: • each utterance: 1.The number of other speakers this character speaks after 2. The number of utterances ago this character last spoke 3. Whether the next utterance is spoken by the same character • pairwise: between u i and candidate parent utterance u j 4. The number of WordPiece tokens u i and u j have in common 5.The distance between the two utterances |i − j| 6.Whether there are utterances from either speaker between u i and u j 7. Whether u i and u j are in the same turn 8. Whether u i and u j are from the same speaker BERT baseline.We adapt the Siamese encoders used in previous work on conversation analysis (Jiang et al., 2018;Henderson et al., 2020;Wu et al., 2020) We extract the [CLS] token, denoted e [CLS] .
Feature-based embeddings.To enhance the expressivity of our models, we introduced additional embedding layers, randomly initialized, to encode information pertinent to the conversation structure in the scene.Each of the following features is assigned an embedding vector: utterance distance f d i (feature 5 from the featurized model), turn f t i (whether this line in the same turn as last, feature 7), scene speaker f k i (whether two speakers are the same, feature 8).All f ∈ R 250×2 .We can then represent each utterance pair as the concatenation of all embeddings: (3) Finally, we pass u i,j through a non-linearity before the sigmoid output layer to compute the matching score m i,j : For training, we used a self-link token [SELF] as a parent candidate for every u i , which is assigned as the true parent if u i is the start of a thread.For each (u + j , u i ) pair in our annotation, we sample five u − j .The objective is to minimize the binary cross-entropy loss, L link .
At inference time, for each u i we have a candidate pool Since in 90% of our annotation the true parent is within 5 utterances, we picked candidate pool size C = 6.We calculate the matching score between u i and all u c ∈ P i and select argmax P i m i,c as parent.
In addition to the BERT baseline, we adapt three recent architectures for dramatic conversation disentanglement.They are designed with Kummerfeld et al.'s (2019) IRC chat log in mind, and while many textual features do not have equivalents in our dramatic domain, we incorporate some designs as we saw fit, described below: BERT with soft attention alignment.We adapt the soft alignment mechanism in the pointer module from Yu and Joty (2020) to emphasize the textual similarity between u i and u j : where ) and H j = (h j,0 , . . ., h j,q ) are the bidrectional LSTM representations for u i and u j .H i is used as query vectors to compute attentions over the key/value vectors in H j and the set of attended vectors H ′ i , one for each h i ∈ H i .In Eq. 8-9 we enhance the interactions by applying difference and element-wise product between the original and attended vectors.Finally, we swap out BERT-based contextual embeddings for u i and u j with h f i and h f j , with the following resultant representation of the two utterances: The matching score is calculated using Eq.4-5.
6-way classifier.The structural characterization of conversation proposed by Ma et al. (2022) is the current state of the art.We use their architecture without reference dependency modeling, since we don't have mentions in our movie data in the same format as IRC, but retain the rest.Their goal is to train a C-way classifier: for each, u i , pick one from candidates including u i and The UOI and candidate pairs are stringed and fed into a pre-trained model as , where L is the input sequence length.To obtain aggregated contextualized representations, we extract the [CLS] token: For candidate window size, we chose C = 6 (including self-pointing as one candidate).
Their architecture features two components: speaker property modeling and the Syn-LSTM module.Speaker property modeling leverages the masked Multi-Head Self-Attention (MHSA, Liu et al., 2021) mechanism to account for utterances from the same speaker with a speaker-aware mask matrix M , which we include in our adaptation: Syn-LSTM (Xu et al., 2021) is a biLSTM with an additional input gate to retain the information of utterances within the candidate window, designed to make the model context-aware.In other words, we have and the self-pointing [u i , u i ] pair is: where p ij is the representation for the pair of is then fed into the classification head to predict the parent.The training objective is to minimize the cross-entropy loss.
Multi-task learning.We follow Yu and Joty (2020); Zhu et al. ( 2021); Huang et al. (2022) and introduce an auxiliary task, a binary classifier to predict whether u i and u j belong to the same thread, the probability of which is: where u i,j is the representation of the utterance pair given.The objective is to minimize the binary cross-entropy loss: ) where y i,j = 1 if u i and u j are in the same thread, 0 otherwise.The total training loss L of this model class is: where α is a hyper-parameter to control the impact of the auxiliary task.We experimented with α = {0.1,0.5, 1.0} and 0.1 performed best.usernames as speaker labels and treated system messages as action statements.Since individual users can send multiple messages in what would be one dialogue turn in movies, we did not perform sentence segmentation on messages.

Setup
We trained all our models for 10 epochs and used the dev set for early stopping with the learning rate 5 × 10 −6 (BERT-based) and 10 −3 (linear).Our train-test split is reported in Table 2.For our BERT implementation, we used bert-base-cased from HuggingFace 4.19.2 with PyTorch 1.10.0.7An epoch took 1 hour 8 minutes on average on two NVIDIA GeForce RTX 2080 Ti GPUs.

Results
Experimental results are presented in Table 3.We report the standard set of metrics (described in Appendix A), along with their 95% bootstrap confidence intervals.We first note that the clustering metrics are low while link prediction accuracy is high because the most reasonable parent utterance for most UOIs (90%) is the immediately previous utterance, which leads to high baseline accuracy for link prediction, but low clustering baselines when considering entire threads.
The performance of models trained on Kummerfeld et al.'s (2019) suggests that domain difference matters.While seven times larger, their dataset is in an entirely different domain, and intuitively, chatroom users interact differently from movie characters.Such differences might account for the inferior performance, especially on stricter cluster metrics like One-to-One and Exact Match.
Enhancements to the baseline lead to minor, statistically insignificant, improvements, and the 6way classifier outperforms the rest model classes on most metrics.Therefore, for the analysis below, we use the 6-way classifier.

Analysis
To illustrate the usefulness of conversation disentanglement in drama, we disentangled 808 movies from SCRIPTBASE-J and carried out two analyses enabled by this work to explore two questions that engage previous work in film studies: 8 Are conversational threads in movies getting shorter over the years?In his analysis on visual style in contemporary films, film historian Bordwell (2002, p. 16) observed, "For many of us, today's popular American cinema is always fast": the average shot length is decreasing and cuts and camera movements have become more rapid over the course of the twentieth century, leading to an impression of "intensified continuity."A similar observation is made in Cutting et al.'s (2010) empirical work, which relates this trend to our natural fluctuation of attention.Notably, such work emphasizes the visual aspect of films, which reinforces the established hierarchy in film studies: the film is a visual medium, and image is more important than sound.This hierarchy is critiqued in studies on film dialogues in particular (Kozloff, 2000), since characters converse with one another only after the advent of sound films.It then leads us to ask: Are conversational threads in films also getting shorter over the years?
Since we have disentangled movie conversations into threads, we can calculate the average number of utterances there are in a thread in a given movie and in a given year.Our movie data, while spanning from 1930s to 2010s, is not evenly distributed.As a result, for this analysis, we aggregated movies 8 The list of movies we used for analysis can be found in our GitHub repo: https://github.com/kentchang/dramatic-conversation-disentanglement/blob/ bf3d2fbc00f9d64356c308a2c0ca6b2e73580c19/list/ titles-for-analysis.txt in a 5-year range.While we would expect thread lengths to also decrease, Fig. 3 tells a different story.
We can see the average thread length decreasing (although not statistically significantly so) until 1970, and the trend is flat since.Film dialogues seem to be resisting the broader trend associated with visual styles.
What is the pattern of floor claiming between men and women in movies?It has been pointed out that the film industry became dominated increasingly by men over the twentieth century (Boon, 2008).The Bechdel Test (Bechdel, 1986) is a popular and well-known measure for the representation of women in films, often used for advocating that women on screen should "speak up" (O'Meara, 2016) to encourage more diverse representation.Through this work, we would like to add an additional dimension to it: How often do characters who are women start a conversation in films?
In the tradition of continental philosophy, to initiate a conversation-or, to become a speaking subject-is a socially and ethically significant act (Foucault, 1972;Lacan, 2006).This influenced much of feminist film studies that considers the presence and absence of women's voices in films (Silverman, 1988;Lawrence, 1991;Sjogren, 2006).This work inspires us to frame and measure the agency of characters who are women in relation to the frequency with which they get to start a new conversational thread and claim the floor.We start by using TMDb's API to look up the gender of the actor portraying the character.9For this analysis, we only consider movies released after 1979, after which point we have at least five movies each year.In our data, 30.4% of threads are started by women.This is aligned with the oftstated observation that men talk more than women in films (Ramakrishna et al., 2017;Lauzen, 2019), and the trend has not significantly changed for decades.
However, as we see in Fig. 4, when we subtract the percentage of speaking time by women (the number of lines they have) from that of threads started by women (e.g., in 2011, 32.7% of threads were started by women, and 30.8% of lines were spoken by women, so we see an absolute difference of +1.9), we see that women generally start more threads relative to their speaking time, and that is also relatively constant over time.In the figure, any year in which the 95% confidence interval does not overlap with 0 is significant at that level; while this is not significant for many individual years (given the limited number of movies per year), it is significant over all years (+1.0, [0.07, 0.14]).This finding is surprising because it suggests that despite their under-representation, women characters are written to initiate conversations more than their male counterparts.

Conclusion
We present in this work a new dataset for studying conversation disentanglement in movies and TV shows in order to enrich the landscape of this line of work in NLP.Movie and TV dialogues offer pragmatic patterns and interaction structures different from chat logs, on which standard benchmarks for this task are built.To ensure high quality of this dataset, we digitized teleplays written for TV pilots, so we have more screenplays in the standard format, which we find most useful for annotation.In addition, we draw on theoretical resources from sociolinguistics, sociology, and film studies to create a robust annotation scheme that considers topic and floor changes specifically in the context of drama, which we believe speaks to the needs of the wider scholarly community.
To the best of our knowledge, no previous technical or theoretical work has offered a working operationalization of conversational thread in the context of dramatic (scripted, spoken, face-to-face) conversations, or examined the significance of initiating a conversation or gaining the floor in this domain.While we do not claim that the results from our analysis are definitive, our work has demonstrated a new method to further investigate the sound-image hierarchy, gendered power dynamics in films, and communication behaviors in cultural representations on screen.We hope this will encourage and facilitate future research on drama and conversation in NLP, film studies, and the computational humanities.

Limitations
This work is first limited by the availability of screenplays in the standard format.While movie or TV show transcripts are more readily available (subject to permission to use), they are less ideal for annotation due to unreliable scene headers and speaker labels.This therefore limits the size of our corpus, as digitization and correction are laborintensive.In our analysis, we relied on metadata from SCRIPTBASE-J (each movie has a Jinni 10 profile that includes its corresponding IMDb 11 ID) and TMDb. 12Given it scale, we weren't able to check individually whether the release year or the gender of actors in those community-built resources is correct or up-to-date.This work is also unimodal, while movies and TV shows are multimodal, which meant we did not have access to the video for annotation, and we could not compare thread length and shot length, among other things.

Ethics Statement
We are aware that this dataset and the analytical work that follows only represent a limited set of cultural and ethnic groups as well as language uses.The dataset we're annotating highlights US movies (and not e.g., Bollywood, Nollywood or the global film industry more generally), and so one risk is the centering that culture (and conversational norms) within that dataset at the expense of others.There have been documented allocational and distributional biases in the film industry (Baker and Faulkner, 1991;Ravid, 1999;O'Brien, 2014;Khadilkar et al., 2022), and we encourage those interested in furthering this line of work to acquaint themselves with relevant discourses.We are also aware that the dataset contains potentially problematic content, such as vulgar, violent, or offensive language in screenplays, or other biases held by individual screenwriters.

A Evaluation metrics
• Adjusted Random Index (ARI) (Halkidi et al., 2002) is defined as: • Variation of Information (VI) is the information gain or loss when going from one clustering to another (Meilȃ, 2007).It is the sum of conditional entropies H(Y |X) + H(X|Y ), where X and Y are clusters of the same set of items.We report 1−VI, so the larger the value the better.
• Shen F 1 (Shen et al., 2006): Given a detected thread j and true thread i: where Recall(i, j) n j , n i,j is the number of messages of thread i in j, n j the number of messages in detected thread j, n i thread i.
• One-to-One Overlap (Elsner and Charniak, 2008) calculates the percentage of overlap between two sets of conversational threads paired up with the max-flow algorithm.
• Exact Match F 1 (Kummerfeld et al., 2019) calculates the number of perfectly matched conversational threads between two sets.During our annotation process, we did see threads with only one dialogue line, which functions quite differently from system messages in the context of IRC, so we did not exclude conversations with only one dialogue line.

B Annotation guidelines B.1 Building intuitions
This is a study of conversational behaviors of characters in drama: here, we consider TV shows and movies and the specific task of conversation disentanglement.On the highest level, we want to identify threads in a conversation between multiple characters in a scene of a TV show or movie.In the same scene, some characters can change the subject of a conversation, or redirect other characters' attention to themselves, while others might never do so.Characters in a closer relationship might converse more frequently with each other.We annotate to build a dataset that can help us investigate those inquiries.To (hopefully) ease understanding, all examples below are drawn from Gilmore Girls, which follows the story of Lorelai Gilmore and her daughter, Rory, in a small town in Connecticut.In the excerpts below, we see Luke, Lorelai's will-they-orwon't-they love interest throughout the series, and Emily, Lorelai's mother.Gilmore Girls is famous for its fast-paced dialogue and offers illustrations of conversational behaviors that we wish to study.

B.1.1 Dialogue, conversation, and reply-to
Dialogue as speech act.Our definition of dialogue is an all-encompassing one taken from Mc-Kee (2016, p. 2): "Any words said by any character to anyone."A line of dialogue is, then, a sequence of words uttered (or, an utterance) by a character to themselves, another character, or a few other characters.This view, unlike a narrower one, where dialogue is a conversation held between characters, sees dialogue as a verbal tactic initialized by a character to achieve a certain goal: "All talk responds to a need, engages a purpose, and performs an action" McKee (2016, p. 3).In other words, a dialogue is a speech act.Characters use dialogue to inform us of an event (exposition), tell us something about themselves (characterization), or try to make something happen (action).
Conversation and conversational thread.We adopt a broad definition of a conversation: conversation is a talk between characters.In defining thread, we first adapt Goffman's (1963)'s definition of conversation13 : a thread is a kind of focused interaction: one where "persons gather close together and openly cooperate to sustain a single focus of attention, typically by taking turns at talking" (Elsner and Charniak, 2008, p. 24).Second, we assume conversations in a given scene in drama are entangled and there are often more than one thread in any given conversation.Taken together, here is our operative definition: A thread is a cluster of semantically and pragmatically coherent utterances that are part of a conversation.Those utterances share a single, sustainable focus of attention (Goffman, 1963;Ervin-Tripp, 1964;Sacks et al., 1974), either on a character (who has other characters' attention) or a topic (often related to the wants and needs of a character), as well as other observable contextual relations (O 'Neill and Martin, 2003).
In a conversation, attention can be paid to a character (who has the floor) or a topic (why they are having this conversation).In the context of drama, we can often relate the topic of a thread to the desire or intent of the character who started the conversation (this loops back to McKee's idea that dialogues are speech acts): characters can express their own needs, or respond to someone else's needs, and the need acts as the driving passion of the dramatic conversation (and subsequently, plot, characterization, etc.).Attention can be paid to a character or a topic, but such attention should be sustainable over multiple utterances to form a thread.In other words, when an utterance redistributes the focus of other characters (or, a change of floor) or shows us new wants and needs of a character, and such distribution or attention or topical focus is carried over to the next couple of utterances, it usually marks the start of a new conversational thread.However, there are occasions where threads can be short, which we will describe at the end of this section.
It has been noted that the exact start of a conversational thread is not easy to determine (McDaniel et al., 1996;Elsner and Charniak, 2010), so we will dedicate the second section ( §A.1.2) to this topic.There, we try to unpack our operational definition with more examples.
Reply-to relationship.Following conventions in NLP (Zhu et al., 2021), we understand individual dialogues in a multi-party conversation in terms of parent utterance and utterance of interest (UOI), and utterance is roughly synonymous with a dialogue line.The following is the first two utterances in the entire series of Gilmore Girls: LORELAI Please, Luke.
LUKE How many cups have you had this morning? (I.i)14 In practice, UOI means the line you are currently annotating, and its parent utterance is the previous line it most logically replies to.In the example above, if Luke's utterance is the UOI, its parent utterance is the immediately previous utterance by Lorelai ("Please, Luke . . .").In fact, the default parent utterance is usually the previous line.
Perhaps we can think of conversations in drama as sequences of sentences, where one triggers the next.Given our qualitative observations of drama, we make the following remarks on UOI, parent utterance, and thread: • If an UOI does not have any parent utterance, it's the start of a new thread.
• One UOI can only have ONE parent utterance.
An utterance can have multiple children (utterances that point to it as parent utterance).
• The default parent utterance is the previous utterance.
Nuances of reply-to and sentences in one dialogue turn.We first note that topic is not entirely about coherence from the semantic point of view: we expect to see metaphors or jokes in TV shows and movies, and there's no puppy dog, queen, or crucifix really present in the scene; and just like in real life, we might bring up something that appears entirely random in any conversation ("You have to be in my brain to see the connection!")Still, we can intuitively tell there are two threads of conversation going on, and we aren't perplexed when Rory says "All hail to the queen of nonsequiturs."We can perhaps look at this previous exchange from a pragmatic perspective: Lorelai starts that "Who is this" conversation because she wants to find out who this girl in Rory's dorm room is.Venereal disease has nothing to do with that.It does not follow.
On the other hand, floor has to do with ownership and attention.Who has started and owns the conversation?Who controls our (and other characters') attention?Who do we direct our gaze to if we were also in the scene?Lorelai's last line can help us understand the nature of a conversational thread: a specific character first started and owned the thread, and then they let someone else do so.One character is ready to speak or stays speaking, and others listen and reply.Here, Lorelai wants to buy Emily a DVD player, and the latter refuses.Emily leaving the room can be understood as her unwilling to pay more attention to Lorelai: Emily doesn't want to stay in the conversation anymore.Also, Lorelai asking "Where are you going" is a shift in topic: it has nothing to do with DVD players.We will emphasize this again, but floor change tends to be more common when the thread involves at least three characters.Floor still exists between two-party conversations: if a high school student finds herself in the principal's office, we can expect the principal might be the one doing most of the talking and has the floor.But floor change should be more frequent when we have more than two characters.
This exchange between Emily and Lorelai is also an interesting instance where Emily tries to gain control over the conversation (or floor) and Lorelai doesn't let her, so she just leaves.This behavior of floor gaining and changing is a particularly interesting aspect we want to consider for analysis.In the world of Gilmore Girls, Emily is the matriarch of the family, and she does have the most power in her conversations with Lorelai (her daughter) and Rory (her granddaughter).Who tends to gain the floor?How often does a new thread start?How long is a thread?Those are interesting empirical questions we hope this work can eventually help us answer.
Let's try another one and revisit this example: LORELAI (D1) Michel, come on, we've got to get into these budgets.Here, we have two threads: thread one has D1, D2, D5 (topic: budget meeting); thread two has D3 and D4 (topic: Michel's recording device).D5 relates most strongly to Lorelai's desire to get Michel to join the meeting, which is expressed already in D1, so it takes precedence over the weak semantic relation of it (D3) and that machine (D5) and makes D5 part of the first thread.
Here's one last example for intuition.This is a family dinner scene.Rory, the granddaughter, brings her boyfriend, Dean, along, who meets Richard, the grandfather, for the first time.Interrogation ensues.Pay attention to the intention and desire of each character, and who has the control of where the conversation is going.EMILY  In this scene, Emily and Lorelai are having a quarrel.Rory tries to distract everyone and change topic, yelling "Bangalore" entirely out of the blue.Rory's "Bangalore" also starts a new thread.
In both examples, both threads have only one utterance, which might be shorter than usual, but contextually, it makes sense.
In the next section, we will talk in detail about how to determine topic/floor change or thread start in a more principled manner.But ultimately you need to use your judgment, and hopefully you have a rough sense of what a topic or a floor is.

B.1.2 Threading drama
This section offers more instructions on how to decide when a conversational thread starts in drama.We already noted the two layers at work in the previous section: 1. Conceptually, the start of a thread introduces a new floor ("focus of attention", Goffman, 1963) or an observable change in topic (contextual relations, O 'Neill and Martin, 2003).
2. Practically, the start of a thread has no parent utterance.
The rest of this section further elaborates our (perhaps idiosyncratic) definitions of floor and topic as they apply to our conversational analysis of drama.
Floor: Who are we paying attention to?Elsner and Charniak (2010, p. 392) describe the start of a new conversational thread as the process of participants (or in our case, characters) "hav[ing] refocused their attention . . .away from whoever held the floor in the parent conversation."Like in Goffman (1963), attention is the operative word: the character holding the floor can safely assume that they have attention from others.Such attention is singular and must be sustained throughout the thread; or someone else has the floor.A test we could use is whether we could logically insert "Now, everyone listen to me!" or "Now, everyone look at me!" before the UOI that we see as a potential start of a new thread.Consider the following example: TAYLOR This goes well beyond a head of lettuce, young man.The charges against your nephew are numerous.He stole the "save the bridge" money!LUKE He gave that back.TAYLOR He stole a gnome from Babette's garden.
LUKE Pierpont was also returned.MISS PATTY He hooted at one of my dance classes.
FRAN He took a garden hose from my yard.
ANDREW My son said he set off the fire alarms at school last week.
LORELAI I heard he controls the weather and wrote the screenplay to Glitter.McDaniel et al. (1996), introduce some basic semantic and pragmatic mechanisms that signal the continuation of a thread below: • semantically and pragmatically coherent response: By definition, if the UOI makes a coherent response to any previous lines, the closest candidate to it is its parent utterance.
• semantically non-topical speech: Expressive speeches ("Ouch") or greetings ("Hi") do not have a "manifest topic" (Ervin-Tripp, 1964), so we consider them as non-topical.Unless they are used to gain floor, they will always be regarded as a continuation of the same thread.
• successive greetings: If there are other things going on in the scene, "Hi" and "Goodbye" alone do not form a thread.
• co-reference: Pronouns whose referent would be less ambiguous if we determine that the UOI continues the current thread.
• term of address: If one character addresses another directly, there's a higher chance that they are in the same thread.
• acknowledgement: If one character acknowledges, in their utterance, the presence of another character in the same scene, they are more likely in the same thread.
• same physical location: If all characters stay where they are at the start of the thread, they might be in the same thread.This last example is also an important reminder that ultimately this project is about finding the reply-to relationship (and from there, threads of conversations).It's not about who replies to whom or who is listening.Addressees or participant roles should not be your primary judgement in deciding the reply-to relationship.
Lastly, many social media and instant messaging apps have this notion of "thread" built in: a Twitter or Slack thread usually explores the same topic; if one person explicitly indicates to which previous message they are replying, and then another one replies to that message, those messages naturally form a thread.Threads work in a similar way here.

B.2 Annotation in action
Data disclaimer.Please be aware that our sampled scenes may contain potentially problematic content, such as vulgar, violent, or offensive language in screenplays, or other biases held by individual screenwriters.
Annotation principles.Here are the general rules for annotation: 1. Intuitively, dialogues follow the basic economic principle, where D n replies to D n−1 .
2. A new thread starts when a speaker refocuses other characters' attention or starts a new topic.
3. Use speaker labels, action lines, dialogue turn information to enhance your understanding of the scene.
4. Always quickly skim through a couple of lines and get a sense of what's happening in the scene before starting to annotate.

Summary of symbols.
For each sentence in a dialogue line, annotate with the following symbols.
Next section is on how to use those symbols.
• this line is the start of a new thread: -T: both floor and topic changes occur, or any signals that indicate the previous conversation was over.Also use this symbol at the start of any scene.-F: you're certain only Floor change occurs/can add "now, everyone listen to me" at the beginning (or a phrase that serves the same function is actually part of the line) -P: you're certain only toPic change occurs/can add "switching subject" at the beginnin (or a phrase that serves the same function is actually part of the line) • -: this line replies to the preceding line • Dx: this line replies to sentence D x • symbols for editorial convenience (should be used very sparingly): -S: skip the current sentence, due to significant OCR/parsing errors -X: this line requires further discussion for adjudication Handling parsing/OCR errors Fig. 5   an action statement or dialogue line but has information about the script or purely logistical information (like Untitled Project (04/12/22), CBS Studio Production), put S (skip).Or longtail errors: screenplays to Star Trek movies put all dialogues in Klingon in parentheses, and they will be parsed as action lines.It's impossible to find all of them through regular expressions, and we don't need our model to see them.
Fourth, pay attention to ellipses and em dashes: IMSDB/Scriptbase-J can use '[sentence] . ..' (with leading space), '[sentence]. ..' (without), and there can be space between each dot ('. ..' vs '...').Emdashes can be ' -' (dash separated by space), '-' (two dashes, no space surrounding), or '-' (that looks just like two words being linked together.There's no easy way for us can clean and normalize that in our pipeline, and in some cases they interfere with the semantics.So correct those too.

Figure 1 :
Figure 1: Example of dramatic conversations, taken from a scene in Young Sheldon.Speaker labels are in boldface and SMALL CAPS.Curved arrow lines indicate the reply-to relations between dialogue lines.Each thread is distinguished by colors.
Hello, Butch!Did you like your old man's speech?JUNIOR Hello, Pop!I was in a box.I could hear every word.

Figure 2 :
Figure2: Example of the standard screenplay format (Citizen Kane).EXT.ONE OF THE EXITS is the scene header.The dialogue portions are usually indented, while action statements ("Emily and Junior are standing") are not indented.Speaker labels are found in cue lines in all caps, followed by their dialogue lines.
to independently encode representations of the utterance of interest, parent utterance, and their associated scene context.We used two classes of embeddings: contextual and feature-based.Contextual embeddings.Given a pre-trained model F like BERT, each utterance u spoken by speaker k is stringed together with special tokens as [CLS] k [SEP] u [LINE], where [SEP] separates a speaker label and the associated line, and [LINE] marks the end of the line.Here, [LINE] is a custom token whose representation is learned during training.The contextual embeddings are derived by

Figure 3 :
Figure 3: The average thread lengths of movies in a 5-year range, along with 95% confidence intervals.

Figure 4 :
Figure 4: The percentage of threads started by women relative to their speaking time, along with 95% CIs.
SOOKIE (D2) Now.MICHEL (D3) Does the red light mean it's programmed?SOOKIE (D4) [to Lorelai] I explained it a hundred times.LORELAI (D5) Michel, you've been setting that machine for 20 minutes now.(IV.xvi) Figure 5: This is an example of typical OCR/parsing errors.It's unclear why Eddie's line is broken into three paragraphs (which is not normal), but since our pipeline relied heavily on line breaks, our parser wrongly recognized Eddie's lines as action statements because you can't really distinguish between a true action statement ("As Steve pours water . . .") and a dialogue line just by using line breaks.Errors like these are easily fixed, however: supply dummy line turn ID and dialogue IDs, and annotate accordingly.
B.2.1 Questions to ask while annotating1.This is the beginning of a scene.• Put T. • Read a couple of lines ahead and gain a sense of: -Why does the character speak at all? -What do they want?-Who has the floor?2. This is a new sentence: (a) Can D n−1 be the sensible reply to D n ?• If so, put -.(b) If not, what previous line leads to this line?Is there any previous line that triggers (or, gives the necessary context for us to understand) this UOI?• If there's one, put the utterance ID.(c) If not, is there a topic/floor change?i.Is there any new intent or desire being expressed?Can I insert "switching subject" at the beginning of UOI?• If so, put P (toPic).ii.Is there a character replacing another one as the center of attention?Can I insert "Now, look at me/listen to me" at the beginning of UOI?• If so, put F (Floor). iii.Do floor and topic changes happen at the same time?Or are there other

Table 3 :
Experimental results.All metrics are reported with 95% bootstrap confidence intervals.
Britta H Sjogren. 2006.Into the vortex female voice and paradox in film.University of Illinois Press, Urbana.Weishi Wang, Steven C H Hoi, and Shafiq Joty.2020.Response Selection for Multi-Party Conversations with Dynamic Topic Tracking.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6581-6591, Online.Association for Computational Linguistics.
Now then, do they validate parking here?"hasnothing to do with whether Lorelai is stubborn or not (which we tentatively call a topic).To reflect this, in this work, we study utterances on the sentence level.In other words, a screenplay/teleplay exhibits the following hierarchy: Those preliminary remarks don't always apply.Often one dialogue line is too large a unit for us to fully understand conversational behaviors, and more often the next line isn't the response to the current: EMILY You're being stubborn, as usual.LORELAI SOOKIE And if we go down after two years . . .LORELAI It'll be the most exciting two years of our lives!(II.viii)In conversations, one person can finish each other's lines.So, we say, for multiple sentences in one dialogue turn, the parent utterance of sentence n is, by default, sentence n − 1. SOOKIE [to Lorelai] I explained it a hundred times.LORELAI Michel, you've been setting that machine for 20 minutes now.LORELAI Who is that?RORY I don't know.She just followed me in here like a puppy dog without saying a word.LORELAI Maybe she's lost.RORY Or, maybe she's one of my new suitemates who I'm already off to a swell start with.LORELAI Do you know how vulnerable you are to venereal disease?RORY All hail to the queen of the nonsequiturs.RORY Mom.LORELAI Or tell her where you live.RORY Too late.LORELAI Oh, you touched the doorknob.RORY Good grief.(IV.ii) Antonia, please bring out the Twinkies.Rory and Lorelai were talking about Rory's grandparents at first.From the action statements, we see Luke brings plates and forks to the table, and Rory's "Thanks" at the end replies to that.It does not belong in the previous thread about her grandparents, and here the only logical annotation would indicate that Rory starts a new thread, which has only one sentence, before the scene ends.
RICHARD I'm talking to Dean.DEAN I don't know yet.RICHARD You don't?DEAN No, not yet.RICHARD Well, what kind of grades do you get? EMILY Richard please, don't grill the boy.RICHARD I'm not grilling the boy Emily.It's an easy question.A's, B's, C's? DEAN I get a mixture actually.RICHARD LORELAI He changes a mean water bottle.DEAN BOOTSY [Now, everyone listen to me:] I think it's time for me to pipe up here.Excuse me, but I've got the floor.With that in mind, when you see the first utterance in the scene, try to identify that need, purpose, or action it speaks to, which you can describe in a verb phrase.From there, extrapolate a broader, general topic, which you can describe in a noun phrase.Remember, you might not be able to do such extrapolation without knowing what's happening in the scene.So read ahead and skim a little bit.A new conversational thread begins when the UOI has nothing to do with that previous topic.Otherwise, it continues the current thread.A simple test to see if this UOI starts a new topic is to insert such parenthetical statements as "Changing subject" at the beginning, and see if the conversation still makes sense.On the flip side, we, adapting LUKE Unbelievable.BOOTSY LUKE You don't have the floor.BOOTSY shows you what kind of parsing/OCR related errors you might encounter and why they are there.If you spot an obvious/easily fixable OCR or parsing error, please correct it.If you suspect you've spotted an error of any kind, you could take a look at the original txt or pdf file.It also comes with experience.After you're sure there are errors, here's how you fix them:First, if you are turning an action line into a dialogue line, supply dummy dialogue turn ID and dialogue ID.Our suggestion is something like La and Da.We do so because those lines you are rescuing might become parent utterances of UOIs to come, in which case you can annotate with Da.Second, you might need to turn a dialouge line into an action line:Many entities are singled out printed in uppercase, such as THE BATHROOM here.When they appear alone in the line, there's no way to distinguish them from a regular speaker label (we also don't want to simply exclude locations, because e.g., MAN IN THE STREET is a well-formed speaker label).To correct this, simply change the dialogue ID to A Changing line type is NOT necessary (saves you two seconds, which add up).Since we don't really use action IDs for annotation, it's not necessary to add them.Removing dialogue turn ID or speaker label is optional.This is what that row should be: , if a line is now empty after your correction, or if you spot a line that does not contain Third