Improving Automatic Quotation Attribution in Literary Novels

Current models for quotation attribution in literary novels assume varying levels of available information in their training and test data, which poses a challenge for in-the-wild inference. Here, we approach quotation attribution as a set of four interconnected sub-tasks: character identification, coreference resolution, quotation identification, and speaker attribution. We benchmark state-of-the-art models on each of these sub-tasks independently, using a large dataset of annotated coreferences and quotations in literary novels (the Project Dialogism Novel Corpus). We also train and evaluate models for the speaker attribution task in particular, showing that a simple sequential prediction model achieves accuracy scores on par with state-of-the-art models.


Introduction
We focus on the task of automatic quotation attribution, or speaker identification, in full-length English-language literary novels.The task involves attributing each quotation (dialogue) in the novel to the character who utters it.The task is complicated by several factors: characters in a novel are referred to by various names and aliases (Elizabeth, Liz, Miss Bennet, her sister); these aliases can change and be added over the course of the novel; and authors often employ differing patterns of dialogue in the text, whereby quotations are sometimes attached to the speaker explicitly via a speech verb, and at other times require keeping track of character turns over multiple paragraphs.The development of automated methods has also been hindered by the paucity of annotated datasets on which models can be trained and evaluated.
Existing methods for quotation attribution fall into one of two groups: those that directly attribute the quotation to a named character entity and those that treat it as a two-step process in which quotations are first attached to the nearest relevant mention of a character and mentions are then resolved to a canonical character name via a coreference resolution model.We contend that most use-cases of a quotation attribution system involve resolving the speaker mention to one among a list of character entities.Thus, the usability of these systems is very much dependent on their ability to compile such a list of character entities and to resolve each attributed mention to an entity from this list.' Here, we use the Project Dialogism Novel Corpus (Vishnubhotla et al., 2022), a large dataset of annotated coreferences and quotations in literary novels, to design and evaluate pipelines of quotation attribution.Our analysis shows that state-ofthe-art models are still quite poor at character identification and coreference resolution in this domain, thus hindering functional quotation attribution.
2 Background and Prior Work Elson and McKeown (2010) introduce the CQSA corpus, which contains quotations from excerpts from 4 novels and 7 short-stories that are annotated for the nearest speaker mention, which can be named (e.g., Elizabeth), or nominal (her friend).On average, only 25% of the attributions in CQSA are to a named entity.
In contrast, He et al. (2013) link quotations directly to entities, and a list of characters and aliases is required for attribution.This list is generated with a named entity recognition (NER) model to obtain entity terms, which are then grouped together using Web resources such as Wikipedia.
The GutenTag package from Brooke et al. (2015) contains modules for generating character lists and identifying speakers in literary texts.The former is based on the LitNER model (Brooke et al., 2016a), which bootstraps a classifier from a lowdimensional Brown clustering of named entities from Project Gutenberg texts.The speaker attri-arXiv:2307.03734v1[cs.CL] 7 Jul 2023 bution model is a simple rule-based approach that identifies the nearest named entity.Sims and Bamman (2020) annotate the first 2000 tokens of 100 novels from the LitBank dataset1 .Quotations are linked to a unique speaker from a predefined list of entities.LitBank also contains annotations for coreference for these tokens (Bamman et al., 2020).The BookNLP package2 from the same group contains pre-trained models for NER, coreference resolution, and speaker attribution, although the latter is only at the mention-level.
Cuesta-Lazaro et al. ( 2022) attempt to reconcile the differences in pre-requisites and methodologies of prior attribution systems by proposing a modularization of the task into three sub-tasks: quotation identification, character identification, and speaker attribution.They evaluate baselines for each component, propose a new state-of-the-art method for speaker attribution, and quantify the relative importance of each module in an end-to-end pipeline.Their speaker attribution module, however, considers only named mentions in the text as candidate speakers, leading to a lower performance on implicit and anaphoric quotations.Neither their dataset of 15 novels nor their model for speaker attribution have been made public, precluding comparison with our work below.
In our work, we follow this modular formulation, with some key differences: (a) we evaluate an additional sub-task of coreference resolution, allowing us to (b) test an attribution model that can work with both named and pronominal candidate mentions surrounding a quotation; and (c) we evaluate our models on a publicly available dataset.

Dataset: PDNC
We briefly describe here the Project Dialogism Novel Corpus (Vishnubhotla et al., 2022).PDNC consists of 22 full-length English novels, published in the 19th and 20th centuries, annotated with the following information: Characters: A list of characters in the novel.This includes characters who speak, are addressed to, or referred to multiple times in the novel.Each character is identified by a main name (e.g., Elizabeth Bennet), as well as a set of aliases (Liz, Lizzie, Eliza).We do not distinguish between the two, and treat each character entity as identifiable by a set of names (so that Elizabeth Bennet, Liz, Lizzie, Eliza forms one character entity).
Quotations: Each uttered quotation in the novel is annotated with its speaker and addressee(s); with the referring expression, if any, that indicates who the speaker is; and with internal mentions, i.e., named or pronominal phrases within the quotation that refer to one or more character entities.The annotations in PDNC make it ideal for evaluating several aspects of quotation attribution in novels, including named entity recognition, coreference resolution, and speaker attribution.

Modularization of the Task
Character identification: The goal of this subtask is to build a list of the unique character entities in a novel.Although NER models perform quite well at identifying spans of text that constitute a named entity (here, a character name), the task is complicated by the fact that characters can have multiple aliases in the text.Moreover, some characters may be introduced and referred to only by social titles (the policeman, the Grand Inquisitor, the little old man, the bystander).
Coreference resolution: The goals here are to identify text spans that refer to a character entity (which we refer to as mentions) and to link each mention to the correct character entity or entities to which it refers.In addition to mentions that are personal pronouns such as he, she, and them, literary texts have an abundance of pronominal phrases that reflect relationships between characters, such as her husband and their father.Such phrases can also occur within quotations uttered by a character (e.g., my father), requiring quotation attribution as a prerequisite for complete coreference resolution.
Quotation identification: Perhaps the most straightforward of our sub-tasks, here we identify all text spans in a novel that constitute dialogue, i.e., are uttered by a character entity or entities.
Speaker attribution: Finally, this sub-task links each identified quotation to a named character identity.While most models are designed to solve the more tractable and practical problem of linking quotations to the nearest relevant speaker mention, we subsume the mention-entity linking tasks under the coreference resolution module, equating the two tasks.

Models and Evaluation Metrics
We evaluate each of the modules of section 4 separately.In order not to confound the evaluation with cascading errors, at each step, we "correct" the outputs of the automated system from the previous step by using annotations from PDNC.

Character Identification
We evaluate two pipelines -GutenTag and BookNLP -on their ability to identify the set of characters in a novel, and potentially, the set of aliases for each character.In addition, we also test the NER system from the spaCy 3 module as a proxy for the state-of-the-art in NER that is not trained explicitly for the literary domain.
Character recognition (CR): For each novel, we compute the proportion of annotated character entities that are identified as named entities of the category 'PERSON' (Doddington et al., 2004).We use a simple string-matching approach, where we try for either a direct match, or a unique match when common prefixes such as Mr. and Sir are removed.Thus, if a particular novel has N character entities annotated, the NER model outputs a list of K named 'PERSON' entities, and K ′ of these entities are in turn matched with M out of the N characters, the CR metric is calculated as M/N.
Character clustering: We use the clustering evaluation metrics of homogeneity (C.Hom), completeness (C.Comp), and their harmonic mean, vscore to evaluate named entity clusters.Homogeneity (between 0 and 1) is the fraction of named clusters that link to the same character entity; completeness is the number of homogeneous clusters a single entity is distributed over (ideal value of 1).
As an example, consider the case where we have three annotated characters for a novel: Elizabeth Bennet, Mary Bennet, and The Queen.The set of annotated aliases for the characters are {Elizabeth Bennet, Eliza, Lizzie, Liz}, {Mary Bennet, Mary}, and {The Queen}.Say model M 1 outputs the following entity clusters: {Elizabeth Bennet, Eliza}, {Liz, Lizzie} and {Mary Bennet, Mary}; model M 2 outputs {Elizabeth Bennet, Mary Bennet, Eliza, Mary}, {Liz, Lizzie}.Each model has recognized two out of the three characters in our list; this evaluates to a CR score of 2/3.Each of the three clusters from model M 1 refers solely to one character entity, resulting in a homogeneity score of 1.0.However, these three clusters are formed for only two unique character entities, resulting in a completeness score of 1.5 (v-score 0.6).Analogously, model M 2 has a homogeneity score of 0.5 3 https://explosion.ai/blog/spacy-v3and a completeness score of 1.0 (v-score 0.5).

Coreference Resolution
We consider two pipelines for coreference resolution: BookNLP (based on Ju et al. (2018)) and spaCy (based on Dobrovolskii (2021)).Given a text, these neural coreference resolution models output a set of clusters, each comprising a set of coreferent mention spans from the input.
Evaluating this module requires annotations that link each mention span in a novel to the character entity referred to.PDNC, unfortunately, contains these mention annotations only for text spans within quotations.
We therefore evaluate coreference resolution only on a subset of the mention spans in a novel, extracted as follows: We first identify the set of mention clusters from our models that can be resolved to an annotated character entity, using the character lists from PDNC and the string-matching approach described above.We then prune this to only include those mention spans that are annotated in the PDNC dataset, i.e, mention spans that occur within quotations, and evaluate the accuracy of the resolution.
Mention clustering (M-Clus): We compute the fraction of mention clusters that can be matched to a unique (Uniq) annotated character entity rather than to multiple (Mult) or no (None) entities.
Mention resolution (M-Res): For those mention spans within PDNC that are identified by the model and are assigned to a cluster that can be uniquely matched to a character entity (# Eval), we compute the accuracy of the linking (Acc.).

Quotation Identification
Most models, rule-based or neural, can identify quotation marks and thus quotations.We evaluate how many of such quoted text instances actually constitute dialogue, in that they are uttered by one or more characters.Our gold standard is the set of quotations that have been annotated in PDNC, which includes quotations uttered by multiple characters and by unnamed characters such as a crowd.

Speaker Attribution
The speaker-attribution part of BookNLP's pipeline is a BERT-based model that uses contextual and positional information to score the BERT embedding for the quotation span against the embeddings of mention spans that occur within a 50-word context window around the quotation; the highest-scoring mention is selected as the speaker.We supplement this approach by limiting the set of candidates to resolved mention spans from the coreference resolution step, thereby directly performing quotationto-entity linking.As we see from our results, this method, which we refer to as BookNLP+, greatly improves the performance of the speaker attribution model by eliminating spurious candidate spans.
We also evaluate a sequential prediction model that predicts the speaker of a quotation simply by looking at the sequence of speakers and mentions that occur in some window around the quotation.We implement this as a one-layer RNN that is fed a sequence of tokens representing the five characters mentioned most recently prior to the quotation text, one character mention that occurs right after, and, optionally, the set of characters mentioned within the quotation.

Experimental Setup
We evaluate the models for character identification, coreference resolution, and quotation identification on the entire set of 22 novels in PDNC, since we are neither training nor fine-tuning these on this dataset.For the speaker attribution models, we define the training setup below.
We curate the set of mention candidates for each novel in the following manner: the mention clusters generated by BookNLP are used to extract the set of mention spans that could be successfully resolved to a character entity from the annotated PDNC character lists for each novel.We append to this set the annotated mention spans (within quotations) from PDNC, as well as explicit mention spansthat is, text spans that directly match a named alias from the character list.
Overlaps between the three sets are resolved with a priority ranking, whereby PDNC annotations are considered to be more accurate than explicit name matches, which in turn take precedence over the automated coreference resolution model.
We test with 5-fold cross-validation in two ways: splitting the annotated quotations in each novel 80/20 and splitting the set of entire novels 80/20.

Results
From Table 1, we see that the neural NER models of spaCy and BookNLP are better at recognizing character names than GutenTag's heuristic system (0.81 and 0.85 vs 0.60).However, the strengths of GutenTag's simpler Brown-clustering- based NER system are evident when looking at the homogeneity; when two named entities are assigned as aliases of each other, it is almost always correct.This shows the advantage of documentlevel named entity clustering as opposed to local span-level mention clustering for character entity recognition.The cluster quality metric, on the other hand, tells us that GutenTag still tends to be conservative with its clustering compared to BookNLP, which nonetheless is a good strategy for the literary domain, where characters often share surnames.
Performance of these models on the coreference resolution task is significantly lower (Table 2).A majority of the mention clusters from both BookNLP and spaCy's coreference resolution modules end up as unresolved clusters, with no containing named identifier that could be linked to a PDNC character entity.However, when we evaluate mention-to-entity linking on the subset of clusters that can be resolved, both systems achieve accuracy scores of close to 0.78, although spaCy is able to resolve far fewer mentions (499 vs 1127).
The importance of the character identification and coreference resolution tasks can be quantified by looking the performance of the speaker attribution models (Table 3).The end-to-end pretrained BookNLP pipeline, when evaluated on the set of PDNC quotations (which were identified with accuracy of 0.94), achieves an accuracy of 0.42.When we restrict the set of candidate mentions for each quotation to only those spans that can be resolved to a unique character entity, the attribution accuracy increases to 0.61.However, the RNN model still beats this performance with an accuracy of 0.72 on the random data split.contextual model is trained on data from PDNC, its accuracy improves to 0.78.These scores drop to 0.63 and 0.68 for the entire-novel split, where we have the disadvantage of being restricted only to patterns of mention sequences, and not speakers.

Analysis
We briefly go over some qualitative analyses of the errors made by models in the different subtasks, which serves to highlight the challenges presented by literary text and opportunities for future research.
Character Identification and Coreference Resolution: We manually examine the mention clusters identified by our coreference resolution modules that could not be matched a unique character entity as annotated in PDNC.We find that, by far, the most common error is conflating characters with the same surname or family name within a novel.For example, several of the women characters in these novels are often referred to by the names of their husbands or fathers, prefixed with a honorific such as Mrs. or Miss.Thus Mrs. Archer refers to May Welland in The Age of Innocence and Miss Woodhouse refers to Emma Woodhouse in Emma.However, a surname without a title, such as Archer or Woodhouse, generally refers to the corresponding male character.This results in the formation of mention clusters that take the spans Miss Woodhouse and Woodhouse to be coreferent, despite being different character entities.We see similar issues with father-son character pairs, such as George Emerson and Mr. Emerson in A Room With A View, and with character pairs that are siblings.

Speaker Attribution:
We first quantify the proportion of quotations attributed to a mention cluster that cannot be resolved to a named character entity with the end-to-end application of the BookNLP pipeline.
On average, 47.7% of identified quotations are assigned to an unresolved mention cluster as the speaker.The range of this value varies from as low as 12.5% (The Invisible Man) to as high as 78.7% (Northanger Abbey).A majority of these unresolved attributions occur with implicit and anaphoric quotations (76.2%),where the speaker is not explicitly indicated by a referring expression such as Elizabeth said, as opposed to explicit quotations (23.8%).
In Table 4, we break down the performance of the speaker attribution models by quotation type.We see that even our local context-based RNN model is able to identify the speaker of explicit quotations with a relatively high accuracy, and that the speaker for non-explicit quotations can also generally be modeled using the sequence of 5-6 characters mentioned in the vicinity of the quotation.The transformer-based models are of course able to use this local context more effectively by making use of linguistic cues and non-linear patterns of mentions and speakers in the surrounding text.Still, our best performing model achieves an accuracy of only 0.53 on implicit and anaphoric quotations when applied to novels unseen in the training set (the Novels split).

Conclusions and Future Work
In this work, we quantitatively evaluated the key components of a functional quotation attribution system.We showed that the initial task of recognizing characters and their aliases in a novel remains quite a challenge, but doing so greatly improves the performance of speaker attribution by limiting the set of candidate speakers.However, with existing coreference resolution systems, a large portion of mention clusters (around 90%) remain unresolved, so this remains a problem for new research.

Limitations
There is much variation in literary writing and narrative styles, and our work here deals with a small, curated subset of this domain.The novels we analyze are all in the English language, and were published between the early 19th and early 20th centuries.The authors and novels themselves are drawn from what is considered to be the established literary canon, and are not necessarily representative of all the works of that era, let alone literary works of other eras.The texts we analyze are largely uniform in narrative style.We limit ourselves to only those quotations that are explicitly indicated as such in the text by quotation marks, thereby eliminating more-complex styles such as free indirect discourse (Brooke et al., 2016b) and stream-of-consciousness novels.We do not deal with nuances such as letters and diary entries nor quotations within quotations.The models we analyze for named entity recognition and coreference resolution use a fixed, binary formulation of the gender information conveyed by pronominal terms.Though the development of fairer, more representative models is constrained by current datasets, we note that there is encouraging progress being made in this area (Bamman et al., 2020;Yoder et al., 2021).

Table 1 :
Character identification: Average scores across all the novels in the dataset.Column headings are defined in the text.Scores for each individual novel are reported in Appendix B.

Table 2 :
Coreference resolution: All scores are averaged over the 22 novels in PDNC.Column headings are defined in the text.

Table 4 :
Attribution accuracy for the speaker attribution models, broken down by quotation type, for the Quotations and Novels cross-validation splits.Column Exp.refers to explicit quotations, and column Rest refers to implicit and anaphoric quotations.

Table 5 :
Results of character identification for each novel with BookNLP and GutenTag.'#Chars' is the number of characters in the novel.Other headers are the same as in Table1.

Table 6 :
Results of character identification for each novel with spaCy.'#Chars' is the number of characters in the novel.Other headers are the same as in Table1.