Symbolization, Prompt, and Classification: A Framework for Implicit Speaker Identification in Novels

,


Introduction
Speaker identification in novel dialogues, also known as dialogue attribution (Muzny et al., 2017;Cuesta-Lázaro et al., 2022), aims at identifying the speaking characters of utterances in fiction texts (Glass and Bangay, 2007;Elson and McKeown, 2010).It is an important task for various downstream applications like automatically assigning appropriate voices to utterances in audiobook production (Pan et al., 2021) and novel-to-script conversion (Soo et al., 2019).As dialogues serve as  the major interaction between characters in novels, automatic identification of speakers can also be useful for novel-based knowledge mining tasks such as social network extraction (Jia et al., 2020) and personality profiling of the characters (Sang et al., 2022).Table 1 shows an example randomly sampled from the Chinese novel World of Plainness.For utterances U1, U2, U3, and U5, their speakers can be easily determined by recognizing the explicit narrative patterns like "Said Wu Zhongping" in the previous or following sentence.U4 is an exception that does not fall into this explicit pattern.To correctly predict the speaker of U4 requires an understanding of the conversation flow.Although many speaker identification cases can be solved by recognizing narrative patterns, many complex examples still call for a deep understanding of the surrounding context.We refer to these complex examples as implicit speaker identification cases.They pose difficulties for existing speaker identification algorithms.
Most recent approaches for speaker identification (Chen et al., 2021;Cuesta-Lázaro et al., 2022;Yu et al., 2022) are based on pre-trained language models (PLMs) like BERT (Devlin et al., 2019).PLM-based methods enjoy the advantage of the PLM's internal linguistic and commonsense knowledge obtained from pre-training.However, two main downsides should be highlighted for these methods.On the one hand, some methods truncate a context into shorter textual segments before feeding them to PLMs (Chen et al., 2021;Cuesta-Lázaro et al., 2022).Intuitively, this approach inevitably introduces a bias to focus on short and local texts, and to identify speakers by recognizing explicit narrative patterns usually in the local context of the utterance.It may fail in implicit speaker identification cases when such patterns are not available and long-range semantic understanding is indispensable.On the other hand, some methods adopt an end-to-end setting (Yu et al., 2022) that sometimes extracts uninformative speakers like personal pronouns.Besides, they only perform mention-level speaker identification in which two extracted aliases of the same character won't be taken as the same speaker.In recent months, large language models (LLMs) have become the most exciting progress in the NLP community.Although LLMs show impressive zero-shot/few-shot capabilities in many benchmarks (Patel et al., 2023;Ouyang et al., 2022;Chung et al., 2022), how well they work for speaker identification remains unknown.
Drawing inspiration from the successful application of prompt learning and pattern-exploiting training in various tasks like sentiment analysis (Patel et al., 2023;Schick and Schütze, 2021), we propose a framework to identify speakers in novels via symbolization, prompt, and classification (SPC).SPC first symbolizes the mentions of candidate speakers to construct a unified label set for speaker classification.Then it inserts a prompt to introduce a placeholder for generating a feature for the speaker classifier.This approach minimizes the gap between the training objectives of speaker identification and the pre-training task of masked language modeling, and helps leverage the internal knowledge of PLMs.In previous studies, the interutterance correlation in conversations was shown to be useful for speaker identification in sequential utterances (He et al., 2013;Muzny et al., 2017;Chen et al., 2021).SPC also introduces auxil-iary character classification tasks to incorporate supervision signals from speaker identification of neighborhood utterances.In this way, SPC can learn to capture the implicit speaker indication in a long-range context.
To measure the effectiveness of the proposed method and to test its speaker identification ability, SPC is evaluated on four benchmarks for speaker identification in novels, specifically, the web novel collection, World of Plainness, Jin-Yong novels, and CSI dataset.Compared to the previous studies that conduct experiments on merely 1-18 labeled books (Yu et al., 2022), the web novel collection dataset contains 230 labeled web novels.This dataset enables us to fully supervise the neural models and to analyze their performance at different training data scales.Experimental results show that the proposed method outperforms the best-performing baseline model by a large margin of 4.8% accuracy on the web novel collection.Besides, this paper presents a comparison with the most popular ChatGPT1 , and results indicate that SPC outperforms it in two benchmarks.To facilitate others to reproduce our results, we have released our source code2 .
To sum up, our contributions in this paper are three-fold: (1) We propose SPC, a novel framework for implicit speaker identification in novels via symbolization, prompt, and classification.(2) The proposed method outperforms existing methods on four benchmarks for speaker identification in novels, and shows superior cross-domain performance after being trained on sufficient labeled data.
(3) We evaluate ChatGPT on two benchmarks and present a comparison with the proposed method.

Related Work
In recent years, deep neural networks have shown great superiority in all kinds of NLP tasks (Kim, 2014;Chen and Manning, 2014), including textbased speaker identification.Candidate Scoring Network (CSN) (Chen et al., 2021) is the first deep learning approach developed for speaker identification and outperforms the manual featurebased methods by a significant margin.For each candidate speaker of an utterance, CSN first encodes the shortest text fragment which covers both the utterance and a mention of the candidate speaker with a BERT.Then the features for the speaker classifier are extracted from the output of BERT.In this way, the model learns to identify speakers by recognizing superficial narrative patterns instead of understanding the context.Cuesta-Lázaro et al. (2022) migrates a dialogue state tracking style architecture to speaker identification.It encodes each sentence separately with a Distill-BERT (Sanh et al., 2019) before modeling cross-sentence interaction with a Gated Recurrent Unit (Chung et al., 2014).Then a matching score is calculated between each character and each utterance.However, this method still just utilizes the PLM to model local texts and results in poor performance.Yu et al. (2022) adopts an end-to-end setting that directly locates the span of the speaker in the context.It feeds the concatenation of the utterance and its context to a RoBERTa (Liu et al., 2019;Cui et al., 2021), after which the start and end tokens are predicted on the output hidden states.Yet under end-to-end setting, only mention-level speaker identification is performed, in which two extracted aliases of the same character are taken as two different speakers.
Previous studies have shown the great advantage of applying deep-learning methods to speaker identification in novel dialogues, but these methods either are limited to recognizing superficial patterns or only identify speakers at mention level.Our study proposes a method to identify speakers based on context understanding and improves the performance on implicit speaker identification problems when long-range semantic information is needed.The proposed method outperforms other competitors given sufficient labeled data, and is more efficient in data utilization.

Task Definition
Before we dive into the details of our proposed approaches, it would be necessary to declare the basic settings for the task of novel-based speaker identification.The sentences in the novel have been split and each sentence is identified as either an utterance or a narrative.The name and aliases of the speaking characters in the novel have been collected in advance.The occurrences of the speaking characters in the novel are referred to as mentions.For the target utterance whose speaker we intend to identify, a selected context that covers the target utterance is extracted and denoted as ctx = {x −N 1 , ..., x −1 , u, x 1 , ..., x N 2 }. u denotes the target utterance and x −N 1 , ..., x −1 , x 1 , ..., x N 2 denote the N 1 + N 2 sentences surrounding u.The speaker of the target utterance is assumed to be mentioned in the selected context, while the exceptions (should be rare) are discarded from the dataset. 3Assume that m candidate speakers are located in ctx by matching the text in ctx to the speakers' names.

Framework Introduction
Figure 1 shows the framework of the proposed SPC.SPC takes the selected context as input and generates the likelihood of character classification as its output.First, the mentioned characters in the input context are replaced with symbols.Then, a prompt with a placeholder (the [MASK]) is inserted to the right of the target utterance.After that, placeholders of auxiliary tasks are introduced into the context.At last, the PLM encodes the processed context and classifies the characters at each placeholder.

Character Symbolization
Character symbolization unifies the candidate speaker sets in different contexts, after which speaker identification can be formulated as a classification task.We assign a unique symbol C j (j = 1, ..., m) to each candidate character mentioned in ctx, and replace the mentions of each character with its corresponding symbol in ctx.Note that this mapping from characters to symbols is only used for the specific selected context rather than for the whole novel, so as to reduce the number of classification labels.The character symbols form a local candidate set CS = {C 1 , ..., C m }.Let M be the maximum number of candidate characters.C 1 , ..., C M have been added to the special token vocabulary of the PLM in advance.The embeddings of these special tokens are randomly initialized and will be jointly optimized with other model parameters.

Prompt Insertion
Next, we insert into the selected context a prompt p="[prefix] ___ [postfix]", right after the target utterance.In the inserted prompt, "___" is the placeholder we aim to classify, while [prefix] and [postfix] are manually crafted strings on both sides of the placeholder.In practice, we choose   "(" and "说了这句话)" for [prefix] and [postfix] respectively, which combine to mean "(___ said these words)" in English.The [MASK] token in the PLM's vocabulary is used for the placeholder.

Speaker Classification based on a PLM
Then we feed the processed context to a PLM which has been pre-trained on masked language modeling (MLM) to classify the missing character at the placeholder.Specifically, we utilize the pre-trained MLM head of the language model to produce a scalar score s j (j = 1, ..., m) for each candidate character C j : At the training stage, as our goal is to assign the correct speaker a higher score than other candidates, we choose margin ranking loss to instruct the optimization.For example, assume the correct speaker of the target utterance is C j , then C j is paired with each candidate in CS\C j .The speaker identification loss of u is calculated on the scores of each candidate pair, as: where mgn is the ideal margin between the scores of the two candidates.At the inference stage, the candidate assigned the highest score is identified as the speaker.

Auxiliary Character Classification
Based on the classification task form, we designed two auxiliary character classification tasks: 1) neighborhood utterances speaker classification (NUSC) to utilize the speaker alternation pattern between adjacent utterances, and 2) masked mention classification (MMC) to alleviate the excessive reliance on neighborhood explicit narrative patterns.NUSC teaches the model to classify the speakers of the neighborhood utterances of the target utterance.In this way, the model learns to utilize the dependency between the speakers of neighborhood utterances.MMC randomly masks the character mentions in the neighborhood sentences of the target utterance and quizzes the model on classifying the masked characters.It corrupts the explicit narrative patterns like "Tom said" which usually exists in the neighborhood sentence of the utterance and guides the model to utilize long-range semantic information.
We relabel the target utterance as u i and its previous and following utterance as u i−1 and u i+1 .To perform NUSC, the prompt p is also inserted after u i−1 and u i+1 , as long as these two utterances and their speakers are covered in ctx.Then, the model classifies the speakers for u i−1 and u i+1 as described in Section 3.5.The loss introduced by NUSC is the average speaker classification loss of the two neighborhood utterances: For MMC, we randomly mask the character mentions by a probability in the previous and following sentences of the target utterance (i.e. x −1 and x 1 ), provided the masked characters are mentioned somewhere other than these two sentences.The model is required to predict the masked characters.Let the indexes of the masked characters be M C, then we have the average classification loss of each masked mention as: in which L mgn is the same margin ranking loss function as is shown in Eq.( 3).
After incorporating NUSC and MMC into the training objective, the loss function turns into: (6) where α and β control the strength of supervision signals from NUSC and MMC respectively.At the inference stage, we do not mask the character mentions in x 1 and x −1 to retain the explicit narrative patterns.The prompt is inserted after u i−1 and u i+1 at the inference stage to keep traininginference consistency.

Datasets
We conducted experiments on four Chinese speaker identification datasets which are the web novel collection (WN), World of Plainness (WP) (Chen et al., 2019) 4 , Jin-Yong novels (JY) (Jia et al., 2020)  5 , and end-to-end Chinese Speaker Identification dataset (CSI) (Yu et al., 2022) 6 respectively.WN is a large internal speaker identification dataset.Its annotation details can be referred to in Appendix D. WN and CSI both consist of web novels of various genres and writing styles, while WP and JY are printed literature with more normative writings.WP and JY can serve as cross-domain evaluation datasets for WN.As no test data was provided in CSI, we use the development set of CSI as the test data and randomly sampled 10% of instances from the training set for validation.The number of novels and utterances in each subset of each dataset is shown in Table 2.There are no overlapped novels in the different subsets of WN.
For clarity, each subset of a dataset is named as "dataset-subset", e.g., "WN-train" for the training set of WN.To evaluate the impact of a smaller training set, we also sampled 5 novels from WNtrain to make a smaller training set with about 31k utterances.To distinguish between the whole training set and the sampled one, we referred to them as WN-large and WN-small respectively.

Baselines
We compared SPC to the following baselines.CSN (Chen et al., 2021) feeds a candidate-specific textual segment to the PLM and then outputs the likelihood of the corresponding candidate.A revision algorithm based on speaker alternation pattern is adopted to revise the speaker identification results in two-party dialogues.modeling cross-sentence interaction with a GRU.The embeddings of the character mention and the utterance are dotted to obtain the likelihood.
A CRF is employed to model the dependency between speakers of neighborhood utterances.Due to limited GPU memory, we only use a base model for DST_SI.E2E_SI (Yu et al., 2022) feeds the concatenation of the utterance and its context to the PLM, after which the start and end tokens are predicted on the output hidden states.GPT-3.5-turbo was based on GPT-3 (Patel et al., 2023) and aligned to human preference by reinforcement learning with human feedback (Ouyang et al., 2022).We prompted the model with the format "{context}#{utterance}#The speaker of this utterance is:".We used a tolerant metric that the response was considered correct if the true speaker's name is a sub-string of the response. 7e only evaluated this baseline on WP and JY due to the expensive API costs.

Implementation Details
As context selection plays a critical part in speaker identification, we detail the context selection procedures for the methods we implemented.For SPC, we selected a 21-sentence context window surrounding the target utterance, which corresponds to N 1 = N 2 = 10 in Section 3.1.If the context window exceeds the PLM's length limit (512 tokens for RoBERTa), we would truncate the context window to fit the input length requirement.Since DST_SI is not open source, we implemented it ourselves.We followed their paper and segment conversations by restricting the number of intervening narratives between utterances to 1.We further included the previous and following 10 sentences of each conversation, and limited the maximum number of involved sentences to 30.More details can be referred to Appendix A.

Overall Results
We tuned and evaluated the models on the same dataset (in-domain), or tuned the models on WN and evaluated them on WP and JY (cross-domain).
Note that although we compared zero-shot GPT-3.5-turbo to other cross-domain results, it hadn't been tuned on any data.The released CSI dataset masked 10% of tokens due to copyright issues, so we collected E2E_SI's performance on the masked CSI from the GitHub page of CSI.Validation/Test accuracies are shown in Table 3.We will mainly discuss the test results and leave the validation results for reference.First, we can conclude from the table that RoBERTa-large performed better than RoBERTabase and BERT-base for the same method.Regardless of the specific PLM, the comparative relationship between different methods remains constant.So we mainly focus on the performance of different methods based on RoBERTa-large.SPC based on RoBERTa-large consistently performed better than or comparable to all non-LLM baselines in both in-domain evaluation and cross-domain evaluation.In the in-domain evaluation of WN, SPC outperformed the best opponent CSN by 4.8% and 3.9% trained on WN-large and WN-small, achieving overall accuracies of 94.6% and 90.0%.These are remarkable improvements as the errors are reduced by 47% and 28%.As WN-test includes 70 web novels of various genres, we believe it reflects general performance on web novels.In the in-domain evaluation on WP which is the only dataset evaluated in CSN paper (Chen et al., 2021), SPC still outperformed CSN by 1.0%.We observed MMC might cause serious overfitting for very small datasets like WP-train, so we didn't adopt MMC for WP-train.In cross-domain evaluations, SPC also consistently outperformed all non-LLM baselines, which shows its better generalizability to novels of unseen domains.
Although GPT-3.5-turbo underperformed WNlarge tuned SPC, its zero-shot performance is still remarkable.In comparison, GPT-3.5-turbo has a much large number of parameters and benefits from its vast and general pre-training corpus, while SPC excelled by effectively tuning on an adequate domain-specific corpus.It's worth mentioning that the response of GPT-3.5-turbo may contain more than one name, e.g., "Jin-bo's sister, Xiu" and "Runye's husband (Xiang-qian)".These responses may fool our evaluation criterion, as the response is only required to cover the true speaker.

Ablation study
We conducted ablation studies based on SPC to investigate the effect of each module, with results shown in Table 4.
We first removed ACC from SPC.As can be seen in the table, removing ACC dropped the evaluation accuracies by 0.8% and 0.6% on WN, by 0.2% on JY, and by 3.7% on WP.This indicates that the auxiliary character classification tasks are beneficial for improving speaker identification performance. 8Only the NUSC task was applied to training on WP-train, and it contributed a lot to the performance on WP-test.We think it's because the writing of WP is more normative than the writing of novels in WN.The sequential utterances in WP usually obey the speaker alternation pattern, which can be easily learned and utilized.
We further ablated prompting.To this end, we did not insert the prompt but extracted the CSNstyle features from the output of PLM to produce the likelihood of each candidate speaker.After ablating the prompt-based architecture, the performance of models trained on WN-large decreased by 0.6%, whereas those on WN-small and WP-train decreased drastically by 4.8% and 11.9%.It shows that prompting is helpful for boosting performance in a low-resource setting and verifies our starting point for developing this approach.Prompting closes the gap between the training objectives of speaker identification and the pre-training task, which can help the PLM understand the task and leverage its internal knowledge.JY is the exception in which performance did not degrade after this ablation, although its training set only contains 15k samples.We believe this is because JY is too easy and lacks challenging cases to discriminate between different ablations.
To gain insight into how the performance of SPC and its ablated methods varies with the scale of training data, we trained them on varying numbers of novels sampled from WN-large and evaluated their performance on WN-test.To facilitate comparison, we performed a similar evaluation on CSN.As is shown in Figure 2 data scenarios and the improvement became less significant as training utterances increased.
The deeper ablated SPC (SPC w/o.ACC & prompt) initially under-performed CSN, but it overtook CSN when the training data reached about 150k utterances.In comparison, CSN didn't benefit much from the increment of training data.As a possible explanation, the deeper ablated method using a longer context window (445.5 tokens avg.) gradually learned to handle implicit speaker identification cases that require long-range context understanding, after being trained on more data.However, CSN using a short context window (131.3 tokens avg.) still failed in these cases.It's also revealed that generally more data is needed to train models that take a longer context window as input.However, with prompting and ACC, SPC overcomes this difficulty and learns to identify speakers in a long context window with a much smaller data requirement.As is shown in Figure 2, SPC took merely 31k utterances to reach the accuracy of 90%, while about 180k utterances were needed by its deeper ablated method to gain the same performance.

Implicit Speaker Identification Analysis
In this section, we're going to find out whether our proposed method shows a stronger ability to identify implicit speakers than other methods using a shorter context window, like CSN.
Intuitively, if the speaker's mention is close to the utterance, then the speaker of this utterance can probably be identified by recognizing explicit narra- Sentence distance between the utterance and its speaker tive patterns "Tom said".On the contrary, utterances with distant speaker mentions are typically implicit speaker identification cases and require context understanding.We calculated the speaker identification accuracies of the utterances in WNtest, categorizing them based on the proximity of each utterance to the closest mention of its speaker.The comparison between SPC and CSN is shown in Figure 3.The term "sentence distance" refers to the absolute value of the utterance's sentence index minus the index of the sentence where the speaker's closest mention is found.
It can be seen from the figure that, as the sentence distance increased, both SPC and CSN identified speakers more inaccurately.Initially, at sentence distance = 1, both models performed comparably and achieved accuracies above 95%.However, when sentence distance came to 2, the identification accuracy of CSN drastically dropped to 74%.Utterances with sentence distance > 1 can be regarded as implicit speaker identification cases, so CSN is not good at identifying the speakers of these utterances.While SPC still maintained over 80% until sentence distance reached 5 and consistently outperformed CSN by 8.4% to 13.4% for sentence distance greater than 1.Thus it's verified that SPC has a better understanding of the context compared to CSN, and thus can better deal with implicit speakers.The proportion of test utterances at each sentence distance is also shown in Figure 3. 81.1% of the utterances are situated at sentence distance = 1, elucidating the reason behind CSN's commendable overall performance despite its incapacity to handle implicit speakers.
We also conducted a case study on the implicit speaker identification cases in WN-test, with a few translated examples provided in Appendix C. SPC did perform better than CSN in many challenging speaker identification cases.

Conclusions
In this paper, we propose SPC, an effective framework for implicit speaker identification in novels via symbolization, prompt, and classification.Experimental results show SPC's superiority on four speaker identification benchmarks and its remarkable cross-domain capability.Furthermore, SPC significantly outperforms the strongest competitor CSN in implicit speaker identification cases that require deeper context understanding.We also evaluate ChatGPT on two speaker identification benchmarks and present a comparison with SPC.In the future, we hope to harness LLMs with longer input length limits to further improve speaker identification performance.

Limitations
Although SPC has proved its effectiveness in novel-based speaker identification, we consider two aspects that can be further improved.First, we only implemented SPC on small base models containing less than 1 billion parameters.In light of the swift progress in LLMs, investigating the full potential of these advanced LLMs holds significant value and promise for future advancements.Second, in real-world applications, our approach operates on the output of a character name extraction module which might produce incorrect results.Thus, it's worth studying how to improve the robustness of the speaker identification approach and make it less vulnerable to errors in the character names.We adopted Chinese RoBERTa-wwm-ext-large and RoBERTa-wwm-ext-base 9 for CSN and SPC, while only the base model was adopted for DST_SI due to memory limitation.The large and base models contain 355M and 125M parameters respectively.We used BertTokenizer in Transformers toolkit 10 to tokenize the texts.

A Implementation Details
For SPC, we selected "(___说了这句话)" as the prompt, which means "(___ said these words)" in English.The maximum number of candidate speakers M was set as 50.The training criterion of SPC has been demonstrated in Section 3 with the ideal margin mgn set as 1.0.For SPC the auxiliary character classification loss factor α and β in Eq.( 6) were both set as 0.3.The mask probability for the masked mention classification task was 0.5.While the optimization criterion for DST_SI was minimizing the log-likelihood of the true speaker sequence.Adam optimizer (Kingma and Ba, 2015) was employed for optimization and the learning rate decayed every epoch by multiplying 0.98.The best models were selected by validation accuracy.Other hyper-parameters are shown in Table 5

B.1 Using Different Prompts
We tried different prompt templates to see how they affect speaker identification performance.As can be seen from Table 7, different prompt templates don't affect the performance much, but using a empty prompt would hurt the performance.This indicates SPC's insensitivity to prompt selection.

B.2 Using N-neighborhood Utterances
We conducted pilot experiments on using different numbers of neighborhood utterances for neighborhood utterance speaker classification (NUSC).We trained the models on WN-small, showing their validation performance in Figure 4. It's clear that using 1-neighborhood utterances for NUSC (our setting for SPC) brings some improvements, compared to 0-neighborhood (not applying NUSC).However, extending to more neighborhood utterances does not bring further improvements.A possible explanation is that the dependency between neighborhood utterances mostly lies in adjacent utterances, instead of distant ones.methods listed at the bottom of each example.They are all implicit speaker identification examples without explicit narrative patterns.We observed SPC did much better than CSN in solving these challenging speaker identification cases that require deeper context understanding.Example 3 shows that the auxiliary character classification tasks (ACC) help the model to better capture the interutterance dependency.

D Web Novel Collection Details
The web novel collection was annotated on 230 web novels by a group of Chinese native annotators.
The utterances in the novels were first identified based on quotes.Then, the annotators were instructed to mark the name of the speaker in the neighborhood context of each utterance, and the names of other characters were also marked if found.
Copyrights of the web novels belong to their respective proprietors.The authors are allowed to use the data for research purposes and should follow the principle of fair use.The data annotators are a group of employed professional data annotators each at least with a bachelor's degree.Their wages are in line with local regulations.Before the novels were handed to the annotators, they had been reviewed by a group of content reviewers to filter any offensive information, including violence, terror, abuse, etc.The annotation instruction informed the annotators of the potential use of the annotated data for speaker identification research.

Figure 1 :
Figure 1: The framework of SPC.The block in green represents the target utterance, while the blocks in blue represent the inserted prompt templates.The uncolored blocks are other sentences in the selected context.The mentions of candidate speakers are also colored.
the output hidden state of the PLM corresponding to the placeholder, where d is the hidden size of the PLM.W W W T ∈ R d×d and b b b T ∈ R d are the weight and bias of the linear transform layer.W W W D ∈ R M ×d and b b b D ∈ R M are the weight and bias of the decoder layer.W W W T and b b b T are pretrained along with the PLM, while W W W D and b b b D are randomly initialized.s s s = [s 1 , .., s M ] is the output score vector in which the first m scores correspond to the m candidate in CS.

Figure 2 :
Figure 2: The speaker identification accuracies (%) of SPC, CSN, and the ablated methods of SPC on WNtest, trained on increasing utterances sampled from WNlarge.The number in parentheses denotes the number of novels used for training.

Figure 3 :
Figure3: The speaker identification accuracies (%) of utterance groups divided by the sentence distance between the utterance and its speaker.The numbers in parentheses along the x-axis are the proportions (%) of utterances in the groups.

Figure 5 Figure 4 :
Figure 5 shows three translated examples from WNtest, with speaker identification results of different

Table 1 :
A translated example from the Chinese novel World of Plainness.U1-U5 are utterances and N1-N4 are narratives.

Table 3 :
(Chen et al., 2021)uracies (%) of SPC and the baselines ( †: numbers reported in(Yu et al., 2022); §: number reported in(Chen et al., 2021)).The results of GPT-3.5-turbo were obtained under a zero-shot setting without using any training data.The cross-domain validation accuracies were obtained on WP-val/JY-val.The highest test accuracy in each column is emphasized in bold.

Table 4 :
Evaluation accuracies (%) of SPC and its ablated methods.The indentation indicates each ablation is based on the previous ones.ACC stands for auxiliary character classification.

Table 5 :
Hyper-parameters for each method.Some parameters took different values for WN/WP/JY/CSI.

Table 6 .
. Some of the parameters took different values for WN, WP, JY, and CSI respectively.On WN-large the training of each investigated model converges in 30 to 50 epochs.

Table 6 :
GPU hours consumed in training one epoch on WN-large.

Table 7 :
Validation accuracies (%) on WN-val of different prompt templates.The English translations of meaningful prompts are provided."-" represents an empty string.