AbstractWe propose a speaker clustering model for textual dialogues, which groups the utterances of a multi-party dialogue without speaker annotations, so that the actual speakers are identical inside each cluster. We find that, without knowing the speakers, the interactions between utterances are still implied in the text, which suggest the relations between speakers. In this work, we model the semantic content of utterance with a pre-trained language model, and the relations between speakers with an utterance-level pairwise matrix. The semantic content representation can be further instructed by cross-corpus dialogue act modeling. The speaker labels are finally generated by spectral clustering. Experiments show that our model outperforms the sequence classification baseline, and benefits from the auxiliary dialogue act classification task. We also discuss the detail of determining the number of speakers (clusters), eliminating the interference caused by semantic similarity, and the impact of utterance distance.