M 3 Seg: A Maximum-Minimum Mutual Information Paradigm for Unsupervised Topic Segmentation in ASR Transcripts

Topic segmentation aims to detect topic boundaries and split automatic speech recognition transcriptions (e.g., meeting transcripts) into segments that are bounded by thematic meanings. In this work, we propose M 3 Seg, a novel Maximum-Minimum Mutual information paradigm for linear topic segmentation without using any parallel data. Specifically, by employing sentence representations provided by pre-trained language models, M 3 Seg first learns a region-based segment encoder based on the maximization of mutual information be-tween the global segment representation and the local contextual sentence representation. Secondly, an edge-based boundary detection module aims to segment the whole by topics based on minimizing the mutual information between different segments. Experiment re-sults on two public datasets demonstrate the effectiveness of M 3 Seg, which outperform the state-of-the-art methods by a significant (18%— 37% improvement) margin.


Introduction
Automatic speech recognition (ASR, also known as computer speech recognition or speech-to-text) (Rabiner and Juang, 1993;Graves and Jaitly, 2014) has brought us great convenience by transcribing conversations into text anytime and anywhere to aid human understanding.However, the generated unstructured transcriptions are sometimes too lengthy for users to grasp a high-level meaning quickly.Meeting transcripts are usually long and contain multiple heterogeneous topics in various structures, such as opening sessions, different discussion subjects, and closing sections.As one or more topics usually drive conversations or discussions, topic segmentation (Labadié and Prince, 2008) can improve the readability of transcription and facilitate † Equal Contribution.

Region-Based Segment Modeling
Figure 1: Illustration of M 3 Seg, which first learns segment representations by mutual information maximization, and then segments the entire ASR transcript by mutual information minimization.PLM refers to pretrained language models, and dotted red lines indicate topic changes.downstream long-text tasks such as meeting summarization, passage retrieval, and automated article generation (Mohri et al.; Feng et al.).
However, acquiring annotated training data to support topic segmentation is expensive.Unsupervised methods have attracted attention due to their less reliance on annotated data.Most existing unsupervised works can be categorized into edge-based and region-based methods.Particularly, edge-based methods aim to detect discontinuities (i.e., topic changes) in a sequence where semantic features change rapidly, based on word frequency (Hearst, 1997), embeddings of pre-trained language models (Solbiati et al., 2021), etc (Choi, 2000).Region-based methods aggregate neighbouring sentences through the homogeneity criterion, such as latent dirichlet allocation (Riedl and Biemann, 2012) or perplexity calculated by PLM (Feng et al.).Nevertheless, these methods rely on capturing local sentence information rather than global topic information to perform segmentation, which may be noise sensitive and have a gap in generating ground-truth semantic boundaries.
In contrast to prior works, which fail to capture the global semantic-level topic information of different segments, we propose a novel Maximum-Minimum Mutual information paradigm for unsupervised topic segmentation (M 3 Seg).Our work is inspired by mutual information, which measures how much one random variable tells us about another.Intuitively, sentences in the same segment should depend on the same topic (with high mutual information), and different topic segments should be independent of each other (with low mutual information).Based on this insight, M 3 Seg divides an ASR script into different topic segments based on a two-stage process: region-based segment modeling and edge-based boundary detection.We first learn a segment encoder by maximizing the mutual information between the global segment representation and each local contextualized sentence representation (provided by pre-trained language Models).Stage 2 detects topic boundaries by minimizing mutual information between different segments.
Experimental results on two widely-used benchmark datasets show that M 3 Seg consistently surpasses five existing methods by a wide margin.We conduct ablation studies to demonstrate the effectiveness of the proposed model and show the great utility of using mutual information in this task.

Method
Formally, let s denote a meeting transcript produced by an automatic speech recognition (ASR) system, which consists of a list of n utterances s = {s 1 , s 2 , • • • , s n }.Topic segmentation can be seen as a problem of topic change detection, and aims to cut the transcript into consecutive segments {s 1:i−1 , s i:j , • • • , s k:n } based on the underlying topic structure.For each segment s i:j , it represents a segment in a meeting transcript, from the i-th sequence to the j-th sequence s i:j = {s i , • • • , s j }.
For example, an hour-long meeting transcript can be broken down into different topic segments (e.g., opening sessions, different discussion subjects, and closing remarks) to make it more read-able.Note that we cannot access gold-standard segments as human annotations do not exist.
Model Overview As aforementioned, given a transcript s = {s 1 , s 2 , • • • , s n }, topic segmentation seeks to splits it into several topic segments, where the sentences in each segment belong to the same topic (requirement i) but different segments represents relatively independent topics (requirement ii).Inspired by this idea, we propose a maximum-minimum mutual information paradigm for topic segmentation (titled M 3 Seg) in an unsupervised manner, which learns segment representations by maximizing mutual information and partitions different segments by minimizing mutual information.
Mutual information (MI) can measure the dependence between two random variables (Shannon, 1948).Given two random variables a and b, the MI between them is I(a; b) = a,b P (a, b) log P (a|b) P (a) .The intuitive interpretation of I(a; b) is a measurement of the degree a reduces the uncertainty in b or vice versa.For example, the MI I(a; b) is equal to 0 when a and b are independent.Therefore, from the view of MI, different topic segments should be as independent of each other as possible (i.e., MI minimization), but sentences within the same topic segment should be as dependent on the same topic as possible (i.e., MI maximization).Based on this insight, M 3 Seg consists of two stages (as depicted in Figure 1): (1) Segment Modeling Based on MI Maximization Intuitively, knowing a sentence reduces the uncertainty of its corresponding topic.Thus, we train a segment encoder E to learn the global segment representation y i:j = E θ (r i:j ) of the segment s i:j by maximizing the MI between it and each of its local sentence representations r k , k ∈ [i, j] (requirement i): in an ASR transcript, we first use the pre-trained language model (PLM) to obtain the contextualized representation r of each sentence (Peters et al., 2018) where d is the representation dimension of PLM.In our case, r i is computed by applying a mean-over-time pooling layer on the token representations of the last layer of RoBERTa-base (Liu et al., 2019) model.We then use the segment encoder E to get the segment representation y i:j of the text segment s i:j :y i:j = E θ (r i:j ), y i:j ∈ R d , where θ denotes the parameters of E. The segment encoder E is trained by maximizing the MI between the global segment representation and each of its local sentence representations.However, mutual information estimation is generally intractable for continuous and high dimensional random variables, so we maximize the InfoNCE (Logeswaran and Lee, 2018) lower bound estimator of Eq 1.Following (Kong et al., 2020), maximizing InfoNCE is analogous to maximizing the standard cross-entropy loss: , where y + i:j ∈ r i:j is sampled from sentence representations in the segment s i:j , and y − i:j is sampled from the rest.We use the dot product between embeddings to measure the distance (i.e., •) in the vector space.Note that we only need to train the segment encoder E, while PLM's parameters are fixed.
Our assumption is that in a meeting transcript, a continuous sequence of utterances contains a thematic information, which can be either coarsegrained or fine-grained.For example, the entire meeting transcript's utterances may be based on a certain motivation (coarse-grained theme) for convening, while if divided into fine-grained segments, each utterance can be associated with a certain issue (fine-grained theme).It is based on this assumption that we construct data from any continuous sequence of utterances in the same meeting transcript and use the maximization of mutual information to train a segment encoder E to learn the global segment representation (Eq 1) (Wang andWan, 2020, 2021).This also contributes to the effectiveness of our method.
, where b − is sampled from the complement set b of b.In order to achieve a single-pass computation (low time complexity), we calculate the difference in MI between the proposed boundary i and its adjacent offsets of +/-1 as the MI bound. . ( This can be easily extended to more complex bounds in the future.As a consequence, we propose a metric based on MI gap (MIG) to quantitatively assess the effectiveness of disentanglement between two neighboring regions s 1:i−1 and s i:j : M IG(y Finally, we derive the topic boundaries as pairs of regions y 1:i−1 and y i:j where M IG(y 1:i−1 ; y i:j ) scores are greater than a certain threshold δ.The MIG measures how much the segment representations change when we move the boundary by one position.A positive MIG means that the segments become more dissimilar when we shift the boundary, indicating a potential topic change.A negative MIG means that the segments become more similar when we shift the boundary, indicating a coherent topic.Therefore, we can use the MIG as a criterion for detecting topic boundaries.

Experiments
To demonstrate the effectiveness of our model, we evaluate M 3 Seg on two widely-used benchmark datasets, AMI Meeting Corpus (Carletta et al., 2005) and the ICSI Meeting Corpus (the dataset details are shown in Appendix A.).According to the paper Ghosh et al. (2022), it is more likely that the dataset we used consists of Unstructured Chats, transcriptions of spoken language from a spoken scene, which result in often incoherent and incomplete utterances with some word errors.
We compare M 3 Seg with several state-of-the-art unsupervised topic segmentation methods, including region-based methods (i.e., TopicTiling (Riedl and Biemann, 2012) and DialoGPT (Feng et al.) ) and edge-based methods (i.e., TextTiling (Hearst, 1997), C99 (Choi, 2000), and TextTiling-BERT (Solbiati et al., 2021)).For analysis, we also show the Random and Even methods, which refer to placing topic boundaries randomly and every n-th utterance, respectively.We perform two standard evaluation metrics, Pk (Beeferman et al., 1999) and WinDiff (Wd) (Pevzner and Hearst, 2002)  Regarding the detection of multiple boundaries, we use a certain threshold δ to control it.That is, as long as the mutual information gap between pairs of regions is greater than δ, it will be considered as topic boundaries (detailed in Section 2).For the random baseline, following the settings of TextTiling-BERT, we set each utterance to have a 0.30 probability of being placed with topic boundaries (the reason is that this value is roughly consistent with the probability of boundaries occupying the entire utterance).based boundary detection module.For "w/o segment modeling", we directly apply a max-over-time pooling operation to sentences instead of the segment encoder E. For "w/o boundary detection", we apply TextTiling (Hearst, 1997) algorithm, which calculates the similarity based on the segment representations provided by E. From Table 2, two components play a role, yet the most significant drop when the region-based segment modeling module is removed, demonstrating the great effectiveness of using MI to model segments.

Method
Influence of Pre-trained Language Models (PLMs) To investigate the importance of the PLM's contextualized representations, we perform ablations by using different PLMs in Table 3. From the results, we can see that the larger the pre-trained model leads to more significant improvement, indicating that the contextual information provided by PLM with more data and parameters is more conducive to topic segmentation.But M 3 Seg can bring consistent and significant improvements under different pre-trained language models, reflecting the effectiveness of our method.Influence of the MIG Threshold (δ) We also investigate the impact of different MI gap threshold δ on the results.Figure 2 presents the Pk score and the number of topic segments after segmentation of our model on the AMI dataset under different MIG threshold δ.With the increase of δ, the number of segments decreases, indicating that the larger the threshold of the MI gap leads to the rougher segmentation of topics.Interestingly, the Pk score first decreases to a minimum and then increases, potentially due to the presence of an optimal decision point wrt the selection of MIG threshold δ.

Limitations
In this study, we introduce the M 3 Seg framework, which is a novel approach to unsupervised topic segmentation.We transform the task into an optimization problem that maximizes intra-segment mutual information and minimizes inter-segment mutual information.Our experiments and analysis demonstrate that the proposed model outperforms competitive systems by reducing error rates by 18% to 37%.These results emphasize the effectiveness of using mutual information for topic segmentation and suggest future opportunities to develop more complex and controllable systems.
However, our approach has limitations that should be acknowledged.Firstly, it has only been tested on the English language, and further experimentation is required to evaluate its performance on low-resource languages.Additionally, our method relies on pre-trained language models, which may not always be available or suitable for certain applications.Nevertheless, we believe that our maximum-minimum mutual information paradigm has potential to advance the development of unsupervised topic segmentation systems.

B Metrics
Following previous works (Solbiati et al., 2021), we perform two standard evaluation metrics, Pk (Beeferman et al., 1999) and WinDiff (Wd) (Pevzner and Hearst, 2002) scores, to show the performance of segmented results.Both metrics use a fixed sliding window over the document, and calculate the segmentation error by comparing the number of boundaries in the ground truth with the number of boundaries predicted by the topic segmentation model.

C Implementation
Our segment encoder E is a one-layer Transformer (Vaswani et al., 2017) with a dimension of d = 768.The threshold θ of the mutual information gap score is set to 0.01.The pre-trained language model uses RoBERTa-base (Liu et al., 2019) and can be easily migrated to other PLM models.We implement our model based on PyTorch and use two Tesla V100 graphic cards for learning.In order to use the original features obtained from the PLM without additional scaling, we set Dim(E) to the same dimension as the output of layer of PLM.
We train E separately for each meeting input, using only the input text list and random segment boundaries to maximize mutual information as the training goal.No gold segment boundaries are required, which is consistent with the test scenario, that is, it is not biased by other meeting data.

( 2 )
Boundary Detection Based on MI Minimization After obtaining the segment representation of any region, we propose an edge-based boundary detection module to detect topic changes by minimizing the MI between the segment representations of different topic segments (requirement ii): J BD = min i,j,k∈[1,n],i̸ =k I(y i:j ; y j:k ).Specifically, as we are primarily interested in maximizing the MI gap, and not concerned with its precise value, we can rely on non-Kullback-Leibler diver-gences which may offer favorable trade-offs.Following the Info N CE estimator in Poole et al. (2019), we define a Jensen-Shannon mutual information estimator of two factorized latent variables a and b: ÎJSD (a; b) scores (details are shown in Appendix B), and the model implementation details are shown in Appendix C.

Table 1 :
Results of topic segmentation.The lower the values of Pk and Wd, the higher the segment performance.We mark the best results in bold.
Module Effectiveness Analysis To investigate the importance of the model's individual components, we perform ablations by removing the region-based segment modeling module and edge-

Table 3 :
Results of M 3 Seg with different PLMs.

Table 4 :
Results of M 3 Seg with different contextualized sentence representations.