Unsupervised Abstractive Dialogue Summarization with Word Graphs and POV Conversion

We advance the state-of-the-art in unsupervised abstractive dialogue summarization by utilizing multi-sentence compression graphs. Starting from well-founded assumptions about word graphs, we present simple but reliable path-reranking and topic segmentation schemes. Robustness of our method is demonstrated on datasets across multiple domains, including meetings, interviews, movie scripts, and day-to-day conversations. We also identify possible avenues to augment our heuristic-based system with deep learning. We open-source our code, to provide a strong, reproducible baseline for future research into unsupervised dialogue summarization.


Introduction
Compared to traditional text summarization, dialogue summarization introduces a unique challenge: conversion of first-and second-person speech into third-person reported speech.Such discrepancy between the observed text and expected model output puts greater emphasis on abstractive transduction than in traditional summarization tasks.The disorientation is further exacerbated by each of many diverse dialogue types calling for a differing form of transduction -short dialogues require terse abstractions, while meeting transcripts require summaries by agenda.
Thus, despite the steady emergence of dialogue summarization datasets, the field of dialogue summarization is still bottlenecked by a scarcity of training data.To train a truly robust dialogue summarization model, one requires transcript-summary pairs not only across diverse dialogue domains, but also across multiple dialogue types as well.The lack of diverse annotated summarization data is especially pronounced in low-resourced languages.From such state of the literature, we identify a need for unsupervised dialogue summarization.Our method builds upon previous research on unsupervised summarization using word graphs.Starting from the simple assumption that a good summary sentence is at least as informative as any single input sentence, we develop novel schemes for path extraction from word graphs.Our contributions are as follows: 1. We present a novel scheme for path reranking in graph-based summarization.We show that, in practice, simple keyword counting performs better than complex baselines.For longer texts, we present an optional topic segmentation scheme.
2. We introduce a point-of-view (POV) conversion module to convert semi-extractive summaries into fully abstractive summaries.The new module by itself improves all scores on baseline methods, as well as our own.
3. Finally, We verify our model on datasets beyond those traditionally used in literature to provide a strong baseline for future research.
With just an off-the-shelf part-of-speech (POS) tagger and a list of stopwords, our model can be applied across different types of dialogue summarization.

Multi-sentence compression graphs
Pioneered by Filippova (2010), a Multi-Sentence Compression Graph (MSCG) is a graph whose nodes are words from the input text and edges are coocurrance statistics between adjacent words.During preprocessing, words "<bos>" (beginningof-sentence) and "<eos>" (end-of-sentence) are prepended and appended, respectively, to every input sentence.Thus, all sentences from the input are represented in the graph as a single path from the <bos> node (v bos ) to the <eos> node (v eos ).Overlapping words among sentences will create intersecting paths within MSCG, creating new paths from v bos to v eos , unseen in the original text.Capturing these possibly shorter but informative paths is the key to performant summarization with MSCGs.Ganesan et al. (2010) introduce an abstractive sentence generation method from word graphs to produce opinion summaries.Tixier et al. (2016) show that nodes with maximal neighbors -a concept captured by graph degeneracy -likely belong to important keywords of the document.Shortest paths from v bos to v eos are scored according to how many keyword nodes they contain.Subsequently, a budget-maximization scheme is introduced to find the set of paths that maximizes the score sum within designated word count (Tixier et al., 2017).We also adopt graph degeneracy to identify keyword nodes in MSCG.

Unsupervised Abstractive Dialogue Summarization
Aside from MSCGs, unsupervised dialogue summarization usually employ end-to-end neural ar-

Summarization strategy
In following subsections we outline our proposed summarization process.

Word graph construction
First, we assemble a word graph G from the input text.We use a modified version of Filippova (2010)'s algorithm for graph construction: • Let SW be a set of stopwords and T = s 0 , s 1 , ... be a sequence of sentences in the input text.
• Decompose all s i ∈ T into a sequence of POS-tagged words.
s i = ("bos", "meta"), (w i,0 , pos i,0 ), ..., (w i,n−1 , pos i,n−1 ), ("eos", "meta") (1) • For every (w i,j , pos i,j ) ∈ s i such that w i / ∈ SW and s i ∈ T , add a node v in G.If a node v with the same lowercase word w i,k and tag pos i,k such that j = k exists, pair (w i,j , pos i,j ) with v instead of creating a new node.If multiple such matches exist, select the node with maximal overlapping context (w i,j−1 and w i,j+1 ).
• Add stopword nodes -(w i,j , pos i,j ) ∈ s i such that w i,j ∈ SW and s i ∈ T -to G with the algorithm described above.
• For all s i ∈ T , add a directed edge between node pairs that correspond to subsequent words.Edge weight w between nodes v 1 and v 2 is calculated as follows: (3) , where w ij and w ik are words in s i that correspond to nodes v 1 and v 2 , respectively.
In edge weight calculation, w favors edges with strong cooccurrence, while w −1 favors edges with greater salience, as measured by word frequency.
It follows from above that only a single <bos> node and a single <eos> node will exist once the graph is completed.

Keyword extraction
The resulting graph from the previous step is a composition that captures syntactic importance.Traditional approaches utilize centrality measures to identify important nodes within word graphs (Mihalcea and Tarau, 2004;Erkan and Radev, 2004).In this work we use graph degeneracy to extract keyword nodes.In a k-degenerate word graph, words that belong to k-core nodes of the graph are considered to be keywords.We collect KW , a set of nodes belonging to the k-core subgraph.The k-core of a graph is the maximally degenerate subgraph, with minimum degree of at least k.

Path threshold calculation
Once keyword nodes are identified, we score every path from v bos to v eos that corresponds to a sentence from the original text.Contrary to previous research into word-graph based summarization, we use a simple keyword coverage score for every path: , where V i is the set of all nodes in path p i , a representation of sentence s i ∈ T , within the word graph.We calculate the path threshold t, the mean score of all sentences in the original text.Later, when summaries are extracted from the word graph, candidates with path score less than t are discarded.We also experimented with setting t as the minimum or maximum of all original path scores, but such configurations yielded inferior summaries influenced by outlier path scores.
Our path score function is reminiscent of the diversity reward function in Shang et al. (2018).However, we use the function as a measure of coverage instead of diversity.More importantly, we utilize the score as means to extract a threshold based on all input sentences, which is significantly different from Shang et al. (2018)'s utilization of the function as a monotonically increasing scorer in submodularity maximization.

Topic segmentation
For long texts, we apply an optional topic segmentation step.Our summarization algorithm is separately applied to each segmented text.Similar to path ranking in the next section, topics are determined according to keyword frequency.For every sentence in the input, we construct a topic coverage vector c, a zero-initialized row-vector of length |KW |.Each column of the row vector is a binary representation signaling the presence of a single element in KW .Topic coverage vector of a path containing two keywords from KW , for instance, would contain two columns with 1.
Every transition between sentences is a potential topic boundary.Since each sentence (and corresponding path) has an associated topic coverage vector, we quantify the topic distance d of a sentence with the next as the negative cosine distance of their topic vectors: If p is a hyperparameter representing the total number of topics, one can segment the original text at p − 1 sentence boundaries with the greatest topic distance.Alternatively, sentence boundaries with topic distance greater than a designated threshold can be selected as topic boundaries.For simplicity, we proceed with the former segmentation setup (top-p boundary) when necessary.

Summary path extraction
We generate a summary per-speaker.Our construction of the word graph allows fast extraction of sub-graphs containing only nodes pertaining to utterances from a single speaker.For each speaker subgraph, we generate summary sentences as follows: 1. We obtain k shortest paths from v bos to v eos by applying the k-shortest paths algorithm (Yen, 1971) to our word graph.
2. Iterating from the shortest path, we collect any paths with keyword coverage score above the threshold calculated in 3.3.
3. For each path found, we track the set of encountered keywords in KW .We stop our search if all keywords in KW were encountered, or a pre-defined number of iterations (the search depth) is reached.
A good summary has to be both concise and informative.Intuitively, edge weights of the proposed word graph captures the former, while keyword thresholding prioritizes the latter.

POV conversion
Finally, we convert our collected semi-extractive summaries into abstractive reported speech using a rule-based POV conversion module.We describe sentences extracted from our word graph as semiextractive rather than extractive, to recognize the distinction between previously unseen sentences created from pieces of text, and sentences taken verbatim from the original text.Similar to existing extract-then-abstract summarization pipelines (Mao et al., 2021;Liu et al., 2021), our method hinges on the assumption that the extractive pathreranking step will optimize for summary content, while the succeeding abstractive POV-conversion step will do so for summary style.FewSum (Bražinskas et al., 2020) also applies POV conversion in a few-shot summarization setting.FewSum conditions the summary generator to produce sentences in targeted styles, which is achieved by nudging the decoder to generate pronouns appropriate for each designated tone.
Popular literature has established that defining an all-encompassing set of rules for indirect speech conversion is infeasible (Partee, 1973;Li, 2011).In fact, the English grammar is mostly descriptive rather than prescriptive -no set of official rules dictated by a single governing authority exists.Even so, rule based POV conversion does provide a strong baseline compared to state-of-the-art techniques, such as end-to-end Transformer networks (Lee et al., 2020).In this study, we limit our scope to rule-based conversion because only the rule-based system among all tested methods in Lee et al. (2020) confers to the unsupervised nature of this paper.We encourage further research into integrating more advanced reported speech conversion techniques into the abstractive summarization pipeline.
In this work, we apply four conversion rules: 1. Change pronouns from first person to third person.
2. Change modal verbs can, may, and must to could, might, and had to, respectively.
4. Fix subject-verb agreement after applying rules above.
We notably omit prepend rules suggested in (Lee et al., 2020), because the input domain of our summarization system is unbounded, unlike with taskoriented spoken commands for virtual assistants.We also leave tense conversion for future research.

Datasets
We test our model on dialogue summarization datasets across multiple domains: Table 1 provides detailed statistics and descriptions for each dataset.
For AMI and ICSI, we conduct several ablation experiments with different components of our model omitted: semi-extractive summarization without POV conversion is compared with fully-abstractive summarization with POV conversion; utilization of pre-segmented text provided by Shang et al. ( 2018) is compared with application of topic segmentation suggested in this paper.

Baselines
For meeting summaries, we compare our method with previous research on unsupervised dialogue summarization.Along with Filippova (2010), Shang et al. (2018), andFu et al. (2021), we select Boudin and Morin (2013) and Mehdad et al. (2013)  Table 3: Results on day-to-day, interview, screenplay, and debate summarization datasets.All reported scores are F-1 measures.In our method, topic segmentation is applied to datasets with average transcription length greater than 5,000 characters (MediaSum, SummScreen), and POV conversion is applied to all datasets.
first three sentences of a document as the summary.Because summary distributions in several document types tend to be front-heavy (Grenander et al., 2019;Zhu et al., 2021), LEAD-3 provides a competitive extractive baseline with negligible computational burden.

Meeting summarization
Table 2 records experimental results on AMI and ISCI datasets.In all categories, our method or a baseline augmented with our POV conversion module outperforms previous state-of-the-art.

Effect of suggested path reranking
Our proposed path-reranking without POV conversion yields semi-extractive output summaries competitive with abstractive summarization baselines.Segmenting raw transcripts into topic groups with our method generally yields higher F -measures than using pre-segmented transcripts in semi-extractive summarization.

Effect of topic segmentation
Summarizing pre-segmented dialogue transcripts results in higher R2, while applying our topic segmentation method results in higher R1 and RL.This observation is in line with our method's emphasis on keyword extraction, in contrast to keyphrase extraction seen in several baselines (Boudin and Morin, 2013;Shang et al., 2018).Models that preserve token adjacency achieve higher R2, while models that preserve token presence achieve higher R1.RL additionally penalizes for wrong token order, but token order in extracted summaries tend to be well-preserved in word graphbased summarization schemes.

Effect of POV conversion module
Our POV conversion module improves benchmark scores on all tested baselines, as well as on our own system.It is only natural that a conversion module that translates text from semi-extractive to abstractive will raise scores on abstractive benchmarks.However, applying our POV module to already abstractive summarization systems resulted in higher scores in all cases.We attribute this to the fact that previous abstractive summarization systems do not generate sufficiently reportive summaries; past research either emphasize other linguistic aspects like hyponym conversion (Shang et al., 2018), or treat POV conversion as a byproduct of an end-to-end summarization pipeline (Fu et al., 2021).
5.2 Day-to-day, interview, screenplay, and debate summarization Our method outperforms the LEAD-3 baseline on most benchmarks (Table 3).The model shows consistent performance across multiple domains in R1 and RL, but shows greater inconsistency in R2.
Variance in the latter metric can be attributed, as in 5.1.2,to our model's tendency to optimize for single keywords rather than keyphrases.Robustness of our model, as measured by consistency of ROUGE measures across multiple datasets, is shown in Figure 4. Notably, our method falters in the MediaSum benchmark.Compared to other benchmarks, Me-diaSum's reference summaries display heavy positional bias towards the beginning of its transcripts, which benefits the LEAD-3 approach.It also is the only dataset in which references summaries are  (Gliwa et al., 2019).not generated for the purpose of summary evaluation, but are scraped from source news providers.Reference summaries for MediaSum utilize less reported speech compared to other datasets, and thus our POV module fails to boost the precision of summaries generated by our model.

Conclusion 6.1 Improving MSCG summarization
This paper improves upon previous work on multisentence compression graphs for summarization.We find that simpler and more adaptive path reranking schemes can boost summarization quality.We also demonstrate a promising possibility for integrating point-of-view conversion into summarization pipelines.
Compared to previous research, our model is still insufficient in keyphrase or bigram preservation.This phenomenon is captured by inconsistent R2 scores across benchmarks.We believe incorporating findings from keyphrase-based summarizers (Riedhammer et al., 2010;Boudin and Morin, 2013) can mitigate such shortcomings.

Avenues for future research
While our methods demonstrate improved benchmark results, its mostly heuristic nature leaves much room for enhancement through integration of statistical models.POV conversion in particular can benefit from deep learning-based approaches (Lee et al., 2020).With recent advances in unsupervised sequence to sequence transduction (Li et al., 2020;He et al., 2020), we expect further research into more advanced POV conversion techniques will improve unsupervised dialogue summarization.
Another possibility to augment our research with deep learning is through employing graph networks (Cui et al., 2020) for representing MSCGs.With graph networks, each word node and edge can be represented as a contextualized vector.Such schemes will enable a more flexible and interpolatable manipulation of syntax captured by traditional word graphs.
One notable shortcoming of our system is the generation of summaries that lack grammatical coherence or fluency (Table 4).We intentionally leave out complex path filters that gauge linguistic validity or factual correctness.We only minimally inspect our summaries to check for inclusion of verb nodes, as in Filippova (2010).Our system can be easily augmented with such additional filters, which we leave for future work.

Figure 2 :
Figure 2: Construction of word graph.Red nodes and edges denote the selected summary path.Node highlighted in purple ("Poodles") is the only non-stopword node included in the k-core subgraph of the word graph.We use nodes from the k-core subgraph as keyword nodes.All original sentences from the unabridged input is present as a possible path from v bos to v eos .Paths that contain more information than those original paths are extracted as summaries.
chitectures.Zhang et al. (2021) and Zou et al. (2021) utilize text variational autoencoders (VAEs) (Kingma and Welling, 2014; Bowman et al., 2016) to decode conditional or denoised abridgements.Fu et al. (2021) reformulate summary generation into a self-supervised task by equipping auxiliary objectives to the training architecture.Among endto-end frameworks we only include Fu et al. (2021) as our baseline, because the brittle nature of training text VAEs, coupled with the lack of detail on data and parameters used to train the models, render VAE-based methods beyond reproducible.

Figure 3 :
Figure 3: Topic segmentation on AMI meeting ID ES2005b.Green bars indicate sentence boundaries with highest topic distance.

Figure 4 :
Figure 4: Normalized standard deviation (also called coefficient of variance) of R1, R2, and RL scores across all datasets.Normalized standard deviation is calculated as σ/x, where σ is the standard deviation and x is the mean.

Table 1 :
Statistics for benchmark datasets.All character-level and word-level statistics are averaged over the test set and rounded to the nearest whole number.

Table 2 :
Shang et al. (2018)summarization datasets.All reported scores are F-1 measures.Models with P OV indicate post-proceessing with our suggested POV conversion module.P reSeg models utilize topic segmentations provided inShang et al. (2018), and T opicSeg models intake unsegmented raw transcripts and perform the topic segmentation algorithm suggested in this paper.Results for RepSum are quoted from the original paper.
as our baselines.All but Fu et al. (2021) are word graph-based summarizers.For all other categories, we choose LEAD-3 as our unsupervised baseline.LEAD-3 selects the