Indicative Summarization of Long Discussions

Online forums encourage the exchange and discussion of different stances on many topics. Not only do they provide an opportunity to present one’s own arguments, but may also gather a broad cross-section of others’ arguments. However, the resulting long discussions are difficult to overview. This paper presents a novel unsupervised approach using large language models (LLMs) to generating indicative summaries for long discussions that basically serve as tables of contents. Our approach first clusters argument sentences, generates cluster labels as abstractive summaries, and classifies the generated cluster labels into argumentation frames resulting in a two-level summary. Based on an extensively optimized prompt engineering approach, we evaluate 19 LLMs for generative cluster labeling and frame classification. To evaluate the usefulness of our indicative summaries, we conduct a purpose-driven user study via a new visual interface called D ISCUS - SION E XPLORER : It shows that our proposed indicative summaries serve as a convenient navigation tool to explore long discussions. 1


Introduction
Online discussion forums are a popular medium for discussing a wide range of topics.As the size of a community grows, so does the average length of the discussions held there, especially when current controversial topics are discussed.On Change-MyView (CMV), 2 for example, discussions often go into the hundreds of arguments covering many perspectives on the topics in question.Initiating, participating in, or reading discussions generally has two goals: to learn more about others' views on a topic and/or to share one's own.
To help their users navigate large volumes of arguments in long discussions, many forums offer basic features to sort them, for example, by time of creation or popularity.However, these alternative views may not capture the full range of perspectives *Equal contribution. 1Code: https://github.com/webis-de/EMNLP-23 2 https://www.reddit.com/r/changemyview/exchanged, so it is still necessary to read most of them for a comprehensive overview.In this paper, we depart from previous approaches to summarizing long discussions by using indicative summaries instead of informative summaries. 3Figure 1 illustrates our three-step approach: first, the sentences of the arguments are clustered according to their latent subtopics.Then, a large language model generates a concise abstractive summary for each cluster as its label.Finally, the argument frame (Chong and Druckman, 2007;Boydstun et al., 2014) of each cluster label is predicted as a generalizable operationalization of perspectives on a discussion's topic.From this, a hierarchical summary is created in the style of a table of contents, where frames act as headings and cluster labels as subheadings.To our knowledge, indicative summaries of this type have not been explored before (see Section 2).
Our four main contributions are: (1) A fully unsupervised approach to indicative summarization of long discussions (Section 3).We develop robust prompts for generative cluster labeling and frame assignment based on extensive empirical evaluation and best practices (Section 4).(2) A comprehensive evaluation of 19 state-of-the-art, prompt-based, large language models (LLMs) for both tasks, supported by quantitative and qualitative assessments (Section 5).(3) A user study of the usefulness of indicative summaries for exploring long discussions (Section 5).( 4) DISCUSSION EXPLORER, an interactive visual interface for exploring the indicative summaries generated by our approach and the corresponding discussions. 4Our results show that the GPT variants of OpenAI (GPT3.5,ChatGPT, and GPT4) outperform all other open source models at the time of writing.LLaMA and T0 perform well, but are not competitive with the GPT models.Regarding the usefulness of the summaries, users preferred our summaries to alternative views to explore long discussions with hundreds of arguments.Example (simplified) Figure 1: Left: Illustration of our approach to generating indicative summaries for long discussions.The main steps are (1) unit clustering, (2) generative cluster labeling, and (3) multi-label frame assignment in order of relevance.Right: Conceptual and exemplary presentation of our indicative summary in table of contents style.Frames act as headings and the corresponding cluster labels as subheadings.

Related Work
Previous approaches to generating discussion summaries have mainly focused on generating extractive summaries, using two main strategies: extracting significant units (e.g., responses, paragraphs, or sentences), or grouping them into specific categories, which are then summarized.In this section, we review the relevant literature.

Extractive Summarization
Extractive approaches use supervised learning or domain-specific heuristics to extract important entities from discussions as extractive summaries.
For example, Klaas (2005) summarized UseNet newsgroup threads by considering thread structure and lexical features to measure message importance.Tigelaar et al. (2010) identified key sentences based on author names and citations, focusing on coherence and coverage in summaries.Ren et al. (2011) developed a hierarchical Bayesian model for tracking topics, using a random walk algorithm to select representative sentences.Ranade et al. (2013) extracted topic-relevant and emotive sentences, while Bhatia et al. (2014) and Tarnpradab et al. (2017) used dialogue acts to summarize question-answering forum discussions.Egan et al. (2016) extracted key points using dependency parse graphs, and Kano et al. (2018) summarized Reddit discussions using local and global context features.These approaches generate informative summaries, substituting discussions without backreferencing to them.

Grouping-based Summarization
Grouping-based approaches group discussion units like posts or sentences, either implicitly or explicitly.The groups are based on queries, aspects, topics, dialogue acts, argument facets, or key points annotated by experts.Once the units are grouped, individual summaries are generated for each group by selecting representative members, respectively.This grouping-then-summarization paradigm has been primarily applied to multi-document summarization of news articles (Radev et al., 2004).Follow-up work proposed cluster link analysis (Wan and Yang, 2008), cluster sentence ranking (Cai et al., 2010), and density peak identification in clusters (Zhang et al., 2015).For abstractive multidocument summarization, Nayeem et al. (2018) clustered sentence embeddings using a hierarchical agglomerative algorithm, identifying representative sentences from each cluster using TextRank (Mihalcea and Tarau, 2004) on the induced sentence graph.Similarly, Fuad et al. (2019) clustered sentence embeddings and selected subsets of clusters based on importance, coverage, and variety.These subsets are then input to a transformer model trained on the CNN/DailyMail dataset (Nallapati et al., 2016) to generate a summary.Recently, Ernst et al. (2022) used agglomerative clustering of salient statements to summarize sets of news articles, involving a supervised ranking of clusters by importance.
For Wikipedia discussions, Zhang et al. ( 2017) proposed the creation of a dynamic summary tree to ease subtopic navigation at different levels of detail, requiring editors to manually summarize each tree node's cluster.Misra et al. (2015) used summarization to identify arguments with similar aspects in dialogues from the Internet Argument Corpus (Walker et al., 2012). Similarly, Reimers et al. (2019) used agglomerative clustering of contextual embeddings and aspects to group sentence-level arguments.Bar-Haim et al. (2020a,b) examined the mapping of debate arguments to key points written by experts to serve as summaries.
Our approach clusters discussion units, but instead of a supervised selection of key cluster members, we use vanilla LLMs for abstractive summarization.Moreover, our summaries are hierarchical, using issue-generic frames as headings (Chong and Druckman, 2007;Boydstun et al., 2014) and generating concise abstractive summaries of corresponding clusters as subheadings.Thus our approach is unsupervised, facilitating a scalable and generalizable summarization of discussions.

Cluster Labeling
Cluster labeling involves assigning representative labels to document clusters to facilitate clustering exploration.Labeling approaches include comparing term distributions (Manning et al., 2008), selecting key terms closest to the cluster centroid (Role and Nadif, 2014), formulating key queries (Gollub et al., 2016), identify keywords through hypernym relationships (Poostchi and Piccardi, 2018), and weak supervision to generate topic labels Popa and Rebedea (2021).These approaches often select a small set of terms as labels that do not describe a cluster's contents in closed form.Our approach overcomes this limitation by treating cluster labeling as a zero-shot abstractive summarization task.

Frame Assignment
Framing involves emphasizing certain aspects of a topic for various purposes, such as persuasion (Entman, 1993;Chong and Druckman, 2007).Frame analysis for discussions provides insights into different perspectives on a topic (Morstatter et al., 2018;Liu et al., 2019).It also helps to identify biases in discussions resulting, e.g., from word choice (Hamborg et al., 2019b,a).Thus, frames can serve as valuable reference points for organizing long discussions.We use a predefined inventory of media frames (Boydstun et al., 2014) for discussion summarization.Instead of supervised frame assignment (Naderi and Hirst, 2017;Ajjour et al., 2019;Heinisch and Cimiano, 2021), we use prompt-based LLMs for more flexibility.

Indicative Discussion Summarization
Our indicative summarization approach takes the sentences of a discussion as input and generates a summary in the form of a table of contents, as shown in Figure 1.Its three steps consist of clustering discussion sentences, cluster labeling, and frame assignment to cluster labels.

Unit Clustering
Given a discussion, we extract its sentences as discussion units.The set of sentences is then clustered using the density-based hierarchical clustering algorithm HDBSCAN (Campello et al., 2013).Each sentence is embedded using SBERT (Reimers and Gurevych, 2019) and these embeddings are then mapped to a lower dimensionality using UMAP (McInnes et al., 2017). 5Unlike previous approaches that rank and filter clusters to generate informative summaries (Ernst et al., 2022;Syed et al., 2023), our summaries incorporate all clusters.The sentences of each cluster are ranked by centrality, which is determined by the λ value of HDBSCAN.A number of central sentences per cluster are selected as input for cluster labeling by abstractive summarization.
Meta-sentence filtering Some sentences in a discussion do not contribute directly to the topic, but reflect the interaction between its participants.Examples include sentences such as "I agree with you." or "You are setting up a straw man."Pilot experiments have shown that such meta-sentences may cause our summarization approach to include them in the final summary.As these are irrelevant to our goal, we apply a corpus-specific and channel-specific meta-sentence filtering approach, respectively.Corpus-specific filtering is based on a small set of frequently used meta-sentences M in a large corpus (e.g., on Reddit).It is bootstrapped during preprocessing, and all sentences in it are omitted by default. 6ur pilot experiments revealed that some sentences in discussions are also channel-specific (e.g., for the ChangeMyView Subreddit).Therefore, we augment our sentence clustering approach by adding a random sample assumes that meta-sentences are independent of others in a discussion.The noise threshold θ = 2 3 was chosen empirically.Sentences in a discussion that either belong to a meta-sentence cluster or whose nearest cluster is considered to be one are omitted.In our evaluation, an average of 23% of sentences are filtered from discussions.Figure 2 illustrates the effect of meta-sentence filtering on a discussion's set of sentence.

Generative Cluster Labeling
Most cluster labeling approaches extract keywords or key phrases as labels, which limits their fluency.These approaches may also require training data acquisition for supervised learning.We formulate cluster labeling as an unsupervised abstractive summarization task.We experiment with prompt-based large language models in zero-shot and few-shot settings.This enables generalization across multiple domains, the elimination of supervised learning, and fluent cluster labels with higher readability in comparison to keywords or phrases.
We develop several prompt templates specifically tailored for different types of LLMs.For encoder-decoder models, we carefully develop appropriate prompts based on PromptSource (Bach et al., 2022), a toolkit that provides a comprehensive collection of natural language prompts for various tasks across 180 datasets.In particular, we analyze prompts for text summarization datasets with respect to (1) descriptive words for the generation of cluster labels using abstractive summarization, (2) commonly used separators to distinguish instructions from context, (3) the position of instructions within prompts, and (4) the granularity level of input data (full text, document title, or sentence).Since our task is about summarizing groups of sentences, we chose prompts that require the full text as input to ensure that enough contextual  information is provided (within the limits of each model's input size).Section 4.1 provides details on the prompt engineering process.

Frame Assignment
Any controversial topic can be discussed from different perspectives.For example, "the dangers of social media" can be discussed from a moral or a health perspective, among others.In our indicative summaries, we use argumentation frame labels as top-level headings to operationalize different perspectives.An argumentation frame may include one or more groups of relevant arguments.We assign frame labels from the issue-generic frame inventory shown in Table 1 (Boydstun et al., 2014) to each cluster label derived in the previous step. 7 We use prompt-based models in both zero-shot and few-shot settings for frame assignment.In our experiments with instruction-tuned models, we designed two types of instructions, shown in Figure 10, namely direct instructions for models trained on instruction-response samples, and dialog instructions for chat models.The instructions are included along with the cluster labels in the prompts.Moreover, including the citation of the frame inventory used in our experiments has a positive effect on the effectiveness of some models (see Appendix D.1 for details).

Indicative Summary Presentation
Given the generated labels of all sentence clusters and the frame labels assigned to each cluster label, our indicative summary groups the cluster labels by their respective frame labels.The cluster label groups of each frame label are then ordered by cluster size.This results in a two-level indicative summary, as shown in Figures 1 and 4.

Prompt Engineering
Using prompt-based LLMs for generative cluster labeling and frame assignment requires modelspecific prompt engineering as a preliminary step.We explored the 19 model variants listed in Table 2. To select the most appropriate models for our task, we consulted the HELM benchmark (Liang et al., 2022), which compares the effectiveness of different LLMs for different tasks.Further, we have included various recently released open source models (with optimized instructions) as they were released.Since many of them were released during our research, we reuse prompts previously optimized prompts for the newer models.

66B
Autoregressive model with similar effectiveness to GPT-3, but more efficient data collection and training.

Direct Instruction
LLaMA-CoT vanilla LLaMA-30B fine-tuned on chain-of-thought and reasoning samples (Si and Lin, 2023).

GPT3.5 for Generative Cluster Labeling
Generate a single descriptive phrase that describes the following debate in very simple language, without talking about the debate or the author.Debate: """{text}"""

GPT4 for Frame Assignment
The following {input_type} a contains all available media frames as defined in the work from {authors}: {frames} For every input, you answer with three of these media frames corresponding to that input, in order of importance.
a A list of frame labels or a JSON with frame labels and their descriptions.
Figure 3: The best performing instructions for cluster labeling and frame assignment.For frame assignment, citing the frame inventory using the placeholder {au-thors} has a positive impact on the effectiveness of some models (see Appendix D.1 for details).
The position of the instruction within a prompt was also varied, taking into account prefix and suffix positions.For decoder-only models like BLOOM, GPT-NeoX, OPT-66B, and OPT, we experimented with hand-crafted prompts.For GPT3.5, we followed the best practices described in OpenAI's API and created a single prompt.
Prompts were evaluated on a manually annotated set of 300 cluster labels using BERTScore (Zhang et al., 2020).We selected the most effective prompt for each of the above models for cluster labeling.Our evaluation in Section 5 shows that GPT3.5 performs best in this task.Figure 3 (top) shows the best prompt for this model. 9

Frame Assignment
For frame assignment, models were prompted to predict a maximum of three frame labels for a given cluster label, ordered by relevance.Experiments were conducted with both direct instructions and dialogue prompts in zero-shot and few-shot set-9 ChatGPT and GPT4 were released after our evaluation.
tings.In the zero-shot setting, we formulated three prompts containing (1) only frame labels, (2) frame labels with short descriptions, and (3) frame labels with full text descriptions (see Appendix D.2 for details).For the few-shot setting, we manually annotated up to two frames from the frame inventory of Table 1 for each of the 300 cluster labels generated by the best model GPT3.5 in the previous step.We included 42 examples (3 per frame) in the few-shot prompt containing the frame label, its full-text description, and three examples.The remaining 285 examples were used for subsequent frame assignment evaluation.Our evaluation in Section 5 shows that GPT4 performs best on this task.Figure 3 (bottom) shows its best prompt.

Evaluation
To evaluate our approach, we conducted automatic and manual evaluations focused on the cluster labeling quality and the frame assignment accuracy.We also evaluated the utility of our indicative summaries in a purpose-driven user study in which participants had the opportunity to explore long discussions and provide us with feedback.

Data and Preprocessing
We used the "Winning Arguments" corpus from Tan et al. (2016) as a data source for long discussions.It contains 25,043 discussions from the ChangeMyView Subreddit that took place between 2013 and 2016.The corpus was preprocessed by first removing noise replies and then meta-sentences.Noise replies are marked in the metadata of the corpus as "deleted" by their respective authors, posted by bots, or removed by moderators.In addition, replies that referred to the Reddit guidelines or forum-specific moderation were removed using pattern matching (see Appendix A for details).The remaining replies were split into a set of sentences using Spacy (Honnibal et al., 2020).To enable the unit clustering (of sentences) as described in Section 3.1, the set of meta-sentences M is bootstrapped by first clustering the entire set of sentences from all discussions in the corpus and then manually examining the clusters to identify those that contain meta-sentences, resulting in |M | = 955 meta-sentences.After filtering out channel-specific noise, the (cleaned) sets of discussion sentences are clustered as described.
Evaluation Data From the preprocessed discussions, 300 sentence clusters were randomly se- lected.Then, we manually created a cluster label and up to three frame labels for each cluster.Due to the short length of the cluster labels, up to two frames per label were sufficient.After excluding 57 examples with ambiguous frame assignments, we obtained a reference set of 243 cluster label samples, each labeled with up to two frames.

Generative Cluster Labeling
The results of the automatic cluster labeling evaluation using BERTScore and ROUGE are shown in (Appendix) Tables 7 and 8, respectively.We find that ChatGPT performs best.To manually evaluate the quality of the cluster labels, we used a rankingbased method in which four annotators scored the generated cluster labels against the manually annotated reference labels of each of the 300 clusters.
To provide additional context for the cluster content, the five most semantically similar sentences to the reference label from each cluster were included, as well as five randomly selected sentences from the cluster.To avoid possible bias due to the length of the cluster labels by different models, longer labels were truncated to 15 tokens. 10To determine an annotator's model ranking, we merged the preference rankings for all clusters using reciprocal rank fusion (Cormack et al., 2009).Annotator agreement was calculated using Kendall's W for rank correlation (Kendall, 1948), which yielded a value of 0.66, indicating substantial agreement.
The average ranking of each model is shown in Table 3 along with the length distributions of the generated cluster labels. 11GPT3.5 showed supe-10 Figure 9 in the Appendix shows the annotation interfaces. 11As newer models were published after our manual evaluation, we show an automatic evaluation of all models using human and GPT3.5-based reference labels in the Appendix rior effectivenss in generating high-quality cluster labels.It ranked first in 225 out of 300 comparisons, with an average score of 1.38 by the four annotators.The cluster labels generated by GPT3.5 were longer on average (9.4 tokens) and thus more informative than those generated by the other models, which often generated disjointed or incomplete labels.In particular, T0 generated very short labels on average (3.1 tokens) that were generic/non-descriptive.

Frame Assignment
In the zero-/few-shot frame assignment settings described in Section 4.2, we prompted the models to predict three frames per cluster label in order of relevance.Using the manually annotated reference set of 243 cluster labels and their frame labels, we evaluated the accuracy of the frames predicted for each cluster label that matched the reference frames.The results for the first predicted frame in Tables 7 and 8.
are shown in Table 4.In most cases, GPT4 outperforms all other models in the various settings, with the exception of the zero-shot setting with a short prompt, where GPT3.5 narrowly outperforms GPT4 with 60.9% accuracy versus 60.5%.Among the top five models, the GPT* models that follow direct user instructions perform consistently well, with the LLaMA-/65B/-CoT and T0 models showing strong effectiveness among the open-source LLMs.Conversely, the OPT model performs consistently worse in all settings.The few-shot setting shows greater variance in results, suggesting that the models are more sensitive to the labeled examples provided in the prompts.Including a citation to the frame inventory paper in the instructions (see Figure 10) significantly improved the effectiveness of Falcon-40B (12%) and LLaMA-65B (9%) in the zero-shot setting (see Appendix D.1 for details).

Usefulness Evaluation
In addition to assessing each step of our approach, we conducted a user study to evaluate the effectiveness of the resulting indicative summaries.In this study, we considered two key tasks: exploration and participation.With respect to exploration, our goal was to evaluate the extent to which the summaries help users explore the discussion and discover new perspectives.With respect to participation, we wanted to assess how effectively the summaries enabled users to contribute new arguments by identifying the appropriate context and location for a response.We asked five annotators to explore five randomly selected discussions from our dataset, for which we generated indicative summaries using our approach with GPT3.5.To facilitate intuitive exploration, we developed DISCUSSION EXPLORER (see Section 5.5), an interactive visual interface for the evaluated discussions and their indicative summaries.In addition to our summaries, two baselines were provided to annotators for comparison: (1) the original web page of the discussion on ChangeMyView, and (2) a search engine interface powered by Spacerini (Akiki et al., 2023).The search engine indexed the sentences within a discussion using the BM25 retrieval model.This allowed users to explore interesting perspectives by selecting self-selected keywords as queries, as opposed to the frame and cluster labels that our summaries provide.Annotators selected the best of these interfaces for exploration and participation.
Results With respect to the exploration task, the five annotators agreed that our summaries outperformed the two baselines in terms of discovering arguments from different perspectives presented by participants.The inclusion of argumentation frames proved to be a valuable tool for the annotators, facilitating the rapid identification of different perspectives and the accompanying cluster labels showing the relevant subtopics in the discussion.For the participation task, three annotators preferred the original web page, while our summaries and the search engine were preferred by the remaining two annotators (one each) when it came to identifying the appropriate place in the discussion to put their arguments.In a post-study questionnaire, the annotators revealed that the original web page was preferred because of its better display of the various response threads, a feature not comparably reimplemented in DISCUSSION EXPLORER.The original web page felt "more familiar."However, we anticipate that this limitation can be addressed by seamlessly integrating our indicative summaries into a given discussion forum's web page, creating a consistent experience and a comprehensive and effective user interface for discussion participation.

DISCUSSION EXPLORER
Our approach places emphasis on summary presentation by structuring indicative summaries into a table of contents for discussions (see Section 3).To demonstrate the effectiveness of this presentation style in exploring long discussions, we have developed an interactive tool called DISCUSSION EXPLORER.12This tool illustrates how such summaries can be practically applied.Users can participate in discussions by selecting argumentation frames or cluster labels.Figure 4 presents indicative summaries generated by different models, providing a quick overview of the different perspectives.This two-level table of contents-like summary provides effortless navigation, allowing users to switch between viewing all arguments in a frame and understanding the context of sentences in a cluster of the discussion (see Figure 5).

Conclusion
We have developed an unsupervised approach to generating indicative summaries of long discussions to facilitate their effective exploration and navigation.Our summaries resemble tables of con-CMV: The "others have it worse" argument is terrible and should never be used in an actual conversation with a depressed person Figure 4: DISCUSSION EXPLORER provides a concise overview of indicative summaries from various models for a given discussion.The summary is organized hierarchically: The argument frames act as heading, while the associated cluster labels act as subheadings, similar to a table of contents.Cluster sizes are also indicated.Clicking on a frame lists all argument sentences in a discussion that assigned to that frame, while clicking on a cluster label shows the associated argument sentences that discuss a subtopic in the context of the discussion (see Figure 5).tents, which list argumentation frames and concise abstractive summaries of the latent subtopics for a comprehensive overview of a discussion.By analyzing 19 prompt-based LLMs, we found that GPT3.5 and GPT4 perform impressively, with LLaMA fine-tuned using chain-of-thought being the second best.A user study of long discussions showed that our summaries were valuable for exploring and uncovering new perspectives in long discussions, an otherwise tedious task when relying solely on the original web pages.Finally, we presented DISCUSSION EXPLORER, an interactive visual tool designed to navigate through long discussions using the generated indicative summaries.This serves as a practical demonstration of how indicative summaries can be used effectively.

Limitations
We focused on developing a technology that facilitates the exploration of long, argumentative discussions on controversial topics.We strongly believe that our method can be easily generalized to other types of discussions, but budget constraints prevented us from exploring these as well.We also investigated state-of-the-art language models to summarize these discussions and found that commercial models (GPT3.5,GPT4) outperformed opensource models (LLaMA, T0) in generating indicative summaries.Since the commercial models are regularly updated, it is important to note that the results of our approach may differ in the future.Although one can define a fixed set of prompts for each model, our systematic search for the optimal prompts based on an evaluation metric is intended to improve the reproducibility of our approach as newer models are released regularly.
To evaluate the effectiveness of the generated summaries, we conducted a user study with five participants that demonstrated their usefulness in exploring discussions.Further research is needed on how to effectively integrate summaries of this type into discussion platform interfaces, which was beyond the scope of this paper.

A Preprocessing
Deleted posts were matched using: "[deleted]", "[removed]", "[Wiki][Code][/r/DeltaBot]", "[History]".To remove posts from moderators, we used: • "hello, users of cmv! this is a footnote from your moderators" • "comment has been remove" • "comment has been automatically removed" • "if you would like to appeal, please message the moderators by clicking this link." • "this comment has been overwritten by an open source script to protect" • "then simply click on your username on reddit, go to the comments tab, scroll down as far as possibe (hint:use res), and hit the new overwrite button at the top." • "reply to their comment with the delta symbol"

B Clustering Implementation
We employed HDBSCAN, a soft clustering algorithm (Campello et al., 2013) to cluster the contextual sentence embeddings from SBERT (Reimers and Gurevych, 2019).As these embeddings are high dimensional, we follow Grootendorst ( 2022) and apply dimensionality reduction on these embeddings via UMAP (McInnes et al., 2017) and cluster them based on their euclidean distance.Most parameters were selected according to official recommendations for UMAP, 13 and HDBSCAN.14

UMAP Parameters metric
We set this to "cosine" because this is the natural metric for SBERT embeddings.
n_neighbors We set this to 30 instead of the default value of 15 because this makes the reduction focus more on the global structure.This is important since the local structure is more sensitive to noise.
n_components We set this value to 10.
min_dist We set this value to 0 because this allows the points to be packed closer together which makes separating the clusters easier.

HDBSCAN Parameters metric
We set this to "euclidean" because this the target metric that UMAP uses for reducing the points.
cluster_selection_method We set this value to "leaf".An alternative choice for this options is "eom".This option has the tendency to create unreasonably large clusters.There are instances where it creates only two or three clusters even for very large discussions.The "leaf" method does not suffer from this problem but it is more dependent on the "min_cluster_size" parameter.
min_cluster_size This parameter is the most important one for this approach.It is also not straight forward to find a value for this since the sizes of the main subtopics of a discussion depend on the size of the discussion.To find a good value, we sampled 50 discussion randomly and 50 discussion stratified by discussion length from all discussions.We compute the clustering for all 100 Exploring the Discussion via an Indicative Summary discussion for different values for min_cluster_size and manually determine a lower and upper bound for min_cluster_size that give a good clustering.
We computed a regression model using the following function family as a basis: The input variable x is the number of sentences in the discussion and the output variable is the average of the upper and lower bound.This yields the following function for computing min_cluster_size: f (x) = 0.421 • x 0.559 .Figure 6 visualizes upper and lower bounds as well as the found model.

C Generative Cluster Labeling
Model Descriptions Given the large number of models investigated in the paper for both the tasks, we categorized them based on their release timelines.Models older than GPT3.5 are listed under Pre-InstructGPT such as T0, BLOOM, GPT-NeoX, and OPT.The Direct and Dialogue labels refer to models released after GPT3.5 which differ in their prompt styles as shown in Figure 7. Best prompts for the manually evaluated models (Section 5.2) are shown in Figure 8.  cluding summarization, and surpasses GPT-3 in some tasks despite being much smaller.It was trained on prompted datasets where supervised datasets were transformed into prompts.
2. BLOOM (Scao et al., 2022) is an autoregressive LLM with 176B parameters, which specializes in prompt-based text completion for multiple languages.It also supports instruction-based task completions for previously unseen tasks.
4. OPT (Zhang et al., 2022) is an autoregressive LLM with 66B parameters from the suite of decoder-only pre-trained transformers.These models offer similar performance and sizes as GPT-3 while employing more efficient practices for data collection and model training.
5. GPT3.5 (Brown et al., 2020;Ouyang et al., 2022) is an instruction-following LLM with 175B parameters that outperforms the GPT-3 model across several tasks by consistently adhering to user-provided instructions and generating high-quality, longer outputs.We used the text-davinci-003 variant.In contrast to the other open-source models, it is accessible exclusively through the OpenAI API. 15   Prompt Descriptions We investigated several prompt templates for each model and selected the best performing one.All the prompts investigated for the encoder-decoder T0 model are shown in Table 11.Prompt templates for the decoder-only Pre-InstructGPT models (BLOOM, OPT, GPT-NeoX) are listed in Table 12.Prompt templates for the instruction-following LLMs are listed in Table 13.
Automatic Evaluation For the sake of completion, we automatically evaluated the recently released (at the time of writing) instruction-following models.To adapt them to generative cluster labeling, we devised two instructions (Figure 7) similar to the direct and dialogue style instructions used for frame assignment (Section 3.3).Next, we computed BERTScore and ROUGE against two sets of references: (1) manually annotated ground-truth labels for 300 clusters, and (2) cluster labels from GPT3.5 which was the best model as per our manual evaluation (Section 5.2, Table 3).Complete results for BERTScore along with length distributions for the generated cluster labels are shown in Table 7, while results for ROUGE are shown in Table 8.

Direct Instruction for Cluster Labeling
Every input is the content of a debate.For every input, you generate a single descriptive phrase that describes that input in very simple language, without talking about the debate or the author.

Dialogue Instruction for Cluster Labeling
A chat between a curious user and an artificial intelligence assistant.The user presents a debate and the assistant generates a single descriptive phrase that describes the debate in very simple language, without talking about the debate or the author.annotation interface used to collect the rankings for cluster label quality.

D Frame Assignment
Model Descriptions We categorize the models according to the instruction style followed for finetuning and generation.Instructions for each type are shown in Figure 10.The best prompts for each model are listed in Figure 11.

Direct Instruction Models
1. LLaMA-COT16 is a finetuned model on datasets inducing chain-of-thought and logical deductions (Si and Lin, 2023).
2. Alpaca (Taori et al., 2023)   2. Vicuna (Chiang et al., 2023) is finetuned from LLaMA using user-shared conversations collected from ShareGPT. 18It has shown competitive performance when evaluated using GPT-4 as a judge.We used the 7B and 13B variants of this model.
3. Vicuna (Xu et al., 2023) is an open-source chat model trained on 100k dialogues generated by allowing ChatGPT (GPT 3.5-turbo) to converse with itself.We used the 7B and 13B variants of this model.

D.1 Citation Impact on Frame Assignment
We conducted additional experiments to evaluate the impact of providing the citation of the media frames corpus paper by Boydstun et al. (2014) as additional information in the instructions shown in Section 3.3.This piece of information was provided after the substring "defined by" in the prompt template.Table 5 shows the results.We note that providing the citation information has a positive

Direct Instruction for Frame Assignment
The following {input_type} a contains all available media frames as defined in the work from {authors}: {frames} For every input, you answer with three of these media frames corresponding to that input, in order of importance.
a A list of frame labels or a JSON with frame labels and their descriptions.

Dialogue Instruction for Frame Assignment
A chat between a curious user and an artificial intelligence assistant.The assistant knows all media frames as defined by ... : {frames}.The assistant answers with three of these media frames corresponding to the user's text, in order of importance.Table 5: Analysis of the impact of providing citation of the media frames corpus paper as additional information in the instructions for the frame assignment.Providing citation information (Cite.)shows up to 12% improvement for Falcon-40B and 9% for LLaMA-65B under zero-shot setting (with only frame labels in the prompt).
impact on the performance of the models.The improvement is up to 12% for Falcon-40B and 9% for LLaMA-65B under zero-shot setting (only frame labels without descriptions in the prompt).This improvement can be attributed to the models being trained on a large text corpus, with the citation serving as a strong signal for generating more accurate labels.However, ChatGPT is only slightly affected.

Cultural Identity
The social norms, trends, values and customs constituting culture(s), as they relate to a specific policy issue.

Economic
The costs, benefits, or monetary/financial implications of the issue (to an individual, family, community or to the economy as a whole).
External Regulation & Reputation A country's external relations with another nation; the external relations of one state with another; or relations between groups.This includes trade agreements and outcomes, comparisons of policy outcomes or desired policy outcomes.
Fairness & Equality Equality or inequality with which laws, punishment, rewards, and resources are applied or distributed among individuals or groups.Also the balance between the rights or interests of one individual or group compared to another individual or group.
Health & Safety Healthcare access and effectiveness, illness, disease, sanitation, obesity, mental health effects, prevention of or perpetuation of gun violence, infrastructure and building safety.

Morality
Any perspective-or policy objective or action (including proposed action)-that is compelled by religious doctrine or interpretation, duty, honor, righteousness or any other sense of ethics or social responsibility.
Policy Prescription & Evaluation Particular policies proposed for addressing an identified problem, and figuring out if certain policies will work, or if existing policies are effective.

Political
Any political considerations surrounding an issue.Issue actions or efforts or stances that are political, such as partisan filibusters, lobbyist involvement, bipartisan efforts, deal-making and vote trading, appealing to one's base, mentions of political maneuvering.Explicit statements that a policy issue is good or bad for a particular political party.

Public Opinion
References to general social attitudes, polling and demographic information, as well as implied or actual consequences of diverging from or getting ahead of public opinion or polls.

Quality of Life
The effects of a policy, an individual's actions or decisions, on individuals' wealth, mobility, access to resources, happiness, social structures, ease of day-to-day routines, quality of community life, etc.
Security & Defense Security, threats to security, and protection of one's person, family, in-group, nation, etc. Generally an action or a call to action that can be taken to protect the welfare of a person, group, nation sometimes from a not yet manifested threat.
Other Any frames that do not fit into the above categories.
Table 6: Descriptions of frames as per Boydstun et al. (2014).For zero-shot prompts, we experimented with providing (1) list of frames, (2) frames with relevant aspects from the descriptions (e.g. the economic frame has the aspects "costs", "benefits", "financial implications"), and (3) frames with complete descriptions as additional context.
Table 7: Complete results of automatic evaluation via BERTScore for the cluster labeling task of all 19 LLMs.We compared them against the manually annotated reference and GPT3.5, the best model from our manual evaluation.
The top three models are indicated for each metric.Similar to the ROUGE evaluation, we see a strong performance by ChatGPT and LLaMA-CoT.Also shown are the statistics of the length of the generated cluster labels (in number of tokens).Table 8: Complete results of automatic evaluation via ROUGE for the cluster labeling task of all 19 LLMs.We compared them against the manually annotated reference and GPT3.5, the best model from our manual evaluation.
The top three models are indicated for each metric.We see that ChatGPT and LLaMA-CoT perform strongly across the board.
CMV: The "others have it worse" argument is terrible and should never be used in an actual conversation with a depressed person Indicative Summary (GPT3.5)

Fairness & Equality
• Complexities of comparing one's own struggles to those of others.
[97] (Quality of Life) • Advice can be helpful or unhelpful depending on how it is used.
[25] (Morality) • Focusing on personal goals and eliminating negative self-talk to create a growth mindset.
M ′ ⊂ M to the set of sentences D of each individual discussion before clustering, where |M ′ | = max{300, |D|}.The maximum value for the number of meta-sentences |M ′ | is chosen empirically, to maximize the likelihood that channel-specific meta-sentences are clustered with corpus-specific ones.After clustering the joint set of meta-sentences and discussion sentences D ∪ M ′ , we obtain the clustering C. Let m C = |C ∩ M ′ | denote the number of metasentences and d C = |C ∩ D| the number of discussion sentences in a cluster C ∈ C. The proportion of meta-sentences in a cluster is then estimated as (a) Joint clustering of a discussion and meta-sentences D∪M ′ .(b) The sampled meta-sentences M ′ ⊂ M highlighted gray.(c)Classification of meta-sentence clusters to be omitted.

Figure 2 :
Figure 2: Effect of meta-sentence filtering: (a and b) A discussion's sentences D are jointly clustered with a sample of meta-sentences M ′ ⊂ M .(c) Then each cluster is classified as a meta-sentence cluster based on its proportion of meta-sentences and its neighboring clusters.Meta-sentence clusters are omitted.

Figure 5 :
Figure5: An exploratory view provided by DISCUSSION EXPLORER to quickly navigate a long discussion via an indicative summary.On the left, clicking on a cluster label lists all its constituent sentences.On the right, a specific sentence from the chosen cluster is presented in the context of the discussion.Softly highlighted are the sentences from other clusters that surround the selected sentence.Users can thus easily skim a discussion with several arguments for relevant information using the indicative summary in this exploratory view.

Figure 6 :
Figure 6: Blue vertical bars show the upper and lower bound for min_cluster_size that yield a good clustering for the corresponding discussion.The red curve shows the optimal fit for the regression.

Figure 7 :
Figure 7: Direct and dialogue-style instructions for generative cluster labeling prompts.The best prompts for each model are shown in Figure 8.

Figure 8 :
Figure 8: Best prompts for generative cluster labeling for each model.These prompts were chosen based on the automatic evaluation of several prompts for each model against 300 manually annotated cluster labels.

Figure 9 :
Figure 9: Annotation interface for ranking-based qualitative evaluation of cluster labels.

Figure 10 :
Figure 10: Best performing instructions for frame assignment.Providing the citation for the frame inventory via the placeholder {authors} positively affects the performance of some models (Appendix D.1).

•
How to help those with depression.[35] (Quality of Life) Morality • Differences between sadness and depression.[98] (Quality of Life) • Reflecting on blessings and practicing gratitude to increase happiness.[39] (Quality of Life) • Mindful awareness and reprogramming of thought patterns to take charge of emotions.[22] (Quality of Life) • Gaining perspective to appreciate life and understand how one's actions affect others.[22] (Fairness & Equality) • A journey of self-discovery and growth through difficult times.[17] (Quality of Life) Table 15: Indicative Summary from GPT3.5.CMV: The "others have it worse" argument is terrible and should never be used in an actual conversation with a depressed person Indicative Summary (GPT4) Fairness & Equality • Acknowledging personal struggles while recognizing others' hardships [97] (Quality of Life) Health & Safety • Understanding and managing depression as a complex mental state [98] (Quality of Life) • Importance of gratitude for happiness and mental health [39] (Quality of Life) • Impact of different approaches to supporting depressed individuals.[35] (Quality of Life) • Controlling and reprogramming thought patterns through mindful awareness and rational evaluation of emotions.[22] (Quality of Life) Policy Prescription & Evaluation • Effectiveness of advice depends on individual and context.[25] (Quality of Life) Quality of Life • Gaining perspective for personal growth and understanding.[22] (Morality) • Focusing on positive mindset and self-growth [21] (Health & Safety) • Overcoming challenges and finding happiness through personal growth and change.[17] (Morality) 8 7 For detailed label descriptions see Table 6 in the Appendix. 8See Appendices C and D for details.

Table 3 :
Results of the qualitative evaluation of generative cluster labeling.Shown are (1) the mean rank of a model from four annotators and (2) the number of times a model was ranked first by an annotator.GPT3.5 (text-davinci-003) performed better than other models and generated longer labels on average.