Using LLM for Improving Key Event Discovery: Temporal-Guided News Stream Clustering with Event Summaries

,


Introduction
Analyzing the dynamics of discussions within the stream of news coverage has been an important tool for researchers to visualize and characterize media discourse around a topic (Field et al., 2018;Liu et al., 2019;Li and Goldwasser, 2019;Roy and Goldwasser, 2020;Luo et al., 2020;Liu et al., 2021;Lei et al., 2022;Dutta et al., 2022).News media discourse is typically centered around real-world events that catch media attention and gives rise to news reports streams.With the vast, ever-growing amount of news information available, we need automatic ways for identifying such key events.
In this paper, we study the problem of identifying and characterizing key events from a large collection of news articles.Since the number of news events is usually not known in advance, past works have typically formulated the problem as a form of non-parametric clustering of news articles, using Hierarchical Dirichlet Processes (Zhou et al., 2015;Beykikhoshk et al., 2018) or Stream Clustering (Laban and Hearst, 2017;Miranda et al., 2018;Staykovski et al., 2019;Saravanakumar et al., 2021).Rather than relying on the output of such clustering algorithms directly, we view the discovered clusters as event candidates, and leverage recent advances in Large Language Modeling (LLM) (Brown et al., 2020) to characterize these candidates and reason about their validity.From a bird's eye view, the process is related to past work on interactive clustering (Hu et al., 2014;Pacheco et al., 2022Pacheco et al., , 2023)), but instead of using human feedback to shape the emergent clusters, we rely on LLM inference.
We propose a framework for clustering an archive of news articles into temporally motivated news events.A high-level overview of our approach is shown in Figure 1.We first retrieve relevant issue-specific articles (details about the document retrieval module are in App A) and perform temporal analysis to identify "peaks", in which the number of articles is significantly higher.We then use HDBSCAN (Campello et al., 2013) a nonparametric clustering algorithm to generate candidate event clusters.We then characterize the candidate clusters by performing few-shot multidocument summarization of the top-K articles assigned to each cluster, identify inconsistent clusters by assessing the (dis)agreement between the summary and each article individually, and redundant clusters by assessing the similarity between cluster pairs' summaries (details in Sec.2.1).These lowquality candidates are removed, resulting in higher quality event clusters.We demonstrate this property over the NELA dataset (Horne et al., 2022) and show the improvement both in terms of event coherence and document mapping quality. 2 Event Discovery and Article Inference 2.1 Event Discovery Temporal Filtering.The first step towards generating event candidates is to identify temporal landmarks or peaks, where the media coverage surges with respect to one or more real-world events.We represent the news articles as a time-series data, where denote the number of articles published at each time step.The task is to identify a set of peaks, P = {p 1 , p 2 , • • • , p m } at different points in time.With this formulation, we hypothesize that the resulting clusters from our framework would be able to segregate discussions at various time steps and form coherent events compared to other approaches.We use an existing outlier detection algorithm (Palshikar et al., 2009) towards this task.More details in Appendix B.
Peak-Specific Clustering.Within each peak, the increased media coverage can be attributed to multiple relevant events.We categorize the documents in each peak p i into a set of events, E i = {e 1 , e 2 , • • • , e q }, and form an overall event set, pertaining to the issue.We embed the title and first 4 lines of a news article instance using a dense retriever (Ni et al., 2021) model.The embedded documents are clustered using HDBSCAN to identify key news events.Prior to clustering, we reduce the dimensions of document embedding using UMAP (McInnes et al., 2018).Details are in Appendix C.
Event Characterization.The event set obtained at each peak (E i ), is still prone to noise and is not easily interpretable without significant effort.Characterizing the news events makes the clusters interpretable and helps remove inconsistencies.The candidate events are characterized by generating  1: Incoherent cluster removal.The cluster summary aligns with the 1 st and the 2 nd article, while the 3 rd article is off-topic compared to the other two.a multi-document summary using GPT-3.5.The prompts are engineered to generate short eventspecific summaries in a two-shot setting.The two closest documents to each centroid are used in the prompt to generate event summaries.
Post summary generation, we perform a cluster inconsistency check.A cluster is deemed to be incoherent if the top-K closest documents to the centroid do not align with the summary embedding.We embed the event summaries using the same dense retriever model, and compute the cosine similarity score between the summary embedding and the top-K documents for the cluster (k = 5).Based on a threshold value, we treat the incoherent clusters as noise and discard them.Note that we only discard clusters but not documents associated with them.They are still used for cluster membership assignment in the next stage of our framework.Tab. 1 Event Title: President Biden's Climate Plan Event Title: Biden's Climate Change Actions Event Description: This is about President Joe Biden's executive orders aimed at tackling climate change by reducing the U.S. carbon footprint and emissions, stopping oil and gas leases on public lands, and prioritizing climate change as a national security concern.
Event Description: This is about President Joe Biden's executive actions to combat climate change by prioritizing science and evidence-based policy across federal agencies, pausing oil drilling on public lands, and aiming to cut oil, gas, and coal emissions.

Event Title: Texas Abortion Ban
Event Title: Texas Abortion Law Event Description: This is about a new Texas law that bans abortions after 6 weeks and empowers regular citizens to bring civil lawsuits against anyone who aids a woman looking to terminate a pregnancy.
Event Description: This is about the controversial Texas abortion law that bans abortions after six weeks and has been condemned by President Joe Biden as an unprecedented assault on women's rights.shows an example of the discarded cluster.
We do an additional cleaning step by merging the clusters that share a similar event summary.We devise a simple greedy algorithm which utilizes GPT-3.5 for inference.In the first iteration of the algorithm, we start by constructing a set, S = {(s 1 , s 2 ), • • • , (s n−1 , s n )}, that contains every pairwise combination of event summaries.For each element in S, we prompt LLM to infer if the pair of summaries are discussing about the same event.If the event summaries, say (s 1 , s 2 ), are equivalent, then we merge these summaries, and update the set S by removing every element in the set that contains s 1 or s 2 .In the second iteration, we construct a new set, S ′ , that holds every combination of updated event summaries, and repeat the previous step.We run the algorithm for two iterations or halt if there are no merges after the first iteration.Tab. 2 shows an example where the event summaries clearly indicate that the clusters need to merged.Details about the hyperparameter selections, and prompts are in Appendix C, B.

Inference: Map Articles to Events
In this stage of our framework, we decide the cluster membership using a similarity module.We embed the updated event summaries using the same encoder, and compute the cosine similarity score between the summary and the document of interest.By thresholding, we determine if the article can be mapped to an event.For cluster membership, we extend the temporal window by d days before and after the peak (d = 1), and consider all the documents published in that timeframe.

Experiments and Results
We conduct experiments on the NELA-dataset, which is a large collection of news articles (see Ap-pendix A).Using our document retrieval module, we collect a total of 335k relevant news articles on 11 contemporary issues2 .The application of temporal filters reduces the article count to 90k, which is the basis for our analysis.The retrieved articles are mapped to a four-way {left, right, center, and conspiracy-pseudoscience} political rating.Details about the dataset, document retrieval module, and four-way political rating can found in Appendix A.
Evaluation Metrics.We evaluate our framework's ability to create coherent event clusters at the desired granularity with three automatic metrics inspired by Mimno et al. (2011).Given an event e i and the top-10 relevant entities V e i = {v e i l } l∈[1..10] to e i by TF-IDF, entity purity measures the percentage of the documents that mention at least one of the top-10 entities; coverage counts the percentage of documents accounted for in the cluster assignments.In addition, entity coherence considers cooccurrences of central entity pairs in the clustered documents to measure coherency for an event. where indicates the co-occurrence frequency of two entities in documents.An entity coherence value closer to zero indicates a highly coherent news event cluster.We offer a more detailed explanation of the metrics in Appendix D.
Baselines.We compare our method's performance against various competitive topic model as baselines.We consider LDA (Blei et al., 2003;Hoffman et al., 2010) in two different settings -LDA, and LDA (Temporal).The topics are estimated individually at each temporal peak for LDA (Temporal), whereas the topics are estimated across Table 3: Evaluation results averaged for all issues.Last column shows the average of the total event count from each peak and for each issue.For LDA(Temporal), we assigned the document to its most probable topic if the probability was ≥ 0.5.
all peaks at once for LDA.We include three additional baselines -Temporal Filtering, HDBSCAN, and BERTopic (Grootendorst, 2022).Note that BERTopic 3 is an off-the-shelf neural baseline for clustering documents.For methods other than ours, we do not incorporate a cluster membership module as we directly estimate the topics for all the documents in an extended temporal window of d days before and after the peak (d = 1).Preprocessing and hyperparameter details are in Appendix C.
Results.Tab. 3 shows the aggregated results obtained for various methods across all the issues.For LDA (baseline), the events are estimated over a union of all the documents from every peak for an issue.We study the impact of event estimation with the temporal component by comparing LDA (baseline) and Temporal Filtering methods.We observe only a slight drop in average purity (−3 points) for the Temporal Filtering method.Further, Tab. 8 shows that in case of Free Speech, Abortion, Immigration issues, the purity scores are higher than LDA (baseline), which validates our hypothesis that adding a temporal dimension to event identification can help form coherent events.

Coverage vs Purity Trade off
We evaluate the trade-off between coverage and entity purity among the methods that take event temporality into account.We observe that LDA (Temporal) has a very high coverage with the least purity, which can be attributed to noise associated with the topic distributions.BERTopic improves over this method in both coverage, and purity measures across 11 issues.It even outperforms HDBSCAN in both the metrics.However, while BERTopic has increased coverage, it still fails to outperform our 3 https://maartengr.github.io/BERTopicmethod in terms of purity, and this can be primarily attributed to our inference mechanism that is based on generated event summaries.
To address low coverage issue from our method, we propose to run our framework for the second iteration by updating event summary embedding with the mean value of top-10 most representative document embeddings in the cluster (from the first iteration).In doing so, average coverage increased by +12.5 points across all issues, with minimal decrease of < 5 points in purity.Tab.6 shows the results for each issue after the second iteration.

Impact of Merge/Remove Operations
We investigate the impact of removing cluster inconsistencies over the generated candidate events.For this analysis, we compare HDBSCAN with the same hyperparameters and input data as our method.We observe that average of the inter-event cosine similarity score between event-pairs, and across all issues is lesser by 0.14 for our method.This indicates that our method achieves improved cluster separability after eliminating inconsistencies.Tab. 5 shows the report for each issue.Overall, the score is reduced, with one exception for the issue of Corruption.Manual inspection suggest that the increase can be due to removal of "good" clusters.An example is shown in Fig. 7.

KEYEVENTS ⇒ More Event Coherence
To better understand the advantages and disadvantages of our method, the authors manually annotate a small set of data samples for Climate Change.We test for event coherence, and mapping quality over this dataset.We define an event to be coherent if the top-K most representative documents of that event are in agreement with each other (k = 3).We also annotate to verify the validity of documentto-event assignments (mapping quality), where we check for agreement between the document and its respective event summary.The details about the experimental setup can be found in Appendix E.
The test is conducted across all events for our method, HDBSCAN, and BERTopic.To measure coherence, we first identify the top-K documents for an event based on their cosine similarity scores with the event centroid.In addition, we estimate mapping quality by judging if document pairs should be clustered together or not.
Results.The results of the human evaluation are shown in Tab. 4. Our method failed to generate coherent events for 5 out of the 56 cases for Climate Change, while BERTopic failed in 9 out of 62 cases (ignoring 3 cases where the annotator provided a label of −1).HDBSCAN failed in 8 out of 53 cases.Overall, the event coherence scores from BERTopic and HDBSCAN closely trail our method by a margin of approximately −6 points, implying that the generated events from these methods are coherent.However, considering the event purity scores, we conclude that these two methods are more noisy.In terms of mapping quality, our method outperforms HDBSCAN by a large margin.The precision score from BERTopic is better than HDBSCAN, indicating the effectiveness of BERTopic in grouping 'good' item pairs together over a small sample of randomly selected datapoints for the issue -Climate Change.More details in Appendix E.

LLM Usage and Efficiency
As temporal filtering results in an average of 55 event clusters per issue, we observe that using LLM for event summarization and cluster-merging incurs reasonable cost, as we discuss in Limitations.

Broader Impact
Our method and the resulting KEYEVENTS dataset could be useful for analyzing political discourse across different ideologies.As a simple case study, we illustrate how the portrayal of events varies for different political ideologies.We take an entitybased approach (Rashkin et al., 2016;Field and 2 shows that left-leaning articles mention him significantly more than the other two ideologies in some of the events (e.g. the 5 th , 9 th , and 14 th ).Analyzing these events' articles show that left leaning articles criticize his ties to the coal industry and opposition to climate change legislation, while fewer (or no) mentions in articles with other ideology leanings under the same events.
Different ideologies also persist different sentiments when mentioning the same entity.In Biden's Executive Actions on Climate Change (16 th event in Fig. 2), articles from different ideologies have comparable mention frequencies of Joe Manchin.We prompt GPT-3.5 to classify the sentiment expressed towards him (positive, neutral, negative).Interestingly, none of the articles from any ideology expresses a positive sentiment; 86% of the articles from the left endure a negative attitude towards him, whereas only 38% and 0% of the articles from the center and the right have negative sentiments.This distinction shows that even the same entities could be portrayed differently within each event to strengthen the beliefs along their political lines.

Conclusion
We present a framework for key events identification and showed that events generated from our approach were coherent through quantitative measures, and human evaluation.We also presented a simple qualitative study to showcase the potential of KEY EVENTS, for investigating various political perspectives under nuanced settings.

Limitations
As the temporal filtering step of our framework relies on the publicaiton date of documents as input, we work with the assumption that the documents have a timestamp attached to them.However, the main idea of event characterization using LLM, and associating the documents to their closest event summary is applicable to other cases with no changes.
Our approach relies on GPT-3.5 for generating a multi-document event summary and clustermerging.We choose to use GPT-3.5 instead of the open-source counterparts mostly due to computational resource constraints.Since all GPT calls are made on the cluster-level, we are able to maintain the total experimental cost of the paper under $5 with respect to the OpenAI API.To minimize the reliance and cost associated with LLM usage, we are using only pairs of documents with most similar vector representation to generate event summary.We opt for more an efficient approach here, and leave the exploration of efficiency vs. performance trade-off for future work.

A Document Retrieval Module
This module retrieves news articles relevant to an issue of interest.User is expected to provide an issue name or a set of issue names around which the documents are to be retrieved.Using this input, we generate a set of relevant keywords associated with each issue by prompting GPT-3.5.We craft the prompt in such a way that GPT-3.5 generates a list of keywords that appear in the context of the issue specified by the user.We then use BM25 algorithm on the indexed NELA data to retrieve documents associated with each keyword for the issue.We use BM25 with the default settings for b, and only vary the term frequency saturation k1 = 1.3 as we are dealing with longer news documents.
NELA Dataset It is a collection of ≈ 1.8M news documents from 367 news outlets between January 1st 2021, and December 31st, 2021.NELA is successful in organizing the news articles based on their ideological bias.However, this structure is not well-suited to characterize the differences in discourse between the political ideologies in online news media.
In this work, we primarily focus on 207 news sources that are based out of USA.The political rating corresponding to these sources are mapped to a four-way {left, right, center, conspiracy-pseudoscience}. The ratings are decided based on MBFC4 .Using the scores provided by MBFC, we categorize left-center and right-center political ratings to one of the {center, left, right} ratings.

B Event Candidate Generation
Temporal Filtering We implement an outlier detection algorithm (Palshikar et al., 2009) which considers a temporal window of 2k points around each data point, x.These are k points before x, and k points after x.Using these 2k data points, we compute the mean and standard deviation.The data point is considered as a local peak if it is at least a standard deviation away from the mean value.Among the detected local peaks, we further apply a filter to retrieve global peaks.We do this by computing the mean and standard deviation values for the detected local peaks.If the value at the local peak is above the mean value, we mark that as a global peak.In the case of multiple peaks within a temporal window of k days, we merge them to form a single peak.We set the value of k = 3 for our experiments.Figure 3 shows the result of this algorithm for the issue -Abortion.

C Models and Hyperparameters
To obtain topics from LDA with Variational Bayes sampling (under both settings), we use Gensim (Rehurek and Sojka, 2011) implementation.We follow the preprocessing steps shown in (Hoyle et al., 2021), and estimate the number of topics in a datadriven manner by maximizing .We do a gridsearch over a set of {2, 3, 4, 5} for LDA (Temporal) method.The set of topics for LDA (baseline) is {10, 20, • • • , 60}.
In the case of HDBSCAN, when used for our method, and as a standalone clustering model, we use a data-driven approach to estimate the best number of topics by maximizing the DBCV score (Moulavi et al., 2014).We retain the default settings for cluster_selection_method, and metric parameters, while we change the min_cluster_size to get more sensible topics.This number is selected based on a grid search whose values are sensitive to the number of input data points.Suppose |X| denote the number of data points, then the grid parameters for HDBSCAN used in our method include {0.05 × |X|, 0.06 × |X|, • • • 0.1 × |X|}.This is updated to consider only the last three elements for HDBSCAN (standalone).If not, we see unusually high number of topics per peak.We set the n_neighbors parameter in UMAP embedding model to min_cluster_size.
For cluster incoherency check, we choose a threshold of 0.6.If the cosine similarity score between the event summary embedding and the document embedding is lower than this threshold, we discard those documents as noise.
For our method's similarity module, we choose a threshold of 0.69 based on evaluating the trade-off between purity, coherence and coverage values.
Prior to computing the TF-IDF scores to retrieve the top-K entities, we use a simple yet effective method for entity linking (Ratinov et al., 2011) that is based on Wikipedia mentions.

D Evaluation Metrics
In this section, we describe the evaluation metrics proposed in our work.
Several studies (Nanni et al., 2017;Spitz and Gertz, 2018;Chen et al., 2023) in the past have shown that entities and the context associated with them can potentially represent a topic or an event.With this as the premise, we have devised entitybased evaluation metrics that helps us quantify the quality of the resulting clusters.We further validate our results through a simple human evaluation process on partially annotated data for the issue -Climate Change.
We define entity purity for an event to be the proportion of the documents that are mapped to that event, where the document has at least one entity that overlaps with the top-K TF-IDF based entities for that event (k = 10).The idea is that central entities associated with a news event must be reflected in the documents clustered for that event.Note that in order to remove commonly repeated entities in news such as Biden, Trump etc., we consider top-K TF-IDF based entities for an event as central entities.A purity score of 100% for an event indicates that every document in the cluster has atleast a mention of one of the top-K central entities, suggesting that each document is potentially discussing about that event.
We also define entity coherence metric as an additional measure to validate the cluster quality.We adapt the topic coherence metric from (Mimno et al., 2011) to define entity coherence C, for an event, e i as where, e i denotes an event, V e i = {v e i 1 , v e i 1 , • • • , v e i 10 } denotes the top-10 TF-IDF based entities for e i , F (v e i m , v e i l ) indicates the co-document frequency (counts the joint document frequency for entity types v m , v l ), F (v l ) indicates document frequency for entity type v l , and ϵ is a smoothing factor.Informally, it considers co-occurrences of central entity pairs (as opposed to topic words) in the clustered documents to measure coherency for an event.Note that a higher value indicates a highly coherent news event.By virtue of using log in formula, a value closer to zero is more desirable than a largely negative value.We further observe that this measure is positively correlated with entity purity, indicating that purity can be a good measure to represent cluster coherence.
In addition to these, we have an additional metric coverage, which essentially counts the number of documents accounted for in the clustering process.Ideally, we want any clustering algorithm to reject noise and cluster every document in the corpus.We do not want to exclude any document.Post noise removal, a good clustering algorithm is expected to have a coverage of 100% in an ideal scenario.

E Human Evaluation
For the event coherence case, the annotators are asked to verify if the top-3 documents for the event are in agreement with each other.They are asked to provide a score of 1 if the documents are in agreement, a score 0 if they are not, or a score of −1 if they are not sure about the label.We show only the title and first four lines of the news article.We did not receive any −1 for this case.
To evaluate the mapping quality of our model, we randomly sample a set of peaks, and within each peak, we randomly sample 50 documents to form an overall set of 430 documents mapped to various events for the issue Climate Change.We show the title and first four lines of news article, and the event summary to the annotators.Similar to coherence case, we ask the annotators to provide a score of 1 if the document aligns with the summary, 0 for no alignment, or −1 if they are not sure.There is no clear definition of alignment, and we let the annotators make this judgement.We received a total of 6 not sure labels.On eliminating unsure instances, our method got 352 out of 424 instances correct, which translates to a precision value of ≈ 0.83.
However, in order to compare the performance of our method with other model, we devise a strategy to derive 'good' and 'bad' example-pairs by treating human-labeled data as the gold standard.We assume that if the two documents receive a score of 1 within the same event, then they must be 'equivalent'.
With this assumption, for a given temporal peak and within every event, we construct 'good/positive' example set by considering every possible document-pairs from valid cluster assignments.To construct 'bad/negative' example set, we consider a union of the following -(a) Documentpairs from valid cluster assignments between different events; (b) Document-pairs from an invalid, and a valid cluster assignment within each event.
The task is to evaluate how well each method performs in retaining the good example-pairs within the same cluster.We ensure to remove all the documents that are not mapped to any event by each method.Owing to the nature of data collection, we report only the precision values for all the three methods under consideration.

G Prompt Templates
This section shows the prompt templates used to generate multi-document summary (Fig. 9), and to verify if a pair of cluster characterization is equivalent (Fig. 10).Table 8: Compares the results obtained for each method and issue.Last column shows summation of all event counts (from each detected temporal peak).For LDA(Temporal), we assigned the document to its most probable topic if the probability was ≥ 0.5.

Figure 1 :
Figure 1: High-level overview of our framework for KEYEVENTS identification.

Figure 2 :
Figure 2: Frequency of the entity Joe Manchin (y-axis: #entity mentions per article within each event) in Climate Change events (x-axis: event indices across time).

Figure 3 :
Figure 3: Dynamic Analysis of documents from Jan 1 to Dec 31, 2021, for the issue Abortion.X-axis represents time (one day interval).Red dots indicate detected peaks.
Climate Justice and African Activists Event Description: This is about the challenges faced by African climate activists in bringing attention to the climate crisis and the need for climate justice.There Will Never Be Climate Justice If African Activists Keep Being Ignored We go to Kampala, Uganda, to speak to climate activist Vanessa Nakate on the occasion of her first book being published, A Bigger Picture.... The Looking Glass World Of 'Climate Injustice' In our wacky world where almost nothing makes sense anymore, there is no shortage of examples of politicians, let alone self-important academics, journalists, and wealthy elites, looking foolish with self-contradictory policy demands.... New Miss Universe Urges Action on Climate Change: Choice to Kill or Save Nature A new Miss Universe has been crowned and she is a climate alarmist....

Table 2 :
Illustrates two cases of cluster merge from issue Climate Change, and Abortion respectively.

Table 4 :
Human evaluation results of our method.

Table 5 :
Shows the impact of cluster merge and remove operations for each issue.Note that input data and hyperparameters used by HDBSCAN in this setting are the same as our method.Lower the similarity score the better.