Discord Questions: A Computational Approach To Diversity Analysis in News Coverage

There are many potential benefits to news readers accessing diverse sources. Modern news aggregators do the hard work of organizing the news, offering readers a plethora of source options, but choosing which source to read remains challenging. We propose a new framework to assist readers in identifying source differences and gaining an understanding of news coverage diversity. The framework is based on the generation of Discord Questions: questions with a diverse answer pool, explicitly illustrating source differences. To assemble a prototype of the framework, we focus on two components: (1) discord question generation, the task of generating questions answered differently by sources, for which we propose an automatic scoring method, and create a model that improves performance from current question generation (QG) methods by 5%, (2) answer consolidation, the task of grouping answers to a question that are semantically similar, for which we collect data and repurpose a method that achieves 81% balanced accuracy on our realistic test set. We illustrate the framework's feasibility through a prototype interface. Even though model performance at discord QG still lags human performance by more than 15%, generated questions are judged to be more interesting than factoid questions and can reveal differences in the level of detail, sentiment, and reasoning of sources in news coverage.


Introduction
News coverage often contains bias linked to the source of the content, and as many readers rely on few sources to get informed (Newman et al., 2021), readers risk exposure to such bias on critical societal issues such as elections and international affairs (Bernhardt et al., 2008).Modern news aggregators such as Google News propose an engineering solution to the problem, offering news

Fed Rate Increase Story
How many rate hikes will there be? readers diverse source alternatives for any given topic.In practice, however, users of news aggregators interested in diverse coverage must invest more time and effort, reading through several sources and sifting through overlapping content to build an understanding of a story's coverage diversity.
Prior work has explored methods to present coverage diversity information.For example, AllSides offers meta-data about the sources, such as political alignment (AllSides, 2021).But source-based information can be overly generic.Other projects have proposed to use article clustering and topicmodeling-based approaches to provide the user with story-specific insights about source diversity.Yet clustering interpretation can be complex for untrained users (Spinde et al., 2020;Saisubramanian et al., 2020).
In this work, we propose a new framework to discover and present news diversity in multi-source settings: the Discord Questions framework.Discord questions are meant to be: (1) answered by most sources that cover the story, (2) answered in semantically diverse and sometimes contradicting ways by the sources.The use of questions to accompany readers is motivated by prior work showing automatically generated questions can improve reader comprehension (Therrien et al., 2006), and foster an environment for active reading and comprehension (Singer, 1978).
The discord questions and the consolidated groups of answers are intended to be an interpretable slice through the sources' coverage, indicating how sources align for a specific issue in the story.Figure 1 presents two illustrative discord questions that were generated by our framework existing Google News stories.In the first example, the sources and experts they introduce make forecasts that are subjective and uncertain: in a story about the Federal Reserve's rate increase, news sources predict that anywhere between 4 and 8 hikes might happen in 2022.In the second example, in a story about the US House passing a bill about Gun Regulations, some sources chose to be more optimistic, focusing on how many Republicans were required for the bill to pass, while others employed a more pessimistic tone, writing that the bill did not have a serious chance to pass.
We hypothesize that a well-phrased question and a consolidated set of answers from the sources can reveal the coverage diversity of a story in a flexible and interpretable way for end-users.In our work, we operationalize the Discord Questions framework into a pipeline with three main components as shown in Figure 2.More specifically, we focus on two tasks: answer consolidation for the news domain and discord question generation.We create evaluation settings for each, allowing us to build high-performing models to use in a prototype implementation of the framework.
For answer consolidation, we repurpose existing QA evaluation works (Chen et al., 2020), adapting it to achieve a balanced accuracy above 80% on our built test set.For discord question generation, we train a question generation model that improves the percentage of generated discord questions by 5% compared to a strong baseline.We however estimate that our best-performing model still lags human-written question quality by at least 15% in our evaluation setting.
We prototype the Discord Questions framework in a live demonstration.We rely on the Google News aggregator to obtain a listing of sources that cover a story and use our pipeline to generate several discord questions.Manual inspection reveals that questions generated by our system are found to be more interesting than other types of questions (such as factoid questions) and that the consolidated answers help surface diversity in terms of the level of detail, answer aspects, sentiment, and reasoning of sources, successfully revealing differences in coverage from news sources.

Framework Definition
We first define terminology, then introduce components of the discord questions framework.

Terminology
A news story (sometimes topic or event) is a group of news articles published around the same time that discuss a common event and set of entities.Individual news articles of a story are each published by a source, a media organization that often hosts the article on its distribution platform.An article is composed of a headline, the article's content, and optionally a summary.We denote the collection of articles' contents as the full context of a story.

Discord Question Pipeline
The pipeline is visually summarized in Figure 2. It takes as input a story's news articles and follows three steps: (1) question generation in which candidate discord questions are generated, (2) question answering in which answers to a question are extracted from each source's content, and (3) answer consolidation, in which a question's extracted answers are organized into semantic groups.The output is a set of questions and corresponding answer groups, which can be used to surface news coverage diversity.

Discord Question Generation
Discord Question Generation consists of using any of the sources' content to generate a question satisfying two properties: (1) high coverage, with most of the sources providing an answer to the question, (2) answered diversely, with answers exhibiting semantic diversity which can be organized into semantic groups.We define cutoffs that assess if each property is respected.For property (1), the question should be answered by 30% or more of the sources.For property (2), when grouping a question's answers, the largest group should contain no more than 70% of all answers.
In Figure 2, out of the 4 candidate questions, only Q2 and Q3 satisfy both properties and are considered discord questions.Questions such as Q1 -breaks property (2) -are labeled as consensus questions, as a majority of the sources' answers are in the same semantic group (i.e., circles).Factoid questions tend to be consensus questions (e.g., Who is the president of France?).Questions such as Q4 -breaks property (1) -are labeled as peripheral questions, as a minority of sources answer the question.We hypothesize that consensus and peripheral questions are not pertinent to the study of a story's coverage diversity, as they do not reveal dimensions of source discord.Section 5 explores ways to automatically generate discord questions.

Question Answering
Once a candidate question is generated, the question answering (QA) module extracts each source's answer -if any -to the question.
We leverage two properties of QA models in the Discord Questions framework.First, the QA model we use is extractive, selecting spans of text in the source's content that most directly answer the question without modification.Second, the model discerns when a source does not contain any answer to a question, predicting a No Answer special token.
In this work, we use a standard QA model, a RoBERTa-Large trained on common extractive QA datasets (details in Appendix A), and reflect on the choice of QA model in the Limitations section.

Answer Consolidation
Once a question's answers are extracted, the final step is answer consolidation (Zhou et al., 2022).The objective is to organize answers into semantic groups, with answers within the same group conveying semantically similar answers.
We follow Zhou et al. (2022) and decompose answer consolidation into two sub-tasks: (1) answerpair similarity prediction (also answer equivalence), in which a model is tasked with assessing the similarity S 12 between two answers (a 1 , a 2 ) to a question Q, (2) the consolidation step, in which given a set of answers (a 1 , a 2 , ..., a n ) and all pairwise similarities S 12 , S 1n , S 2n , ..., the model must organize the answers into semantic groups.
Because answer-pair similarity can involve subjective opinion, Chen et al. (2020) framed the task as a regression problem, collecting human annotations on a 5-point Likert scale.Bulian et al. (2022) later simplifies the task by framing it as binary classification and still achieve high inter-annotator agreement.We adopt the binary classification framing, as it simplifies annotation procedures.In Section 4, we collect an evaluation set for news answer consolidation and explore diverse transfer learning strategies, finding resources to build highperforming models for our application.

Related Work
Analysis of media diversity and bias often attempts to examine news coverage based on the media organizations that own the sources (Hendrickx et al., 2020).The objective can be to map a source onto a left-right political range (Baly et al., 2018), or geopolitical origin (e.g., country) (Hamborg et al., 2018).Information about source bias can be conveyed to the user through clustering (Park et al., 2009) or matrix visualization (Hamborg et al., 2018).Prior work has however shown that using visualization to increase news reader awareness can be challenging (Spinde et al., 2020).In the Discord Questions framework, we envision a new approach to news coverage diversity by revealing concrete examples of questions and organized answer groups that reveal source alignments.
Answer Equivalence & Consolidation.Pretrained models and large datasets have boosted QA performance, yet shallow metrics -exact match and token F1 -remain the most popular to assess performance (Chen et al., 2019).Recent work on answer equivalence, MOCHA (Chen et al., 2020) and Answer Equivalence (AE) (Bulian et al., 2022), build methods to improve QA evaluation by manually collecting datasets of semantic similarities between reference and system answers to a question.Zhou et al. (2022) formulate the task of answer consolidation and collect a large dataset to explore model performance on the task in the domain of online forums (i.e.Quora).In our work, we frame the answer consolidation task in the news domain and re-purpose answer equivalence models to achieve high performance on the task.
Question generation has expanded from the answer-aware sequence-to-sequence task (Du et al., 2017) to include many domain-specific applications, from clarification QG (Rao and Daumé III, 2018), inquisitive QG during a reading exercise (Ko et al., 2020), for conversation recommendations (Laban et al., 2020), factual consistency evaluation in summarization (Fabbri et al., 2021) or to decompose fact-checking claims (Chen et al., 2022).With discord questions, we add a new practical application of QG, to enable analysis of news coverage diversity.
Multi-document summarization (MDS), applied to product reviews (Di Fabbrizio et al., 2014;Bražinskas et al., 2021) or in the news domain (Fabbri et al., 2019), can be seen as related to discord questions.In MDS, models learn from the dataset content selection techniques, and whether to include or omit discordant information.Discord questions can be seen as targeted MDS focusing on story elements that involve source disagreement.

News Answer Consolidation
We collected an evaluation set we name NAnCo (News Answer Consolidation) and evaluated several transfer learning strategies to select the bestperforming model for the pipeline.

NAnCo Data
To build a challenging evaluation set, we used a manual process to select questions and source answers for annotation.At the time of annotation, we selected a hundred large stories in the recent section of Google News.Although Google News most likely applies a filter on the stories that appear in the recent section, we did not curate story selection beyond selecting stories with at least 25 sources.For each story, we use a baseline QG model -a BART-large model (Lewis et al., 2020) trained on NewsQA (Trischler et al., 2017) -to generate sev-eral thousand candidate questions.We then use a QA model to question answers from the story's full context.We filter to questions with at least 25 answers and manually select eight questions for which preliminary inspection reveals discord.In addition, we ensured that selected questions represented diverse topics (e.g., geopolitics, business, science), and structures (e.g., Why, How, What, and Who questions).
Statistics of NAnCo are summarized in Appendix A1.For each question and answer set, we tasked three human annotators with grouping the answers semantically.The annotators were first shown an example question with pre-annotated groups by an author of the paper and could discuss the task before beginning annotation.Instructions given to the annotators are listed in Appendix B.
We follow Laban et al. (2021)'s procedure to aggregate multiple grouping annotations into global groups, using a combination of majority voting and graph-based clustering (Blondel et al., 2008).We then measure inter-annotator agreement using the Adjusted Rand Index measure between each annotator and a leave-one-out version of the global groups, and find an overall agreement of 0.76, confirming that consensus amongst annotators is high.
In the final dataset, questions have an average of 9.4 answer groups (ranging from 5-12), each with an average of 3.0 distinct answers per group (ranging from 1-25).We separate questions into two groups: four questions to a validation set available for hyper-parameter tuning, and four to a test set.

Experimental Setting
To facilitate experimentation, we convert final group labels into a binary classification task on pairs of answers.For each question, we look at all pairs of answers, assigning a label of 1 if the two answers are in the same global group, and 0 otherwise.In total, we obtain 3,267 pairs, with a class imbalance of 25% of positive pairs.The NAnCo data is large enough for evaluation, but too small for model training.We explore the reuse of existing resources to assess which transfers best to our task, specifically looking at models from NLI, sentence similarity, and answer equivalence.
For NLI models, we explore two models: Rob-L-MNLI, a RoBERTa-Large model (Liu et al., 2019) trained on the popular MNLI dataset (Williams et al., 2018), and Rob-L-VitC trained on the more recent Vitamin C dataset (Schuster et al., 2021), which has shown promise in other semantic comparison tasks such as factual inconsistency detection (Laban et al., 2022a).Model prediction is: (1) Where P (E|...) and P (C|...) are model probabilities of the entailment and contradiction class.During validation, minor modifications such as a symmetric scoring, and using only P (E|...) had negligible influence on overall performance.
We explore two sentence embeddings models, selected on the Hugging Face model hub2 as strong performers on the Sentence Embedding Benchmark3 .First, BERT-STS, a BERT-base model (Devlin et al., 2018) finetuned on the Semantic Text Similarity Benchmark (STS-B) (Cera et al.).Second, MPNet-all, an MPNet-base model (Song et al., 2020) trained on a large corpus of sentence similarity tasks (Reimers and Gurevych, 2019).
Finally, we select four answer equivalence models.First, LERC is a BERT-base model introduced in Chen et al. ( 2020).Second, Rob-L-MOCHA, a RoBERTa-Large model trained on MOCHA's regression task, which requires predicting an answer pairs similarity on a scale from 1 to 5. Third, Rob-L-AE, a RoBERTa-Large model we train on the AE's binary classification task which determines whether an answer pair is similar or not.Fourth, the RobL-MOCHA-AE model, which we train on a union of MOCHA and AE, adapting the classification labels to regression values (i.e., label 1 to value 5, label 0 to value 0).
We note that not all models have access to the same input.NLI and Sentence Embeddings models are not trained on tasks that involve questions, and we only provide answer pairs for those models.Answer equivalence-based models see the question as well as the answer pair, as prior work has shown that it can improve performance (Chen et al., 2020).
All models produce continuous values as predictions.The threshold for classification is selected on the validation set, and used on the test set to assess realistic performance.Technical details for training and usage of the eight models are in Appendix D.

Results
In NAnCo to account for class imbalance.On all datasets, answer equivalence models perform best, followed by sentence embeddings models, and NLI models perform worst.Within answer equivalence models, Rob-L-MOCHA tops performance, outperforming both LERC -a smaller model trained on the same data -and AE-trained models.We hypothesize that the more precise granularity of MOCHA provides additional signals useful to our task.Surprisingly, training on the union of MOCHA and AE does not improve performance, hinting at differences between the datasets, and a closer resemblance of our task to MOCHA.
All models see a decrease in performance when transitioning from validation to test settings.This drop in performance reflects the reality of using models in practice, in which a threshold must be selected in advance.
Although a test balanced accuracy of 81.3% is far from errorless, the performance is encouraging and we use Rob-L-MOCHA when assembling the framework in Section 6.In practice, for a set of answers to a question, we run Rob-L-MOCHA on all answer pairs, build a graph based on predictions, and run the Louvain clustering algorithm (Blondel et al., 2008) to obtain answer groups.

Discord Question Generation
The Discord Question framework relies on obtaining story-relevant questions.QG models are known to excel at generating factoid questions but are limited on realistic curiosity-driven questions (Scialom and Staiano, 2020).We propose an automatic method to evaluate QG models on the ability to generate discord questions, based on the intuition that we can use a story's full context to evaluate a question.The method is illustrated in Figure 3.

Evaluation Method
We select 200 news stories on the recent section of Google News, omitting stories with less than 10 sources.For each, we extract the full context, as well as a summary selected from one of the articles.
All QG models receive the summary and generate a candidate question.Crucially, models do not have access to the full context but must generate questions with diverse answers in the full context.
Once a candidate is generated, the QA module extracts all potential answers (A) to the question from the full context, and the answer consolidation module groups answers semantically.If no answer is extracted, or answers were extracted from fewer than 30% of sources, we label the question as a peripheral question.If answer consolidation finds that a single answer group accounts for at least 70% of answers, we label the question as a consensus question.We find that factoid questions often fall in this category (e.g., Who is the president of X?).
We note that the thresholds set to filter out peripheral and consensus questions were chosen empirically are can be modified depending on the application setting.For example, regarding the threshold for labeling a question as peripheral, lowering the threshold leads to producing more discord questions, including more specific questions that are not central to the story, while increasing the threshold would lead the pipeline to produce no discord questions.
A common limitation of QG is a preference for vague and common questions (Heilman and Smith, 2010), a problem that exists in other NLG domains such as dialogue response generation (Li et

2016
).With overly vague questions (e.g., What did they say?), models increase the likelihood of being answered.Vague questions are undesirable in our framework, as differing answer groups might arise not from discord, but from ambiguity.
We devise an automatic method to detect vague questions, borrowing from the concepts of TF-IDF and term specificity (Jones, 1972).We use 10 distractor news articles published several months before the news story.For a candidate question, we extract all answers to the question from the distractor articles (A dis ).We compute a question specificity score as: where we set ϵ = 0.001 for numerical stability, and |A| is the number of answers.If there are few distractor answers, specificity is large, otherwise, if Spec(Q, A, A dis ) ≤ 2, we label the question as vague.Other candidates are labeled as discord questions, as they (1) are answered by a large proportion of sources, (2) have several groups of answers, and (3) are specific to the story.
A confounding factor in QG is the choice of start word.Start words affect the difficulty of generating discord questions, with a difference between words that more often lead to factoid questions (e.g., Where), or reasoning starting words (e.g., Why).A model that generates a larger fraction of Why questions might be advantaged, regardless of its ability on all start words.To counter the start word's effect, we enforce that models are compared using the same start words.
For each of the 200 test stories, models generate one question for four start words: Why, How, What, and Who (we skip Where and When as our observation revealed a very low percentage of discord questions), for a total of 800 candidate questions.
To understand task feasibility, we collect humangenerated discord questions.We manually wrote a candidate discord question for each story and start word combinations.Although not necessarily an upper bound of performance, it can serve as a rough estimate of human performance.

Results
Results for QG models and human performance in Table 2. Overall, human performance outperforms models by a large margin for all start words.As expected, the start word affects task difficulty, with discord percentages lower for Who questions, even in the human-written condition.
The dataset influences performance more than model choice, and in particular different datasets lead to the best performance on different start words.For example, NewsQA models achieve the highest performance on the Why questions, Fairy-Tale models on the What questions, and Inquisitive models on the How and Who questions.This insight leads us to aggregate a Discord dataset by concatenating: (Inqui/How, NewsQA/Why, FairyTale/What, and Inqui/Who).We train a T5-large model on Discord, and achieve the highest overall performance of 48% discord questions generated, an absolute improvement of 5.7%, even though performance still lags humanwritten questions by around 15%.
Automatic evaluation is inherently limited.We next complement our results with manual annotation of generated discord questions.

Discord Questions Assembled
We assemble the Discord Questions frame with the best-performing components -Rob-L-MOCHA for consolidation and T5-Discord for QG -and make a public web interface.We perform a manual evaluation of the system, first evaluating the relative interestingness of discord questions to potential readers, and second analyzing types of diversity surfaced by the system.

Assembly Detail
In the demonstration, we collect stories as they are added to the English version of Google News, filtering to stories with at least 10 distinct sources.For each story, we obtain article content using the newspaper library4 and run the Discord Questions pipeline, often generating several hundred candidate questions and filtering down to discord questions that receive highest source coverage.
We design two interface to visualize stories: Q&A view and Grid view, both shown in Figure 4.In the Q&A view, the user sees a list of selected discord questions and a horizontal carousel with a representative answer from each answer group.Sources are linked explicitly, the user can click to access the original article.In the Grid view, information is condensed into a matrix to facilitate comparison between sources: each row lists a question, each column represents a source, in each entry, a shape indicates whether a source answered a question, and the shape's color indicates the source's answer group.
Work on the interface is preliminary, and serves as a proof of concept, demonstrating it can be run on several hundred stories a day with a moderate amount of infrastructure.Further investigation through usability studies is required to understand the usefulness of the framework to news readers.

Are Discord Questions Interesting?
The automatic evaluation in Section 5.2 does not consider the interestingness of the question: a question might qualify as a discord question while not covering an interesting aspect of the story for the reader.Interest in a question is inherently subjective, and we perform a manual annotation of questions to evaluate the relative interestingness of discord questions to other question categories.
We randomly select 300 question pairs from Section 5.2's experiment.Each pair contains one question marked as discord, and one marked as any other category (i.e., peripheral, consensus, vague) for the same story.Three annotators read the shuffled question pair (Q 1 , Q 2 ) and optionally the story's summary, and select the question they would be more interested in seeing answered.The annotator can choose: Q 1 wins, Q 2 wins, both are not interesting, or both are interesting.Appendix E relays task instructions and detailed results.
We compute the inter-annotator agreement level through Cohen's Kappa, and find an agreement level of 0.51 or moderate agreement, confirming that though interest in a question is subjective, some agreement amongst annotators exists.We find that annotators preferred discord questions in 68% of cases, confirming that discord questions are not only relevant for surfacing diversity in source coverage, but they also are more interesting to news readers.We note that a preference in 68% of cases shows that in many cases, consensus and peripheral questions are interesting as well, and discord questions are one of many ways to generate interesting questions in news reading applications.

Types of Diversity Surfaced
To gain an understanding of the types of surfaced diversity, we inspect discord questions generated by our pipeline from 32 Google News stories from the Business, World, and Science sections.For each story, we annotate up to five questions with at least 3 answer groups, annotating 100 questions.
We annotated each question with whether the question qualifies as a discord question, and if so, the type of diversity it reveals.We found that 16% of questions are erroneously tagged as discord and the remaining 84% surface four types of diversities.
Causes for errors were: (1) the question is vague and sources answer different question interpretations (14%), and (2) answers are all semantically similar but the consolidation module mistakenly creates multiple groups (2%).
For valid discord questions, we labeled each with up to four types of coverage diversity it reveals, expanding on prior work in answer equivalence (Bulian et al., 2022): • Level-of-detail Difference.79% of valid questions surface differences between coarse and precise answers to the question, • Aspect Difference.66% bring to light differences in the aspects answers focus on (e.g.economics vs. politics), • Sentiment Difference.41% reveal source answers being more positive, neutral, or negative towards a question, • Reason Difference.22% expose differences in the reason or prediction a source makes about a question.
See Appendix F for examples of each type of diversity.From this analysis, we conclude that although the pipeline produces errors, the majority of generated questions reveal some coverage diversity.

Conclusion
We introduce the Discord Questions framework, in which we hypothesize that a question accompanied by an organized set of answers from the sources can spotlight specific ways in which sources disagree, providing concrete examples of coverage diversity to a news reader.We decompose the framework into required components, and design evaluation methodology for each.We select high-performing models for each component and assemble them into a working prototype of the framework.We confirm through manual analysis that questions generated within our framework are of interest to potential users, and in a majority of cases surface four types of diversity in news coverage, from varying levels of detail in the reporting, to differing sentiment or reasoning about an event's cause, confirming that discord questions are an interpretable tool to uncover coverage diversity.

Limitations
In this work, we focus on generating discord questions, filtering out other types of questions.Discarded questions can however be valued in other settings, and our selection process should not be seen as a general assessment of question quality.For instance, Fabbri et al. (2021) show that generating highly specific factoid questions can boost performance in factual inconsistency detection in summarization, and unanswered questions help challenge QA models (Rajpurkar et al., 2018).
Our demonstration relies on components susceptible to making errors, which can compound as one module's errors are forwarded to the next.For example, an inaccurately extracted answer by the QA model will lower the quality of answer groups in the consolidation step.In particular, extractive QA models can be limiting when answers are indirect or implied (Chen et al., 2022).On the bright side, modularity enables us to swap to improved components, for instance as generative QA becomes available (Tafjord and Clark, 2021).
The framework we propose assumes that exposing a news reader to coverage from a diverse set of sources is beneficial, however, exposure to media bias can be detrimental, in some cases misrepresenting important geopolitical events such as elections (Allcott and Gentzkow, 2017) and wars (Kuypers, 2006).Therefore, a careful balance is required in source selection to present diverse perspectives to the user, while not promoting dangerous misrepresentations.In the implementation presented in Section 6, we rely on Google News' source selection process5 , which accounts for transparency and editorial practices of a source.Google News is however not a gold standard, as it is known to have Western bias (Watanabe, 2013) and the aggregator recently removed major Russian sources from its platform6 .
Our current prototype is inherently limited due to our focus on English-written news, as coverage diversity on international topics is likely to come from non-English news sources.However, improvements in automatic news translation (Tran et al., 2021), as well as multi-lingual models (Hu et al., 2020) draw a path towards a multi-lingual version of our prototype.
As stated earlier, the prototype interface we built remains work-in-progress, and usability studiesplanned as future work -are required to examine the effect of discord questions on readers' understanding of the news.Furthermore, beyond the setting of Google News stories, discord questions could be beneficial on the study of long, unfolding news stories (Laban and Hearst, 2017), helping readers form evolving opinions over time.Finally, future work should aim to integrate discord questions into non-standard news interfaces such as chatbots (Laban et al., 2020) or podcasts (Laban et al., 2022b).

Ethical Consideration
We focused our experiments for the Discord Questions on the English language, and even though we expect the framework to be adaptable to other languages and settings, we have not verified this assumption experimentally and limit our claims to the English language.
The models and datasets utilized primarily reflect the culture of the English-speaking populace.Gender, age, race, and other socio-economic biases may exist in the dataset, and models trained on these datasets may propagate these biases.Question generation and answering tasks have previously been shown to contain these biases.
We note that the models we use are imperfect and can make errors.When interpreting our framework's outputs, results should be interpreted not in terms of certainty but probability.For example, if our system states that a source did not answer a specific discord question, there is a probability that the source answered the question, but the questionanswering module we use failed to extract such an answer.
To build the components of our prototype, we relied on several datasets as well as pre-trained language models.We explicitly verified that all datasets and models are publicly released for research purposes and that we have proper permission to reuse and modify the models.

B NAnCo Annotation Instructions
The instructions that were given to the three annotators are listed in Figure A1.As listed in the instructions, the annotators were tasked with first looking through an annotated example before starting the annotation.One annotator asked clarification questions about the use of "-1" annotations before proceeding.

C NAnCo Statistics
Table A1 lists the eight questions included in the NAnCo evaluation we created, with the first four questions in the validation set, and the last four in the test set.Welcome!This sheet contains 9 tabs (Q1-Q9), each containing a question (at the top) and 25-50 answers to that question from different sources.The goal is to annotate, in each tab, which answers give the same answer elements (are in the same cluster).

D NAnCo Model Details
In each tab, you should fill out the Cluster Annotation column.
-For each answer row, the cluster annotation should be a number, such that all the answers that you believe give the same answer should receive the same cluster number.
-The cluster numbers do not need to be consecutive (for instance if you change your mind about a cluster) -You can move the answer rows around if you want to (for example put similar answer rows next to each other), but it is not necessary.
-If you believe an answer does not contain a valid answer, you should annotate that with a "-1".These will be removed from annotation and not considered an answer cluster.
The first tab Q1 has already received annotation.Review the sheet's annotation, and if you disagree, or want to discuss an annotation choice, reach out to Philippe to discuss.If you have other questions about the task, reach out as well.Once you feel like you understand the task, feel free to start annotation.We anticipate the task to take 2-3 hours to annotate the 8 spreadsheets (Q2-Q9).Welcome!On each row, read through the two questions.If it is unclear what news story it is about, you can read through the summary as an additional context (Note that it is ok if you can't find an answer from the summary given a question).The task consists of choosing which question you believe is more interesting and central to the story.That is, please select a question that you are more curious/willing to know what are the answers from different source articles.Options for preference are: -1 (if you believe question1 is more interesting) -2 (if you believe question2 is more interesting) -0 (if both questions are roughly equally not interesting) -3 (if both questions are roughly equally interesting) The first example (row 5) is labeled as an example (news story about Wikipedia and Bitcoin).The preference is set as 1 as the first question (How did Wikipedia's decision affect the free web-based encyclopedia?) is judged to be more interesting than question 2 (How long will Wikipedia stop accepting cryptocurrency donations?which asks about a detail that might not be stated).

E Preference Annotation Details
Figure A2 details the instructions that were given to the three annotators that participated in the preference selection task described in Section 6.2.

Figure 1 :
Figure 1: Discord Questions surface news coverage diversity.By finding questions that sources answer differently, concrete examples of coverage diversity for a particular story can be surfaced.

Figure 2 :
Figure 2: Overview of the Discord Questions Framework.The pipeline consists of: (1) question generation, (2) question answering, and (3) question consolidation, to find questions that news sources answer differently.

Figure 3 :
Figure 3: Diagram of automatic evaluation of question candidates.Questions are tagged into one of four categories: peripheral, factoid, vague and discord.

Figure 4 :
Figure 4: Prototype interface of the Discord Questions demonstration.The Q&A view (left) lists the most covered discord questions and answers.The Grid view (right) condenses the story information in a matrix.

Figure A1 :
Figure A1: Instructions for the annotation of the NAnCo evaluation set.

Figure A2 :
Figure A2: Instructions for the annotation of the preference over question interestingness

Table 1 ,
we report Pearson correlation scores for MOCHA, and balanced accuracy for AE and

Table 2 :
al., Results on Discord QG.For each model (BART, T5, MixQG), and dataset (SQuAD, NewsQA, Fairy, and Inqui) we report the % of questions tagged as discord.T5-Discord is the model trained on data we curate, and we report a human performance estimate.