PREME: Preference-based Meeting Exploration through an Interactive Questionnaire

The recent increase in the volume of online meetings necessitates automated tools for organizing the material, especially when an attendee has missed the discussion and needs assistance in quickly exploring it. In this work, we propose a novel end-to-end framework for generating interactive questionnaires for preference-based meeting exploration. As a result, users are supplied with a list of suggested questions reflecting their preferences. Since the task is new, we introduce an automatic evaluation strategy by measuring how much the generated questions via questionnaire are answerable to ensure factual correctness and covers the source meeting for the depth of possible exploration.


Introduction
In recent years, video conferencing technology has gained substantial improvements, and thus, online meetings have become easily accessible and more prominent.Primarily due to the pandemic and working from home, the need for video calling has grown significantly.Therefore, the high volume of online meetings necessitates automated tools for managing and organizing essential information for attendees.Especially when an attendee has missed an online meeting, it is critical to access the required information since quickly reading through the transcript is quite time-consuming.
Providing meeting summaries is a promising direction (Wang and Cardie, 2013;Jacquenet et al., 2019;Zhao et al., 2019;Singhal et al., 2020).However, recent studies show that 1) users' needs do not fully align with current approaches to automatic text summarization (ter Hoeve et al., 2020(ter Hoeve et al., , 2022) ) and 2) approaches designed for document summarization could not effectively apply to meetings transcripts (Murray et al., 2010;Mehdad et al., 2013;Li et al., 2019) due to the following potential reasons: (R1) Structure: standard documents are well structured compared to meeting transcripts; Figure 1: An example of exploring one of the meetings from the collection (Carletta et al., 2005) based on user preferences through an interactive questionnaire.
(R2) Language: spoken language used in meetings is less regular than documents; and (R3) Multiple speakers: the speaker role is essential.Moreover, there is little meeting data publicly available that can be used for experimentation compared to regular documents such as news or articles.In contrast with document summarization, when summarizing a meeting, different users tend different preferences on what content should be included in the summary.Therefore, there is an increasing calling for alternative ways of summarizing, especially for meetings transcripts.Recently, Zhong et al. (2021) attempted to tackle this problem by proposing a query-based multi-domain meeting summary, where a user provides a query in question form, e.g., 'What was the discussion about the jog dial's function when talking about changes in the current design?' to locate the part of the transcript that related to the query and then summarize.However, when attendees have missed the meeting, they cannot formulate such questions due to no prior knowledge about the meeting.To overcome this, we aim to address the following research challenge: How can attendees effectively explore a meeting content without having prior knowledge about it?
This work is motivated by the fact that asking questions is a more efficient way for humans to ac-  quire information than notes in plain text (Lawson et al., 2007(Lawson et al., , 2006;;Aliannejadi et al., 2021).Thus, we address preference-based meeting exploration by automatically generating a structured interactive questionnaire for a transcript that covers most of the discussed topics and quickly walks users through the discussed content.An example of the desired questionnaire is shown in Fig. 1.First, the user has the ability to express their preferences regarding subjects that have been discussed (Solbiati et al., 2021;Huang et al., 2018;Zhang and Zhou, 2019;Sehikh et al., 2017).Next, the questionnaire interactively suggests narrowing down their exploration if possible by displaying a list of possible related aspects.As a result, a ranked list of questions reflecting user preferences is generated.Next, the user can pick a question that demonstrates their seeking needs the most and is redirected to the meeting part containing an answer.Interactively asking for preferences in the questionnaire is beneficial because the user oversees what has been covered during the meeting they have missed.In section 4.2 we elaborate on a user study on a number of professionals who find such application useful for their daily job.Hence, the goal of proposed questionnaires is twofold: (G1) to compactly represent the discussed content; (G2) to guide users to form questions that express their preference regarding the transcript.
We require the generated questionnaire to satisfy the following properties: P1 Coverage: coverage is the amount of the information from the source text that a questionnaire points to.The generated questionnaire must cover the meeting as much as possible; P2 Answerable: a given meeting transcript should contain the answers to the questions generated as a result of the questionnaire.
To address the defined challenge, we propose a framework, PREME, which consists of several concrete sequential steps highlighted in Fig. 2. We start by enchaining the method to extract meeting segments (Solbiati et al., 2021).Due to the conversational nature of the meeting, topic detection from the segments is challenging (Huang et al., 2018;Zhang and Zhou, 2019;Sehikh et al., 2017).Thus, we indirectly extract the topics as follows.First, we generate questions from each segments (Brown et al., 2020) since extracting topics from the questions is much more well studied.Further, we employ a trained Conditional Random Field (CRF) model to tag subjects and aspects (Fig. 1) from generated questions originated from each segments (Wallach, 2004).Once we got each segment's topic list, we proposed a strategy to normalize them to reduce the number of options in the questionnaire.Recently, Deutsch et al. (2020) demonstrated that QA-Based evaluation is strongly correlated with human opinion.Thus, to evaluate PREME, we employ a similar QA-based strategy.
To summarize, the main contributions are: C1 We propose PREME, a novel framework to enable meetings exploration based on user's preferences through an interactive questionnaire; C2 We propose a new method for subject normalization which returns the most informative subject from a set of phrases and keywords; C3 We introduce a new automatic evaluation strategy for measuring the effectiveness of the proposed questionnaire to assess the required properties P1 and P2, which according to (Deutsch et al., 2020) Nallapati et al., 2017;See et al., 2017;Celikyilmaz et al., 2018;Liu and Lapata, 2019;Zhang et al., 2020), academic papers (Manakul and Gales, 2021;Huang et al., 2021) and books (Kryściński et al., 2021).Meeting summarization has also emerged as a widespread need recently.Due to the unique discourse structure of dialogues, conventional document summarization systems are facing challenges when summarizing meetings (Li et al., 2019;Zhu et al., 2020).Thus, new models are proposed for tackling this task.Wang and Cardie (2013) employ decisions, and action items in dialogues to generate the summary progressively.Oya et al. (2014) propose a template-based meeting summarization system by learning the relationship between summaries and their source meeting transcripts.Shang et al. (2018) design an unsupervised meeting summarization model with multi-sentence compression techniques.Li et al. (2019) introduce multi-modal information into meeting summarization with a hierarchical attention mechanism.Zhu et al. (2020) propose a hierarchical meeting summarizer that can process both word-level and turn-level information of dialogues.Furthermore, the community noted that due to the lengthy content and distributed information, a general summary of the meetings does not necessarily satisfy what users seek.Thus, Query-based summarization methods have become more prevailing for generating concise and specific summaries.(Litvak and Vanetik, 2017;Nema et al., 2017;Baumel et al., 2018;Ishigaki et al., 2020;Kulkarni et al., 2020Kulkarni et al., , 2021;;Pasunuru et al., 2021).Recently, Zhong et al. (2021) proposed a new framework of query-based summarization for meetings, in which they annotate QMSUM, a query-based multi-domain meeting dataset.Each QMSUM meetings come along with a set of queries with different levels of abstractness, i.e., general queries and specific queries.Human annotators write these queries, and the summaries align with these queries after reading the meeting transcripts.
While query-based summarization can be a proper path to provide users with meeting information at different specificity levels, we argue that issuing such specific queries still requires a certain degree of background knowledge.In reallife scenarios, users might not be equipped with that knowledge and issue informative queries, especially when they did not attend the meeting.Hence, they can not benefit from query-based summarization techniques to explore the meetings.We address the drawbacks of query-based summarizers by providing users with an interactive questionnaire which provides them with potential queries and allows them to explore the meetings more flexibly.

Evaluation of Summaries Factuality
The summaries often has called out for hallucination issues (Maynez et al., 2020).Thus, Wang et al. (2020) propose a framework to evaluate factual consistency of summaries with the source text Similarly, Deutsch et al. (2020) propose a Question Answering (QA)-based evaluation approach on summaries' content quality.They measure how much information is contained in a candidate summary by calculating the proportion of questions it can answer.These approaches inspired us for automated end-to-end evaluations of the questionnaires.

Question Generation and Filtering
Initial works in Question Generation task leveraged crowd-sourcing or rule-based methods to generate pre-defined question templates (Mostow and Chen, 2009;Rus et al., 2010;Lindberg et al., 2013;Fabbri et al., 2020;Mazidi and Nielsen, 2014;Labutov et al., 2015).Heilman and Smith (2010) tackled this problem by over-generating candidate questions and then using a learning to rank framework to rank them to filter the low-quality questions.SQUASH (Krishna and Iyyer, 2019) is one of the recent works in which authors used question generation methods to convert a document into a hierarchy of question-answer pairs with the focus on questions' granularity level.They employed a neural encoder-decoder model trained on three reading comprehension data sets, i.e., SQuAD (Rajpurkar et al., 2016), QuAC (Choi et al., 2018), and CoQ (Reddy et al., 2019) to generate the questions, and further, they filtered out the unanswerable questions using some heuristics and question answering models.While question generation using question answering data sets seems a general approach, this method does not work well on meetingrelated questions generated due to many reasons, including: (1) Different structure of meetings compared to documents; (2) There are not many ques-tion-answering datasets available from meetings; (3) Sometimes, the answer to questions generated from meetings could be very long, making it hard to fit the context in neural models.In our work, we introduce an automatic method that can generate questions regarding the meeting to overcome the high price of collecting with annotators.

Questionnaire Organization
Obtaining users preferences has always shown to be a challenging task (Jiang et al., 2008;Rokach and Kisilevich, 2012;Anava et al., 2015;Christakopoulou et al., 2016;Sepliarskaia et al., 2018).The task becomes more challenging when we aim to minimize the number of interactions with users to get to know their preferences.Sepliarskaia et al. (2018) reformulate this task as an optimization problem.They propose a static questionnaire by choosing a minimal and diverse set of questions.Similarly, in Liu et al. (2019) proposed a dynamic questionnaire generation method for search of clinical trials.Quiz-style question generation has also been explored recently by Lelkes et al. (2021).The authors have formulated the problem as two sequence to sequence tasks, including the question-answer generation step and incorrect answer generation step.We argue that while the former step seems relevant to our work, it could not be adapted to meeting transcripts since their proposed dataset has been trained on factual question answering data sets and cannot be used for meeting purposes.All in all, we can conclude that creating questionnaires are still under exploration in different domain.Hence, our effort in organizing a questionnaire, especially for meetings, is timely and useful for future research.

Proposed Framework: PREME
This section explains PREME, our proposed novel methodology to explore meetings based on users' preferences through an interactive questionnaire.An overview of our methodology is shown in Fig. 2 in which we first apply a topic segmentation method (Solbiati et al., 2021) on meeting transcript to retrieve segments with different topics (Section 3.1).Then, we generate a set of all possible questions from each segment (Section 3.2).Further, we extract the most informative part of the questions, i.e., the subject and aspect of each question (Section 3.3).In the last step, we map the normalized subjects and aspects with generated questions and form the questionnaire (Section 3.4).

Meeting Segmentation
A meeting transcript can be extremely long and contain discussions of various topics.Therefore, our goal is to divide the meeting text into a sequence of topically coherent chunks.Thus, we adopted an unsupervised topic segmentation method based on the contextualized presentation of meeting (Solbiati et al., 2021).In this topic segmentation method, the authors compute the BERT embeddings for every utterance of the meeting transcript.Further, they curated blocks of utterances and performed a block-wise max-pooling operation to generate contextualized embedding for each block.Then, the semantic similarity between two adjacent blocks is captured, and a change in the topic is detected if two adjacent blocks show similarity below a certain threshold.This approach has several advantages, including: (1) It is unsupervised; (2) Since we are just converting the meeting into smaller pieces, and we are not losing any part of the meeting.

Question Generation
For question generation from a segment, we leveraged the powerful GPT-3 model (Brown et al., 2020).An impressive capability of the GPT-3 is to generate very realistic results from few training samples or even no training sample (few-shot and zero-shot learning).The variety of the generated content can be controlled using a temperature hyper-parameter.To expand the size of generated questions' pool as much as possible, in each segment, the API is called in a zero shot learning model with different temperature values between [0-1] with a 0.05 margin, where the value closer to 1 means more diversified questions.We set the maximum output length to 128 tokens and then we repeat the process for 10 trials for each specific temperature.Given that the maximum context window for the API was 2048 tokens, we truncate and slide by half-a-window size of 2048 tokens whenever a segment includes more than 2048 tokens.As a results, A list of questions is extracted based on random initialization in each API call, meaning different results are achieved even with the same hyper-parameters.We extracted five questions on average per segment in each call.Finally, a union across all runs is used to form our question pool.

Subject and Aspect Extraction
Every of the generated questions has one or more subject(s) that is defined as the principal matter that attendees have discussed, i.e., the main con- cern of the questions.Some questions might point to a specific aspect(s) of the subject which is defined as the mentioned details about a given subject.We aim to extract the primary subjects from any question and the detailed aspect if it is mentioned.
Table 1 shows examples of annotated subjects and aspects for a few questions.For instance, in the question "What is the arrow symbol on the remote control for?", "remote control" is annotated as the subject and the "arrow symbol" is the specific aspect of the subject.To extract the subjects and aspects from the questions, we use CRF (Wallach, 2004).We examined SOTA keyword extraction and contextualized neural embedding-based topic extraction models; however, the CRF model which uses word's identity, suffix, shape and POS tags as features, seems to work the best among them.
To train the CRF model, we were required to have annotated questions with subjects and aspects labels.We designed an annotation study using the UHRS1 crowd-sourcing platform, where we carefully trained annotators with detailed instructions to label randomly selected 1000 questions generated by GPT3 with their subject and aspects2 .Each question has been assigned to two annotators, and we report the annotators' agreement in Section 4. Further, we employ the trained CRF model to extract subjects and aspects from the questions.

Questionnaire Generation
Given a meeting transcript, for each of its segment T which was initially supposed to coherently point out one subject, we generate Q T , a set of generated questions from T .In other words, given an ideal meeting segmentation method, each segment is supposed to be pointed to one subject.Thus, we assume that each segment has only one valid topic and as shown in Figure 2, each segment is being represented with one S norm .We create a set S Q T by extracting the subjects from each question in Q T .Therefore, for the segment T , we have at least |Q T | number of subjects.Extracted subjects from a question set with the same origin segment must be normalized so that one comprehensive, general, and informative subject presents a segment.The more the selected subject representative covers other concepts in S Q T , the better normalization we employed.This subject normalization reduces the number of subjects shown to the user at the first step of the questionnaire and will decrease the user's effort, causing figuring out users' preferences by asking them the minimum number of questions.In other words, our goal is to select a single subject S norm from S Q T which represents S Q T in the most informative way.To do so, we define the notion of the subject network as follows.
Definition 3.1.Given a segment T , a set of generated questions Q T , and extracted subjects It is a weighted undirected graph, where V = {s i ∈ S Q T }, and E = {e s i , e s j : ∀s i , s j ∈ V}.The function w : E → [0, 1] is the cosine similarity between the semantic relatedness of the contextualized embedding vectors of two incident subjects of an edge e s i ,s j , i.e., v s i and v s j .
In Def.3.1, we propose a subject-network where subjects are connected, and edge weights represent the semantic similarity between the two subjects.We hypothesize that the node with highest similarity and connection to others is the most central one.In other words, since it has great similarity to other subjects, there is a high probability that it points to a more generic concept and that covers the other subjects.Hence, the node S norm should have high centrality attribute to represent the main subject of segment S. We employed PageRank (Haveliwala, 2003) value to find the most important and informative node in this network.Similarly, PageRank has shown to have a high correlation with the most important nodes and has been used in tackling different tasks such as quantifying term's specificity or ranking problems in different information retrieval tasks (Arabzadeh et al., 2020(Arabzadeh et al., , 2019;;Kurland and Lee, 2010).We measure the PageRank score of each node and select the node   (Janin et al., 2003).The edge weights represent the semantic similarity between each nodes.Higher weights are shown with higher width.
with the highest PageRank value as the representative subject S norm of the subject set S Q T for segment T .In other words, we represent each segment T by subject S norm where P ageRank(S norm ) > P ageRank (s i ) for every s i ∈ V.
Fig. 4 displays a subject-network generated from extracted subjects from one of the meetings' segments in the QMSUM dataset.subjects such as "Education", "Schools," "Young people who are leaving school" are included in this subject set and represented by nodes in this subject-network.Further, we connect every pair of nodes in this graph, and the edge weight is directly related to their semantic similarity.As presented in Fig. 4, some nodes have higher edge weights which their connected lines are shown with greater width.We measure page rank in this weighted network.Here "Education" got the highest PageRank value in this subject-network.Hence, we present these subjects by one subject, i.e., "Education"."Education" can be a promising representative for these subjects as it covers more specific concepts such as "schools", "statutory education," and "post 12 education." Next, the extracted aspects from each question set should be mapped to their representative subject.We remove the redundant and repetitive aspects and subjects by removing those who have highly similar n-grams.Plus, There might be several subjects existing in S Q T which all point out to S norm , and they might be semantically very similar.In this step, we must be concerned not to lose any aspect because of subject normalization.We aim to map every aspect from S norm and every s i in S Q T which is highly similar to S norm to maximize the potential of questions we might want to show at the end of the questionnaire.For instance, in Fig 3 we display a few extracted subjects and aspects from one segment.If we only consider "education" and its related aspect, we will lose many aspects that users might be interested in, and as a result, the questionnaire coverage will drop.On the other hand, if we merge the highly similar representative subjects with, e.g., "school setting" and "Education and Skills Committee," we will have a broader host of questions to suggest to users.Therefore, we will filter out dissimilar subjects from S Q T to S norm and map extracted aspects from filtered S Q T to S norm as it is shown in Fig. 3.As a result, if "education" is the subject of interest for a user, they have the opportunity to select which aspects of education they are more interested in, such as "Role" of education or "challenges" of education.Finally, we will show users the questions in which the selected aspects and normalized subjects have appeared.

Evaluation Methodology
For experiments, we use the QMSUM dataset (Zhong et al., 2021), which includes 232 product, academic, and committee meetings (Janin et al., 2003;Carletta et al., 2005).Each meeting comes with a set of general and specific questions; the general ones are out of the scope of this work since they refer to very broad concepts, e.g., "summarize the whole meeting.".Further evaluations are conducted on the QMSUM test set.

Evaluating Framework Components
The proposed framework consists of several steps (Fig. 2).The used meeting segmentation (Solbiati et al., 2021) method has shown to outperform baselines (Hearst, 1997;Beeferman et al., 1999;Badjatiya et al., 2018).Hence, we refer to original paper for evaluation results.Evaluating Question Generation: We evaluate the quality of our generated questions by measuring the fraction of generated questions by human annotators in QMSUM that we covered in PREME.We assume the specific queries in the QMSUM dataset enjoy relatively high quality because annotators issued them after comprehensively reading the transcript (gold standard questions).Hence, Fig. 5 reports the similarity between most similar questions generated by PREME and the gold questions by three different similarity metrics i.e., Sentence-BERT similarity (Reimers and Gurevych, 2019), Rouge F-1 score (Lin, 2004), and BLEU-4 score (Papineni et al., 2002).We assume a questions from QMSUM is covered if there is at least a question generated by PREME that has similarity is higher than a certain threshold t ∈ [1, 0.9, ..., 0.1, 0].We report the percentage of 'Covered/Not Covered' questions based on different similarity matching thresholds.Based on Fig. 5 we conclude while we cover a relatively fair number of specific questions, there is still room for improvement.However, we should note that the questions in QMSUM are very limited, and initially, they were not supposed to cover all possible questions that one could raise from the meeting.Additionally, we observe that questions in QM-SUM, which are issued by humans, include more abstractive questions while our generated questions inclined toward more factual ones.
Evaluating Subject and Aspect Extraction: To assess the quality of the collected dataset, we measure Krippendorff's alpha agreement between annotators (Krippendorff, 2011) for extracted subject and aspect of the 1000 questions generated from the training set.Tab. 2 shows annotators have agreement ∼ 0.4, which is interpreted as "Moderate" agreement for such a challenging task.Since different annotators might selected different section of the text, Tab. 2 reports both hard and soft agreements.we trained the CRF model using crfsuite library and evaluated it by 10-fold cross-validation.Given each term in the questions, the model predicts whether the term is considered the subject, aspect, or not applicable for labeling (N/A).Tab. 3 shows the result of the CRF model evaluation in terms of precision, recall, and F1 scores.We notice that the model shows better performance on detecting aspects compared to the subject.

Evaluating Questionnaires
To the best of our knowledge, we are first to propose a preference-based questionnaire as a way for meeting exploration; thus, no particular gold standard benchmark or evaluation metrics.Since we require users to express their preference, it makes it challenging to simulate 'enough imaginative context' among annotators.Thus, we conducted a user study to highlight the usefulness of exploring meetings through an interactive questionnaire.We provided 20 participants who were professional workers and graduate students aged between 24-41 with detailed explanations and examples of results generated by PREME such as in Figure 1.Participants on average had over 5 hours of online meetings per week.Among which, over 80% of them reported that they need to explore the content of a past meeting, at least a couple of times a week.Finally, over 80% of participants agreed on finding PREME useful for meetings exploration.Also, we introduce a new evaluation strategy that satisfies the desired properties on coverage (P1) and the existence of answers in the transcript (P2).The proposed automatic metrics capture if our framework is ready to be tested through a more comprehensive user study in the future, when we can run a pair-wise preference-based comparison between PREME and other meeting exploration methods.Automatic evaluation: We utilize the model SOTA called Locator in (Zhong et al., 2021) in which, given the query, it can extract the relevant spans from the meeting.The Locator employs a hierarchical ranking-based model structure based on CNN (Kim, 2014) and Transformers (Vaswani et al., 2017) architecture.The Locator embeds each utterance of the meeting and feeds it to a CNN network by capturing the local features, and utilize Transformer layers to obtain contextualized turnlevel representations.In addition, the speaker's embedding is also concatenated to the features list.Finally, the model uses MLP to score each turn, and the turns with the highest scores are considered the relevant spans for each question.
To measure the coverage (to satisfy P1), we adopt the newly proposed QA-style of evaluation (Deutsch et al., 2020;Wang et al., 2020) which has shown to have substantial correlation with human judgments in terms of questions quality assessments.Coverage is defined as the fraction of a meeting that a questionnaire encompasses.To measure the coverage, first, the relevant answer spans for the existing questions in a questionnaire are located.Further, the proportion of utterances that were already located as relevance answer spans w.r.t. the whole meeting transcripts, is measured as the coverage.We believe that that is a promising indicator of questionnaire informativeness.The coverage is basically how much of the original meeting was covered by the questionnaire.We hypothesize that a good questionnaire should ideally include questions from all parts of a meeting.i.e., the questionnaire includes questions related to every part of the meeting so that users are able to explore their section of interest from the meeting.Therefore, the more the questionnaire covers the meeting, the better it is.To do so, we find the answer spans to the generated question in each questionnaire and we report the percentage of utterances that the locator detected as the answer span for all the questions in the questionnaire from the whole meeting.We run our experiments on the QMSUM test set.Tab. 4 shows the details of this test set.We over generate the questions and after removing the duplicates, on average, the questionnaire has 1257 unique questions from Academic meetings, 1105 questions from Committee meetings, and 724 questions from Product meetings.Further, Tab. 4 reports the percentage of utterances covered in each meeting.On average, our proposed questionnaire can cover 81% of the meeting.We also compared the coverage on different types of meetings.While our generated questionnaire covered Committee meetings the least (64%), the Product and Academic meetings show higher coverage (over 80%).Further, we evaluate how much the generated questions in PREME are answerable (to satisfy P2).Inspired by (Krishna and Iyyer, 2019), we run a pretrained QA model (Sanh et al., 2019) over generated questions and report the confidence score for each QA pair in Fig. 6.We use DistilBERT fine-tuned on SQUAD (Rajpurkar et al., 2016) dataset.We observe that more than 73% of generated questions from PREME on meetings in test set of QMSUM shows confidence score higher than 0.5 and more than 42% of questions shows confidence score greater than 0.7.The results confirm that a promising portion of generated questions are answerable.

Conclusions and Future Work
We proposed an end-to-end framework, called PREME, that allows automatically building a questionnaire that will enable users to explore the most of discussed subjects and their aspects if desired.As a result, users are supplied with questions about the meetings that express their information needs, and answers can be found in the transcript.Since simulating actual users' preferences is challenging and requires hired annotators, we have ran a small user study as well running an automatic end-to-end evaluation strategy to demonstrate the desired properties (P1 and P2) of the generated questionnaires.We publicly release the collected dataset of annotated questions concerning its subjects and aspects, the code for questionnaires generation, and our evaluation procedure to carry forward the proposed state-of-the-art for the newly formulated problem.In future, and by proposing a new method for questionnaire generation will allow us to run a user study for pair-wise comparison of the methods and reveal the correlation between human and automatic evaluation metrics for the suggested task.

Limitations
Generally, there is not much data available for meeting exploration.Thus, all studies on this domain are limited by small training and exploratory data.Therefore, it would be beneficial for the community to collect more labelled meeting data for meeting exploration and organization purposes.Since PREME is made of different SOTA components, its performance is also limited by individual components.In future, novel attempts can be made to address this problem as an end-to-end framework.In addition, the future works should include an extensive human evaluation that will reveal additional requirements for the PREME to satisfy, which will suggest additional evaluation metrics.Plus, since this the first work on to tackle meeting exploration via questionnaire, the preference-based evaluation is not possible.

Figure 2 :
Figure 2: Overview of our framework, Preference-based Meeting Exploration through an Interactive Questionnaire (PREME), where Q is a comprehensive set of questions, and S i and A j are extracted pairs of subjects and aspects.

Figure 3 :
Figure 3: An example of how extracted subjects and aspects from a given segment are normalized.

Figure 4 :
Figure 4: An example of subject-network built for one extracted segments from(Janin et al., 2003).The edge weights represent the semantic similarity between each nodes.Higher weights are shown with higher width.

Figure 5 :
Figure 5: Coverage of PREMEon QMSUM test set considering different similarity metrics and threshold

Figure 6 :
Figure 6: Histogram of Confidence Scores of Question-Answering model on generated questions from PREME.

Table 1 :
Examples of annotated questions with their subjects and aspects .Subjects are highlighted in red and Aspects are highlighted in green.Q1 What is the arrow symbol on the remote control for?Q2 What are the main frustrations people have with the

Table 3 :
CRF performance on extracting subjects and aspects of questions using 10-fold cross validation

Table 4 :
Test set statistics and PREME Performance: Average number of generated questions and Coverage.