PSED: A Dataset for Selecting Emphasis in Presentation Slides

Emphasizing words in presentation slides allows viewers to direct their gaze to focal points without reading the entire slide, retaining their attention on the speaker. Despite many studies on automatic slide generation, few have addressed helping authors choose which words to emphasize. Motivated by this, we study the problem of choosing candidates for emphasis by introducing a new dataset containing presentation slides with a wide variety of topics. We evaluated a range of state-of-the-art models on this novel dataset by organizing a shared task and inviting multiple researchers to model emphasis in slides.


Introduction
Presentation slides have become so commonplace that researchers have developed resources for designing effective slides (Alley and Robertshaw, 2004;Alley and Neeley, 2005;Jennings, 2009). These guidelines cover advice on the overall style, such as choosing colors and font size to ensure readability from a distance, as well as ways to help the content stand out more distinctly. However, recommendations to enhance the slides' communication power could improve authoring even more.
Our goal is predicting emphasis words in presentation slides. Emphasis uses special formatting like boldface or italics to make words stand out. Well-designed emphasis can significantly increase the viewers' retention by guiding their focus to a few words (Alley and Robertshaw, 2004). Instead of reading the entire slide, they can read only the emphasized parts, keeping their attention on the speaker and their speech, as Figure 1 illustrates. 1 The Emphasis Selection (ES) task was initially introduced by Shirani  short written text in social media, and later became a SemEval 2020 task (Shirani et al., 2020b). In this paper, we focus on presentation slides, introducing a new corpus as well as automated emphasis prediction approaches. We are among the first to use the content of the slides to provide automated design assistance.
Task Characteristics Emphasis selection poses new challenges specific to presentation slides. They can have different structures, and authors may follow traditional styles, or modern styles with more visual content. Slides cover a wide range of topics, from technical, marketing, and legal presentations to children's illustrations. The requirement to generalize to different domains and cover a variety of topics poses new challenges and encourages developing robust language understanding models. We rely only on input text without additional context from the user or the rest of the design. The task is highly subjective, but the goal is straightforward: use natural language understanding techniques to discover the most most common interpretation of a slide page and to generate emphasis that makes the page easier to understand quickly.
Benchmarking The Task Instead of providing baselines for the proposed dataset, we organized a shared task and invited researchers to work on the new corpus. Section 6 describes the top-performing methods. By examining the challenges of the dataset, we provide different analysis components.

Related Work
Prior work explored automatically generating presentation slides from documents such as scientific articles (Beamer and Girju, 2009;Wang et al., 2017;Hu and Wan, 2013;Shibata and Kurohashi, 2005;Sravanthi et al., 2009). These projects assume that a slide page is a summarization of some part of the paper, and many summarization methods have been proposed to improve the effectiveness.
Other studies provide guidelines or alternatives to traditional designs to communicate a presentation's content more effectively (Alley and Robertshaw, 2004;Jennings, 2009;Alley et al., 2006;Atkinson, 2005;Doumont, 2005). These create slides with sentence headlines and visual elements to reinforce ideas and increase the audience's retention of the information during presentation.
Many applications provide design assistance for images and text, but most use only basic heuristics. Recent work uses AI-based models to recommend design attributes based on the content (Zhao et al., 2018b,a;Shirani et al., 2020a). Shirani et al. (2019) introduced Emphasis Selection for written text in visual media. The proposed model with an end-to-end sequence tagging architecture utilizes label distribution learning (LDL) (Geng, 2016) to handle the task's subjectivity, and predicts emphasis scores for short written texts. They trained and evaluated the model against a collection of social media short texts from Adobe Spark 2 . Later on in SemEval 2020 (Shirani et al., 2020b), 31 teams proposed novel approaches to model emphasis more effectively. The organizers augmented the social media dataset with a large dataset of short quotations. Top-performing teams (Huang et al., 2020;Morio et al., 2020;Singhal et al., 2020) used rich contextualized pre-trained language models such as ERNIE 2.0 , XLMRoBERTa , XL-Net (Yang et al., 2019), and T5 (Raffel et al., 2019).
This study focuses on a new domain, presentation slides, where emphasis serves a different purpose than in social media. For social media the main purpose is to draw the audience's attention, while for presentations, the main purpose is to help the audience better understand the content. Identifying emphasis in presentations brings unique

Task Definition
Given a sequence of tokens in a slide page, C = {x 1 , ..., x n }, the task is to compute a real value y i ∈ [0, 1] for each x i in C, indicating the degree to which the token needs to be emphasized.

Data Collection
The Presentation Slides Emphasis Dataset (PSED) 3 is a collection of presentation slides covering a wide range of topics, from technical slides on various topics to non-technical ones such as children's material. Each instance in PSED represents one slide page along with eight annotations. We only focused on English slides. To cover a wide range of topics and areas, we collected data from different sources such as websites with .ORG and .GOV domains and slides from the ACL anthology. 4 We pre-processed all slide pages to make sure they included clean pieces of text. We removed slides that only had equations, mathematical formulas, tables, or figures and used the PDFMiner Python library 5 to extract the text. Quality control steps ensured the text and the slide matched.

Annotation Process
In an MTurk experiment, we asked nine annotators to label each page. We showed the image of the slide as well as the corresponding text and asked workers to select words to emphasize as if they were preparing the slides for their own presentation. Ten percent of the hits included quality questions to make sure the annotators read the slides.
We observed a low Fleiss' Kappa score (Shrout and Fleiss, 1979) of 0.1414 on the dataset. A closer examination revealed that the dataset included some technical and domain-specific slides that were not entirely understandable to a general audience. Therefore, we removed slides with a score below -0.05 and the overall score increased to 0.1797. We also noticed that many cases included at least one annotator with a very different selection. To provide a more consistently-annotated data set for training, we removed the annotator for each slide with the lowest agreement to the other annotators. The final dataset contains annotations from eight annotators and has a Fleiss' Kappa score of 0.2092. Such a score is similar to the score reported in (Shirani et al., 2020b) and indicates the existence of multiple points of view about emphasis in the dataset. Table 1 shows an example of a bullet point annotated with BIO annotation data. It shows that there is more agreement selecting words such as "risk" and "management" compared to the others. Table 1: An example bullet point along with emphasis probabilities. "B" indicates the beginning of the emphasis, "I" the inside, and "O" non-emphasis words. "Freq." shows the frequencies of "B", "I" and "O". "Emphasis Probs.", shows the emphasis probability ("B+I") over eight annotations.    Table 3 describes the length of instances in the PSED dataset, giving the minimum, mean, and maximum number of words in slides for each split.
As previous research has suggested, word types have a significant role in the selection of appro-  Train  13  78  180  Dev  15  71  164  Test  17  79  181 priate emphasis. Therefore, in this section, we examine the role of part-of-speech tags (POS) in this task. Specifically, we choose the top 20 POS tags, which frequently occur in the training and development sets, to analyze the feature's effectiveness. We used spaCy library 6 to obtain POS tags for all tokens. To examine how the emphasis probabilities are distributed, we divided them evenly into four intervals. Figure 2 shows the occurrence of the top 20 POS tags for all token labels in our training and development sets. POS tags such as "IN", ",", ".", and ":" are more favored to have low emphasis probabilities (0-0.25). Interestingly, some POS tags like "DT", "CD," and "VBZ" have zero words in the highest emphasis probability interval (0.75-1.0). Overall, most POS tags fall into the lowest emphasis probability, and the difference lies in the (0.25-0.5) interval, where POS tags like "NN", "NNS," and "VBG" mostly appear. Similar to POS tags, other hand-crafted features such as punctuation and upper-case tokens helped improve the results of some models. This motivated us to examine the degree of emphasis probability for different lexical features. Figure 3 shows the average emphasis scores for each category in the training and development sets. Comparing all lexical features, "Uppercase start" has the highest average emphasis score, and "Contain numbers" and "Punctuation" have the lowest. This indicates some general trends for emphasis with respect to word categories. We also performed an error analysis to examine how the length of slides can affect the prediction. The results show that longer slides are more challenging due to having more options to select.

Evaluation Metric
For better comparison with previous work in ES, we followed an evaluation method similar to Shirani et al. (2020b). This metric is specifically designed to meet the subjectivity of the task.  (4) To rank models, we compute the average value of Match m for all m values and call this averaged value (RANK). We treat words in the ground truth with the same probability equally, so if the model predicts either of the tokens, we consider it as a correct answer.

Performance Benchmarks
To better examine the challenges of the dataset and benchmark the task, we organized a shared task and invited the community to participate in modeling emphasis in this new domain. 7 Different novel and interesting solutions for this particular task were proposed. Table 4 shows the scores and the best methods for the top three teams. The most popular approach was ensemble Transformer-based models. Many hand-crafted features such as Part-of-speech (POS) tags, keywords, and lexical features (such as words with capital letters and punctuation) were explored to improve the models' performance. We describe and compare top-performing approaches next.
The top-performing team, UBRI-604 (Hu et al., 2021), by proposing end-to-end Transformer-based approach, ranked in the first place with RANK score of (0.525). Different rich Transformer-based pre-trained language models were explored during the experiment, such as ALBERT (Lample and Conneau, 2019), GPT-2 (Radford and Wu, 2019), RoBERTa , ERNIE 2.0 , XLNet (Yang et al., 2019), XLMRoBERTa and BERT (Devlin et al., 2019). Comparing the results of all seven models, XLMRoBERTa performed the best. Besides pre-trained language models, UBRI-604 leveraged lexical features such as capitalized words and punctuation, for further improvement.
DeepBlueAI team stood in second place (0.519), a RANK score that was 0.006 lower than the first team's. DeepBlueAI introduced an ensemble Transformer-based model with two fully-connected layers combined with POS tags embedding and hand-crafted features. The ensemble model takes advantage of BERT, SciBERT (Beltagy et al., 2019) and ERNIE 2.0 pre-trained language models by taking the average of the scores predicted by these models.
Lastly, Cisco (Ghosh et al.), with a score 0.001 lower than the second team, ranked third. Cisco explored two approaches based on BiLSTM+ELMo (Shirani et al., 2019) (Kullback and Leibler, 1951) is used as the loss function during the training phase.

Discussion
The PSED dataset contains slides with different lengths. To better examine how the length of slides can affect the prediction, we performed an error analysis to examine this relationship. We divided the test set into three groups based on the instances' lengths, namely <60, 60-90, and >90 tokens. Then we computed the average Match m scores over all shared task submissions, four in total, for every example in each group. As shown in Table 5, short slides always achieve better scores compared to medium and long slides. This indicates that predicting emphasis in longer instances is more challenging. This is due there being more options (words) to select for emphasis.
Many slides in the PSED dataset contain scientific words. Besides using pre-trained models, trained on a general domain, some teams decided to handle scientific words differently. For example, DeepBlueAI explored using the SciBERT (Beltagy et al., 2019) model, which is pre-trained on scientific articles. On the other hand, Cisco explored training a scientific keyword predictor and used the output as a feature to the model. Extending the proposed approaches to more efficiently address the diverse vocabulary of the dataset is an important future direction.

Conclusion
We presented a new dataset for emphasis selection on presentation slides, posing new challenges for modeling emphasis. We created a shared task and invited researchers to model emphasis for presentation slides. We provided different data analyses on the dataset and summarized the insights gained from the shared task. A future extension could explore more robust techniques to address the challenges in the PSED dataset because of its diversity in topic, structure, and length.

Ethics
The proposed data in this work is collected from public domain sources and do not intrude on user privacy. For the manual work in annotation process, crowd workers were fairly compensated ($0.55 reward per response, which is over the US minimum wage).