SemEval-2020 Task 10: Emphasis Selection for Written Text in Visual Media

In this paper, we present the main findings and compare the results of SemEval-2020 Task 10, Emphasis Selection for Written Text in Visual Media. The goal of this shared task is to design automatic methods for emphasis selection, i.e. choosing candidates for emphasis in textual content to enable automated design assistance in authoring. The main focus is on short text instances for social media, with a variety of examples, from social media posts to inspirational quotes. Participants were asked to model emphasis using plain text with no additional context from the user or other design considerations. SemEval-2020 Emphasis Selection shared task attracted 197 participants in the early phase and a total of 31 teams made submissions to this task. The highest-ranked submission achieved 0.823 Matchm score. The analysis of systems submitted to the task indicates that BERT and RoBERTa were the most common choice of pre-trained models used, and part of speech tag (POS) was the most useful feature. Full results can be found on the task’s website.


Introduction
In visual communication, emphasis is an intentional focus on one or more components to create a main focal point or center of interest with the composition. It has been found that it only takes the human eye 50 milliseconds to form an opinion on a visual composition (Lindgaard et al., 2006). Therefore, it is important for a visual element to deliver a clear message by calling attention to specific information. Whether on flyers, posters, ads, social media posts or motivational messages, emphasis is usually designed to grab a viewer's attention by being distinct from the rest of the design elements.
Thanks to various online platforms, a massive amount of digital text is being generated by users every day. These media are filled with content competing for attention and are usually highly designed to engage viewers' attention to convey their message in the most efficient way. For textual content, word emphasis is used as a powerful tool to better convey the desired meaning of the written text to the audience. Utilizing emphasis techniques can potentially add another dimension to the text through visualization. Emphasis on textual content can be done with colors, backgrounds, or fonts, or with styles like italic and boldface to clarify or even change the meaning of a sentence by drawing attention to some specific information. Figure 1a shows an example that is aesthetically appealing but fails to effectively communicate its intent. Understanding the text would allow the system to propose a different layout that emphasizes words that contribute more to the communication of the intent, as shown in Figure 1b.
In the last few years, we have observed many significant improvements in various platforms for generating, formatting, and editing digital text. For example, some graphic design applications such as Adobe Spark 2 perform automatic text emphasis using templates that include images and text with different design effects. However, the used layout algorithms are often inflexible in that they rigidly emphasize words based on the visual attributes (e.g., word length) of those words, rather than the semantics of the text. As a result, the outcome may fail to accurately communicate the meaning of the written text, resulting in unintended emphasis and the wrong message to the audience. However, an emphasis selection model can potentially make better suggestions by having a better understanding of the input text.
Task Characteristics This emphasis selection task poses new challenges associated with the nature of the task: (1) No additional context from the user or the rest of the design such as background image is provided. Therefore the proposed task requires a computational understanding of the written text. (2) The dataset contains very short texts, usually fewer than ten words. Generally, working with short text instances is challenging since the decision needs to be made by only considering a few words.
(3) Word emphasis patterns are author-and domain-specific, therefore, without knowing the author's intent and only considering the input text, multiple emphasis selections are valid. However, a good model should be able to capture the inter-subjectivity or common sense within given annotations and finally label words according to higher agreements.
Expected Impact of the Task The ultimate goal is to enable design assistance for authors by suggesting words that are good candidates to emphasize. The typical applications of this task include, but are not limited to, creating flyers, posters, drawings, advertisements and other visual material one may find online and across social media platforms such as Pinterest, Instagram and Snapchat. Moreover, emphasis selection models have applications in many design programs such as Adobe Spark, Apache OpenOffice Impress, GIMP, or Microsoft PowerPoint.
SemEval Emphasis Selection Task In this shared task, we invited research on a novel Natural Language Processing (NLP) task that represents unique algorithmic and modeling challenges due to its nature. We observed a diverse and interesting set of solutions to tackle the existing challenges from a large number of participants, both from academia and industry. As part of this shared task, we released a dataset annotated with word emphases, which served as a benchmark to compare various techniques. Furthermore, we expect the task to be interesting for researchers studying relevant tasks such as machine-human interaction, reading comprehension, graphic design and user experience. In the following sections, we describe the setup, participation, results, and more importantly, the insights gained from the task.

Task Definition
Given a sequence of tokens C = {x 1 , ..., x n }, a real number y i ∈ [0, 1] needs to be assigned for each token in the sequence, indicating the degree to which the token needs to be emphasized. In other words, we define the emphasis score y i as the probability or weight of the i th token in the sequence. Finally, during the evaluation, the final set of emphases are generated by selecting tokens with the highest values (described in Section 5).

Related Work
We firstly introduced and formulated the task of emphasis selection in (Shirani et al., 2019) in which an end-to-end label distribution learning (LDL) model in a sequence tagging architecture is proposed to model emphasis. We evaluated the model against different baselines on the Spark dataset (introduced in Section 3). Keyword or key-phrase detection may be the closest topic to emphasis selection. Keywords can capture the main topics described in a given document (Turney, 2002). Modeling keywords or key-phrases has been widely addressed in different domains such as news articles (Wan et al., 2007), scientific publications (Nguyen and Kan, 2007) and Twitter data (Zhang et al., 2016;Bellaachia and Al-Dhelaan, 2012). Keyword detection mainly focuses on finding important nouns or noun phrases (Augenstein et al., 2017). In contrast, emphasis could be applied to a subset of words with different roles in a sentence. Generally, word emphasis may use to express emotions, show contrast, capture a reader's interest or clarify a message. Moreover, emphasis selection in social media posts deals with very short texts and the prediction needs to be made based on a single instance.
In the context of expressive prosody generation, emphasis has been addressed based on acoustic and prosodic features that exist in spoken data. For example, (Nakajima et al., 2014) predicted emphasized accent phrases from advertisement text information and (Mass et al., 2018) modeled word emphasis on audience-addressed speeches.

Data Collection
The data used for this shared task is the integration of two datasets from different sources, which are created from scratch based on texts collected from the Adobe Spark and Wisdom Quotes website. The dataset used for this task can be found in the task's data repository 3 . The following are the descriptions of the two datasets. The Spark dataset is collection of 1,195 instances from Adobe Spark 4 . It contains a variety of subjects featured in flyers, posters, advertisements or motivational memes on social media. The Quotes dataset is a collection of quotes from well-known authors collected from Wisdom Quotes 5 with 2,681 instances. Table 1 provides details about the length of instances in the datasets. The Emphasis dataset with 3,876 instances, consists of 44,976 words and 4,886 unique words. We used Amazon Mechanical Turk and asked nine annotators to label each piece of text. More precisely, we asked annotators to select word(s) in the given text that should be emphasized. Having nine annotators gives us this ability to capture different viewpoints, each focusing on different parts of the sentence. Figure 2 shows an example of text annotated with nine annotations. In this example, there is more consensus in emphasizing words like "inspiration" and "Genius". On the other hand, words like "is" and "percent" are not good candidates based on general agreement. To ensure high-quality annotation, we included carefully-designed quality questions in 10 percent of the hits. Moreover, we only allowed master annotators to participate.
The data is split up randomly between training, development and test sets. A training data set of 2,741 instances, development set of 392 instances, and test set of 743 instances were released to the participants. Fleiss' Kappa score (Shrout and Fleiss, 1979) of 24.60 was observed on the data set. Such a Kappa score indicates the existence of multiple perspectives about emphasis in the dataset.  Table 2 shows an example of a short text annotated with the BIO annotations. As it is shown, words such as "Best" are selected more often for emphasis than other words in the sequence. First, we compute the label distribution for each instance, which corresponds to the count per label normalized by the total number of annotations (shown in "Norm. Freq. column"). Then we compute Emphasis Probabilities for all the words in the sequence. The final evaluation is against ground truth emphasis probabilities (explained in Section 5). Table 2: An example from the Spark dataset along with its nine annotations. In this table,"B/I"s and "O"s represent emphasis and non-emphasis words respectively. "B"s indicate the beginning and "I"s indicate the inside of emphasis. "Freq." and "Norm. Freq." columns show the normal and normalized values for label frequencies respectively.

Data Analysis
Many systems reported performance gain by using Part of Speech Tags (POS) tags in their models. In this section, we analyze the effectiveness of this feature by closely examining the top 20 POS tags in our dataset. We used the Stanford Part-Of-Speech Tagger (Toutanova et al., 2003) to obtain POS tags for all tokens in our dataset. We divide the emphasis probabilities to four intervals (0-0.25, 0.25-0.50, 0.50-0.75 and 0.75-1.00) and compute how the POS tags are distributed in these four intervals.  Figure 3 shows the occurrence of the top 20 POS tags in four emphasis probability intervals for all token labels in our training set. POS tags like ",", ".", "DT" and "PRP" are more favored to have low emphasis probabilities (0-0.25). Interestingly, words with the highest probabilities (0.75-1.00) are usually from "NNP", "NN" and "JJ" word types. As we expected, there are some general trends for emphasized words with respect to the type of words in sentences, which make POS tags a useful feature for modeling emphasis.

Evaluation Metric
The evaluation was performed on the test set. Participants were asked to provide a real value (greater and equal to zero) for each token in the test set that indicates the probability of the token being emphasized. All models were evaluated with Match m metric and ranked based on the averaged values of scores for m=1, 2, 3, 4.
Match m For each instance x in the test set D test , we selected a set S Finally, we computed the average value of Match m for all m ∈ {1 . . . 4} and ranked the submitted systems based on this averaged value (RANK). To better handle word duplicates, the computation is based on the position of words in a sentence rather than the actual words. Note that there were many cases where two or more tokens have the exact same probability. In this case, if the model predicts either one of the labels, we considered it as a correct answer. Table 5 shows some examples form the dataset, illustrating how the metric is computed.

Baseline Model
We provided a baseline model for this task. This model (DL-BiLSTM-ELMo) is a sequence-labeling model that essentially utilizes ELMo contextualized embeddings (Peters et al., 2018) as well as two BiLSTM layers to label emphasis. During the training phase, the Kullback-Leibler Divergence (KL-DIV) (Kullback and Leibler, 1951) is used as the loss function. More analysis and the complete description of this model is provided in (Shirani et al., 2019).

Systems and Results
This task attracted 197 participants and a total of 31 teams made submissions to this task. The teams that submitted papers for the SemEval-2020 proceedings are listed in Table 3. In total, 25 teams performed higher than the baseline and six teams performed lower. 13 of the 31 teams also submitted their system description papers.
The base models used in the task submissions ranged from ELMo ( Figure 4 shows different pre-trained models used in this task. Among them, BERT and RoBERTa were used most often. Ensemble transformer-based models were one of the most popular approaches (26% of submissions). All submissions applied deep neural network techniques to model emphasis. Moreover, some teams did explore hand-crafted features, such as part-of-speech tags, named entities, valence, arousal, dominance scores to enhance the performance of their models.

Top Systems
The results for each of four scores, as well as the RANK score, are shown in Table 3. The top-3 teams based on the RANK score are ERNIE (Huang et al., 2020), Hitachi (Morio et al., 2020), IITK (Singhal et al., 2020). The top-performing team, ERNIE, achieved the highest Match m score of 0.823, 0.009 points higher than the second team and 0.013 points higher than the third team. ERNIE, achieved the highest score not only in RANK score but across all four scores. The next system on our leader board is Hitachi, with a score of 0.814. And finally, IITK, by achieving 0.810 RANK, stands in third place.

Best Paper Awards
Our shared task awarded several best paper distinctions to complement the top performing systems. Here are the categories of best papers and the winners for each: • Best system description paper: IDS (Shin et al., 2020), this paper, with interesting analysis components, advances our understanding regarding the effectiveness of pre-trained language models for this specific task.
• Best result interpretation paper: MIDAS (Anand et al., 2020), the authors go the extra mile to analyze the results in this paper.
• Best negative results paper: UIC-NLP (Hossu and Parde, 2020), the authors performed extensive experiments with non-contextualized pre-trained models as well as a variety of hand-crafted features. Through the error analysis, the authors identified a number of common challenging patterns for the model, including late-phrase words, sequences of words, and abnormal/poetic sentence structure.

Top Performing Systems and Novel Architectures
In this section, we provide a brief description of the best performing and novel approaches. Table 4 shows a high level summary of these systems. ERNIE achieved the highest score by fine-tuning ERNIE 2.0 as the base model. They also reported high performance by using other pre-trained models like XLM-RoBERTa, RoBERTa and ALBERT. They further boosted the model by utilizing data augmentation and hand-crafted features like word capitalization and the occurrence of hashtags in instances.
Hitachi tackled the task by combining rich contextualized embeddings and fine-tuning seven Pre-trained Language Models (PLMs) on the task such as BERT, GPT-2 (Radford et al., 2019), RoBERTa, XLM-RoBERTa, XLNet (Yang et al., 2019), XLM (Lample and Conneau, 2019), and T5. In addition, they added POS tags and token embeddings from a character-level LSTM layer. They introduced a distribution fusion system to fuse the output distributions of the fine-tuned models and find the optimal hyperparameter set. They showed the performance gain of the fusion model over average ensemble as well as individual PLMs. Among all PLMs, BERT and XLNet models were more successful in predicting emphasis individually.
IITK, the team ranking in third place, proposed an ensemble model where the base models were BERT, RoBERTa, and XLNet. In order to aggregate the outputs, they computed the average of the scores predicted by these models. The authors also provided different baselines from the character-level BiLSTM model with attention to transformer-based models like XLM-RoBERTa, ALBERT and GPT-2. When comparing all individual models, XLNet-Large performed the best.  (26) A wide range of novel methods were used to model emphasis. For example, FPAI (Guo et al., 2020) converted the task of emphasis selection to a simplified query-based machine reading comprehension (MRC) task, where the goal was to answer the fixed query, "Find candidates for emphasis".
To tackle the low inter-annotator agreement in the dataset, Tëxtmarkers (Glocker and Markianos Wright, 2020) attempted to model multiple annotators jointly by adapting a crowd layer architecture (Rodrigues and Pereira, 2018), introducing initialization with agreement dependent noise. The crowd layer is intended to help the model to outperform a baseline trained with token level majority voting.
IDS (Shin et al., 2020) performed an interesting analysis of pre-trained models to investigate whether PLMs contain enough knowledge to select proper words for emphasis. They compared different zero-shot models in which self-attention distributions of PLMs were used to emphasize words. More precisely, the authors investigated individual attention heads of different models like BERT, DistilBERT, GPT-2, RoBERTa, XLNet, and XLM to probe their ability to identify emphasis without any fine-tuning. Their interesting findings indicate that DistilBERT is more successful in predicting emphasis while XLNet and GPT-2 perform poorly when there is no training for this task.
The top non-transformer-based model, Procyon (ranked 12th), successfully proposed an ELMo-based multi-modal model with two sub-networks to learn emphasis scores based on word representations and POS tags separately.

Discussion
To have a better understanding of the challenges of this task, we perform an error analysis to examine where the models succeed and in what situations they face difficulties in selecting emphasis words. More specifically, we compute the average Match m score over all 31 submissions for every example in the test set and examine the challenging cases for all models. Table 5 shows some interesting examples from the test set with three Match m scores (m1-m3) from all submissions, where m1 stands for the average score for system predictions obtained by selecting the top word, and m3 stands for results from selecting the top 3. In many cases, selecting emphasis words was unchallenging for most of the systems (e.g., S1 in Table  5 with "Imagination" as the top word and "rules" and "world" with same emphasis probability.). In some examples, there is no single token standing out in the sentence, so it was not easy to select one single word with certainty. S2 is a good example with low m1 and high m2 and m3, indicating disagreement between models and annotators for choosing the first word with the highest probability. We also observed many cases where one word clearly stands out of the sentence but it is not clear which words should be selected next. S3 is an example of this where most systems were able to select the top word "talked" correctly, but faced difficulties in predicting other words for that sentence.
There are some cases where prediction is easy for humans but still poses challenges for models. For example, most annotators agreed on selecting "basketball" with the highest probability in S4; however, many models failed to select this word in the top position, probably due to the unusual structure of the sentence. In this example, "East", "Sleep" and "Watch" have equal probabilities in the annotation. Table 5: Examples from the test set with averaged Match m scores across all submitted systems. Words with high emphasis probability labels are shown in bold.

Num Sentence
Match m S1-1 Imagination rules the world. m1 = 0.9354 S1-2 Imagination rules the world. m2 = 1.0 S1-3 Imagination rules the world. m3 = 1.0 S2-1 All successes begin with self-discipline. It starts with you. m1 = 0.0322 S2-2 All successes begin with self-discipline. It starts with you. m2 = 0.8225 S2-3 All successes begin with self-discipline. It starts with you. m3 = 0.9677 S3-1 I learned most, not from those who taught me but from those who talked with me. m1 = 0.8387 S3-2 I learned most, not from those who taught me but from those who talked with me. m2 = 0.5645 S3-3 I learned most, not from those who taught me but from those who talked with me.

Conclusion
This paper summarizes the insights gained from organizing Task 10 at SemEval-2020. Given a short piece of text, the task consists of selecting candidate words to emphasize. We received a good number of system submissions, with 13 teams submitting a system description paper. While there were many differences between individual systems, we observed a strong trend favoring the use of transformer based models as key ingredient in the proposed architectures. Many description papers present valuable analyses of the data and task. We encourage readers interested in this task to take a careful look at these papers for additional inspiration on how to improve results further.