MIDAS at SemEval-2020 Task 10: Emphasis Selection Using Label Distribution Learning and Contextual Embeddings

This paper presents our submission to the SemEval 2020 - Task 10 on emphasis selection in written text. We approach this emphasis selection problem as a sequence labeling task where we represent the underlying text with various contextual embedding models. We also employ label distribution learning to account for annotator disagreements. We experiment with the choice of model architectures, trainability of layers, and different contextual embeddings. Our best performing architecture is an ensemble of different models, which achieved an overall matching score of 0.783, placing us 15th out of 31 participating teams. Lastly, we analyze the results in terms of parts of speech tags, sentence lengths, and word ordering.


Introduction
Emphasis selection is an emerging research problem (Shirani et al., 2019) in the natural language processing domain, which involves automatic identification of words or phrases from a short text that would serve as good candidates for visual emphasis. This research is most relevant to visual media such as flyers, posters, ads, and motivational messages where certain words or phrases can be visually emphasized with the use of different color, font, or other typographic features. This type of emphasis can help with expressing an intent, providing more clarity, or drawing attention towards specific information in the text. Automatic emphasis selection is therefore useful in graphic design and presentation applications to assist users with appropriate choice of text layout.
Prior works in speech processing (Mishra et al., 2012;Chen and Pan, 2017) have modeled word-level emphasis using acoustic and prosodic features. Understanding emphasis in speech is critical to many downstream applications such as text-to-speech synthesis (Nakajima et al., 2014), speech-to-speech translation (Do et al., 2015), and computer assisted pronunciation training (Felps et al., 2009). In computational linguistics, emphasis selection is very closely related to the problem of keyphrase extraction (Turney, 2002). Keyphrases typically refer nouns and noun-phrases that capture the most salient topics in long documents such as scientific articles (Sahrawat et al., ;Mahata et al., 2018;Swaminathan et al., 2020), news articles (Hulth and Megyesi, 2006), web pages (Yih et al., 2006), etc. In contrast, emphasis selection deals with very short texts (e.g. social media posts), and also emphasis could be applied to words belonging to various parts of speech.
The goal of SemEval 2020 -Task 10 is to design methods for automatic emphasis selection in short texts. To this end, the organizers (Shirani et al., 2020) provided a dataset consisting of over 3,000 sentences annotated for token-level emphasis by multiple annotators. The authors employed the standard I-O tagging schema, which is widely used in annotation of token-level tags. We approached emphasis selection as a sequence labeling task solved using a Bidirectional Long Short-term Memory (BiLSTM) model, where the individual tokens are represented using various contextual embedding models. We also employ label distribution learning (LDL) (Geng, 2016)

Methods
Let d = {w 1 , w 2 , ..., w n } be the input text, where w i is the i th token. The problem of emphasis selection is to assign each token w i one of two possible labels E = {e I , e O }, where e I denotes emphasis on the token and e O means otherwise. We approach this problem as a sequence labeling task solved using a BiLSTM model. We first represent each token w i with a dense vector x i of a fixed size. To this end, we explore three different embedding architectures: BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and XL-NET (Yang et al., 2019). Thus the given input text d is transformed into a sequence of vectors {x 1 , x 2 , ..., x n }. We then feed these vectors to a BiLSTM model which captures the sequential relations between the tokens. The hidden state of the BiLSTM h i is associated with the token w i . Thus h i provides a fixed-size representation for token w i while incorporating information from the surrounding tokens.
In standard sequence prediction problem, we can apply an affine transformation to map h i to the class space. However, in this paper, as with (Shirani et al., 2019), we employ LDL (Geng, 2016) which transforms the output space into a distribution over the labels E. Namely, the objective of the model is not to just assign one label for a token but a real-valued vector. This vector is a distribution over the labels E, where the values are proportional to the number of annotations. To achieve this objective, we use KL-Divergence between the predictions and ground truth as the loss function for the model.
The above equation demonstrates the loss for one sample, where p(e j ) is the ground truth distribution andp(e j ) is the model prediction. Note that the above equation reduces to negative log-likelihood in case of standard sequence prediction. The entire architecture is described in Figure 1.

Experimental work 3.1 Dataset
The dataset provided for this SemEval task consists of 3,134 samples labeled for token-level emphasis by multiple annotators. The data was split into a training set consisting of 2,742 samples and development set with 392 samples. The training set has approximately 12 tokens per instance with the longest sample containing 38 tokens, and the shortest has one token. Likewise, the development set also has approximately 12 tokens per sample, and the longest sample has 31 tokens while the shortest has two tokens. Shirani et al. (Shirani et al., 2019) has more details about the experimental protocols used for data collection.

Experimental settings
We trained all the BiLSTM models using stochastic gradient descent in batched mode with the batch of 32. We used four different contextual embedding models for word representation: BERT (bert-base- uncased), BERT cased (bert-base-cased), RoBERTa (roberta-base), and XL-Net (xlnet-base-cased). We experimented on replacing the BiLSTM layer with a simple feed-forward dense layer. We also experimented with the trainabliity of different layers in the architecture. Namely, if none of the layers are trainable, only the last layer is trainable, and if all the layers are trainable. All the models were trained for 20 epochs: after each epoch, we evaluated on development dataset and stored the model from the best performing epoch. The hidden layers for the BiLSTM models were set to 128 units, the dense layers had 256 units, and the models trained at learning rates ranging from 2e − 5 to 3e − 4. We evaluated all the models in terms match scores as described in (Shirani et al., 2019). These match scores, for a given cardinality m, quantify the intersection between the top m model predictions for emphasis and the ground truth as obtained from annotations. Table 1 presents the performance of both the BiLSTM and the dense models for different choices of embeddings, and varying number of trainable layers. The first observation from these results is that the choice of architecture (BiLSTM vs. Dense) did not make a big difference in the performance. Second, the choice of embeddings did contribute significantly towards the performance: RoBERTa based models most often obtained the best scores, and XL-Net based models obtained the lowest scores. Lastly, we also observed that the model performance also improved with more trainable layers irrespective of the choice or architecture or embeddings. The best performing model, with an average match score of 0.788, was RoBERTa with a dense layer and all the layers set to be trainable.

Ensembling
We experimented with two model ensembling approaches: average and weighted average. Average ensembling predicts the output simply as the average of outputs from all the models. In weighted averaging,   we use the model performance on development dataset to weigh its contribution towards final prediction. We observed that the difference between these two ensembling approaches was rather minimal. We also tried ensembles of models with different combination of architectures and embeddings but eventually observed that the ensemble of all the models obtained the best performance. Table 2 summarizes the results from some of these experiments. Our best system achieved an average match score of 0.783 on the final test dataset, placing us 15th out of 31 teams. The highest score achieved in the task was 0.823.

Emphasis vs. Parts of Speech
We wanted to understand how the model predictions compared to the annotations for various parts of speech (POS) tags. Table 3 presents the average emphasis score of human annotators on the development dataset for various POS tags. Also included in this table are the predictions from the best BERT and RoBERTa models. Of the various POS tags, nouns, proper-nouns, and adjectives are the classes with the most emphasis. This is also the case with the model outputs, however, they seem to be predicting higher emphasis scores for proper-nouns than nouns or adjectives. At the other end of the spectrum are coordinating conjunctions, adpositions, and punctuations. Figures 2 show an example of a situation where our models achieve very low match scores. Here, the models predicted the nouns 'Happiness' and 'Unhappiness' to have high emphasis but the annotators emphasized tokens which are verbs and adverbs.

Shuffling Word Order
We wanted to demonstrate that our models are not just picking up on certain keyphrases but capturing some important semantics in the data. To this end, we trained a new set of models on the training dataset, where for each sample, the order of the words was randomly shuffled. The resulting models were then evaluated on the development dataset. We repeated this experiment five times and Table 4 presents the average performance across these runs. Also included in this table is a baseline model which predicts a random score for each token. As expected, the models trained on shuffled data are significantly worse than their counterparts in Table 1. Another interesting observation is that the performance of these models   Table 4: Performance of models trained on data where the sentences were randomly shuffled.
is comparable to the random baseline. This suggests that the word order and therefore semantic structure is very important to the emphasis selection problem.

Length vs. Performance
We also wanted to understand how the model performance is influenced by the length of the samples. As mentioned earlier the average length of each sample in the dataset is 12 tokens and the standard deviation of the length is around 6. Driven by these statistics, we decided to split the development data into three sets: Short (< 6 tokens, 80 samples), Medium (6 to 18 tokens, 262 samples), and Long (>18 tokens, 50 samples). Table 5 summarizes the results (average match score) of all the models split into these three groups. All the models, irrespective of the choice of architecture or embeddings, have deteriorated with increasing length of the samples. The difference between the longest and shortest samples is most pronounced for BERT-based models. RoBERTa-based models seem to be handling longer samples much better than the other two embeddings.

Conclusion
In this paper, we present our submission to the SemEval 2020 -Task 10 on emphasis selection in written text. Our best performing model achieved an overall matching score of 0.783, placing us 15th out of 31 participating teams. We approached emphasis selection as sequence prediction problem solved using BiLSTMs. Our experimental work demonstrates the effect of model architectures, trainability of layers, and embeddings on the performance. We analyze the results in terms of parts of speech tags and sentence lengths. Our analysis provides some interesting insight into some of the shortcomings of the models and also the challenges with emphasis selection.