Randomseed19 at SemEval-2020 Task 10: Emphasis Selection for Written Text in Visual Media

This paper describes our approach to emphasis selection for written text in visual media as a solution for SemEval 2020 Task 10. We used an ensemble of several different Transformer-based models and cast the task as a sequence labeling problem with two tags: ‘I’ as ‘emphasized’ and ‘O’ as ‘non-emphasized’ for each token in the text.


Introduction
The purpose of SemEval 2020 Task 10 is to design automatic methods for emphasis selection, i.e. choosing candidates for emphasis in short written texts or enabling automated design assistance in authoring. The given task is different from related ones in that word emphasis patterns are person-and domain-specific, making different selections valid depending on the audience and the intent. Examples of different emphases are presented in Figure 1.
The dataset for this shared task includes short sentences from the following two datasets: 1. Spark dataset: This dataset is collected from Adobe Spark and is a collection of short texts containing a variety of subjects featured in flyers, posters, advertisements or motivational memes on social media.
2. Quotes dataset: This dataset is a collection of quotes from well-known authors collected from Wisdom Quotes.
The data was labeled by 9 annotators, for further details please refer to the task paper (Shirani et al., 2020). Each token in the dataset is labeled using the BIO notation. The main target of the task is to predict emphasis probability for each token in a given sentence, which is calculated as a share of 'B' and 'I' labels from 9 annotators. An example of training and validation data is presented in We approach the task as a sequence labeling problem with two tags: 'I' as 'emphasized' (the original 'B' was transformed to 'I') and 'O' as 'non-emphasized' for each token in the text. Subsequently, the probability of predicted 'I' token was taken as emphasis probability.
We considered two ways to convert the given dataset to a sequence labeling dataset: 1. Separate annotations for each sentence (9 examples for each given sentence) 2. Major vote annotation for each sentence (1 example for each given sentence), i.e. in Table 1 all annotations for token 'evolve' B|I|I|O|B|B|O|O|O were converted to 'I' (emphasized) -5 'B' and 'I' vs. 4 'O' Models were trained on these two types of dataset representation (separate annotations or major vote annotation).

Models
The main approach is quite simple: a pretrained Transformer-based model (Vaswani et al., 2017) with a token classification head on top (a linear layer on top of the hidden-states output).
Given a sequence of tokens, the model is to label each token with its appropriate class ('I' or 'O'). Emphasis probability for each token was calculated as a softmax over logits for a given token (the value of label 'I' was taken as a result). For words that were divided into several tokens, the result was considered to be the average of its parts. The model architecture is shown in Figure 2.
3 types of models were used: The final evaluation metric is the mean of M atch 1 , M atch 2 , M atch 3 , M atch 4

Experimental setup
We used the dataset split scheme provided by the organizers, the data contained 3,877 sentences with 70% training, 10% validation and 20% test sets. All models were trained only on the given train part of the dataset.
The For each set of hyperparameters the model was trained for up to 10 epochs with 4 validations per epoch. We used linear gradient scheduling with warm-up (fraction of warmup steps was set to 0.05 for all models) and AdamW optimizer (Loshchilov and Hutter, 2017

Test results
For the test data we submitted results of 4 models: -Best single model based on validation loss -Ensemble of best single models of each type based on validation loss -Best single model based on validation score -Ensemble of best single models of each type based on validation score The ensemble is just an average prediction of the models that compose it. The best test score was achieved by an ensemble of single models chosen by the validation score. The ensemble and its components parameters and results are presented in Table 2 Table 2: Parameters and scores of best models selected by validation score The ensemble of single models chosen by validation loss, its components parameters and results are presented in Table 3.

Model type
Label  Table 3: Parameters and scores of best models selected by validation loss As can be seen from the tables above, the models trained on separate annotations showed better scores than the models trained on major vote annotations.
A part of the competition leaderboard for the evaluation phase is presented in Table 4. Our model took 4th place. 6 Column 'Mean Score' is the column with final evaluation metric -mean of M atch n , n ∈ (1...4). In parentheses is the place of the model for each evaluation metric.
Our model showed good results for M atch 3 and M atch 4 evaluation metrics (2nd place) and worse for M atch 1 and M atch 2 (8th and 4th place). It means the model doesn't predict one or two most emphasized tokens very well, but works satisfactorily at predicting three and four.  The model performed better than the baseline DL-BiLSTM+ELMo model provided by organizers (Shirani et al., 2019).

Prediction examples
Below are examples of prediction for two types of model checkpoints -chosen by the best validation score and the best validation loss. Models selected by validation loss predict more extreme values (zeros for unlikely emphasized tokens and higher values for likely emphasized tokens) compared to models selected by validation score.

Future Improvements
We didn't consider the following things that might improve the model: -Other transformer-based models, i.e. ERNIE (Sun et al., 2019) -Sentence source: Quotes dataset or Spark dataset -Part of speech tags for tokens, can be obtained using one of the existing models and libraries: nltk, spacy, etc.

Conclusion
In this paper we presented the system we used in SemEval-2020 Emphasis Selection For Written Text in Visual Media competition. The proposed approach is based on Transformer-based models and sequence labeling task. The final model we used was an ensemble of 3 models (BERT, RoBERTa, XLNet) each trained with best hyperparameters according to validation scores. The model showed good results and took 4th place.