TextLearner at SemEval-2020 Task 10: A Contextualized Ranking System in Solving Emphasis Selection in Text

This paper describes the emphasis selection system of the team TextLearner for SemEval 2020 Task 10: Emphasis Selection For Written Text in Visual Media. The system aims to learn the emphasis selection distribution using contextual representations extracted from pre-trained language models and a two-staged ranking model. The experimental results demonstrate the strong contextual representation power of the recent advanced transformer-based language model RoBERTa, which can be exploited using a simple but effective architecture on top.


Introduction
Visual communication aims at conveying intended information from the author to the audience with the help of visual elements. One of its essential design principles is to reduce the ambiguity and increase the effectiveness of communication. Visual communication usually consists of two elements: text and image. For text, the author often emphasizes some parts of the text by changing the visual representation of that part to better convey the intention. Visual communication design tools often fail in understanding the meaning of text and user intention. They rely solely on visual attributes when suggesting emphasized textual parts and their visual representations, which often leads to the wrong text emphasis (Shirani et al., 2020). Therefore, a system that could help design tools to understand the semantic meaning and common human interpretation of the text would better assist the user to design effective visual communication.
SemEval 2020 Task 10: Emphasis Selection For Written Text in Visual Media is a shared task that invites participants to propose methods that can model human emphasis selection for short written English text (Shirani et al., 2020). The task contains several challenges. First, emphasis selection is highly subjective; different annotator may emphasize different parts of the text. Second, there is a lack of additional context information; the model therefore has to rely only on the text. Third, there is a relatively small amount of training data; the training set provided by the organizers contains only 2742 instances of short written text.
Our strategy to tackle these three challenges is to model the emphasis selection distribution of annotators by a combination of contextual representations of text extracted from a pre-trained language model and a two-staged ranking model to suggest possible tokens for emphasis. From experimental results, we demonstrate that our simple two-staged ranking model built on top of RoBERTa (large model, fine-tuned on MNLI) (Liu et al., 2019) outperforms the LSTM-based baseline system (Shirani et al., 2020), which indicates transformer-based pre-trained language models such as RoBERTa have a strong contextual representation power and the combination of such language models with a simple ranking model is beneficial to the problem of emphasis selection.

Task Definition
The task organizers formalized this task as to determine an emphasis subset S of a sequence of tokens C = [x 1 , x 2 , ..., x n ]. Our goal is to propose a system that can generate this subset S with emphasis scores assigned to each token within it, with which we can sort the top K emphasized tokens.

Dataset
The SemEval-2020 Task 10 dataset is a collection of short English texts crawled from Adobe Spark and Wisdom Quotes (Shirani et al., 2019). It contains 3877 instances with total of 44977 tokens. The task organizer randomly split 70% of the data into the training set, 10% into the development set, and the remaining 20% into the test set. The sentences in the training set contain an average of 11.8 words.
Based on the dataset statistics, we concluded the following points: First, the amount of training data is relatively small to drive training of deep neural networks. Additionally, the short text and lack of additional information make it difficult for a model to capture the contextual information, but reduce the problem of long-range dependency. Therefore, we would like to utilize a pre-trained language model to obtain the contextual representations of the tokens and use those for a ranking model to approximate the emphasis selection distribution given in the training dataset.

System
We propose a two-staged emphasis selection system. The first stage of this system functions as the initial top K word selection to select candidate words for emphasizing. The second stage then re-ranks these K words by re-assigning an emphasis score to each of them. Figure 1 depicts the overall system architecture. First stage The first stage consists of an embedding layer and a base ranking model. Facing the limited amount of training data and lack of additional contextual information, the design of the first stage originates from the idea of utilizing the pre-trained language model to extract contextual rich representations. The base ranking model built on top of the embedding layer is meant to use those representations to further learn the emphasis selection distribution from the training data.
Given a sentence C = [x 1 , x 2 , ..., x n ], the system first inputs this sentence into the embedding layer to obtain a list of word embedding vectors as E = [e 1 , e 2 , ..., e n ]. Each of these is then fed into the base ranking model, which consists of two fully connected layers: where e i ∈ R d is the d-dimensional embedding vector of a token x i , h b ∈ R m is the hidden vector, W b,1 ∈ R m×d and W b,2 ∈ R 1×m are the weight matrices and b b,1 ∈ R m and b b,2 ∈ R 1 the biases. α is an activation function. The base ranking model predicts an emphasis probability score s i for each word embedding. Based on this, the top k words are selected as set S k ⊆ C of candidate tokens for emphasizing.

Second Stage
The second stage performs a re-ranking of the candidate set S k by performing a pairwise comparison of all tokens x i ∈ S k . It is inspired by Eberts and Ulges (2019), in particular by their relation classifier, which is used for joint entity and relation extraction.
The system first retrieves the embedding vectors of the top words of S k to compose a set E k . As a second step, it creates contextual word pairs. For each pair (e i , e j ) of word embeddings from E k , we define three types of context vectors: pre-context, btw-context, and post-context: Concatenating the pair (e i , e j ) and its context vectors gives a new training sample e new,ij = [c pre ij , e i , c btw ij , e j , c pos ij ] for the re-ranking model. The re-ranking model consists of N fully connected layers followed by a softmax layer and the emphasis score calculation: (depending on which context vectors are used, see subsection 3.3), h l are the hidden vectors, W r,l the weight matrices and b r,l the biases for l ∈ [1, N ]. r = [r i>j , r i=j , r i<j ] is the result vector, predicting if a word x i has a higher emphasis probability (r i>j ), the same emphasis probability (r i=j ) or a lower emphasis probability (r i<j ) than word x j . Finally, for a word x j ∈ S k , the final emphasis score is calculated as following, where top k is the set of indices corresponding to the candidate set S k :

Experiments
This section describes the experiments on selecting the embedding layer for the first stage and examining the different compositions of contextual word pairs in improving the performance of the re-ranking model at the second stage.

Evaluation Metric
The task organizer provided task-specific evaluation metric match m (Shirani et al., 2020) as following: In the test set D test , for each sentence x, the words with top m emphasis probabilities from ground truth set forms S m consists of words with top m predicted probabilities from the prediction set. We used this metric as the only evaluation criterion in our experiments.

Embedding Layer Selection
This experiment is to select the best performing pre-trained conventional word embeddings or pre-trained language models as the embedding layer in the first stage.
We trained the base ranking models with each candidate as embedding layer and compared their performances based on match 4 scores, since we will select at least 4 words for the second stage.
From Table 1, the base ranking model using GloVe pre-trained on Common Crawl performs the worst across match m and match average scores. For the base ranking model using RoBERTa-large-mnli, both concatenated 17th-24th and concatenated 21st-24th layer outperform the baseline. Among those, concatenated 17th-24th layer yields the best match 4 among all candidate pre-trained embeddings.
Compared to GloVe, RoBERTa-large-mnli can embed a word with rich contextual as well with syntactic and semantic information, which provides comprehensive representations for the base ranking model to learn emphasis selection. Therefore, we selected RoBERTa-large-mnli (17th-24th layer) as embedding layer in the first stage of our proposed emphasis selection system.  Table 1: match m scores for the base ranking model using candidate embeddings trained on training set, tested on development set. The base ranking model using RoBERTa-large-mnli (17th-24th layer) yields the best match 2,3,4 and match average scores.

Contextual Word Pairs
In the second stage, different compositions of contextual word pairs determine the amount of contextual information given to the re-ranking model for re-ranking the candidate words from the first stage. This experiment aims at searching for the optimal contextual word pair composition. We defined the parameters of the experiment as top words K = 5 (top K words from first stage to be re-ranked at the second stage) and the contextual word pair composition (whether to include surrounding context and in-between context) as Context = [[pre-context= False, btw-context= True / False, post-context= False], [pre-context= True, btw-context= True/False, post-context= True]]. Table 2 shows the experimental results. The proposed emphasis selection system with contextual word pair setting: [pre-context= True, btw-context= True, post-context= True] and [pre-context= False, btw-context= True, post-context= False] , achieves the best match average = 0.786. Compared to the best performing base ranking model (roberta-large-mnli, 17th-24th layer) at the first stage, the best performing re-ranking model at second stage improves the match 1 score by 0.056 and the match average score by 0.016.
Regarding the contribution of the context information, the re-ranking model can improve the scores even when comparing word pairs without any types of context vectors involved. Including the surrounding System pre-context between-context post-context match 1 match 2 match 3 match 4 match average   Table 3: match m scores for our proposed system (base ranking model + re-ranking model) using RoBERTa-large-mnli (17th-24th layer) trained on training set, tested on test set.
context vectors and in-between context vectors or only in-between context vectors in the contextual word pairs slightly improves the match m scores. Adding only the surrounding context vectors worsens the performance of the re-ranking model. In the post-evaluation phase, we tested our proposed emphasis selection system on the test set with the same experimental settings. As Table 3 shows, the proposed emphasis selection system with contextual word pair setting [pre-context= False, between-context= True, post-context= False] outperforms the baseline system across all match m and match average scores, with 0.034 improvement over match 1 score and 0.028 over match average . With our proposed system, we rank 21st on the final leaderboard of the evalution phase. See Table 3 for the respective match m scores. The match m scores in the post evaluation phase are slightly different from the submitted system in the evaluation phase due to different random initialization. Figure 2 shows some example predictions of our proposed system compared to the ground truth of the development set.

Conclusion
In this paper, we presented our emphasis selection system for SemEval 2020 Task 10: Emphasis Selection For Written Text in Visual Media. The system is based on an embedding layer to extract contextual representation from the input sentence, followed by a two-staged ranking model to learn the emphasis selection distribution. The experimental results show that the transformer-based language model RoBERTa provides rich contextual representation to support the proposed two-staged ranking system in outperforming the LSTM-based baseline system (Shirani et al., 2020) and successfully coping with this task.

A Implementation Details
Following our design in section 2, we defined the base ranking model with two fully connected layers. The first layer has the input dimension corresponding to the dimension of the word vector and output dimension m = 8. The second layer has input dimension m = 8, and output dimension r = 1.
The re-ranking model at the second stage consists of four fully connected layers. The first layer has an input dimension corresponding to the contextual word pairs and output dimension of 1024. The second layer has an input dimension of 1024 and output dimension of 64. The third layer has an input dimension of 64 and output dimension of 4. The input dimension of the last layer is 4, and the output dimension is 3.
The activation function for both base ranking model and re-ranking model is LeakyReLU (Maas et al., 2013). When training the base ranking model, the optimizer is Adam (Kingma and Ba, 2014) with learning rate of 0.0005, and loss function is mean squared error. For the re-ranking model, we selected stochastic gradient descent (learning rate = 0.001, decay=0.0, momentum=0.9) as optimizer and categorical cross entropy as loss function. We used the training dataset provided by the task organizers to train the models and the development set for the initial evaluation. We implemented our system using Keras (2.3.1) (Chollet and others, 2015) and Flair (0.4.4) (Akbik et al., 2019).