EL-BERT at SemEval-2020 Task 10: A Multi-Embedding Ensemble Based Approach for Emphasis Selection in Visual Media

In visual media, text emphasis is the strengthening of words in a text to convey the intent of the author. Text emphasis in visual media is generally done by using different colors, backgrounds, or font for the text; it helps in conveying the actual meaning of the message to the readers. Emphasis selection is the task of choosing candidate words for emphasis, it helps in automatically designing posters and other media contents with written text. If we consider only the text and do not know the intent, then there can be multiple valid emphasis selections. We propose the use of ensembles for emphasis selection to improve over single emphasis selection models. We show that the use of multi-embedding helps in enhancing the results for base models. To show the efficacy of proposed approach we have also done a comparison of our results with state-of-the-art models.


Introduction
The SemEval-2020 Task 10 (Shirani et al., 2020) challenge focuses on emphasis selection in visual media. Emphasis is the process of giving importance to some parts of communication to convey the message in a better way. It is used to draw the attention of readers to a specific section of the information. It is also used for removing the ambiguity in the message. In vocal communication, the emphasis is generally conveyed by giving stress on the specific word. In visual communications like flyers, posters, advertisements, the emphasis is conveyed by using different fonts, colors, or backgrounds for the text.
Text designing systems like Adobe Spark 1 can automatically provide template-based layouts for text. However, these algorithms generally rely on the visual features of the text like word length and suggest designs based on those features. Sometimes, this type of method does not emphasize on the proper words and might not help in conveying important information or may even convey wrong information. In Fig 1a, we show the automatic design provided by Adobe Spark. Even though Figure 1a is aesthetically pleasing, it is not giving emphasis to essential words and might fail in conveying the message. Instead, Figure 1b uses a different layout and emphasizes on vital words.
Given only the text and not the intent of the message, there can be multiple valid emphases, and different authors can prefer to emphasize different words. Therefore, we cannot use a single label to say whether a word should be emphasized or not. To tackle this, we use learning label distribution (LDL) (Gao et al., 2017). LDL is used for assigning a real number to each word showing the probability of a word being emphasized.
Our main contributions can be summarized as: • We propose the use of different embeddings for the task of emphasis selection and also show that different combinations of embedding can improve the emphasis selection.
• We show that encoded sentences and words using different embedding (multi-embedding) have introduced new information to models and improved the performance. • We propose the use of ensemble models for emphasis selection in visual media.
• We show a comparison of our results with baselines and state-of-the-art models and also give a qualitative comparison of different methods.

Related Work
In text data, the majority of works focus on important keyword identification from long texts. There are mainly two methods for keyword extraction: supervised and unsupervised. Supervised methods generally treat the keyword identification as a classification problem and classify words as either keyword or not (Frank et al., 1999;Tang et al., 2004;Medelyan and Witten, 2006). Unsupervised methods usually utilize TF-IDF (Hasan and Ng, 2010) scores or clustering methods (Liu et al., 2009) for keyword identification.
Recently, (Zhang et al., 2016) also proposed a model using RNNs for keyword identification. Different methods are proposed for the emphasis selection in audio. Most works use acoustic features like loudness, pitch to detect emphasized words in audio data (Kochanski et al., 2005;Wang and Narayanan, 2007). Recently, some works are proposed to predict word emphasis in text to improve text-to-speech (TTS) systems (Nakajima et al., 2014;Mass et al., 2018). (Sun, 2002) proposed ensemble-based models for emphasis selection in audio. (Shirani et al., 2019) proposed the use of label distribution learning (LDL) for emphasis selection in short texts in visual media. Here, we show that the use of ensemble models trained with different embeddings performs better than base models.

Problem Definition
Given a sentence S with tokens C = {w 1 , w 2 , ..., w n }, where 1 < |S| < n, emphasis selection is the task to find candidate words in C to emphasize on for conveying the meaning of message in an effective manner.

Label distribution learning (LDL)
We use "IO" scheme for labels, where "I" represents emphasis and "O" represents non-emphasis. Then label distribution learning is the task of learning probability d w y for each word w ∈ C, denoting the degree with which word w belongs to label y. d w y ∈ [0, 1] and y d w y = 1.

Dataset
The dataset consists of two sub-datasets: 1. Spark dataset, 2. Quotes dataset Spark dataset: This dataset is collected from Adobe Spark and is a collection of short texts from flyers, posters, or advertisements and includes 1,200 instances.
Quotes dataset: This dataset is a collection of quotes from well-known authors collected from Wisdom Quotes, which contains 2,718 instances.
For further analysis, the dataset is split up into train, test, and development sets with 70%, 20%, and 10% samples, respectively.

Model
We use the DL-BiLSTM model proposed by (Shirani et al., 2019) as our base model, to the best of our knowledge this model provides state-of-the-art results for emphasis selection in visual media. Each word of the input sequence is represented with word embedding. Part-of-speech (POS) and sentence embedding are added as additional information. Two bidirectional LSTM (Hochreiter and Schmidhuber, 1997) layers are used to capture the sequence information. The last layer of the model is fully-connected and assigns probability using the hidden state of LSTM. Figure 2 shows the overall architecture of the DL-BiLSTM model.
We have used different sets of embedding to train the models. WordBERT and SentBERT represent word, and sentence embedding generated using pre-trained BERT (Devlin et al., 2019), respectively. POSEmbd represents one-hot-encoded POS tag. In Model-4, Model-5 and Model-6, we use ELMo (Peters et al., 2018) as word embedding.

Baseline Models
Here, we discuss the baseline models and their implementation. SL-BiLSTM: This model has same architecture as DL-BiLSTM, but the distribution is mapped to binary labels. Also, for training the SL-BiLSTM model (Shirani et al., 2019), we use negative log-likelihood loss in place of KL-Divergence loss (Kullback and Leibler, 1951).
CRF (Conditional Random Fields): Similar to (Shirani et al., 2019), this model is trained with handcrafted features like word identity, word suffix, word shape, and POS tag for the current and nearby words.

Experimental Settings
We have used BERT-as-Service framework 2 and 1024 dimension pre-trained cased BERT embedding for encoding the words and sentences. We have used universal POS tagger from NLTK 3 to get the POS tags for sentences, and then have used a bag of word embedding to encode POS tag information. We have also used 2048 dimension ELMo embedding for encoding word information.
The size of the BiLSTM layer in the model depends on the set of embeddings used, Table 1 lists the sizes of BiLSTM layers for different sets of embedding. All the proposed models are trained for 10 epochs with Adam (Kingma and Ba, 2015) as the optimizer and learning rate of 0.001. The baseline models are trained for 160 epochs. To prevent models from over-fitting, we have also used two dropout layers with a dropout-rate of 0.5 in the sequence and inference layers.
For training the DL-BiLSTM model, we have used KL-Divergence loss, and for training SL-BiLSTM model, we have used negative log-likelihood loss.
We have also used different ensembles of the above models, the final score of an ensemble is the average of scores provided by all the models.

Results
As proposed by (Shirani et al., 2019), we have used Match m score for evaluation of our models. Table 2 lists the results of the base, baseline, state-of-the-art and top five ensemble models. Figure 3 shows a qualitative comparison of ensembles and base models. It can be seen from Figure 3a and 3b that both the single model and the ensemble model produce similar results for most cases. In some cases (examples 1 and 4), results of ensemble model is more similar to target scores shown in Figure 3c. From Table 2, we have noticed that the addition of POSEmbd improves scores for models with ELMo word embedding (Model-4, Model-5, Model-6) but does not improve results for models with BERT word embedding (Model-1, Model-2, Model-3). This shows that the addition of POSEmbd does not add any additional information to BERT embedding; we can infer that BERT embedding is inherently able to identify POS information.

Analysis
We have also noticed that the use of BERT sentence embedding (SentBERT) improves results for models with ELMo word embedding (Model-4, Model-5, Model-6) but does not improve results for models with BERT word embedding (Model-1, Model-2, Model-3). This shows the ability of BiLSTM model to encode the sequence property correctly. The improvement in the case of ELMo based models shows that the addition of information from different embedding space can add additional information and helps in improving the performance of models.
We have also noticed that ensemble models perform better than single base models and, in almost all cases, provide better results than base models.  We also compare our results with other baseline models and state-of-the-art models.

Conclusion & Future Work
We show that the use of multiple embeddings for encoding different information (i.e., word and sentences) can help in improving model performance. We also show that ensemble models perform better for emphasis selection then single base models. In the future, we will work on ensembles with different model architectures and methods for emphasis selection. We will also work on generating task-specific word embedding for emphasis selection.