YNU-HPCC at SemEval-2020 Task 10: Using a Multi-granularity Ordinal Classification of the BiLSTM Model for Emphasis Selection

In this study, we propose a multi-granularity ordinal classification method to address the problem of emphasis selection. In detail, the word embedding is learned from Embeddings from Language Model (ELMO) to extract feature vector representation. Then, the ordinal classifica-tions are implemented on four different multi-granularities to approximate the continuous em-phasize values. Comparative experiments were conducted to compare the model with baseline in which the problem is transformed to label distribution problem.


Introduction
Short texts have a great impact on visual communication. They are usually designed to grab a viewer's attention and convey a message efficiently. For text, word emphasis is used to capture the intent better, removing the ambiguity that may exist in some plain texts. Word emphasis can clarify or even change the meaning of a sentence by drawing attention to some specific information, and it can be done with colors, backgrounds, fonts, italics, or boldface. The purpose of this task is to design automatic methods for emphasis selection, i.e., choosing candidates for emphasis in short written text, to enable automated design assistance in authoring (Shirani et al., 2020). In the task definition, given a sequence of words or tokens, there is a subset of words that are good candidates to emphasize, and the word probability represents the degree of emphasis, which needs to be predicted.
In a previous work, a rule-based approach is used (Widera et al., 1997). Later, Text-based features such as part-of-speech (POS), information content, position in the sentence and other information was adopted (Volker Strom, 2007). Some methods have been proposed for predicting emphasized words for expressive Text-To-Speech (TTS) based on a deep neural network (DNN) (Mass et al., 2018;Rosenberg et al., 2015). To address the multiple annotators problem, a majority voting ensemble has been used to transform the problem into single-label learning (Laws et al., 2011). Shirani (2019) suggest that the task should be transformed into label distribution learning (LDL). The main difference between such a method and previous works is that the label is not a single dispersed value, but rather a continuous label distribution. Therefore, the model can fit more accurately to a real label instead of having the distortion of transforming a probability into a single label.
The essence of the problem is to use a sequence labeling model to predict a probability value for each word. The existing methods mainly applied either a softmax or conditional random field (CRF) (Huang et al., 2015;Lafferty et al., 2001) layer to predict labels, which cannot be used in this task. Therefore, using a linear decoder to output continuous linear values or using sigmoid output nonlinear probability values as  Figure 1: Overview of our model. the result have been considered, but the performance was not suitable.
In this paper, we propose a multi-granularity ordinal classification method for the task. This involves an ensemble of four different models with different levels of granularity, including Binary Classification, Ternary Classification, Quaternary Classification, and Quintuple Classification. Specifically, it divides the probability between (0,1) into different parts as category labels. For the Binary Classification, the probability (0,1) is divided into two categories, i.e., (0-0.5) and (0.5-1). For the Ternary Classification, the probability (0,1) is divided into three categories, i.e., (0-0.33), (0.33-66), and (0.66-1). For the Quaternary Classification, the probability (0,1) is divided into four categories, i.e., (0-0.25), (0.25-0.5), (0.5-0.75), and (0.75-1). The Quintuple Classification divides (0,1) into five categories, i.e., (0-0.2), (0.2-0.4), (0.4-0.6), (0.6-0.8), and (0.8-1). The reason why we use multi-granularity ordinal classification is that Binary Classification ignores the differences between word emphasis probability. For example, if two words' emphasis probabilities are 0.3 and 0.4, they belong to the same class for Binary Classification, while they belong to different classes for ternary classification. The finer the granularity of the division is, the more the model can learn the difference between key words. However, there is still a balance between granularity and the number of classifications, that is, the performance of multi-granularity classification will decrease as the granularity increases. This approach can use different granularities of information to improve performance.
The rest of the paper is organized as follows. In Section 2, we describe the multi-granularity ordinal classification of the bidirectional long short-term memory (BiLSTM) model in detail. The comparative experimental results are presented in Section 3. Conclusions are drawn in Section 4.

Multi-granularity Ordinal Classification of BiLSTM Model
We use an ensemble of multi-granularity ordinal classification of the BiLSTM model to learn emphasis patterns. Figure 1 shows the overall architecture of the proposed model. First, we fine-grain words' emphasis probabilities, and the granularity labels (0, 1, 2, 3, 4) denote the emphasis degrees from high to low. After we load the test into the word embedding layer, a BiLSTM is used to obtain word annotations that summarize the information from both directions and to learn more abstract features. A fully connected softmax layer is used to output a probability distribution over all classes from the BiLSTM output, which can be weighted into a single-value probability. Finally, we use a voting ensemble to obtain the final probability.

Multi-granularity Ordinal Classification
Multi-granularity ordinal classification is widely applied in many domains. Multi-granularity classification is used to analyze visual objects from subordinate categories, e.g., species of birds or models of cars in computer vision (Wei et al., 2019;Berg et al., 2014). Multi-granularity classification is also used in many NLP tasks such as sentiment classification (Hao et al., 2019), neural machine translation (Mehri and Eskénazi, 2019), Dialog (Mehri and Eskénazi, 2019)and named entity recognition (Mai et al., 2018).
In this task, there are two methods of fine-graining, ROC-AUC and ORDER. ROC-AUC: the probability of each word between (0,1) is divided into different parts as category labels  (0, 1, 2, 3, 4), representing the degree of emphasis. Experimental results show that when the threshold is set to 0.35, the model performs best in Binary Classification. ORDER: First, we arrange the words according to probability from large to small, and then take the previous k words into a category, after which we take the remaining m words into a category, and so on until the specified granularity is obtained. We find that when k is set to 4, the model performs best in Binary Classification. Then, we compare ROC-AUC and ORDER with respect to Binary Classification. The comparison of ROC-AUC and ORDER with respect to Binary Classification is shown in Figure 2, where the score classes are defined in section 3.3.
As indicated, ORDER performed better than ROC-AUC with respect to Binary Classification. Therefore, we use ORDER as our fine-graining method. Experimental results show that different partition sizes behave differently. For Binary Classification, we take the top four in probability as label 0, and the rest as label 1. For Ternary Classification, we take the top two in probability as label 0, the third to fourth as label 1, and the remaining as label 2. The division process for Ternary Classification is shown in Figure 3. For Quaternary Classification, we take the top two in probability as label 0, the third through fourth as label 1, the fifth through sixth as label 2, and the rest as label 3. For Quintuple Classification, we take the top two in probability as label 0, the third through fourth as label 1, the fifth through sixth as label 2, the seventh as label 3, and the remaining as label 4.

Word Embedding
To capture semantic and syntactic information of a word, word embedding has been widely used in the NLP domain (Lai et al., 2016).The 1024-dimensional ELMo word vector is used in the first layer of the model, and the word vector matrix is loaded into the embedding layer. Also, the max sequence length is set to 38 because 99% of sentence lengths are shorter than this value.

Bi-directional Long Short-Term Memory
Bi-directional Long Short-Term Memory (BiLSTM) is based on LSTM and is a special kind of RNN (Hochreiter, 1997), which is capable of learning long-term dependencies. Employing two BiLSTM layers helps to build a deeper feature extractor. We also find that having more than two stacked LSTM layers does not help the performance, as the model becomes too complicated.

Ensemble
The output of full connected layer is a label distribution. The output must be weighted and summed into a single value p k (k = 2, 3, 4, 5). For Binary Classification, the output of full connected layer is [a, b], where a represents the probability of label 0, b represents the probability of label 1. Compared with label 1, label 0 represents a higher probability of emphasis. So, the weighted single value p 2 = 0.8 * a + 0.2 * b. For Ternary Classification, the output of full connected layer is [a, b, c], and the weighted single value p 3 = 0.44 * a + 0.33 * b + 0.23 * c. For Quaternary Classification, the output of full connected layer is [a, b, c, d], and the weighted single value p 4 = 0.4 * a + 0.3 * b + 0.2 * c + 0.1 * d. For Quintuple Classification, the output of full connected layer is [a, b, c, d, e], and the weighted single value p 5 = 0.4 * a + 0.3 * b + 0.2 * c + 0.1 * d + 0 * e. Finally, The ensemble output p = 0.25 * p 2 + 0.25 * p 3 + 0.25 * p 4 + 0.25 * p 5 .

Experiments and Evaluation
Experiments were conducted to evaluate the proposed model. We report the results of the official review. The details of the experiment are described as follows.

Data Preprocessing
The data that the organizers of the competition provided contained 6 columns, including word ID, word, begin-inside-outside (BIO) annotations, BIO frequencies, emphasis probability, and POS tags. We only used the word and corresponding emphasis probability. Then, we transformed the probability into labels using a multi-granularity ordinal classification. We obtained the max length of sentences to pad the embedding vectors and label vectors.

Implementation Details
This experiment used Keras in TensorFlow2.1. We used ELMo pretrained word vectors in tensorflow-hub. The hyperparameters were tuned to the performance of training and dev dataset using the given metric function. Different classifiers may have their own optimization parameters. For all classifiers, the learning rate is 0.001 and epsilon is 1 × 10 −6 . The optimizer is RMSprop (Ruder, 2016) and loss function is mean squared error. The activation of the fully connected layer is softmax. Additional best-tuned parameters are shown in Table 1

Evaluation Metrics
For evaluation, M atch m is the evaluation metric for this task: For each instance X in the test set Dtest, a set S where score 0 is the average of score 1 , score 2 , score 3 and score 4 .

Results and Discussion
We use the DL-BiLSTM+ELMo model (Shirani et al., 2019) as the baseline. The dev dataset experiment results are shown in Table 2. The results of test data in the post-evaluation period are 0.607, 0.731, 0.802, which is lower than the baseline of test data:0.608, 0.737, 0.807, 0.849, 0.75. Accroding to the dev dataset experiment results, we find that Ternary Classification performs the best, and Binary classifier performs the worst. The ensemble results are higher than any other single classification. The experimental results do not indicate that increasing the granularity of the division leads to better results. This is due to the imbalance of the data; In fact, the performance of the classifier decreases as the granularity increases.

Conclusion
In this paper, we describe a task system that we submitted to SemEval-2020 for emphasis selection. We propose a Multi-granularity Ordinal Classification of the BiLSTM model. In future work, we will attempt to generalize models with better capabilities.