LAST at SemEval-2020 Task 10: Finding Tokens to Emphasise in Short Written Texts with Precomputed Embedding Models and LightGBM

To select tokens to be emphasised in short texts, a system mainly based on precomputed embedding models, such as BERT and ELMo, and LightGBM is proposed. Its performance is low. Additional analyzes suggest that its effectiveness is poor at predicting the highest emphasis scores while they are the most important for the challenge and that it is very sensitive to the specific instances provided during learning.


Data and Evaluation Settings
The materials for this task consisted of 3,876 brief pieces of text (or items) from 2 to 38 words 4 . The challenge organizers recruited raters, through Amazon Mechanical Turk, for annotating these items. For each item, nine of them had to decide whether each token should be emphasized, a difficult decision for human judges since Shirani et al. (2019) reported an agreement (Fleiss' kappa) below 0.65. These items were randomly split up into three sets (see Shirani et al. (2020) for details): the training set (Train: 2,741 items and 32,399 tokens), the development set (Dev: 392 items and 4,385 tokens) and the test set (Test: 743 items and 8,192 tokens). The true labels for the Train and Dev sets were available from the beginning of the training phase while those for the Test set still remain hidden.
Systems proposed to this shared task had to submit for each token of each item a predicted score of emphasis. On the basis of Shirani et al. (2019), the task organizers decided to use for assessing performances an ad hoc measure called Match m, roughly 5 defined in Shirani et al. (2019), but implemented in a python function. It assesses the accuracy with which the tokens with the highest emphasis scores according to the submission are also those with the highest emphasis scores in the gold standard. This calculation is carried out successively for the first, first two, first three and first four tokens with the highest scores in the gold standard. The final Match m for a system is the unweighted average of the average of the four Match m scores over all of the items. Being an average of accuracy scores, its maximum value is 1. The minimum value for a given item depends on both its length in tokens and the number of ties among the highest scores 6 . An important property of this evaluation measure is that it is calculated by item on the ranked data. It follows that it treats in the same way two tokens ranked first in their respective item, but one selected by all the annotators and the other by less than half while it can be considered that identifying the former is more important.

System Overview
The proposed approach treats the prediction of the token emphasis scores as a regression problem and uses LightGBM (Ke et al., 2017) to perform this prediction. The first analyzes carried out indicated that using classical features in text categorization such as token, lemma and POStag n-grams were far from achieving an effectiveness comparable to that of the approach proposed by Shirani et al. (2019), which was chosen by the task organizers as the baseline. One potential explanation is that the majority of tokens in the materials appear only once. To take this difficulty into account, I chose to represent each token by means of precomputed embedding models and to use the corresponding vectors as features for LightGBM. It was expected that the embeddings will make rare tokens similar to more frequent ones in the training materials. The first analyzes carried out with this approach showed that adding to these embedding features the n-grams features tested first was not useful. So, I decided to focus on the embeddings. However, a limitation of the approach is that no information is extracted from the context (i.e., the whole item). An extended version of the system was therefore developed. It was based on the emphasis score predicted by the base system to which contextual features were added, some of which being obtained after processing the items by means of the Stanford CoreNLP (Manning et al., 2014).

Procedure to Build the Systems
With the exception of obtaining the embeddings themselves and the Stanford CoreNLP, all the processing steps were performed by means of a series of custom SAS programs running in SAS University (freely available for research at www.sas.com/en us/software/university-edition.html). All the predictive models were built by means of LightGBM. For developing the systems and fine-tuning the parameters, a 7-fold cross-validation procedure (CV) based on the combined Train and Dev sets was used.

Pretrained Embedding Models and Features
Two pretrained embedding models were used. The first was the 24 times 1024-dim uncased BERT embeddings (Devlin et al., 2019), more precisely wwm uncased L-24 H-1024 A-16, obtained by means of bert-as-service (Xiao, 2018). The tests carried out with the pretokenized version having shown that a large proportion of the tokens are unknow of BERT, I led BERT tokenized the items, adding the embeddings of several BERT tokens when necessary (e.g., un + ##lea + ##rn + ##ing to get the layers for unlearning). The second was the pretrained ELMo model (Peters et al., 2018) applied without learning and item by item to the tokens provided in the original materials, giving rise to the 1024-dim ELMo embeddings.
The analyzes carried out during system development led to first selecting the 22nd layer of BERT (B22) to which were added the 1024 vectors of ELMo, the 12th layer of BERT (B12) and 1024 features obtained by adding the values of each vector in layer 22 of the preceding token and the target token. A binary feature was used to indicate whether the BERT layers were computed by adding several BERT tokens, that is, as explained just above, when it was necessary to add the embeddings of several BERT tokens to get the embeddings for an original token.

Contextual Features
Two sets of contextual features were used. The first set (called LR) was obtained by encoding for each item all lemmas, POStag, dependency governor and entity code from the Sanford CoreNLP of all the tokens to the left and to right of the target token with a weight proportional to the distance between the target token and each of these tokens.
The second set of contextual features (called REP) has its origin in the fact that a significant number of items contain repeated lemmas and that this seems to affect the position of the emphasis. For example, in the item Positive mind. Positive vibes. Positive life., the emphasis seems to increase with repetitions. This type of information has been encoded by applying the following procedure to items in which at least one lemma is repeated: For each repeated lemma pairs for each token in the item the repeated lemma with the following cases is one-hot encoded: Is it before the first repeated lemma? Is it the first repeated lemma? Is it between the first repeated lemma and the second one? Is it the second repeated lemma? Is it after the second repeated lemma?

LightGBM Parameters
For the base system, the LightGBM parameters have been left at their default values except the followings: num iterations: 6000, max bin: 510, bagging freq: 5, bagging fraction: 0.38, boost from average:false, feature fraction: 0.05, learning rate: 0.0095, max depth: 6, min data in leaf: 40, num leaves: 25. For the extended system, the same parameters were used except for: feature fraction: 0.09, max depth: 9, min data in leaf: 10, num leaves: 27.

Results
As a reminder, the base system uses only traits from precomputed embedding models and the extended system uses the prediction from the base system (one feature) to which contextual features are added. These two systems were submitted for the challenge using the combination of Train and Dev sets for learning (3,133 items and 36,784 tokens). The best of the two was the extended system, which achieved a mean Match m of 0.756 (see the Full model in Table 2), just barely 0.003 more than the base system (see the Full model in Table 1). It ranked twenty-third out of 31 participants, very far from the best team,  which obtained a mean Match m of 0.823, and with only 0.006 more than the task organizers' Baseline system 7 .
In order to assess the usefulness of the different feature sets, an ablation procedure was used. It must be noted that all the compared models include the binary feature which codes whether the embeddings of several BERT tokens were added to get the embeddings for an original token. The values obtained on the official test set are given in Table 1 for the base system and in Table 2 for the extended system. At the request of the challenge organizers, these tables also present the performances obtained when the training was carried out only on the Train set. In Table 1, the performance of the full model, but using the default parameters of LightGBM are also given. There are two cases: all the default parameters including the number of iterations, which is then fixed at 100 (Def. 100) and the same model, but by fixing this number at 6,000 as in the other cases (Def. 6,000) 8 .
Def. 100 < Def. 6,000 < B22 = B22 ELMo = B22 ELMo B12 = FullM1 < FullM2 Figure 1: Statistically significant differences between the main models for the 7-fold CV It is clearly the parameter optimization which improves the performance of the base system. The differences between the other conditions are very small and variable depending on which training set is used. Such small differences between the various versions of the systems raise the question whether they are not just the result of random fluctuations.
In order to determine if the observed differences were statistically significant, a Monte-Carlo permutation test for related samples (Howell, 2008, Chap. 18) was used to compare selected conditions at a threshold level of 0.01, using 1,000 random permutations. These analyzes were performed by means of Figure 2: Bubble plot of true and predicted emphasis scores for the 7-fold CV a cross-validation procedure since the gold standard for the test materials is not available. As Bestgen (2020) observed that performing a single k-fold cross-validation provided sufficient information to assess the reproducibility of observed differences, the 7-fold distribution used during the development of the system was employed. Figure 1 shows that the default parameters are significantly less efficient than the optimized ones while the extended model is significantly better than the other models.
To get a better idea of the successes and failures of the extended system, Figure 2 presents a bubble plot of the true and predicted scores of all tokens in the 7-fold CV, the size of a bubble being proportionally related to the number of observations in this area. It clearly shows that the system effectiveness is low at predicting the highest scores (see the large variability of the predicted scores for true scores of 8 and 9), which are the most important for obtaining a high Match m in the challenge.
In this figure, three areas deserve special attention. The bottom left corner contains a large number of effective predictions. These are very largely punctuation marks and grammatical words correctly predicted by the model as having to obtain (in general) low scores. For example, out of the 289 that in the combined Train and Dev sets, 96.5% have an emphasis score less than or equal to 2 and none have a score higher than 5.
You 're never a loser until you quit trying .  Table 3: True and predicted scores for the item which occurs twice in the Train set.
The two areas highlighted in Figure 2 contain tokens for which the errors are the greatest. They correspond among others to grammatical words considered by the annotators to be emphasized because they are part of a chunk such as university of life. There is also a somewhat strange case (see Table 3): the same item occurring twice in the Train set on which the raters who evaluated these two versions strongly disagreed. The model is mistaken each time on the word never, but once it gives it a score that is far too low and once a score far too high. The explanation for this contrasting behavior is that the two statements were not in the same CV test set. It thus appears that the model is very sensitive to the specific instances provided during learning.

Conclusion
The system proposed for Task 10 of SemEval-2020 to select tokens to be highlighted in short texts was mainly based on precomputed embedding models and LightGBM. Its performance was mediocre since it did barely better than the baseline. It would not be correct to draw an argument from the fact that the human annotators themselves do not agree with each other in this task since other teams have proposed systems capable of achieving a much better performance. The performance of the extended system might have been improved if a weighting function for the features had been used such as the bi-normal separation feature scaling (Forman, 2008) or BM25 which has proved useful in the VarDial challenge (Bestgen, 2017). Yet, it seems to me that the major limitation of the proposed system is to take into account very little contextual information.