CS-UM6P at SemEval-2021 Task 7: Deep Multi-Task Learning Model for Detecting and Rating Humor and Offense

Humor detection has become a topic of interest for several research teams, especially those involved in socio-psychological studies, with the aim to detect the humor and the temper of a targeted population (e.g. a community, a city, a country, the employees of a given company). Most of the existing studies have formulated the humor detection problem as a binary classification task, whereas it revolves around learning the sense of humor by evaluating its different degrees. In this paper, we propose an end-to-end deep Multi-Task Learning (MTL) model to detect and rate humor and offense. It consists of a pre-trained transformer encoder and task-specific attention layers. The model is trained using MTL uncertainty loss weighting to adaptively combine all sub-tasks objective functions. Our MTL model tackles all sub-tasks of the SemEval-2021 Task-7 in one end-to-end deep learning system and shows very promising results.


Introduction
Humor is a human trait that defines the emotional and behavioral characteristics of an individual. It refers to the quality of being amusing, comic, sarcastic, etc. Most dictionaries define humor also as a message, whose ingenuity or verbal skill, or incongruity, that has the power to make individual laughing.
Fine-tuning pre-trained transformer-based language models on the target task data has shown state-of-the-art (SOTA) results in many NLP applications (Devlin et al., 2019;. For instance, several research works on humor and offensive language detection have achieved SOTA performances using pre-trained transformer-based language models (Zampieri et al., 2019;Weller and Seppi, 2019;Zampieri et al., 2020).
In this paper, we describe our system submitted to the SemEval-2021 Task-7 (Sub-Tasks 1 and 2) (Meaney et al., 2021). We propose an end-toend deep Multi-Task Learning (MTL) model based on RoBERTa Encoder  and taskspecific attention layers. The attention mechanism is applied on top of the encoder's contextualized word embedding to extract task-specific features. The classification and regression modules are fed with their task-specific attention output and the shared pooled output of the encoder. In order to adaptively combine all tasks' losses, we employed the MTL uncertainty loss weighting method (Kendall et al., 2017). We also investigate the base and the large variants of BERT (Devlin et al., 2019) and RoBERTa encoders for both single-task and MTL. The obtained results show that our MTL model outperforms its single-task counterparts on both Task 1 and Task2. The best performances are obtained using RoBERTa-large encoder. Our system is ranked 18th, 9th, 7th and 20th on Sub-Tasks 1a, 1b, 1c and 2a, respectively.
The remainder of this paper is organized as fol-lows. Section 2 describes the SemEval-2021 task-7 and the provided data. Section 3 presents our MTL system. Section 4 summarizes the obtained results. Section 5 concludes the paper.

Task description
The SemEval-2021 Task 7 consists of two main tasks: the first task seeks recognizing and rating humor while the second task aims to rate offense (Meaney et al., 2021). To this end, the organizers have provided 8000 sentences for the training, and 1000 sentences for the validation and test. All training and validation sentences are labeled for humor detection and offense rating, while only humorous sentences are labeled for humor and controversy rating. The dataset is labeled by 20 annotators. They have a balanced set of age groups from 18 to 70.

Task 1: Humor detection
The aim is to predict four target values for the following sub-tasks: • Task 1a: This sub-task is a binary classification task where the aim is to classify texts as humorous or not.
• Task 1b: This sub-task consists of predicting the humor degree of a text. The degree is based on the average rating (from 0 to 5) given by the annotators.
• Task 1c: This sub-task consists of predicting whether the humor rating would be considered controversial or not: i.e. whether or not the variance between the annotators' ratings is higher than the median rating.

Task 2: Offensive rating
This task has one sub-task for offense rating: • Task 2a: This task predicts the degree of offense conveyed in a text regardless of its humor label. The offense degree varies from 0 (not offensive) to 5 (very offensive).

System description
We propose an end-to-end deep MTL model based on pre-trained transformer-based language model (Devlin et al., 2019; and taskspecific attention layers. First, we apply the encoder to the input text in order to obtain its Contextual Word Embedding (CWE). The task-specific attention layers are applied on the CWE. The classifier (Task 1a, Task 1c) or the regressor (Task 1b, Task 2a) is fed with the concatenation of its taskspecific attention output and the encoder's pooled output. The model is then trained to minimize the binary cross-entropy loss and the RMSE loss for the classification and regression tasks, respectively. Finally, these losses are combined using uncertainty loss weighting for MTL.

Transformer encoder
In order to recognize the most important patterns in an input text, we encode its using the state-ofart pre-trained transformer encoder. We compare four transformer encoders, namely BERT, BERT-Large, RoBERTa and RoBERTa-Large (Devlin et al., 2019;. The tokenizer of the encoder splits the input sentence into wordpeices [T 1 , T 2 , ..., T n ] and encodes them using its vocabulary. The transformer encoder is fed with the encoded input and outputs the pooled embedding h pooled ∈ R 1×d (embedding of [CLS] (resp. < s >) token of BERT (resp. roBBERTa)) and the CWE H = [h 1 , h 2 , ..., h n ] ∈ R n×d (d is the embedding dimension).

Task-specific attention layer
We use one task-specific attention layer for each task. Using H, the CWE of the input sentence, the attention mechanism (Bahdanau et al., 2015;Yang et al., 2016) extracts the task-specific representation s * ( * denotes the task) as follows: where W a ∈ R d×1 and W α ∈ R n×n are the trainable parameters of the attention layer, U ∈ R n×1 is the attention mechanism's context vector, and α ∈ [0, 1] n weights h 1 , h 2 , ..., h n according to their contribution to the task objective.

Task Classification/Regression module
As the SemEval-2021 Task-7 consists of two classification tasks (1a and 1c) and two regression tasks (1b and 2), we employ two classification modules and two regression modules. Each of these taskspecific module is composed of one hidden layer and one output layer, and takes as input the concatenation [h pooled , s * ] of the pooled output (h pooled ) and its task-specific attention output (s * ).

MTL objective
Our MTL model is trained to minimize the losses of the four tasks. Specifically, it minimizes the binary cross-entropy loss and the RMSE loss for classification and regression tasks, respectively. These losses are expressed as follows: • Binary cross-entropy loss for humor classification • RMSE loss for Humor rating • Binary cross-entropy loss for Controversy classification • RMSE loss for offense rating where y andŷ are the ground-truth and the predicted values, respectively. In order to adaptively weight the losses of the four tasks, we combine them using MTL uncertainty loss weighting (Kendall et al., 2017), given by: where σ i (i = 1..4) captures the amount of noise that exists in the output of each task, and used to tune the impact of each loss in MTL optimization. Finally, the MTL model is trained to minimize the overall loss L total with respect to the network parameters as well as the noise parameters σ i .

Experiment Settings
We have evaluated the performance of our model and it single-task counterparts using both the base and the large models of BERT and RoBERTa: • BERT-base: 12 transformer blocks, d = 768, 12 attention heads, and 110M parameters.
For text preprocessing, we have implemented a simple pipeline that normalizes contractions. All evaluated models are trained using Adam optimizer (Kingma and Ba, 2014) with a learning rate of 1 × 10 −5 . The batch size and the number of epochs are fixed to 16 and 5, respectively. We have investigated both single-task training and MTL for all tasks. It is worth mentioning that, for single-task learning, we also apply an attention layer on top of the contextualized word embedding. This has improved single-task models as well. All models are trained on the full train sets, validated on the validation set, and evaluated on the test set of each task. For evaluation purpose, we have used the shared task's evaluation metrics, namely the F1score, the Accuracy, and the Root Mean Squared Error RMSE. It is worth mentioning that models' validation is preformed using the development set, while the presented results are obtained employing the test set. Table 1 presents the obtained results for all tasks using single-task and MTL models. The results show that our MTL model surpasses its single-task counterparts on all tasks. The large variants of BERT and RoBERTa encoders offer better performance compared to their base variants. The best performance is obtained using our MTL model on top of RoBERTa large encoder. These results can be explained by the fact that deep encoders can capture more complex pattern from the input text.  Besides, MTL leverages useful signals from the related tasks.

Experiment Results
To investigate the effectiveness of the taskspecific attention layers and the uncertainty loss weighting on the performance of our MTL model, we have performed an ablation study. Table 2 presents the results of our model without these components. The results show that both components improve the performance of our MTL model. We achieve the most performance gain by incorporating the task-specific attention layers into our model. Besides, the adaptive losses weighting component outperforms the simple combination of task losses (L total = l 1 + l 2 + l 3 + l 4 ).

Conclusion
In this paper, we have presented our system for humor and offense detection and rating. Our system consists of an end-to-end MTL model based on the state-of-art pre-trained transformer encoder and task-specific attention layers. The latter layers are applied on top of the contextualized word embedding to extract task-discriminative features. We have employed two classification and regression modules to tackle the four tasks. Our MTL model is trained to minimize the four tasks losses, while weighting them adaptively using the MTL uncertainty loss weighting. We have also investigated the performance of our MTL model as well as its single-task counterparts using four pretrained transformer-based encoders. The best performances are obtained using our MTL model while employing RoBERTa-large encoder.
In future work, we would like to improve our model, by considering the relationship between the different tasks. Besides, we want to use our model not only to detect humorous and offensive content, but also to perform other related tasks.