HumorHunter at SemEval-2021 Task 7: Humor and Offense Recognition with Disentangled Attention

In this paper, we describe our system submitted to SemEval 2021 Task 7: HaHackathon: Detecting and Rating Humor and Offense. The task aims at predicting whether the given text is humorous, the average humor rating given by the annotators, and whether the humor rating is controversial. In addition, the task also involves predicting how offensive the text is. Our approach adopts the DeBERTa architecture with disentangled attention mechanism, where the attention scores between words are calculated based on their content vectors and relative position vectors. We also took advantage of the pre-trained language models and fine-tuned the DeBERTa model on all the four subtasks. We experimented with several BERT-like structures and found that the large DeBERTa model generally performs better. During the evaluation phase, our system achieved an F-score of 0.9480 on subtask 1a, an RMSE of 0.5510 on subtask 1b, an F-score of 0.4764 on subtask 1c, and an RMSE of 0.4230 on subtask 2a (rank 3 on the leaderboard).


Introduction
Humor, appreciated by people with almost any age or cultural background, is perhaps one of the most fascinating human behaviors. Besides providing entertainment, humor can also be beneficial to mental health by serving as a moderator of life stress (Lefcourt and Martin, 2012), and plays an important role in regulating human-human interaction. As Reeves and Nass (1996) have pointed out, people respond to computers in the same way as they do to real people, which indicates that modeling humor computationally could bring positive effects in human-computer interaction (Nijholt et al., 2003). Despite being universal to human beings, the extent to which people find something humorous varies according to one's age, gender, or socio-economic status, making humor a highly subjective experience. This poses many challenges to the field of computational humor. Abundant research has been done to enable computers to automatically decide whether humor is entailed in a given piece of text. Early work (Mihalcea and Strapparava, 2005;Mihalcea et al., 2010) uses manually engineered features to recognize humor in text, while more recent work (Chen and Soo, 2018;Weller and Seppi, 2019) adopts deep learning approaches and pre-trained language models.
SemEval 2021 Task 7: HaHackathon: Detecting and Rating Humor and Offense (Meaney et al., 2021) aims at detecting and rating humor as well as offense in short English text. There are four subtasks involved. Subtask 1a is a binary classification task, predicting if the text would be considered humorous for an average user. Subtask 1b is a regression task and predicts the humor rating of the text if it is considered humorours. Subtask 1c is again a binary classification task and predicts whether the humor rating is controversial, whose ground-truth label is decided based on the variance of the annotators' ratings. This task also involves offense detection. Subtask 2a predicts how offensive the text is for a general user. All the regression subtasks have scores ranging from 0 to 5.
In this paper, we present our system submitted to SemEval 2021 Task 7. We followed the architecture of DeBERTa (He et al., 2020), an improved version of BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) by using two novel techniques: disentangled attention and decoding enhanced masking. We mainly relied on the disentangled attention mechanism, where the attention weights of the input words are calculated based on their content vectors and relative position vectors. For the four subtasks, we used the same base structure and the only difference is at the output layer, where the classification tasks have two out-put units and the regression tasks only have one. The pre-trained DeBERTa model has two variants that differ in size. During the evaluation phase, the large version achieved an F-score of 0.9480 on subtask 1a, an RMSE of 0.5510 on subtask 1b, an F-score of 0.4764 on subtask 1c, and an RMSE of 0.4230 on subtask 2a (rank 3 on the leaderboard). In addition, we also experimented with the BERT and RoBERTa models as our baselines, and found them generally under-performed by DeBERTa. Our code has been made publicly available. 1 2 Related Work Mihalcea and Strapparava (2005) used several human-centric features such as alliteration and synonym to recognize humor in one-liners. Mihalcea et al. (2010) approached the problem by calculating the semantic relatedness between the set-up and the punchline. Morales and Zhai (2017) proposed a generative language model and leveraged background text sources to identify humor in Yelp reviews. Liu et al. (2018) proposed to model sentiment association between elementary discourse units and designed features based on discourse relations. Xie et al. (2020) calculated the uncertainty and surprisal of the set-up and the punchline according to the incongruity humor theory, which were found useful in humor recognition. Recent work also developed neural network based models to recognize humor in text. Chen and Lee (2017) and Chen and Soo (2018) adopted convolutional neural networks, while Weller and Seppi (2019) used a Transformer architecture.

Dataset
SemEval 2021 Task 7 provides three datasets: the training set (8,000), the validation set (1,000), and the final test set (1,000). Table 1 summarizes the statistics of the three datasets, and lists the respective information of humorous (positive) and nonhumorous (negative) examples. Each example is a piece of English text accompanied by four features: is humor (subtask 1a), humor rating (subtask 1b), humor controversy (subtask 1c), and offense rating (subtask 2a). For subtask 1b and 2a, the labels range from 0 to 5. Table 2 gives two samples, one being humorous and the other non-humorous. For subtask 2a, whose goal is to predict the offense rating of the input text, we also visualize top 200 frequent unigrams for examples with offense rating ≥ 2 and < 2, respectively, illustrated as two word clouds ( Figure 1a and Figure 1b). As we can observe, Figure 1a contains words that are expected to appear in offensive text, usually targeting at a specific group of people (e.g., "black", "gay", "chinese", "muslim", etc.), while Figure 1b contains more ordinary words, which generally do not imply offense.

System Overview
With the increasingly powerful neural networks such as the Transformer (Vaswani et al., 2017), the performance on many downstream NLP tasks has been greatly improved by fine-tuning large pre-trained language models on smaller but taskspecific datasets. Traditional Transformer-based language models such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) use absolute positional embeddings in the input layer, which are added up with the word embeddings and serve as the input to the following Transformer layers. The self attention weights between the tokens are calculated solely based on their hidden representations. However, recent work (Shaw et al., 2018;Dai et al., 2019) has shown that relative position representations are more effective for NLP tasks.
Our system leverages the disentangled attention mechanism from the DeBERTa model (He et al., 2020), where the attention weights between input tokens are calculated based on their content vectors as well as their relative positions. As shown in Figure 2,    embeddings at the input layer, we create a relative positional embedding table, which is shared across all layers, to represent the relative position between token i and token j. More specifically, the index of the relative position between token i and j is defined as where k is the maximum distance we consider. Similar to normal Transformer attention mechanism, the content representations H and the relative position representations P ∈ R 2k×d are transformed to queries, keys, and values: Then, the attention weight A ij between token i and token j are calculated as follow: When aggregating the input representations H, we apply a scaling factor 1/ √ 3d to obtain the output representations H o : For subtask 1a and 1c, which are binary classification tasks, we use softmax output layer and cross entropy loss. For subtask 1b and 2a, which are regression tasks, we use mean square error as the loss function. Otherwise, the base structure is the same, and we initialize the model with the pre-trained DeBERTa weights.

Experimental Setup
We evaluated and compared our system with several baselines on the provided dataset, whose statistics are provided in Section 3. In this section, we are going to elaborate the setup of our experiment.

Baselines
In our experiment, we consider the following approaches as our baselines: • Bag of words (BoW). In this approach, we neglect the order of the input tokens, and simply add up the word embeddings of the tokens to form the vector representation of the input text. We implemented logistic regression for subtask 1a and 1c, and linear regression for subtask 1b and 2a, using the 300d GloVe word embeddings (Pennington et al., 2014).
• Convolutional neural network (CNN). Convolutional neural networks have been widely adopted in computer vision and image recognition. When applied to NLP tasks, the input is a 2D matrix with each row being the word embeddings of the respective token, and the convolution is operated along the rows, with a   fixed window size. We follow the CNN model in the work of Chen and Lee (2017), which includes an extra highway layer before the final fully connected layer, allowing shortcut connections with gate functions.
• Bidirectional long short-term memory (Bi-LSTM). LSTM (Hochreiter and Schmidhuber, 1997) has shown to perform quite well in handling sequential inputs, making it suitable for many NLP tasks. Bidirectional LSTM incorporates two LSTMs, one in the forward direction and the other in the backward direction, thus better modeling the context. In this approach, we use a Bi-LSTM with hidden size 200 and one hidden layer.
• BERT. BERT (Devlin et al., 2019) is a deep bidirectional Transformer pre-trained on BooksCorpus and English Wikipedia, with two training objectives: (1) masked language model, where some of the input tokens are randomly masked and are to be recovered by the model; (2) next sentence prediction, where the goal is to predict if the input second sentence follows the first one. By fine-tuning the pre-trained BERT, the performance of a wide range of NLP tasks can be largely improved, compared with previous models such as LSTMs.
• RoBERTa. RoBERTa (Liu et al., 2019) is an optimized version of BERT, which was trained on bigger datasets and longer sequences. In addition, the next sentence prediction objective was removed, which was found to slightly improve the performance of downstream tasks. RoBERTa reportedly achieved better results than BERT on benchmarks such as GLUE, RACE and SQuAD.

Implementation
All the Transformer-based models in the experiment have two variants that differ in model size.
The base version has 12 Transformer layers, 768 hidden units, and 12 multiheads. The large version has 24 Transformer layers, 1024 hidden units,  and 16 multiheads. We used the Adam optimizer (Kingma and Ba, 2015) with learning rate 5 × 10 −6 , and a batch size of 16. All the models were trained until the minimum loss value is reached on the validation set.

Evaluation Metrics
For classification tasks 1a and 1c, we use precision, recall, F-score, and accuracy as the evaluation metrics. For regression tasks 1b and 2a, we use the root mean square error as the evaluation metric: whereŷ n is the predicted value, and y n is the ground-truth value.

Results
The performance of our system and the baselines is shown in Table 3 (subtask 1a and 1b) and Table 4 (subtask 1c and 2a). We show the performance scores on both the validation and the test set. Generally speaking, the large version of our system performs quite well on all the four subtasks, compared with the other models. It can also be observed that, Transformer-based models always outperform the traditional methods by a large margin, except for subtask 1c, where all the models perform poorly and similarly. We conjecture this is because humor controversy is itself a highly subjective task, which is difficult even for humans. We also observe that large version of BERT-like models are generally better than their base counterparts, which is natural since larger models with more parameters usually bring better performance. Table 5 gives the confusion matrix of our system on the test set in subtask 1a. We can see that in  both positive and negative cases, the system performs quite well and makes only few errors. We manually examined some cases where our system makes a false prediction, and found that when our system predicts humorous but the ground-truth is non-humorous, the input text usually contains a question, e.g., There are 2 kinds of families on Thanksgiving. Which one are you?
We infer this is because most of the humorous examples in the training set contains a question, usually followed by a short answer serving as the punchline.

Conclusion
In this paper, we describe our system submitted to SemEval 2021 Task 7. We adopted the disentangled attention mechanism from the DeBERTa model, and participated in all the four subtasks. During the evaluation phase, we got a rank of 3 on the leaderboard for subtask 2a. For future work, we would like to combine human-centric features with the current architecture using the disentangled attention mechanism, and develop a hybrid model. In addition, we plan to expand the provided dataset with extra jokes from various sources such as Reddit forums, hoping to further improve the performance of our system.