SarcasmDet at SemEval-2021 Task 7: Detect Humor and Offensive based on Demographic Factors using RoBERTa Pre-trained Model

This paper presents one of the top winning solution systems for task 7 at SemEval2021, HaHackathon: Detecting and Rating Humor and Offense. This competition is divided into two tasks, task1 with three sub-tasks 1a,1b, and 1c, and task2. The goal for task1 is to predict if the text would be considered humorous or not, and if it is yes, then predict how humorous it is and whether the humor rating would be perceived as controversial. The goal of the task2 is to predict how the text is considered offensive for users in general. Our solution has been developed using RoBERTa pre-trained model with ensemble techniques. The paper describes the submitted solution system’s architecture with the experiments and the hyperparameter tuning that led to this robust system. Our model ranked third and fourth places out of 50 teams in tasks 1c and 1a with F1-Score of 0.6270 and 0.9675, respectively. At the same time, the model ranked one of the top 10 models in task 1b and task 2 with an RMSE scores of 0.5446 and 0.4469, respectively.


Introduction
In our daily life, the obstacles and difficulties in dealing with sarcasm, bullying, or even abuse of all kinds and ways are increasing day by day (Sheehan et al., 1999;Cleary et al., 2009;Tucker and Maunder, 2015;van Verseveld et al., 2021). Technically, sarcasm and bullying are among the most complex and challenging topics that major companies and institutes seek to address. Artificial intelligence and text processing techniques are the most potent current methods for detecting these problems within texts and images. Sarcasm and abuse are associated with attacking a specific person or group of people either through an unintended joke or, in many cases, by directly affecting the target's psyche. Irony and offensiveness are characterized by their vocabularies that are peppered with humor to conceal the opposite (Lee and Katz, 1998).
Task 7 at SemEval-2021, "HaHackathon: Detecting and Rating Humor and Offense", provides two main tasks: task-1 with three sub-tasks (1a,1b, 1c) and task-2. The goal of task-1 is to predict if the text would be considered humorous or not, and if it is yes, then expect how funny it is and whether the humor rating would be perceived as controversial. The goal of task 2 is to predict how the text is considered offensive for users in general. Our solution, SarcasmDet, has been ranked among the top four teams in two sub-tasks. The proposed approach uses the provided dataset, which contains 10K of row text data. We have experimented with several pre-trained language models using the simple transformers library. It is worth mentioning that using the hard-voting ensemble technique has increased our score remarkably.
The paper is constructed as follows: Section 2 provides the related works. Section 3 and 4 describe the shared task and the provided dataset, respectively. Section 5 describes our system solution. Section 6 shows our experiments. Section 7 provides the results, and finally, the conclusion is in Section 8.

Related Works
In recent years, social media's development and growth have motivated the NLP research community to detect Humor and Offensiveness. In 2018, SemEval provided different shared tasks to detect emotions and irony in tweets (Mohammad et al., 2018;Van Hee et al., 2018). The top teams' proposed models mostly used LSTM and word embeddings (Abdullah and Shaikh, 2018;Badaro et al., 2018;Wu et al., 2018). In 2019, SemEval also introduced a shared task to discover offensive language in social media. Researchers in (Liu et al., 2019a) used the dataset of the Offensive Language Identification Dataset (OLID) provided by (Zampieri et al., 2019). They ranked first in the task with an F1 (Macro) score of 0.8286 by applying linear model, LSTM, and BERT pre-trained model. In 2020, one of the shared tasks presented in SemEval was about how to change a chunk of text to make the text funnier. The authors (Mahurkar and Patil, 2020;Shatnawi et al., 2020) applied a pre-trained BERT model with different preprocessing for the presented dataset. This paper presents our solution to task 7 in SemEval2021, to detect humor and offensive simultaneously and explains it in detail.

Tasks Description
All subtasks of the SemEval2021 task 7 have different requirements. In this section, we have detailed the description for each task.

Task 1a Humor Detection
Task1a is a binary classification problem. The text should be classified as humor or not based on the answers of 20 participants to whether the particular text was intended to be funny or not. It is considered funny based on the majority of the participants' responses. Table 1 shows an example of the training dataset for task 1a.

Task 1b Average Humor Score
Task 1b is a regression task; humor rating depends on the classified task 1a arguments. If the text was classified as funny (humor), then a question was raised about the level of humorous in the text on a scale of 1-5. Then, they took the average rating as a label. If not humorous text, they used 0 as a label. Table 2 shows an example of the training dataset for task 1b.

Task 1c Humor Controversy
Task 1c is a binary classification problem task; humor controversy depends on the classified arguments from task 1a. If the text was classified as funny (humor), then the task should determine whether the classification of the humor is controversial (1) or not (0). Table 3 shows an example of the training dataset for task 1c.

Task 2 Average Offensiveness Score
Task2 is a regression problem task. The question was asked to determine whether the text is offensive in general, and how much general offensive is between 1-5. Table 4 shows an example of the training dataset for task 2.

Dataset Description
The dataset provided by (Meaney et al., 2021) Se-mEval 2021 organizers for task7 contains 10,000 rows of text data and four columns of labels. The

Data Preprocessing
There was no need to implement preprocessing methods for the dataset of task1a and task2. However, the dataset for task1b and task1c contain null values. Therefore, we attempted to convert all null values into zeros, which lowered the data's quality. Therefore, we used another technique, which is dropping the records with null. The later technique increased the data quality and gave better performances.

Systems Description
In our solution, we have used the pre-trained language model, RoBERTa (Liu et al., 2019b), that uses a robustly optimized NLP method to improve the Bidirectional Encoder Representations from Transformers. We have also used the BERT pretrained model (Devlin et al., 2018). RoBERTa is built based on BERT's language masking strategy, which learns to predict knowingly hidden sections of text within unannotated language examples. We have chosen RoBERTa pre-trained model because of the significant improvements in the performance by tuning the BERT training procedure and the architecture based on BERT-large. We have experimented with several deep learning models. In our best-performed solution system, we implemented ensemble technique (Chou et al., 2009) on the bestscored models that include RoBERTa-large, BERTlarge trained on 24-layer, 1024 hidden, 16-heads, 355M parameters. We used RoBERTa model from HuggingFace (Wolf et al., 2019) and simpletransformers pre-trained models. More details about each subtask are as follows.

Task 1a
We have applied RoBERTa (base/large) with different hyperparameters. Then, we have utilized the hard-voting ensemble technique to produce the best model that predicts the label in the test dataset. Our approach's best result has scored 0.9513 F-score in the development phase and 0.9675 F-score in the test phase. The learning rate=1e-5, manual seed= 17, train batch size= 16, and num train epochs= 5.

Task 1b
In this sub-task, we have applied BERT(base/large) cased with different hyperparameter. Then applied the hard-voting ensemble technique for the best model to predict the label in the test set (learning rate=1e-5, manual seed= 17, train batch size= 16, and num train epochs= 5).

Task 1c
In this sub-task, we have applied several pretrained NLP models, such as BERT(base-large), XlNet(large), and RoBERTa(large), but the best solution was obtained from the two previous sub-tasks (task 1a, task 1b). We used the best results of task 1a and task 1b to predict task 1c. If the result from task1a is 1, then that indicates it is humor. If the value of the Humer Ratting is equal or more than 3, then we consider that humor controversy to be 1. Otherwise, we assume humor controversy is 0.

Task2
We have applied RoBERTa (large/base) with different hyperparameters. Then, we have used the hard-voting ensemble technique for the best model to predict the label in the test dataset (learning rate=1e-5, manual seed= 17, train batch size= 16, and num train epochs= 5).

Experiments
We have experimented with several pre-trained NLP models to detect Humor and Offensive through the development and evaluation phases. The pre-trained models include BERT(base/large) that is developed by Google researchers. Also, AlBERT(base/large) (Lan et al., 2019), which is a lite version of BERT to reduce parameters and increase the model speed by reducing memory consumption. Another pre-trained model is Xl-Net(base/large) (Yang et al., 2019), which introduced the automatic regressive pre-training method and outperformed BERT model in several tasks sentiment analysis, question answering and others. Finally, RoBERTa model, which outperformed most of the pre-trained models, if not all. We have implemented our experiments on google Colab using CPU and GPU. Using collab GPU increased the speed of the experiments by 100%. We used simple transformers library with various hyperparameters, learning rate=1e-5, manual seed= 17, train batch size=8-16-32 and epochs= 2-3-5. Our best results accomplished on all tasks was using hard-voting ensemble technique on top of best-scored results by RoBERTa-large and BERT-large-cased. The use of hard-voting technique increased the performance, and the accuracy, remarkably. In the development phase, our model ranked first place in three task1a, task1c, task2, and second place in task1c. However, for the evaluation phase, we have ranked 4th place in task1a, and 3rd place in task1c 3rd. Table  5 shows details of all hyperparameters used on all models for two phases.  7 Results Our solution system results are divided into two phases development and evaluation phase. We experienced several pre-trained language models (RoBERTa, BERT, ALBERT, and XLNET) and implemented them using a simple transformers library in the development phase. In the evaluation phase(test phase), we improved our system solution's capabilities by using different hyperparameters with ensemble techniques. In the following sub-sections, we provided in detail all the results of the evaluation phase.

Task 1a
In task1a RoBERTa-large model outperformed all models with 0.9669 F-score and 0.9590 accuracies. We used the hard-voting ensemble technique to improve our results using the top best five achieved scores by RoBERTa-large and RoBERTabase model with different hyperparameters. We have increased our solution performance and accomplished 4th place with a 0.9675 f-score and 0.9600 accuracies using this method. Table 6 shows the ensemble results and the top best results for RoBERTa model.

Task 1b
In task1b BERT-large-cased outperformed all models with 0.5468 RMSE. We improved the result  with the same method of ensemble and hyperparameters used in the previous sub-task. We have used the top best four achieved scores by BERTlarge-cased and BERT-base-cased model, and we achieved 10th place with 0.5446 RMSE. Table 7 shows ensemble results and the top best results for the BERT model.

Error Analysis
Our model was able to predict well in task 1a with an F1-score of 0.9675, but in task 1c, the prediction decreased with an F1-score of 0.52. Figures 2 and  3 show the confusion matrix for tasks 1a and 1c.
The reason for this is due to the distribution of the datasets. In task 1a, the dataset was balanced, but in task 1c, the dataset was imbalanced as it contained null values.

Conclusion
This paper presents and describes our solution system for the SemEval2021 Task7: HaHackathon detecting and rating humor and offense. We have applied several pre-trained language models, such as RoBERTa, BERT, ALBERT, and XLNET, with hard-voting ensemble technique to detect humor and offense mechanism. Our final solution was based on BERT-large-cased model and RoBERTalarge model, which showed remarkable improvements and a high overall outperformance. Our solution system ranked 4th place in task1a with a 0.9675 F-score, 10th place in task1b with a 0.5446 RMSE, 3rd place in task1c with a 0.6270 F-score, and 10th place in task2 with a 0.4469 RMSE.