IIITH at SemEval-2021 Task 7: Leveraging transformer-based humourous and offensive text detection architectures using lexical and hurtlex features and task adaptive pretraining

This paper describes our approach (IIITH) for SemEval-2021 Task 5: HaHackathon: Detecting and Rating Humor and Offense. Our results focus on two major objectives: (i) Effect of task adaptive pretraining on the performance of transformer based models (ii) How does lexical and hurtlex features help in quantifying humour and offense. In this paper, we provide a detailed description of our approach along with comparisions mentioned above.


Introduction
Humour is an important part of human conversation. It has a social function as well and can play an important role in group cohesiveness (Ziv, 2010). Hence humorous content is also found on various social media websites. While there has always been a fine line between funny and offensive humour, the anonymity, distance and isolation provided by being online can increase instances of offensive or controversial humour being posted online. (Weitz, 2017) In this task, we have presented a transformer based approach combined with lexical and hurtlex feature sets to quantify humour and offense of a piece of text.
We achieved an F1 score of 0.959 in the humor classification task and 0.592 in the humor controversy task. For the regression tasks, we achieved a RMSE score of 0.541 and 0.488 in the humor regression and offense regression task respectively.

Related work
There have been many attempts made at computational humour detection. In this section, we briefly describe other work in this area. In this approach (Blinov et al., 2019), the authors have used universal language model fine-tuning method for humour recognition. Convolutional neural networks (CNN) have also been used for this task by (Chen and Soo, 2018) whereas (Weller and Seppi, 2019) used transformers to classify humour.
There has also been a lot of shared tasks and workshops related to computational humour. One of them is SemEval-2020 Task 7: Assessing Humor in Edited News Headlines (Hossain et al., 2020) where Zhang (Zhang et al., 2020) used bidirectional neural networks with an attention mechanism and incorporated lexical features to assess humour in edited news headlines.
There has been a lot of work done on hate speech and offensive speech detection as well. CNN's and gated recurrent units (GRU) have been used for this task (Zhang and Luo, 2018). Recurrent neural networks combined with user-related information have also been used for hate speech detection in Twitter Data (Pitsilis et al., 2018) whereas multilingual transformer architectures were leveraged by (Ghosh Roy et al., 2021) to detect hostile content in English, Hindi and German.

Task and dataset overview
The task (Meaney et al., 2021) is divided into 4 sub-tasks.
1. Humour detection: This is a binary classification task where the model needs to predict if the text is humorous or not where the values are either 0 and 1.

Humour
Rating: This is a regression task where the model needs to rate how humorous the text is where the value can vary between 0 to 5. 4. Offense Rating: This is a regression task where the model needs to rate how offensive the text is. It can vary between 0 to 5.
The dataset for the tasks was provided by The workshop organizers. It consisted of 10,000 sentences. 8,000 sentences were provided for training and 1,000 for validation. The remaining 1,000 were used for testing. Each row consisted of a unique identifier,the text and the label values of "is humor", "humor rating","humor controversy" and "offense rating".

Hurtlex features
HurtLex (Bassignana et al., 2018) is a lexicon of offensive, aggressive, and hateful words in over 50 languages which is further categorized into 17 categories. Identifying these kinds of words can potentially help in offensive content detection. Also, in some cases, a humorous piece of text might contain such a word to denote humour. We have also experimented with this feature for humour classification and regression task.

Lexical features
The structure of humorous and offensive texts can be a bit different from normal texts. We have leveraged a lexical feature set that would help us capture that information and distinguish humorous and offensive texts. The set of lexical features are: • Counting the total number of letters, punctuation, upper case letters and numbers within the text.
• Identifying the presence of any named entity. For detecting named entities, we have used the AllenNLP named entity recogniser 1 which uses pretrained GloVe vectors for token embeddings and a GRU encoder. (Peters et al., 2017) • Detecting the presence of interrogation by identifying '?' symbol or any WH-word • Detecting the number of personal pronouns and what kind of personal pronouns they are: first-person, second-person or third-person.
For detecting the personal pronouns, we have used a pre-defined list of personal pronouns.

Sentence embeddings
For generating the sentence embeddings, we have experimented with 4 different pre-trained transformer models: bert-base-uncased (Devlin et al., 2018), roberta-base (Liu et al., 2019), google/electra-base-discriminator (Clark et al., 2020) and xlnet-base-cased (Yang et al., 2019). Initially, we finetuned each of the pre-trained models for each task and made predictions on the validation set. On the basis of the performance, we have selected one pre-trained model to proceed to our final setup4.5. For the binary humour classification, humour regression and offensive regression task, we have selected roberta-base. On the other hand, google/electra-base-discriminator gave the best performance for humour controversy task.

Task adaptive pretraining
In the paper (Gururangan et al., 2020), we can see the benefits of continued pretraining of pre-trained transformer models on unlabelled task-specific data or Task Adaptive Pretraining (TAPT) before finetuning them on a downstream task like text classification. This paper (Raha et al., 2021) showcases the gains attributed to further pre-training of the IndicBERT (Kakwani et al., 2020) model for hostility detection in Hindi. We have experimented with the same approach for all our downstream tasks where a pretrained transformer model( roberta-base for humor classification, regression and offensive regression) is further pretrained on training data with the masked language modelling (MLM) objective. In our results 5, we have shown the benefits gained from task adaptive pretraining for each task. Note that task adaptive pretraining was not done on google/electra-base-discriminator for the humour controversy classification.

Final setup
In this subsection, we outline our final architecture from the set of input features to the final label generation for each task. At first, we have generated the set of lexical features and the hurtlex features on both training, validation and testing data. For generating the hurtlex features, we have used the featurizer in hurtlex Finally, the embeddings generated from the transformer models are concatenated with hurtlex features and lexical features to form the final vector representation for a particular text. For optimization, we have used the Adam (Kingma and Ba, 2017) optimizer where the learning rate was set to 1e-5 and a dropout (Srivastava et al., 2014) with the probability of 0.1. We updated weights based on cross-entropy loss values for the classification tasks and Mean Squared Error for the regression tasks. A dense multi-layer perceptron serves as the final binary classifier head or regression head. The model weights were saved and evaluated on the development set at the end of every epoch and the finetuning continued for 10 epochs. We have  reported the scores of the models that yielded the best F1 score on the development set and used them to further predict on the test set. We have also experimented with or without considering the hurtlex and lexical features to showcase the gains or losses attributed to them.

Results
The gains attributed to task adaptive pretraining of roberta-base on the humour classification is shown in table 2. We can see that continued pretraining of roberta-base has improved the model performances significantly.
In table 1, we can see the results of inclusion and exclusion of the lexical and hurtlex features for each task. We notice that lexical and hurtlex features do contribute to the performance of humour classification. Combining hurtlex features and lexical features with transformer embeddings have improved the results of both humour classification and humour regression task. For offensive regression, the hurtlex features played an important role while lexical features degraded the performance. This is probably because the lexical features were curated for the identification of humour. For the  Table 3: Results on the test split for each task and their respective ranks on the leaderboard during the evaluation phase. Task 1a refers to humour classification, Task 1b refers to humour regression, task 1c refers to humor controversy and Task 2 refers to the offensive regression task.
humour controversy, excluding lexical and hurtlex features gave the best results. This might be because textual features played much more important role than lexical and hurtlex features.
In table 3, we report the results obtained on the test set during the evaluation phase and the rank of our models on the official leaderboard 8 . We used the best performing models on the validation set to achieve those results.
Overall, this work shows how task adaptive pretraining can improve model performance for downstream tasks and the role of hurtlex and lexical features for humor and offensive detection.

Conclusion
All the experiments performed above were done with default hyperparameters(unless explicitly mentioned) due to resource constraints. The model performances could have improved if we could search for optimal hyperparameters using cross validation. Furthermore, the regression tasks could improve if we could use an ensemble of the best performing models for our final predictions.