DUTH at SemEval-2021 Task 7: Is Conventional Machine Learning for Humorous and Offensive Tasks enough in 2021?

This paper describes the approach that was developed for SemEval 2021 Task 7 (Hahackathon: Incorporating Demographic Factors into Shared Humor Tasks) by the DUTH Team. We used and compared a variety of preprocessing techniques, vectorization methods, and numerous conventional machine learning algorithms, in order to construct classification and regression models for the given tasks. We used majority voting to combine the models’ outputs with small Neural Networks (NN) for classification tasks and their mean for regression for improving our system’s performance. While these methods proved weaker than modern, deep learning models, they are still relevant in research tasks because of their low requirements on computational power and faster training.


Introduction
The underpinnings of humor have proven far more vexing than those of other emotional experiences. It is a highly subjective topic that various scholars have attempted to construct theories for understanding its fundamental elements in the studies of philosophy, linguistics, psychology and sociology. Some theories, e.g. the Benign Violation Theory (Warren and McGraw, 2015), suggest that humor can be described as linguistic violations that still make grammatical sense. The aforementioned theory supports that for a joke to be classified as humorous, it needs to avoid being too harmless or too offensive.
There are numerous studies in humor sentiment analysis in the last decade. In microblogging, Reyes et al. (2012) considered extracting linguistic devices from tweets to be used as features for classifying these tweets as humorous or ironic, while Raz (2012) approached the classification of humorous tweets as a multi-class problem of 11 types of humor, in his attempts to better attribute the real sentiment of a tweet. Recent attempts on humor detection on SemEval's 2020 Task 7 indicate that transformer models like BERT (Mahurkar and Patil, 2020) far outperform traditional machine learning algorithms (S et al., 2020).
This paper describes our submissions to Se-mEval 2021 task 7 and is structured as follows. Section 2 describes the tasks, training data, and evaluation measures. Section 3 describes key methods and algorithms used. Section 4 describes our proposed system while Section 5 analyzes our results. Finally, we draw our conclusions in Section 6, where we also propose directions for future work.

Background
In this section we describe each subtask's objective, the given data, and evaluation measures.

Subtasks
The main objectives of SemEval 2021 Task 7 were split into 4 subtasks. Subtask 1a required us to classify short texts as humorous or not, while Subtasks 1b and 1c required us to rate the text's humor and further classify it as controversial or not respectively. Finally, in Subtask 2, we had to rate how offensive each text was-humorus or not. All texts were in English.

Dataset
The organizers released the full training data in three parts: trial, development, and evaluation. Our final training dataset consisted of 9,000 different texts annotated with labels regarding each subtask in csv file format, while our test set consisted of 1,000 texts. Statistics for the training dataset are presented in Table 1 and Figure 1.

Evaluation Measures
For the classification Subtasks 1a and 1c, we use the F 1 measure. For the regression Subtasks 1b and 2, we use the root mean squared error (RMSE).

Experimental Setup
In this section we describe the preprocessing and vectorization methods as well as the machine learning algorithms used.

Preprocessing
Text preprocessing is the backbone of every text classification task. We applied the following techniques: 1. Tokenizing and Lowercasing words: We lower-cased the words in the texts and split them into tokens.
2. Stemming or Lemmatisation: Reducing noise from texts while (generally) improving system performance.
3. Removing Stopwords: Stopwords do not add much meaning to a sentence, so removing them helps in reducing the number of features and improving results.
4. Tagging words with capital letters: Tagging each word containing a capital letter that is not the first word in a sentence. Applied (when used) prior to word lowercasing.

Replacing
Emojis: Very few emojis were found in texts, so we replaced them with their corresponding sentiment.
6. Replacing Contractions: We replace contractions into their full forms using a dictionary. 7. Removing integers: Numbers have no emotional value, so we remove them.
8. Part-of-Speech (POS) tagging: We used POS tagging of words for preprocessing the texts, following two different approaches. Aims to remove words with low sentimental value from the data by targeting specific tags.
9. Numeric Feature Extraction: We extracted counts of characters, words, exclamation points, and numbers, as well as the numbers of declarative, interrogative, and imperative/exclamative sentences in a text. Finally, we extracted the counts of verbs, nouns, and adjectives, for each text.
The performance comparison of each preprocessing method is shown in Section 5. Stemming, lemmatisation, POS tagging, and most of numeric feature extractions were achieved by using tools from the well established NLTK (Elhadad, 2010), while we were guided by the survey of Ravi and Ravi (2015) of the most commonly used techniques in text preprocessing for sentiment analysis and by our previous works Symeonidis et al., 2018) on this subject.

Machine Learning
The training of our classification and regression models aimed to improve the evaluation measure used for the corresponding task. Many probabilistic, linear and tree-based algorithms were used, as well as small neural network architectures.
The algorithms/Neural Networks (NN) that performed the best were used in our systems and are listed below: • Linear Models: Linear SVM, Bayesian Ridge Regression and LASSO • Non-Linear Models: Naive Bayes, Light GBM (Ke et al., 2017) and XGBoost (Chen and Guestrin, 2016) • NN Models: Dense and Long Short-Term Memory Networks (using the Keras API 1 ) Linear SVM models were excluded from our final systems mainly due to our better tuning of the LGBM and XGB models during the last phase.

Vectorization and Embedding
We used the following scikit-learn toolkit's vectorizers to extract features from the preprocessed data using word unigrams or bigrams: • Tf-idf Vectorizer: translates the word counts matrix to a matrix of tf-idf features.
• Delta tf-idf Vectorizer: proposed by Martineau and Finin (2009), it creates tf-idf features similarly to tf-idf vectorizer but applies a weighting scheme reflecting the difference of tf-idf value of each word between the texts of two classes. We used the subtask's 1a labels for weighting tf-idf values in every subtask.
In order to create features for the LSTM models, we translate the words of each text into word vectors. This translation is achieved through the use of a well-known, pre-trained model: GloVe (Pennington et al., 2014).

System Overview
In this section we describe the proposed system for each subtask.

Proposed System
We trained each model with all possible combinations of preprocessing and vectorization. During the development phase, we evaluated these models using 10-fold cross validation on the training data. These evaluations guided us through hyperparameter tuning and model selection.
During the evaluation phase, we combined the outputs of these models in order to produce our system's predictions (Figure 2). The system's performance was evaluated using the test data from the development phase. That data was also used as validation data for training and tuning the dense and LSTM networks.
Model selection for our final systems was a repetitive but simple process. We selected the best performing model per algorithm and vectorization method, some weaker models whose outputs had 1 https://github.com/fchollet/keras Figure 2: System Architecture lower correlation than the outputs of the best performing models and all the NN models that we had trained. We then combined  all selected model outputs in all possible combinations consisting of at least 3 models and picked the best performing one as our final system for task prediction.
This process was repeated for each subtask. After finding the combination that produced the best results for the development's phase test data, we re-trained the system's models by appending that test data on the training data.

Subtask 1a: Humour Classification
In Subtask 1a, each text needs to be classified as humorous or not. Table 2 showcases the various preprocessing methods, vectorization tools and ML algorithms that comprised our final system. Since the number of models is even, majority voting favors the 'non-humorous' label in case of a tie, i.e. the less represented tag in the dataset.

Preprocessing
Vectorizer ML 1,2,5,8a and 9 Delta tf-idf unigrams Shallow NN 1,2 and 9 Tf-idf unigrams LGBM 1,2,8a and 9 Delta tf-idf bigrams Naive Bayes 1,2,6,7,8a and 9 Delta tf-idf bigrams Naive Bayes 1,7 and 9 Tf-idf bigrams XGB 1 Word2vec embedings LSTM In Subtask 1b, each humorous text needs to be rated in a range of 0-5 on how much humorous it is. For this subtask, the best combination found amounted to 13 models. Thus, we will not include a table for this task. All proposed preprocessing techniques were used but 3 and 8b as well as every vectorizer. Interestingly, NN models were ruled out in this subtask since a LGBM, XGB and Bayesian Ridge combination produced the best outcome.

Subtask 1c: Controversial Humor Classification
In Subtask 1c, each humorous text has to be classified as controversial or not. This is the only task that a single model outperformed any combination of models we tried to assemble. A preprocess of extracting numeric features (9), appending POS tags (8a), lowercasing and tokenization (1), a delta tf-idf vectorizer extracting bigrams and an LGBM model outperformed every other combination.

Subtask 2: Offense Rating
Finally, in Subtask 2 each text is rated in the range of 0-5 on how offensive it is. The final system is the average of the 7 individual models described in Table 3.

Results
Each preprocessing method had an impact on model performance, as it is shown in Table 4 for each task. The results for each task/subtask are shown in Table 5. We present the scores of our best performing single models and combination of models respectively on the development test set as well as our submissions for the evaluation test data.
We can detect a pattern in the difference between the winning team's submissions and our submissions, and between the performance of our NN and non-NN models. It would be an indicationfor Subtask 1a-that our conventional machine learning models, while achieving a respectable performance, cannot handle some outliers. This can be also observed through the results of Subtask 2, where outliers have a greater impact on the RMSE metric. Our false negative results on humor are mostly ironic, reference-based jokes or highly controversial ones, and our false positive  results are mostly conversational writing texts like microbloging posts. The basic LSTM models we created were able to slightly close the gap with the superior, transformer models but there is still plenty of headroom for improvement. On the other hand, our systems on Subtasks 1b and 1c were much closer to their superior models. While Subtask's 1b results could be attributed to a large extend on the distribution of humor ratings, Subtask 1c seems to be a much harder task regardless the approach. With an average accuracy of 0.5 across all submissions, humor controversiality seems to puzzle even the most complex models; anywise, humor controversiality is much more subjective than humor itself.
Nevertheless, a great advantage of conventional machine learning is training speed and hardware requirements. State-of-the-art boosting models like the ones we used (LightGBM and XGB) can be accelerated through the use of GPUs while our small NNs can be trained in a couple of minutes. Training/tuning deep learning models, on the other hand, requires expensive hardware and can be very time-consuming.

Conclusions
In this report, we presented our approach on humorous and offensive text classification and rating based on the combination of outputs from different preprocessing techniques, vectorization methods and machine learning algorithms. Our proposed systems were outperformed by other teams in the main tasks, while our conventional machine learning models were mostly inferior to our neural networks. Our future work will focus on expanding our preprocessing methods, introducing further ensemble methods and stacking, as well as