SSS at SemEval-2023 Task 10: Explainable Detection of Online Sexism using Majority Voted Fine-Tuned Transformers

This paper describes our submission to Task 10 at SemEval 2023-Explainable Detection of Online Sexism (EDOS), divided into three subtasks. The recent rise in social media platforms has seen an increase in disproportionate levels of sexism experienced by women on social media platforms. This has made detecting and explaining online sexist content more important than ever to make social media safer and more accessible for women. Our approach consists of experimenting and finetuning BERT-based models and using a Majority Voting ensemble model that outperforms individual baseline model scores. Our system achieves a macro F1 score of 0.8392 for Task A, 0.6092 for Task B, and 0.4319 for Task C.


Introduction
Social media has become a free and powerful tool for communication, allowing people to share their ideas and connect with others worldwide. For women, social media has provided a platform for bringing visibility, empowering, and educating people on women's issues. However, with an increase in the number of users, there is a significant increase in sexism on these platforms has led to online harassment and disinformation, making social media less accessible and unwelcoming, continuing to uphold unequal power dynamics and unfair treatment within society. Social media's anonymity and ease of access have made it easier for individuals to spread sexist content online. This has led to a growing interest and needs to detect and identify online content as sexist and explain why that content is sexist. Natural Language Processing (NLP) techniques have shown promise in enabling the automated detection of sexist language and providing insights into its nature of it. In this paper, we describe our contributions to Se-mEval 2023 Task 10:Explainable Detection of Online Sexism (Kirk et al., 2023), which proposed a new dataset of sexist content from Gab and Reddit to detect and explain online sexist content through three hierarchical classification subtasks. Task 10 of SemEval 2023 includes three hierarchical subtasks: Task A, which is a binary classification task to classify whether a post is sexist or not sexist, Task B, a multi-class classification task for sexist posts which further classifies them into four categories and Task C, another multi-class classification task to classify the sexist posts into one of eleven fine grained vectors. We conducted detailed experiments on BERT-based models, such as RoBERTa (Liu et al., 2019) and DeBERTaV3 V3 (He et al., 2021), using different training methods and loss functions to reduce class imbalance along with hyperparameter tuning. We then ensembled these individual models using a majority voting ensemble and show how these methods affect the model's performance and how an ensemble of these models performs better than the individual baselines.

Problem and Data Description
SemEval 2023 Task 10: Explainable Detection of Online Sexism aims to detect and flag any abuse or negative sentiment towards women based on their gender or based on their gender combined with one or more other identity attributes. The dataset consists of 20,000 labelled entries. Of these, 10,000 are sampled from Gab and 10,000 from Reddit. Initially, three skilled annotators assign labels to all entries, and any discrepancies are resolved by one of the two experts. In Task A, if all annotators unanimously agree on a label, it is considered the final label. However, if there is any disagreement, an expert will review the entry and determine the final label. For Task B and C, if at least two annotators agree on a label, that is considered the final label. However, an expert will determine the label if there is a three-way disagreement. The challenge is divided into three hierarchical subtasks: Subtask A: Binary Sexism Classification-The task is a two-class (binary) classification that detects whether a post is sexist. The training set consists of 14000 (70% split) entries, of which 3398 are sexist, as shown in Table 1.

Post
Label yeah but in order to keep the benefit i have to be good tommorow because i told her we could try not sexist As Roosh said having discussions with woman is a waste of time anyway sexist

Related Work
Online Sexism is considered a type of hate speech by many and can be found on several big platforms like Twitter and Reddit. This increases the importance of improving detection and classifica-   (Gorrell et al., 2019) Detection. (Fersini et al., 2022) aimed to detect misogynous memes on the web by taking advantage of available texts and images, and (Mina et al., 2021) aims to automatically identify sexism in social media content by applying machine learning methods. However, there has not been much work in the explainable detection of online sexism, which has been addressed in this task. (Agrawal and Mamidi, 2022) and (Rao, 2022) performed binary and multi-class classification using transformer-based models and showed how ensembling them outperforms the baseline model performances.

System Overview
After conducting extensive experiments on pretrained transformer-based models such as BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019), DeBERTa (He et al., 2020),DeBERTaV3 (He et al., 2021), AlBERTa (Lan et al., 2019) and XL-Net (Yang et al., 2019) and comparing their baseline performances based on accuracy and Macro F1 score, we finally used DeBERTaV3 for Task A and both RoBERTa and DeBERTaV3 for Task B and C.

Ensemble
To increase the overall accuracy of the predictions and robustness of the predictive model, models were first individually tuned on the entire dataset. Then, their predictions were combined using ensembling methods. Final results were obtained using the Majority Voting Ensembling method, a hard voting method in which the class predicted by most models is considered the final output and weighted average ensemble, in which the weights are obtained by grid search or other optimization techniques on the validation dataset.

Task A: Binary Classification
This consisted of classifying texts into binary classes: sexist and not sexist. The data was trained on the pre-trained BERT-based models, of which DeBERTaV3 gave the best baseline score. We experimented with the DeBERTaV3 model with tuning parameters such as different learning rates and number of epochs to test their effect on the model's accuracy. We also used Focal Loss (Lin et al., 2017) and Class Weights to deal with class imbalance in the dataset. We finally used the Majority Voting ensembling method on the top 3 model combination results to obtain the final labels.

Class Imbalance
We noticed a major class imbalance in the training dataset of Task A, with 10602, not sexist examples and only 3398 sexist examples. Hence to deal with this, we experimented with class imbalance handling measures such as Focal Loss and Class Weights assigned to each class to reduce class imbalance. Focal Loss: Focal Loss is primarily used in classification tasks to deal with class imbalance.
It is an improvement over the Cross-Entropy loss function (Zhang and Sabuncu, 2018) and deals with class imbalance by assigning more weights to hard or easily misclassified examples and downweight easy examples, thereby emphasizing correcting misclassified ones.

Task B and C: Multi-Label Classification
These tasks were a multi-label classification problem. We had to predict the category label in task B and the vector label in task C of the text already classified as sexist in task A. We experimented with parameters like learning rates and the number of epochs. We used Focal Loss and Class Weights for any class imbalance. Apart from these methods, we also trained the baseline DeBERTaV3 and RoBERTa models using a 5-fold cross-validation method and compared the effect of the training method on accuracy. The best-performing models were ensembled using the Majority Voting ensemble method to get the final labels.

Experimental Setup
The training data consists of 14,000 entries out of 20,000 total entries, and these are used for Task A. 3,398 of these 14,000 are labelled as sexist, which are further used in Tasks B and C. We strictly used the data provided by the organisers to train and test our models in the development and testing phases. We used 80% of the training dataset to train the models, while 20% remaining was used for crossvalidation. Upon using both methods, we found that the majority voting ensemble method gives better results than the weighted average ensemble method. Using the majority voting ensemble reduced individual model errors, and the impact of outliers was also minimized, leading to a more accurate and robust system. For Task A, we used our best three performing models for the ensemble, including the baseline De-BERTaV3 model, with focal loss and different hyperparameter values. The exact hypermeters used for these three models are mentioned in Table 4.  where α is a weighting factor that gives more importance to the minority class, and γ is a tunable parameter that modulates the degree of downweighting. The values we take after experimenting are 1.0 for α and 2.0 for γ, the default value used in the original paper. We also made use of Class Weights along with Focal Loss, which are computed for each class using the formula, n samples n classes × np.bincount(y) (2) However upon experimenting, we observed that the results obtained were lesser compared to baseline models, hence we did not include it in the final ensemble.
In Task B and C, after much experimenting, we noticed that an ensemble of RoBERTa and DeBER-TaV3 models with 5-fold cross-validation gave a better result than the baseline models. The learning rate for both was taken as 2e-5, with 4 epochs in each fold. Macro F1 score was the primary evaluation metric used. We used the pretrained models available in huggingface library and implemented the models using ktrain library (Maiya, 2020), a lightweight wrapper for deep learning libraries.

Results
Results for Task A, B and C are mentioned in Table  5   While using focal loss doesn't improve the F1 score compared to the baseline DeBERTaV3 model in Task A, ensembling both the models along with De-BERTaV3 with a learning rate of 4e-5 improves the overall performance of the model. For Task B and C, Focal Loss models performed poorly compared to the baselines despite having some imbalanced classes. We also noticed that using a different training method, like 5-fold Cross Validation improved the performance over the baseline models for multiclass classification and ensembling them further improved the performance.

Conclusion and Future Works
In this work, we benchmarked various pretrained Transformer based models such as BERT, RoBERTa and DeBERTaV3 and their majority voting Ensemble Models. This detection and classification of online sexism were done under Task 10 at SemEval 2023-Explainable Detection of Online Sexism (EDOS). We also used Focal loss to deal with class imbalance.DeBERTaV3 gave the best F1 score of 0.8348 on Task A, 0.6381 on Task B, and 0.6381 on Task C. RoBERTa gave results close to DeBERTaV3 on both Task B and C.
In future works, we would like to (1) compare the results of the models on different datasets and languages; (2) use unsupervised learning to improve the results of our models;(3) propose a novel custom architecture which provides more robust predictions.