HULAT at SemEval-2023 Task 10: Data Augmentation for Pre-trained Transformers Applied to the Detection of Sexism in Social Media

This paper describes our participation in SemEval-2023 Task 10, whose goal is the detection of sexism in social media. We explore some of the most popular transformer models such as BERT, DistilBERT, RoBERTa, and XLNet. We also study different data augmentation techniques to increase the training dataset. During the development phase, our best results were obtained by using RoBERTa and data augmentation for tasks B and C. However, the use of synthetic data does not improve the results for task C. We participated in the three subtasks. Our approach still has much room for improvement, especially in the two fine-grained classifications. All our code is available in the repository https://github.com/isegura/hulat_edos.


Introduction
Sexism can be defined as behaviors or beliefs that support gender inequality, and result in discrimination, generally against women.Contrary to what one might believe, sexism is still very present also in the most advanced and technologically advanced societies (Ridgeway, 2011).Proof of this is that many gender stereotypes are still present in our belief system today (for example, men should not wear dresses).Unfortunately, social networks are used to spread hateful and sexist messages against women (Rodríguez-Sánchez et al., 2020).
During the last few years, various research efforts (Rodríguez-Sánchez et al., 2022;Fersini et al., 2022) have been devoted to the development of automatic tools for the detection of sexist content.While these automated tools have addressed the classification of sexist content, this is a highlevel classification, without providing additional information that allows us to understand why the content is sexist.The goal of SemEval-2023 Task 10, Explainable Detection of Online Sexism (EDOS) (Kirk et al., 2023), is to promote the development of fine-grained classification models for detecting sexism in posts written in English, which were collected from social networks such as Gab and Reddit.The organizers of the task proposed three subtasks: A) Binary Sexism Detection, B) Category of Sexism, a four-class classification task, and C) Fine-grained Vector of Sexism, an 11-class classification.A detailed description of these classifications can be found at (Kirk et al., 2023).
In our approach, we explored some of the most popular pre-trained transformer models such as BERT (Devlin et al., 2019), DistilBERT (Sanh et al., 2019), RoBERTa (Zhuang et al., 2021), andXLNet (Yang et al., 2019).Moreover, we used different data augmentation techniques (such as EDA (Wei and Zou, 2019), and NLPAug library 1 ) to create synthetic data.Then, synthetic data and training data were used to fine-tune the models.Based on our experiments during the development phase, we decided to use the RoBERTa transformer model to estimate our predictions for the test dataset during the test phase.
We participated in the three subtasks.In task A, our system obtained a macro F1-score of 0.8298, ranking 43th, with a total of 84 teams in the final ranking.The top system achieved a macro F1score of 0.8746, while the lowest macro F1-score was 0.5029.About half of the systems achieved a macro F1-score below 0.83.In task B, our system ranked in the 45th position out of the 69 participating systems.Our macro F1-score was 0.5877, while the lowest and highest macro F1-scores were 0.229 and 0.7326, respectively.In task C, our team ranked in the 27th position out of the 63 participating systems.The lowest and highest macro F1 scores were 0.06 and 0.56, respectively.About half of the systems achieved a macro F1-score below 0.42, while our system had a macro F1-score of 0.44.
Our systems, which ranked roughly in the middle of the three rankings, show modest results on the three subtasks.Our approach still has much room for improvement, especially in the two fine-grained classifications.The results showed that the use of synthetic data does not appear to provide a significant improvement in the performance of the transformers.All our code is available in the repository https://github.com/isegura/hulat_edos.

Background
The goal of this task is to detect sexism content.The task is composed of three subtasks: A, B and C. Task A is a binary classification task to distinguish between sexism and non-sexism texts.Task B and C aim a finer-graned classification with four and eleven classes, respectively.
The full dataset consists of 20,000 posts written in English.Half of the posts were taken from Reddit and the other half from Gab. Gab is a social network known for its far-right users.The dataset was divided in three splits with a ratio of 70:10:20.That is, 14,000 posts were used for training, 2,000 for development, and 4,000 for the final evaluation.
We have studied the class distribution in each task.In Task A, a binary classification, the two classes are not balanced, where the not-sexist class is the majority class.The same distribution is observed in the three datasets (see Fig. 3).We also plot the distribution of categories for task B (see Fig. 4. In the dataset, the label for the second task is the field 'label_category'.It contains four different categories: "1.threats, plans to harm and incitement", "2.derogation", "3.animosity", and "4.prejudiced discussions".The majority category is "2.derogation".To obtain the distribution of these categories, we removed those records that were annotated as 'not sexist'.The second class with a larger number of instances is "3.animosity".The other two classes are the minority classes, "4.prejudiced discussions" and "1.threats, which have a similar number of instances.The same distribution is observed in the three datasets. Regarding the distribution of the vectors in task C (see Fig. 5), the vector subcategory "2.1 descriptive attacks" is the majority class, while "3.4 condescending explanations or unwelcome advice" is the minority class.The vectors follow a distribution similar to that of their corresponding categories.For example, the vectors with the largest number of instances are usually the vectors of the category "2.derogation", followed by the vectors corresponding to the category "3.animosity".
We also studied the length of the texts in the datasets (see Fig. 6).There are no significant differences between the three datasets.The mean number of tokens is around 23, and the maximum length is approximately 55 tokens.
We want to know if there are differences in the length of the texts between the two main classes: sexist and non-sexist.As the three datasets show a similar distribution, we created a density graph for the whole dataset (Fig. 1) that shows the distribution of sexist texts and non-sexist texts.Although sexist and non-sexist texts appear to have a very similar distribution of their lengths, we can observe that some sexist texts may be slightly longer than non-sexist texts.Figure 2 shows the length distribution of the texts for each category in task B. We can see that the texts classified as "4.prejudiced discussions" appear to be longer than the other texts.The category "1.threats, plans to harm and incitement" have the shortest texts.Indeed, the average length of the texts in the first category is around 22 tokens, while in the four categories is around 27 tokens.The other two categories, "2. .derogation" and "3.animosity", show very similar distribution with an average length of 25 tokens for their texts.
We also study the length distribution of texts for each vector.As there are eleven vectors, it is very difficult to compare their distributions (see Fig. 7).For this reason, we created a density graph for the vectors of each category (see Appendix).All vectors have a very similar distribution of text length.Texts classified as '4.1 supporting mistreatment of individual women' or '4.2 supporting systemic discrimination against women as a group' tend to have the largest average length between 27 and 30 tokens.The vector '2.1 descriptive attacks' has an average length of 26 tokens.The vector '1.2 incitement and encouragement of harm' has the smallest average length (around 22 tokens).The other vectors have an average length between 23 and 25 tokens.Therefore, There do not seem to be significant differences between the length of the texts of each vector.
3 System Overview
BERT (Devlin et al., 2019) is the most popular transformer model due to its excellent results in many NLP tasks.BERT is an encoder trained using two strategies: masked language modeling (MLM) and next sentence prediction (NSP).The multilingual version of BERT was pre-trained in more than one hundred languages using Wikipedia.DistilBERT (Sanh et al., 2019) is a smaller version of BERT, which can achieve similar results to BERT but with less training time.
RoBERTa (Zhuang et al., 2021) is based on BERT.RoBERTa was pre-trained using additional data.Unlike BERT, RoBERTa does not use the next sentence prediction (NSP) strategy.Regarding the MLM strategy, some tokens are dynamically masked during pre-training.Another difference with BERT is that RoBERTa uses a byte-level BPE tokenizer, which has a larger vocabulary than BERT (50k vs 30k).Therefore, RoBERTa has a larger vocabulary that can provide better results, but with an increase in complexity.
XLNet (Yang et al., 2019) is an autoregressive model.That is, it was pre-trained to predict the next token for a given input sequence of tokens.XLNet does not use any masked strategy.Instead of this, it uses a permutation language modeling that can capture context by training an autoregressive model on all possible permutations of words in a sentence.This allows to create bidirectional contextualized representations of words.Like BERT, this model was trained with Wikipedia and BooksCorpus, but also with Giga5, ClueWeb 2012-B, and Common Crawl.

Data augmentation
Data augmentation (DA) aims to increase the training size by applying different transformations to the original dataset.For example, in computer vision, some modifications can be performed by cropping, flipping, changing colors, and rotating pictures.In NLP, these transformations include swapping tokens (but also characters or sentences), deletion or random insertion of tokens (but also characters or sentences), and back translation of texts between different languages.While those transformations are easier to implement in computer vision, they are challenging in NLP, because they can alter the grammatical structure of a text.
Another advantage is that these techniques help to enhance the diversity of the examples in the dataset.Moreover, they also help to avoid overfitting.Unfortunately, data augmentation does not always improve the results in NLP tasks.
In this task, we used different data augmentation techniques (such as EDA (Wei and Zou, 2019), and NLPAug library2 ) to create synthetic data.
EDA has been implemented in the textaugment library3 for Python.EDA uses four simple operations: Synonym Replacement, Random Insertion, Random swap, and Random Deletion.The first operation randomly chooses n words in a sentence (which are not stopwords).Then, these words are replaced with synonyms from WordNet4 , a very large lexicon for English.Random insertion chooses a random word (which is not a stopword).Then, it finds a random synonym that is inserted in a random position in the sentence.The third operation, Random Swap, randomly chooses two words in the sentence and swaps their positions.The fourth operation, Random Deletion, randomly removes a word from a sentence.These operations can be repeated several times.
NLPAug also provides an efficient implementation of DA techniques.In particular, NLPAug offers three types of augmentation: Character level augmentation, Word level augmentation, and Sentence level augmentation.In each of these levels, NLPAug provides all the operations described above, that is, synonym replacement, random deletion, random insertion, and swapping.Regarding synonym replacement, the most effective way is using word embeddings to select the synonyms.This technique allows us to obtain a sentence with the same meaning but with different words.NLPAug uses non-contextual embeddings (such as Glove, word2vec, etc) or contextual embeddings (such as Bert, Roberta, etc).
In this work, we use the synonym replacement provided by EDA, which is based on WordNet.Thanks to NLPAug, we generate new texts by using a contextualized language model such as BERT.

Experimental Setup
During the development phase, we divide the training dataset into three splits: training, validation, and test, with a ratio of 70:10:20.These three splits, were used to train and evaluate the different models and data augmentation techniques.During the development phase, these techniques were only applied to the training split.
However, during the test phase, we used the full training provided by the organizers to train our model.Moreover, we applied the data augmentation techniques to the full training dataset to obtain more synthetic data.The organizers published the real answers for the development dataset, so we could use the development dataset as our validation set to train our model, and create the final predictions for each task.
Based on our results in the development phase, for task B and C, we decided to use RoBERTa combined with data augmentation techniques to generate the final predictions.However, for task A, we only use RoBERTa, because the data augmentation techniques did not appear to improve the results for the binary classification.

Results
HULAT has participated in the three subtasks.Below we present our results in each task.

Task A
As was previously said, we fine-tuned a RoBERTa model using the full training dataset, without using synthetic data.Our system provided a macro F1score of 0.8298, obtaining the 43rd position of a total of 84 participating systems.The highest macro F1-score was 0.8746, while the lowest was 0.5029.About half of the systems achieved a macro F1-score below 0.83.Table 1 shows the results on the test dataset for task A. We evaluated all combinations that we studied during the development phase.

Model
We evaluated both the uncased and the cased versions of BERT.BERT uncased shows better results than the cased version (more than one point of improvement).The use of data augmentation does not improve the results of the BERT model in none of their versions, cased or uncased.
DistilBERT obtains slightly lower results than BERT, though its training time is much better.Data augmentation helps to increase recall, but with worse precision.The improvement in F1 is not significant.There are hardly any differences between the results of the cased model and those obtained under the uncased version of DistilBERT.
XLNet has very similar results to those obtained by the uncased version of BERT.The data augmentation techniques do not appear to improve the results.
RoBERTa defeats all previous approaches, with improvements in macro F1-score between 1 and 3 points.In particular, RoBERTa achieves better precision than DistilBERT and BERT.Regarding the results obtained by data augmentation, the use of synthetic data negatively affects the precision of RoBERTa.
In sum, all the models show very close results, and data augmentation does not improve the results.RoBERTa slightly outperforms the other models.

Task B
In this task, we fine-tuned the RoBERTa model using the full training and the synthetic data created with the data augmentation techniques described in section 3. Our system ranked in the 45th position out of the 69 participating systems.Our macro F1-score was 0.5877, while the lowest and highest macro F1-scores were 0.229 and 0.7326, respectively.Table 2 shows the results on the test dataset for task B. We evaluated all combinations that we studied during the development phase.

Model
In task B, we again evaluated both the uncased and the cased versions of BERT.Although both versions obtain close results, the uncased version shows slightly better precision and recall than the cased one.For the cased version of BERT, data augmentation improves the precision (around one point) but significantly lowers the recall (more than three points).It also has a negative effect on the performance of the BERT uncased model.
While BERT and its simplified version, Distil-BERT, show close results in task A, DistilBERT shows worse performance than BERT (around six point over the macro F1-score) in task B, a fourclass classification.Contrary to BERT, the cased version of DistilBERT is slightly superior to its uncased version.However, the results are so close that these differences are not statistically significant in the models.The use of data augmentation shows an improvement of around five points over the macro F1-score (in both versions of DistilBERT), but with a slight decrease in the precision.In terms of macro F1-score, data augmentation obtains an improvement of two points.Therefore, unlike BERT, Dis-tilBERT gets some improvements thanks the use of data augmentation.
XLNet outperforms DistilBERT, showing simi-lar results to BERT.As with BERT, data augmentation does not appear to help XLNet in classifying the four categories for sexism.Like BERT and XLNet, RoBERTa achieves a macro F1-score of 0.595.Data augmentation increases the recall, but with a significant decrease of the precision.However, RoBERTa with data augmentation obtained the best results on the development set during the development phase.For this reason, we decided to use this combination for our final submission on the test phase.
Table 3 shows the results of RoBERTa with data augmentation for each category.Although the category '1.threats, plans to harm and incitement' has the lowest number of instances in the dataset (see Fig. 4), it shows the top F1 (0.624).The posts in this category are shorter than the posts in the rest of the categories (see Fig. 2).Moreover, an analysis of these texts show that they usually use a very violent vocabulary.Indeed, some of their most common words are: 'bitch', 'kill', 'rape', 'fuck', 'punch', 'beat', 'kick', 'hang', 'death', or 'slap'.The category with the lowest F1 is '4.prejudiced discussions', with around 10 points less than F1 for the first category.The lower score may be due to the fact of this category has very few instances compared to the second (derogation) and third (animosity) categories (see Fig 4).Moreover, its texts tend to be longer than the texts of the first category (threats) (see Fig. 2).The scarcity of examples in this category together with the fact that they do not use aggressive vocabulary as it was in the first category, may make very challenging to classify them.

Task C
In task C, we used the same approach as for task B, that is, RoBERT and data augmentation techniques.
Our system obtained a macro F1 score of 0.4458, which ranked in the 27th position out of the 63 participating systems.The lowest and highest macro F1 scores were 0.06 and 0.56, respectively.About half of the systems achieved a macro F1-score below 0.42.Table 4 shows the results on the test dataset for task C. We evaluated all combinations that we studied during the development phase.
The cased version of BERT slightly outperforms the uncased version.Unlike tasks A and B, data augmentation techniques appear to have a positive effect on the results for task C. Thus, they obtain an improvement of more than 10 points for BERT uncased and eight points for the cased version.
DistilBERT provides lower results than BERT.Both versions of DistilBERT, cased and uncased, show very close results.Like in BERT, data augmentation improves the results.
XLNet outperforms BERT with an increase of around three points over the macro F1-score when data augmentation is used.RoBERTa obtains the best scores, outperforming the other models.In addition, when data augmentation is used, the model obtains a significant improvement of 10 points over the macro F1-score.
In sum, RoBERTa trained with training and synthetic data is the best approach for the task C. Table 5 shows the results of RoBERTa with data augmentation for each vector.The model could not classify any instance of the vector '3.4 condescending explanations or unwelcome advice', which only has 14 instances and the test dataset, and 47 in the training dataset.Although our model was trained with synthetic examples (in particular, 94 for this label), the total number of examples for this vector is still very scarce.Although the vector '1.2 incitement and encouragement of harm' is not one of the vectors with ther largest number of instances, it does show the best F1 score (0.657).As was previously discussed for the category 1, the texts classified with this vector tend to be shorter and include very violent words such as 'bitch', 'fuck', 'kill, or 'kick'.The vector '3.1 casual use of gendered slurs, profanities, and insults' achieve the second highest F1-score (0.646).The vector 3.1 is the third one with the highest number of instances in the dataset.Regarding the other vectors, we observe that the fewer instances a vector has, the lower the F1 score it obtains.When RoBERTa is trained without using synthetic data, it can not classify any instance of the three vectors: 1.1, 3.3 and 3.4.Therefore, data augmentation techniques improve the results of task C.

Conclusion
Our team participated in the three tasks with an approach based on RoBERT fine-tuned with training data and synthetic data created by data augmentation techniques.This approach shows very modest results on the three tasks (our systems approximately rank in the middle of the three rankings).We still have much room for improvement, especially in the two fine-grained classifications.While data augmentation is does not achieve a significant improvement in task A and B, its obtains a positive effect on the results in task C.
As future work, we plan to extend our research on data augmentation techniques to augment the training data.For example, we plan to use back translation (Sugiyama and Yoshinaga, 2019).In addition, we will exploit other datasets for the detec-tion of sexism content, such as the EXIST dataset (Rodríguez-Sánchez et al., 2021) or MAMI (Fersini et al., 2022), to also approach the task from two different scenarios: multilingual and multimodal.

A Appendix
In this section, we provide supplementary material for our research.
Figure 3 shows the distribution of the classes sexist and not sexist in the three datasets.Figure 4 shows the distribution of the four categories in task B. Figure 5 shows the distribution of the eleven vectors in task C.     Figure 7 is a density graph showing the distribution of text length text for each vector in task C. In addition, Figures 8-11 show the distribution of the text length for the vectors of each of the four categories: "1.threats, plans to harm and incitement", "2.derogation", "3.animosity", and "4.prejudiced discussions".

Figure 1 :
Figure 1: Density graph of the length of texts for the classes sexist and not sexist.

Figure 2 :
Figure 2: Density graph of the length of texts for each category (task 2).

Figure 3 :
Figure 3: Class distribution for task A.

Figure 4 :
Figure 4: Class distribution for task B.

Figure 5 :
Figure 5: Class distribution for task C.

Figure 6 :
Figure 6: Distribution of text length (number of tokens) for each dataset.

Figure 6
Figure6shows the distribution of text length in each dataset.Figure7is a density graph showing the distribution of text length text for each vector in task C. In addition, Figures8-11show the distribution of the text length for the vectors of each of the four categories: "1.threats, plans to harm and incitement", "2.derogation", "3.animosity", and "4.prejudiced discussions".

Figure 7 :
Figure 7: Density graph of the length of texts for each vector in task C.

Figure 8 :
Figure 8: Density graph of the length of texts for vectors of the category 1.

Figure 9 :
Figure 9: Density graph of the length of texts for vectors of the category 2.

Figure 10 :
Figure 10: Density graph of the length of texts for vectors of the category 3.

Figure 11 :
Figure 11: Density graph of the length of texts for vectors of the category 4.

Table 1 :
Resuls for TASK A on the final test dataset

Table 2 :
Resuls for TASK B on the final test dataset

Table 3 :
Results provided by RoBERTa and data augmentation on the test dataset (task B) for categories: 1. threats, plans to harm and incitement, 2. derogation, 3. animosity, and 4. prejudiced discussions.

Table 4 :
Resuls for TASK C on the final test dataset Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019.Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.In The 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019.Amane Sugiyama and Naoki Yoshinaga.2019.Data augmentation using back-translation for contextaware neural machine translation.In Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), pages 35-44.Jason Wei and Kai Zou.2019.EDA: Easy data augmentation techniques for boosting performance on text classification tasks.In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6383-6389, Hong Kong, China.Association for Computational Linguistics.Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019.Xlnet: Generalized autoregressive pretraining for language understanding.Advances in neural information processing systems, 32.Liu Zhuang, Lin Wayne, Shi Ya, and Zhao Jun.2021.A robustly optimized BERT pre-training approach with post-training.In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 1218-1227, Huhhot, China.Chinese Information Processing Society of China.