Controversy and Conformity: from Generalized to Personalized Aggressiveness Detection

There is content such as hate speech, offensive, toxic or aggressive documents, which are perceived differently by their consumers. They are commonly identified using classifiers solely based on textual content that generalize pre-agreed meanings of difficult problems. Such models provide the same results for each user, which leads to high misclassification rate observable especially for contentious, aggressive documents. Both document controversy and user nonconformity require new solutions. Therefore, we propose novel personalized approaches that respect individual beliefs expressed by either user conformity-based measures or various embeddings of their previous text annotations. We found that only a few annotations of most controversial documents are enough for all our personalization methods to significantly outperform classic, generalized solutions. The more controversial the content, the greater the gain. The personalized solutions may be used to efficiently filter unwanted aggressive content in the way adjusted to a given person.


Introduction
Unfortunately, in the pursuit of knowledge on the Internet, one may come across content that they consider inappropriate for various reasons, such as being too aggressive. Many users notoriously come across content that offends them while surfing the Internet. This can cause discomfort and discourage from further expansion of knowledge. To avoid this, it is important to effectively filter out content that a given user may find unwanted. This poses a risk of erroneous assessment of whether a given text is considered inappropriate by a given person. For that purpose, we need to extend commonly applied generalizing solutions and develop personalized methods that take into account beliefs and preferences of the individual user. We expect this information can be obtained from the individual's prior opinions about the offensiveness of some texts. Then, it is crucial to select the relevant texts that allow deriving as much information about users preferences as possible. Our new idea is to use some known, most controversial texts whose offensiveness is very ambiguous and depends more on subjective personal judgment. We examined how many documents has to be annotated by a given user to encapsulate their beliefs sufficiently and to improve personalized reasoning. Independently, we considered personal measures quantifying conformity of each individual. In other words, we measured to what extent a person evaluates documents similarly to others, i.e. "is a part of the mainstream". The conformity measures are used as input features for the classifier. This way, it is possible to find out the user beliefs based on their opinions regarding a relatively small number of texts. In this paper, we present novel methods of personalized aggressive content detection based on the representation of user opinion about aggressive texts. We propose: (1) conformity-based personalization, (2) class-based embeddings, and (3) annotation-based embeddings (Sec. 6). Our experiments were performed on the only relevant dataset Wikipedia Talk Labels: Aggression (Sec. 3). Having defined and calculated controversy of documents and conformity of users (Sec. 4), we validated our methods. The results revealed that additional individualized features: simple user conformity measures computed on few texts or embeddings of even four controversial texts significantly boost our personalized classification (Sec. 8). The gain provided by our personalized methods is greater for more controversial documents. This work is based on the results obtained in the article . In addition, in paper (Milkowski et al., 2021), we showed that the personalized approach is also effective for other subjective problems in NLP, such as recognizing emotions elicited by text. The source code we used to conduct experiments and evaluation is publicly available in CLARIN-PL GitHub repository 1 .

Related work
It is observable a steady increase in the number of offensive (Levmore and Nussbaum, 2010), hate (Breckheimer, 2001;Brown, 2018), aggressive, toxic, cyberbullying (Chen et al., 2012, or simply socially unacceptable online messages (Ljubešić et al., 2019). There are many definitions of offensive speech, which can be summarised as speech that targets specific social groups in a way that is harmful to them (Jacobs, 2002). Some countries, such as the USA, protect the rights to use this type of speech as an acceptable form of political expression (Heyman, 2008). In turn, the law prohibits hate speech in many EU countries (Rosenfeld, 2002). Such laws pose a challenge for operators of social networking sites and other online services to identify and moderate unacceptable content. Large companies such as Facebook and Google are often accused of not doing enough to ensure that their platforms are not used to attack other people (Ben-David and Fernández, 2016). On the other hand, attempts to automatically control content often lead to the accidental blocking of content that was not intended to offend anyone.
Ambiguity of the definition of offensiveness is a serious problem. This inconsistency is visible in many reviews related to automatic detection of hate speech (Fortuna and Nunes, 2018;Schmidt and Wiegand, 2017;Alrehili, 2019;Poletto et al., 2020) or more specifically on aggressiveness detection (Sadiq et al., 2021;Modha et al., 2020).
In articles focused on detection of aggressiveness (Modha et al., 2018;Risch and Krestel, 2018;Safi Samghabadi et al., 2020), the most often used were datasets shared at the Workshops on Trolling, Aggression and Cyberbullying (TRAC) (Kumar et al., 2018(Kumar et al., , 2020 at LREC. Few others also used the Wikipedia Talk Labels: Aggression (Wulczyn et al., 2017b), where all individual annotations are available, not just the majority vote. Unfortunately, we have not found any other aggression dataset, for which this information would also be given. Moreover the authors focus mainly on the multilingual aspect of the aggression detection (Modha et al., 2018;Risch and Krestel, 2018;Safi Samghabadi et al., 2020). In addition to deep neural models, less complex methods such as logistic regression are also used (Modha et al., 2018;Risch and Krestel, 2018).
To the best of our knowledge, there are no work that dealt with the subjective problem of aggressiveness detection in the personalized way. The disagreement between annotators is usually measured by a single value, e.g. using Cohen's kappa or Krippendorf's alpha, and not investigated further. The researchers prefer a higher agreement level rather than controversy. Therefore, majority annotation is used in modeling, which to some extent leads to the loss of valuable information.
There are several studies focusing on the problem of the disagreement in data annotations. This provides valuable information not only about the annotators, but also about the instances by reflecting their ambiguity (Aroyo and Welty, 2013). There may be no single right label for every text.
The disagreement was used to divide annotators into polarized groups (Akhtar et al., 2020) or to filter out the spammers (Raykar and Yu, 2012; Soberón et al., 2013). In (Gao et al., 2019), attention was also drawn to the problem of conformity bias, where the reviewers tend to issue similar opinions. Less frequently, the disagreement is examined at the instance level, to measure its controversy or ambiguity, as in (Aroyo and Welty, 2013). For example, (Chklovski and Mihalcea, 2003) used confusion matrices in word sense tagging task to create and explore coarse sense clusters.

Dataset: Wikipedia Talk Labels
We used the Wikipedia Talk Labels: Aggression data, gathered in the Wikipedia Detox project (Wulczyn et al., 2017b,a). Unlike other collections, it provides information about all annotations given by Crowdflower workers (not only the majority vote) for 100k+ comments from English Wikipedia. The assigned aggression score ranged from very aggressive (-3), via neutral (0), to very friendly (3). It was binarized to '1 -aggressive' for negative scores or '0 -nonaggressive' for neutral or friendly annotations. The dataset contained a suggested data split into train, dev and test set.
To enable our experiments, we removed annotations assigned by workers with less than 100 annotations in the train set, <20 in the dev set or <20 in the test set. Otherwise, we would not have data to extract user beliefs from and to perform personalization. We also removed users who did not assign any aggressive label in the dev set. Information about at least one text, that a specific user considered aggressive was crucial to model his individual perception of such content. Finally, there were 2,450 annotators left (Tab. 1), so we randomly divided them into 10 equal-sized folds.
The train set is used to calculate the representations (embeddings) of documents being classified. This is the only data exploited in the classic, generalizing approach (our baseline). The dev set provides information about user beliefs, i.e. their previous annotations. Individualized input features are extracted from dev data: (1) conformity measures and (2) personal embeddings in class-based and annotation-based personalization. Personalizationrelated calculations on the dev set refer to both training and testing procedure. The documents from the test set are embedded and classified by the trained model for the validation purposes. The dev texts are solely used to quantify user beliefs: user conformity and personal embeddings. Each cell is a single text (comment) and its individual annotation.

Controversy and Conformity Measures
For training and testing purposes, both controversy Contr for documents and conformity GConf, WConf for users are calculated within the dev set.

Controversy
Controversy Contr(d) ∈ [0, 1] of document d is an entropy-based measure expressed in the following 5918 way: where n 0 d , n 1 d is the number of negative and positive annotations assigned to document d, respectively; n d is the total number of document d's annotations, n d = n 0 d + n 1 d ; n c d n d approximates the probability that annotation of document d is of class c. Contr(d) = 0 means that all users annotated d the same, Contr(d) = 1 when 50% of users perceived it aggressive and 50% not.
Controversy Contr(d) is used to rank documents from the dev dataset. The most controversial texts (top k) are embedded in class-based or annotation-based personalization. Independently, controversy is computed within the test data in order to investigate differences in reasoning quality for more and less controversial documents.

General conformity
General conformity GConf (a, C) ∈ [0, 1] of human a quantifies how often a belongs to the majority of annotators evaluating individual texts. It can be of different kind depending on the class C we consider: where A a is the set of documents annotated by a; C denotes the conformity type related to the considered classes, i.e. C = {0}, {1} or {0, 1}; l d,a is the class label assigned by a to document d; l d is the d's class label obtained by majority voting. In case of equal annotations for both classes document d is considered aggressive. GConf (a, C) = 1 when a annotated all documents d ∈ A a the same like the others and no one annotated it otherwise. Note that depending on C, conformity can be calculated in three variants: for nonaggressive (C = {0}), aggressive (C = {1}) or any documents (C = {0, 1}) annotated by a. Such three conformity values are used as input features in conformity-based personalization, Sec. 7.

Weighted conformity
Weighted conformity W Conf (a, C) ∈ [0, 1] is similar to general conformity GConf (a, C) but it respects the size of the group the annotator belongs to, while evaluating the document. The larger the group with annotator a, the greater annotator a conformity:

Controversy Analysis
To have some insight into our data, we calculated controversy Contr(d) on each dataset (train/dev/test). Fig. 2 presents the distribution of annotations for controversy measure in the dev and test set. In both, the ratio of aggressive to nonaggressive documents is increasing and reaching 0.5 for the most controversial documents, i.e. Contr(d) = 1 resulting from the same number of aggressive and nonaggressive votes. The examples of such texts are following: "Your behaviour is inappropriate and your reaction is ludicrous. Do they give out admin rights in cornflake packets now?", n 0 d = n 1 d = 5. "Far from being ridiculous, it is the recommended approach to follow on wikipedia. We don't simply state what either side claims, rather we report on how they are viewed by neutral 3rd party sources. Take it to WP:NPOVN if you don't believe me, rather than indulging in your continued disruptive habit of always having the WP:LASTWORD.", n 0 d = n 1 d = 14. We learned that classic methods based solely on content analysis (not personalized) perform worse, the more controversial the documents being tested, Fig. 6. It was the main inspiration for our personalized methods.
We also checked contribution of aggressive texts for the consecutive most controversial documents included in the personal user embeddings, Fig. 3.

Methods for Personalized Aggressiveness Detection
We assume that personal beliefs can be expressed by user activity, i.e. their individual annotations. It means that we can use information about k documents previously annotated by the user in the form of their embeddings or user conformity measures. It leads us to three novel personalization methods: (1) conformity-based, (2) text-based, and (3) annotation-based, Fig. 4. According to our initial studies, the most informative were user annotations provided for most controversial documents.
In conformity-based personalization, we exploited simple conformity measures that represent the beliefs of one user in the aggregated way: GConf and WConf. Each of them can deliver three separate values: for only aggressive, only nonaggressive, and all texts. Finally, we examined input feature sets based on only GConf, only WConf, and on both, Sec. 7.
We also propose two versions of personal embeddings for previously annotated texts: class-based and annotation-based.
The class-based embedding consists of two fast-Text embeddings of k documents from the dev set that the user rated as (1) nonaggressive and (2) separately as aggressive, Fig. 4. Each of the two embeddings can aggregate any and different number of previous user annotations; the embedding size is static for every k. If the user has not annotated any texts of given class (e.g. aggressive), the embedding represents an empty string (zeros). Overall, it is a very rare case in our experiments, mostly happening for k = 1.
The annotation-based embeddings consider all k user annotations individually. For each such text d, we use the following features: (1) the embedding of the d's content, (2) its controversy Contr(d), (3) the percentage of users who rated d as nonaggressive, (4) the rating of the given user (0/1), and (5) the information on whether this rating is consistent with the the majority rating. Thus, we receive a relatively large number of input features: 300+k * 304.
Our general personalized aggressiveness detection procedure is as follows: 1. We ask users to annotate k most controversial documents from the pre-defined set (here dev).
2. Information from the first step is used to extract individually-specific features reflecting personal user beliefs, i.e. conformity measures or embeddings of these k texts (classbased and annotation-based methods). Fig.  1) annotate next documents. The data about their following annotations (embeddings of texts from train) together with data from step 2. are used to train the classifier. Fig. 1), we also collect their annotations (the test set).

For some other users (lower rows in
Together with the information about their individual preferences (step 2.) they are used for validation (testing) purposes only.

Experimental setup
To validate our three personalized methods, we utilized Wikipedia Talk Labels: Aggression, see Sec.
3. We applied 10-fold cross-validation based on users. The first nine sets are used to train the model (upper rows in Fig. 1), while the remaining 10th set for testing (lower rows in Fig. 1). The results presented in plots are averaged over all ten folds. Since only dev texts with annotations are assumed to represent prior knowledge about users, they were used to test personalization scenarios for each of our three methods: class-based, annotation-based, and conformity-based. The last one was in three variants: only three GConf (a, C) measures (for C = {0}, {1}, {0, 1}), only three W Conf (a, C) measures, all six conformity values. Thus, we analyzed five methods in total. For each of them, we considered: (1) different number k=1,2,..20 of texts d previously annotated by user a: d ∈ A a (for conformity-based methods |A a | = k, (2) different selection procedures for texts d ∈ A a used to represent a's beliefs (personalization): (2a) k most controversial texts d ∈ A a , Figure 4: A classic approach generalizing output based solely on textual content (the same decision for all users)an upper flow (our baseline). Three personalized methods proposed in the paper: (1) Conformity-based -additional input features -personal conformity measures (GConf, WConf or both, each for aggressive, nonaggressive or any texts); (2) Class-based -two embeddings of k = 4 texts previously annotated by a given user, one embedding for one aggressive text and the second for three nonaggressive ones; (3) Annotation-based -embeddings, classes and additional features for each of k = 4 most controversial texts previously annotated by a given user.
(2b) k class-balanced most controversial (like 2a but with class balancing), (2c) most aggressive d ∈ A a (rank according to % of aggressive annotations among all for d), (2d) random selection of k texts d ∈ A a . In total, we tested: 10 folds x (5 methods x 20 distinct k no. of texts x 4 selection + 1 baseline) = 4,010 models.
The logistic regression models were optimized during the training process by using the L2 regularization and the early stopping mechanism. Both of them aim to prevent overfitting and the early stopping mechanism additionally ensures that the model instance that achieved the best loss function score is preserved. The models were run on Intel Xeon Processor E5-2650 v4.
We also compared our personalized methods with the baseline, i.e. the commonly investigated approach generalizing user perception. It exploited only the evaluated text embeddings as the input.
We considered classification performance not only for the whole test set but also in its breakdown of 10 percentage buckets according to three independent rankings of test docs: (1) most controversial (Contr(d)), (2) with least conformity GConf (a, {0, 1}), averaged over all a ∈ T est an-notating d, (3) least W Conf (a, {0, 1}). Here, the measures were computed for the test set only, not for dev. It was used to investigate where our models more outperform the baseline. In order to generate text embeddings in each personalization method, we used the fastText library (Bojanowski et al., 2017;Joulin et al., 2017). It offers pre-trained word vectors for 157 languages, based on the continuous bag of words (CBOW) model in a 300-dimensional space, with character n-grams of length 5.

Validation of personalization methods
Both class-based and annotation-based methods were tested using various rankings while selecting texts for personal embeddings: most controversial, class-balanced most controversial, most aggressive, and random. The conformity-based methods were evaluated in terms of the measure variant used: general conformity, weighted conformity, and both, all with random selection of texts.

Conformity-based Personalization
The results for three conformity-based personalization methods, i.e. three different sets of input conformity features (Sec. 7) and various number k (d) comparison of the best method of each type. Both (b) and (c) were evaluated using various rankings while selecting texts for personal embeddings: most controversial, class-balanced most controversial, most aggressive, and random. Macro F1 score for both classes have the same shapes by with different range for Y: 0.68-0.73. of texts used to calculate user conformity are shown in Fig. 5a. The greater k results in more precise evaluation of user conformity. It also directly and positively impacts on model performance, although gains for k > 15 are very small.
Additionally, we considered the performance for more and less controversial documents in the test set, Fig. 6a. It is clearly visible that the nonpersonalized method is completely lost for the most controversial documents. However, our conformitybased models lose relatively less. It appears that their gain (smaller loss) is greater for 30% most controversial texts. In other words, the greater controversy, the greater gain from personalization. Fig. 5b describes evaluation of class-based embeddings for various text selection approaches and different number of previously annotated texts. The performance was shown only for texts from the aggression class (the same plot shapes were for macro F1 and both classes). The models using the most controversial texts for selection reached the best results in 14 out of 20 cases (70%). The highest F1 score was achieved for only 4 texts representing user beliefs. It was greater than the model without any personalization by over 7pp.

Annotation-based Embeddings
Annotation-based embeddings were tested for the same rankings as in Sec. 8.2, Fig. 5c. The most controversial texts used to generate user representations and feed the model provided the best results in 17 out of 20 cases (85%). The best performance was achieved while using 18 texts to represent user personal beliefs -then, the input consisted of 5,772 features. The F1 score of this model was greater than the baseline by over 10pp.
The greater gain compared to the not personalized method is exposed for 50% of the most controversial texts in the test set; the greatest for 10% of the most controversial -even 22.7 percentage points (twice better: 44.0% vs. 21.3%), Fig. 6b.

Comparison of personalization methods
The best models from each personalization method, which were achieved for annotations of most controversial texts, are compared in Fig. 5d. Models based on annotation-based embeddings provided significantly better results than the others in 10 out of 20 cases of k values (50%). The conformitybased models performed better than other models in Figure 6: Performance of two personalized methods proposed in the paper, only for the aggression class: (a) conformity-based; (b) annotation-based. Both were evaluated on documents d in the test set, sorted in ascending order by Contr(d) measure, 0-10 denotes 10% of the most controversial texts.
3 out of 20 cases (15%); it referred to the smallest number of texts considered (k = 1 ÷ 3). The highest value of F1 score was achieved by the model using 18 texts to represent user personal beliefs. However, this solution used 5,772 input features, whereas the much simpler conformity-based model with 306 input features was only 2.7 percentage points worse. Simultaneously, conformity-based model training time was 38.6 times faster than the annotation-based one, Fig. 7.
Practically, we would like to avoid bothering the user with too many previous annotations, i.e. we may want to limit k to just a few, for example k = 4. Then, we should select k most controversial texts and use either class-based or conformity-based personalization. They learn just as fast but keep the same performance: 7.3 percentage points, 5.7 percentage points greater F1 for class aggressive, respectively, and 3.9 percentage points, 3.2 percentage points greater macro F1 (for both classes), respectively.
The worst performance was observed for models using class-based embeddings. The results of evaluation on all texts are presented in Fig. 5d.
Random selection of k texts for personalization is almost always worse than dedicated rankings, Fig. 6b,c. Most controversial texts turned out to be the best option that usually outperformed the most aggressive and class-balanced most controversial.

Discussion
A valuable observation from our experiments is that already one document used to valuate user beliefs is enough to significantly improve reasoning, Fig.  5d. Anyway, more texts in personalization keep boosting the performance, but about 4-5 previously annotated most controversial documents seem to be a reasonable trade-off between reasoning quality Annotation-based embeddings most precisely express user opinions, but it comes at the cost of linearly longer learning and demand for more samples. They also cannot easily adapt to different number k of personalization documents.
We decided to utilize very fast logistic regression model with fastText embeddings, since we wanted to examine thousands of models related to multiple scenarios, not all are presented here.
We belief our personalization methods establish a new research direction: how to effectively and efficiently embed user beliefs? We expect new methods will be developed for that purpose.
One of the most important postulate derived from our research is the demand for new datasets collections. We need annotations of individual humans rather than aggregated and agreed general beliefs received by majority voting, by annotator training, or by removal of controversial texts.
Besides, our personalization methods may be applied to any NLP problem with inconsistencies between people. It especially refers to diverse emotions evoked by textual content, hate speech, detection of cyberbullying or offensive, toxic, abusive, harmful, or socially unaccepted content.
The common problem of imbalanced classes in aggressiveness detection (Tab. 1, Fig. 12) will be addressed in future work.

Conclusions
The main conclusion from our research is that the natural controversies associated with individual perceptions of contents should not be overlooked or reduced but rather directly exploited in personalized solutions. Ultimately, this reflects the diversity in our societies.
Our three new personalization methods make use of texts previously annotated by a given user by means of conformity measures, class-based or annotation-based embeddings. Just a few documents are able to capture individual user beliefs, the more so, the more controversial documents they relate to. As a result, all our methods outperform classic solutions that generalize offensiveness understanding. The gain is greater for more controversial documents.
The personalization solutions can also be applied to other NLP problems, where the content tends to be subjectively perceived as hate speech, cyberbullying, abusive or offensive, as well as in prediction of emotions elicited by text (Kocoń et al., 2019a;Milkowski et al., 2021) and even in sentiment analysis (Kocoń et al., 2019;Kanclerz et al., 2020).
We keep working on testing of our methods on more resource-demanding but also more SOTA language representations: XLNet , RoBERTa (Liu et al., 2019), and XLM-RoBERTa (Conneau et al., 2020).