Multiple Instance Learning for Offensive Language Detection

,


Introduction
Detection of offensive content online has attracted widespread attention in recent years.Offensive content could not only be human-produced on social media platforms such as Twitter and Facebook, but also could be system-generated due to the pervasive usage of pre-trained language models (Gehman et al., 2020).To tackle the problem in practice, a general solution is to train models capable of identifying messages containing offensive language.Then the identified messages are checked and pro-Figure 1: MIL scenarios in offensive language detection.Natural labels such as reports and bans are associated with the entire bag.The model is supposed to learn to predict the bag label and locate all offensive instances at the same time.
great many studies on dealing with offensive language based on deep learning approaches (Pitsilis et al., 2018;Pitenis et al., 2020).
Previous researches mainly work on detection task based on sentence-level annotated corpus from a specific resource (Zampieri et al., 2019;Kumar et al., 2018), which requires massive manual effort.Such approaches are effective but resourceconsuming, especially when transferring to a new platform or language.In social media, there are many existing information sources that could substitute for manual labeling.Historical records of online platforms like user feedback (report or dislike) and punishments made by moderators could act as supervision.Unfortunately, in many cases those "natural labels" are associated with a larger object (i.e. a user, an article or a dialogue) rather than one certain sentence, which are not suitable for fully supervised learning.However, by regarding each sentence as an instance and the article/dialogue as a bag of instances, we can formalize this scenario into a multiple instance learning task.In this way, we are able to train neural models only using the natural bag-level labels from platforms.As illustrated in Figure 1, the model could not only predict bag-level tags, but also locate offensive sentences to provide more explainable results for moderators.
MIL is a typical weakly-supervised task where each label is associated with a bag of instances, following the rule that a bag is labeled positive if the bag contains at least one positive instance.In offensive language detection task, an offensive sentence will be regarded as a positive instance.Most works of MIL concentrate on the original main target of MIL, which is to predict test bag labels.In offensive language detection tasks, models are supposed to predict not only the bag label but also the sentence labels in order to locate the offensive contents.
As benchmarks are necessary for studying MIL in offensive language detection, we first reconstruct an existing supervised corpus-OLID (Zampieri et al., 2019) into bag-form.Regarding that bags pieced together randomly with independent instances may lack internal relevance, we first cluster the sentences and then sample bags inside each cluster.We also collect a new corpus named Multi-INstance Offensive Response (MINOR) dataset.Bags in MINOR are constructed with tweets and replies, which accord better with the practical application.
In order to study MIL methods for natural language processing tasks systemically, we break down the design of MIL methods into four categories according to where instance-level information is fused into bag level: text fusion, embedding fusion, score fusion and hybrid fusion.We notice that embedding fusion methods perform well in bag-level prediction, while score fusion ones have advantages in instance-level.So we propose a hybrid fusion method with mutual-attention mechanism, which enhances both instance and bag level representation at the same time.Experimental results demonstrate that our mutual-att method outperforms other models at both levels.Ablation studies further illustrate the effectiveness of each component in mutual-att.
To summarize, our contributions are as follows: • We formalize offensive language detection into a MIL task to utilize coarse grained natural labels from online platforms.
• We present two datasets: OLID-bags and MI-NOR to study the multi-instance offensive language detection task.• After categorizing the existing MIL methods and revisiting their relative merits on our datasets, we propose a new hybrid fusion MIL method-mutual-att, which outperforms existing methods at both bag-level and instancelevel.

Related Work
Offensive Language Detection Offensive language detection has always been a concerned topic for researchers.Great effort has been made to collect corpus from social media (i.e.Twitter) (Waseem and Hovy, 2016) and establish benchmarks (i.e.OLID, TRAC) (Zampieri et al., 2019;Kumar et al., 2018).As offensive language online is a world-wide problem, researchers also constructed many non-English (Pitenis et al., 2020;Mubarak et al., 2021) and multi-language (Kumar et al., 2018) datasets.Semi-supervised dataset SOLID (Rosenthal et al., 2021) has also been proposed to provide large-scale training data without heavy annotation efforts.
Offensive language detection is a typical text classification task.Classic machine learning classifiers including naive Bayes and support vector machine have been widely employed to detect offensive language.Besides universal features such as bag-of-words (McEnery et al., 2000) and n-grams (Pendar, 2007), Chen et al. (2012) also developed task-specific feature extraction method.Neural models like LSTM and CNN have been applied in numerous recent studies, while pre-trained language models like BERT have achieved SOTA performances in numbers of challenges (Liu et al., 2019).
Multiple Instance Learning Multiple instance learning was originally proposed by Dietterich et al. (1997) for drug activity prediction.As the framework (Maron and Ratan, 1998) of MIL could be extended into various scenarios, it attracts attention from communities of many areas.Computer vision (CV) is the major application field of MIL.Numerous studies have applied MIL methods to CV tasks including image classification (Wu et al., 2015), object tracking (Babenko et al., 2009) and medical prediction (Yao et al., 2020;Li et al., 2021).Several natural language processing (NLP) tasks like document modeling (Pappas and Popescu-Belis, 2017) and sentiment analysis (Pappas and Popescu-Belis, 2014;Angelidis and Lapata, 2018;Ji et al., 2020) also meet the definition of MIL.
Various models have been adopted as the base model in MIL tasks.The base model of MIL varies from classic machine learning methods (Gärtner et al., 2002;Andrews et al., 2003) to deep models (Shi et al., 2020;Ilse et al., 2018) over time, and from convolutional networks (Wu et al., 2015;Li et al., 2021) to language models (Angelidis and Lapata, 2018;Ji et al., 2020) over application fields.Besides the choice of base model, we find in this study that MIL fusion methods are essential to MIL frameworks.In Section 3, we discuss when and how instance-level information is fused into bag level in detail.

MIL fusion
The input of multi-instance offensive language detection task is the text of several instances from a bag, while the supervision is the bag-level label.So the calculation process of a MIL method could be represented as a path from instance-level text to bag-level score, which is shown in Figure 2. According to when instance-level information is fused into bag level, we organize MIL methods for NLP tasks into four categories: text-level fusion, embedding-level fusion, score-level fusion and hybrid fusion.We also discuss different fusion operations in Section 3.2 including pooling methods and attention mechanism.

Fusion Level
Text-level Fusion Text fusion is an intuitive method for textual MIL tasks, as shown in Equation 1. First, n instance-level text inputs are fused into bag-level by concatenating them into long text.Then a neural model is applied for the long text classification.During the inference phase, instancelevel predictions are made by taking a single sentence as the input text.
Embedding-level Fusion By combining n hidden sentence embeddings into bag-level as shown in Equation 2, embedding fusion methods retain informative representation for the final classification layer, which benefits for bag-level prediction.However, most embedding fusion methods do not have the ability to predict instance labels independently.Only attentional model (Ilse et al., 2018) is explainable in instance-level, but still can not make a direct prediction.
Score-level Fusion As shown in Equation 3, score-level fusion methods first predict n instance labels independently and then calculate the bagscore according to the instance-scores.They have advantages in instance-level prediction, but are weak in bag-level performance because representation information is lost before the bag-level deci-sion.
Hybrid Fusion Hybrid fusion is a combination of embedding fusion and score fusion.By fusing embedding and score at the same time, the model could get a rich bag representation while having the ability to predict each instance label.Loss-base attention (Shi et al., 2020) is a typical hybrid attention method, which introduces a loss function over both bag-score and instance-score, as Figure 3 shows.Though the bag-level and instance-level scores are both trained via back propagation, they do not have interaction during the forward process.Thus, we develop our hybrid fusion method with mutual-attention, which allows both levels to enhance each other's prediction.

Fusion Operation
Pooling methods are adopted as fusion operations in many MIL works, of which the two most basic are max-pooling and mean-pooling.Max-pooling conforms well to the bag-labeling rule of MIL task.However, among each bag, only the instance/neurons with max output will be trained, which could cause low training efficiency and poor instance-level performance.Although meanpooling could provide gradients for all instances, treating every instance equally apparently is not suitable for MIL tasks.Therefore, pooling methods including log-sum-exp pooling (Ramon and De Raedt, 2000) and noisy-or pooling (Zhang et al., 2005) are adopted to provide gradients for all instances while treating every instance differently.
In order to develop a flexible and trainable fusion operation, attention mechanisms have been introduced as a MIL fusion operation (Ilse et al., 2018).
Recent works (Shi et al., 2020;Li et al., 2021) with attention mechanism have obtained state-ofthe-art performance.Besides, neural network such as CNN (Kotzias et al., 2015) and GRU (Karamanolakis et al., 2019) could also be used as a fusion operation in MIL model.

Bag-level
Figure 4: Mutual-attention mechanism.The red arrow stands for I2B-att, while the green one for B2I-att.

Method
We propose a mutual-attention mechanism composed of instance to bag attention (I2B-att) and bag to instance attention (B2I-att).As Figure 4 shows, representation and score of instances are both fused into bag-level via I2B-att, while bag score enhances instance scores by B2I-att.
Instance to Bag Attention Instance embedding and score are fused via the same attention-I2B-att.Following (Shi et al., 2020), we directly calculate the instance weight α i according to the output of instance prediction layer z i .In this way, instance weight is ensured to be consistent with the prediction probability, that is to say, the instance with a higher possibility to be positive has a larger attention weight.What's more, no extra parameters need to be introduced, so the model can remain high-efficient.
Bag to Instance Attention In order to avoid disagreement between instance and bag predictions, B2I-att is applied to constrain instance prediction with bag score.When the bag label is offensive, an instance label can be offensive or non-offensive, thus low constraint from bag level is supposed to be added to the instance score.By contraries, an instance label should be constrained to non-offensive if the bag is not offensive.The trainable weight β in Equation 5 controls how much instance predictions are supposed to be influenced by the bag prediction.
Fusions of Mutual-Attention Model Having I2B and B2I weight α and β, we then calculate the final prediction P f inal following Equation 6, where λ is a hyper-parameter.
5 Experimental Setup

Dataset Construction
OLID-bags OLID (Zampieri et al., 2019) is an offensive language dataset containing annotated tweets.In order to study MIL task, we reconstruct it into bag-form.First, we cluster the sentences using Kmeans algorithm and TF-IDF feature.The number of clusters is set 595/67/43 for train/dev/test sets to make each cluster contains 20 sentences on average.Then bags are randomly sampled inside each cluster.Each bags contains 2 to 8 instances and its bag-label follows the definition of MIL.Table 3 shows the statistics of OLID-bags.Since the bag-label is offensive when any instance in the bag is offensive, the proportion of offensive and non-offensive samples in instance-level and bag-level differs greatly.Offensive instances account for only about one-third, while most bags are offensive.This inconsistency of ratio between instance and bag level makes it more challenging for models to perform well at both levels.

MINOR dataset
In order to construct more "real" bags, we collect and annotate the MINOR dataset.Each bag in MINOR is composed of a tweet and several corresponding responses.To obtain a sufficient proportion of offensive language, we get the IDs of tweets and responses from Stance in Replies and Quotes data (SRQ) (Villa-Cox et al., 2020), whose topics are very controversial.The test set of MINOR is manually annotated, which contains 389 bags and 1501 instances.Our definition of offensive and non-offensive text follows OLID (Zampieri et al., 2019).We use OLID sentences as examples in the annotation guideline.Each instance is labeled by 3 annotators with a Cohan's Kappa agreement (Cohen, 1960)  Comparison Clearly, OLID-bags has higher label quality because it is manual-labeled while MI-NOR has larger data scale.As we can see in Table 2, due to the clustering step, instances in an OLID bag often share similar vocabularies or topics.Instances in an MINOR bag have more strong and direct connections as they are post and responses.Witness rate (WR) is a concept of MIL, which stands for the proportion of positive instances in positive bags.The WR of OLID-bags is 40.5% while MINOR's is 53.7% because MINOR has a higher inner-bag similarity.Previous studies (Carbonneau et al., 2018) have shown that tasks with lower WR are more challenging, so MINOR could be easier for a model even if it's automatically labeled.

Settings and Baselines
Experimental Settings In order to compare MIL methods fairly, we set the base model f () to be BERT-base for all experiments.The learning rate of all methods is set to 1e-5 and the batch size is 8.The threshold for prediction is determined by searching on the dev set to get the best macro F1score.λ value is set to 0.8 for mutual attention and   loss-based attention (Shi et al., 2020).The detailed analysis of λ is carried out in Section 6.2.
Baselines In this paper, we will represent baselines using our categorization rather than the original name because the task and the base model are different, and it is more clear to show features of the method in this way.Specifically, each MIL method will be named by its fusion level + fusion operation.Corresponding relations between baselines and their names are shown in Table 5.
We also show the result of instance-level supervised learning, but note that it is for reference only because label-levels are different.

Evaluation Metrics
Bag-level Prediction We evaluate bag-level performance by macro-F1 score and accuracy.For methods that are not able to make bag-level prediction (i.e.supervised), we inference the bag-level label from instance-level prediction according to the rule of MIL.
Instance-level Prediction Instance-level performance is also measured with macro-F1 score and accuracy.As mentioned in Section 3.1, some embedding fusion methods can not predict instance labels directly.So in this paper, we evaluate instancelevel performance of embedding fusion methods by letting them predict the labels of single-instance bags.For emb+att model, instance-level predictions p i = σ(z i ) are made according to the logits value z i before the soft-max calculation of attention mechanism.

Main Results
Experimental results of our method and baselines on OLID-bags and MINOR are shown in Tabel.6.
Supervised learning can be regarded as the ceiling of instance-level performance, as it is trained with full instance labels.Surprisingly, most MIL methods have comparable or even better bag-level performance on OLID-bags than fully supervised learning.Also, some MIL methods have close instance-level performance to supervised learning on both datasets, especially our mutual-att method.These results indicate that multiple instance learning is a feasible and promising solution to make use of natural bag labels in online offensive language detection.Table 6: Main results on OLID-bags and MINOR.All experiments are conducted over 5 runs, and we report the averaged results along with standard deviations.The bold numbers stand for the best performance except supervised learning.The bold results with * are significantly better than the second highest ones (α = 0.05).
Underlined ones represent the 2nd and 3rd highest results.
Among the four fusion level categories, text fusion is the simplest method which has an average but reliable performance.As we mentioned in Section 3, embedding fusion methods are good at baglevel prediction, while score fusion methods have better instance-level performance.We can find in the table that emb+att and score+att get remarkable results at bag and instance level respectively which are second only to hybrid fusion methods.By combining their advantages, the two hybrid fusion methods achieve high performance at both levels.In particular, our mutual-att method outperforms loss-att at both levels and is more stable at instance level.As for fusion operations, we find that max-pooling and attention mechanism perform well, while mean-pooling has poor and unstable performance.
All models achieve much higher performance on MINOR than OLID-bags which implies that MINOR is less challenging even if it's annotated by model.There are two main reasons.One is that MINOR has more sufficient data, and the other is that MINOR has a higher witness rate as we mentioned in Section 5.1.We will make a detailed discussion about why could model get high results on such semi-supervised dataset in Section 6.3.

Ablation Study and Parameter Analysis
Ablation Study In order to investigate the effectiveness of each component of our mutual-attention method, we conduct an ablation study whose results are shown in Table 7.Note that in experiment "w/o I2B-att" we only remove score level I2B-att and only during the testing process.Results imply that I2B-att mainly enhances bag predictions, while B2I-att mainly improves instance-level performance.We notice that such improvements may be made by eliminating disagreements between bag prediction and instance predictions.We find that existing MIL methods suffer from the disagreement problem.About 8% bags in loss-att model's prediction have conflicts between two prediction levels.For example, the model may predict the whole paragraph is nonoffensive while predict one sentence in it is offen-sive, which is illogical and will confuse the decision maker in practical use.Our B2I and I2B attention reduce those disagreements by letting the two predictions influence each other.When they are both applied, the disagreement rate is reduced to 3%.
Parameter Analysis We carry out experiments to investigate the influence of λ in Equation 6. Curves in Figure 6 show how the performance of our model changes when λ varies.Note that simply setting λ = 0, 1 will make the model unreasonable, so these results are not included in 6.The results of removing components of our model are discussed in the ablation study part.We find that F1 performance on OLID-bags dramatically drops when λ becomes lower than 0.5 especially at bag-level.The model is less sensitive to λ value on MINOR.From these results, 0.8 to 0.9 could be the proper range for λ.Empirically, we set λ = 0.8 in our main experiments.

Semi-Supervising
Since the training set of MINOR is labeled by a model trained on OLID, we doubt if a model directly trained on OLID-bags would perform better.We carry out an experiment to prove such semisupervised strategy is effective and necessary.Results in Table 8 demonstrate that training on the auto-labeling data is far more effective than direct transferring from OLID data.Although the supervision is indirectly from OLID, the unique distribution of real tweet and response bags of MINOR is also necessary for the bag modeling.Besides, MINOR has a larger amount of data than OLID, which may provide the model with more diverse expressions.
What's more, since bag label is not necessarily associated with every instance label, MIL task may suffer less from label noise.For example, if one instance in a bag is labeled correctly as offensive, the bag label will be correct regardless of other instances.As Table 9 shows, the annotator model fails in 19.2% MINOR instances, while the baglevel error rate is only 12.1%."Natural labels" such as moderator punishments and user reports also do not require manual efforts and may also contain noise.High performance on MINOR indicates that it is possible to utilize those free resources with MIL methods even if their label quality is limited.

Conclusion and Discussion
In this paper, we propose that MIL can utilize the historical information from social media platforms as natural labels for offensive language detection.We present two multi-instance datasets for offensive language detection: OLID-bags and MINOR.OLID-bags is reconstructed from OLID while bags of MINOR are composed with real tweets and responses.
We systematically categorize MIL methods for textual tasks into four classes namely text fusion, embedding fusion, score fusion and hybrid fusion.We observe on OLID-bags and MINOR that embedding fusion has higher bag-level performance, while score fusion methods are good at instancelevel prediction.Hybrid fusion methods could integrate the advantages of them, among which our proposed mutual-attention achieves state-of-the-art performance on both datasets at both levels.We also carry out a detailed ablation study to investigate the effectiveness of the proposed I2B-att and B2I-att and how they address the prediction disagreement problem.Discussion in Section 6.3 also verifies that our semi-supervised strategy is efficient and effective.
The proposed datasets, OLID-bags and MINOR, could support future studies on multi-instance offensive language detection.Also, the presented MIL formalization could expand to other online risky content identification tasks.We hope our work could inspire researchers from social media platforms and be instantiated in real scenarios.

Ethical Statement
Data used in this study are all from public released datasets.We strictly follow the ethical implications in previous research related to the data source.The source of our data has a nature of anonymity in a certain extent.We further clean the private information by filtering out nicknames, phone numbers, and URL links in a rule-based setting.The annotators in this study are all co-authors and are only shown with anonymized tweets when annotating.Moreover, because of the natural bias in terms of political stance, race, gender, etc. when defining and annotating offensive languages, we urge the user to cautiously examine the ethical implications of offensive language detection models in real-world applications.

Figure 5 :
Figure 5: The construction process of MINOR dataset.
of 0.868 and a Spearman correlation (Delgado and Tibau, 2019) of 0.881.As OLID and MINOR are both tweet data, we label the training set of MINOR using a BERT (Devlin et al., 2019) annotator model trained on OLID fully supervised.The accuracy of the annotator model is 84.8% on OLID and 80.8% on MINOR.Such semi-supervised labeling strategy could save manual efforts while providing a larger data scale.

Table 1 :
Table 1 is the symbol description of the paper.Symbol description.

Table 2 :
Example bags from OLID-bags and MINOR.We use * to mask names of public figures and some sensitive words.

Table 4 :
Data split of MINOR.

Table 7 :
Results of ablation study on OLID-bags.Disagree stands for the proportion of bags whose instance predictions conflict with bag prediction.

Table 8 :
Comparing transferring and semi-supervised learning.The model is our hybrid+mutual-att method.

Table 9 :
Performance of the annotator model on MI-NOR test set.