BERT-Beta: A Proactive Probabilistic Approach to Text Moderation

Text moderation for user generated content, which helps to promote healthy interaction among users, has been widely studied and many machine learning models have been proposed. In this work, we explore an alternative perspective by augmenting reactive reviews with proactive forecasting. Specifically, we propose a new concept text toxicity propensity to characterize the extent to which a text tends to attract toxic comments. Beta regression is then introduced to do the probabilistic modeling, which is demonstrated to function well in comprehensive experiments. We also propose an explanation method to communicate the model decision clearly. Both propensity scoring and interpretation benefit text moderation in a novel manner. Finally, the proposed scaling mechanism for the linear model offers useful insights beyond this work.


Introduction
Text moderation is essential for maintaining a nontoxic online community for media platforms (Nobata et al., 2016). Many efforts from both academia and industry have been made to address this critical problem. Recently, the most prototypical thread is to do sophisticated feature engineering or develop powerful learning algorithms (Nobata et al., 2016;Badjatiya et al., 2017;Bodapati et al., 2019;Tan et al., 2020;Tran et al., 2020). Automatic comment moderation schemes plus human review are certainly the cornerstone of the fight against toxicity.
These existing works, however, are reactive approaches to handling user generated text in response to the publication of new articles. In this paper, we revisit this challenge from a proactive perspective. Specifically, we introduce a novel concept text toxicity propensity to quantify how likely an article is prone to incur toxic comments. This is a proactive outlook index for news articles prior to the publication, which differs radically from the existing reactive approaches to comments.
In this context, reactive describes comment-level moderation algorithms after the publication of news articles (e.g., Perspective (Perspectiveapi)), which quantifies whether comments are toxic and should be taken down or sent for human review. Proactive emphasizes article-level moderation effort before the publication (without access to comments), which forecasts how likely articles are to attract toxic comments in the future and gives suggestions (e.g., rephrase news articles properly) in advance. Our work can be viewed as the first machine learning effort for a proactive stance against toxicity.
Formally, we propose a probabilistic approach based on Beta distribution (Beta) to regress article toxicity propensity on article text. For previously published news articles with comments, we take the average of comments' toxicity scores as the ground-truth label for model learning. The effectiveness of this approach is shown in both test set and human labeling. We also develop a scheme that can provide convincing explanation to the decision of the deep learning model.
Recently, context, in the form of parent posts, has been studied but it is only viewed as regular text snippets for lifting the performance of toxicity classifiers (Pavlopoulos et al., 2020) while screening posts. Our work instead focuses on predicting the proactive toxicity propensity of articles before they receive user comments.
Beta distribution is usually utilized as a priori in Bayesian statistics. The most popular example in natural language processing is Topic Model (Blei et al., 2003), where the multivariate version of beta distribution (a.k.a. Dirichlet distribution) generates parameters of mixture models. Beta regression is originally proposed for modeling rate and proportion data (Ferrari and Cribari-Neto, 2004) by parameterizing mean and dispersion and regressing parameters of interest. It has been applied to evaluate grid search parameters in optimization (McKinney-Bock and Bedrick, 2019), model emotional dimensions (Aggarwal et al., 2020) and statistical processes of child-adult linguistic coordination and alignment (Misiek et al., 2020).

Beta Regression
In this work, both comment toxicity score and the derived article toxicity propensity score (to be detailed in the subsequent section 4.1) range from 0 to 1. Empirically, their distributions exhibit an asymmetry and may not be modelled well with the Gaussian distribution (Figs. 2 and 3 of Appendix A). Furthermore, comment toxicity score distributions of individual articles vary with article content as shown in Fig. 3 of Appendix A. Modelling the entire distribution of an article comment toxicity scores is thus a reasonable approach. Beta distribution is very flexible and it can model quite a wide range of well-known distribution families from symmetric uniform (α = β = 1) and bellshaped distributions (α = β = 2) to asymmetric shapes (α = β).
In this context, the toxicity propensity score y is assumed to follow the Beta distribution with probability density function (pdf): where α and β are two positive shape parameters to control the distribution. B(α, β) is the normalization constant and support y meets y ∈ [0, 1]. Eq. 1 holds the probabilistic randomness given α and β, we thus impose a regression structure of them on text content. Formally, given a training set D = {(x n , y n )} N n=1 with raw text feature vector x n and label y n for sample n, we apply feature engineering or text embedding g(·) and then regress α n (> 0) and β n (> 0) on g(x n ) respectively as ( 2) where f α (·) and f β (·) are learned jointly. g(·) can be either pre-fixed or learned together with f α (·) and f β (·), which is detailed in the subsequent section. Specifically, the learning procedure of f α (·), f β (·) and g(·) (if applicable) is to minimize loss Substituting Eqs. 1 and 2 into it gives the final objective function.
In the inference phase, with learned f α (·), f β (·) and g(·), α m and β m for a new sample x m can be readily derived from Eq. 2. We take the mean of Eq. 1 as a point estimator: y m = αm αm+βm because we are predicting the average toxicity.

Dataset
We collect a dataset of articles published on Yahoo media outlets, which are all written in English. We also exclude articles with low comment volume to make the distribution learning reliable. The number of comments for 99% of the analyzed articles lie in [10, 8K], with 25% quantile of 20, median of 50 and mean of 448. The employed dataset is then split into training, validation and test parts based on the publishing date with ratio of 8:1:1 as described in Table 1. It's worthwhile to note that input text x n is the concatenation of article title and text body. The toxicity propensity score y n of article n is defined as the average toxicity score of all associated comments. Comments are scored by Google's Perspective (Perspectiveapi), which lies in [0, 1]. Perspective intakes user generated text and outputs toxicity probability. It's a convolutional neural net (Noever, 2018) trained on a comments dataset 1 of wikipedia labeled by multiple people per majority rule.

Experiment Setup
In Eq. 2, we set both f α (·) and f β (·) to singlelayer neural networks. For g(·), we experiment with either Bag of Words (BOW) or BERT embedding (BERT) (Devlin et al., 2019). Specifically, we take uni-gram and bi-gram words sequence and compute the corresponding Term Frequency-Inverse Document Frequency (TF-IDF) vectors, which leads to around 5.8 million tokens for BOW. For BERT, we take the base version and then finetune f α (·) and f β (·) on top of the [CLS] embedding, which ends up with 110 million parameters. If input text exceeds the maximum length (510 as [CLS] and [SEP] are reserved), we adopt a simple yet effective truncation scheme (Sun et al., 2019). Specifically, we empirically select the first 128 and the last 382 tokens for long text. The rationale is that the informative snippets are more likely to reside in the beginning and end. Batch size is 16 and learning rate is 1e − 5 for Adam optimizer (Kingma and Ba, 2015). They are called BOW-β and BERT-β for short.

Baseline Methods and Metrics
We compare with the linear regression method using BOW features, as well as the BERT base model. Both are combined with one of two loss functions, Mean Absolute Error (MAE) or Mean Squared Error (MSE). We call them BOW-MAE, BOW-MSE, BERT-MAE, BERT-MSE, respectively. The experiment settings are same as the Beta regression.
Since we are interested in identifying articles of high toxicity propensity, we want to make sure that an article with high average toxicity is ranked higher than one with low propensity. Thus in addition to mean absolute error, root mean squared error (RMSE) and AUC@Precision-Recall curves (AUC@PR), we measure performance using two ranking metrics, Kendall coefficient (Kendall) and Spearman's coefficient (Spearman).

Results
We perform evaluation on the whole test set and on human labels.

Test Set
Table 2 details the performance comparisons. Overall, Beta regression stands out across different metrics regardless of feature engineering due to its modeling flexibility. BERT-based methods also outperform BOW ones in terms of feature engineering and representation. This is reasonable as the former has 20 times as large parameters as the latter and offers the contextual embedding. Interestingly, MAE and MSE schemes don't achieve the minimum MAE and RMSE although they are dedicated to this goal, which might result from the limitation of point estimator.

Human Labels
As labels are derived from machine, we want a sanity check to ensure that the model decision conforms to human intuition. Namely, when the model classifies an article as having high toxicity propensity, we want to make sure that it correlates well with human judgement. To this end, we divide test set into 10 equal buckets with an interval of 0.1 and merge the last 4 buckets into [0.6, 1] due to much fewer articles with score being above 0.6 (as shown in Fig. 2 . We then randomly take 100 samples per bucket and set aside 10% for human training and the remaining are labelled by the human judges as the benchmark set. We recruit two groups of people for independent annotation, which are required to pick one from five levels (a reasonable balance between smoothness and accuracy for manually labeling toxicity propensity per judges' suggestion) to describe the propensity extent to which an article is likely to attract toxic comments: Very Unlikely (VU), Unlikely (U), Neutral (N), Likely (L) and Very Likely (VL). Table  3 is the confusion matrix showing how much two groups of human judges agree with each other.  Table 3). Moreover, Cohen's Kappa is about 0.23 by taking expected chance agreement into account 2 . In light of this, we jointly score the set by assigning −2, −1, 0, 1 and 2 to VU, U, N, L and VL, respectively. Since each article has two labels, the addition gives an integer score interval [-4,4]. Table 4 reports the performance with human labels as the ground truth, which confirms the previous findings that BERT-β performs the best. Additionally, we pick scores 2, 3 and 4 as thresholds to monitor precision and recall curves (Fig. 1). Likewise, the proposed schemes achieve compelling performance widely.
Taken together, our probabilistic methods agree more with both machine and human judgements.

Explanation
As we focus on the pre-publication text moderation, a reasonable explanation is an essential step to convince stake-holders of subsequent operations. For BERT-β explanation, we adopt gradient-based saliency map variants from computer vision (Simonyan et al., 2013;Shrikumar et al., 2017). We compute the gradient ∇f (x) with respect to input tokens embedding e(x), where f (x) = α(x)/(α(x) + β(x)) is the mean prediction for sample x (Section 3), and x = (t 1 , t 2 , · · · , t L ) where t l (l = 1, 2, · · · , L) is a single token. The element of ∇f (x) is partial derivative ∂f ∂e(t l ) (x) to measure the token-level contribution to the scoring. The explanation is conducted by assuming the article is controversial, and we want to figure out which words cause some comments to be toxic. So it also makes sense to maximize the maximum toxicity of the comments. We thus experiment with f (x) = (α(x) − 1)/(α(x) + β(x) − 2), which is the mode (corresponding to the peak in the PDF of Beta distribution) under reasonable assumption (α, β > 1). We denote the resulting scheme by subscript "mode".
For saliency map (SM) (Simonyan et al., 2013), the metric is ∂f ∂e(t l ) (x) 2 without direction. A variant is dot product (DP) between token embedding and gradient element e(t l ) T · ∂f ∂e(t l ) (x) with direction (Shrikumar et al., 2017). We also propose a hybrid (HB) scheme to take magnitude of SM and direction of DP to form a new metric. We perform an ablation study (AS) to delete single token t l alternately and then compute the score discrepancy between original x and x ¬l as well. As a reference, we examine the regression coefficients (RC) of linear BOW-MSE, which are easy to check for explaining the contribution of corresponding words.
A few well-trained human judges are recruited to tag k (example-specific, determined by annotators) most important words. We then prioritize tokens with different metrics and pick top k ones as candidates. Hit rate (proportion of human annotated tokens covered by schemes) is used to compare different tools. We take 1, 000 examples for human review and compute the average hit rate, as compared in Table 5. All schemes for BERT-β are much better than linear scheme RC, which is consistent with the predictive performance discrepancy. SM and HB are close and outperform black-box ablation study, which implies the valuable role of model-aware gradients in the explanation. DP is inferior to AS and seems not consistent with human annotation as well as other gradient based methods. In practice, we take SM for the explanation (Appendix B) due to its out-performance and simplicity. As expected, mode (SM mode ) covers more annotated words than mean (SM) on average (more discussions in Appendix C).
where X is the training corpus. y ∈ [0, 1] N ×1 and τ (X) ∈ Z N ×M (M = 5.8 million) are the training labels and TF-IDF matrix. y and τ (X) are their column-wise means. The pre-computed w can be viewed as a surrogate of the regression coefficient for the linear regression problem, which is used to scale TF-IDF of BOW-MSE in both training and inference phases. We call it Naive Bayes Linear Regression (NBLR) for short.
The scaling benefits the performance, as compared in Table 6. As can be seen, NBLR improves upon BOW-MSE significantly, although it is not as good as BERT-β. Our work can benefit text moderation. The proactive propensity offers a toxicity outlook for comments, which could be utilized in multiple ways. For example, stricter moderation rules are enforced for articles that are predicted to have a high toxicity propensity. Furthermore, the propensity could be used as an additional feature for the downstream reactive toxicity recognition models, as well as for allocation of appropriate human resources.
The explanation tool can also be used to remind editors to rephrase some controversial words to mitigate the odds of attracting toxic comments. Text moderation is an important yet challenging task, our proactive work is attempting to open up a new perspective to augment the traditional reactive procedure. Our current model, however, is not perfect as shown by article b in Fig. 3 of Appendix A where the learned distribution doesn't fit well the observed histogram. Technically, NBLR is an encouraging lightweight extension to Linear Regression. Likewise, we will continue to work towards the improvement of the non-linear Beta regression.

Conclusion
We approach text moderation by developing a wellmotivated probabilistic model to learn a proactive toxicity propensity. An explanation scheme is also proposed to visually explain the connection between this new prospective score, and text content. Our experiment shows the superior performance of the proposed BERT-β algorithm, compared with a number of baselines, in predicting both the average toxicity score, and the human judgement.

A Toxicity Score and Beta Distribution
The distribution of news articles' toxicity propensity score is reported in Fig. 2. Comment score distributions of two articles with predictive distribution are given in Fig. 3.

B SM Explanation Examples
We pick two samples from the test set and then leverage SM in section 4.5 to highlight key words for the illustration purpose, as shown in Fig. 4. The color intensity is proportional to the normalized saliency map value. The darker the color of a token is, the more important it's to the scoring. There's also a positional bias towards the first sentence as it's the article title.

C BERT-β mode
We also explore the mode of BERT-β as a point estimator and compare it with the mean. Table 7 details the performance discrepancy between the test set and human labels. For the toxicity propensity prediction in the test set, it does make sense for mean to slightly outperform mode as ground-truth labels are the score mean of comments. When it comes to human labels and explanation, people annotate news articles based on the perceived controversial words most likely to incur toxic comments. Mode is thus able to capture the worst case better and agrees more with human annotations. This finding is in line with the better explanation performance, as compared in Table 5.