Analyzing Norm Violations in Live-Stream Chat

Toxic language, such as hate speech, can deter users from participating in online communities and enjoying popular platforms. Previous approaches to detecting toxic language and norm violations have been primarily concerned with conversations from online forums and social media, such as Reddit and Twitter. These approaches are less effective when applied to conversations on live-streaming platforms, such as Twitch and YouTube Live, as each comment is only visible for a limited time and lacks a thread structure that establishes its relationship with other comments. In this work, we share the first NLP study dedicated to detecting norm violations in conversations on live-streaming platforms. We define norm violation categories in live-stream chats and annotate 4,583 moderated comments from Twitch. We articulate several facets of live-stream data that differ from other forums, and demonstrate that existing models perform poorly in this setting. By conducting a user study, we identify the informational context humans use in live-stream moderation, and train models leveraging context to identify norm violations. Our results show that appropriate contextual information can boost moderation performance by 35\%.


Introduction
Interactive live streaming services such as Twitch 2 and YouTube Live 3 have emerged as one of the most popular and widely-used social platforms.Unfortunately, streamers on these platforms struggle with an increasing volume of toxic comments and norm-violating behavior. 4While there has been extensive research on mitigating similar problems for online conversations across various platforms such as Twitter (Waseem and Hovy, 2016; Davidson (2) relationships between chats are less clearly defined.Such differences make chats in the synchronous domain more difficult to be moderated by existing approaches.et al., 2017;Founta et al., 2018;Basile et al., 2019;ElSherief et al., 2021), Reddit (Datta and Adar, 2019;Kumar et al., 2018;Park et al., 2021), Stackoverflow (Cheriyan et al., 2017) and Github (Miller et al., 2022), efforts that extend them to live streaming platforms have been absent.In this paper, we study unique characteristics of comments in livestreaming services and develop new datasets and models for appropriately using contextual information to automatically moderate toxic content and norm violations.
Conversations in online communities studied in previous work are asynchronous: utterances are grouped into threads that structurally establish conversational context, allowing users to respond to prior utterances without time constraints.The lack of time constraints allows users to formulate longer and better thought-out responses and more easily reference prior context.
On the other hand, conversations on live streaming platforms are synchronous, i.e. in real-time, as utterances are presented in temporal order without a thread-like structure.Context is mostly established by consecutive utterances (Li et al., 2021).The transient nature of live-stream utterances en-courages fast responses, and encourages producing multiple short comments that may be more prone to typos (70% of comments are made up of < 4 words).Figure 1 shows an illustration of the contrasting temporal and length patterns between the asynchronous and synchronous platforms.
Owing to these different characteristics, we find that previous approaches for detecting norm violations are ineffective for live-streaming platforms.
To address this limitation, we present the first NLP study of detecting norm violations in live-stream chats.We first establish norms of interest by collecting 329 rules from Twitch streamers' channels and define 15 different fine-grained norm categories through an iterative coding process.Next, we collect 4,583 moderated chats and their corresponding context from Twitch live streams and annotate them with these norm categories ( §2.1- §2.3).With our data, we explore the following research questions: (1) How are norm violations in live-stream chats, i.e. synchronous conversations, different from those in previous social media datasets, i.e. asynchronous conversations?; (2) Are existing norm violation or toxicity detection models robust to the distributional shift between the asynchronous and synchronous platforms?( §3.1, §3.3); and (3) Which features (e.g., context and domain knowledge) are important for detecting norm violation in synchronous conversations? ( §3.2) From our explorations, we discover that (1) livestream chats have unique characteristics and norm violating behavior that diverges from those in previous toxicity and norm-violation literature; (2) existing models for moderation perform poorly on detecting norm violations in live-stream chats; and (3) additional information, such as chat and video context, are crucial features for identifying norm violations in live-stream chats.We show that incorporating such information increases inter-annotator agreement for categorizing moderated content and that selecting temporally proximal chat context is crucial for enhancing the performance of norm violation detection models in live-stream chats.

NormVio-RT
To investigate norm-violations in live-stream chat, we first collect Norm Violations in Real-Time Conversations (NormVio-RT), which contains 4,583 norm-violating comments on Twitch that were moderated by channel moderators. 5An overview of our 5 Please contact the authors for the anonymized study data.data collection procedure is illustrated in Figure 2. We first select 200 top Twitch streamers and collect moderated comments from their streamed sessions ( §2.1).To understand why these chats are moderated, we collect chat rules from these streamers and aggregate them to define coarse and fine-grained norm categories ( §2.2).We design a three-step annotation process to determine the impact of the chat history, video context, and external knowledge on labeling decisions ( §2.3).Lastly, we present analysis of the collected data ( §2.4).

Data Collection
We collected data using the Twitch API and IRC6 from the streamers with videos that are available for download among the top 200 Twitch streamers as of June 20227 .We specifically looked for comments that triggered a moderation event during a live stream (e.g.user ban, user timeout), and collected the moderated comment and the corresponding video and chat logs up to two minutes prior to the moderation event.Logs of moderated events from August 22, 2022 to September 3, 2022 were collected.We excluded comments that were moderated within less than 1 second of being posted, as they are likely to have been moderated by bots rather than humans.

Norm Categorization
Twitch streamers can set their own rules for their channels, and these channel-specific rules are essential for understanding why comments were moderated.We first collect 329 rules from the top 200 Twitch streamers' channels.Next, following Fiesler et al. (2018), we take an iterative coding process such that the authors of this paper individually code for rule types with certain categories, come together to determine differences and then repeat the coding process individually.With this process, we aggregated similar rules into 15 different finegrained level norm categories (e.g., controversial topics, begging) and cluster multiple fine-grained categories into 8 different coarse-grained norm categories (e.g., off-topic).To better understand the targets of offensive comments in the HIB (Harassment, Intimidation, Bullying) class, we added an additional dimension to consider whether the target is the broadcaster (streamer), participants in the channel (e.g., moderators and viewers), or someone not directly involved in the broadcast.We asked annotators to assign "Incivility" to cases where annotators do not believe that a specific pre-defined rule type has been violated although moderated.Examples of "Incivility" are provided in Appendix A.4.
Table 1 shows the resulting norm categories and corresponding fine-grained norms with examples.

Violated Norm Type Annotation
We recruited three annotators who are fluent in English and spend at least 10 hours a week on live streaming platforms to ensure that annotators understood live streaming content and conventions.
Their fluency was verified through several rounds of pilot annotation work.Internal auditors continuously conducted intermittent audits to ensure that annotators fully understood the guidelines.
Annotators were asked to annotate each mod-  Lastly, to examine how much external knowledge matters in understanding comments on live streaming platforms, we asked annotators to (1) indicate whether external knowledge is necessary to understand why a comment triggered a moderation event and if so (2) describe what that knowledge is.We focus on two types of external knowledge: platform-and streamer-specific.Platform-specific knowledge includes the implicit meaning of partic-  ular emojis, emotes, and slang that are commonly used on Twitch.Streamer-specific knowledge involves the streamer's personal background and previous streaming sessions.As shown in Table 2, we provide templates for each type that annotators can easily fill out (More details in Appendix A.3).

Data Statistics and Analysis
General Observations We identified three characteristics that distinguish real-time live-streaming chat from other domains.First, the majority of comments are very short; 70% of comments are made up of < 4 words.Additionally, they are often very noisy due to the real-time nature of communication, which leads to a high number of typos, abbreviations, acronyms, and slang in the comments.Lastly, some comments use unusual visual devices such as ASCII art and "all caps", to make them more noticeable.This is because each comment is visible only for a short time in popular streams (on average, there are around 316 chats per minute for the streamers in our data).The chat window in live streaming platforms can only display a limited number of comments, so viewers are incentivized to use visual devices to draw the streamer's attention in these fast-paced conditions.
False positives in data.We find that the "Incivility" case contains many false positives, as they include cases that seem to have been moderated for no particular reason.We asked annotators to put all miscellaneous things into the "Incivility" category, and also to mark as "Incivility" if they  could not identify any reason for the moderation.We found that many cases are not identifiable, as shown in Table 3.It is natural that many cases are non-identifiable in stage 1, as annotators are only given the moderated comment and no context.However, the 7.45% non-identifiable cases that remain even after stage 3 could be false positives, or they could be cases where the moderation event occurred more than two minutes after a problematic comment was made.
Interestingly, providing context helps mitigate annotator bias, as shown by the increase in interannotator agreement from stage 1 to stages 2 and 3 in Table 4. Here, the exact match determines whether all three annotators have exactly the same rules; partial match determines whether there is at least one intersection rule between three annotators; and majority vote chooses the rule types that were selected by at least two people.Also, non-identifiable and disagreement cases drop significantly when the contexts are given as shown in Table 3. Similarly for determining rule types, context also helps annotators identify targets for HIB and reduces inconsistencies between annota-tors.Our observation emphasizes the importance of context in synchronous communication and differs from previous findings that context-sensitive toxic content is rare in asynchronous communication (Pavlopoulos et al., 2020;Xenos et al., 2021).Analysis details are in Appendix A.2.
External knowledge helps annotations.Norm Category Distribution Table 3 shows the norm category distribution of streamers' rules and the moderated comments.While the categories are not directly comparable to the ones defined in Nor-mVio for Reddit (Park et al., 2021), we identified a few similar patterns.First, in both domains, Harassment and Incivility (i.e., Discrimination, HIB, Incivility) take up a significant portion of the entire set of norm violations.Also, the two domains show a similar pattern where rules for Off-Topic, Inappropriate Contents, and Privacy exist but are relatively less enforced in practice.However, we also found that the two domains differ in various ways.For example, Spam and Meta-Rules cover significantly higher portions of both rules and moderated comments on Twitch than on Reddit.On the other hand, there are fewer rules about content on Twitch, which implies that streamers are less concerned about the content of the comments than Reddit community moderators.As our data shows that norm-violating comments on live chats exhibit distinctive rules and patterns, it suggests that the existing norm violation detection systems may not perform well without domain adaptation to account for these distributional differences.We examine this hypothesis empirically in the following section and suggest appropriate modeling adjustments to better detect toxicity for real-time comments.

Toxicity Detection in Live-stream Chat
In this section, we first check whether norm violation and toxicity detection models are robust to the distributional shift from asynchronous conversations to synchronous conversations and vice versa, and identify how important the context or domain knowledge are for detecting toxicity and norm violation in synchronous conversations.

Performance of Existing Frameworks.
To examine the difference in toxicity detection between asynchronous and synchronous communication, we investigate whether existing toxicity detection models are effective for synchronous communication.We evaluate the performance of four existing tools on NormVio-RT: Google's Perspective API (Lees et al., 2022) 8 , OpenAI content filter9 , OpenAI moderation (Markov et al., 2022) 10 , and a RoBERTa-large model fine-tuned on machine-generated toxicity dataset called Toxi-Gen (Hartvigsen et al., 2022).We only use examples from the discrimination and HIB categories in NormVio-RT, as they are most similar to the label space that the existing models are trained for (e.g., hateful content, sexual content, violence, self-harm, and harassment).Categories are determined based on the stage 1 consolidated labels, as we do not provide any context to the model.Additionally, we select an equal number of random chats from the collected stream to construct negative examples.
To ensure the quality of negative examples, we only select chats that are not within two minutes prior to any moderation event as they are less likely to contain norm violating chats.We also only select chats from users who have never been moderated in our data.To obtain the predictions from the models, we check whether toxicity score is greater than or equal to 0.5 for Perspective API, and for OpenAI, check the value of the "flagged" field which indicates whether OpenAI's content policy is violated.We use binary classification outputs for ToxiGen.toxic messages.The results illustrate that while existing models do not frequently produce false positives (high recall), they perform poorly in detecting toxic messages found in synchronous chats, with a detection rate of only around 55% at best (low precision).12).Here, the labels are based on stage 3.

Norm Classification in
To examine how context affects model performance, we experiment with four model variants with different input context: (1) Single user context is only the chat logs of the moderated user that took place up to two minutes before the moderation event; (2) Multi-user context (event) is N messages that directly precede the moderation event, regardless of whether it belongs to the moderated user; (3) Multi-user context (utterance) is N messages that directly precedes the single utterance, which is the moderated user's last message before the moderation event (i.e., chat 3 in Figure 3).; (4) Multi-user context (first) is the first N messages of the collected two-minute chat logs.The intuition for this selection is that the moderation event may have taken place much earlier than the moderation event.In all the Multi-user contexts, we use N = 5; (5) Broadcast category is the category that streamers have chosen for their broadcast.It usually is the title of a game or set to "just chatting"; and (6) Rule text is a representative rule example shown in Table 1 Experimental Results.Table 6 presents performance of norm classification for coarse-level norm categories."All" refers to binary moderation detection, whether the message is moderated or not, and not the specific norm type.First, we can see that additional context improves the performance of "All," but context does not consistently improve the performance of category-specific norm classifiers.For example, context reduces performance for categories where the issues are usually limited to the utterance itself (e.g., discrimination and privacy).In contrast, categories that rely on the relationships between utterances, such as HIB and incivility, show improved performance with con- text.Secondly, multi-user context performs quite well compared to the other contexts, indicating that a more global context that includes utterances from other users helps determine the toxicity of target utterances.Lastly, the strong performance of Multiuser context (first) suggests that earlier messages in the two-minute window are more important, meaning that the temporal distance between the moderation event and the actual offending utterance may be substantial in many cases.Thus, our results encourage future efforts on developing a more sophisticated approach for context selection.
Availability of Context.To compare human decisions with those of our models, we conduct experiments varying the context available to annotators and models.For example, we expect models trained with only single utterances to perform best when using stage 1 (utterance only) labels as ground-truth labels since humans are also not given any context at stage 1.Indeed, in Figure 4, using the stage 1 labels as the ground truth labels yields the best performance for a model trained without any context, while using the stage 2 (context) labels as the ground truth labels shows the best performance for a model trained with previous chat history.Since our experiments only handle text inputs, it is not surprising that using stage 3 (video) labels as ground-truth labels yields worse performance than using stage 2 labels.However, interestingly, the gap is not large, which indicates that gains from a multi-modal model that incorporates information from the video may be small and that single modality (text-only) models can be sufficient for the majority of moderation instances.
Context Size.To understand how the amount of available context affects moderation performance, we compare the multi-user context configurations with various number of messages from one to 25. Figure 5 demonstrates that 15 to 20 messages prior to the moderated user's message helps with moderation performance the most (See utterance and first).However, increasing the number of messages that directly precede the moderation event actually lowers moderation performance (See event).It may be that most of this context serves as noise.

Distribution Shift in Norm Classification.
Existing tools often focus on identifying harmful speech, but NormVio (Park et al., 2021) also considers a wider range of norm-violating comments on Reddit, similar to NormVio-RT but in a different domain.We compare NormVio and NormVio-RT by evaluating the performance of a model finetuned on NormVio with NormVio-RT, and vice versa, to examine the impact of distribution shift between these domains.We choose six coarse-level categories that overlap between the two, as shown in Table 7.To measure with-context performance, we use the previous comment history for Reddit and multi-user context (utterance) for Twitch to simulate the most similar setup in both domains.Overall, experimental results show a pronounced distribution shift between Reddit (asynchronous) and Twitch (synchronous).Interestingly, models trained on Twitch are able to generalize better than models trained on Reddit despite having 6x less training data.Specifically, models trained using the out-of-domain Twitch+context data perform comparably on the Reddit test set to those trained using in-domain Reddit+context data.
Beyond Binary Toxicity Detection Treating toxicity detection as a binary task may not be enough to understand nuanced intents and people's reactions to toxic language use (Jurgens et al., 2019;Rossini, 2022).To holistically analyze toxicity, recent works take a more fine-grained and multidimensional approach: (1) Explainability explains why a particular chat is toxic with highlighted rationales (Mathew et al., 2021), free-text annotations of implied stereotype (Sap et al., 2020;ElSherief et al., 2021;Sridhar and Yang, 2022), or pre-defined violation norms (Chandrasekharan et al., 2018;Park et al., 2021).These explanations can be used not only to improve the performance of the toxicity detection model, but also to train models that generate explanations; (2) Target identification finds the targets of toxic speech, such as whether the target is an individual or a group, or the name of the group (e.g., race, religion, gender) (Ousidhoum et al., 2019;Mathew et al., 2021); (3) Context sensitivity determines toxicity by leveraging context, such as previous tweets (Menini et al., 2021), comments (Pavlopoulos et al., 2020;Xenos et al., 2021) or previous sentences and phrases within the comments (Gong et al., 2021).They show that context can alter labeling decisions by annotators, but that it does not largely impact model performance (Pavlopoulos et al., 2020;Xenos et al., 2021;Menini et al., 2021); (4) implication understands veiled toxicity that are implied in codewords and emojis (Taylor et al., 2017;Lees et al., 2021), and microaggressions that subtly expresses a prejudice attitude toward certain groups (Breitfeller et al., 2019;Han and Tsvetkov, 2020); and (5) Subjectivity measures annotation bias (Sap et al., 2022) and manage annotator subjectivity involved in labeling various types of toxicity, which arises from differences in social and cultural backgrounds (Davani et al., 2022).In this paper, we analyze the toxicity of synchronous conversations in terms of the aforementioned dimensions by identifying explanation of toxicity as a form of norm categories (explainability), finding targets of HIB words (target identification), leveraging context for both annotation and modeling (context sensitivity), asking annotators for implied knowledge statement (implication), and examining how human decisions align with machine decisions under different amounts of information (subjectivity).

Conclusion
In this paper, we analyzed messages flagged by human moderators on Twitch to understand the nature of norm violations in live-stream chats, a previously overlooked domain.We annotated 4,583 moderated chats from live streams with their norm violation category and contrasted them with those from asynchronous platforms.We shed light on the unique characteristics of live-stream chats and showed that models trained with existing data sets perform poorly in detecting toxic messages in our data, which motivates the development of specialized approaches for the synchronous setting.Our experiments established that selecting relevant context is an important feature for detecting norm violations in the synchronous domain.we hope our work will help develop tools that enable human moderators to efficiently moderate problematic comments in real-time synchronous settings and make the user-experience in these communities more pleasant.

Limitations
Our data, analysis, and findings have certain limitations.Our research is restricted to the English language and the Twitch platform, although the methods used to detect rule violations in live-stream chat and collect data can be adapted to other languages.Additionally, we recognize that our annotators were recruited from one country, which may result in a lack of diversity in perspectives and potential societal biases.Furthermore, we established a 2-minute context window for each moderated comment within the moderation event, but this may not capture all relevant context.Additionally, the small size of our humanannotated data may limit the generalizability of our findings to other situations.We recognize that our data set may not represent all instances of rule violations in real-world scenarios.This may be due to the biases of the moderators in choosing which users or comments to moderate or prioritizing certain types of violations over others.Also, the randomly sampled data we annotated may not be representative of the entire population and the imbalance of rule violation classes in our data set may not contain enough samples of rare categories to make definitive conclusions.
Our experimental results indicate that models trained to detect norm violation using our data are far from perfect and may produce errors.When such models are used in real world applications, this can result in overlooking potentially problematic comments or incorrectly flagging nonproblematic comments.Therefore, we recommend using AI-based tools to assist human moderators rather than trying to fully replace them.Practitioners should also be aware that there may be users with malicious intent who try to bypass moderation by making their comments appear innocent.By employing moderation models, malicious users may be better able to craft toxic messages undetectable by existing models.As mentioned above, having a final step of human review or verification of the model output will be beneficial.Additionally, it may be necessary to continuously update the model and limit public access to it.

Ethical Considerations
We took several steps to ensure that our data collection was ethical and legal.We set the hourly rate of compensation for workers at $16.15, which was well above the country's minimum wage at the time ($7.4).To ensure the safety and well-being of our workers, we maintained open communication channels, allowing them to voice any question, concerns, or feedback about the data annotation.This also helped to improve the quality of the collected data as we promptly addressed issues reported by workers throughout the process.We also give each annotation instance enough time so that we do not pressure annotators (40 days for 4,583 instances).We did not collect any personal information from annotators and we did not conduct any experiments with human subjects.
We confirm that we collected and used chats, also referred to as user content, in accordance with Twitch's Terms of Service and do not publicly release the data as it may be in violation of laws against unauthorized distribution of user content.However, we intend to make the platform-specific knowledge statements we compiled available to support future research on real-time chat in the livestreaming domain.During the collection process, we used the official Twitch API to monitor and retrieve chats.
Lastly, we want to emphasize that careful consideration must be given to user privacy when using moderation events to study norm violations.While users may be aware that their comments can be viewed by others in the chat room, researchers must also understand that users have the right to request not to be included in the data and establish a mechanism for users to contact researchers to have their data removed, and refrain from publicly releasing the data and instead share it on a need-to-know basis to control who has access to the data.

Acknowledgement
We would like to thank to Yeeun Shin (SoftlyAI) for managing the data annotation process and Kyumin Park (SoftlyAI) for the development of the web annotation framework.Datumo, known as SELECTSTAR in South Korea, provided a crowdsourcing platform for the annotation of the data. This

A Annotation Details
We engage in active discussions with annotators and provide detailed feedback after multiple rounds of pilot study to ensure the data quality.

A.1 Annotation UI
To make it easy for annotators to annotate with various types of contexts, we create an annotation tool.The annotation tool has three options and the user can select each option for each step annotation.Figure 6 shows the UI for step 1 which shows only the user's last chat (bad utterance) before the moderation event.Figure 7 shows chat logs up to two minutes ago based on the moderation events on multi user context panel.To make it easier for annotators to find previous chats from moderated users, we create single user context panel to only display chat logs of the moderated user in multi user context.Figure 8 shows both chat logs and video context.The video context shows 1-minute clipped video around the moderation event.

A.2 Annotation Consolidation
To determine the final label for each moderated event, we aggregate the labels of annotators using a majority vote with heuristic rules.Each annotator a i identifies a list of violated rules L = {l 1 , l 2 , • • • , l k } for a moderated event e at each stage k = {1, 2, 3}.Here, we don't consider the target for HIB.We first evaluate the percentage agreement to measure inter-annotator agreement in each stage by exact match and partial match.The exact match determines whether all three annotators have exactly the same rules (L a1 = L a2 = L a3 ) and partial match determines whether there is at least one intersection rule between three annotators ((L a1 ∩ L a2 ∩ L a3 ) > 0).Table 4 shows the inter-annotator agreement percentage.We find that 98% of agreements from exact match are single label cases (i.e., 98% of exact matches have only one label) and many disagreements are resolved using the partial match method.92% disagreements that persist even with the partial match method are the case where one or two annotators marking a comment as violating the "Incivility" rule while the others do not.Finally, to determine the gold label using the annotations from the three annotators, we apply a majority vote approach, choosing the rule types that were selected by at least two people.We discard approximately 3% of events that cannot be consolidated because all three annotators provided different labels.Target Agreement for HIB For cases consolidated as HIB with the majority vote, we further analyze the inter-annotator agreement of target labels among annotators who have marked them as HIB.In cases where the annotator was unable to identify the target, we asked them to mark the target as "non-identifiable".

B Experimental Setup Details
Each fine-tuned experiment uses 1 NVIDIA RTX A5000 GPU and uses FP16.We implement models using PyTorch (Paszke et al., 2019) and Huggingface Transformers (Wolf et al., 2019).We use the Adam optimizer with a maximum sequence length of 256 and a batch size of 4. We set 100 epochs and validate the performance every 100 steps.The stopping criteria is set to 10.For each data, we searched for the best learning rate for our model out of [1e-5, 2e-5, 5e-5, 1e-4, 3e-4].Then, we report the average score of 3 runs by different random seeds (42,2023,5555).Each run takes 10 to 30 minutes.To determine the data distribution ratio between positives and negatives in the training data, we searched for the best distribution out of [1:1, 1:2, 1:5, Original] by random negative sampling.As shown in Table 12, we found that the evenly distribution (1:1) shows the most stable performance with the lowest standard deviation under with and without context.Data statistics for both Twitch and Reddit (Park et al., 2021) are presented in Table 13-14.Note that we report the number of data statistics after sampling the same number of negative samples as positive samples.

C Ablation Study C.1 Context Arrangement
To understand how the context arrangement in the input affects the performance, we conduct experiments with multiple variants of context arrangement on moderation detection (See Table 15).First, the results show that randomly shuffled context consistently harm the performance.It indicates that context order matters, in contrast to the findings in dialog system study results (Sankar et al., 2019;He et al., 2021).Moreover, input as the sequential order of chats presented in the contextaware model (Pavlopoulos et al., 2020), or adding more contexts (e.g., broadcast category, rule text) degrade the performance.This indicates that the target text should always be placed first, and some contexts may not be helpful.
Figure 1: A Motivating Example.Chat in the synchronous domain has different characteristics than those in the asynchronous domain: (1) the temporal gap between chats and message length are much smaller; and(2) relationships between chats are less clearly defined.Such differences make chats in the synchronous domain more difficult to be moderated by existing approaches.

Figure 2 :
Figure 2: Data Construction.Norms are manually defined based on the chat rules of the top 200 streamers, and annotators annotate the violated norm of moderated event by three stages.
NormVio-RT.To understand the model's ability to detect norm violations and how additional information can affect detection, we train binary classification models for each category with different types of context including conversation history, broadcast category, and rule description following Park et al. (2021).Experimental Setup.For each coarse-level category, we train a RoBERTa-base model with a binary cross entropy loss to determine whether the message is violating the certain norm or not.Following Park et al. (2021), we perform an 80-10-10 train/dev/test random split of moderated messages and add the same number of unmoderated messages in the same split.Next, for each binary classification, we consider the target category label as 1 and others as 0 and construct a balanced training data set.Appendix B (See Table

Figure 3 :
Figure3: Multi-user context is chat logs that occurred up to two minutes before the moderation event while single-user context is chat logs of moderated user in a multi-user context and single utterance is the moderated user's last message before the moderation event.
. The rule text is only used for training examples because it is not possible to know which rule was violated for unseen examples and we use randomly selected rule text for unmoderated negative examples in training examples.All contexts are appended to the input text (single utterance) with a special token ([SEP]) added between the input text and the context.Chat logs for multi-user context and single-user context are placed sequentially with spaces between chats.Training details and data statistics are presented in Appendix B.

Figure 4 :
Figure 4: Performance (F1 Score) of moderation detection by different ground truth label for each context.

Figure 5 :
Figure 5: Performance (F1 Score) trend of moderation detection with varying context length.

Figure 6 :
Figure 6: Step 1. single utterance shows only the user's last chat before the moderation event.

Figure 7 :
Figure 7: Step 2. + chat context shows chat logs up to two minutes ago based on the moderation events (multi user context).single user context only shows the moderated user's messages within two minutes.

Table 1 :
Live streaming norms.We map rules from top 200 Twitch streamers' channels to coarse and fine-grained level norms.Some rules specify targets (OOB: Others Outside of Broadcast, OIB: Others In Broadcast).

Table 2 :
A knowledge statement template.

Table 3 :
Data Statistics.# of rules indicates the number of streamers specifying the norm in their channels and # of violates indicates actual number of messages that violate corresponding norms.

Table 4 :
Inter -annotator Agreement.Presents the percentage of moderated events for 4,583 events while the number in parentheses indicates the number of events.

Table 5 :
Performance (Binary F1) of toxicity detection models on HIB and Discrimination data.Binary F1 refers to the results for the 'toxic' class.

Table 6 :
Performance on Norm Classification.macro F1 score for each coarse-level norm category."All" refers to binary classification between moderated and unmoderated messages without considering norm category.Best models are bold and second best ones are underlined.Scores are average of 3 runs (3 random seeds).

Table 8 :
Inter-annotator Percent Agreement for Targets of HIB.Presents the agreement percentage of HIB after majority vote.The numbers in parentheses indicate the absolute number of events.

Table 9 :
Data statistics of knowledge statements.

Table 8
Note that there are some examples require both.Statistics demonstrate that HIB and incivility require domain knowledge the most to understand the meaning behind them.Knowledge statement examples.For each coarse-level norm category, we present its examples (See Table10).thatmeansboredom,originatingfromamanwhofellasleeponstream.Streamer[streamer] was LoL(League of Legends) game streamer, but he seems to quit and play gamble.Spam PlatformGamba is text that means gambling and the [streamer] ordered mods to ban whoever type it.Streamer[streamer] has banned using emoji "PogU".Meta-Rules PlatformShoutout is text that means highlight notable members in chat, prompting others to follow their channel.Streamer[streamer] is a game streamer on Twitch.Incivility Platform !sac or !sacme is text that means sacrifice.People in the chat sacrifice themselves(get timed out) by typing them to earn some kind of points.Streamer[streamer] has declined [person]'s fight offer.
A.4 Examples of Incivility.

Table 11 :
Table 11 presents examples of chat moderation by streamers where the underlying reason for moderation is not apparent.The cases highlight potentially uncomfortable situations that streamers may encounter.Example cases of Incivility.