Say ‘YES’ to Positivity: Detecting Toxic Language in Workplace Communications

Warning: this paper contains content that may be offensive or upsetting. Workplace communication (e.g. email, chat, etc.) is a central part of enterprise productivity. Healthy conversations are crucial for creating an inclusive environment and maintaining har-mony in an organization. Toxic communications at workplace can negatively impact overall job satisfaction and are often subtle, hid-den or demonstrate human biases. The linguistic subtlety of mild yet hurtful conversations has made it difﬁcult for researchers to quantify and extract toxic conversations automatically. While offensive language or hate speech has been extensively studied in social communities, there has been little work studying toxic workplace communications. Speciﬁcally, the lack of corpus, sparsity of toxicity in enterprise emails and a well-deﬁned criteria for annotating toxic conversations have prevented researchers from addressing the problem at scale. toxicity in providing (1) a general and computationally viable taxonomy to study toxic language at workplace (2) a dataset to study toxic language at workplace based on the taxonomy and (3) analysis on why offensive language and hate-speech datasets are not suitable to detect workplace toxicity. Our implementation, analysis and data will be available at https://aka.ms/ ToxiScope .

tions at workplace can negatively impact overall job satisfaction and are often subtle, hidden or demonstrate human biases. The linguistic subtlety of mild yet hurtful conversations has made it difficult for researchers to quantify and extract toxic conversations automatically. While offensive language or hate speech has been extensively studied in social communities, there has been little work studying toxic workplace communications. Specifically, the lack of corpus, sparsity of toxicity in enterprise emails and a well-defined criteria for annotating toxic conversations have prevented researchers from addressing the problem at scale. We take the first step towards studying toxicity in workplace communications by providing (1) a general and computationally viable taxonomy to study toxic language at workplace (2) a dataset to study toxic language at workplace based on the taxonomy and (3) analysis on why offensive language and hate-speech datasets are not suitable to detect workplace toxicity. Our implementation, analysis and data will be available at https://aka.ms/ ToxiScope.

Introduction
Studies have shown that more than 80% of the issues affecting employees' productivity and satisfaction are related to negative work environment behaviors such as harassment, bullying, ostracism, gossiping, and incivility (Anjum et al., 2018). Moreover, workplace gossiping results in distracted * Most of the work was done while the first author was an intern at Microsoft Research The highlighted sentence was annotated as toxic and gossip by annotators. This instance has a confidence score of 0.15 on Perspective API 1 employees and low morale. Duffy et al. (2002) and Kong (2018) find that workplace incivility leads to social undermining of employees which could lead to trust issues, difficulty in establishing cooperative relationship, lower job satisfaction and attitudinal outcomes such as gaining personal power and reputation (Aquino and Thau, 2009;Baumeister, 1995;Ellwardt et al., 2012;McAndrew et al., 2007).
Many organizations enact policies that prohibits practicing extremely toxic behaviors like bullying, verbal threats, profanity, harassment and discrimination; yet detecting more subtle forms of toxicity like negative gossiping, stereotyping, sarcasm, and microaggressions in conversations remains a challenge.
Toxicity can be manifested in different ways. It spans a wide spectrum that includes subtle and indirect signals; that can often be no less toxic than overly offensive language . While the research community has made enormous progress in detecting overly offensive language and hate speech (Schmidt and Wiegand, 2017;Waseem et al., 2018;Fortuna and Nunes, 2018;Qian et al., 2019), there has been less focus on computationally evaluating other subtle expressions of toxicity.
Qualitative studies have found these subtle signals to have long lasting negative effect (Sue, 2010;Nadal et al., 2014). As Figure 1 shows, currently popular toxicity detection tools cannot detect subtle yet hurtful conversations as harmful. We argue that it is equally important to detect these subtle aggressive conversations and educate employees for a healthy workplace. Detecting wider aspects of toxic text can be challenging. Subtle signals like stereotyping, mild aggression can be contextsensitive, sparse, highly subjective and do not have well defined annotation guidelines; whereas overly toxic language and hate speech are rarely contextsensitive (Pavlopoulos et al., 2020) and have welldefined guidelines (Waseem et al., 2017). In this paper, we take first steps towards (1) defining a taxonomy for studying toxic language in workplace setting by analyzing the definitions from impoliteness theory and psychology (2) building a dataset of human annotations on publicly available email corpus (3) providing computational methods to establish baselines for detecting toxic language in enterprise emails, and (4) analyzing why current datasets and tools for detecting hate speech do not work in our setting.

Related Work
Offensive Language Detection: Perspective API is a popular toxicity detector for detecting offensive conversations. Waseem et al. (2018) devised a taxonomy and created a dataset to detect hate speech and discrimination. Xu et al. (2012) Rajamanickam et al. (2020) showed joint model of emotion and abusive language detection helps model performance. However, toxic language in workplace has often subtle aggressive conversations and lesser offensive text. Subtle aggressive conversations can be covert faux pas or unintentional whereas offensive text is overt and includes intentional choice of words. Also, a conversation in a workplace is more formal than the social media text. Due to their fundamental different structure, current datasets and models trained on these datasets are not able to properly detect workplace toxicity. Microaggression datasets: Breitfeller et al. (2019) released a dataset from Reddit, Gab, and www.microaggressions.com showing that it's possible to annotate these highly subjective and linguistically subtle uncivil communications and detect them using computational methods. It is focused on gender-based discrimination due to their availability in social media. The annotation guideline also use gender as discrimination axis to determine toxicity. Whereas we are interested in formal conversations that are context dependent and are majorly targeted towards individuals addressed in emails irrespective of gender. Wang and Potts (2019) introduced a new Reddit dataset with labels corresponding to the condescending linguistic acts in conversations and showed that by leveraging the context, it is possible to detect this type of challenging toxic language. Similarly, Caselli et al. (2020) leveraged the context of occurrences to create a Twitter dataset for implicit and explicit abusive language. Implicit abusive language does not immediately insinuate abuse. However, its true meaning is often concealed by lack of profanity or hateful terms which makes it difficult to detect. Oprea and Magdy (2020) released a corpus for sarcasm selfannotated by authors on Reddit. However, these datasets mainly contain abusive language and sarcastic tweets on popular social events and are informal.
To the best of our knowledge, there is no available dataset in our community to study toxic language in emails. The most similar work to ours can be Raman et al. (2020). However, the focus of this work has been mostly offensive language in GitHub community whereas our work focuses on detecting toxicity in workplace emails. Email Communications: There is also some prior work on Email corpus for sociolinguistic downstream tasks. Prabhakaran et al. (2014) explored the relation between power and gender on Enron corpus. They showed that the manifestations of power differ significantly between genders and the gender information can be used to predict the power of people in conversations. Similarly, Bramsen et al. (2011) studied social power relationships between members of a social network, based purely on the content of their interpersonal communication using statistical methods. Madaan et al. (2020) released automatically labeled Enron corpus for politeness. However, their definition for polite-ness does not capture toxic language. Chhaya et al. (2018) devised computational method to identify conversation tone in Enron corpus. They categorize tones as frustration, formal and polite and find that affect-based features are important to detect tone in conversation. However, affect-based features do not capture subtle offensive text. We are interested in studying subtle and offensive text in workplace emails which are different from the prior work in this area.

Toxicity in Enterprise Email
Our goal is to study and understand workplace toxic communications through one of the most frequently used ways of communication in organizational settings, emails (The Radicati Group, 2020). The distribution of our dataset (Section 3.2) demonstrates the significant presence of the implicit and subtle toxic language in workplace email communications contrary to social media and open source communities. We created a taxonomy (Section 3.1) and a crowd sourced annotation task (Section 3.2) to manually annotate toxic language in the Avocado research email collection (Oard et al., 2015). This collection contains corporate emails from an information technology company referred to as "Avocado". The collection contains an anonymized version of the full content of emails, and different meta information from Outlook mailboxes of employees' emails. The full collection contains 279 employees and 938,035 emails.
In addition, we perform analysis of different emotional affects for each category of toxic language. From previous work, we understand that toxic language has a strong correlation with negative emotions. We also studied whether using context was beneficial in determining toxicity. To this end, we conducted an analysis to study whether humans benefit from context in detecting toxic language in emails. We assume that to determine toxicity in a text, humans read the entire email body and previous emails and not only the given text. We quantify these observations through annotations before using context aware representations in our modeling.

Taxonomy for toxic language
We leveraged the different negative culture practices with definitions from impoliteness theory (Culpeper, 1996) and offensive language detection in social media (Zampieri et al., 2019b,a) to define taxonomy for toxic language in workplace communications. We have the following goals in mind: (1) generalizable across different organizations, (2) sufficiently represented in our corpus, (3) cover the main dimensions of negative culture in workplace from cross-domain literature. We have summarized the definitions in Table 2 and described each of these below. Non-Toxic: The non-toxic class has instances of friendly, knowledge sharing, formal respectful type of conversations. These conversations often have positive or neutral connotations. Impolite: The impolite class has instances of sarcasm, stereotyping, rude statements. These conversations often have opposite polarity to their previous context with negative or neutral connotations that might complement the work on benevolent sexism (Jha and Mamidi, 2017). Following Impoliteness theory (Culpeper, 1996), we define 'Rude' as direct, intentionally disrespectful words to the addressee whereas sarcasm (implicature to express the opposite of being said), stereotyping (unintentional) need not be necessarily direct yet disrespectful comments to the addressee in the conversation. Negative Gossip: The gossip class includes rude, mocking conversations about a person not involved in the conversation. We find these instances have negative connotations with a tone of complaint and lack of respect toward the target. Kong (2018) found repeated gossip conversations in organizations caused hostility and stress among the employees. As shown by Wulczyn et al. (2017), conversations targeted targeted towards a third person need not necessarily be extreme yet can be disrespectful. Evidently, our annotators find our annotators feel gossip conversations are more annoying whereas impolite conversations have more sadness with higher overlap with offensive category (Figure 3). We refer to this type as "Gossip" in the rest of the paper. Offensive: Detecting overly toxic language has  been extensively studied in the research community. We follow a similar definition of offensive language as Zampieri et al. (2019b) which refers to any form of unacceptable language to insult a targeted individual or group. In our setting, we define offensive language such that it includes five broad categories: profanity, bullying, harassment, discrimination and violence.

Annotation task
We design a hierarchical annotation framework to collect instances of sentence in an email and the corresponding label on a crowd-sourcing platform. Before working on the task, annotators go through a brief set of guidelines explaining the task. We collect the dataset in batches of around 1000 examples each. For the first three batches, we upload 75-100 instances manually labeled as toxic by the group of researchers working on the project to understand if the annotators followed the guidelines. We repeat the pilot testing until desirable performance is achieved. Also, we manually review a sample of the examples submitted by each annotator after each batch and exclude those who do not provide accurate inputs from the annotators pool and redo all their annotations. A key characteristic of subtle toxic emails are that they often result from prior experiences, cultural difference or background between individuals (Sue et al., 2007). Hence, designing annotation for detecting toxicity is a difficult task and there will be discrepancies in perceived toxicity between the annotators. In order to minimize ambiguity and provide a clearer context to the annotators, we provide email body, subject, and the prior email in thread as context information.
For each highlighted sentence, annotators indicate whether the post is toxic, type of toxicity, whether the target of the toxic comment is the recipient or someone else, whether the prior email as context was helpful, the kind of negative affect associated with toxicity and whether the whole email was toxic. We provide a subset of negative affects to the annotators from WordNet-Affect (Strapparava and Valitutti, 2004). The annotators answer the questions on type of toxicity and the target only if they indicate potential toxicity during annotation. They can also choose multiple toxic categories for a highlighted sentence. Finally, the annotators are provided an optional text box to provide additional details if the highlighted sentence did not belong to any of the categories we defined. Please note that the sub-types of toxicity do not have a clear boundary and are not mutually-exclusive.
A total of 76 annotators participated in this task. All annotators were fluent in English and came from 4 countries: USA, Canada, Great Britain and India, with the majority of them residing in the USA. Each highlighted statement in the email was annotated by three annotators and they were compensated based on an hourly rate (as opposed to per annotation) to encourage them to optimize for quality. They took an average of 5 minutes per annotation. We assume a sentence is toxic even if one out of three annotators perceived it as toxic. We adopt this principle to be inclusive of every individual's background, culture, sexual orientation and implicit toxic language can be subtle. Similarly, we included the union of the toxicity types selected by the three annotators for the instance. A snapshot of our crowd-sourcing framework can be found in Appendix 5 Due to the scarce nature of toxic conversations in emails, we adopt two round approach for data collection. For the first round of annotations, we use several heuristics to increase the chances of identifying positive instances in the sample. We tried running the Perspective API and the microaggression model (Breitfeller et al., 2019) against Avocado corpus. The coverage of Perspective API is extremely low (0.1%) since not many overly toxic text is present in Avocado corpus. On the other hand, the microaggression model output has low precision (0.12%). To further prune the false positives, we employ filtering methods 2 over the outputs from microaggression model before sending the positive labels for annotation. The first round of annotations provided a positive label ratio of 2.74% compared to 0.29% from a manually annotated batch of around 800 random email sentences. This implies the need to be selective regarding the emails we submit for annotation. In addition, for the second round of annotations, we used SVM classifier to pick positive instances from the unlabeled email corpus. To avoid model biases, we randomly sample unlabeled email sentences based on their probability scores with more instances being sampled from the higher scores ranges. The second round of annotations provided a positive label ratio of 11.2% which is significantly higher than our previous rounds.The classifier is updated with more examples after each round of annotations.
Overall, the final dataset contains 10,110 email sentences of which 1,120 of the sentences are labeled as toxic by annotators. We call this dataset for studying toxic language in workplace communications as ToxiScope. Please note that we asked the annotators to identify spam emails and their types including Advertisement, Adult content, and Derogatory content. We observed that 99% of the emails in Spam category are advertisement and we decided to exclude those emails since advertisement contents are not in the scope of toxic language detection. Figure 2 shows the distribution of toxic emails over sub-categories of toxic language which indicates higher frequency for Impolite emails. Annotators Agreement: Overall, the annotations showed inter-annotator agreement score of Krippendorf's α = 0.718 to detect whether a given sentence was toxic or not. Broken down by each cate- gory, annotators agreed on a sentence being offensive at Krippendorf's α = 0.77, impolite at Krippendorf's α = 0.29 and gossip at Krippendorf's α = 0.32. The high agreement score on overall toxicity shows that annotator judgements are reliable and the lower agreement score on sub-types are indicative of the subjectivity and lack of objectivity for implicit toxicity (Lilienfeld, 2017) and not the quality. We also quote several prior works in toxicity setting and other tasks that lack objectivity, and have inter-annotator agreement score in our range. Microaggression dataset has a score of 0.41 for 200 instances and Rashkin et al. (2016) has a score of 0.25 for inter-annotator agreement.
Insights from annotation task: Sometimes defining a clear boundary between categories of toxic language is challenging because they are not mutually exclusive. Therefore a statement can belong to multiple toxic categories. For example, the content of an email can be about gossiping and at the same time be discriminatory against a certain group of people. Our analysis shows that 92% of emails belong to a single toxic category while the rest of the emails contain two or more types of toxic language. Figure 3 shows the co-occurrence of different toxic contents in the same email. We can observe that the Offensive and Impolite categories are slightly more likely to happen in the same email than with Gossip. Since our task is highly subjective, in order to understand the reasons behind perceived toxicity we ask annotators several questions about the target and affect of the toxic statement, and whether the context (previous email) is useful in determining the toxicity of the statement. We find that in 41% of the instances, context information was helpful to determine toxicity. In 76.86% of the toxic instances, the language was targeted to another individual or a group. Un- derstandably, all the toxic instances have negative affect with anger and hostile being present in most of the cases. However, annotators find gossip examples more disgusting and a toxic sentence to be 6.1% more annoying when they are targeted to another individual not in the conversation.
We use 70% of the data for training and 10% as validation set. We hold out 20% of the data for test set.

Detecting toxic conversations in Emails
We design our experiments with the following goals: (1) Investigate if contextual information (email body, the parent email) helps in determining toxicity. We also study which categories of toxic language benefit from adding context to the sentence.
(2) We also test our hypothesis that current toxic language datasets cannot identify indirect aggressive or impolite sentences. We consider current state-of-the-art toxic language detectors for this task.
(3) Evaluate our baseline models on other datasets including Wiki Comments (Wulczyn et al., 2017) and GitHub (Raman et al., 2020) to study if understanding subtle signals help in determining overly toxic language. We experimented with publicly available state-  Table 4 summarizes the performance of models trained and tested on ToxiScope. The baselines performance are reported for binary classification (toxic vs non-toxic). We report evaluation metrics in F 1 (macro and micro) and accuracy (TPR and TNR) of different classes due to class imbalance. For the models in Table 4, which required context as an input, we took the prior email in the thread during pre-processing. The results imply pretrained Bert models fine-tuned on ToxiScope perform better than non-pretrained models. Hence, we will focus on these models to evaluate the effect of context on the outcome. In addition, the low recall performance or True Positive Rate (TPR) demonstrates the challenge in detecting subtle toxic instances in communications and from now on we pay more attention to TPR and F 1 score metrics.

Results and Analysis
Effect of adding context: As outlined in Section 3.2, annotators find prior email and email body helpful to determine toxicity. Pavlopoulos et al. (2020) showed that adding context did not help pre-trained models like Bert in boosting the   performance. However, the dataset in their setting was small in size and the target comments were mostly offensive. These observations may not generalize in our case since we are interested in detecting implicit and subtle cases of aggressive language. In order to evaluate the effect of the contextual information, we experimented with different variations of the context. Table 5 presents the TPR for different categories of the toxic language. Based on our experiments, models find context helpful to detect toxicity. Interestingly, models do not find contextual information necessary to detect offensive language unlike other categories. We also observed gossip category benefits the most from the neighborhood sentences as context. The majority of the gossip emails in our dataset belong to complain sub-category which are spread across multiple sentences. Hence, many of the neighboring sentences could have had negative connotations that would have aided the models. However, on average using the previous email in the thread is most helpful in detecting the toxic language. In general, finding implicit toxic language is a difficult task. This is evident in low TPR of gossip and impolite classes as well as their sparse labels and the low inter-annotator agreement scores in those categories. Generalization to other domains: To investigate how other domains can lever our dataset, we trained the baseline models for toxic language detection (Breitfeller et al., 2019;Raman et al., 2020) and context aware sentence classification  on ToxiScope. Then, we tested these models against different toxic language datasets. Since we did not find any dataset studying toxic language in workplace (with implicit and explicit toxic text), we picked the datasets that overlap with one or few categories of our interest. The results are presented in Table 6 which shows that Bert based models outperform other methods in all of the domains. Note that on microaggression dataset we achieve TPR of 0.54 which performs better than the model provided by Breitfeller et al. (2019) with best TPR of 0.36 3 . On Wiki Comments dataset, our baseline models using Bert have good accuracy (TPR 0.86) in detecting toxic text which is comparable to the TPR of Perspective API (0.85). The reason for high false positive rate could be that Wiki Comments dataset does not consider subtle  aggressive text as toxic. The best performing classifier by Raman et al. (2020) on GitHub datatset has a TPR of 0.35. One reason for poor scores on GitHub dataset can be attributed to noisy labels.
We sampled a few instances from GitHub dataset and found 15% of them to be noisy. Overall, these experiment results imply the potential benefits of using our dataset for detecting toxic language in social media and open source community domains.
Leveraging social media and open source communities data to detect workplace toxicity: Offensive language is widely studied on social media language and there are several datasets and methods available for this task. Tables 8 presents the performance of the publicly available models and API 4 on ToxiScope. The model from Breitfeller et al. (2019) has a reasonable performance on ToxiScope. Their method uses lexicons for microaggressions from external sources. Leveraging these external sources as weak supervision signals might help in boosting performance of models for ToxiScope as well.
Next, we investigated if these datasets can be helpful in training models for detecting workplace toxicity. We fine-tuned and trained Bert based models over Microaggression, GitHub, and Wiki Comments and ran the inference on ToxiScope. As we expected, Table 7 shows that the models trained on Microaggression dataset are more applicable to workplace toxic language detection. However, they are still performing worse than the in-domain models (Table 4). Impolite and gossip (constituting of sarcasm, stereotyping, rude) categories are predominantly present in ToxiScope while there are not many datasets available for these tasks and the existing datasets are small in size. This could explain the inadequate performance of these models.

Conclusion
Previously, we saw a gap in available resources to detect workplace negative communications and based on our observations, Microaggression dataset was the only resource applicable to this domain which did not show promising performance. Hence, we created ToxiScope to close this gap. We presented a taxonomy and annotation guidelines to study toxic language in workplace emails. We also provided baseline methods to detect toxic language in ToxiScope. Further, we demonstrated the necessity of new dataset to detect workplace toxicity since the models trained on existing overly toxic datasets and on Microaggression dataset do not detect subtle toxic text. In addition, we observed that context help Bert based models to detect subtle toxic sentences. However, our results indicate that we need more sophisticated models and better representation of context to detect implicit toxic sentences. In future, we will explore other methods like weak supervision from other sources and   self-training for better performance. Going forward, we will also investigate other research questions pertaining to the likelihood of an individual using toxic language repeatedly, correlation of power and gender dynamics with respect to toxicity, presence of the bias (racial/gender) in ToxiScope, understanding the degree of severity of toxic text. We hope our work will encourage the researchers in the community to study and develop methods to detect workplace toxicity.

Annotation
In this work, we leverage the publicly available Avocado corpus which belongs to Language Data Consortium (LDC). This email dataset has been processed and anonymized by LDC. We received approval from our organization Internal Review Board (IRB) before starting the annotation task to make sure we are in compliance with the Avocado Research Email Collection license agreements as well as the ethical guidelines. We understand that annotating potentially toxic content can have negative impact on the workers. In order to reduce these effects, we provided warnings and information about the research project in a consent form. We asked the annotators to read the consent form and only proceed if they've agreed to its terms (Figure 4). The risks and benefits of working on this annotation tasks were presented to annotators in the consent form: Benefits: There are no direct benefits to you that might reasonably be expected as a result of being in this study. The research team expects to learn to detect micro-aggressive and toxic language in email communications from the results of this research, as well as any public benefit that may come from these Research Results being shared with the greater scientific community. Risks: During your participation, you may experience some discomfort being exposed to profanity, toxic and discriminatory language in emails. To mitigate this risk, the research team makes it possible for you to take a break or skip tasks without adversely affecting your ratings within the crowdsourcing platform. This research may involve risks to you that are currently unforeseeable. In addition, we did not collect any personal or demographic information other than their crowd source platform identification number. The consent form explains how we manage their information and provide details about their compensations. Resources were also provide to answer the annotators questions and concerns. Moreover, we limited the number of emails an annotator can work on in a task and paid them above minimum wage ($12-15 per hour).

Deployment
Detecting harmful language in email communication is a difficult task even for human. Recent work have shown that the toxic language detection models are also very prone to racial biases (Sap et al., 2019;Davidson et al., 2019) due to the fact that they are using biased datasets. In this work, we hired annotators from different English speaking countries to reduce the bias in our dataset. However, this is a research paper with the goal to better understand the problem of toxic language in workplace communications and encouraging other researchers to work on this problem. We believe further study needs to be done on this dataset to make sure it's not biased before deploying any computational model.
In addition, for deploying this technology, we need access to the employees' communications. To the best of our knowledge, most workplaces do not provide any guarantee of privacy for employee's communications using enterprise systems. In addition, there are several existing technologies being implemented on workplace communications for improving users' productivity such as response generation and intent detection in emails. These technologies are being used without violating user's privacy thanks to advances in the fields of unsupervised learning and privacy-preserving machine learning.
Moreover, this technology have multiple applications and some of them can potentially be used to harm employees and their friends and family. For example, using this model to detect toxic language and report employees to HR or their manager is a high-stake application. If this system makes a false positive error, it may damage employee's reputation, forces the employee to defend themselves and diminishes their trust in the company. This technology can also be used to provide feedback to employees about their written communication style. This tool can be used for training purposes and increasing workers awareness of such a micro-aggressive language. If this system makes frequent false positive errors, employees will become annoyed and be less productive, which causes an eventual drop in the company's profits. Companies can pursue mitigation steps and allow employees to provide feedback and dispute the system's predictions.