KOLD: Korean Offensive Language Dataset

Recent directions for offensive language detection are hierarchical modeling, identifying the type and the target of offensive language, and interpretability with offensive span annotation and prediction. These improvements are focused on English and do not transfer well to other languages because of cultural and linguistic differences. In this paper, we present the Korean Offensive Language Dataset (KOLD) comprising 40,429 comments, which are annotated hierarchically with the type and the target of offensive language, accompanied by annotations of the corresponding text spans. We collect the comments from NAVER news and YouTube platform and provide the titles of the articles and videos as the context information for the annotation process. We use these annotated comments as training data for Korean BERT and RoBERTa models and find that they are effective at offensiveness detection, target classification, and target span detection while having room for improvement for target group classification and offensive span detection. We discover that the target group distribution differs drastically from the existing English datasets, and observe that providing the context information improves the model performance in offensiveness detection (+0.3), target classification (+1.5), and target group classification (+13.1). We publicly release the dataset and baseline models.


Introduction
Online offensive language is a growing societal problem.It propagates negative stereotypes about the targeted social groups, causing representational harm (Barocas et al., 2017).Among various research directions for offensive language detection, Target span?
Target type?
Figure 1: An illustration of the annotation process of KOLD.The title is given to provide context to the annotators.Along with the categorical labels, annotators are asked to mark spans of a sentence that justifies their decision.Waseem et al. (2017) proposes a taxonomy to distinguish hate speech and cyberbullying based on whether the offensive language is directed toward a specific individual or entity, or toward a generalized group.Zampieri et al. (2019) integrates this taxonomy into a hierarchical annotation process to create the Offensive Language Identification Dataset (OLID), which is adopted in other languages as well (Zeinert et al., 2021;Zampieri et al., 2020).Many problems remain in offensive language detection such as failure to generalize (Gröndahl et al., 2018;Karan and Šnajder, 2018), over-sensitivity to commonly-attacked identities (Dixon et al., 2018;Kennedy et al., 2020), and propagating bias in annotation (Sap et al., 2019;Davidson et al., 2019).Among various attempts to solve those problems, first is offensive language target detection to iden- tify the individuals or groups who are the victims of offensive language (Sap et al., 2020;Mathew et al., 2021;Shvets et al., 2021).Second is offensive language span detection to provide some explanation for offensive language (Pavlopoulos et al., 2022;Mathew et al., 2021).Third is to provide context information of offensive language to improve offensive language detection (Vidgen et al., 2021;de Gibert et al., 2018;Gao and Huang, 2017).
However, these steps forward are limited to English because we lack comprehensive datasets in other languages (Mubarak et al., 2017;Fortuna et al., 2019;Chiril et al., 2020;Rizwan et al., 2020).Language-specific datasets are essential for offensive language which is very culture-and languagedependent (Reichelmann et al., 2020;Albadi et al., 2018;Ousidhoum et al., 2019).Although Korean is a comparably high-resourced language (Joshi et al., 2020), there are only a few publicly available Korean offensive language corpora (Moon et al., 2020), and they do not consider the type of the target (e.g., group, individual) to differentiate hate speech from cyberbullying, include context information such as titles of articles, nor annotate text spans for explainability.
In this paper, we describe and publicly release the Korean Offensive Language Dataset (KOLD), 40,429 comments collected from news articles and videos. 1 The unique characteristics of our dataset are as follows: • It is the first Korean dataset with a hierarchical taxonomy of offensive language (see Figure 1).If the comment is group-targeted offensive language, we additionally annotate among the 21 target group labels tailored to Korean culture.
• The specific spans of text that are offensive or that reveal targeted communities are annotated (see Table 1 for examples).KOLD is the first publicly released dataset to provide both types of spans for offensive language in Korean.
• The comments in our dataset are annotated with the original context.We provide the titles of the articles and videos during the annotation process, which resembles the realistic setting of actual usage.

Annotation Task Design
We use a hierarchical annotation framework based on the multi-layer annotation schema in OLID (Zampieri et al., 2019).Additionally, we identify the specific target group of the offensive language.
We also annotate the spans that support the labeling decision if the comment is offensive and/or contains a target of offensiveness. Figure 1 illustrates an overview of our annotation task, and Table 1 shows examples.

Level A: Offensive Language Detection
At level A, we determine whether the comment is offensive (OFF) or not (NOT), and which part of the comment makes it offensive (offensive span).
We consider a comment offensive if it contains any form of untargeted profanity or targeted offense such as insults and threats, which can be implicit or explicit (Zampieri et al., 2019).We define offensive span as a specific segment of text that justifies why a comment is offensive, also known as a rationale (Zaidan et al., 2007).In parallel with the definition of offensiveness, the span includes not only explicit profanity but also implicit offensive language (e.g., sarcasm or metaphor (ElSherief et al., 2021)).If the offensiveness is conveyed across multiple sentences in the comment, all of them are captured as the offensive span.Taking into account the faithfulness of a rationale (DeYoung et al., 2020), the offensive span is the minimal snippet of the text (i.e., sufficient) that includes all forms of expressions that convey even the slightest intensity of offense (i.e., comprehensive), such as affixes and emojis.

Level B: Target Type Categorization of Offensive Language
Level B categorizes the type of the target and highlights the supporting span of the target (target span).
There are four possible categories.
• Untargeted (UNT): An offensive comment that does not contain a specific target.
• Individual (IND): An offensive comment that is targeted at a specific individual.This includes a famous person or a named/unnamed individual with specific reference in the text.Comments targeted at an individual are categorized as cyberbullying (Chen et al., 2012).
• Group (GRP): An offensive comment targeted at a group of people with shared pro-tected characteristics, such as gender or religion.Offensive language in this category is generally considered as hate speech (Zhang and Luo, 2019).
• Others (OTH): An offensive comment whose target does not belong to the above two categories.Targeting an organization, a company, or an event.
We define a target span as a span of characters in the comment that indicates the target of the offensive language.It is collected for all types of targeted offensive speech, regardless of the target type (IND, GRP, OTH).If the term used to indicate the target is offensive, target span can overlap with the offensive span (e.g., jjangkkae, which corresponds to ching-chong in English).

Level C: Target Group Identification of Group Targeted Offensive Language
Level C identifies the specific targets of offensive language, which consists of two hierarchical levels: target group attribute and target group.The target group represents the specific social or demographic groups that share the same identity (e.g., Women, Muslim, Chinese), and the target group attribute is a superclass for the target group.We allow multigroup annotation if the target entity of the comment belongs to more than one group.For instance, "페 미년 (feminist bitch)", a word that disparages a feminist woman, targets two groups: Women and Feminist.Table 10 in the Appendix contains the full set of 21 target groups.To determine the set of target groups, we begin with categorizing targets in Sap et al. (2020) and add several categories to better reflect the Korean language and culture.As the result of analyzing the targets in 1,000 initial samples, we add Chinese, Korean-Chinese, and Indian to the Race, Ethnicity & Nationality attribute, as they take up larger portions than the initial target groups (White, Asian).Group characteristics that do not belong to the four target group attributes (e.g., Disabled, Feminist) are grouped under Miscellaneous.Note that we are aware that feminism is a gender-related issue, but classified Feminist into Miscellaneous because feminists embody a group of people that share the same ideology rather than being a subclass of gender.We show the distribution of the top two levels (A, B) and the target group attributes (Level C) in Table 2 and the subsequent target group categories in Table 3.

Source Corpora Collection
We choose two social media platforms, NAVER and YouTube, as our source of data, which are two of the top three mobile apps used in Korea in 2021. 2 In particular, we collect titles and comments on NAVER news articles and YouTube videos distributed from March 2020 to March 2022.Due to the scarcity of offensive comments, we collect articles and posts by using predefined keywords, which is a commonly used method in hate speech dataset construction (Waseem and Hovy, 2016).Every keyword is potentially highly correlated with articles or videos that may have abusive comments.Keywords are listed in Appendix A.
To ensure we do not reveal users' personally identifiable information, we do not collect user ids.We replace mentions of a username with <user> tokens, URLs with <url> tokens, and emails with <email> tokens to conceal private information.

Annotation Procedure
The steps we took for high-quality of annotations include providing a detailed guideline, selecting the annotators deliberately, and managing the annotation process carefully.In the guideline, we resolve predictable difficulties during the process.For example, we provide rules for delimiting morphological boundaries specific to Korean to collect consistent text spans, and provide guidance of implicit hate speech based on the taxonomy proposed in ElSherief et al. (2021).To ensure overall annotation quality, we only allow annotators who pass a qualification test to participate in the main annotation process.We make it clear that annotators refer to the title of the article or video as context to reduce ambiguity.As hate speech annotation can also be influenced by the bias of the annotators (Davidson et al., 2019;Sap et al., 2022;Al Kuwatly et al., 2020), we include annotators of diverse demographic backgrounds by limiting the maximum amount of annotation per worker to 1% of the data.A total of 3,124 annotators participate in creating the final dataset.To ensure each label is genuine, we embed questions with clear known answers and remove users who answer those wrong.We use SelectStar, a crowdsourcing platform in Korea, to collect annotations.The full task is shown in the appendix (Figure 2).To decide on the gold label, we apply majority voting among the three annotations for the categorical labels, and take character offsets that more than two out of three annotators highlighted for text spans.When the gold label cannot be determined by majority voting as there are more than two choices, inspectors3 resolve the disagreement to determine the gold label.

Annotation Result
Overall, the average Krippendorff's α for interannotator agreement of each annotation level is 0.55.The label distribution of the collected data is shown in Table 2.
Level A: Offensive Language Detection The dataset contains 40,429 comments, of which 20,130 comments are classified as offensive language.Offensive comments targeted at group characteristics (also known as hate speech) comprise 30.7% of the whole data.Specifically, Krippendorff's α for the inter-annotator agreement is 0.55 for classifying offensiveness.
Level B: Target Type of Categorization of Offensive Language Among 20,130 offensive comments, 2,596 comments are classified as untargeted offense (UNT).These comments include comments with non-targeted profanity and swearing.Group (GRP) is the most common type of target, taking up 70.1% of the three types of targeted offenses, followed by individual (IND) (22.0%) and others  4 Dataset Analysis

Target Group Distribution
Our novel finding is that target groups are defined based on the specific language and culture to embrace ongoing social phenomena and reflect them to the dataset.Shown in Table 4, the distribution of the target groups in KOLD largely differs from the English HateXplain dataset (Mathew et al., 2021).We observe that groups such as Jewish, Arabs and Hispanic which commonly appear in English datasets (e.g., Sap et al. (2020); Ousidhoum et al. ( 2019)), do not frequently appear in KOLD.
Africans, the target group that appears most frequently in HateXplain, is not included in the top ten ranked groups in our dataset, and the reverse is true for Feminist, the first-ranked target group in our dataset.While Women, LGBTQ+ (which includes Homosexual) and Men are common target groups in both datasets, other identity groups such as Chinese, Korean-Chinese, Progressive and Conservative only appear in our dataset.Specifically, we observe that within the Korean language, the Asian race as a target of offensive language should be more finely partitioned.While Asians appear as a frequent target in English datasets without race or ethnic division (An et al., 2021;Ousidhoum et al., 2019;Hartvigsen et al., 2022), in KOLD, it is further separated into finegrained targets grouped by nationality or ethnicity (e.g., Chinese, Indian, Southeast Asian).Moreover, our dataset demonstrates that a single Asian race should be divided into separate target groups.A large portion of Chinese and Korean-Chinese targeted offensive comments in KOLD highlights the uniqueness of offensive language in Korean and reflects the cultural differences between Korean speakers and English speakers.This demonstrates the prevalence of social bias among Asians even though they share similar cultural values and phenotype (Lee et al., 2017).

The Role of Title for Target Group Identification
In KOLD, 55.1% of group-identified offensive comments have no target span marked in the comment, which implies that in the majority of the cases, titles contain information about the targets.For example, given the title of the article "'Islam' in Korea / Yonhap News", it is easy to find a comment without explicitly mentioning the target such as "I don't care what (they) believe, what matters is the fact that (they) kill people" (penultimate row of Table 1).

Experiments and Results
We experiment using three different model architectures: (1) sequence classification model for predicting offensiveness, target type, and target group categories, (2) token classification model for predicting the offensive and target span, and (3) multitask model for predicting both category and span at once.We report the score of single-task models using various sizes of Korean BERT and RoBERTa (Park et al., 2021), and compare the result against the multi-task model.
We further conduct an ablation study by excluding the title from the input to discover how much it impacts the prediction performance of the model.We also compare the results of our model with translated versions of English data, and a multilingual span prediction model.
For the experiments, we use an 80-10-10 split for each task, and report the best performances based on the F 1 score of the test set result with the tuned hyperparameters.Training details are reported in the Appendix B.

Category Prediction
For each level of annotation, we fine-tune the pretrained models to predict the label given the ti- tle and comment, and then evaluate the model using precision, recall, and F 1 scores of the positive class.
In Table 5, we observe that the more the categories are fine-grained, the task becomes more difficult, and the larger model shows better performance.

Span Prediction
We convert each span in the comment to BIO-tags to formulate the span prediction task as a token classification task and fine-tune the pre-trained models to predict BIO-tags assigned to each token.To evaluate the model, we follow the work of Da San Martino et al. ( 2019) by computing the F 1 score of the predicted character offsets with the ground truth.If the ground truth is empty, a perfect score (F 1 = 1) is assigned whereas if the predicted set of offsets is empty, a score of zero (F 1 = 0) is assigned.
As demonstrated in Table 7: Evaluation results of the multi-task baseline models compared against the single-task baseline models."Seq" refers to the sequence classification model, "Span" refers to the span prediction model, and "Seq+Span" refers to the multi-task model.All models have fine-tuned the same pre-trained language model, BERT-base.larger model is consistent with the results of the category prediction.

Category and Span Prediction
We employ a multi-task learning approach to train a model capable of classifying the category and predicting the span at the same time.In the multitask model, the sequence classifier and the token classifier share the neural representation of the pretrained model and only differ in the output layers for each task.The representation of the first token ([CLS]) is fed into an output layer for sequence classification, and the other representations are fed into the layer for token classification.The model jointly learns the global information of a given input sequence and span information.We train two types of multi-task models.First, we train a multitask model with the binary label of offensiveness and the corresponding offensive span.Second, using the data labeled as offensive, we train a multitask model to predict the target type (Level B) and the corresponding target span.
As shown in Table 7, multi-task models outperform single-task models in span prediction by 10% in the F1 score, mainly due to joint learning of both types of information.However, the performance of sequence classification drops in both models.
Examples of model predictions and ground truths are illustrated in Table 11 and Table 12 in the Appendix.

Title Ablation on Classification Tasks
To find out how much the context information (titles of the articles and the videos) contributes to the offensiveness and target classifications, we conduct an ablation study by excluding the titles from the input.
As can be seen in Table 8, if only the comments are given, accuracy and the f1 scores drop significantly compared to the setting where titles and comments are given together.This phenomenon becomes more significant as the granularity of the  label increases.When predicting the fine-grained target group, the f1 score dropped by more than 30%.We conclude that providing the context with the comments helps the model predict the target groups more precisely, as the comments may not contain sufficient information.

Translated Data and Multilingual Models
To see how much translation and multilingual model are effective at distinguishing offensiveness, we compare our baselines against (1) sequence classification model trained on translated dataset, and (2) multilingual offensive span detection model.For the translation experiment, we translate the OLID to Korean via google translate api 5 and use the dataset for training the same sequence classifier described in Section 5.1.For the multilin-gual experiment, we adopt the multilingual token classification model (MUDES) (Ranasinghe and Zampieri, 2021) trained on English toxic span dataset (Pavlopoulos et al., 2021).For all tasks, evaluation is done with the KOLD test set.We report the results in Table 9.Overall, both translation and multilingual approaches are not more effective than our baselines.For the offensive category prediction, our model is 13.7 higher, and for the span prediction, our model is 27.8 higher.Although MUDES scores high on English (61.6), the performance drops significantly in Korean (12.8).
6 Related Work

Offensiveness & Hate Speech Detection
Most datasets created for the detection of offensive language have dealt with the subtypes of offensive language such as hate speech, cyberbullying, and profanity as a flat multi-level classification task (Waseem and Hovy, 2016;Davidson et al., 2017;Wiegand et al., 2018;Mollas et al., 2022).Waseem et al. (2017) and Zampieri et al. (2019) have proposed a hierarchical taxonomy of offensive speech, emphasizing the need for annotating specific dimensions of offensive language, such as the content's explicitness and the type of targets.Rosenthal et al. (2021) further expands the size of the dataset using the OLID dataset proposed by Zampieri et al. (2019) with semi-supervising method.The hierarchical annotation has also made possible systematic expansion to subtypes of hate speech in the following works, such as misogyny (Zeinert et al., 2021).Our work also builds upon the taxonomy proposed by Zampieri et al. (2019), further identifying the targeted social group of offensive languages.
Recent papers focus on more diverse aspects, such as interpretability and context information.To train a human-interpretable classification models, Sap et al. (2020) collect social bias implicated about the targeted group in a free-text format.In a similar spirit, Pavlopoulos et al. (2022) and Mathew et al. (2021) create datasets annotated with particular span of the text that makes the post toxic (Zaidan et al., 2007).As most text in the real world appears in context (Seaver, 2015), considering context is important for the development of practical models.Recent work on offensive language detection incorporates the context of the post (Vidgen et al., 2021;de Gibert et al., 2018;Gao and Huang, 2017), albeit the benefits of the context are controversial (Pavlopoulos et al., 2020;Xenos et al., 2021).Using hierarchical annotation, KOLD dataset systematically classifies multiple depths of contextualized offensiveness, and collects textual spans to justify such classification at the same time.

Non-English Datasets
There is relatively little work done on developing offensive language datasets in languages other than English.Simple translation of English datasets is not enough as there are some well-known issues in using automatically translated English datasets in NLP, such as translationese (Koppel and Ordan, 2011) and over-representation of source language's culture (Hu et al., 2020).Several papers have emphasized the need of high-quality monolingual data (Hu et al., 2021;Park et al., 2021).This is also true in offensive language datasets.The focus of hatred differs by culture and country (Reichelmann et al., 2020).Ousidhoum et al. (2019) observe that there are significant differences in terms of target attributes and target groups in the three languages (English, French, Arabic) of which they constructed hate speech datasets.Moreover, Nozza (2021) shows that zero-shot, cross-lingual transfer learning of English hate speech has limitations.Some datasets for detection of toxicity or abuse exist in other languages (e.g., Zeinert et al. (2021) for Danish, Fortuna et al. (2019) for Portuguese, Mubarak et al. (2021) for Arabic, and Çöltekin (2020) for Turkish).For Korean, Moon et al. (2020) have paved the way for hate speech detection, but they are relatively small in size and lack focus on the target of offensiveness.In comparison, KOLD is built upon an extensive taxonomy that can handle a broad range of offensive language with clearly annotated target groups of 21 categories and textual spans.

Conclusion
We present KOLD, a dataset of 40,429 comments of news articles and video clips, annotated within context.It is the first to introduce a hierarchical taxonomy of offensive language in Korean with textual spans of the offensiveness and the targets.We establish baseline performance for multi-task model that both detects the categories and the spans that support the classification.Through analysis and experiments, we show that target terms are often omitted in offensive comments, and title information helps models predict the target of the offense.This finding can be applied to other syntactically nullsubject languages other than Korean (e.g., Arabic, Chinese, Modern Greek) as well.By comparing the distribution of target groups with existing English data and showing the inadequacy of multilingual models, we demonstrate that offensive language corpus customized for the language and its corresponding culture is necessary.We acknowledge that our dataset does not cover all communities of Korean social media whose offensive language patterns may differ from each other.Despite this limitation, KOLD will serve as a stepping stone to developing more accurate and adaptive offensive language detection models in Korean.

Ethical Considerations
This study has been approved by the KAIST Institutional Review Board (#KH2021-177).During the annotation process, we informed the annotators that the content might be offensive or upsetting and limited the amount that each worker could provide.Annotators were also paid above the minimum wage.We are aware of the risk of releasing a dataset containing offensive language.This dataset must not be used as training data to automatically generate and publish offensive language online, but by publicly releasing it, we cannot prevent all malicious use.We will explicitly state that we do not condone any malicious use.We urge researchers and practitioners to use it in beneficial ways (e.g., to filter out hate speech).Another consideration is that the names of political figures and popular entertainers mentioned in the comments remain in our dataset.This is because offensive language detection becomes difficult without those mentions.This is consistent with the common practice in other offensive language datasets, and as a community, we need to deliberate and discuss the potential implications.

Limitations
We discuss two limitations of our work in this section.First, our annotation method requires a high annotation cost and a lot of time.Guiding the annotators to familiarize them with our annotation process takes much time since the guideline is complicated, and they are updated whenever ambiguous comments are reported during the process, to give annotators clear direction.Furthermore, as we collect three annotations per comment, we need a large number of annotators (3,124) and spent a signifi-cant amount of annotation cost to pay them above the minimum wage.
In most cases, there is a trade-off between quantity and quality.For example, if one plans to build a large-scale dataset with limited amount of resources, he/she should sacrifice the complexity of the annotations by reducing the amount of work for each annotator, or the accuracy of the labels by decreasing the number of annotators for each sample.This is the reason why it is challenging to build a large dataset with accurate and rich annotations.Recently, there has been an approach to make a large-scale machine-generated hate speech detection dataset (Hartvigsen et al., 2022).This might be an alternative to overcome such limitations.By collaborating with such models, we can obtain largescale datasets with accurate labels while reducing annotation costs and time.
Second, detecting patterns of offensive language changing over time requires constant update.For example, hateful comments related to COVID-19 emerged recently, and offensive language toward political figures or celebrities also changes constantly.It is difficult to train a model that captures such changes well with a dataset within a limited time period.A model trained on our dataset might not perform so well in detecting hateful comments that emerge in the future.To overcome this limitation, a continuous update of datasets as well as methods to efficiently update models (Qian et al., 2021), is needed.
젠더갈등...SNS 때문? Gender in Conflict, deepening by SNS? 사람은 사람일뿐 People are all just people NOT 한국 106개 사회단체, '아프가니스탄 난민 보호책' 수립 촉구 'Protection of Afghan Refugees' by 106 Korean civil societies 정신좀 차리자 Snap out of it OFF UNT 북한, 아프간 사태로 자신감 얻었나... 미국에 '인권 문제' 역공 N. Korea calls US-led Afghan war 'Human Rights Crime' 누가 김정은 인권 좀 말살 안하나?Anyone for the extermination of Kim Jong-un ?OFF IND 틱톡 감성 거부하다 못해 패버리는 흑인 형님 모음 Angry TikTok reactions by Black bro 틱톡 이라는 이름의 쓰래기는 제거되야되 The garbage named TikTok should be eliminated OFF OTH "이 법은 죽을 뻔한 누구를 살리기도 해요" "This law could save desperate lives who almost died" 수술받은 트랜스젠더들 을 봐도 분노가 솟아 칼을 꺼내 죽이겠다!I get mad when I see transgender people who had surgery, I'll take a knife and kill them all-조선족이랑 결혼한 후기 Deadly Temptation -A marriage with Korean-Chinese 조선족 들 중국으로 다돌아갔으면 속이 시원하겠다 It would be so refreshing if all the Korean-Chinese went back in Korea 뭘 믿든상관 없는데 사람을 죽이니까 무서운거지 I don't care what (they) believe, what matters is the fact that (they) kill people OFF 포스터 또 젠더 이슈 논란 GS25's poster of Brave Girls evokes dispute over gender issues yet again 페미 는 젠더가 아니라 정신병이다 Feminism is not about gender inequality, it is a mental disorder OFF GRP Misc.Feminists Table 1: Examples of the comments in KOLD, along with the annotation results.The Title is either the headline of the news article or the title of the video where the comment is posted on.The subject in the parentheses is omitted in the original sentence in Korean.(OFF: offensive, NOT: not offensive, UNT: untargeted, IND: individual, OTH: other, GRP: group, blue : offensive span, green : target span)

Table 2 :
Statistics of labels in KOLD.Here, the lowest level of data granularity is the target group attribute in Level C.

Table 3 :
Breakdown of target group attributes of grouptargeted offensive comments (Level C).We present the three most frequent target groups for each target group attribute.Multi-targeted groups are split into single groups when counting.Political Affiliation and Religion both appear in less than 5% of the data.Table3shows a breakdown of the target group attributes within the top three most frequent as well as the group Others, which is tagged at the target group attribute level but does not belong to the target group choices we provide.In Race, Ethnicity & Nationality, Others take up the largest portion with 1,605 comments, since it includes small but various origins ranging from Afghans and Americans to North Koreans and North Korean defectors.The three most frequently targeted group characteris- (OTH) (7.9%).Krippendorff's α for agreement on deciding the type of target is 0.45 .tics in the whole dataset are Feminist, LGBTQ+, and Women, which amount to 11.42%, 10.55%, and 8.7% of the group-targeted offensive language, respectively.Krippendorff's α is 0.65 for specifying the target group of the offensive language.

Table 5 :
Evaluation results of category prediction models of each task.Binary-F1 is used to evaluate offensive task and macro-F1 is used for the other tasks whose number of labels is more than two.Bold indicates the best performance across the models.

Table 6 :
Evaluation results of offensive/target span prediction models.Bold indicates the best performance across the models.
Table 6, the best characterlevel F1 score is 45.4 for offensive span and 62.5 for target span.The pattern of higher score with a

Table 8 :
Category prediction results on the offensiveness (Level A), target type (Level B), and target group categories (Level C) with and without the title.Binary-F1 is used to evaluate the offensive task and macro-F1 is used for the other tasks.ΔF1 refers to the absolute and relative performance gap.For all models, we use the same BERT-base model.