2024
pdf
bib
abs
A Corpus of Hindi-English Code-Mixed Posts to Hate Speech Detection
Prashant Kapil
|
Asif Ekbal
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
Social media content, such as blog posts, comments, and tweets, often contains offensive language, including racial hate speech, personal attacks, and sexual harassment. Detecting inappropriate language is crucial for user safety and prevention of hateful behavior and aggression. This study introduces the HECM (Hindi-English code-mixed tweets) to fill the gap in Hindi language resources. The corpus comprises approximately 9.4K tweets labeled as hateful and nonhateful. It includes detailed information on the data, such as the annotation schema, the label definitions, and an interannotator agreement score of 85%. The study evaluates the effectiveness of traditional machine learning, deep neural networks, and transformer encoder-based approaches. The results show a significant improvement in terms of macro-F1 and weighted F1 scores. Additionally, a lexicon containing 2000 lexicons tagged in 21 categories is created based on the multilingual HURTLEX lexicon. This lexicon is merged with the transformer encoder, resulting in a marginal improvement in macro-F1 and weighted-F1. The study also experiments with a Hindi-Devanagari dataset to assess the impact of the lexicon on performance metrics.
pdf
bib
abs
A Survey on Combating Hate Speech through Detection and Prevention in English
Prashant Kapil
|
Asif Ekbal
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
The rapid rise of social networks has brought with it an increase in hate speech, which poses a significant challenge to society, researchers, companies, and policymakers. Hate speech can take the form of text or multimodal content, such as memes, GIFs, audio, or videos, and the scientific study of hate speech from a computer science perspective has gained attention in recent years. The detection and combating of hate speech is mostly considered a supervised task, with annotated corpora and shared resources playing a crucial role. Social networks are using modern AI tools to combat hate speech, and this survey comprehensively examines the work done to combat hate in the English language. It delves into state-of-the-art methodologies for unimodal and multimodal hate identification, the role of explainable AI, prevention of hate speech through style transfer, and counternarrative generation, while also discussing the efficacy and limitations of these methods. Compared with earlier surveys, this paper offers a well-organized presentation of methods to combat hate.
2023
pdf
bib
abs
A Unified Multi task Learning Architecture for Hate Detection Leveraging User-based Information
Prashant Kapil
|
Asif Ekbal
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
Hate speech, offensive language, aggression, racism, sexism, and other abusive language is a common phenomenon in social media. There is a need for Artificial Intelligence (AI) based intervention which can filter hate content at scale. Most existing hate speech detection solutions have utilized the features by treating each post as an isolated input instance for the classification. This paper addresses this issue by introducing a unique model that improves hate speech identification for the English language by utilising intra-user and inter-user-based information. The experiment is conducted over single-task learning (STL) and multi-task learning (MTL) paradigms that use deep neural networks, such as convolution neural network (CNN), gated recurrent unit (GRU), bidirectional encoder representations from the transformer (BERT), and A Lite BERT (ALBERT). We use three benchmark datasets and conclude that combining certain user features with textual features gives significant improvements in macro-F1 and weightedF1.
2020
pdf
bib
abs
Leveraging Multi-domain, Heterogeneous Data using Deep Multitask Learning for Hate Speech Detection
Prashant Kapil
|
Asif Ekbal
Proceedings of the 17th International Conference on Natural Language Processing (ICON)
With the exponential rise in user-generated web content on social media, the proliferation of abusive languages towards an individual or a group across the different sections of the internet is also rapidly increasing. It is very challenging for human moderators to identify the offensive contents and filter those out. Deep neural networks have shown promise with reasonable accuracy for hate speech detection and allied applications. However, the classifiers are heavily dependent on the size and quality of the training data. Such a high-quality large data set is not easy to obtain. Moreover, the existing data sets that have emerged in recent times are not created following the same annotation guidelines and are often concerned with different types and sub-types related to hate. To solve this data sparsity problem, and to obtain more global representative features, we propose a Convolution Neural Network (CNN) based multi-task learning models (MTLs) to leverage information from multiple sources. Empirical analysis performed on three benchmark datasets shows the efficacy of the proposed approach with the significant improvement in accuracy and F-score to obtain state-of-the-art performance with respect to the existing systems.
2019
pdf
bib
abs
NLP at SemEval-2019 Task 6: Detecting Offensive language using Neural Networks
Prashant Kapil
|
Asif Ekbal
|
Dipankar Das
Proceedings of the 13th International Workshop on Semantic Evaluation
In this paper we built several deep learning architectures to participate in shared task OffensEval: Identifying and categorizing Offensive language in Social media by semEval-2019. The dataset was annotated with three level annotation schemes and task was to detect between offensive and not offensive, categorization and target identification in offensive contents. Deep learning models with POS information as feature were also leveraged for classification. The three best models that performed best on individual sub tasks are stacking of CNN-Bi-LSTM with Attention, BiLSTM with POS information added with word features and Bi-LSTM for third task. Our models achieved a Macro F1 score of 0.7594, 0.5378 and 0.4588 in Task(A,B,C) respectively with rank of 33rd, 54th and 52nd out of 103, 75 and 65 submissions. The three best models that performed best on individual sub task are using Neural Networks.
2018
pdf
bib
abs
An Ensemble Approach for Aggression Identification in English and Hindi Text
Arjun Roy
|
Prashant Kapil
|
Kingshuk Basak
|
Asif Ekbal
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)
This paper describes our system submitted in the shared task at COLING 2018 TRAC-1: Aggression Identification. The objective of this task was to predict online aggression spread through online textual post or comment. The dataset was released in two languages, English and Hindi. We submitted a single system for Hindi and a single system for English. Both the systems are based on an ensemble architecture where the individual models are based on Convoluted Neural Network and Support Vector Machine. Evaluation shows promising results for both the languages. The total submission for English was 30 and Hindi was 15. Our system on English facebook and social media obtained F1 score of 0.5151 and 0.5099 respectively where Hindi facebook and social media obtained F1 score of 0.5599 and 0.3790 respectively.