Scarcity of large-scale datasets, especially for resource-impoverished languages motivates exploration of data-efficient methods for hate speech detection. Hateful intents are expressed explicitly (use of cuss, swear, abusive words) and implicitly (indirect and contextual). In this work, we progress implicit and explicit hate speech detection using an input-level data augmentation technique, task reformulation using entailment and cross-learning across five languages. Our proposed data augmentation technique EasyMix, improves the performance across all english datasets by ~1% and across multilingual datasets by ~1-9%. We also observe substantial gains of ~2-8% by reformulating hate speech detection as entail problem. We further probe the contextual models and observe that higher layers encode implicit hate while lower layers focus on explicit hate, highlighting the importance of token-level understanding for explicit and context-level for implicit hate speech detection. Code and Dataset splits - https://anonymous.4open.science/r/data_efficient_hatedetect/
A Simple Unsupervised Approach for Coreference Resolution using Rule-based Weak Supervision
Alessandro Stolfo | Chris Tanner | Vikram Gupta | Mrinmaya Sachan
Proceedings of the 11th Joint Conference on Lexical and Computational Semantics
Labeled data for the task of Coreference Resolution is a scarce resource, requiring significant human effort. While state-of-the-art coreference models rely on such data, we propose an approach that leverages an end-to-end neural model in settings where labeled data is unavailable. Specifically, using weak supervision, we transfer the linguistic knowledge encoded by Stanford?s rule-based coreference system to the end-to-end model, which jointly learns rich, contextualized span representations and coreference chains. Our experiments on the English OntoNotes corpus demonstrate that our approach effectively benefits from the noisy coreference supervision, producing an improvement over Stanford?s rule-based system (+3.7 F1) and outperforming the previous best unsupervised model (+0.9 F1). Additionally, we validate the efficacy of our method on two other datasets: PreCo and Litbank (+2.5 and +5 F1 on Stanford’s system, respectively).
Virtual Adversarial Training (VAT) has been effective in learning robust models under supervised and semi-supervised settings for both computer vision and NLP tasks. However, the efficacy of VAT for multilingual and multilabel emotion recognition has not been explored before. In this work, we explore VAT for multilabel emotion recognition with a focus on leveraging unlabelled data from different languages to improve the model performance. We perform extensive semi-supervised experiments on SemEval2018 multilabel and multilingual emotion recognition dataset and show performance gains of 6.2% (Arabic), 3.8% (Spanish) and 1.8% (English) over supervised learning with same amount of labelled data (10% of training data). We also improve the existing state-of-the-art by 7%, 4.5% and 1% (Jaccard Index) for Spanish, Arabic and English respectively and perform probing experiments for understanding the impact of different layers of the contextual models.