Jonathan Rusert


2022

pdf bib
Suum Cuique: Studying Bias in Taboo Detection with a Community Perspective
Osama Khalid | Jonathan Rusert | Padmini Srinivasan
Findings of the Association for Computational Linguistics: ACL 2022

Prior research has discussed and illustrated the need to consider linguistic norms at the community level when studying taboo (hateful/offensive/toxic etc.) language. However, a methodology for doing so, that is firmly founded on community language norms is still largely absent. This can lead both to biases in taboo text classification and limitations in our understanding of the causes of bias. We propose a method to study bias in taboo classification and annotation where a community perspective is front and center. This is accomplished by using special classifiers tuned for each community’s language. In essence, these classifiers represent community level language norms. We use these to study bias and find, for example, biases are largest against African Americans (7/10 datasets and all 3 classifiers examined). In contrast to previous papers we also study other communities and find, for example, strong biases against South Asians. In a small scale user study we illustrate our key idea which is that common utterances, i.e., those with high alignment scores with a community (community classifier confidence scores) are unlikely to be regarded taboo. Annotators who are community members contradict taboo classification decisions and annotations in a majority of instances. This paper is a significant step toward reducing false positive taboo decisions that over time harm minority communities.

pdf bib
Don’t sweat the small stuff, classify the rest: Sample Shielding to protect text classifiers against adversarial attacks
Jonathan Rusert | Padmini Srinivasan
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Deep learning (DL) is being used extensively for text classification. However, researchers have demonstrated the vulnerability of such classifiers to adversarial attacks. Attackers modify the text in a way which misleads the classifier while keeping the original meaning close to intact. State-of-the-art (SOTA) attack algorithms follow the general principle of making minimal changes to the text so as to not jeopardize semantics. Taking advantage of this we propose a novel and intuitive defense strategy called Sample Shielding.It is attacker and classifier agnostic, does not require any reconfiguration of the classifier or external resources and is simple to implement. Essentially, we sample subsets of the input text, classify them and summarize these into a final decision. We shield three popular DL text classifiers with Sample Shielding, test their resilience against four SOTA attackers across three datasets in a realistic threat setting. Even when given the advantage of knowing about our shielding strategy the adversary’s attack success rate is <=10% with only one exception and often < 5%. Additionally, Sample Shielding maintains near original accuracy when applied to original texts. Crucially, we show that the ‘make minimal changes’ approach of SOTA attackers leads to critical vulnerabilities that can be defended against with an intuitive sampling strategy.

pdf bib
Adversarial Authorship Attribution for Deobfuscation
Wanyue Zhai | Jonathan Rusert | Zubair Shafiq | Padmini Srinivasan
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advances in natural language processing have enabled powerful privacy-invasive authorship attribution. To counter authorship attribution, researchers have proposed a variety of rule-based and learning-based text obfuscation approaches. However, existing authorship obfuscation approaches do not consider the adversarial threat model. Specifically, they are not evaluated against adversarially trained authorship attributors that are aware of potential obfuscation. To fill this gap, we investigate the problem of adversarial authorship attribution for deobfuscation. We show that adversarially trained authorship attributors are able to degrade the effectiveness of existing obfuscators from 20-30% to 5-10%. We also evaluate the effectiveness of adversarial training when the attributor makes incorrect assumptions about whether and which obfuscator was used. While there is a a clear degradation in attribution accuracy, it is noteworthy that this degradation is still at or above the attribution accuracy of the attributor that is not adversarially trained at all. Our results motivate the need to develop authorship obfuscation approaches that are resistant to deobfuscation.

pdf bib
On the Robustness of Offensive Language Classifiers
Jonathan Rusert | Zubair Shafiq | Padmini Srinivasan
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Social media platforms are deploying machine learning based offensive language classification systems to combat hateful, racist, and other forms of offensive speech at scale. However, despite their real-world deployment, we do not yet comprehensively understand the extent to which offensive language classifiers are robust against adversarial attacks. Prior work in this space is limited to studying robustness of offensive language classifiers against primitive attacks such as misspellings and extraneous spaces. To address this gap, we systematically analyze the robustness of state-of-the-art offensive language classifiers against more crafty adversarial attacks that leverage greedy- and attention-based word selection and context-aware embeddings for word replacement. Our results on multiple datasets show that these crafty adversarial attacks can degrade the accuracy of offensive language classifiers by more than 50% while also being able to preserve the readability and meaning of the modified text.

2021

pdf bib
NLP_UIOWA at Semeval-2021 Task 5: Transferring Toxic Sets to Tag Toxic Spans
Jonathan Rusert
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

We leverage a BLSTM with attention to identify toxic spans in texts. We explore different dimensions which affect the model’s performance. The first dimension explored is the toxic set the model is trained on. Besides the provided dataset, we explore the transferability of 5 different toxic related sets, including offensive, toxic, abusive, and hate sets. We find that the solely offensive set shows the highest promise of transferability. The second dimension we explore is methodology, including leveraging attention, employing a greedy remove method, using a frequency ratio, and examining hybrid combinations of multiple methods. We conduct an error analysis to examine which types of toxic spans were missed and which were wrongly inferred as toxic along with the main reasons why they occurred. Finally, we extend our method via ensembles, which achieves our highest F1 score of 55.1.

2020

pdf bib
NLP_UIOWA at SemEval-2020 Task 8: You’re Not the Only One Cursed with Knowledge - Multi Branch Model Memotion Analysis
Ingroj Shrestha | Jonathan Rusert
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We propose hybrid models (HybridE and HybridW) for meme analysis (SemEval 2020 Task 8), which involves sentiment classification (Subtask A), humor classification (Subtask B), and scale of semantic classes (Subtask C). The hybrid model consists of BLSTM and CNN for text and image processing respectively. HybridE provides equal weight to BLSTM and CNN performance, while HybridW provides weightage based on the performance of BLSTM and CNN on a validation set. The performances (macro F1) of our hybrid model on Subtask A are 0.329 (HybridE), 0.328 (HybridW), on Subtask B are 0.507 (HybridE), 0.512 (HybridW), and on Subtask C are 0.309 (HybridE), 0.311 (HybridW).

2019

pdf bib
NLP@UIOWA at SemEval-2019 Task 6: Classifying the Crass using Multi-windowed CNNs
Jonathan Rusert | Padmini Srinivasan
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper proposes a system for OffensEval (SemEval 2019 Task 6), which calls for a system to classify offensive language into several categories. Our system is a text based CNN, which learns only from the provided training data. Our system achieves 80 - 90% accuracy for the binary classification problems (offensive vs not offensive and targeted vs untargeted) and 63% accuracy for trinary classification (group vs individual vs other).