Sahajpreet Singh
2024
Hate Personified: Investigating the role of LLMs in content moderation
Sarah Masud
|
Sahajpreet Singh
|
Viktor Hangya
|
Alexander Fraser
|
Tanmoy Chakraborty
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
For subjective tasks such as hate detection, where people perceive hate differently, the Large Language Model’s (LLM) ability to represent diverse groups is unclear. By including additional context in prompts, we comprehensively analyze LLM’s sensitivity to geographical priming, persona attributes, and numerical information to assess how well the needs of various groups are reflected. Our findings on two LLMs, five languages, and six datasets reveal that mimicking persona-based attributes leads to annotation variability. Meanwhile, incorporating geographical signals leads to better regional alignment. We also find that the LLMs are sensitive to numerical anchors, indicating the ability to leverage community-based flagging efforts and exposure to adversaries. Our work provides preliminary guidelines and highlights the nuances of applying LLMs in culturally sensitive cases.
Intent-conditioned and Non-toxic Counterspeech Generation using Multi-Task Instruction Tuning with RLAIF
Amey Hengle
|
Aswini Padhi
|
Sahajpreet Singh
|
Anil Bandhakavi
|
Md Shad Akhtar
|
Tanmoy Chakraborty
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Counterspeech, defined as a response to mitigate online hate speech, is increasingly used as a non-censorial solution. The effectiveness of addressing hate speech involves dispelling the stereotypes, prejudices, and biases often subtly implied in brief, single-sentence statements or abuses. These expressions challenge language models, especially in seq2seq tasks, as model performance typically excels with longer contexts. Our study introduces CoARL, a novel framework enhancing counterspeech generation by modeling the pragmatic implications underlying social biases in hateful statements. The first two phases of CoARL involve sequential multi-instruction tuning, teaching the model to understand intents, reactions, and harms of offensive statements, and then learning task-specific low-rank adapter weights for generating intent-conditioned counterspeech. The final phase uses reinforcement learning to fine-tune outputs for effectiveness and nontoxicity. CoARL outperforms existing benchmarks in intent-conditioned counterspeech generation, showing an average improvement of ∼3 points in intent-conformity and ∼4 points in argument-quality metrics. Extensive human evaluation supports CoARL’s efficacy in generating superior and more context-appropriate responses compared to existing systems, including prominent LLMs like ChatGPT.
Search
Fix data
Co-authors
- Tanmoy Chakraborty 2
- Md. Shad Akhtar 1
- Anil Bandhakavi 1
- Alexander Fraser 1
- Viktor Hangya 1
- show all...