Rainbow-Teaming for the Polish Language: A Reproducibility Study

Aleksandra Krasnodębska, Maciej Chrabaszcz, Wojciech Kusa


Abstract
The development of multilingual large language models (LLMs) presents challenges in evaluating their safety across all supported languages. Enhancing safety in one language (e.g., English) may inadvertently introduce vulnerabilities in others. To address this issue, we implement a methodology for the automatic creation of red-teaming datasets for safety evaluation in Polish language. Our approach generates both harmful and non-harmful prompts by sampling different risk categories and attack styles. We test several open-source models, including those trained on Polish data, and evaluate them using metrics such as Attack Success Rate (ASR) and False Reject Rate (FRR). The results reveal clear gaps in safety performance between models and show that better testing across languages is needed.
Anthology ID:
2025.trustnlp-main.12
Volume:
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Trista Cao, Anubrata Das, Tharindu Kumarage, Yixin Wan, Satyapriya Krishna, Ninareh Mehrabi, Jwala Dhamala, Anil Ramakrishna, Aram Galystan, Anoop Kumar, Rahul Gupta, Kai-Wei Chang
Venues:
TrustNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
155–165
Language:
URL:
https://aclanthology.org/2025.trustnlp-main.12/
DOI:
Bibkey:
Cite (ACL):
Aleksandra Krasnodębska, Maciej Chrabaszcz, and Wojciech Kusa. 2025. Rainbow-Teaming for the Polish Language: A Reproducibility Study. In Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025), pages 155–165, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Rainbow-Teaming for the Polish Language: A Reproducibility Study (Krasnodębska et al., TrustNLP 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.trustnlp-main.12.pdf