Proceedings of the PolEval 2025 Workshop
Łukasz Kobyliński, Alina Wróblewska, Maciej Ogrodniczuk (Editors)
- Anthology ID:
- 2025.poleval-main
- Month:
- November
- Year:
- 2025
- Address:
- Warsaw
- Venues:
- PolEval | WS
- SIG:
- Publisher:
- Institute of Computer Science PAS and Association for Computational Linguistics
- URL:
- https://aclanthology.org/2025.poleval-main/
- DOI:
- PDF:
- https://aclanthology.org/2025.poleval-main.pdf
PolEval is an annual shared-task evaluation campaign dedicated to advancing natural language processing for the Polish language. This paper presents an overview of PolEval 2025, the eighth edition of the campaign, which included three completed tasks covering machine-generated text detection, gender-inclusive language generation, and speech emotion recognition. The evaluation was conducted using standardised datasets and metrics via the AmuEval platform. PolEval 2025 attracted 15 teams and over 100 submissions, demonstrating continued engagement from the Polish NLP community. We describe the organisation of the campaign, the evaluation setup, and the role of PolEval in fostering reproducible research and community-driven benchmarking.
PolEval 2025 Task 1 Śmigiel: Spotting Machine-Generated Text from LLMs for Polish
Piotr Przybyła | Jakub Strebeyko | Alina Wróblewska
Piotr Przybyła | Jakub Strebeyko | Alina Wróblewska
This paper introduces the first shared task on machine-generated text (MGT) detection for Polish, organised as part of the PolEval 2025 evaluation campaign. The task evaluates participating systems under three scenarios — unsupervised, constrained, and open — designed to reflect different levels of access to training data. In total, seven systems were submitted.The results indicate that MGT detection for Polish is feasible, with the best-performing constrained systems achieving over 90% accuracy on the main evaluation set. However, performance drops when models are tested on unseen domains or generator models, revealing substantial limitations in generalisation. In the most challenging settings, unsupervised approaches perform better, despite achieving overall lower performance.This shared task establishes a new benchmark for MGT detection in Polish. The publicly released Śmigiel dataset is intended to support future research on robust and generalisable MGT detection methods.
This paper introduces the first shared task on machine-generated text (MGT) detection for Polish, organised as part of the PolEval 2025 evaluation campaign. The task evaluates participating systems under three scenarios – unsupervised, constrained, and open – designed to reflect different levels of access to training data. In total, seven systems were submitted. The results indicate that MGT detection for Polish is feasible, with the best-performing constrained systems achieving over 90% accuracy on the main evaluation set. However, performance drops when models are tested on unseen domains or generator models, revealing substantial limitations in generalisation. In the most challenging settings, unsupervised approaches beat the supervised ones. This shared task establishes a new benchmark for MGT detection in Polish. The publicly released Śmigiel dataset is intended to support future research on robust and generalisable MGT detection.
Perplexity-Driven Contrastive Scoring for Unsupervised Detection of AI-Generated Texts in Polish
Damian Stachura
Damian Stachura
The SMIGIEL competition at PolEval 2025 focuses on distinguishing Polish human-written text from AI-generated text. I participated in one of the subtasks that required a zero-shot detection method. My solution adapts the Binoculars detector by pairing language models and using calibrated thresholds. Specifically, I replaced the English language models from the original Binoculars method with models trained on Polish corpora. This approach achieved first place in the chosen competition track. Overall, my findings demonstrate that domain-specific language models and careful thresholding enable state-of-the-art zero-shot AI-text detection performance across new languages and domains. The code is publicly available at https://github.com/damian1996/2025-smigiel.
Inspired by zero-shot detection methods that compare perplexity across model pairs, we investigate whether computing perplexity differences on whole-text character-level perplexity can effectively detect LLM-generated Polish text. Unlike token-level ratio methods that require compatible tokenizers, our approach enables pairing any models regardless of tokenization. Through systematic evaluation of 91 model pairs on the PolEval 2025 ŚMIGIEL shared task, we identify Gemma-3-27B and PLLuM-12B as optimal, achieving 81.22% accuracy on test data with unseen generators. Our difference-based approach outperforms token-level ratio methods (+5.5pp) and single-model baselines (+8.3pp) without using training labels, capturing asymmetric reactions where human text causes greater perplexity divergence than LLM text. We demonstrate that complementary model pairing (multilingual + monolingual) and architectural quality matter more than raw model size for this task.
This paper presents the results of the PolEval 2025 shared task on gender-inclusive large language models for Polish. The primary goal of this task is to encourage the development of models capable of generating grammatically well-formed, contextually appropriate, and gender-inclusive output — a property of increasing importance in both human-centred NLP and NLG applications. To support this objective, we employed the newly developed Inclusive Polish Instruction Set (IPIS), a high-quality, human-annotated resource designed to guide models toward gender-inclusive behaviour. The shared task comprised two subtasks: gender-inclusive proofreading, which evaluates the ability of a model to transform masculine-generic Polish text into an inclusive equivalent, and gender-sensitive Polish-English translation, which investigates gender marking across languages. A total of six system submissions were received — three for each subtask. The evaluation demonstrates that the top-performing gender-inclusive systems outperform both the baseline and state-of-the-art models. These findings highlight the effectiveness of IPIS-tuned approaches and establish strong benchmarks for future research on gender inclusivity in Polish NLP.
Less is More—Achieving SOTA at PolEval 2025 Task 2a: Gender-inclusive LLMs for Polish (Proofreading) with LoRA and Qwen3-8B
Adam Majczyk
Adam Majczyk
In this paper the winning solution of PolEval 2025 Task 2a is presented. The approach utilizes LoRA fine-tuning of the Qwen3-8B model. Multiple LoRA matrix ranks are explored. Versions with and without the system prompt in loss calculation are evaluated. New SOTA was established at F1=0.6039 beating the previously best model at F1=0.5985. After the task’s conclusion the solution was improved upon and F1=0.6283±0.0056 was achieved.
Prompt-Based Gender-Inclusive Polish-English Translation Using Bielik Large Language Model with Structured Output
Krzysztof Wróbel
Krzysztof Wróbel
We present a simple yet effective approach to gender-inclusive Polish↔English translation for the PolEval 2025 Task 2 shared task. Without any fine-tuning, our solution leverages the Bielik 11B v2.6 model with carefully engineered system prompts and structured output, achieving a chrF score of 84.03 and securing first place in the translation subtask. The approach demonstrates that prompt engineering with few-shot examples and structured output can effectively handle the complex task of generating and removing gender-inclusive forms with the inclusive asterisk notation in Polish text. Per-direction analysis reveals stronger performance on PL→EN (chrF 88.24) compared to EN→PL (chrF 79.88), highlighting the asymmetric difficulty of adding versus removing inclusive forms.
The Polish language, like some Slavic and Romance languages, has a masculine-centric bias in its generic forms, leading to frequent use of masculine nouns when referring to women or mixed-gender groups. This presents a linguistic challenge for development of gender-inclusive technologies, addressed in PolEval task 2.This paper presents pragmatic instruction fine-tuning approach, using Low-Rank Adaptation (LoRA) on the pre-trained Polish PLT5 sequence-to-sequence model.
Lightweight IPIS Instruction Tuning of Bielik-7B for Gender-Inclusive Polish<—>English Translation: System Description for PolEval 2025 Task 2 (IPIS-translation)
Mateusz Czajka
Mateusz Czajka
We describe a compact but fully open-source system submitted to PolEval 2025 Task 2 (Gender-inclusive LLMs for Polish), subtask B: IPIS-translation. The goal of this subtask is gender-sensitive Polish↔English translation, including the production of gender-inclusive Polish outputs that follow specific orthographic conventions such as gender stars and slash forms. Our method performs instruction tuning of the Polish LLM Bielik-7B-Instruct using parameter-efficient LoRA adapters, with optional 4-bit NF4 quantization for single-GPU training. Samples from the Inclusive Polish Instruction Set (IPIS) are converted into a chat-style format with a task-provided gender-inclusive system prompt. Despite a deliberately lightweight tuning budget and greedy decoding, our submission placed 3rd on the hidden test B split, achieving bleu_pe = 20.7871. We detail the training and inference pipeline, discuss design choices and limitations, and outline directions for improving inclusive translation quality in Polish.
This paper introduces the Polish Speech Emotion Recognition Challenge, a shared task aimed at advancing research on cross-lingual emotion recognition in low-resource languages. The challenge’s objective was to develop systems that could recognize emotional states in Polish speech using only multilingual training data, with no access to Polish training examples. The final test set consisted of newly recorded Polish speech samples created specifically for the challenge, ensuring a fully blind evaluation. Participants submitted emotion predictions for six target classes. System performance was assessed using the macro-averaged F1 score as the primary metric.
Cross-lingual Speech Emotion Recognition (SER) is frequently hindered by speaker-specific prosodic variations that obscure universal emotional cues. Standard models often fail to generalize across languages due to the domain shift caused by differing acoustic standards. To address this, we present a novel SER approach that integrates unsupervised speaker adaptation directly at inference time. Our architecture utilizes a frozen, pretrained HuBERT encoder and introduces a Greedy Cluster Assignment Algorithm. This method groups a speaker’s utterances to form emotion-dependent centroids, enforcing speaker-consistent labeling without the computational cost of retraining. We evaluated this approach in a cross-lingual setting using the Polish nEMO dataset, which was excluded from training. Our method achieved the best performance in the POL-EVAL 2025 Task 4, improving the Macro F1 score from 0.619 to 0.753 on validation data and securing 1st place on the official leaderboard. Results demonstrate that inference-only clustering effectively disentangles ambiguous high-arousal categories, such as Fear and Surprise, by calibrating to the individual speaker’s vocal range.
Zero-Shot Transfer of Pretrained Speech Representations for Multilingual Emotion Recognition
Tomasz Kuczyński
Tomasz Kuczyński
Speech emotion recognition remains a challenging task, particularly in low-resource language settings. In this work, we explore the development of a system capable of identifying emotional states in Polish speech using training data exclusively from other languages. Our approach relies on a pretrained speech representation model and follows a strict zero-shot training paradigm, enabling cross-lingual knowledge transfer without access to any Polish data. The system was developed in the context of the Polish Speech Emotion Recognition Challenge (PolEval 2025), which required participants to train models solely on multilingual resources and evaluate them on Polish speech in a zero-shot setup. We present a complete solution encompassing model selection, audio preprocessing, and fine-tuning strategy, and discuss the potential of large-scale language models for cross-lingual emotion recognition.