Krzysztof Wróbel


2025

Inspired by zero-shot detection methods that compare perplexity across model pairs, we investigate whether computing perplexity differences on whole-text character-level perplexity can effectively detect LLM-generated Polish text. Unlike token-level ratio methods that require compatible tokenizers, our approach enables pairing any models regardless of tokenization. Through systematic evaluation of 91 model pairs on the PolEval 2025 ŚMIGIEL shared task, we identify Gemma-3-27B and PLLuM-12B as optimal, achieving 81.22% accuracy on test data with unseen generators. Our difference-based approach outperforms token-level ratio methods (+5.5pp) and single-model baselines (+8.3pp) without using training labels, capturing asymmetric reactions where human text causes greater perplexity divergence than LLM text. We demonstrate that complementary model pairing (multilingual + monolingual) and architectural quality matter more than raw model size for this task.
We present a simple yet effective approach to gender-inclusive PolishEnglish translation for the PolEval 2025 Task 2 shared task. Without any fine-tuning, our solution leverages the Bielik 11B v2.6 model with carefully engineered system prompts and structured output, achieving a chrF score of 84.03 and securing first place in the translation subtask. The approach demonstrates that prompt engineering with few-shot examples and structured output can effectively handle the complex task of generating and removing gender-inclusive forms with the inclusive asterisk notation in Polish text. Per-direction analysis reveals stronger performance on PLEN (chrF 88.24) compared to ENPL (chrF 79.88), highlighting the asymmetric difficulty of adding versus removing inclusive forms.

2022

The paper presents a submission to the EvaLatin 2022 shared task. Our system places first for lemmatization, part-of-speech and morphological tagging in both closed and open modalities. The results for cross-genre and cross-time sub-tasks show that the system handles the diachronic and diastratic variation of Latin. The architecture employs state-of-the-art transformer models. For part-of-speech and morphological tagging, we use XLM-RoBERTa large, while for lemmatization a ByT5 small model was employed. The paper features a thorough discussion of part-of-speech and lemmatization errors which shows how the system performance may be improved for Classical, Medieval and Neo-Latin texts.

2016