Tomasz Walkowiak

2026

Evaluating Cost-Efficiency of LLMs in a RAG Setup on Polish Wikipedia: Quality vs. Energy Consumption
Patrycja Smits | Tomasz Walkowiak
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Retrieval-augmented generation has become the dominant paradigm for deploying large language models in knowledge-intensive applications, yet practitioners lack guidance on model selection when both quality and costs matter. We evaluate language models from 4B to 70B parameters, including PLLuM and Bielik families of Polish LLM, within a Polish Wikipedia-based RAG pipeline. Quality assessment uses GPT-4o pairwise comparison across 1,000 PolQA questions with bias mitigation and Bradley-Terry ranking, while energy measurements capture inference costs on NVIDIA H100 hardware. Our findings challenge conventional scaling assumptions: parameter scaling beyond 12B offers minimal quality gains, with mid-size PLLuM-12 matching 70B performance while reducing energy consumption by 83%.

pdf bib abs

Stylometric Approach to AI-generated Texts. An Analysis of Contemporary French-Language Literature
Adam Pawłowski | Tomasz Walkowiak
Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026

The article focuses on a stylometric analysis of authentic literary texts and thematically related texts generated by large language models. The texts under study represent a fairly broad cross-section of twentieth-century French literature. Five models were used to generate the texts (ChatGPT 4-o, GPT 4-o mini, DeepSeek v.3, c4ai-command-r-plus, and c4ai-command-a). The original human-written stories of approximately 20,000 characters were summarized, and new narratives were then generated on the basis of these abstracts. In terms of plot and style, they were intended to resemble the originals. The research carried out with TF-IDF of the most frequent words showed that texts generated by specific LLMs and written by humans cluster relatively well as distinct groups. The experiments also showed that the "authorial" specificity of machine-generated texts partly matches the original clustering of human-written source texts.

2024

pdf bib abs

NLP for Digital Humanities: Processing Chronological Text Corpora
Adam Pawłowski | Tomasz Walkowiak
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

The paper focuses on the integration of Natural Language Processing (NLP) techniques to analyze extensive chronological text corpora. This research underscores the synergy between humanistic inquiry and computational methods, especially in the processing and analysis of sequential textual data known as lexical series. A reference workflow for chronological corpus analysis is introduced, outlining the methodologies applicable to the ChronoPress corpus, a data set that encompasses 22 years of Polish press from 1945 to 1966. The study showcases the potential of this approach in uncovering cultural and historical patterns through the analysis of lexical series. The findings highlight both the challenges and opportunities present in leveraging lexical series analysis within Digital Humanities, emphasizing the necessity for advanced data filtering and anomaly detection algorithms to effectively manage the vast and intricate datasets characteristic of this field.

2023

pdf bib abs

Great Bibliographies as a Source of Data for the Humanities – NLP in the Analysis of Gender of Book Authors in German Countries and in Poland (1801-2021)
Adam Pawłowski | Tomasz Walkowiak
Proceedings of the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

The subject of this article is the application of NLP and text-mining methods to the analysis of two large bibliographies: Polish one, based on the catalogs of the National Library in Warsaw, and the other German one, created by Deutsche Nationalbibliothek. The data in both collections are stored in MARC 21 format, allowing the selection of relevant fields that are used for further processing (basically author, title, and date). The volume of the Polish corpus (after filtering out non-relevant or incomplete items) includes 1.4 mln of records, and that of the German corpus 7.5 mln records. The time span of both bibliographies extends from 1801 to 2021. The aim of the study is to compare the gender distribution of book authors in Polish and German databases over more than two centuries. The proportions of male and female authors since 1801 were calculated automatically, and NLP methods such as document vector embedding based on deep BERT networks were used to extract topics from titles. The gender of the Polish authors was recognized based on the morphology of the first names, and that of the German authors based on a predefined list. The study found that the proportion of female authors has been steadily increasing both in Poland and in German countries (currently around 43%). However, the topics of women’s and men’s writings invariably remain different since 1801.

2021

pdf bib abs

Comprehensive Punctuation Restoration for English and Polish
Michał Pogoda | Tomasz Walkowiak
Findings of the Association for Computational Linguistics: EMNLP 2021

Punctuation restoration is a fundamental requirement for the readability of text derived from Automatic Speech Recognition (ASR) systems. Most contemporary solutions are limited to predicting only a few of the most frequently occurring marks, such as periods, commas, and question marks - and only one per word. However, in written language, we deal with a much larger number of punctuation characters (such as parentheses, hyphens, etc.), and their combinations (like parenthesis followed by a dot). Such comprehensive punctuation cannot always be unambiguously reduced to a basic set of the most frequently occurring marks. In this work, we evaluate several methods in the comprehensive punctuation reconstruction task. We conduct experiments on parallel corpora of two different languages, English and Polish - languages with a relatively simple and complex morphology, respectively. We also investigate the influence of building a model on comprehensive punctuation on the quality of the basic punctuation restoration task

pdf bib abs

Text Document Clustering: Wordnet vs. TF-IDF vs. Word Embeddings
Michał Marcińczuk | Mateusz Gniewkowski | Tomasz Walkowiak | Marcin Będkowski
Proceedings of the 11th Global Wordnet Conference

In the paper, we deal with the problem of unsupervised text document clustering for the Polish language. Our goal is to compare the modern approaches based on language modeling (doc2vec and BERT) with the classical ones, i.e., TF-IDF and wordnet-based. The experiments are conducted on three datasets containing qualification descriptions. The experiments’ results showed that wordnet-based similarity measures could compete and even outperform modern embedding-based approaches.

2019

pdf bib abs

Evaluation of vector embedding models in clustering of text documents
Tomasz Walkowiak | Mateusz Gniewkowski
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

The paper presents an evaluation of word embedding models in clustering of texts in the Polish language. Authors verified six different embedding models, starting from widely used word2vec, across fastText with character n-grams embedding, to deep learning-based ELMo and BERT. Moreover, four standardisation methods, three distance measures and four clustering methods were evaluated. The analysis was performed on two corpora of texts in Polish classified into subjects. The Adjusted Mutual Information (AMI) metric was used to verify the quality of clustering results. The performed experiments show that Skipgram models with n-grams character embedding, built on KGR10 corpus and provided by Clarin-PL, outperforms other publicly available models for Polish. Moreover, presented results suggest that Yeo–Johnson transformation for document vectors standardisation and Agglomerative Clustering with a cosine distance should be used for grouping of text documents.

Co-authors

Patrycja Smits 1

Venues

NLP4DH1

RANLP1

Fix author