Arkadiusz Janz

2025

Alignment is the critical process of minimizing harmful outputs by teaching large language models (LLMs) to prefer safe, helpful and appropriate responses. While the majority of alignment research and datasets remain overwhelmingly English-centric, ensuring safety across diverse linguistic and cultural contexts requires localized resources. In this paper, we introduce the first Polish preference dataset PLLuM-Align, created entirely through human annotation to reflect Polish language and cultural nuances. The dataset includes response rating, ranking, and multi-turn dialog data. Designed to reflect the linguistic subtleties and cultural norms of Polish, this resource lays the groundwork for more aligned Polish LLMs and contributes to the broader goal of multilingual alignment in underrepresented languages.

2024

pdf bib abs
BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language
Konrad Wojtasik | Kacper Wołowiec | Vadim Shishkin | Arkadiusz Janz | Maciej Piasecki
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The BEIR dataset is a large, heterogeneous benchmark for Information Retrieval (IR), garnering considerable attention within the research community. However, BEIR and analogous datasets are predominantly restricted to English language. Our objective is to establish extensive large-scale resources for IR in the Polish language, thereby advancing the research in this NLP area. In this work, inspired by mMARCO and Mr. TyDi datasets, we translated all accessible open IR datasets into Polish, and we introduced the BEIR-PL benchmark – a new benchmark which comprises 13 datasets, facilitating further development, training and evaluation of modern Polish language models for IR tasks. We executed an evaluation and comparison of numerous IR models on the newly introduced BEIR-PL benchmark. Furthermore, we publish pre-trained open IR models for Polish language, marking a pioneering development in this field. The BEIR-PL is included in MTEB Benchmark and also available with trained models at URL https://huggingface.co/clarin-knext.

2023

pdf bib abs
Wordnet for Definition Augmentation with Encoder-Decoder Architecture
Konrad Wojtasik | Arkadiusz Janz | Bartłomiej Alberski | Maciej Piasecki
Proceedings of the 12th Global Wordnet Conference

Data augmentation is a difficult task in Natural Language Processing. Simple methods that can be relatively easily applied in other domains like insertion, deletion or substitution, mostly result in changing the sentence meaning significantly and obtaining an incorrect example. Wordnets are potentially a perfect source of rich and high quality data that when integrated with the powerful capacity of generative models can help to solve this complex task. In this work, we use plWordNet, which is a wordnet of the Polish language, to explore the capability of encoder-decoder architectures in data augmentation of sense glosses. We discuss the limitations of generative methods and perform qualitative review of generated data samples.

pdf bib abs
Data Augmentation Method for Boosting Multilingual Word Sense Disambiguation
Arkadiusz Janz | Marek Maziarz
Proceedings of the 12th Global Wordnet Conference

Recent advances in Word Sense Disambiguation suggest neural language models can be successfully improved by incorporating knowledge base structure. Such class of models are called hybrid solutions. We propose a method of improving hybrid WSD models by harnessing data augmentation techniques and bilingual training. The data augmentation consist of structure augmentation using interlingual connections between wordnets and text data augmentation based on multilingual glosses and usage examples. We utilise language-agnostic neural model trained both with SemCor and Princeton WordNet gloss and example corpora, as well as with Polish WordNet glosses and usage examples. This augmentation technique proves to make well-known hybrid WSD architecture to be competitive, when compared to current State-of-the-Art models, even more complex.

pdf bib abs
Word Sense Disambiguation Based on Iterative Activation Spreading with Contextual Embeddings for Sense Matching
Arkadiusz Janz | Maciej Piasecki
Proceedings of the 12th Global Wordnet Conference

Many knowledge-based solutions were proposed to solve Word Sense Disambiguation (WSD) problem with limited annotated resources. Such WSD algorithms are able to cover very large sense repositories, but still being outperformed by supervised ones on benchmark data. In this paper, we start with analysis identifying key properties and issues in application of spreading activation algorithms in knowledge-based WSD, e.g. influence of the network local structures, interaction with context information and sense frequency. Taking our observations as a point of departure, we introduce a novel solution with new context-to-sense matching using BERT embeddings, iterative parallel spreading activation function and selective sense alignment using contextual BERT embeddings. The proposed solution obtains performance beyond the state-of-the-art for the contemporary knowledge-based WSD approaches for both English and Polish data.

2021

pdf bib abs
Discriminating Homonymy from Polysemy in Wordnets: English, Spanish and Polish Nouns
Arkadiusz Janz | Marek Maziarz
Proceedings of the 11th Global Wordnet Conference

We propose a novel method of homonymy-polysemy discrimination for three Indo-European Languages (English, Spanish and Polish). Support vector machines and LASSO logistic regression were successfully used in this task, outperforming baselines. The feature set utilised lemma properties, gloss similarities, graph distances and polysemy patterns. The proposed ML models performed equally well for English and the other two languages (constituting testing data sets). The algorithms not only ruled out most cases of homonymy but also were efficacious in distinguishing between closer and indirect semantic relatedness.

pdf bib abs
Neural Language Models vs Wordnet-based Semantically Enriched Representation in CST Relation Recognition
Arkadiusz Janz | Maciej Piasecki | Piotr Wątorski
Proceedings of the 11th Global Wordnet Conference

Neural language models, including transformer-based models, that are pre-trained on very large corpora became a common way to represent text in various tasks, including recognition of textual semantic relations, e.g. Cross-document Structure Theory. Pre-trained models are usually fine tuned to downstream tasks and the obtained vectors are used as an input for deep neural classifiers. No linguistic knowledge obtained from resources and tools is utilised. In this paper we compare such universal approaches with a combination of rich graph-based linguistically motivated sentence representation and a typical neural network classifier applied to a task of recognition of CST relation in Polish. The representation describes selected levels of the sentence structure including description of lexical meanings on the basis of the wordnet (plWordNet) synsets and connected SUMO concepts. The obtained results show that in the case of difficult relations and medium size training corpus semantically enriched text representation leads to significantly better results.

2020

pdf bib abs
Brand-Product Relation Extraction Using Heterogeneous Vector Space Representations
Arkadiusz Janz | Łukasz Kopociński | Maciej Piasecki | Agnieszka Pluwak
Proceedings of the Twelfth Language Resources and Evaluation Conference

Relation Extraction is a fundamental NLP task. In this paper we investigate the impact of underlying text representation on the performance of neural classification models in the task of Brand-Product relation extraction. We also present the methodology of preparing annotated textual corpora for this task and we provide valuable insight into the properties of Brand-Product relations existing in textual corpora. The problem is approached from a practical angle of applications Relation Extraction in facilitating commercial Internet monitoring.

2019

pdf bib abs
Propagation of emotions, arousal and polarity in WordNet using Heterogeneous Structured Synset Embeddings
Jan Kocoń | Arkadiusz Janz
Proceedings of the 10th Global Wordnet Conference

In this paper we present a novel method for emotive propagation in a wordnet based on a large emotive seed. We introduce a sense-level emotive lexicon annotated with polarity, arousal and emotions. The data were annotated as a part of a large study involving over 20,000 participants. A total of 30,000 lexical units in Polish WordNet were described with metadata, each unit received about 50 annotations concerning polarity, arousal and 8 basic emotions, marked on a multilevel scale. We present a preliminary approach to propagating emotive metadata to unlabeled lexical units based on the distribution of manual annotations using logistic regression and description of mixed synset embeddings based on our Heterogeneous Structured Synset Embeddings.

pdf bib abs
Testing Zipf’s meaning-frequency law with wordnets as sense inventories
Francis Bond | Arkadiusz Janz | Marek Maziarz | Ewa Rudnicka
Proceedings of the 10th Global Wordnet Conference

According to George K. Zipf, more frequent words have more senses. We have tested this law using corpora and wordnets of English, Spanish, Portuguese, French, Polish, Japanese, Indonesian and Chinese. We have proved that the law works pretty well for all of these languages if we take - as Zipf did - mean values of meaning count and averaged ranks. On the other hand, the law disastrously fails in predicting the number of senses for a single lemma. We have also provided the evidence that slope coefficients of Zipfian log-log linear model may vary from language to language.

pdf bib abs
A Comparison of Sense-level Sentiment Scores
Francis Bond | Arkadiusz Janz | Maciej Piasecki
Proceedings of the 10th Global Wordnet Conference

In this paper, we compare a variety of sense-tagged sentiment resources, including SentiWordNet, ML-Senticon, plWordNet emo and the NTU Multilingual Corpus. The goal is to investigate the quality of the resources and see how well the sentiment polarity annotation maps across languages.

pdf bib abs
Word Sense Disambiguation based on Constrained Random Walks in Linked Semantic Networks
Arkadiusz Janz | Maciej Piasecki
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Word Sense Disambiguation remains a challenging NLP task. Due to the lack of annotated training data, especially for rare senses, the supervised approaches are usually designed for specific subdomains limited to a narrow subset of identified senses. Recent advances in this area have shown that knowledge-based approaches are more scalable and obtain more promising results in all-words WSD scenarios. In this work we present a faster WSD algorithm based on the Monte Carlo approximation of sense probabilities given a context using constrained random walks over linked semantic networks. We show that the local semantic relatedness is mostly sufficient to successfully identify correct senses when an extensive knowledge base and a proper weighting scheme are used. The proposed methods are evaluated on English (SenseEval, SemEval) and Polish (Składnica, KPWr) datasets.

2018

pdf bib abs
Wordnet-based Evaluation of Large Distributional Models for Polish
Maciej Piasecki | Gabriela Czachor | Arkadiusz Janz | Dominik Kaszewski | Paweł Kędzia
Proceedings of the 9th Global Wordnet Conference

The paper presents construction of large scale test datasets for word embeddings on the basis of a very large wordnet. They were next applied for evaluation of word embedding models and used to assess and compare the usefulness of different word embeddings extracted from a very large corpus of Polish. We analysed also and compared several publicly available models described in literature. In addition, several large word embeddings models built on the basis of a very large Polish corpus are presented.

pdf bib abs
Recognition of Hyponymy and Meronymy Relations in Word Embeddings for Polish
Gabriela Czachor | Maciej Piasecki | Arkadiusz Janz
Proceedings of the 9th Global Wordnet Conference

Word embeddings were used for the extraction of hyponymy relation in several approaches, but also it was recently shown that they should not work, in fact. In our work we verified both claims using a very large wordnet of Polish as a gold standard for lexico-semantic relations and word embeddings extracted from a very large corpus of Polish. We showed that a hyponymy extraction method based on linear regression classifiers trained on clusters of vectors can be successfully applied on large scale. We presented also a possible explanation for contradictory findings in the literature. Moreover, in order to show the feasibility of the method we extended it to the recognition of meronymy.

pdf bib abs
Context-sensitive Sentiment Propagation in WordNet
Jan Kocoń | Arkadiusz Janz | Maciej Piasecki
Proceedings of the 9th Global Wordnet Conference

In this paper we present a comprehensive overview of recent methods of the sentiment propagation in a wordnet. Next, we propose a fully automated method called Classifier-based Polarity Propagation, which utilises a very rich set of features, where most of them are based on wordnet relation types, multi-level bag-of-synsets and bag-of-polarities. We have evaluated our solution using manually annotated part of plWordNet 3.1 emo, which contains more than 83k manual sentiment annotations, covering more than 41k synsets. We have demonstrated that in comparison to existing rule-based methods using a specific narrow set of semantic relations our method has achieved statistically significant and better results starting with the same seed synsets.

pdf bib
Classifier-based Polarity Propagation in a WordNet
Jan Kocoń | Arkadiusz Janz | Maciej Piasecki
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs
Graph-Based Approach to Recognizing CST Relations in Polish Texts
Paweł Kędzia | Maciej Piasecki | Arkadiusz Janz
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

This paper presents an supervised approach to the recognition of Cross-document Structure Theory (CST) relations in Polish texts. In the proposed, graph-based representation is constructed for sentences. Graphs are built on the basis of lexicalised syntactic-semantic relation extracted from text. Similarity between sentences is calculated from graph, and the similarity values are input to classifiers trained by Logistic Model Tree. Several different configurations of graph, as well as graph similarity methods were analysed for this tasks. The approach was evaluated on a large open corpus annotated manually with 17 types of selected CST relations. The configuration of experiments was similar to those known from SEMEVAL and we obtained very promising results.

Venues

Fix author