Michał Perełkiewicz


2024

pdf bib
PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods
Slawomir Dadas | Michał Perełkiewicz | Rafał Poświata
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present Polish Information Retrieval Benchmark (PIRB), a comprehensive evaluation framework encompassing 41 text information retrieval tasks for Polish. The benchmark incorporates existing datasets as well as 10 new, previously unpublished datasets covering diverse topics such as medicine, law, business, physics, and linguistics. We conduct an extensive evaluation of over 20 dense and sparse retrieval models, including the baseline models trained by us as well as other available Polish and multilingual methods. Finally, we introduce a three-step process for training highly effective language-specific retrievers, consisting of knowledge distillation, supervised fine-tuning, and building sparse-dense hybrid retrievers using a lightweight rescoring model. In order to validate our approach, we train new text encoders for Polish and compare their results with previously evaluated methods. Our dense models outperform the best solutions available to date, and the use of hybrid methods further improves their performance.

2022

pdf bib
OPI@LT-EDI-ACL2022: Detecting Signs of Depression from Social Media Text using RoBERTa Pre-trained Language Models
Rafał Poświata | Michał Perełkiewicz
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

This paper presents our winning solution for the Shared Task on Detecting Signs of Depression from Social Media Text at LT-EDI-ACL2022. The task was to create a system that, given social media posts in English, should detect the level of depression as ‘not depressed’, ‘moderately depressed’ or ‘severely depressed’. We based our solution on transformer-based language models. We fine-tuned selected models: BERT, RoBERTa, XLNet, of which the best results were obtained for RoBERTa. Then, using the prepared corpus, we trained our own language model called DepRoBERTa (RoBERTa for Depression Detection). Fine-tuning of this model improved the results. The third solution was to use the ensemble averaging, which turned out to be the best solution. It achieved a macro-averaged F1-score of 0.583. The source code of prepared solution is available at https://github.com/rafalposwiata/depression-detection-lt-edi-2022.

2020

pdf bib
Annobot: Platform for Annotating and Creating Datasets through Conversation with a Chatbot
Rafał Poświata | Michał Perełkiewicz
Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations

In this paper, we introduce Annobot: a platform for annotating and creating datasets through conversation with a chatbot. This natural form of interaction has allowed us to create a more accessible and flexible interface, especially for mobile devices. Our solution has a wide range of applications such as data labelling for binary, multi-class/label classification tasks, preparing data for regression problems, or creating sets for issues such as machine translation, question answering or text summarization. Additional features include pre-annotation, active sampling, online learning and real-time inter-annotator agreement. The system is integrated with the popular messaging platform: Facebook Messanger. Usability experiment showed the advantages of the proposed platform compared to other labelling tools. The source code of Annobot is available under the GNU LGPL license at https://github.com/rafalposwiata/annobot.

pdf bib
Evaluation of Sentence Representations in Polish
Slawomir Dadas | Michał Perełkiewicz | Rafał Poświata
Proceedings of the Twelfth Language Resources and Evaluation Conference

Methods for learning sentence representations have been actively developed in recent years. However, the lack of pre-trained models and datasets annotated at the sentence level has been a problem for low-resource languages such as Polish which led to less interest in applying these methods to language-specific tasks. In this study, we introduce two new Polish datasets for evaluating sentence embeddings and provide a comprehensive evaluation of eight sentence representation methods including Polish and multilingual models. We consider classic word embedding models, recently developed contextual embeddings and multilingual sentence encoders, showing strengths and weaknesses of specific approaches. We also examine different methods of aggregating word vectors into a single sentence vector.

2019

pdf bib
CX-ST-RNM at SemEval-2019 Task 3: Fusion of Recurrent Neural Networks Based on Contextualized and Static Word Representations for Contextual Emotion Detection
Michał Perełkiewicz
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper, I describe a fusion model combining contextualized and static word representations for approaching the EmoContext task in the SemEval 2019 competition. The model is based on two Recurrent Neural Networks, the first one is fed with a state-of-the-art ELMo deep contextualized word representation and the second one is fed with a static Word2Vec embedding augmented with 10-dimensional affective word feature vector. The proposed model is compared with two baseline models based on a static word representation and a contextualized word representation, separately. My approach achieved officially 0.7278 microaveraged F1 score on the test dataset, ranking 47th out of 165 participants.

pdf bib
Numbers Normalisation in the Inflected Languages: a Case Study of Polish
Rafał Poświata | Michał Perełkiewicz
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

Text normalisation in Text-to-Speech systems is a process of converting written expressions to their spoken forms. This task is complicated because in many cases the normalised form depends on the context. Furthermore, when we analysed languages like Croatian, Lithuanian, Polish, Russian or Slovak there is additional difficulty related to their inflected nature. In this paper we want to show how to deal with this problem for one of these languages: Polish, without having a large dedicated data set and using solutions prepared for other NLP tasks. We limited our study to only numbers expressions, which are the most common non-standard words to normalise. The proposed solution is a combination of morphological tagger and transducer supported by a dictionary of numbers in their spoken forms. The data set used for evaluation is based on the part of 1-million word subset of the National Corpus of Polish. The accuracy of the described approach is presented with a comparison to a simple baseline and two commercial systems: Google Cloud Text-to-Speech and Amazon Polly.