Rafał Poświata


2024

pdf bib
PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods
Slawomir Dadas | Michał Perełkiewicz | Rafał Poświata
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present Polish Information Retrieval Benchmark (PIRB), a comprehensive evaluation framework encompassing 41 text information retrieval tasks for Polish. The benchmark incorporates existing datasets as well as 10 new, previously unpublished datasets covering diverse topics such as medicine, law, business, physics, and linguistics. We conduct an extensive evaluation of over 20 dense and sparse retrieval models, including the baseline models trained by us as well as other available Polish and multilingual methods. Finally, we introduce a three-step process for training highly effective language-specific retrievers, consisting of knowledge distillation, supervised fine-tuning, and building sparse-dense hybrid retrievers using a lightweight rescoring model. In order to validate our approach, we train new text encoders for Polish and compare their results with previously evaluated methods. Our dense models outperform the best solutions available to date, and the use of hybrid methods further improves their performance.

2022

pdf bib
OPI@LT-EDI-ACL2022: Detecting Signs of Depression from Social Media Text using RoBERTa Pre-trained Language Models
Rafał Poświata | Michał Perełkiewicz
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

This paper presents our winning solution for the Shared Task on Detecting Signs of Depression from Social Media Text at LT-EDI-ACL2022. The task was to create a system that, given social media posts in English, should detect the level of depression as ‘not depressed’, ‘moderately depressed’ or ‘severely depressed’. We based our solution on transformer-based language models. We fine-tuned selected models: BERT, RoBERTa, XLNet, of which the best results were obtained for RoBERTa. Then, using the prepared corpus, we trained our own language model called DepRoBERTa (RoBERTa for Depression Detection). Fine-tuning of this model improved the results. The third solution was to use the ensemble averaging, which turned out to be the best solution. It achieved a macro-averaged F1-score of 0.583. The source code of prepared solution is available at https://github.com/rafalposwiata/depression-detection-lt-edi-2022.

pdf bib
OPI at SemEval-2022 Task 10: Transformer-based Sequence Tagging with Relation Classification for Structured Sentiment Analysis
Rafał Poświata
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper presents our solution for SemEval-2022 Task 10: Structured Sentiment Analysis. The solution consisted of two modules: the first for sequence tagging and the second for relation classification. In both modules we used transformer-based language models. In addition to utilizing language models specific to each of the five competition languages, we also adopted multilingual models. This approach allowed us to apply the solution to both monolingual and cross-lingual sub-tasks, where we obtained average Sentiment Graph F1 of 54.5% and 53.1%, respectively. The source code of the prepared solution is available at https://github.com/rafalposwiata/structured-sentiment-analysis.

2020

pdf bib
Annobot: Platform for Annotating and Creating Datasets through Conversation with a Chatbot
Rafał Poświata | Michał Perełkiewicz
Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations

In this paper, we introduce Annobot: a platform for annotating and creating datasets through conversation with a chatbot. This natural form of interaction has allowed us to create a more accessible and flexible interface, especially for mobile devices. Our solution has a wide range of applications such as data labelling for binary, multi-class/label classification tasks, preparing data for regression problems, or creating sets for issues such as machine translation, question answering or text summarization. Additional features include pre-annotation, active sampling, online learning and real-time inter-annotator agreement. The system is integrated with the popular messaging platform: Facebook Messanger. Usability experiment showed the advantages of the proposed platform compared to other labelling tools. The source code of Annobot is available under the GNU LGPL license at https://github.com/rafalposwiata/annobot.

pdf bib
Evaluation of Sentence Representations in Polish
Slawomir Dadas | Michał Perełkiewicz | Rafał Poświata
Proceedings of the Twelfth Language Resources and Evaluation Conference

Methods for learning sentence representations have been actively developed in recent years. However, the lack of pre-trained models and datasets annotated at the sentence level has been a problem for low-resource languages such as Polish which led to less interest in applying these methods to language-specific tasks. In this study, we introduce two new Polish datasets for evaluating sentence embeddings and provide a comprehensive evaluation of eight sentence representation methods including Polish and multilingual models. We consider classic word embedding models, recently developed contextual embeddings and multilingual sentence encoders, showing strengths and weaknesses of specific approaches. We also examine different methods of aggregating word vectors into a single sentence vector.

2019

pdf bib
ConSSED at SemEval-2019 Task 3: Configurable Semantic and Sentiment Emotion Detector
Rafał Poświata
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes our system participating in the SemEval-2019 Task 3: EmoContext: Contextual Emotion Detection in Text. The goal was to for a given textual dialogue, i.e. a user utterance along with two turns of context, identify the emotion of user utterance as one of the emotion classes: Happy, Sad, Angry or Others. Our system: ConSSED is a configurable combination of semantic and sentiment neural models. The official task submission achieved a micro-average F1 score of 75.31 which placed us 16th out of 165 participating systems.

pdf bib
Numbers Normalisation in the Inflected Languages: a Case Study of Polish
Rafał Poświata | Michał Perełkiewicz
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

Text normalisation in Text-to-Speech systems is a process of converting written expressions to their spoken forms. This task is complicated because in many cases the normalised form depends on the context. Furthermore, when we analysed languages like Croatian, Lithuanian, Polish, Russian or Slovak there is additional difficulty related to their inflected nature. In this paper we want to show how to deal with this problem for one of these languages: Polish, without having a large dedicated data set and using solutions prepared for other NLP tasks. We limited our study to only numbers expressions, which are the most common non-standard words to normalise. The proposed solution is a combination of morphological tagger and transducer supported by a dictionary of numbers in their spoken forms. The data set used for evaluation is based on the part of 1-million word subset of the National Corpus of Polish. The accuracy of the described approach is presented with a comparison to a simple baseline and two commercial systems: Google Cloud Text-to-Speech and Amazon Polly.