PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods

Slawomir Dadas; Michał Perełkiewicz; Rafał Poświata

PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods

Slawomir Dadas, Michał Perełkiewicz, Rafał Poświata

Abstract

We present Polish Information Retrieval Benchmark (PIRB), a comprehensive evaluation framework encompassing 41 text information retrieval tasks for Polish. The benchmark incorporates existing datasets as well as 10 new, previously unpublished datasets covering diverse topics such as medicine, law, business, physics, and linguistics. We conduct an extensive evaluation of over 20 dense and sparse retrieval models, including the baseline models trained by us as well as other available Polish and multilingual methods. Finally, we introduce a three-step process for training highly effective language-specific retrievers, consisting of knowledge distillation, supervised fine-tuning, and building sparse-dense hybrid retrievers using a lightweight rescoring model. In order to validate our approach, we train new text encoders for Polish and compare their results with previously evaluated methods. Our dense models outperform the best solutions available to date, and the use of hybrid methods further improves their performance.

Anthology ID:: 2024.lrec-main.1117
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 12761–12774
Language:
URL:: https://aclanthology.org/2024.lrec-main.1117
DOI:
Bibkey:
Cite (ACL):: Slawomir Dadas, Michał Perełkiewicz, and Rafał Poświata. 2024. PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 12761–12774, Torino, Italia. ELRA and ICCL.
Cite (Informal):: PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods (Dadas et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.lrec-main.1117.pdf

PDF Cite Search