2024
pdf
bib
abs
Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction
Albert Sawczyn
|
Katsiaryna Viarenich
|
Konrad Wojtasik
|
Aleksandra Domogała
|
Marcin Oleksy
|
Maciej Piasecki
|
Tomasz Kajdanowicz
Findings of the Association for Computational Linguistics: ACL 2024
Advancements in AI and natural language processing have revolutionized machine-human language interactions, with question answering (QA) systems playing a pivotal role. The knowledge base question answering (KBQA) task, utilizing structured knowledge graphs (KG), allows for handling extensive knowledge-intensive questions. However, a significant gap exists in KBQA datasets, especially for low-resource languages. Many existing construction pipelines for these datasets are outdated and inefficient in human labor, and modern assisting tools like Large Language Models (LLM) are not utilized to reduce the workload. To address this, we have designed and implemented a modern, semi-automated approach for creating datasets, encompassing tasks such as KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR), tailored explicitly for low-resource environments. We executed this pipeline and introduced the PUGG dataset, the first Polish KBQA dataset, and novel datasets for MRC and IR. Additionally, we provide a comprehensive implementation, insightful findings, detailed statistics, and evaluation of baseline models.
2023
pdf
bib
abs
ISO 24617-2 on a cusp of languages
Krzysztof Hwaszcz
|
Marcin Oleksy
|
Aleksandra Domogała
|
Jan Wieczorek
Proceedings of the 19th Joint ACL-ISO Workshop on Interoperable Semantics (ISA-19)
The article discusses the challenges of cross-linguistic dialogue act annotation, which involves using methods developed for one language to annotate conversations in another language. The article specifically focuses on the research on dialogue act annotation in Polish, based on the ISO standard developed for English. The article examines the differences between Polish and English in dialogue act annotation based on selected examples from DiaBiz.Kom corpus, such as the use of honorifics in Polish, the use of inflection to convey meaning in Polish, the tendency to use complex sentence structures in Polish, and the cultural differences that may play a role in the annotation of dialogue acts. The article also discusses the creation of DiaBiz.Kom, a Polish dialogue corpus based on ISO 24617-2 standard applied to 1100 transcripts.
2022
pdf
bib
abs
DiaBiz.Kom - towards a Polish Dialogue Act Corpus Based on ISO 24617-2 Standard
Marcin Oleksy
|
Jan Wieczorek
|
Dorota Drużyłowska
|
Julia Klyus
|
Aleksandra Domogała
|
Krzysztof Hwaszcz
|
Hanna Kędzierska
|
Daria Mikoś
|
Anita Wróż
Proceedings of the 29th International Conference on Computational Linguistics
This article presents the specification and evaluation of DiaBiz.Kom – the corpus of dialogue texts in Polish. The corpus contains transcriptions of telephone conversations conducted according to a prepared scenario. The transcripts of conversations have been manually annotated with a layer of information concerning communicative functions. DiaBiz.Kom is the first corpus of this type prepared for the Polish language and will be used to develop a system of dialog analysis and modules for creating advanced chatbots.