Matej Klemen


2024

pdf bib
SENTA: Sentence Simplification System for Slovene
Aleš Žagar | Matej Klemen | Marko Robnik-Šikonja | Iztok Kosem
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Ensuring universal access to written content, regardless of users’ language proficiency and cognitive abilities, is of paramount importance. Sentence simplification, which involves converting complex sentences into more accessible forms while preserving their meaning, plays a crucial role in enhancing text accessibility. This paper introduces SENTA, a system for sentence simplification in Slovene. The system consists of two components. First, a neural classifier identifies sentences that require simplification, and second, a large Slovene language model based on T5 architecture is fine-tuned to transform complex texts into a simpler form, achieving an excellent SARI score of 41. Both automatic and qualitative evaluations provide important insights into the problem, highlighting areas for future research in multilingual applications, and fluency maintenance. Finally, SENTA is integrated into a freely accessible, user-friendly user interface, offering a valuable service to less-fluent Slovene users.

pdf bib
SI-NLI: A Slovene Natural Language Inference Dataset and Its Evaluation
Matej Klemen | Aleš Žagar | Jaka Čibej | Marko Robnik-Šikonja
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Natural language inference (NLI) is an important language understanding benchmark. Two deficiencies of this benchmark are: i) most existing NLI datasets exist for English and a few other well-resourced languages, and ii) most NLI datasets are formed with a narrow set of annotators’ instructions, allowing the prediction models to capture linguistic clues instead of measuring true reasoning capability. We address both issues and introduce SI-NLI, the first dataset for Slovene natural language inference. The dataset is constructed from scratch using knowledgeable annotators with carefully crafted guidelines aiming to avoid commonly encountered problems in existing NLI datasets. We also manually translate the SI-NLI to English to enable cross-lingual model training and evaluation. Using the newly created dataset and its translation, we train and evaluate a variety of large transformer language models in a monolingual and cross-lingual setting. The results indicate that larger models, in general, achieve better performance. The qualitative analysis shows that the SI-NLI dataset is diverse and that there remains plenty of room for improvement even for the largest models.

2022

pdf bib
ULFRI at SemEval-2022 Task 4: Leveraging uncertainty and additional knowledge for patronizing and condescending language detection
Matej Klemen | Marko Robnik-Šikonja
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

We describe the ULFRI system used in the Subtask 1 of SemEval-2022 Task 4 Patronizing and condescending language detection. Our models are based on the RoBERTa model, modified in two ways: (1) by injecting additional knowledge (coreferences, named entities, dependency relations, and sentiment) and (2) by leveraging the task uncertainty by using soft labels, Monte Carlo dropout, and threshold optimization. We find that the injection of additional knowledge is not helpful but the uncertainty management mechanisms lead to small but consistent improvements. Our final system based on these findings achieves F1 = 0.575 in the online evaluation, ranking 19th out of 78 systems.