Marian Simko - ACL Anthology

Marian Simko

Also published as: Marián Šimko

2026

Investigating Language and Retrieval Bias in Multilingual Previously Fact-Checked Claim Detection
Ivan Vykopal | Antonia Karamolegkou | Jaroslav Kopčan | Qiwei Peng | Tomáš Javůrek | Michal Gregor | Marian Simko
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Multilingual Large Language Models (LLMs) offer powerful capabilities for cross-lingual fact-checking. However, these models often exhibit language bias, performing disproportionately better on high-resource languages such as English than on low-resource counterparts. We also present and inspect a novel concept - retrieval bias, when information retrieval systems tend to favor certain information over others, leaving the retrieval process skewed. In this paper, we study language and retrieval bias in the context of Previously Fact-Checked Claim Detection (PFCD). We evaluate six open-source multilingual LLMs across 20 languages using a fully multilingual prompting strategy, leveraging the AMC-16K dataset. By translating task prompts into each language, we uncover disparities in monolingual and cross-lingual performance and identify key trends based on model family, size, and prompting strategy. Our findings highlight persistent bias in LLM behavior and offer recommendations for improving equity in multilingual fact-checking. To investigate retrieval bias, we employed multilingual embedding models and look into the frequency of retrieved claims. Our analysis reveals that certain claims are retrieved disproportionately across different posts, leading to inflated retrieval performance for popular claims while under-representing less common ones.

Assessing Web Search Credibility and Response Groundedness in Chat Assistants
Ivan Vykopal | Matúš Pikuliak | Simon Ostermann | Marian Simko
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources. While this promises more reliable answers, it also raises the risk of amplifying misinformation from low-credibility sources. In this paper, we introduce a novel methodology for evaluating assistants’ web search behavior, focusing on source credibility and the groundedness of responses with respect to cited sources. Using 100 claims across five misinformation-prone topics, we assess GPT-4o, GPT-5, Perplexity, and Qwen Chat. Our findings reveal differences between the assistants, with Perplexity achieving the highest source credibility, whereas GPT-4o exhibits elevated citation of non-credible sources on sensitive topics. This work provides the first systematic comparison of commonly used chat assistants for fact-checking behavior, offering a foundation for evaluating AI systems in high-stakes information environments.

2025

SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval
Qiwei Peng | Robert Moro | Michal Gregor | Ivan Srba | Simon Ostermann | Marian Simko | Juraj Podrouzek | Matúš Mesarčík | Jaroslav Kopčan | Anders Søgaard
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

The rapid spread of online disinformation presents a global challenge, and machine learning has been widely explored as a potential solution. However, multilingual settings and low-resource languages are often neglected in this field. To address this gap, we conducted a shared task on multilingual claim retrieval at SemEval 2025, aimed at identifying fact-checked claims that match newly encountered claims expressed in social media posts across different languages. The task includes two subtracks: 1) a monolingual track, where social posts and claims are in the same language 2) a crosslingual track, where social posts and claims might be in different languages. A total of 179 participants registered for the task contributing to 52 test submissions. 23 out of 31 teams have submitted their system papers. In this paper, we report the best-performing systems as well as the most common and the most effective approaches across both subtracks. This shared task, along with its dataset and participating systems, provides valuable insights into multilingual claim retrieval and automated fact-checking, supporting future research in this field.

Soft Language Prompts for Language Transfer
Ivan Vykopal | Simon Ostermann | Marian Simko
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Cross-lingual knowledge transfer, especially between high- and low-resource languages, remains challenging in natural language processing (NLP). This study offers insights for improving cross-lingual NLP applications through the combination of parameter-efficient fine-tuning methods. We systematically explore strategies for enhancing cross-lingual transfer through the incorporation of language-specific and task-specific adapters and soft prompts. We present a detailed investigation of various combinations of these methods, exploring their efficiency across 16 languages, focusing on 10 mid- and low-resource languages. We further present to our knowledge the first use of soft prompts for language transfer, a technique we call soft language prompts. Our findings demonstrate that in contrast to claims of previous work, a combination of language and task adapters does not always work best; instead, combining a soft language prompt with a task adapter outperforms most configurations in many cases.

Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages
Ernesto Luis Estevanell-Valladares | Alicia Picazo-Izquierdo | Tharindu Ranasinghe | Besik Mikaberidze | Simon Ostermann | Daniil Gurgurov | Philipp Mueller | Claudia Borg | Marián Šimko
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages

Large Language Models for Multilingual Previously Fact-Checked Claim Detection
Ivan Vykopal | Matúš Pikuliak | Simon Ostermann | Tatiana Anikina | Michal Gregor | Marian Simko
Findings of the Association for Computational Linguistics: EMNLP 2025

In our era of widespread false information, human fact-checkers often face the challenge of duplicating efforts when verifying claims that may have already been addressed in other countries or languages. As false information transcends linguistic boundaries, the ability to automatically detect previously fact-checked claims across languages has become an increasingly important task. This paper presents the first comprehensive evaluation of large language models (LLMs) for multilingual previously fact-checked claim detection. We assess seven LLMs across 20 languages in both monolingual and cross-lingual settings. Our results show that while LLMs perform well for high-resource languages, they struggle with low-resource languages. Moreover, translating original texts into English proved to be beneficial for low-resource languages. These findings highlight the potential of LLMs for multilingual previously fact-checked claim detection and provide a foundation for further research on this promising application of LLMs.

skLEP: A Slovak General Language Understanding Benchmark
Marek Suppa | Andrej Ridzik | Daniel Hládek | Tomáš Javůrek | Viktória Ondrejová | Kristína Sásiková | Martin Tamajka | Marian Simko
Findings of the Association for Computational Linguistics: ACL 2025

In this work, we introduce skLEP, the first comprehensive benchmark specifically designed for evaluating Slovak natural language understanding (NLU) models. We have compiled skLEP to encompass nine diverse tasks that span token-level, sentence-pair, and document-level challenges, thereby offering a thorough assessment of model capabilities. To create this benchmark, we curated new, original datasets tailored for Slovak and meticulously translated established English NLU resources. Within this paper, we also present the first systematic and extensive evaluation of a wide array of Slovak-specific, multilingual, and English pre-trained language models using the skLEP tasks. Finally, we also release the complete benchmark data, an open-source toolkit facilitating both fine-tuning and evaluation of models, and a public leaderboard at https://github.com/slovak-nlp/sklep in the hopes of fostering reproducibility and drive future research in Slovak NLU.

When the Dictionary Strikes Back: A Case Study on Slovak Migration Location Term Extraction and NER via Rule-Based vs. LLM Methods
Miroslav Blšták | Jaroslav Kopčan | Marek Šuppa | Samuel Harvan | Andrej Findor | Martin Takáč | Marián Šimko
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)

This study explores the task of automatically extracting migration-related locations (source and destination) from media articles, focusing on the challenges posed by Slovak, a low-resource and morphologically complex language. We present the first comparative analysis of rule-based dictionary approaches (NLP4SK) versus Large Language Models (LLMs, e.g. SlovakBERT, GPT-4o) for both geographical relevance classification (Slovakia-focused migration) and specific source/target location extraction. To facilitate this research and future work, we introduce the first manually annotated Slovak dataset tailored for migration-focused locality detection. Our results show that while a fine-tuned SlovakBERT model achieves high accuracy for classification, specialized rule-based methods still have the potential to outperform LLMs for specific extraction tasks, though improved LLM performance with few-shot examples suggests future competitiveness as research in this area continues to evolve.

2024

Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling
Matúš Pikuliak | Stefan Oresko | Andrea Hrckova | Marian Simko
Findings of the Association for Computational Linguistics: EMNLP 2024

We present GEST – a new manually created dataset designed to measure gender-stereotypical reasoning in language models and machine translation systems. GEST contains samples for 16 gender stereotypes about men and women (e.g., Women are beautiful, Men are leaders) that are compatible with the English language and 9 Slavic languages. The definition of said stereotypes was informed by gender experts. We used GEST to evaluate English and Slavic masked LMs, English generative LMs, and machine translation systems. We discovered significant and consistent amounts of gender-stereotypical reasoning in almost all the evaluated models and languages. Our experiments confirm the previously postulated hypothesis that the larger the model, the more stereotypical it usually is.

ChatGPT as Your n-th Annotator: Experiments in Leveraging Large Language Models for Social Science Text Annotation in Slovak Language
Endre Hamerlik | Marek Šuppa | Miroslav Blšták | Jozef Kubík | Martin Takáč | Marián Šimko | Andrej Findor
Proceedings of the 4th Workshop on Computational Linguistics for the Political and Social Sciences: Long and short papers

Large Language Models (LLMs) are increasingly influential in Computational Social Science, offering new methods for processing and analyzing data, particularly in lower-resource language contexts. This study explores the use of OpenAI’s GPT-3.5 Turbo and GPT-4 for automating annotations for a unique news media dataset in a lower resourced language, focusing on stance classification tasks. Our results reveal that prompting in the native language, explanation generation, and advanced prompting strategies like Retrieval Augmented Generation and Chain of Thought prompting enhance LLM performance, particularly noting GPT-4’s superiority in predicting stance. Further evaluation indicates that LLMs can serve as a useful tool for social science text annotation in lower resourced languages, notably in identifying inconsistencies in annotation guidelines and annotated datasets.

2022

Average Is Not Enough: Caveats of Multilingual Evaluation
Matúš Pikuliak | Marian Simko
Proceedings of the 2nd Workshop on Multi-lingual Representation Learning (MRL)

This position paper discusses the problem of multilingual evaluation. Using simple statistics, such as average language performance, might inject linguistic biases in favor of dominant language families into evaluation methodology. We argue that a qualitative analysis informed by comparative linguistics is needed for multilingual results to detect this kind of bias. We show in our case study that results in published works can indeed be linguistically biased and we demonstrate that visualization based on URIEL typological database can detect it.

SlovakBERT: Slovak Masked Language Model
Matúš Pikuliak | Štefan Grivalský | Martin Konôpka | Miroslav Blšták | Martin Tamajka | Viktor Bachratý | Marian Simko | Pavol Balážik | Michal Trnka | Filip Uhlárik
Findings of the Association for Computational Linguistics: EMNLP 2022

We introduce a new Slovak masked language model called SlovakBERT. This is to our best knowledge the first paper discussing Slovak transformers-based language models. We evaluate our model on several NLP tasks and achieve state-of-the-art results. This evaluation is likewise the first attempt to establish a benchmark for Slovak language models. We publish the masked language model, as well as the fine-tuned models for part-of-speech tagging, sentiment analysis and semantic textual similarity.

2020

NLFIIT at SemEval-2020 Task 11: Neural Network Architectures for Detection of Propaganda Techniques in News Articles
Matej Martinkovic | Samuel Pecar | Marian Simko
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Since propaganda became more common technique in news, it is very important to look for possibilities of its automatic detection. In this paper, we present neural model architecture submitted to the SemEval-2020 Task 11 competition: “Detection of Propaganda Techniques in News Articles”. We participated in both subtasks, propaganda span identification and propaganda technique classification. Our model utilizes recurrent Bi-LSTM layers with pre-trained word representations and also takes advantage of self-attention mechanism. Our model managed to achieve score 0.405 F1 for subtask 1 and 0.553 F1 for subtask 2 on test set resulting in 17th and 16th place in subtask 1 and subtask 2, respectively.

2019

Improving Sentiment Classification in Slovak Language
Samuel Pecar | Marian Simko | Maria Bielikova
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

Using different neural network architectures is widely spread for many different NLP tasks. Unfortunately, most of the research is performed and evaluated only in English language and minor languages are often omitted. We believe using similar architectures for other languages can show interesting results. In this paper, we present our study on methods for improving sentiment classification in Slovak language. We performed several experiments for two different datasets, one containing customer reviews, the other one general Twitter posts. We show comparison of performance of different neural network architectures and also different word representations. We show that another improvement can be achieved by using a model ensemble. We performed experiments utilizing different methods of model ensemble. Our proposed models achieved better results than previous models for both datasets. Our experiments showed also other potential research areas.

NL-FIIT at SemEval-2019 Task 9: Neural Model Ensemble for Suggestion Mining
Samuel Pecar | Marian Simko | Maria Bielikova
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper, we present neural model architecture submitted to the SemEval-2019 Task 9 competition: “Suggestion Mining from Online Reviews and Forums”. We participated in both subtasks for domain specific and also cross-domain suggestion mining. We proposed a recurrent neural network architecture that employs Bi-LSTM layers and also self-attention mechanism. Our architecture tries to encode words via word representation using ELMo and ensembles multiple models to achieve better results. We highlight importance of pre-processing of user-generated samples and its contribution to overall results. We performed experiments with different setups of our proposed model involving weighting of prediction classes for loss function. Our best model achieved in official test evaluation score of 0.6816 for subtask A and 0.6850 for subtask B. In official results, we achieved 12th and 10th place in subtasks A and B, respectively.

2018

NL-FIIT at IEST-2018: Emotion Recognition utilizing Neural Networks and Multi-level Preprocessing
Samuel Pecar | Michal Farkas | Marian Simko | Peter Lacko | Maria Bielikova
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

In this paper, we present neural models submitted to Shared Task on Implicit Emotion Recognition, organized as part of WASSA 2018. We propose a Bi-LSTM architecture with regularization through dropout and Gaussian noise. Our models use three different embedding layers: GloVe word embeddings trained on Twitter dataset, ELMo embeddings and also sentence embeddings. We see preprocessing as one of the most important parts of the task. We focused on handling emojis, emoticons, hashtags, and also various shortened word forms. In some cases, we proposed to remove some parts of the text, as they do not affect emotion of the original sentence. We also experimented with other modifications like category weights for learning and stacking multiple layers. Our model achieved a macro average F1 score of 65.55%, significantly outperforming the baseline model produced by a simple logistic regression.

Improving Moderation of Online Discussions via Interpretable Neural Models
Andrej Švec | Matúš Pikuliak | Marián Šimko | Mária Bieliková
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)

Growing amount of comments make online discussions difficult to moderate by human moderators only. Antisocial behavior is a common occurrence that often discourages other users from participating in discussion. We propose a neural network based method that partially automates the moderation process. It consists of two steps. First, we detect inappropriate comments for moderators to see. Second, we highlight inappropriate parts within these comments to make the moderation faster. We evaluated our method on data from a major Slovak news discussion platform.

Venues