Juraj Vladika

2025

Facts Fade Fast: Evaluating Memorization of Outdated Medical Knowledge in Large Language Models
Juraj Vladika | Mahdi Dhaini | Florian Matthes
Findings of the Association for Computational Linguistics: EMNLP 2025

The growing capabilities of Large Language Models (LLMs) can enhance healthcare by assisting medical researchers, physicians, and improving access to health services for patients. LLMs encode extensive knowledge within their parameters, including medical knowledge derived from many sources. However, the knowledge in LLMs can become outdated over time, posing challenges in keeping up with evolving medical recommendations and research. This can lead to LLMs providing outdated health advice or failures in medical reasoning tasks. To address this gap, our study introduces two novel biomedical question-answering (QA) datasets derived from medical systematic literature reviews: MedRevQA, a general dataset of 16,501 biomedical QA pairs, and MedChangeQA, a subset of 512 QA pairs whose verdict changed though time. By evaluating the performance of eight popular LLMs, we find that all models exhibit memorization of outdated knowledge to some extent. We provide deeper insights and analysis, paving the way for future research on this challenging aspect of LLMs.

pdf bib abs

Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning
Juraj Vladika | Ivana Hacajova | Florian Matthes
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Fact verification (FV) aims to assess the veracity of a claim based on relevant evidence. The traditional approach for automated FV includes a three-part pipeline relying on short evidence snippets and encoder-only inference models. More recent approaches leverage the multi-turn nature of LLMs to address FV as a step-by-step problem where questions inquiring additional context are generated and answered until there is enough information to make a decision. This iterative method makes the verification process rational and explainable. While these methods have been tested for encyclopedic claims, exploration on domain-specific and realistic claims is missing. In this work, we apply an iterative FV system on three medical fact-checking datasets and evaluate it with multiple settings, including different LLMs, external web search, and structured reasoning using logic predicates. We demonstrate improvements in the final performance over traditional approaches and the high potential of step-by-step FV systems for domain-specific claims.

pdf bib abs

Natural Language Inference Fine-tuning for Scientific Hallucination Detection
Tim Schopf | Juraj Vladika | Michael Färber | Florian Matthes
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)

Modern generative Large Language Models (LLMs) are capable of generating text that sounds coherent and convincing, but are also prone to producing hallucinations, facts that contradict the world knowledge. Even in the case of Retrieval-Augmented Generation (RAG) systems, where relevant context is first retrieved and passed in the input, the generated facts can contradict or not be verifiable by the provided references. This has motivated SciHal 2025, a shared task that focuses on the detection of hallucinations for scientific content. The two subtasks focused on: (1) predicting whether a claim from a generated LLM answer is entailed, contradicted, or unverifiable by the used references; (2) predicting a fine-grained category of erroneous claims. Our best performing approach used an ensemble of fine-tuned encoder-only ModernBERT and DeBERTa-v3 models for classification. Out of nine competing teams, our approach achieved the first place in sub-task 1 and the second place in sub-task 2.

pdf bib abs

On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems
Juraj Vladika | Florian Matthes
Findings of the Association for Computational Linguistics: NAACL 2025

Retrieval-augmented generation (RAG) has emerged as an approach to augment large language models (LLMs) by reducing their reliance on static knowledge and improving answer factuality. RAG retrieves relevant context snippets and generates an answer based on them. Despite its increasing industrial adoption, systematic exploration of RAG components is lacking, particularly regarding the ideal size of provided context, and the choice of base LLM and retrieval method. To help guide development of robust RAG systems, we evaluate various context sizes, BM25 and semantic search as retrievers, and eight base LLMs. Moving away from the usual RAG evaluation with short answers, we explore the more challenging long-form question answering in two domains, where a good answer has to utilize the entire context. Our findings indicate that final QA performance improves steadily with up to 15 snippets but stagnates or declines beyond that. Finally, we show that different general-purpose LLMs excel in the biomedical domain than the encyclopedic one, and that open-domain evidence retrieval in large corpora is challenging.

pdf bib

FActBench: A Benchmark for Fine-grained Automatic Evaluation of LLM-Generated Text in the Medical Domain
Anum Afzal | Juraj Vladika | Florian Matthes
Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP-2025)

pdf bib abs

Correcting Hallucinations in News Summaries: Exploration of Self-Correcting LLM Methods with External Knowledge
Juraj Vladika | Ihsan Soydemir | Florian Matthes
Proceedings of the Eighth Fact Extraction and VERification Workshop (FEVER)

While large language models (LLMs) have shown remarkable capabilities to generate coherent text, they suffer from the issue of hallucinations – factually inaccurate statements. Among numerous approaches to tackle hallucinations, especially promising are the self-correcting methods. They leverage the multi-turn nature of LLMs to iteratively generate verification questions inquiring additional evidence, answer them with internal or external knowledge, and use that to refine the original response with the new corrections. These methods have been explored for encyclopedic generation, but less so for domains like news summaries. In this work, we investigate two state-of-the-art self-correcting systems by applying them to correct hallucinated summaries, using evidence from three search engines. We analyze the results and provide insights into systems’ performance, revealing interesting practical findings on the benefits of search engine snippets and few-shot prompts, as well as high alignment of G-Eval and human evaluation.

2024

pdf bib abs

Comparing Knowledge Sources for Open-Domain Scientific Claim Verification
Juraj Vladika | Florian Matthes
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

The increasing rate at which scientific knowledge is discovered and health claims shared online has highlighted the importance of developing efficient fact-checking systems for scientific claims. The usual setting for this task in the literature assumes that the documents containing the evidence for claims are already provided and annotated or contained in a limited corpus. This renders the systems unrealistic for real-world settings where knowledge sources with potentially millions of documents need to be queried to find relevant evidence. In this paper, we perform an array of experiments to test the performance of open-domain claim verification systems. We test the final verdict prediction of systems on four datasets of biomedical and health claims in different settings. While keeping the pipeline’s evidence selection and verdict prediction parts constant, document retrieval is performed over three common knowledge sources (PubMed, Wikipedia, Google) and using two different information retrieval techniques. We show that PubMed works better with specialized biomedical claims, while Wikipedia is more suited for everyday health concerns. Likewise, BM25 excels in retrieval precision, while semantic search in recall of relevant evidence. We discuss the results, outline frequent retrieval patterns and challenges, and provide promising future directions.

pdf bib abs

MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering
Juraj Vladika | Phillip Schneider | Florian Matthes
Findings of the Association for Computational Linguistics: ACL 2024

In recent years, Large Language Models (LLMs) have demonstrated an impressive ability to encode knowledge during pre-training on large text corpora. They can leverage this knowledge for downstream tasks like question answering (QA), even in complex areas involving health topics. Considering their high potential for facilitating clinical work in the future, understanding the quality of encoded medical knowledge and its recall in LLMs is an important step forward. In this study, we examine the capability of LLMs to exhibit medical knowledge recall by constructing a novel dataset derived from systematic reviews – studies synthesizing evidence-based answers for specific medical questions. Through experiments on the new MedREQAL dataset, comprising question-answer pairs extracted from rigorous systematic reviews, we assess six LLMs, such as GPT and Mixtral, analyzing their classification and generation performance. Our experimental insights into LLM performance on the novel biomedical QA dataset reveal the still challenging nature of this task.

pdf bib abs

HealthFC: Verifying Health Claims with Evidence-Based Medical Fact-Checking
Juraj Vladika | Phillip Schneider | Florian Matthes
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In the digital age, seeking health advice on the Internet has become a common practice. At the same time, determining the trustworthiness of online medical content is increasingly challenging. Fact-checking has emerged as an approach to assess the veracity of factual claims using evidence from credible knowledge sources. To help advance automated Natural Language Processing (NLP) solutions for this task, in this paper we introduce a novel dataset HealthFC. It consists of 750 health-related claims in German and English, labeled for veracity by medical experts and backed with evidence from systematic reviews and clinical trials. We provide an analysis of the dataset, highlighting its characteristics and challenges. The dataset can be used for NLP tasks related to automated fact-checking, such as evidence retrieval, claim verification, or explanation generation. For testing purposes, we provide baseline systems based on different approaches, examine their performance, and discuss the findings. We show that the dataset is a challenging test bed with a high potential for future use.

pdf bib abs

Improving Health Question Answering with Reliable and Time-Aware Evidence Retrieval
Juraj Vladika | Florian Matthes
Findings of the Association for Computational Linguistics: NAACL 2024

In today’s digital world, seeking answers to health questions on the Internet is a common practice. However, existing question answering (QA) systems often rely on using pre-selected and annotated evidence documents, thus making them inadequate for addressing novel questions. Our study focuses on the open-domain QA setting, where the key challenge is to first uncover relevant evidence in large knowledge bases. By utilizing the common retrieve-then-read QA pipeline and PubMed as a trustworthy collection of medical research documents, we answer health questions from three diverse datasets. We modify different retrieval settings to observe their influence on the QA pipeline’s performance, including the number of retrieved documents, sentence selection process, the publication year of articles, and their number of citations. Our results reveal that cutting down on the amount of retrieved documents and favoring more recent and highly cited documents can improve the final macro F1 score up to 10%. We discuss the results, highlight interesting examples, and outline challenges for future research, like managing evidence disagreement and crafting user-friendly explanations.

pdf bib

DP-MLM: Differentially Private Text Rewriting Using Masked Language Models
Stephen Meisenbacher | Maulik Chevli | Juraj Vladika | Florian Matthes
Findings of the Association for Computational Linguistics: ACL 2024

2023

pdf bib abs

Sebis at SemEval-2023 Task 7: A Joint System for Natural Language Inference and Evidence Retrieval from Clinical Trial Reports
Juraj Vladika | Florian Matthes
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

With the increasing number of clinical trial reports generated every day, it is becoming hard to keep up with novel discoveries that inform evidence-based healthcare recommendations. To help automate this process and assist medical experts, NLP solutions are being developed. This motivated the SemEval-2023 Task 7, where the goal was to develop an NLP system for two tasks: evidence retrieval and natural language inference from clinical trial data. In this paper, we describe our two developed systems. The first one is a pipeline system that models the two tasks separately, while the second one is a joint system that learns the two tasks simultaneously with a shared representation and a multi-task learning approach. The final system combines their outputs in an ensemble system. We formalize the models, present their characteristics and challenges, and provide an analysis of achieved results. Our system ranked 3rd out of 40 participants with a final submission.

pdf bib abs

Scientific Fact-Checking: A Survey of Resources and Approaches
Juraj Vladika | Florian Matthes
Findings of the Association for Computational Linguistics: ACL 2023

The task of fact-checking deals with assessing the veracity of factual claims based on credible evidence and background knowledge. In particular, scientific fact-checking is the variation of the task concerned with verifying claims rooted in scientific knowledge. This task has received significant attention due to the growing importance of scientific and health discussions on online platforms. Automated scientific fact-checking methods based on NLP can help combat the spread of misinformation, assist researchers in knowledge discovery, and help individuals understand new scientific breakthroughs. In this paper, we present a comprehensive survey of existing research in this emerging field and its related tasks. We provide a task description, discuss the construction process of existing datasets, and analyze proposed models and approaches. Based on our findings, we identify intriguing challenges and outline potential future directions to advance the field.

2022

pdf bib abs

A Decade of Knowledge Graphs in Natural Language Processing: A Survey
Phillip Schneider | Tim Schopf | Juraj Vladika | Mikhail Galkin | Elena Simperl | Florian Matthes
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In pace with developments in the research field of artificial intelligence, knowledge graphs (KGs) have attracted a surge of interest from both academia and industry. As a representation of semantic relations between entities, KGs have proven to be particularly relevant for natural language processing (NLP), experiencing a rapid spread and wide adoption within recent years. Given the increasing amount of research work in this area, several KG-related approaches have been surveyed in the NLP research community. However, a comprehensive study that categorizes established topics and reviews the maturity of individual research streams remains absent to this day. Contributing to closing this gap, we systematically analyzed 507 papers from the literature on KGs in NLP. Our survey encompasses a multifaceted review of tasks, research types, and contributions. As a result, we present a structured overview of the research landscape, provide a taxonomy of tasks, summarize our findings, and highlight directions for future work.

pdf bib abs

TUM sebis at GermEval 2022: A Hybrid Model Leveraging Gaussian Processes and Fine-Tuned XLM-RoBERTa for German Text Complexity Analysis
Juraj Vladika | Stephen Meisenbacher | Florian Matthes
Proceedings of the GermEval 2022 Workshop on Text Complexity Assessment of German Text

The task of quantifying the complexity of written language presents an interesting endeavor, particularly in the opportunity that it presents for aiding language learners. In this pursuit, the question of what exactly about natural language contributes to its complexity (or lack thereof) is an interesting point of investigation. We propose a hybrid approach, utilizing shallow models to capture linguistic features, while leveraging a fine-tuned embedding model to encode the semantics of input text. By harmonizing these two methods, we achieve competitive scores in the given metric, and we demonstrate improvements over either singular method. In addition, we uncover the effectiveness of Gaussian processes in the training of shallow models for text complexity analysis.

2019

pdf bib abs

In this paper, we demonstrate the system built to solve the SemEval-2019 task 4: Hyperpartisan News Detection (Kiesel et al., 2019), the task of automatically determining whether an article is heavily biased towards one side of the political spectrum. Our system receives an article in its raw, textual form, analyzes it, and predicts with moderate accuracy whether the article is hyperpartisan. The learning model used was primarily trained on a manually prelabeled dataset containing news articles. The system relies on the previously constructed SVM model, available in the Python Scikit-Learn library. We ranked 6th in the competition of 42 teams with an accuracy of 79.1% (the winning team had 82.2%).