Tomáš Harmatha
2025
Exploring the Performance of Large Language Models for Event Detection and Extraction in the Health Domain
Hristo Tanev
|
Nicolas Stefanovitch
|
Tomáš Harmatha
|
Diana F. Sousa
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Large Language Models (LLM) have entered the world of NLP with a fast pace. LLM has been used for summarization, translation, named entity recognition, and sentiment analysis Recently, different research groups have experimented with event detection and extraction, using LLM at various levels of the processing stage: The LLM have proven to be a very relevant technology from data preparation to event argument extraction. In particular Open Source LLM like Mistral are very important since they can be shared and modified by the research community. Still, little effort was made to study the performance of these models in NLP tasks like event extraction. In this paper we describe an experiment in evaluating several state-of-the-art open large language models (LLM) for the task of event extraction and event detection in the domain of health. The models were prompted to perform detection of health-related events - mostly disease outbreaks, but also natural and man-made disasters, which directly or indirectly have impact on the health of the people. The models were also asked to extract the place, time, number of human and animal cases, and the number of the human fatalities. The performance of the LLM turned out to be better than the one of a state-of-the-art knowledge based system, using as test data a set of 800 news abstracts, containing the title and the lead sentences of health-related news articles. We compared the performance of the event detection and event argument extraction from the open Large Language Models and two knowledge based event extraction systems, NEXUS and Medical NEXUS. Our evaluation shows that all the open LLM show a superior performance w.r.t. the knowledge-based systems with the best improvement of the F1 score of number of human fatalities detection of 0.2 (0.84 vs. 0.64), where the best performing LLM was LLama 3.3 70B instruct.