JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models

Junfeng Jiang; Jiahao Huang; Akiko Aizawa

JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models

Junfeng Jiang, Jiahao Huang, Akiko Aizawa

Abstract

Recent developments in Japanese large language models (LLMs) primarily focus on general domains, with fewer advancements in Japanese biomedical LLMs. One obstacle is the absence of a comprehensive, large-scale benchmark for comparison. Furthermore, the resources for evaluating Japanese biomedical LLMs are insufficient. To advance this field, we propose a new benchmark including eight LLMs across four categories and 20 Japanese biomedical datasets across five tasks. Experimental results indicate that: (1) LLMs with a better understanding of Japanese and richer biomedical knowledge achieve better performance in Japanese biomedical tasks, (2) LLMs that are not mainly designed for Japanese biomedical domains can still perform unexpectedly well, and (3) there is still much room for improving the existing LLMs in certain Japanese biomedical tasks. Moreover, we offer insights that could further enhance development in this field. Our evaluation tools tailored to our benchmark as well as the datasets are publicly available to facilitate future research.

Anthology ID:: 2025.coling-main.395
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5918–5935
Language:
URL:: https://aclanthology.org/2025.coling-main.395/
DOI:
Bibkey:
Cite (ACL):: Junfeng Jiang, Jiahao Huang, and Akiko Aizawa. 2025. JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 5918–5935, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models (Jiang et al., COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.395.pdf

PDF Cite Search Fix data