Abdulsalam obaid Alharbi


2025

pdf bib
Evaluating Large Language Models on Health-Related Claims Across Arabic Dialects
Abdulsalam obaid Alharbi | Abdullah Alsuhaibani | Abdulrahman Abdullah Alalawi | Usman Naseem | Shoaib Jameel | Salil Kanhere | Imran Razzak
Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script

While the Large Language Models (LLMs) have been popular in different tasks, their capability to handle health-related claims in diverse linguistic and cultural contexts, such as Arabic dialects, Saudi, Egyptian, Lebanese, and Moroccan has not been thoroughly explored. To this end, we develop a comprehensive evaluation framework to assess how LLMs particularly GPT-4 respond to health-related claims. Our framework focuses on measuring factual accuracy, consistency, and cultural adaptability. It introduces a new metric, the “Cultural Sensitivity Score”, to evaluate the model’s ability to adjust responses based on dialectal differences. Additionally, the reasoning patterns used by the models are analyzed to assess their effectiveness in engaging with claims across these dialects. Our findings highlight that while LLMs excel in recognizing true claims, they encounter difficulties with mixed and ambiguous claims, especially in underrepresented dialects. This work underscores the importance of dialect-specific evaluations to ensure accurate, contextually appropriate, and culturally sensitive responses from LLMs in real-world applications.