Reem Masoud
2025
Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede’s Cultural Dimensions
Reem Masoud
|
Ziquan Liu
|
Martin Ferianc
|
Philip C. Treleaven
|
Miguel Rodrigues Rodrigues
Proceedings of the 31st International Conference on Computational Linguistics
The deployment of large language models (LLMs) raises concerns regarding their cultural misalignment and potential ramifications on individuals and societies with diverse cultural backgrounds. While the discourse has focused mainly on political and social biases, our research proposes a Cultural Alignment Test (Hoftede’s CAT) to quantify cultural alignment using Hofstede’s cultural dimension framework, which offers an explanatory cross-cultural comparison through the latent variable analysis. We apply our approach to quantitatively evaluate LLMs—namely Llama 2, GPT-3.5, and GPT-4—against the cultural dimensions of regions like the United States, China, and Arab countries, using different prompting styles and exploring the effects of language-specific fine-tuning on the models’ behavioural tendencies and cultural values. Our results quantify the cultural alignment of LLMs and reveal the difference between LLMs in explanatory cultural dimensions. Our study demonstrates that while all LLMs struggle to grasp cultural values, GPT-4 shows a unique capability to adapt to cultural nuances, particularly in Chinese settings. However, it faces challenges with American and Arab cultures. The research also highlights that fine-tuning LLama 2 models with different languages changes their responses to cultural questions, emphasizing the need for culturally diverse development in AI for worldwide acceptance and ethical use. For more details or to contribute to this research, visit our GitHub page https://github.com/reemim/Hofstedes_CAT
AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic
Emad A. Alghamdi
|
Reem Masoud
|
Deema Alnuhait
|
Afnan Y. Alomairi
|
Ahmed Ashraf
|
Mohamed Zaytoon
Proceedings of the 31st International Conference on Computational Linguistics
The swift progress and widespread acceptance of artificial intelligence (AI) systems highlight a pressing requirement to comprehend both the capabilities and potential risks associated with AI. Given the linguistic complexity, cultural richness, and underrepresented status of Arabic in AI research, there is a pressing need to focus on Large Language Models (LLMs) performance and safety for Arabic related tasks. Despite some progress in their development, there is a lack of comprehensive trustworthiness evaluation benchmarks which presents a major challenge in accurately assessing and improving the safety of LLMs when prompted in Arabic. In this paper, we introduce AraTrust, the first comprehensive trustworthiness benchmark for LLMs in Arabic. AraTrust comprises 522 human-written multiple-choice questions addressing diverse dimensions related to truthfulness, ethics, privacy, illegal activities, mental health, physical health, unfairness, and offensive language. We evaluated a set of LLMs against our benchmark to assess their trustworthiness. GPT-4 was the most trustworthy LLM, while open-source models, particularly AceGPT 7B and Jais 13B, struggled to achieve a score of 60% in our benchmark. The benchmark dataset is publicly available at https://huggingface.co/datasets/asas-ai/AraTrust
Search
Fix data
Co-authors
- Emad A. Alghamdi 1
- Deema Alnuhait 1
- Afnan Y. Alomairi 1
- Ahmed Ashraf 1
- Martin Ferianc 1
- show all...