Towards Truly Open, Language-Specific, Safe, Factual, and Specialized Large Language Models

Preslav Nakov


Abstract
First, we will argue for the need for fully transparent open-source large language models (LLMs), and we will describe the efforts of MBZUAI’s Institute on Foundation Models (IFM) towards that based on the LLM360 initiative. Second, we will argue for the need for language-specific LLMs, and we will share our experience from building Jais, the world’s leading open Arabic-centric foundation and instruction-tuned large language model, Nanda, our recently released open Hindi LLM, and some other models. Third, we will argue for the need for safe LLMs, and we will present Do-Not-Answer, a dataset for evaluating the guardrails of LLMs, which is at the core of the safety mechanisms of our LLMs. Forth, we will argue for the need for factual LLMs, we will discuss the factuality challenges that LLMs pose. We will then present some recent relevant tools for addressing these challenges developed at MBZUAI: (i) OpenFactCheck, a framework for fact-checking LLM output, for building customized fact-checking systems, and for benchmarking LLMs for factuality, (ii) LM-Polygraph, a tool for predicting an LLM’s uncertainty in its output using cheap and fast uncertainty quantification techniques, and (iii) LLM-DetectAIve, a tool for machine-generated text detection. Finally, we will argue for the need for specialized models, and we will present the zoo of LLMs currently being developed at MBZUAI’s IFM.
Anthology ID:
2025.bucc-1.3
Volume:
Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC)
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Serge Sharoff, Ayla Rigouts Terryn, Pierre Zweigenbaum, Reinhard Rapp
Venues:
BUCC | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
18
Language:
URL:
https://aclanthology.org/2025.bucc-1.3/
DOI:
Bibkey:
Cite (ACL):
Preslav Nakov. 2025. Towards Truly Open, Language-Specific, Safe, Factual, and Specialized Large Language Models. In Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC), page 18, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Towards Truly Open, Language-Specific, Safe, Factual, and Specialized Large Language Models (Nakov, BUCC 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.bucc-1.3.pdf