Multi-stage Training of Bilingual Islamic LLM for Neural Passage Retrieval

Vera Pavlova

Multi-stage Training of Bilingual Islamic LLM for Neural Passage Retrieval

Abstract

This study examines the use of Natural Language Processing (NLP) technology within the Islamic domain, focusing on developing an Islamic neural retrieval model. By leveraging the robust XLM-R base model, the research employs a language reduction technique to create a lightweight bilingual large language model (LLM). Our approach for domain adaptation addresses the unique challenges faced in the Islamic domain, where substantial in-domain corpora exist only in Arabic while limited in other languages, including English. The work utilizes a multi-stage training process for retrieval models, incorporating large retrieval datasets, such as MS MARCO, and smaller, in-domain datasets to improve retrieval performance. Additionally, we have curated an in-domain retrieval dataset in English by employing data augmentation techniques and involving a reliable Islamic source. This approach enhances the domain-specific dataset for retrieval, leading to further performance gains. The findings suggest that combining domain adaptation and a multi-stage training method for the bilingual Islamic neural retrieval model enables it to outperform monolingual models on downstream retrieval tasks.

Anthology ID:: 2025.clrel-1.4
Volume:: Proceedings of the New Horizons in Computational Linguistics for Religious Texts
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Sane Yagi, Sane Yagi, Majdi Sawalha, Bayan Abu Shawar, Abdallah T. AlShdaifat, Norhan Abbas, Organizers
Venues:: CLRel | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 42–52
Language:
URL:: https://aclanthology.org/2025.clrel-1.4/
DOI:
Bibkey:
Cite (ACL):: Vera Pavlova. 2025. Multi-stage Training of Bilingual Islamic LLM for Neural Passage Retrieval. In Proceedings of the New Horizons in Computational Linguistics for Religious Texts, pages 42–52, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Multi-stage Training of Bilingual Islamic LLM for Neural Passage Retrieval (Pavlova, CLRel 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.clrel-1.4.pdf

PDF Cite Search Fix data