Building an Efficient Multilingual Non-Profit IR System for the Islamic Domain Leveraging Multiprocessing Design in Rust

Vera Pavlova, Mohammed Makhlouf


Abstract
The widespread use of large language models (LLMs) has dramatically improved many applications of Natural Language Processing (NLP), including Information Retrieval (IR). However, domains that are not driven by commercial interest often lag behind in benefiting from AI-powered solutions. One such area is religious and heritage corpora. Alongside similar domains, Islamic literature holds significant cultural value and is regularly utilized by scholars and the general public. Navigating this extensive amount of text is challenging, and there is currently no unified resource that allows for easy searching of this data using advanced AI tools. This work focuses on the development of a multilingual non-profit IR system for the Islamic domain. This process brings a few major challenges, such as preparing multilingual domain-specific corpora when data is limited in certain languages, deploying a model on resource-constrained devices, and enabling fast search on a limited budget. By employing methods like continued pre-training for domain adaptation and language reduction to decrease model size, a lightweight multilingual retrieval model was prepared, demonstrating superior performance compared to larger models pre-trained on general domain data. Furthermore, evaluating the proposed architecture that utilizes Rust Language capabilities shows the possibility of implementing efficient semantic search in a low-resource setting.
Anthology ID:
2024.emnlp-industry.73
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:
November
Year:
2024
Address:
Miami, Florida, US
Editors:
Franck Dernoncourt, Daniel Preoţiuc-Pietro, Anastasia Shimorina
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
981–990
Language:
URL:
https://aclanthology.org/2024.emnlp-industry.73
DOI:
Bibkey:
Cite (ACL):
Vera Pavlova and Mohammed Makhlouf. 2024. Building an Efficient Multilingual Non-Profit IR System for the Islamic Domain Leveraging Multiprocessing Design in Rust. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 981–990, Miami, Florida, US. Association for Computational Linguistics.
Cite (Informal):
Building an Efficient Multilingual Non-Profit IR System for the Islamic Domain Leveraging Multiprocessing Design in Rust (Pavlova & Makhlouf, EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-industry.73.pdf
Poster:
 2024.emnlp-industry.73.poster.pdf
Presentation:
 2024.emnlp-industry.73.presentation.pdf
Video:
 2024.emnlp-industry.73.video.mp4