Riad Souissi
2026
Improving LLM Domain Certification with Pretrained Guide Models
Jiaqian Zhang | Zhaozhi Qian | Faroq AL-Tam | Ignacio Iacobacci | Muhammad AL-Qurishi | Riad Souissi
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiaqian Zhang | Zhaozhi Qian | Faroq AL-Tam | Ignacio Iacobacci | Muhammad AL-Qurishi | Riad Souissi
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) often generate off-domain or harmful responses when deployed in specialized, high-stakes domains, motivating the need for rigorous LLM domain certification. While the VALID algorithm (Emde et al., 2025) achieves formal domain certificate guarantee using a guide model G trained from scratch on in-domain data, it suffers from poor generalization due to limited training. In this work, we propose PRISM, a novel approach that overcomes this key limitation by leveraging pretrained language models as guide models, enhanced via contrastive fine-tuning to sharply distinguish acceptable from refused content. We explore and experiment variants of PRISM with different loss functions to ensure that the model exploits the rich world knowledge of pretrained models while aligned to the target domain. We show that two variants of PRISM, PRISM-BC and PRISM-GA, achieve superior OOD rejection and tighter certification bounds across eight diverse data regimes and perturbations, establishing a more reliable approach to domain-adherent LLM deployment.
2025
Hildoc: Leveraging Hilbert Curve Representation for Accurate and Efficient Document Retrieval
Muhammad AL-Qurishi | Zhaozhi Qian | Faroq AL-Tam | Riad Souissi
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Muhammad AL-Qurishi | Zhaozhi Qian | Faroq AL-Tam | Riad Souissi
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Document retrieval is a critical challenge in information retrieval systems, where the goal is to efficiently retrieve relevant documents in response to a given query. Dense retrieval methods, which utilize vector embeddings to represent semantic information, require effective indexing to ensure fast and accurate retrieval. Existing methods, such as MEVI, have attempted to address this by using hierarchical K-Means for clustering, but they often face limitations in computational efficiency and retrieval accuracy. In this paper, we introduce the Hildoc Index, a novel document indexing approach that leverages the Hilbert Curve to map document embeddings onto a one-dimensional space. This innovative representation facilitates efficient clustering using a 1D quantile-based algorithm, ensuring uniform partition sizes and preserving the inherent structure of the data. As a result, Hildoc Index not only reduces training complexity but also enhances retrieval accuracy and speed during inference. Our method can be seamlessly integrated into both dense retrieval systems and hybrid ensemble systems. Through comprehensive experiments on standard benchmarks like MSMARCO Passage and Natural Questions, we demonstrate that the Hildoc Index significantly outperforms the current state-of-the-art MEVI in terms of both retrieval speed and recall. These results underscore the Hildoc Index as a solution for fast and accurate dense document retrieval.
2022
AraLegal-BERT: A pretrained language model for Arabic Legal text
Muhammad Al-qurishi | Sarah Alqaseemi | Riad Souissi
Proceedings of the Natural Legal Language Processing Workshop 2022
Muhammad Al-qurishi | Sarah Alqaseemi | Riad Souissi
Proceedings of the Natural Legal Language Processing Workshop 2022
The effectiveness of the bidirectional encoder representations from transformers (BERT) model for multiple linguistic tasks is well documented. However, its potential for a narrow and specific domain, such as legal, has not been fully explored. In this study, we examine the use of BERT in the Arabic legal domain and customize this language model for several downstream tasks using different domain-relevant training and test datasets to train BERT from scratch. We introduce AraLegal-BERT, a bidirectional encoder transformer-based model that has been thoroughly tested and carefully optimized with the goal of amplifying the impact of natural language processing-driven solutions on jurisprudence, legal documents, and legal practice. We fine-tuned AraLegal-BERT and evaluated it against three BERT variants for the Arabic language in three natural language understanding tasks. The results showed that the base version of AraLegal-BERT achieved better accuracy than the typical and original BERT model concerning legal texts.