Youssef Mellah


2023

pdf bib
Automated De-Identification of Arabic Medical Records
Veysel Kocaman | Youssef Mellah | Hasham Haq | David Talby
Proceedings of ArabicNLP 2023

As Electronic Health Records (EHR) become ubiquitous in healthcare systems worldwide, including in Arabic-speaking countries, the dual imperative of safeguarding patient privacy and leveraging data for research and quality improvement grows. This paper presents a first-of-its-kind automated de-identification pipeline for medical text specifically tailored for the Arabic language. This includes accurate medical Named Entity Recognition (NER) for identifying personal information; data obfuscation models to replace sensitive entities with fake entities; and an implementation that natively scales to large datasets on commodity clusters. This research makes two contributions. First, we adapt two existing NER architectures— BERT For Token Classification (BFTC) and BiLSTM-CNN-Char – to accommodate the unique syntactic and morphological characteristics of the Arabic language. Comparative analysis suggests that BFTC models outperform Bi-LSTM models, achieving higher F1 scores for both identifying and redacting personally identifiable information (PII) from Arabic medical texts. Second, we augment the deep learning models with a contextual parser engine to handle commonly missed entities. Experiments show that the combined pipeline demonstrates superior performance with micro F1 scores ranging from 0.94 to 0.98 on the test dataset, which is a translated version of the i2b2 2014 de-identification challenge, across 17 sensitive entities. This level of accuracy is in line with that achieved with manual de-identification by domain experts, suggesting that a fully automated and scalable process is now viable.

2022

pdf bib
LARSA22 at Qur’an QA 2022: Text-to-Text Transformer for Finding Answers to Questions from Qur’an
Youssef Mellah | Ibtissam Touahri | Zakaria Kaddari | Zakaria Haja | Jamal Berrich | Toumi Bouchentouf
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection

Question Answering (QA) is one of the main foсuses of Natural Language Proсessing (NLP) researсh. However, Arabiс Question Answering is still not within reaсh. The сhallenges of the Arabiс language and the laсk of resourсes have made it diffiсult to provide powerful Arabiс QA systems with high aссuraсy. While low aссuraсy may be aссepted for general purpose systems, it is сritiсal in some fields suсh as religious affairs. Therefore, there is a need for speсialized aссurate systems that target these сritiсal fields. In this paper, we propose a Transformer-based QA system using the mT5 Language Model (LM). We finetuned the model on the Qur’aniс Reading Сomprehension Dataset (QRСD) whiсh was provided in the сontext of the Qur’an QA 2022 shared task. The QRСD dataset сonsists of question-passage pairs as input, and the сorresponding adequate answers provided by expert annotators as output. Evaluation results on the same DataSet show that our best model сan aсhieve 0.98 (F1 Sсore) on the Dev Set and 0.40 on the Test Set. We disсuss those results and сhallenges, then propose potential solutions for possible improvements. The sourсe сode is available on our repository.