Hanan Aldarmaki - ACL Anthology

Hanan Aldarmaki

2026

JEEM: Vision-Language Understanding in Four Arabic Dialects
Karima Kadaoui | Hanin Atwany | Hamdan Al-Ali | Abdelrahman Mohamed | Ali Mekky | Sergei Tilga | Natalia Fedorova | Ekaterina Artemova | Hanan Aldarmaki | Yova Kementchedjhieva
Findings of the Association for Computational Linguistics: EACL 2026

We introduce JEEM, a benchmark designed to evaluate Vision-Language Models (VLMs) on visual understanding across four Arabic-speaking countries: Jordan, The Emirates, Egypt, and Morocco. JEEM includes the tasks of image captioning and visual question answering, and features culturally rich and regionally diverse content. This dataset aims to assess the ability of VLMs to generalize across dialects and accurately interpret cultural elements in visual contexts. In an evaluation of five prominent open-source Arabic VLMs and GPT-4o, we find that the Arabic VLMs consistently underperform, struggling with both visual understanding and dialect-specific generation. While GPT-4o ranks best in this comparison, the model’s linguistic competence varies across dialects, and its visual understanding capabilities lag behind. This underscores the need for more inclusive models and the value of culturally-diverse evaluation paradigms.

2025

Dialectal Coverage And Generalization in Arabic Speech Recognition
Amirbek Djanibekov | Hawau Olamide Toyin | Raghad Alshalan | Abdullah Alatir | Hanan Aldarmaki
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Developing robust automatic speech recognition (ASR) systems for Arabic requires effective strategies to manage its diversity. Existing ASR systems mainly cover the modern standard Arabic (MSA) variety and few high-resource dialects, but fall short in coverage and generalization across the multitude of spoken variants. Code-switching with English and French is also common in different regions of the Arab world, which challenges the performance of monolingual Arabic models. In this work, we introduce a suite of ASR models optimized to effectively recognize multiple variants of spoken Arabic, including MSA, various dialects, and code-switching. We provide open-source pre-trained models that cover data from 17 Arabic-speaking countries, and fine-tuned MSA and dialectal ASR models that include at least 11 variants, as well as multi-lingual ASR models covering embedded languages in code-switched utterances. We evaluate ASR performance across these spoken varieties and demonstrate both coverage and performance gains compared to prior models.

Voice of a Continent: Mapping Africa’s Speech Technology Frontier
AbdelRahim A. Elmadany | Sang Yun Kwon | Hawau Olamide Toyin | Alcides Alcoba Inciarte | Hanan Aldarmaki | Muhammad Abdul-Mageed
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Africa’s rich linguistic diversity remains significantly underrepresented in speech technologies, creating barriers to digital inclusion. To alleviate this challenge, we systematically map the continent’s speech space of datasets and technologies, leading to a new comprehensive benchmark SimbaBench for downstream African speech tasks. Using SimbaBench, we introduce the Simba family of models, achieving state-of-the-art performance across multiple African languages and speech tasks. Our benchmark analysis reveals critical patterns in resource availability, while our model evaluation demonstrates how dataset quality, domain diversity, and language family relationships influence performance across languages. Our work highlights the need for expanded speech technology resources that better reflect Africa’s linguistic diversity and provides a solid foundation for future research and development efforts toward more inclusive speech technologies.

SparQLe: Speech Queries to Text Translation Through LLMs
Amirbek Djanibekov | Hanan Aldarmaki
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)

With the growing influence of Large Language Models (LLMs), there is increasing interest in integrating speech representations with them to enable more seamless multi-modal processing and speech understanding. This study introduces a novel approach that combines self-supervised speech representations with instruction-tuned LLMs for speech-to-text translation. The proposed approach leverages a modality adapter to align extracted speech features with instruction-tuned LLMs using English speech data. Our experiments demonstrate that this method effectively preserves the semantic content of the input speech and serves as an effective bridge between self-supervised speech models and instruction-tuned LLMs, offering a promising approach for various speech understanding applications.

NADI 2025: The First Multidialectal Arabic Speech Processing Shared Task
Bashar Talafha | Hawau Olamide Toyin | Peter Sullivan | AbdelRahim A. Elmadany | Abdurrahman Juma | Amirbek Djanibekov | Chiyu Zhang | Hamad Alshehhi | Hanan Aldarmaki | Mustafa Jarrar | Nizar Habash | Muhammad Abdul-Mageed
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

We present the findings of the sixth Nuanced Arabic Dialect Identification (NADI 2025) Shared Task, which focused on Arabic speech dialect processing across three subtasks: spoken dialect identification (Subtask 1), speech recognition (Subtask 2), and diacritic restoration for spoken dialects (Subtask 3). A total of 44 teams registered, and during the testing phase, 100 valid submissions were received from eight unique teams. The distribution was as follows: 34 submissions for Subtask 1 five teams, 47 submissions for Subtask 2 six teams, and 19 submissions for Subtask 3 two teams. The best-performing systems achieved 79.8% accuracy on Subtask 1, 35.68/12.20 WER/CER (overall average) on Subtask 2, and 55/13 WER/CER on Subtask 3. These results highlight the ongoing challenges of Arabic dialect speech processing, particularly in dialect identification, recognition, and diacritic restoration. We also summarize the methods adopted by participating teams and briefly outline directions for future editions of NADI.

SPIRIT: Patching Speech Language Models against Jailbreak Attacks
Amirbek Djanibekov | Nurdaulet Mukhituly | Kentaro Inui | Hanan Aldarmaki | Nils Lukas
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Speech Language Models (SLMs) enable natural interactions via spoken instructions, which more effectively capture user intent by detecting nuances in speech. The richer speech signal introduces new security risks compared to text-based models, as adversaries can better bypass safety mechanisms by injecting imperceptible noise to speech. We analyze adversarial attacks under white-box access and find that SLMs are substantially more vulnerable to jailbreak attacks, which can achieve a perfect 100% attack success rate in some instances. To improve security, we propose post-hoc patching defenses used to intervene during inference by modifying the SLM’s activations that improve robustness up to 99% with (i) negligible impact on utility and (ii) without any re-training. We conduct ablation studies to maximize the efficacy of our defenses and improve the utility/security trade-off, validated with large-scale benchmarks unique to SLMs.

Uncertainty-driven Partial Diacritization for Arabic Text
Humaid Ali Alblooshi | Artem Shelmanov | Hanan Aldarmaki
Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025)

We present an uncertainty-based approach to Partial Diacritization (PD) for Arabic text. We evaluate three uncertainty metrics for this task: Softmax Response, BALD via MC-dropout, and Mahalanobis Distance. We further introduce a lightweight Confident Error Regularizer to improve model calibration. Our preliminary exploration illustrates possible ways to use uncertainty estimation for selectively retaining or discarding diacritics in Arabic text with an analysis of performance in terms of correlation with diacritic error rates. For instance, the model can be used to detect words with high diacritic error rates which tend to have higher uncertainty scores at inference time. On the Tashkeela dataset, the method maintains low Diacritic Error Rate while reducing the amount of visible diacritics on the text by up to 50% with thresholding-based retention.

Exploring Transliteration-Based Zero-Shot Transfer for Amharic ASR
Hellina Hailu Nigatu | Hanan Aldarmaki
Proceedings of the Sixth Workshop on African Natural Language Processing (AfricaNLP 2025)

The performance of Automatic Speech Recognition (ASR) depends on the availability of transcribed speech datasets—often scarce ornon-existent for many of the worlds languages. This study investigates alternative strategies to bridge the data gap using zero-shot cross-lingual transfer, leveraging transliteration as a method to utilize data from other languages. We experiment with transliteration from various source languages and demonstrate ASR performance in a low-resourced language, Amharic. We find that source data that align with the character distribution of the test data achieves the best performance, regardless of language family. We also experiment with fine-tuning with minimal transcribed data in the target language. Our findings demonstrate that transliteration, particularly when combined with a strategic choice of source languages, is a viable approach for improving ASR in zero-shot and low-resourced settings.

2024

STTATTS: Unified Speech-To-Text And Text-To-Speech Model
Hawau Olamide Toyin | Hao Li | Hanan Aldarmaki
Findings of the Association for Computational Linguistics: EMNLP 2024

Speech recognition and speech synthesis models are typically trained separately, each with its own set of learning objectives, training data, and model parameters, resulting in two distinct large networks. We propose a parameter-efficient approach to learning ASR and TTS jointly via a multi-task learning objective and shared parameters. Our evaluation demonstrates thatthe performance of our multi-task model is comparable to that of individually trained models while significantly savingcomputational and memory costs (~50% reduction in the total number of parameters required for the two tasks combined). We experiment with English as a resource-rich language, and Arabic as a relatively low-resource language due to shortage of TTS data. Our models are trained with publicly available data, and both the training code and model checkpoints are openly available for further research.

PALM: Few-Shot Prompt Learning for Audio Language Models
Asif Hanif | Maha Tufail Agro | Mohammad Areeb Qazi | Hanan Aldarmaki
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks, which match features of audio waveforms with class-specific text prompt features, inspired by advancements in Vision-Language Models (VLMs). Given the sensitivity of zero-shot performance to the choice of hand-crafted text prompts, many prompt learning techniques have been developed for VLMs. We explore the efficacy of these approaches in ALMs and propose a novel method, Prompt Learning in Audio Language Models (PALM), which optimizes the feature space of the text encoder branch. Unlike existing methods that work in the input space, our approach results in greater training efficiency. We demonstrate the effectiveness of our approach on 11 audio recognition datasets, encompassing a variety of speech-processing tasks, and compare the results with three baselines in a few-shot learning setup. Our method is either on par with or outperforms other approaches while being computationally less demanding. Our code is publicly available at https://asif-hanif.github.io/palm/.

PolyWER: A Holistic Evaluation Framework for Code-Switched Speech Recognition
Karima Kadaoui | Maryam Al Ali | Hawau Olamide Toyin | Ibrahim Mohammed | Hanan Aldarmaki
Findings of the Association for Computational Linguistics: EMNLP 2024

Code-switching in speech, particularly between languages that use different scripts, can potentially be correctly transcribed in various forms, including different ways of transliteration of the embedded language into the matrix language script. Traditional methods for measuring accuracy, such as Word Error Rate (WER), are too strict to address this challenge. In this paper, we introduce PolyWER, a proposed framework for evaluating speech recognition systems to handle language-mixing. PolyWER accepts transcriptions of code-mixed segments in different forms, including transliterations and translations. We demonstrate the algorithms use cases through detailed examples, and evaluate it against human judgement. To enable the use of this metric, we appended the annotations of a publicly available Arabic-English code-switched dataset with transliterations and translations of code-mixed speech. We also utilize these additional annotations for fine-tuning ASR models and compare their performance using PolyWER. In addition to our main finding on PolyWER’s effectiveness, our experiments show that alternative annotations could be more effective for fine-tuning monolingual ASR models.

Data Augmentation for Speech-Based Diacritic Restoration
Sara Shatnawi | Sawsan Alqahtani | Shady Shehata | Hanan Aldarmaki
Proceedings of the Second Arabic Natural Language Processing Conference

This paper describes a data augmentation technique for boosting the performance of speech-based diacritic restoration. Our experiments demonstrate the utility of this appraoch, resulting in improved generalization of all models across different test sets. In addition, we describe the first multi-modal diacritic restoration model, utilizing both speech and text as input modalities. This type of model can be used to diacritize speech transcripts. Unlike previous work that relies on an external ASR model, the proposed model is far more compact and efficient. While the multi-modal framework does not surpass the ASR-based model for this task, it offers a promising approach for improving the efficiency of speech-based diacritization, with a potential for improvement using data augmentation and other methods.

Towards Automated Game-Based Early Screening for Language Disorder
Hamdan Hamid Al-Ali | Elsa Soares | Goncalo Leal | Rita Valente | Nicole Agrela | Alexandra Marquis | Hanan Aldarmaki
Proceedings of the 2024 CLASP Conference on Multimodality and Interaction in Language Learning

This paper examines the potential of gamifying early childhood language disorder screening to make the process more accessible and scalable. We provide an overview of current practices in screening and assessment, and a description of our on-going work towards automation of early screening. By integrating developmental milestones into a video game format and employing automatic speech recognition and natural language processing, this approach aims to enhance the efficiency and reach of early screening in order to identify children who need further professional assessment.

Mixat: A Data Set of Bilingual Emirati-English Speech
Maryam Khalifa Al Ali | Hanan Aldarmaki
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024

This paper introduces Mixat: a dataset of Emirati speech code-mixed with English. Mixat was developed to address the shortcomings of current speech recognition resources when applied to Emirati speech, and in particular, to bilignual Emirati speakers who often mix and switch between their local dialect and English. The data set consists of 15 hours of speech derived from two public podcasts featuring native Emirati speakers, one of which is in the form of conversations between the host and a guest. Therefore, the collection contains examples of Emirati-English code-switching in both formal and natural conversational contexts. In this paper, we describe the process of data collection and annotation, and describe some of the features and statistics of the resulting data set. In addition, we evaluate the performance of pre-trained Arabic and multi-lingual ASR systems on our dataset, demonstrating the shortcomings of existing models on this low-resource dialectal Arabic, and the additional challenge of recognizing code-switching in ASR. The dataset will be made publicly available for research use.

Automatic Restoration of Diacritics for Speech Data Sets
Sara Shatnawi | Sawsan Alqahtani | Hanan Aldarmaki
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Automatic text-based diacritic restoration models generally have high diacritic error rates when applied to speech transcripts as a result of domain and style shifts in spoken language. In this work, we explore the possibility of improving the performance of automatic diacritic restoration when applied to speech data by utilizing parallel spoken utterances. In particular, we use the pre-trained Whisper ASR model fine-tuned on relatively small amounts of diacritized Arabic speech data to produce rough diacritized transcripts for the speech utterances, which we then use as an additional input for diacritic restoration models. The proposed framework consistently improves diacritic restoration performance compared to text-only baselines. Our results highlight the inadequacy of current text-based diacritic restoration models for speech data sets and provide a new baseline for speech-based diacritic restoration.

2023

Handling Realistic Label Noise in BERT Text Classification
Maha Tufail Agro | Hanan Aldarmaki
Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023)

Yet Another Model for Arabic Dialect Identification
Ajinkya Kulkarni | Hanan Aldarmaki
Proceedings of ArabicNLP 2023

In this paper, we describe a spoken Arabic dialect identification (ADI) model for Arabic that consistently outperforms previously published results on two benchmark datasets: ADI-5 and ADI-17. We explore two architectural variations: ResNet and ECAPA-TDNN, coupled with two types of acoustic features: MFCCs and features exratected from the pre-trained self-supervised model UniSpeech-SAT Large, as well as a fusion of all four variants. We find that individually, ECAPA-TDNN network outperforms ResNet, and models with UniSpeech-SAT features outperform models with MFCCs by a large margin. Furthermore, a fusion of all four variants consistently outperforms individual models. Our best models outperform previously reported results on both datasets, with accuracies of 84.7% and 96.9% on ADI-5 and ADI-17, respectively.

ArTST: Arabic Text and Speech Transformer
Hawau Olamide Toyin | Amirbek Djanibekov | Ajinkya Kulkarni | Hanan Aldarmaki
Proceedings of ArabicNLP 2023

We present ArTST, a pre-trained Arabic text and speech transformer for supporting open-source speech technologies for the Arabic language. The model architecture follows the unified-modal framework, SpeechT5, that was recently released for English, and is focused on Modern Standard Arabic (MSA), with plans to extend the model for dialectal and code-switched Arabic in future editions. We pre-trained the model from scratch on MSA speech and text data, and fine-tuned it for the following tasks: Automatic Speech Recognition (ASR), Text-To-Speech synthesis (TTS), and spoken dialect identification. In our experiments comparing ArTST with SpeechT5, as well as with previously reported results in these tasks, ArTST performs on a par with or exceeding the current state-of-the-art in all three tasks. Moreover, we find that our pre-training is conducive for generalization, which is particularly evident in the low-resource TTS task. The pre-trained model as well as the fine-tuned ASR and TTS models are released for research use.

2022

Supervised Acoustic Embeddings And Their Transferability Across Languages
Sreepratha Ram | Hanan Aldarmaki
Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)

2019

Context-Aware Cross-Lingual Mapping
Hanan Aldarmaki | Mona Diab
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Cross-lingual word vectors are typically obtained by fitting an orthogonal matrix that maps the entries of a bilingual dictionary from a source to a target vector space. Word vectors, however, are most commonly used for sentence or document-level representations that are calculated as the weighted average of word embeddings. In this paper, we propose an alternative to word-level mapping that better reflects sentence-level cross-lingual similarity. We incorporate context in the transformation matrix by directly mapping the averaged embeddings of aligned sentences in a parallel corpus. We also implement cross-lingual mapping of deep contextualized word embeddings using parallel sentences with word alignments. In our experiments, both approaches resulted in cross-lingual sentence embeddings that outperformed context-independent word mapping in sentence translation retrieval. Furthermore, the sentence-level transformation could be used for word-level mapping without loss in word translation quality.

Scalable Cross-Lingual Transfer of Neural Sentence Embeddings
Hanan Aldarmaki | Mona Diab
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

We develop and investigate several cross-lingual alignment approaches for neural sentence embedding models, such as the supervised inference classifier, InferSent, and sequential encoder-decoder models. We evaluate three alignment frameworks applied to these models: joint modeling, representation transfer learning, and sentence mapping, using parallel text to guide the alignment. Our results support representation transfer as a scalable approach for modular cross-lingual alignment of neural sentence embeddings, where we observe better performance compared to joint models in intrinsic and extrinsic evaluations, particularly with smaller sets of parallel data.

Homograph Disambiguation through Selective Diacritic Restoration
Sawsan Alqahtani | Hanan Aldarmaki | Mona Diab
Proceedings of the Fourth Arabic Natural Language Processing Workshop

Lexical ambiguity, a challenging phenomenon in all natural languages, is particularly prevalent for languages with diacritics that tend to be omitted in writing, such as Arabic. Omitting diacritics leads to an increase in the number of homographs: different words with the same spelling. Diacritic restoration could theoretically help disambiguate these words, but in practice, the increase in overall sparsity leads to performance degradation in NLP applications. In this paper, we propose approaches for automatically marking a subset of words for diacritic restoration, which leads to selective homograph disambiguation. Compared to full or no diacritic restoration, these approaches yield selectively-diacritized datasets that balance sparsity and lexical disambiguation. We evaluate the various selection strategies extrinsically on several downstream applications: neural machine translation, part-of-speech tagging, and semantic textual similarity. Our experiments on Arabic show promising results, where our devised strategies on selective diacritization lead to a more balanced and consistent performance in downstream applications.

Efficient Sentence Embedding using Discrete Cosine Transform
Nada Almarwani | Hanan Aldarmaki | Mona Diab
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Vector averaging remains one of the most popular sentence embedding methods in spite of its obvious disregard for syntactic structure. While more complex sequential or convolutional networks potentially yield superior classification performance, the improvements in classification accuracy are typically mediocre compared to the simple vector averaging. As an efficient alternative, we propose the use of discrete cosine transform (DCT) to compress word sequences in an order-preserving manner. The lower order DCT coefficients represent the overall feature patterns in sentences, which results in suitable embeddings for tasks that could benefit from syntactic features. Our results in semantic probing tasks demonstrate that DCT embeddings indeed preserve more syntactic information compared with vector averaging. With practically equivalent complexity, the model yields better overall performance in downstream classification tasks that correlate with syntactic features, which illustrates the capacity of DCT to preserve word order information.

2018

Unsupervised Word Mapping Using Structural Similarities in Monolingual Embeddings
Hanan Aldarmaki | Mahesh Mohan | Mona Diab
Transactions of the Association for Computational Linguistics, Volume 6

Most existing methods for automatic bilingual dictionary induction rely on prior alignments between the source and target languages, such as parallel corpora or seed dictionaries. For many language pairs, such supervised alignments are not readily available. We propose an unsupervised approach for learning a bilingual dictionary for a pair of languages given their independently-learned monolingual word embeddings. The proposed method exploits local and global structures in monolingual vector spaces to align them such that similar words are mapped to each other. We show empirically that the performance of bilingual correspondents that are learned using our proposed unsupervised method is comparable to that of using supervised bilingual correspondents from a seed dictionary.

Evaluation of Unsupervised Compositional Representations
Hanan Aldarmaki | Mona Diab
Proceedings of the 27th International Conference on Computational Linguistics

We evaluated various compositional models, from bag-of-words representations to compositional RNN-based models, on several extrinsic supervised and unsupervised evaluation benchmarks. Our results confirm that weighted vector averaging can outperform context-sensitive models in most benchmarks, but structural features encoded in RNN models can also be useful in certain classification tasks. We analyzed some of the evaluation datasets to identify the aspects of meaning they measure and the characteristics of the various models that explain their performance variance.

2016

Learning Cross-lingual Representations with Matrix Factorization
Hanan Aldarmaki | Mona Diab
Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP

GWU NLP at SemEval-2016 Shared Task 1: Matrix Factorization for Crosslingual STS
Hanan Aldarmaki | Mona Diab
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

Robust Part-of-speech Tagging of Arabic Text
Hanan Aldarmaki | Mona Diab
Proceedings of the Second Workshop on Arabic Natural Language Processing

Co-authors

Venues