Ahmed Ali - ACL Anthology

Ahmed Ali

2026

The Complementary Role of Para-linguistic cues for Robust Pronunciation Assessment
Yassine El Kheir | Shammur Absar Chowdhury | Ahmed Ali
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology

Research on pronunciation assessment systems focuses on utilizing phonetic and phonological aspects of non-native (L2) speech, often neglecting the rich layer of information hidden within the para-linguistic cues. In this study, we proposed a novel pronunciation assessment framework, IntraVerbalPA.[The source code will be available to the public upon acceptance.] The framework innovatively incorporates both fine-grained frame- and abstract utterance-level para-linguistic cues, alongside the raw speech and phoneme representations. Additionally, we introduce the “Goodness of phonemic-duration” metric to model phoneme duration distribution within the framework effectively. Our results validate the effectiveness of the proposed IntraVerbalPA framework and its individual components, yielding performance that matches or outperforms existing research works.

2025

DialG2P: Dialectal Grapheme-to-Phoneme. Arabic as a Case Study
Majd Hawasly | Hamdy Mubarak | Ahmed Abdelali | Ahmed Ali
Proceedings of The Third Arabic Natural Language Processing Conference

Grapheme-to-phoneme (G2P) models are essential components in text-to-speech (TTS) and pronunciation assessment applications. While standard forms of languages have gained attention in that regard, dialectal speech, which often serves as the primary means of spoken communication for many communities, as it is the case for Arabic, has not received the same level of focus. In this paper, we introduce an end-to-end dialectal G2P for Egyptian Arabic, a dialect without standard orthography. Our novel architecture accomplishes three tasks: (i) restores short vowels of the diacritical marks for the dialectal text; (ii) maps certain characters that happen only in the spoken version of the dialectal Arabic to their dialect-specific character transcriptions; and finally (iii) converts the previous step output to the corresponding phoneme sequence. We benchmark G2P on a modular cascaded system, a large language model, and our multi-task end-to-end architecture.

Octopus: Towards Building the Arabic Speech LLM Suite
Sara Althubaiti | Vasista Sai Lodagala | Tjad Clark | Yousseif Ahmed Elshahawy | Daniel Izham | Abdullah Alrajeh | Aljawahrah Bin Tamran | Ahmed Ali
Proceedings of The Third Arabic Natural Language Processing Conference

We present Octopus, a first family of modular speech-language models designed for Arabic-English ASR, dialect identification, and speech translation. Built on Whisper-V3 and enhanced with large language models like ALLaM, LLaMA, and DeepSeek, Octopus bridges speech and text through a lightweight projection layer and Q-Former. To broaden its scope beyond speech, Octopus integrates BEATs, a general-purpose audio encoder allowing it to understand both linguistic and acoustic events. Despite its simplicity, this dual-encoder design supports robust performance across multilingual and code-switched scenarios. We also introduce TinyOctopus, a distilled variant using smaller models (Distil-Whisper + LLaMA3-1B / DeepSeek-1.5B), achieving competitive results with just a fraction of the parameters. Fine-tuning on synthetic code-switched data further boosts its performance. Octopus demonstrates the power of compact, extensible architectures in Arabic-centric speech modeling and sets the stage for unified multilingual audio-language understanding.

Iqra’Eval: A Shared Task on Qur’anic Pronunciation Assessment
Yassine El Kheir | Amit Meghanani | Hawau Olamide Toyin | Nada Almarwani | Omnia Ibrahim | Yousseif Ahmed Elshahawy | Mostafa Shahin | Ahmed Ali
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

We present the findings of the first shared task on Qur’anic pronunciation assessment, which focuses on addressing the unique challenges of evaluating the precise pronunciation of Qur’anic recitation. To fill an existing research gap, the Iqra’Eval 2025 shared task introduces the first open benchmark for Mispronunciation Detection and Diagnosis (MDD) in Qur’anic recitation, using Modern Standard Arabic (MSA) reading of Qur’anic texts as its case study. The task provides a comprehensive evaluation framework with increasingly complex subtasks: error localization and detailed error diagnosis. Leveraging the recently developed QuranMB benchmark dataset along with auxiliary training resources, this shared task aims to stimulate research in an area of both linguistic and cultural significance while addressing computational challenges in pronunciation assessment.

AMCrawl: An Arabic Web-Scale Dataset of Interleaved Image-Text Documents and Image-Text Pairs
Shahad Aboukozzana | Muhammad Kamran J Khan | Ahmed Ali
Proceedings of The Third Arabic Natural Language Processing Conference

In this paper, we present the Arabic Multimodal Crawl (AMCrawl), the first native-based Arabic multimodal dataset to our knowledge, derived from the Common Crawl corpus and rigorously filtered for quality and safety. Image-text pair datasets are the standard choice for pretraining multimodal large language models. However, they are often derived from image alt-text metadata, which is typically brief and context-poor, disconnecting images from their broader meaning. Although significant advances have been made in building interleaved image-text datasets for English, such as the OBELICS dataset, a substantial gap remains for native Arabic content. Our processing covered 8.6 million Arabic web pages, yielding 5.8 million associated images and 1.3 billion text tokens. The final dataset includes interleaved image-text documents and question-answer pairs, featuring 2.8 million high-quality interleaved documents and 5 million QA pairs. Alongside the dataset, we release the complete pipeline and code, ensuring reproducibility and encouraging further research and development. To demonstrate the effectiveness of AMCrawl, we introduce a publicly available native Arabic Vision Language model, trained with 13 billion parameters. These models achieve competitive results when benchmarked against publicly available datasets. AMCrawl bridges a critical gap in Arabic multimodal resources, providing a robust foundation for developing Arabic multimodal large language models and fostering advancements in this underrepresented area. Code: github.com/shahad-aboukozzana/AMCrawl

2024

Beyond Orthography: Automatic Recovery of Short Vowels and Dialectal Sounds in Arabic
Yassine El Kheir | Hamdy Mubarak | Ahmed Ali | Shammur Chowdhury
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper presents a novel Dialectal Sound and Vowelization Recovery framework, designed to recognize borrowed and dialectal sounds within phonologically diverse and dialect-rich languages, that extends beyond its standard orthographic sound sets. The proposed framework utilized quantized sequence of input with(out) continuous pretrained self-supervised representation. We show the efficacy of the pipeline using limited data for Arabic, a dialect-rich language containing more than 22 major dialects. Phonetically correct transcribed speech resources for dialectal Arabic is scare. Therefore, we introduce ArabVoice15, a first of its kind, curated test set featuring 5 hours of dialectal speech across 15 Arab countries, with phonetically accurate transcriptions, including borrowed and dialect-specific sounds. We described in detail the annotation guideline along with the analysis of the dialectal confusion pairs. Our extensive evaluation includes both subjective – human perception tests and objective measures. Our empirical results, reported with three test sets, show that with only one and half hours of training data, our model improve character error rate by ≈7% in ArabVoice15 compared to the baseline.

LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking
Fahim Dalvi | Maram Hasanain | Sabri Boughorbel | Basel Mousi | Samir Abdaljalil | Nizi Nazar | Ahmed Abdelali | Shammur Absar Chowdhury | Hamdy Mubarak | Ahmed Ali | Majd Hawasly | Nadir Durrani | Firoj Alam
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

The recent development and success of Large Language Models (LLMs) necessitate an evaluation of their performance across diverse NLP tasks in different languages. Although several frameworks have been developed and made publicly available, their customization capabilities for specific tasks and datasets are often complex for different users. In this study, we introduce the LLMeBench framework, which can be seamlessly customized to evaluate LLMs for any NLP task, regardless of language. The framework features generic dataset loaders, several model providers, and pre-implements most standard evaluation metrics. It supports in-context learning with zero- and few-shot settings. A specific dataset and task can be evaluated for a given LLM in less than 20 lines of code while allowing full flexibility to extend the framework for custom datasets, models, or tasks. The framework has been tested on 31 unique NLP tasks using 53 publicly available datasets within 90 experimental setups, involving approximately 296K data points. We open-sourced LLMeBench for the community (https://github.com/qcri/LLMeBench/) and a video demonstrating the framework is available online (https://youtu.be/9cC2m_abk3A).

Recent advancements in Large Language Models (LLMs) have significantly influenced the landscape of language and speech research. Despite this progress, these models lack specific benchmarking against state-of-the-art (SOTA) models tailored to particular languages and tasks. LAraBench addresses this gap for Arabic Natural Language Processing (NLP) and Speech Processing tasks, including sequence tagging and content classification across different domains. We utilized models such as GPT-3.5-turbo, GPT-4, BLOOMZ, Jais-13b-chat, Whisper, and USM, employing zero and few-shot learning techniques to tackle 33 distinct tasks across 61 publicly available datasets. This involved 98 experimental setups, encompassing ~296K data points, ~46 hours of speech, and 30 sentences for Text-to-Speech (TTS). This effort resulted in 330+ sets of experiments. Our analysis focused on measuring the performance gap between SOTA models and LLMs. The overarching trend observed was that SOTA models generally outperformed LLMs in zero-shot learning, with a few exceptions. Notably, larger computational models with few-shot learning techniques managed to reduce these performance gaps. Our findings provide valuable insights into the applicability of LLMs for Arabic NLP and speech processing tasks.

2023

Automatic Pronunciation Assessment - A Review
Yassine El Kheir | Ahmed Ali | Shammur Absar Chowdhury
Findings of the Association for Computational Linguistics: EMNLP 2023

Pronunciation assessment and its application in computer-aided pronunciation training (CAPT) have seen impressive progress in recent years. With the rapid growth in language processing and deep learning over the past few years, there is a need for an updated review. In this paper, we review methods employed in pronunciation assessment for both phonemic and prosodic. We categorize the main challenges observed in prominent research trends, and highlight existing limitations, and available resources. This is followed by a discussion of the remaining challenges and possible directions for future work.

2021

QASR: QCRI Aljazeera Speech Resource A Large Scale Annotated Arabic Speech Corpus
Hamdy Mubarak | Amir Hussein | Shammur Absar Chowdhury | Ahmed Ali
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain. This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel. The dataset is released with lightly supervised transcriptions, aligned with the audio segments. Unlike previous datasets, QASR contains linguistically motivated segmentation, punctuation, speaker information among others. QASR is suitable for training and evaluating speech recognition systems, acoustics- and/or linguistics- based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other NLP modules for spoken data. In addition to QASR transcription, we release a dataset of 130M words to aid in designing and training a better language model. We show that end-to-end automatic speech recognition trained on QASR reports a competitive word error rate compared to the previous MGB-2 corpus. We report baseline results for downstream natural language processing tasks such as named entity recognition using speech transcript. We also report the first baseline for Arabic punctuation restoration. We make the corpus available for the research community.

2020

What Was Written vs. Who Read It: News Media Profiling Using Text Analysis and Social Media Context
Ramy Baly | Georgi Karadzhov | Jisun An | Haewoon Kwak | Yoan Dinkov | Ahmed Ali | James Glass | Preslav Nakov
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Predicting the political bias and the factuality of reporting of entire news outlets are critical elements of media profiling, which is an understudied but an increasingly important research direction. The present level of proliferation of fake, biased, and propagandistic content online has made it impossible to fact-check every single suspicious claim, either manually or automatically. Thus, it has been proposed to profile entire news outlets and to look for those that are likely to publish fake or biased content. This makes it possible to detect likely “fake news” the moment they are published, by simply checking the reliability of their source. From a practical perspective, political bias and factuality of reporting have a linguistic aspect but also a social context. Here, we study the impact of both, namely (i) what was written (i.e., what was published by the target medium, and how it describes itself in Twitter) vs. (ii) who reads it (i.e., analyzing the target medium’s audience on social media). We further study (iii) what was written about the target medium (in Wikipedia). The evaluation results show that what was written matters most, and we further show that putting all information sources together yields huge improvements over the current state-of-the-art.

2019

Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Shervin Malmasi | Nikola Ljubešić | Jörg Tiedemann | Ahmed Ali
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

2018

Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Shervin Malmasi | Ahmed Ali
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

Word Error Rate Estimation for Speech Recognition: e-WER
Ahmed Ali | Steve Renals
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Measuring the performance of automatic speech recognition (ASR) systems requires manually transcribed data in order to compute the word error rate (WER), which is often time-consuming and expensive. In this paper, we propose a novel approach to estimate WER, or e-WER, which does not require a gold-standard transcription of the test set. Our e-WER framework uses a comprehensive set of features: ASR recognised text, character recognition results to complement recognition output, and internal decoder features. We report results for the two features; black-box and glass-box using unseen 24 Arabic broadcast programs. Our system achieves 16.9% WER root mean squared error (RMSE) across 1,400 sentences. The estimated overall WER e-WER was 25.3% for the three hours test set, while the actual WER was 28.5%.

We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.

2017

QCRI Live Speech Translation System
Fahim Dalvi | Yifan Zhang | Sameer Khurana | Nadir Durrani | Hassan Sajjad | Ahmed Abdelali | Hamdy Mubarak | Ahmed Ali | Stephan Vogel
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

This paper presents QCRI’s Arabic-to-English live speech translation system. It features modern web technologies to capture live audio, and broadcasts Arabic transcriptions and English translations simultaneously. Our Kaldi-based ASR system uses the Time Delay Neural Network (TDNN) architecture, while our Machine Translation (MT) system uses both phrase-based and neural frameworks. Although our neural MT system is slower than the phrase-based system, it produces significantly better translations and is memory efficient. The demo is available at https://st.qcri.org/demos/livetranslation.

Findings of the VarDial Evaluation Campaign 2017
Marcos Zampieri | Shervin Malmasi | Nikola Ljubešić | Preslav Nakov | Ahmed Ali | Jörg Tiedemann | Yves Scherrer | Noëmi Aepli
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

We present the results of the VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects, which we organized as part of the fourth edition of the VarDial workshop at EACL’2017. This year, we included four shared tasks: Discriminating between Similar Languages (DSL), Arabic Dialect Identification (ADI), German Dialect Identification (GDI), and Cross-lingual Dependency Parsing (CLP). A total of 19 teams submitted runs across the four tasks, and 15 of them wrote system description papers.

Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)
Preslav Nakov | Marcos Zampieri | Nikola Ljubešić | Jörg Tiedemann | Shevin Malmasi | Ahmed Ali
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

The SUMMA Platform Prototype
Renars Liepins | Ulrich Germann | Guntis Barzdins | Alexandra Birch | Steve Renals | Susanne Weber | Peggy van der Kreeft | Hervé Bourlard | João Prieto | Ondřej Klejch | Peter Bell | Alexandros Lazaridis | Alfonso Mendes | Sebastian Riedel | Mariana S. C. Almeida | Pedro Balage | Shay B. Cohen | Tomasz Dwojak | Philip N. Garner | Andreas Giefer | Marcin Junczys-Dowmunt | Hina Imran | David Nogueira | Ahmed Ali | Sebastião Miranda | Andrei Popescu-Belis | Lesly Miculicich Werlen | Nikos Papasarantopoulos | Abiola Obamuyide | Clive Jones | Fahim Dalvi | Andreas Vlachos | Yang Wang | Sibo Tong | Rico Sennrich | Nikolaos Pappas | Shashi Narayan | Marco Damonte | Nadir Durrani | Sameer Khurana | Ahmed Abdelali | Hassan Sajjad | Stephan Vogel | David Sheppey | Chris Hernon | Jeff Mitchell
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

We present the first prototype of the SUMMA Platform: an integrated platform for multilingual media monitoring. The platform contains a rich suite of low-level and high-level natural language processing technologies: automatic speech recognition of broadcast media, machine translation, automated tagging and classification of named entities, semantic parsing to detect relationships between entities, and automatic construction / augmentation of factual knowledge bases. Implemented on the Docker platform, it can easily be deployed, customised, and scaled to large volumes of incoming media streams.

2016

Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task
Shervin Malmasi | Marcos Zampieri | Nikola Ljubešić | Preslav Nakov | Ahmed Ali | Jörg Tiedemann
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

We present the results of the third edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial’2016 workshop at COLING’2016. The challenge offered two subtasks: subtask 1 focused on the identification of very similar languages and language varieties in newswire texts, whereas subtask 2 dealt with Arabic dialect identification in speech transcripts. A total of 37 teams registered to participate in the task, 24 teams submitted test results, and 20 teams also wrote system description papers. High-order character n-grams were the most successful feature, and the best classification approaches included traditional supervised learning methods such as SVM, logistic regression, and language models, while deep learning approaches did not perform very well.

2015

Multi-Reference Evaluation for Dialectal Speech Recognition System: A Study for Egyptian ASR
Ahmed Ali | Walid Magdy | Steve Renals
Proceedings of the Second Workshop on Arabic Natural Language Processing

Best Practices for Crowdsourcing Dialectal Arabic Speech Transcription
Samantha Wray | Hamdy Mubarak | Ahmed Ali
Proceedings of the Second Workshop on Arabic Natural Language Processing

2014

Advances in dialectal Arabic speech recognition: a study using Twitter to improve Egyptian ASR
Ahmed Ali | Hamdy Mubarak | Stephan Vogel
Proceedings of the 11th International Workshop on Spoken Language Translation: Papers

This paper reports results in building an Egyptian Arabic speech recognition system as an example for under-resourced languages. We investigated different approaches to build the system using 10 hours for training the acoustic model, and results for both grapheme system and phoneme system using MADA. The phoneme-based system shows better results than the grapheme-based system. In this paper, we explore the use of tweets written in dialectal Arabic. Using 880K Egyptian tweets reduced the Out Of Vocabulary (OOV) rate from 15.1% to 3.2% and the WER from 59.6% to 44.7%, a relative gain 25% in WER.

2012

Arabic Retrieval Revisited: Morphological Hole Filling
Kareem Darwish | Ahmed Ali
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Co-authors

Shervin Malmasi 6

Jörg Tiedemann 6

Marcos Zampieri 6

Yassine El Kheir 5

Nadir Durrani 4

Kareem Darwish 3

Stephan Vogel 3

Samir Abdaljalil 2

Ibrahim Abu Farha 2

Sharefah Al-Ghamdi 2

Badr Alkhamissi 2

Sakhar Alkhereyf 2

Rawan Almatham 2

Areeb Alowisheq 2

Waad Thuwaini Alshammari 2

Zaid Alyafeai 2

Sabri Boughorbel 2

Yousseif Ahmed Elshahawy 2

Maram Hasanain 2

Salam Khalifa 2

Sameer Khurana 2

Hassan Sajjad 2

Yves Scherrer 2

Samia Touileb 2

Wajdi Zaghouani 2

Shahad Aboukozzana 1

Nada Almarwani 1

Mariana S. C. Almeida 1

Abdullah Alrajeh 1

Sara Althubaiti 1

Pedro Balage Filho 1

Guntis Barzdins 1

Aljawahrah Bin Tamran 1

Alexandra Birch 1

Hervé Bourlard 1

Shay B. Cohen 1

Marco Damonte 1

Tomasz Dwojak 1

Youssef Elshahawy 1

Philip N. Garner 1

Ulrich Germann 1

Andreas Giefer 1

Stefan Grondelaers 1

Omnia Ibrahim 1

Marcin Junczys-Dowmunt 1

Georgi Karadzhov 1

Muhammad Kamran J Khan 1

Ondřej Klejch 1

Bornini Lahiri 1

Alexandros Lazaridis 1

Renārs Liepins 1

Vasista Sai Lodagala 1

Amit Meghanani 1

Alfonso Mendes 1

Lesly Miculicich Werlen 1

Nataša Milić-Frayling 1

Sebastião Miranda 1

Jeff Mitchell 1

Shashi Narayan 1

David Nogueira 1

Abiola Obamuyide 1

Nelleke Oostdijk 1

Nikos Papasarantopoulos 1

Nikolaos Pappas 1

Andrei Popescu-Belis 1

Sebastian Riedel 1

Tanja Samardzic 1

Rico Sennrich 1

Mostafa Shahin 1

David Sheppey 1

Dirk Speelman 1

Hawau Olamide Toyin 1

Andreas Vlachos 1

Susanne Weber 1

Samantha Wray 1

Antal van den Bosch 1

Peggy van der Kreeft 1

Chris van der Lee 1

Venues