Ahmed Oumar El-Shangiti


2025

pdf bib
The Geometry of Numerical Reasoning: Language Models Compare Numeric Properties in Linear Subspaces
Ahmed Oumar El-Shangiti | Tatsuya Hiraoka | Hilal AlQuabeh | Benjamin Heinzerling | Kentaro Inui
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

This paper investigates whether large language models (LLMs) utilize numerical attributes encoded in a low-dimensional subspace of theembedding space when answering questions involving numeric comparisons, e.g., Was Cristiano born before Messi? We first identified,using partial least squares regression, these subspaces, which effectively encode the numerical attributes associated with the entities in comparison prompts. Further, we demonstrate causality, by intervening in these subspaces to manipulate hidden states, thereby altering the LLM’s comparison outcomes. Experiments conducted on three different LLMs showed that our results hold across different numerical attributes, indicating that LLMs utilize the linearly encoded information for numerical reasoning.

2024

pdf bib
Casablanca: Data and Models for Multidialectal Arabic Speech Recognition
Bashar Talafha | Karima Kadaoui | Samar Mohamed Magdy | Mariem Habiboullah | Chafei Mohamed Chafei | Ahmed Oumar El-Shangiti | Hiba Zayed | Mohamedou Cheikh Tourad | Rahaf Alhamouri | Rwaa Assi | Aisha Alraeesi | Hour Mohamed | Fakhraddin Alwajih | Abdelrahman Mohamed | Abdellah El Mekki | El Moatez Billah Nagoudi | Benelhadj Djelloul Mama Saadia | Hamzah A. Alsayadi | Walid Al-Dhabyani | Sara Shatnawi | Yasir Ech-chammakhy | Amal Makouar | Yousra Berrachedi | Mustafa Jarrar | Shady Shehata | Ismail Berrada | Muhammad Abdul-Mageed
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

In spite of the recent progress in speech processing, the majority of world languages and dialects remain uncovered. This situation only furthers an already wide technological divide, thereby hindering technological and socioeconomic inclusion. This challenge is largely due to the absence of datasets that can empower diverse speech systems. In this paper, we seek to mitigate this obstacle for a number of Arabic dialects by presenting Casablanca, a large-scale community-driven effort to collect and transcribe a multi-dialectal Arabic dataset. The dataset covers eight dialects: Algerian, Egyptian, Emirati, Jordanian, Mauritanian, Moroccan, Palestinian, and Yemeni, and includes annotations for transcription, gender, dialect, and code-switching. We also develop a number of strong baselines exploiting Casablanca. The project page for Casablanca is accessible at: www.dlnlp.ai/speech/casablanca.