Aly M .Kassem

Also published as: Aly M. Kassem

2026

How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities
Aly M. Kassem | Bernhard Schölkopf | Zhijing Jin
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance by dynamically assigning queries to the most appropriate model based on query complexity. Despite recent advances showing that preference-data-based routers can outperform traditional methods, current evaluation benchmarks remain limited—they largely focus on general model capabilities while overlooking task-specific behaviors and critical concerns such as privacy, safety, and potential backdoor vulnerabilities introduced through preference data. In response, we propose the DSC benchmark: Diverse, simple, and categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types—including coding, translation, mathematics, human instructions, general knowledge, and LLM jailbreaking—and integrates privacy and safety assessments to reveal hidden risks. Our experiments on three preference-based routers and two commercial counterparts demonstrate that while these systems improve efficiency, they often make suboptimal, category-driven decisions; for instance, a BERT-based router directs all coding and mathematics queries to the most powerful LLM—even when simpler models would suffice—while routing jailbreaking attempts to weaker models, thereby elevating safety risks.

2025

pdf bib abs

REVIVING YOUR MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing
Aly M. Kassem | Zhuan Shi | Negar Rostamzadeh | Golnoosh Farnadi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

LLMs are frequently fine-tuned or unlearned to adapt to new tasks or eliminate undesirable behaviors. While existing evaluation methods assess performance after such interventions, there remains no general approach for detecting unintended side effects—such as unlearning biology content degrading performance on chemistry tasks, particularly when these effects are unpredictable or emergent. To address this issue, we introduce MNEME, Model diffiNg for Evaluating Mechanistic Effects, a framework for identifying these side effects using sparse model diffing. MNEME compares base and fine-tuned models on out-of-distribution (OOD) data (e.g., The Pile, LMSYS-Chat-1M), without access to fine-tuning data, to isolate behavioral shifts.Applied to five LLMs across three scenarios, WMDP knowledge unlearning, emergent misalignment, and benign fine-tuning, MNEME achieves up to 95% accuracy in predicting side effects, aligning with known benchmarks and requiring no custom heuristics. Our results demonstrate that sparse probing and diffing offer a scalable and automated lens into fine-tuning-induced model changes, providing practical tools for understanding and managing LLM behavior.

pdf bib abs

ALPACA AGAINST VICUNA: Using LLMs to Uncover Memorization of LLMs
Aly M. Kassem | Omar Mahmoud | Niloofar Mireshghallah | Hyunwoo Kim | Yulia Tsvetkov | Yejin Choi | Sherif Saad | Santu Rana
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

In this paper, we investigate the overlooked impact of instruction-tuning on memorization in large language models (LLMs), which has largely been studied in base, pre-trained models. We propose a black-box prompt optimization method where an attacker LLM agent uncovers higher levels of memorization in a victim agent, surpassing traditional approaches that prompt the model directly with training data. Using an iterative rejection-sampling process, we design instruction-based prompts that minimize overlap with training data to avoid providing direct solutions while maximizing overlap between the victim’s output and the training data to induce memorization. Our method shows 23.7% more overlap with training data compared to state-of-the-art baselines. We explore two attack settings: an analytical approach that determines the empirical upper bound of the attack, both with and without access to responses for prompt initialization, and a practical classifier-based method for assessing memorization without access to memorized data. Our findings reveal that instruction-tuned models can expose pre-training data as much as, or more than, base models; contexts beyond the original training data can lead to leakage; and instructions generated by other LLMs open new avenues for automated attacks, which we believe require further exploration.

2022

pdf bib abs

On The Arabic Dialects’ Identification: Overcoming Challenges of Geographical Similarities Between Arabic dialects and Imbalanced Datasets
Salma Jamal | Aly M .Kassem | Omar Mohamed | Ali Ashraf
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

Arabic is one of the world’s richest languages, with a diverse range of dialects based on geographical origin. In this paper, we present a solution to tackle subtask 1 (Country-level dialect identification) of the Nuanced Arabic Dialect Identification (NADI) shared task 2022 achieving third place with an average macro F1 score between the two test sets of 26.44%. In the preprocessing stage, we removed the most common frequent terms from all sentences across all dialects, and in the modeling step, we employed a hybrid loss function approach that includes Weighted cross entropy loss and Vector Scaling(VS) Loss. On test sets A and B, our model achieved 35.68% and 17.192% Macro F1 scores, respectively.

Co-authors

Hyunwoo Kim 1

Omar Mahmoud 1

Niloofar Mireshghallah 1

Venues

Fix author