Arshia Hemmat
2026
MEENA (PersianMMMU): Multimodal-Multilingual Educational Exams for N-level Assessment
Omid Ghahroodi | Arshia Hemmat | Marzia Nouri | Seyed Mohammad Hadi Hosseini | Doratossadat Dastgheib | Mohammad Vali Sanian | Alireza Sahebi | Reihaneh Zohrabi | Mohammad Hossein Rohban | Ehsaneddin Asgari | Mahdieh Soleymani Baghshah
Findings of the Association for Computational Linguistics: EACL 2026
Omid Ghahroodi | Arshia Hemmat | Marzia Nouri | Seyed Mohammad Hadi Hosseini | Doratossadat Dastgheib | Mohammad Vali Sanian | Alireza Sahebi | Reihaneh Zohrabi | Mohammad Hossein Rohban | Ehsaneddin Asgari | Mahdieh Soleymani Baghshah
Findings of the Association for Computational Linguistics: EACL 2026
Recent advancements in large vision-language models (VLMs) have primarily focused on English, with limited attention given to other languages. To address this gap, we introduce MEENA (also known as PersianMMMU), the first dataset designed to evaluate Persian VLMs across scientific, reasoning, and human-level understanding tasks. Our dataset comprises approximately 7,500 Persian and 3,000 English questions, covering a wide range of topics such as reasoning, mathematics, physics, diagrams, charts, and Persian art and literature. Key features of MEENA include: (1) diverse subject coverage spanning various educational levels, from primary to upper secondary school, (2) rich metadata, including difficulty levels and descriptive answers, (3) original Persian data that preserves cultural nuances, (4) a bilingual structure to assess cross-linguistic performance, and (5) a series of diverse experiments assessing various capabilities, including overall performance, the model’s ability to attend to images, and its tendency to generate hallucinations. We hope this benchmark contributes to enhancing VLM capabilities beyond English.
Unmasking the Factual-Conceptual Gap in Persian Language Models
Alireza Sakhaeirad | Ali Ma'manpoosh | Arshia Hemmat
The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family
Alireza Sakhaeirad | Ali Ma'manpoosh | Arshia Hemmat
The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family
While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms. We introduce DIVANBENCH, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction. Through 315 questions across three task types (factual retrieval, paired scenario verification, and situational reasoning), we evaluate seven Persian LLMs and reveal three critical failures: most models exhibit severe acquiescence bias, correctly identifying appropriate behaviors but failing to reject clear violations; continuous Persian pretraining amplifies this bias rather than improving reasoning, often degrading the model’s ability to discern contradictions; and all models show a 21% performance gap between retrieving factual knowledge and applying it in scenarios. These findings demonstrate that cultural competence requires more than scaling monolingual data, as current models learn to mimic cultural patterns without internalizing the underlying schemas.
2025
VAGUE‐Gate: Plug‐and‐Play Local‐Privacy Shield for Retrieval‐Augmented Generation
Arshia Hemmat | Matin Moqadas | Ali Mamanpoosh | Amirmasoud Rismanchian | Afsaneh Fatemi
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Arshia Hemmat | Matin Moqadas | Ali Mamanpoosh | Amirmasoud Rismanchian | Afsaneh Fatemi
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Retrieval-augmented generation (RAG) still *forwards* raw passages to large-language models, so private facts slip through. Prior defenses are either (i) **heavyweight**—full DP training that is impractical for today’s 70B-parameter models—or (ii) **over-zealous**—blanket redaction of every named entity, which slashes answer quality.We introduce **VAGUE-Gate**, a lightweight, *locally* differentially-private gate deployable in front of *any* RAG system. A precision pass drops low-utility tokens under a user budget ε, then up to k(ε) high-temperature paraphrase passes further cloud residual cues; post-processing guarantees preserve the same ε-LDP bound.To measure both privacy and utility, we release **BlendPriv** (3k blended-sensitivity QA pairs) and two new metrics: a lexical Information-Leakage Score and an LLM-as-Judge score. Across eight pipelines and four SOTA LLMs, **VAGUE-Gate** at ε = 0.3 lowers lexical leakage by **70%** and semantic leakage by **1.8** points (1–5 scale) while retaining **91%** of Plain-RAG faithfulness with only a **240 ms** latency overhead.All code, data, and prompts are publicly released:- Code: < https://github.com/arshiahemmat/LDP_RAG > - Dataset: <https://huggingface.co/datasets/AliMnp/BlendPriv>