Faezeh Hosseini

2026

FFE-Hallu: Hallucinations in Fixed Figurative Expressions: A Benchmark of Idioms and Proverbs in the Persian Language
Faezeh Hosseini | Mohammadali Yousefzadeh | Yadollah Yaghoobzadeh
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Figurative language, especially fixed figurative expressions (FFEs) such as idioms and proverbs, poses unique challenges for large language models (LLMs). Unlike literal phrases, FFEs are culturally grounded and often non-compositional, making them vulnerable to figurative hallucination, the generation or acceptance of plausible-sounding but culturally invalid expressions. We introduce FFE-Hallu, the first comprehensive benchmark for evaluating LLMs’ ability to generate, detect, and translate FFEs in Persian, a linguistically rich but underrepresented language. FFE-Hallu includes 600 carefully curated examples spanning three tasks: FFE generation from meaning, detection of fabricated FFEs (across four controlled categories), and FFE-to-FFE translation from English to Persian. Our evaluation of six state-of-the-art multilingual LLMs reveals persistent weaknesses in both cultural grounding and figurative competence. While models like GPT-4.1 display relative strength in rejecting fabricated FFEs and retrieving authentic ones, most systems struggle to reliably distinguish real FFEs from high-quality fabrications and often hallucinate in translation. This work shows that LLMs still have important gaps in understanding and using figurative language, and that specialized benchmarks like FFE-Hallu are needed.

2025

pdf bib abs

Large Language Models for Persian-English Idiom Translation
Sara Rezaeimanesh | Faezeh Hosseini | Yadollah Yaghoobzadeh
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large language models (LLMs) have shown superior capabilities in translating figurative language compared to neural machine translation (NMT) systems. However, the impact of different prompting methods and LLM-NMT combinations on idiom translation has yet to be thoroughly investigated. This paper introduces two parallel datasets of sentences containing idiomatic expressions for Persian→English and English→Persian translations, with Persian idioms sampled from our PersianIdioms resource, a collection of 2,200 idioms and their meanings, with 700 including usage examples.Using these datasets, we evaluate various open- and closed-source LLMs, NMT models, and their combinations. Translation quality is assessed through idiom translation accuracy and fluency. We also find that automatic evaluation methods like LLM-as-a-judge, BLEU, and BERTScore are effective for comparing different aspects of model performance. Our experiments reveal that Claude-3.5-Sonnet delivers outstanding results in both translation directions. For English→Persian, combining weaker LLMs with Google Translate improves results, while Persian→English translations benefit from single prompts for simpler models and complex prompts for advanced ones.

Co-authors

Venues

EACL1
NAACL1

Fix author