Markarit Vartampetian

2026

LLM-as-a-qualitative-judge: automating error analysis in natural language generation
Nadezhda Chirkova | Tunde Oluwaseyi Ajayi | Seth Aycock | Zain Muhammad Mujahid | Vladana Perlić | Ekaterina Borisova | Markarit Vartampetian
Proceedings of the First Workshop on Multilingual Multicultural Evaluation

Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that instance-specific issues output by LLM-as-a-qualitative-judge match those annotated by humans in 2/3 cases, and that LLM-as-a-qualitative-judge is capable of producing error type reports resembling the reports composed by human annotators. We also demonstrate in a case study how the use of LLM-as-a-qualitative-judge can substantially improve NLG systems performance.

2025

pdf bib abs

SuperGPQA-HCE-FR : un corpus spécialisé en français pour le domaine hydraulique et le génie civil
Markarit Vartampetian | Diandra Fabre | Philippe Mulhem | Sylvain Joubert | Didier Schwab
Actes de l'atelier Évaluation des modèles génératifs (LLM) et challenge 2025 (EvalLLM)

Dans cet article, nous présentons SuperGPQA-HCE-FR, une adaptation française d’un sous-ensemble du benchmark SuperGPQA axé sur les domaines de l’ingénierie hydraulique et du génie civil. Il comprend 285 questions à choix multiples conçues pour évaluer et spécialiser des modèles de langue multilingues de grande taille (LLMs) sur des tâches techniques. La traduction réalisée automatiquement est ensuite évaluée par des experts des domaines. Enfin, nous présentons les premiers résultats sur des modèles Instruct généralistes multilingues en comparant les performances du corpus original en anglais à celles du corpus traduit en français.

Co-authors

Sylvain Joubert 1

Zain Muhammad Mujahid 1

Philippe Mulhem 1

Vladana Perlić 1

Didier Schwab 1

Venues

Fix author