Proceedings of the First Workshop on Multilingual Multicultural Evaluation

Pinzhen Chen, Vilém Zouhar, Hanxu Hu, Simran Khanuja, Wenhao Zhu, Barry Haddow, Alexandra Birch, Alham Fikri Aji, Rico Sennrich, Sara Hooker (Editors)


Anthology ID:
2026.mme-main
Month:
March
Year:
2026
Address:
Rabat, Morocco
Venues:
MME | WS
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/2026.mme-main/
DOI:
ISBN:
979-8-89176-368-5
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/2026.mme-main.pdf

Span annotation - annotating specific text features at the span level - can be used to evaluate texts where single-score metrics fail to provide actionable feedback. Until recently, span annotation was done by human annotators or fine-tuned models. In this paper, we study whether large language models (LLMs) can serve as an alternative to human annotators. We compare the abilities of LLMs to skilled human annotators on three span annotation tasks: evaluating data-to-text generation, identifying translation errors, and detecting propaganda techniques. We show that overall, LLMs have only moderate inter-annotator agreement (IAA) with human annotators. However, we demonstrate that LLMs make errors at a similar rate as skilled crowdworkers. LLMs also produce annotations at a fraction of the cost per output annotation. We release the dataset of over 40k model and human span annotations for further research.
Recent studies evaluate the value orientation of large language models (LLMs) using adapted social surveys, typically by prompting models with survey questions and comparing their responses to average human responses. This paper identifies limitations in this methodology that, depending on the exact setup, can lead to both underestimating and overestimating the similarity of value orientation. Using the World Value Survey in three languages across five countries, we demonstrate that prompting methods (direct vs. chain-of-thought) and decoding strategies (greedy vs. sampling) significantly affect results. To assess the interaction between answers, we introduce a novel metric, self-correlation distance. This metric measures whether LLMs maintain consistent relationships between answers across different questions, as humans do. This shows that even a high average agreement with human data when considering LLM responses independently does not guarantee structural alignment in responses. Additionally, we reveal a weak correlation between two common evaluation metrics, mean-squared distance and KL divergence, which consider all survey answers independent of each other. For future research, we recommend CoT prompting, sampling-based decoding with dozens of samples, and robust analysis using multiple metrics, including self-correlation distance.
Code-switching is a common feature of multilingual communication, and identifying where the language switches reliably is essential for downstream tasks such as generating code-switched machine translations. This paper introduces CSDI, a Code-Switching Detection (CSD) system for Indic text, which jointly learns CSD, Named Entity Recognition, and Part-of-Speech tagging through a shared encoder. Leveraging multitask learning, CSDI captures linguistic cues that signal switching boundaries and achieves a new state-of-the-art macro-F1 score with near-zero 𝛥CMI across six Indic languages. The model also demonstrates strong cross-lingual transfer, effectively leveraging high-resource languages to improve low-resource performance. Despite challenges such as intra-word code-mixing and limited token-level context, CSDI establishes a new baseline for scalable, low-resource NLP research in code-mixed environments.
This paper introduces Vinclat, a novel evaluation dataset for Catalan carefully designed to assess the reasoning capabilities and cultural knowledge of LLMs. It comprises 1,000 high-quality instances, meticulously crafted and reviewed by human annotators. Each instance presents a complex riddle that requires a two-step reasoning process involving inferential and abductive reasoning, along with other cognitive skills such as lexical retrieval, paraphrasing, flexibility in interpretation, pattern recognition, and associative thinking. Given four independent clues, models should infer intermediate concepts which, despite being seemingly unrelated, can be creatively connected to reach a final solution. The task targets a unique blend of capabilities, distinguishing it from existing NLP benchmarks. Our evaluation of state-of-the-art models reveals that these still fall significantly short of human-level reasoning, although scaling trends suggest that the performance gap may narrow over time. This indicates that Vinclat provides a robust and long-term challenge, resisting the rapid saturation that is commonly observed in many existing evaluation datasets.
Large language models (LLMs) are increasingly deployed in multicultural settings; however, systematic evaluation of cultural specificity at the sentence level remains underexplored. We propose the Conceptual Cultural Index (CCI), which estimates cultural specificity at the sentence level. CCI is defined as the difference between the generality estimate within the target culture and the average generality estimate across other cultures. This formulation enables users to operationally control the scope of culture via comparison settings and provides interpretability, since the score derives from the underlying generality estimates. We validate CCI on 400 sentences (200 culture-specific and 200 general), and the resulting score distribution exhibits the anticipated pattern: higher for culture-specific sentences and lower for general ones. For binary separability, CCI outperforms direct LLM scoring, yielding more than a 10-point improvement in AUC for models specialized to the target culture. Our code is available at https://github.com/IyatomiLab/CCI.
Food carries cultural meaning beyond nutrition. It shapes identity, memory, and social norms, which makes it a central concern in anthropology. Given the diversity of food practices across cultures, analyzing them at scale while preserving their depth (“thick” descriptions) remains difficult for ethnographic methods, where Natural Language Processing (NLP) methods can help. Earlier NLP tools often captured only surface-level ”thin” descriptions. Recent methods, especially Large Language Models (LLMs), create openings to recover cultural nuance. In this position paper, we outline research questions at the intersection of food anthropology and NLP, and discuss how LLMs can enable a scalable and culturally grounded anthropology of food. We present a case study examining what LLMs represent about global eating habits, which are often shaped by colonial histories and globalization. Our findings suggest that LLMs’ internal representations recognize cultural clusters, such as shared food habits among formerly colonized regions, but fail to grasp the pragmatic and experiential aspects of food, like the worldwide spread of dishes like pizza or biryani. We conclude by highlighting some of the potential risks and gaps of using NLP for cultural analysis.
Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that instance-specific issues output by LLM-as-a-qualitative-judge match those annotated by humans in 2/3 cases, and that LLM-as-a-qualitative-judge is capable of producing error type reports resembling the reports composed by human annotators. We also demonstrate in a case study how the use of LLM-as-a-qualitative-judge can substantially improve NLG systems performance.
Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations. Because generation is controlled, these inconsistencies reflect how judge scoring behaves differently across languages rather than true model differences.This controlled design provides a diagnostic probe: evaluation methods that fail to maintain stability under identical generation conditions signal transfer failure before deployment. Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, motivating language-specific calibration against targeted human baselines. We release our controlled generation protocol, synthetic data, and evaluation framework to enable replication across language families at https://github.com/isaac-chung/cross-lingual-stability-judges.
Argument mining in multilingual settings has rarely been investigated, due to the lack of annotated resources and to the inherent difficulty of the task. We benchmark the performance of models on cross-lingual and cross-country argument component detection, focusing on political data from the US and France. To do so, we introduce FrenchPolArg, a corpus of argumentative political discourse in French, and we automatically translate already existing US-English resources. We benchmark three different cross-lingual and cross-country pipelines, and compare their results to find the best-performing one. We obtain promising results to be integrated in semi-automatic annotation workflows to reduce the time and cost of annotations.
This paper introduces UNSC-Bench, a benchmark for evaluating Large Language Models (LLMs) in simulating diplomatic decision-making through United Nations Security Council (UNSC) vote prediction. The dataset includes 469 UNSC resolutions from 1947 to 2025, with voting records for the five permanent members (P5) (United States, China, France, Russia, United Kingdom) and translations in four languages. We analyze 26 LLMs, along with thinking variants, across multiple P5 roles and find that (1) without explicit role assignment, models are diplomatically unaligned, defaulting to high yes rates and failing to match any P5 voting pattern, indicating they lack inherent diplomatic identity; (2) model capability (as measured by MMLU-Pro) is strongly correlated with role-playing accuracy; (3) regional models do not outperform others in predicting their home country’s votes; and (4) multilingual evaluation reveals that prompt language impacts model predictions, particularly for minority vote outcomes.
Large Language Models (LLMs) exhibit inequalities with respect to various cultural contexts. Most prominent open-weights models are trained on Global North data and show prejudicial behavior towards other cultures. Moreover, there is a notable lack of resources to detect biases in non-English languages, especially from Latin America (Latam), a continent containing various cultures, even though they share a common cultural ground.We propose to leverage the content of Wikipedia, the structure of the Wikidata knowledge graph, and expert knowledge from social science in order to create a dataset of Questions/Answers (Q/As) pairs, based on the different popular and social cultures of various Latin American countries. We create a database of around 23k questions and associated answers extracted from 23k Wikipedia articles, and transformed into a multiple-choice questions (MCQ) in Spanish and Portuguese, in turn translated to English. We use this MCQ to quantify the degree of knowledge of various LLMs and find out extit(i) a discrepancy in performances between the Latam countries, ones being easier than others for the majority of the models, extit(ii) that the models perform better in their original language, extit(iii) that Iberian Spanish culture is better known than Latam one. Our code, our results for reproducing the results, and all datasets by region will be available.
We study retrieval-augmented generation (RAG) evaluation in the Trendyol QA Assistant using 150k real e-commerce interactions. Our framework combines user satisfaction labels, LLM-as-a-judge scoring, and factor-based diagnostics to separate retrieval from generation errors. We find that judge models broadly reflect user satisfaction trends, though important nuances of dissatisfaction are often missed. Factor-level analysis highlights systematic error patterns across query types and context quality, demonstrating that hybrid evaluation, combining multiple LLM judges with direct user feedback offers the most reliable assessment strategy for production RAG systems.
When multilingual users switch languages mid-conversation, how should LLMs respond? We extend MultiChallenge to evaluate cross-turn language switching, translating 182 multi-turn conversations into German, Chinese, Spanish, and Arabic. Across five frontier models, we observe asymmetric behavior: switching into a foreign language (EN→X) yields high query-language fidelity (89–99%), but switching back to English (X→EN) reveals divergent policies. GPT-5 follows the query language (>95%), while Claude Opus 4.5 and Command R+ maintain the established conversation language (<8%). Task accuracy remains stable across conditions regardless of language selection differences. A simple explicit system prompt shows limited effectiveness in modifying these defaults.
Machine translation benchmarks sourced from the real world are quickly obsoleted, due to most examples being easy for state-of-the-art translation models. This limits the benchmark’s ability to distinguish which model is better or to reveal models’ weaknesses. Current methods for creating difficult test cases, such as subsampling or from-scratch synthesis, either fall short of identifying difficult examples or suffer from a lack of diversity and naturalness. Inspired by the iterative process of human experts probing for model failures, we propose MT-breaker, a method where a large language model iteratively refines a source text to increase its translation difficulty. The LLM iteratively queries a target machine translation model to guide its generation of difficult examples. Our approach generates examples that are more challenging for the target MT model while preserving the diversity of natural texts. While the examples are tailored to a particular machine translation model during the generation, the difficulty also transfers to other models and languages.
As the deployment of large language models (LLMs) expands, there is an increasing demand for personalized LLMs. One method to personalize and guide the outputs of these models is by assigning a persona—a role that describes the expected behavior of the LLM (e.g., a man, a woman, an engineer). This study examines whether an LLM’s interpretation of social norms varies based on assigned personas and whether these variations stem from embedded biases within the models. In our research, we tested 34 distinct personas from 12 categories (e.g., age, gender, beauty) across four different LLMs. We find that LLMs’ cultural norm interpretation varies based on the persona used and that the variations within a persona category (e.g., a fat person and a thin person as in physical appearance group) follow a trend where an LLM with the more socially desirable persona (e.g., a thin person) interprets social norms more accurately than with the less socially desirable persona (e.g., a fat person). While persona-based conditioning can enhance model adaptability, it also risks reinforcing stereotypes rather than providing an unbiased representation of cultural norms. We also discuss how different types of social biases due to stereotypical assumptions of LLMs may contribute to the results that we observe.