Evaluating Multilingual Sentiment Classifiers Using an LLM-Annotated Wikipedia Benchmark

Milena Stróżyna; Włodzimierz Lewoniewski; Izabela Czumałowska

Evaluating Multilingual Sentiment Classifiers Using an LLM-Annotated Wikipedia Benchmark

Milena Stróżyna, Włodzimierz Lewoniewski, Izabela Czumałowska

Abstract

We present a multilingual study of sentiment evaluation on Wikipedia articles from various topics in five languages (German, English,Spanish, Polish, and Russian). In this paper, we compare three large language models (Gemini Pro 3.1, Claude Opus 4.6, and GPT 5.2),each queried three times per sentence, with two popular multilingual sentiment classifiers. This setup allows us to analyze not only inter-model differences but also intra-model stability as a proxy for confidence.To support systematic evaluation, we construct a benchmark dataset based on strict consensus across evaluators and analyze sentiment distributions across topics and languages. We show substantial variation in sentiment distributions, agreement, and consistency across models and languages. Our results suggest that sentiment evaluation on encyclopedic text remains an underexplored challenge for multilingual NLP.

Anthology ID:: 2026.gem-main.63
Volume:: Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 692–703
Language:
URL:: https://aclanthology.org/2026.gem-main.63/
DOI:
Bibkey:
Cite (ACL):: Milena Stróżyna, Włodzimierz Lewoniewski, and Izabela Czumałowska. 2026. Evaluating Multilingual Sentiment Classifiers Using an LLM-Annotated Wikipedia Benchmark. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 692–703, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Evaluating Multilingual Sentiment Classifiers Using an LLM-Annotated Wikipedia Benchmark (Stróżyna et al., GEM 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.gem-main.63.pdf

PDF Cite Search Fix data