Guillermo Marco


2025

pdf bib
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination
Eva Sánchez Salido | Roser Morante | Julio Gonzalo | Guillermo Marco | Jorge Carrillo-de-Albornoz | Laura Plaza | Enrique Amigo | Andrés Fernandez García | Alejandro Benito-Santos | Adrián Ghajari Espinosa | Victor Fresno
Proceedings of the 31st International Conference on Computational Linguistics

In this article we present UNED-ACCESS 2024, a bilingual dataset that consists of 1003 multiple-choice questions of university entrance level exams in Spanish and English. Questions are originally formulated in Spanish and manually translated into English, and have not ever been publicly released, ensuring minimal contamination when evaluating Large Language Models with this dataset. A selection of current open-source and proprietary models are evaluated in a uniform zero-shot experimental setting both on the UNED-ACCESS 2024 dataset and on an equivalent subset of MMLU questions. Results show that (i) Smaller models not only perform worse than the largest models, but also degrade faster in Spanish than in English. The performance gap between both languages is negligible for the best models, but grows up to 37% for smaller models; (ii) Model ranking on UNED-ACCESS 2024 is almost identical (0.98 Pearson correlation) to the one obtained with MMLU (a similar, but publicly available benchmark), suggesting that contamination affects similarly to all models, and (iii) As in publicly available datasets, reasoning questions in UNED-ACCESS are more challenging for models of all sizes.

pdf bib
Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs
Guillermo Marco | Luz Rello | Julio Gonzalo
Proceedings of the 31st International Conference on Computational Linguistics

In this paper, we evaluate the creative fiction writing abilities of a fine-tuned small language model (SLM), BART-large, and compare its performance to human writers and two large language models (LLMs): GPT-3.5 and GPT-4o. Our evaluation consists of two experiments: (i) a human study in which 68 participants rated short stories from humans and the SLM on grammaticality, relevance, creativity, and attractiveness, and (ii) a qualitative linguistic analysis examining the textual characteristics of stories produced by each model. In the first experiment, BART-large outscored average human writers overall (2.11 vs. 1.85), a 14% relative improvement, though the slight human advantage in creativity was not statistically significant. In the second experiment, qualitative analysis showed that while GPT-4o demonstrated near-perfect coherence and used less cliche phrases, it tended to produce more predictable language, with only 3% of its synopses featuring surprising associations (compared to 15% for BART). These findings highlight how model size and fine-tuning influence the balance between creativity, fluency, and coherence in creative writing tasks, and demonstrate that smaller models can, in certain contexts, rival both humans and larger models.

2024

pdf bib
Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing?
Guillermo Marco | Julio Gonzalo | M.Teresa Mateo-Girona | Ramón Del Castillo Santos
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Are LLMs ready to compete in creative writing skills with a top (rather than average) novelist? To provide an initial answer for this question, we have carried out a contest between Patricio Pron (an awarded novelist, considered one of the best of his generation) and GPT-4 (one of the top performing LLMs), in the spirit of AI-human duels such as DeepBlue vs Kasparov and AlphaGo vs Lee Sidol. We asked Pron and GPT-4 to provide thirty titles each, and then to write short stories for both their titles and their opponent’s. Then, we prepared an evaluation rubric inspired by Boden’s definition of creativity, and we collected several detailed expert assessments of the texts, provided by literature critics and scholars. The results of our experimentation indicate that LLMs are still far from challenging a top human creative writer. We also observed that GPT-4 writes more creatively using Pron’s titles than its own titles (which is an indication of the potential for human-machine co-creation). Additionally, we found that GPT-4 has a more creative writing style in English than in Spanish.

pdf bib
A Web Portal about the State of the Art of NLP Tasks in Spanish
Enrique Amigó | Jorge Carrillo-de-Albornoz | Andrés Fernández | Julio Gonzalo | Guillermo Marco | Roser Morante | Laura Plaza | Jacobo Pedrosa
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper presents a new web portal with information about the state of the art of natural language processing tasks in Spanish. It provides information about forums, competitions, tasks and datasets in Spanish, that would otherwise be spread in multiple articles and web sites. The portal consists of overview pages where information can be searched for and filtered by several criteria and individual pages with detailed information and hyperlinks to facilitate navigation. Information has been manually curated from publications that describe competitions and NLP tasks from 2013 until 2023 and will be updated as new tasks appear. A total of 185 tasks and 128 datasets from 94 competitions have been introduced.