Julio Gonzalo

2025

In this article we present UNED-ACCESS 2024, a bilingual dataset that consists of 1003 multiple-choice questions of university entrance level exams in Spanish and English. Questions are originally formulated in Spanish and manually translated into English, and have not ever been publicly released, ensuring minimal contamination when evaluating Large Language Models with this dataset. A selection of current open-source and proprietary models are evaluated in a uniform zero-shot experimental setting both on the UNED-ACCESS 2024 dataset and on an equivalent subset of MMLU questions. Results show that (i) Smaller models not only perform worse than the largest models, but also degrade faster in Spanish than in English. The performance gap between both languages is negligible for the best models, but grows up to 37% for smaller models; (ii) Model ranking on UNED-ACCESS 2024 is almost identical (0.98 Pearson correlation) to the one obtained with MMLU (a similar, but publicly available benchmark), suggesting that contamination affects similarly to all models, and (iii) As in publicly available datasets, reasoning questions in UNED-ACCESS are more challenging for models of all sizes.

pdf bib abs
Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs
Guillermo Marco | Luz Rello | Julio Gonzalo
Proceedings of the 31st International Conference on Computational Linguistics

In this paper, we evaluate the creative fiction writing abilities of a fine-tuned small language model (SLM), BART-large, and compare its performance to human writers and two large language models (LLMs): GPT-3.5 and GPT-4o. Our evaluation consists of two experiments: (i) a human study in which 68 participants rated short stories from humans and the SLM on grammaticality, relevance, creativity, and attractiveness, and (ii) a qualitative linguistic analysis examining the textual characteristics of stories produced by each model. In the first experiment, BART-large outscored average human writers overall (2.11 vs. 1.85), a 14% relative improvement, though the slight human advantage in creativity was not statistically significant. In the second experiment, qualitative analysis showed that while GPT-4o demonstrated near-perfect coherence and used less cliche phrases, it tended to produce more predictable language, with only 3% of its synopses featuring surprising associations (compared to 15% for BART). These findings highlight how model size and fine-tuning influence the balance between creativity, fluency, and coherence in creative writing tasks, and demonstrate that smaller models can, in certain contexts, rival both humans and larger models.

2024

pdf bib abs
Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing?
Guillermo Marco | Julio Gonzalo | M.Teresa Mateo-Girona | Ramón Del Castillo Santos
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Are LLMs ready to compete in creative writing skills with a top (rather than average) novelist? To provide an initial answer for this question, we have carried out a contest between Patricio Pron (an awarded novelist, considered one of the best of his generation) and GPT-4 (one of the top performing LLMs), in the spirit of AI-human duels such as DeepBlue vs Kasparov and AlphaGo vs Lee Sidol. We asked Pron and GPT-4 to provide thirty titles each, and then to write short stories for both their titles and their opponent’s. Then, we prepared an evaluation rubric inspired by Boden’s definition of creativity, and we collected several detailed expert assessments of the texts, provided by literature critics and scholars. The results of our experimentation indicate that LLMs are still far from challenging a top human creative writer. We also observed that GPT-4 writes more creatively using Pron’s titles than its own titles (which is an indication of the potential for human-machine co-creation). Additionally, we found that GPT-4 has a more creative writing style in English than in Spanish.

pdf bib abs
A Web Portal about the State of the Art of NLP Tasks in Spanish
Enrique Amigó | Jorge Carrillo-de-Albornoz | Andrés Fernández | Julio Gonzalo | Guillermo Marco | Roser Morante | Laura Plaza | Jacobo Pedrosa
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper presents a new web portal with information about the state of the art of natural language processing tasks in Spanish. It provides information about forums, competitions, tasks and datasets in Spanish, that would otherwise be spread in multiple articles and web sites. The portal consists of overview pages where information can be searched for and filtered by several criteria and individual pages with detailed information and hyperlinks to facilitate navigation. Information has been manually curated from publications that describe competitions and NLP tasks from 2013 until 2023 and will be updated as new tasks appear. A total of 185 tasks and 128 datasets from 94 competitions have been introduced.

2020

pdf bib abs
An Effectiveness Metric for Ordinal Classification: Formal Properties and Experimental Results
Enrique Amigo | Julio Gonzalo | Stefano Mizzaro | Jorge Carrillo-de-Albornoz
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In Ordinal Classification tasks, items have to be assigned to classes that have a relative ordering, such as “positive”, “neutral”, “negative” in sentiment analysis. Remarkably, the most popular evaluation metrics for ordinal classification tasks either ignore relevant information (for instance, precision/recall on each of the classes ignores their relative ordering) or assume additional information (for instance, Mean Average Error assumes absolute distances between classes). In this paper we propose a new metric for Ordinal Classification, Closeness Evaluation Measure, that is rooted on Measurement Theory and Information Theory. Our theoretical analysis and experimental results over both synthetic data and data from NLP shared tasks indicate that the proposed metric captures quality aspects from different traditional tasks simultaneously. In addition, it generalizes some popular classification (nominal scale) and error minimization (interval scale) metrics, depending on the measurement scale in which it is instantiated.

The importance of evaluation in promoting research and development in the information retrieval and natural language processing domains has long been recognised but is this sufficient? In many areas there is still a considerable gap between the results achieved by the research community and their implementation in commercial applications. This is particularly true for the cross-language or multilingual retrieval areas. Despite the strong demand for and interest in multilingual IR functionality, there are still very few operational systems on offer. The Cross Language Evaluation Forum (CLEF) is now taking steps aimed at changing this situation. The paper provides a critical assessment of the main results achieved by CLEF so far and discusses plans now underway to extend its activities in order to have a more direct impact on the application sector.

2007

pdf bib
The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task
Javier Artiles | Julio Gonzalo | Satoshi Sekine
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf bib
MT Evaluation: Human-Like vs. Human Acceptable
Enrique Amigó | Jesús Giménez | Julio Gonzalo | Lluís Màrquez
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2005

pdf bib
QARLA: A Framework for the Evaluation of Text Summarization Systems
Enrique Amigó | Julio Gonzalo | Anselmo Peñas | Felisa Verdejo
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

pdf bib
Evaluating DUC 2004 Tasks with the QARLA Framework
Enrique Amigó | Julio Gonzalo | Anselmo Peñas | Felisa Verdejo
Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization

2004

pdf bib
Using syntactic information to extract relevant terms for multi-document summarization
Enrique Amigó | Julio Gonzalo | Víctor Peinado | Anselmo Peñas | Felisa Verdejo
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib abs
The Future of Evaluation for Cross-Language Information Retrieval Systems
Carol Peters | Martin Braschler | Khalid Choukri | Julio Gonzalo | Michael Kluck
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

The objective of the Cross-Language Evaluation Forum (CLEF) is to promote research in the multilingual information access domain. In this short paper, we list the achievements of CLEF during its first four years of activity and describe how the range of tasks has been considerably expanded during this period. The aim of the paper is to demonstrate the importance of evaluation initiatives with respect to system research and development and to show how essential it is for such initiatives to keep abreast of and even anticipate the emerging needs of both system developers and application communities if they are to have a future.

pdf bib
An Empirical Study of Information Synthesis Task
Enrique Amigo | Julio Gonzalo | Victor Peinado | Anselmo Peñas | Felisa Verdejo
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

2003

pdf bib
Automatic Association of Web Directories with Word Senses
Celina Santamaría | Julio Gonzalo | Felisa Verdejo
Computational Linguistics, Volume 29, Number 3, September 2003: Special Issue on the Web as Corpus

2002

pdf bib
A Study of Polysemy and Sense Proximity in the Senseval-2 Test Suite
Irina Chugur | Julio Gonzalo | Felisa Verdejo
Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions

2001

pdf bib
Framework and Results for the Spanish SENSEVAL
German Rigau | Mariona Taulé | Ana Fernandez | Julio Gonzalo
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

pdf bib
The UNED Systems at SENSEVAL-2
David Fernández-Amorós | Julio Gonzalo | Felisa Verdejo
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

2000

pdf bib
Evaluating Wordnets in Cross-language Information Retrieval: the ITEM Search Engine
Felisa Verdejo | Julio Gonzalo | Anselmo Peñas | Fernando López | David Fernández
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
Sense clusters for Information Retrieval: Evidence from Semcor and the EuroWordNet InterLingual Index
Julio Gonzalo | Irina Chugur | Felisa Verdejo
ACL-2000 Workshop on Word Senses and Multi-linguality

1999

pdf bib
Towards a Universal Index of Meaning
Piek Vossen | Wim Peters | Julio Gonzalo
SIGLEX99: Standardizing Lexical Resources

pdf bib
Lexical ambiguity and Information Retrieval revisited
Julio Gonzalo | Anselmo Penas | Felisa Verdejo
1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora

pdf bib
An Open Distance Learning Web-Course for NLP in IR
Felisa Verdejo | Julio Gonzalo | Anselmo Penas
EACL 1999: Computer and Internet Supported Education in Language and Speech Technology

1998

pdf bib
Indexing with WordNet synsets can improve text retrieval
Julio Gonzalo | Felisa Verdejo | Irina Chugur | Juan Cigarran
Usage of WordNet in Natural Language Processing Systems

1995

pdf bib abs
Generic Rules and Non-Constituent Coordination
Julio Gonzalo | Teresa Solías
Proceedings of the Fourth International Workshop on Parsing Technologies

We present a metagrammatical formalism, generic rules, to give a default interpretation to grammar rules. Our formalism introduces a process of dynamic binding interfacing the level of pure grammatical knowledge representation and the parsing level. We present an approach to non-constituent coordination within categorial grammars, and reformulate it as a generic rule. This reformulation is context-free parsable and reduces drastically the search space associated to the parsing task for such phenomena.

Venues

ws8
acl7
coling6
emnlp4
lrec4
show all...

cl1

vlc1

wac1

Fix data