Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models

Alicia Picazo-Izquierdo, Ernesto Luis Estevanell-Valladares, Ruslan Mitkov, Rafael Muñoz Guillena, Raúl García Cerdá (Editors)

Anthology ID:: 2025.r2lm-1
Month:: September
Year:: 2025
Address:: Varna, Bulgaria
Venues:: R2LM | WS
SIG:
Publisher:: INCOMA Ltd., Shoumen, Bulgaria
URL:: https://aclanthology.org/2025.r2lm-1/
DOI:
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2025.r2lm-1.pdf

PDF (full) BibTeX Search

pdf bib

Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models
Alicia Picazo-Izquierdo | Ernesto Luis Estevanell-Valladares | Ruslan Mitkov | Rafael Muñoz Guillena | Raúl García Cerdá

pdf bib abs

Modern LLMs often generate fluent text yet fabricate, misquote, or misattribute evidence. To quantify this flaw, we built a balanced Citation‐Alignment Dataset of 500 genuine, expert‐verified claim–quote pairs and 500 minimally perturbed false variants from news, legal, scientific, and literary sources. We then propose CoVeGAT, which converts claims and citations into SVO triplets (with trigram fallback), scores each pair via an LLM‐driven chain of verification, and embeds them in a weighted semantic graph. A Graph Attention Network over BERT embeddings issues strict pass/fail judgments on alignment. Zero‐shot evaluation of seven top LLMs (e.g., GPT‐4o, Gemini 1.5, Mistral 7B) reveals a trade‐off: decisive models reach 82.5 % accuracy but err confidently, while cautious ones fall below 50 %. A MiniLM + RBF kernel baseline, by contrast, achieves 96.4 % accuracy, underscoring the power of simple, interpretable methods.

pdf bib abs

A Comparative Study of Vision Transformers and Multimodal Language Models for Violence Detection in Videos
Tomas Ditchfield-Ogle | Ruslan Mitkov

This project compares methods for de- tecting violent videos, which are crucial for ensuring real-time safety in surveil- lance and digital moderation. It evaluates four approaches: a random forest classi- fier, a transformer model, and two multi- modal vision-language models. The pro- cess involves preprocessing datasets, train- ing models, and assessing accuracy, inter- pretability, scalability, and real-time suit- ability. Results show that traditional meth- ods are simple but less effective. The trans- former model achieved high accuracy, and the multimodal models offered high vio- lence recall with descriptive justifications. The study highlights trade-offs and pro- vides practical insights for the deployment of automated violence detection.

pdf bib abs

Detection of AI-generated Content in Scientific Abstracts
Ernesto Luis Estevanell-Valladares | Alicia Picazo-Izquierdo | Ruslan Mitkov

The growing use of generative AI in academic writing raises urgent questions about authorship and the integrity of scientific communication. This study addresses the detection of AI-generated scientific abstracts by constructing a temporally anchored dataset of paired abstracts—each with a human-written version that contains scientific abstracts of works published before 2021 and a synthetic version generated using GPT-4.1. We evaluate three approaches to authorship classification: zero-shot large language models (LLMs), fine-tuned encoder-based transformers, and traditional machine learning classifiers. Results show that LLMs perform near chance level, while a LoRA-fine-tuned DistilBERT and a PassiveAggressive classifier achieve near-perfect performance. These findings suggest that shallow lexical or stylistic patterns still differentiate human and AI writing, and that supervised learning is key to capturing these signals.

pdf bib abs

A Comparative Study of Hyperbole Detection Methods: From Rule-Based Approaches through Deep Learning Models to Large Language Models
Silvia Gargova | Nevena Grigorova | Ruslan Mitkov

We address hyperbole detection as a binary classification task, comparing rule-based methods, fine-tuned transformers (BERT, RoBERTa), and large language models (LLMs) in zero-shot and few-shot prompting (Gemini, LLaMA). Fine-tuned transformers achieved the best overall performance, with RoBERTa attaining an F1-score of 0.82. Rule-based methods performed lower (F1 = 0.58) but remain effective in constrained linguistic contexts. LLMs showed mixed results: zero-shot performance was variable, while few-shot prompting notably improved outcomes, reaching F1-scores up to 0.79 without task-specific training data. We discuss the trade-offs between interpretability, computational cost, and data requirements across methods. Our results highlight the promise of LLMs in low-resource scenarios and suggest future work on hybrid models and broader figurative language tasks.

pdf bib abs

Evaluating the Performance of Transformers in Translating Low-Resource Languages through Akkadian
Daniel A. Jones | Ruslan Mitkov

In this paper, we evaluate the performance of various fine-tuned, transformer-based models in translating Akkadian into English. Using annotated Akkadian data, we seek to establish potential considerations when developing models for other low-resource languages, which do not yet have as robust data. The results of this study show the potency, but also cost inefficiency, of Large Language Models compared to smaller Neural Machine Translation models. Significant evidence was also found demonstrating the importance of fine-tuning machine translation models from related languages.

pdf bib abs

We propose a fine-tuning strategy for English Multi-class Hope Speech Detection using Mistral, leveraging two complementary datasets: PolyHope and CDB, a new unified framework for hope speech detection. While the former provides nuanced hope-related categories such as GENERALIZED, REALISTIC, and UNREALISTIC HOPE, the later introduces linguistically grounded dimensions including COUNTERFACTUAL, DESIRE, and BELIEF. By fine-tuning Mistral on both datasets, we enable the model to capture deeper semantic representations of hope. In addition to fine-tuning, we developed advanced prompting strategies which provide interpretable, zero-shot alternatives and further inform annotation and classification designs. Our approach achieved third place in the multi-class (Macro F1=71.77) and sixth in the binary (Macro F1=85.35) settings.

pdf bib abs

Does Anaphora Resolution Improve LLM Fine-Tuning for Summarisation?
Yi Chun Lo | Ruslan Mitkov

This study investigates whether adding anaphora resolution as a preprocessing step before fine-tuning the text summarisation application in LLM can improve the quality of summary output. Two sets of training with the T5-base model and BART-large model using the SAMSum dataset were conducted. One uses the original text and the other uses the text processed by a simplified version of MARS (Mitkov’s Anaphora Resolution System). The experiment reveals that when T5-base model is fine-tuned on the anaphora-resolved inputs, the ROUGE metrics are improved. In contrast, BART-large model only has a slight improvement after fine-tuning under the same conditions, which is not statistically significant. Further analysis of the generated summaries indicates that anaphora resolution is helpful in semantic alignment.

pdf bib abs

Transformers and Large Language Models for Hope Speech Detection A Multilingual Approach
Diana Patricia Madera-Espíndola | Zoe Caballero-Domínguez | Valeria J. Ramírez-Macías | Sabur Butt | Hector G. Ceballos

With the rise of Generative AI (GenAI) models in recent years, it is necessary to understand how they performed compared with other Deep Learning techniques, across tasks and across different languages. In this study, we benchmark ChatGPT-4 and XML-RoBERTa, a multilingual transformer-based model, as part of the Multilingual Binary and Multiclass Hope Speech Detection within the PolyHope-M 2025 competition. Furthermore, we explored prompting techniques and data augmentation to determine which approach yields the best performance. In our experiments, XML-RoBERTa frequently outperformed ChatGPT-4. It also attained F1 scores of 0.86 for English, 0.83 for Spanish, 0.86 for German, and 0.94 for Urdu in Task 1, while achieving 0.73 for English, 0.70 for Spanish, 0.69 for German, and 0.60 for Urdu in Task 2.

pdf bib abs

Beyond BLEU: Ethical Risks of Misleading Evaluation in Domain-Specific QA with LLMs
Ayoub Nainia | Régine Vignes-Lebbe | Hajar Mousannif | Jihad Zahir

Large Language Models (LLMs) are increasingly used in scientific question answering (QA), including high-stakes fields such as biodiversity informatics. However, standard evaluation metrics such as BLEU, ROUGE, Exact Match (EM), and BERTScore remain poorly aligned with the factual and domain-specific requirements of these tasks. In this work, we investigate the gap between automatic metrics and expert judgment in botanical QA by comparing metric scores with human ratings across five dimensions: accuracy, completeness, relevance, fluency, and terminology usage. Our results show that standard metrics often misrepresent response quality, particularly in the presence of paraphrasing, omission, or domain-specific language. Through both quantitative analysis and qualitative examples, we show that high-scoring responses may still exhibit critical factual errors or omissions. These findings highlight the need for domain-aware evaluation frameworks that incorporate expert feedback and raise important ethical concerns about the deployment of LLMs in scientific contexts.

pdf bib abs

From Zero to Hero: Building Serbian NER from Rules to LLMs
Milica Ikonić Nešić | Sasa Petalinkar | Ranka Stanković | Ruslan Mitkov

Named Entity Recognition (NER) presents specific challenges in Serbian, a morphologically rich language. To address these challenges, a comparative evaluation of distinct model paradigms across diverse text genres was conducted. A rule-based system (SrpNER), a traditional deep learning model (Convolutional Neural Network – CNN), fine-tuned transformer architectures (Jerteh and Tesla), and Large Language Models (LLMs), specifically ChatGPT 4.0 Nano and 4.1 Mini, were evaluated and compared. For the LLMs, a one-shot prompt engineering approach was employed, using prompt instructions aligned with the entity type definitions used in the manual annotation guidelines. Evaluation was performed on three Serbian datasets representing varied domains: newspaper articles, history textbook excerpts, and a sample of literary texts from the srpELTeC collection. The highest performance was consistently achieved by the fine-tuned transformer models, with F1 scores ranging from 0.78 on newspaper articles to 0.96 on primary school history textbook sample.

pdf bib abs

Enhancing the Performance of Spoiler Review Detection by a LLM with Hints
Genta Nishi | Einoshin Suzuki

We investigate the effects of various hints including an introduction text, a few examples, and prompting techniques to enhance the performance of a Large-Language Model (LLM) in detecting a spoiler review of a movie. Detecting a spoiler review of a movie represents an important Natural Language Processing (NLP) task which resists the Deep Learning (DL) approach due to its highly subjective nature and scarcity in data. The highly subjective nature is also the main reason of the poor performance of LLMs-based methods, which explains their scarcity for the target problem. We address this problem by providing the LLM with an introduction text of the movie and a few reviews with their class labels as well as equipping it with a prompt that selects and exploits spoiler types with reasoning. Experiments using 400 manually labeled reviews and about 3200 LLM-labeled reviews show that our CAST (Clue And Select Types prompting) outperforms (0.05 higher) or is on par with (only 0.01 lower) cutting-edge LLM-based methods in three out of four movies in ROC-AUC. We believe our study represents an evidence of a target problem in which the knowledge intensive approach outperforms the learning-based approach.

pdf bib abs

Evaluating Structured Decoding for Text-to-Table Generation: Evidence from Three Datasets
Julian Oestreich | Lydia Müller

We present a comprehensive evaluation of structured decoding for text-to-table generation with large language models (LLMs). While previous work has primarily focused on unconstrained generation of tables, the impact of enforcing structural constraints during generation remains underexplored. We systematically compare schema-guided (structured) decoding to standard one-shot prompting across three diverse benchmarks - E2E, Rotowire, and Livesum - using open-source LLMs of up to 32B parameters, assessing the performance of table generation approaches in resource-constrained settings. Our experiments cover a wide range of evaluation metrics at cell, row, and table levels. Results demonstrate that structured decoding significantly enhances the validity and alignment of generated tables, particularly in scenarios demanding precise numerical alignment (Rotowire), but may degrade performance in contexts involving densely packed textual information (E2E) or extensive aggregation over lengthy texts (Livesum). We further analyze the suitability of different evaluation metrics and discuss the influence of model size.

pdf bib abs

Evaluating the LLM and NMT Models in Translating Low-Resourced Languages
Julita JP Pucinskaite | Ruslan Mitkov

Machine translation has significantly advanced due to the development of transformer architecture, which is utilised by many modern deep-learning models. However, low-resource languages, such as Lithuanian, still face challenges stemming from the limited availability of training data and resource constraints. This study examines the translation capabilities of Neural Machine Translation (NMT) models and Large Language Models (LLMs), comparing their performance in low-resource translation tasks. Furthermore, it assesses the impact of parameter scaling and fine-tuning on their effectiveness in enhancing model performance. The evaluation showed that while LLMs demonstrated proficiency in low-resource translation, their results were lower compared to NMT models, which remained consistent across smaller variants. However, as model size increased, the lead was not as prominent, correlating with automatic and human evaluations. The effort to enhance translation accuracy through fine-tuning proved to be an effective strategy, demonstrating improvements in vocabulary expansion and structural coherence in both architectures. These findings highlight the importance of diverse datasets, comprehensive model design, and fine-tuning techniques in addressing the challenges of low-resourced language translation. This project, one of the first studies to focus on the low-resourced Lithuanian language, aims to contribute to the broader discourse and ongoing efforts to enhance accessibility and inclusivity in Natural Language Processing.

pdf bib abs

KGEIR: Knowledge Graph-Enhanced Iterative Reasoning for Multi-Hop Question Answering
Tianda Sun | Dimitar Kazakov

Multi-hop question answering (MHQA) requires systems to retrieve and connect information across multiple documents, a task where large language models often struggle. We introduce Knowledge Graph-Enhanced Iterative Reasoning (KGEIR), a framework that dynamically constructs and refines knowledge graphs during question answering to enhance multi-hop reasoning. KGEIR identifies key entities from questions, builds an initial graph from retrieved paragraphs, reasons over this structure, identifies information gaps, and iteratively retrieves additional context to refine the graph until sufficient information is gathered. Evaluations on HotpotQA, 2WikiMultiHopQA, and MuSiQue benchmarks show competitive or superior performance to state-of-the-art methods. Ablation studies confirm that structured knowledge representations significantly outperform traditional prompting approaches like Chain-of-Thought and Tree-of-Thought. KGEIR’s ability to explicitly model entity relationships while addressing information gaps through targeted retrieval offers a promising direction for integrating symbolic and neural approaches to complex reasoning tasks.

pdf bib abs

From Handcrafted Features to LLMs: A Comparative Study in Native Language Identification
Aliyah C. Vanterpool | Katsiaryna Aharodnik

This study compares a traditional machine learning feature-engineering approach to a large language models (LLMs) fine-tuning method for Native Language Identification (NLI). We explored the COREFL corpus, which consists of L2 English narratives produced by Spanish and German L1 speakers with lower-advanced English proficiency (C1) (Lozano et al., 2020). For the feature-engineering approach, we extracted language productivity, linguistic diversity, and n-gram features for Support Vector Machine (SVM) classification. We also looked at sentence embeddings with SVM and logistic regression. For the LLM approach, we evaluated BERT-like models and GPT-4. The feature-engineering approach, particularly n-grams, outperformed the LLMs. Sentence-BERT embeddings with SVM achieved the second-highest accuracy (93%), while GPT-4 reached an average accuracy of 90.4% across three runs when prompted with labels. These findings suggest that feature engineering remains a robust method for NLI, especially for smaller datasets with subtle linguistic differences between classes. This study contributes to the comparative analysis of traditional machine learning and transformer-based LLMs, highlighting current LLM limitations in handling domain-specific data and their need for larger training resources.

pdf bib abs

Systematic Evaluation of Rule-Based Analytics for LLM-Driven Graph Data Modelling
Fabio Antonio Yanez | Andrés Montoyo | Armando Suárez | Alejandro Piad-Morffis | Yudivián Almeida Cruz

This paper presents a novel multi-agent system for automatically generating graph database schemas from tabular data, strategically integrating rule-based analytics with large language models (LLMs). The framework leverages a lightweight rule system to select the most suitable analytic methods based on column data types, providing targeted insights that guide schema generation.

pdf bib abs

Improved Contrastive Learning over Commonsense Knowledge Graphs for Unsupervised Reasoning
Rongwen Zhao | Jeffrey Flanigan

Knowledge-augmented methods leverage external resources such as commonsense knowledge graphs (CSKGs) to improve downstream reasoning tasks. Recent work has explored contrastive learning over relation-aware sequence pairs derived from CSKG triples to inject commonsense knowledge into pre-trained language models (PLMs). However, existing approaches suffer from two key limitations: they rely solely on randomly sampled in-batch negatives, overlooking more informative hard negatives, and they ignore additional plausible positives that could strengthen training. Both factors limit the effectiveness of contrastive knowledge learning. In this paper, we propose an enhanced contrastive learning framework for CSKGs that integrates hard negative sampling and positive set expansion. Hard negatives are dynamically selected based on semantic similarity to ensure the model learns from challenging distinctions, while positive set expansion exploits the property that similar head entities often share overlapping tail entities, allowing the recovery of missing positives. We evaluate our method on unsupervised commonsense question answering and inductive CSKG completion using ConceptNet and ATOMIC. Experimental results demonstrate consistent improvements over strong baselines, confirming that our approach yields richer commonsense-aware representations and more effective knowledge injection into PLMs.