Raúl García-Cerdá

Also published as: Raúl García Cerdá

2025

Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models
Alicia Picazo-Izquierdo | Ernesto Luis Estevanell-Valladares | Ruslan Mitkov | Rafael Muñoz Guillena | Raúl García Cerdá
Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models

pdf bib abs

Building a Lightweight Classifier to Distinguish Closely Related Language Varieties with Limited Supervision: The Case of Catalan vs Valencian
Raúl García-Cerdá | María Miró Maestre | Miquel Canal
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages

Dialectal variation among closely related languages poses a major challenge in low-resource NLP, as their linguistic similarity increases confusability for automatic systems. We introduce the first supervised classifier to distinguish standard Catalan from its regional variety Valencian. Our lightweight approach fine-tunes a RoBERTa-base model on a manually curated corpus of 20 000 sentences—without any Valencian-specific tools—and achieves 98 % accuracy on unseen test data. In a human evaluation of 90 mixed-variety items per reviewer, acceptance rates reached 96.7 % for Valencian and 97.7 % for Catalan (97.2 % overall). We discuss limitations with out-of-distribution inputs and outline future work on confidence calibration and dialect-aware tokenization. Our findings demonstrate that high-impact dialect classification is feasible with minimal resources.

Co-authors

Alicia Picazo-Izquierdo 1

Venues

Fix author