Aiala Rosá

Also published as: Aiala Rosa


2024

pdf bib
A Language Model Trained on Uruguayan Spanish News Text
Juan Pablo Filevich | Gonzalo Marco | Santiago Castro | Luis Chiruzzo | Aiala Rosá
Proceedings of the Second International Workshop Towards Digital Language Equality (TDLE): Focusing on Sustainability @ LREC-COLING 2024

This paper presents a language model trained from scratch exclusively on a brand new corpus consisting of about 6 GiB of Uruguayan newspaper text. We trained the model for 30 days on a single Nvidia P100 using the RoBERTa-base architecture but with considerably fewer parameters than other standard RoBERTa models. We evaluated the model on two NLP tasks and found that it outperforms BETO, the widely used Spanish BERT pre-trained model. We also compared our model on the masked-word prediction task with two popular multilingual BERT-based models, Multilingual BERT and XLM-RoBERTa, obtaining outstanding results on sentences from the Uruguayan press domain. Our experiments show that training a language model on a domain-specific corpus can significantly improve performance even when the model is smaller and was trained with significantly less data than more standard pre-trained models.

pdf bib
Automatic Crossword Clues Extraction for Language Learning
Santiago Berruti | Arturo Collazo | Diego Sellanes | Aiala Rosá | Luis Chiruzzo
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

Crosswords are a powerful tool that could be used in educational contexts, but they are not that easy to build. In this work, we present experiments on automatically extracting clues from simple texts that could be used to create crosswords, with the aim of using them in the context of teaching English at the beginner level. We present a series of heuristic patterns based on NLP tools for extracting clues, and use them to create a set of 2209 clues from a collection of 400 simple texts. Human annotators labeled the clues, and this dataset is used to evaluate the performance of our heuristics, and also to create a classifier that predicts if an extracted clue is correct. Our best classifier achieves an accuracy of 84%.

pdf bib
RETUYT-INCO at MLSP 2024: Experiments on Language Simplification using Embeddings, Classifiers and Large Language Models
Ignacio Sastre | Leandro Alfonso | Facundo Fleitas | Federico Gil | Andrés Lucas | Tomás Spoturno | Santiago Góngora | Aiala Rosá | Luis Chiruzzo
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

In this paper we present the participation of the RETUYT-INCO team at the BEA-MLSP 2024 shared task. We followed different approaches, from Multilayer Perceptron models with word embeddings to Large Language Models fine-tuned on different datasets: already existing, crowd-annotated, and synthetic.Our best models are based on fine-tuning Mistral-7B, either with a manually annotated dataset or with synthetic data.

pdf bib
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Atul Kr. Ojha | A. Seza Doğruöz | Harish Tayyar Madabushi | Giovanni Da San Martino | Sara Rosenthal | Aiala Rosá
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

2023

pdf bib
RETUYT-InCo at BEA 2023 Shared Task: Tuning Open-Source LLMs for Generating Teacher Responses
Alexis Baladón | Ignacio Sastre | Luis Chiruzzo | Aiala Rosá
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

This paper presents the results of our participation in the BEA 2023 shared task, which focuses on generating AI teacher responses in educational dialogues. We conducted experiments using several Open-Source Large Language Models (LLMs) and explored fine-tuning techniques along with prompting strategies, including Few-Shot and Chain-of-Thought approaches. Our best model was ranked 4.5 in the competition with a BertScore F1 of 0.71 and a DialogRPT final (avg) of 0.35. Nevertheless, our internal results did not exactly correlate with those obtained in the competition, which showed the difficulty in evaluating this task. Other challenges we faced were data leakage on the train set and the irregular format of the conversations.

pdf bib
Experiments on Automatic Error Detection and Correction for Uruguayan Learners of English
Romina Brown | Santiago Paez | Gonzalo Herrera | Luis Chiruzzo | Aiala Rosá
Proceedings of the 12th Workshop on NLP for Computer Assisted Language Learning

2022

pdf bib
Using NLP to Support English Teaching in Rural Schools
Luis Chiruzzo | Laura Musto | Santiago Gongora | Brian Carpenter | Juan Filevich | Aiala Rosa
Proceedings of the Second Workshop on NLP for Positive Impact (NLP4PI)

We present a web application for creating games and exercises for teaching English as a foreign language with the help of NLP tools. The application contains different kinds of games such as crosswords, word searches, a memory game, and a multiplayer game based on the classic battleship pen and paper game. This application was built with the aim of supporting teachers in rural schools that are teaching English lessons, so they can easily create interactive and engaging activities for their students. We present the context and history of the project, the current state of the web application, and some ideas on how we will expand it in the future.

2020

pdf bib
HAHA 2019 Dataset: A Corpus for Humor Analysis in Spanish
Luis Chiruzzo | Santiago Castro | Aiala Rosá
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper presents the development of a corpus of 30,000 Spanish tweets that were crowd-annotated with humor value and funniness score. The corpus contains approximately 38.6% of humorous tweets with an average score of 2.04 in a scale from 1 to 5 for the humorous tweets. The corpus has been used in an automatic humor recognition and analysis competition, obtaining encouraging results from the participants.

2018

pdf bib
A Crowd-Annotated Spanish Corpus for Humor Analysis
Santiago Castro | Luis Chiruzzo | Aiala Rosá | Diego Garat | Guillermo Moncecchi
Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media

Computational Humor involves several tasks, such as humor recognition, humor generation, and humor scoring, for which it is useful to have human-curated data. In this work we present a corpus of 27,000 tweets written in Spanish and crowd-annotated by their humor value and funniness score, with about four annotations per tweet, tagged by 1,300 people over the Internet. It is equally divided between tweets coming from humorous and non-humorous accounts. The inter-annotator agreement Krippendorff’s alpha value is 0.5710. The dataset is available for general usage and can serve as a basis for humor detection and as a first step to tackle subjectivity.

pdf bib
A High Coverage Method for Automatic False Friends Detection for Spanish and Portuguese
Santiago Castro | Jairo Bonanata | Aiala Rosá
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

False friends are words in two languages that look or sound similar, but have different meanings. They are a common source of confusion among language learners. Methods to detect them automatically do exist, however they make use of large aligned bilingual corpora, which are hard to find and expensive to build, or encounter problems dealing with infrequent words. In this work we propose a high coverage method that uses word vector representations to build a false friends classifier for any pair of languages, which we apply to the particular case of Spanish and Portuguese. The required resources are a large corpus for each language and a small bilingual lexicon for the pair.

2016

pdf bib
Factuality Annotation and Learning in Spanish Texts
Dina Wonsever | Aiala Rosá | Marisa Malcuori
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a proposal for the annotation of factuality of event mentions in Spanish texts and a free available annotated corpus. Our factuality model aims to capture a pragmatic notion of factuality, trying to reflect a casual reader judgements about the realis / irrealis status of mentioned events. Also, some learning experiments (SVM and CRF) have been held, showing encouraging results.

2010

pdf bib
Opinion Identification in Spanish Texts
Aiala Rosá | Dina Wonsever | Jean-Luc Minel
Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas

2008

pdf bib
Identification automatique de marques d’opinion dans des textes
Aiala Rosá
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues

Nous présentons un modèle conceptuel pour la représentation d’opinions, en analysant les éléments qui les composent et quelques propriétés. Ce modèle conceptuel est implémenté et nous en décrivons le jeu d’annotations. Le processus automatique d’annotation de textes en espagnol est effectué par application de règles contextuelles. Un premier sous-ensemble de règles a été écrit pour l’identification de quelques éléments du modèle. Nous analysons les premiers résultats de leur application.