Iria de-Dios-Flores

2024

pdf bib
CorpusNÓS: A massive Galician corpus for training large language models
Iria de-Dios-Flores | Silvia Paniagua Suárez | Cristina Carbajal Pérez | Daniel Bardanca Outeiriño | Marcos Garcia | Pablo Gamallo
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1

pdf bib
Exploring the effects of vocabulary size in neural machine translation: Galician as a target language
Daniel Bardanca Outeirinho | Pablo Gamallo Otero | Iria de-Dios-Flores | José Ramom Pichel Campos
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1

2023

pdf bib abs
Dependency resolution at the syntax-semantics interface: psycholinguistic and computational insights on control dependencies
Iria de-Dios-Flores | Juan Garcia Amboage | Marcos Garcia
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Using psycholinguistic and computational experiments we compare the ability of humans and several pre-trained masked language models to correctly identify control dependencies in Spanish sentences such as ‘José le prometió/ordenó a María ser ordenado/a’ (‘Joseph promised/ordered Mary to be tidy’). These structures underlie complex anaphoric and agreement relations at the interface of syntax and semantics, allowing us to study lexically-guided antecedent retrieval processes. Our results show that while humans correctly identify the (un)acceptability of the strings, language models often fail to identify the correct antecedent in non-adjacent dependencies, showing their reliance on linearity. Additional experiments on Galician reinforce these conclusions. Our findings are equally valuable for the evaluation of language models’ ability to capture linguistic generalizations, as well as for psycholinguistic theories of anaphor resolution.

2022

The development of language technologies (LTs) such as machine translation, text analytics, and dialogue systems is essential in the current digital society, culture and economy. These LTs, widely supported in languages in high demand worldwide, such as English, are also necessary for smaller and less economically powerful languages, as they are a driving force in the democratization of the communities that use them due to their great social and cultural impact. As an example, dialogue systems allow us to communicate with machines in our own language; machine translation increases access to contents in different languages, thus facilitating intercultural relations; and text-to-speech and speech-to-text systems broaden different categories of users’ access to technology. In the case of Galician (co-official language, together with Spanish, in the autonomous region of Galicia, located in northwestern Spain), incorporating the language into state-of-the-art AI applications can not only significantly favor its prestige (a decisive factor in language normalization), but also guarantee citizens’ language rights, reduce social inequality, and narrow the digital divide. This is the main motivation behind the Nós Project (Proxecto Nós), which aims to have a significant contribution to the development of LTs in Galician (currently considered a low-resource language) by providing openly licensed resources, tools, and demonstrators in the area of intelligent technologies.