2024
pdf
bib
abs
Extending Off-the-shelf NER Systems to Personal Information Detection in Dialogues with a Virtual Agent: Findings from a Real-Life Use Case
Mario Mina
|
Carlos Rodríguez
|
Aitor Gonzalez-Agirre
|
Marta Villegas
Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024)
We present the findings and results of our pseudonymisation system, which has been developed for a real-life use-case involving users and an informative chatbot in the context of the COVID-19 pandemic. Message exchanges between the two involve the former group providing information about themselves and their residential area, which could easily allow for their re-identification. We create a modular pipeline to detect PIIs and perform basic deidentification such that the data can be stored while mitigating any privacy concerns. The use-case presents several challenging aspects, the most difficult of which is the logistic challenge of not being able to directly view or access the data due to the very privacy issues we aim to resolve. Nevertheless, our system achieves a high recall of 0.99, correctly identifying almost all instances of personal data. However, this comes at the expense of precision, which only reaches 0.64. We describe the sensitive information identification in detail, explaining the design principles behind our decisions. We additionally highlight the particular challenges we’ve encountered.
pdf
bib
abs
Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan
Aitor Gonzalez-Agirre
|
Montserrat Marimon
|
Carlos Rodriguez-Penagos
|
Javier Aula-Blasco
|
Irene Baucells
|
Carme Armentano-Oller
|
Jorge Palomar-Giner
|
Baybars Kulebi
|
Marta Villegas
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Current LLM-based applications are becoming steadily available for everyone with a reliable access to technology and the internet. These applications offer benefits to their users that leave those without access to them at a serious disadvantage. Given the vastly large amount of data needed to train LLMs, the gap between languages with access to such quantity of data and those without it is currently larger than ever. Aimed at saving this gap, the Aina Project was created to provide Catalan with the necessary resources to keep being relevant in the context of AI/NLP applications based on LLMs. We thus present a set of strategies to consider when improving technology support for a mid- or low-resource language, specially addressing sustainability of high-quality data acquisition and the challenges involved in the process. We also introduce a large amount of new annotated data for Catalan. Our hope is that those interested in replicating this work for another language can learn from what worked for us, the challenges that we faced, and the sometimes disheartening truth of working with mid- and low-resource languages.
2022
pdf
bib
abs
ParlamentParla: A Speech Corpus of Catalan Parliamentary Sessions
Baybars Kulebi
|
Carme Armentano-Oller
|
Carlos Rodriguez-Penagos
|
Marta Villegas
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference
Recently, various end-to-end architectures of Automatic Speech Recognition (ASR) are being showcased as an important step towards providing language technologies to all languages instead of a select few such as English. However many languages are still suffering due to the “digital gap,” lacking thousands of hours of transcribed speech data openly accessible that is necessary to train modern ASR architectures. Although Catalan already has access to various open speech corpora, these corpora lack diversity and are limited in total volume. In order to address this lack of resources for Catalan language, in this work we present ParlamentParla, a corpus of more than 600 hours of speech from Catalan Parliament sessions. This corpus has already been used in training of state-of-the-art ASR systems, and proof-of-concept text-to-speech (TTS) models. In this work we explain in detail the pipeline that allows the information publicly available on the parliamentary website to be converted to a speech corpus compatible with training of ASR and possibly TTS models.
2021
pdf
bib
Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan
Jordi Armengol-Estapé
|
Casimiro Pio Carrino
|
Carlos Rodriguez-Penagos
|
Ona de Gibert Bonet
|
Carme Armentano-Oller
|
Aitor Gonzalez-Agirre
|
Maite Melero
|
Marta Villegas
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
2014
pdf
bib
abs
Adapting Freely Available Resources to Build an Opinion Mining Pipeline in Portuguese
Patrik Lambert
|
Carlos Rodríguez-Penagos
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We present a complete UIMA-based pipeline for sentiment analysis in Portuguese news using freely available resources and a minimal set of manually annotated training data. We obtained good precision on binary classification but concluded that news feed is a challenging environment to detect the extent of opinionated text.
2013
pdf
bib
FBM: Combining lexicon-based ML and heuristics for Social Media Polarities
Carlos Rodríguez-Penagos
|
Jordi Atserias Batalla
|
Joan Codina-Filbà
|
David García-Narbona
|
Jens Grivolla
|
Patrik Lambert
|
Roser Saurí
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)
2012
pdf
bib
A Hybrid Framework for Scalable Opinion Mining in Social Media: Detecting Polarities and Attitude Targets
Carlos Rodríguez-Penagos
|
Jens Grivolla
|
Joan Codina-Filba
Proceedings of the Workshop on Semantic Analysis in Social Media
2010
pdf
bib
abs
Language Technology Challenges of a ‘Small’ Language (Catalan)
Maite Melero
|
Gemma Boleda
|
Montse Cuadros
|
Cristina España-Bonet
|
Lluís Padró
|
Martí Quixal
|
Carlos Rodríguez
|
Roser Saurí
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
In this paper, we present a brief snapshot of the state of affairs in computational processing of Catalan and the initiatives that are starting to take place in an effort to bring the field a step forward, by making a better and more efficient use of the already existing resources and tools, by bridging the gap between research and market, and by establishing periodical meeting points for the community. In particular, we present the results of the First Workshop on the Computational Processing of Catalan, which succeeded in putting together a fair representation of the research in the area, and received attention from both the industry and the administration. Aside from facilitating communication among researchers and between developers and users, the Workshop provided the organizers with valuable information about existing resources, tools, developers and providers. This information has allowed us to go a step further by setting up a harvesting procedure which will hopefully build the seed of a portal-catalogue-observatory of language resources and technologies in Catalan.
2004
pdf
bib
Mining Metalinguistic Activity in Corpora to Create Lexical Resources Using Information Extraction Techniques: the MOP System
Carlos Rodriguez Penagos
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)
pdf
bib
Metalinguistic Information Extraction for Terminology
Carlos Rodríguez Penagos
Proceedings of CompuTerm 2004: 3rd International Workshop on Computational Terminology