Pablo Gamallo


2024

pdf bib
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1
Pablo Gamallo | Daniela Claro | António Teixeira | Livy Real | Marcos Garcia | Hugo Gonçalo Oliveira | Raquel Amaro
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1

pdf bib
CorpusNÓS: A massive Galician corpus for training large language models
Iria de-Dios-Flores | Silvia Paniagua Suárez | Cristina Carbajal Pérez | Daniel Bardanca Outeiriño | Marcos Garcia | Pablo Gamallo
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1

pdf bib
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2
Pablo Gamallo | Daniela Claro | António Teixeira | Livy Real | Marcos Garcia | Hugo Gonçalo Oliveira | Raquel Amaro
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2

pdf bib
Training and Fine-Tuning NMT Models for Low-Resource Languages Using Apertium-Based Synthetic Corpora
Aleix Sant | Daniel Bardanca | José Ramom Pichel Campos | Francesca De Luca Fornaciari | Carlos Escolano | Javier Garcia Gilabert | Pablo Gamallo | Audrey Mash | Xixian Liao | Maite Melero
Proceedings of the Ninth Conference on Machine Translation

In this paper, we present the two strategies employed for the WMT24 Shared Task on Translation into Low-Resource Languages of Spain. We participated in the language pairs of Spanish-to-Aragonese, Spanish-to-Aranese, and Spanish-to-Asturian, developing neural-based translation systems and moving away from rule-based approaches for these language directions. To create these models, two distinct strategies were employed. The first strategy involved a thorough cleaning process and curation of the limited provided data, followed by fine-tuning the multilingual NLLB-200-600M model (Constrained Submission). The other strategy involved training a transformer from scratch using a vast amount of synthetic data (Open Submission). Both approaches relied on generated synthetic data and resulted in high ChrF and BLEU scores. However, given the characteristics of the task, the strategy used in the Constrained Submission resulted in higher scores that surpassed the baselines across the three translation directions, whereas the strategy employed in the Open Submission yielded slightly lower scores than the highest baseline.

2022

pdf bib
The Nós Project: Opening routes for the Galician language in the field of language technologies
Iria de-Dios-Flores | Carmen Magariños | Adina Ioana Vladu | John E. Ortega | José Ramom Pichel | Marcos García | Pablo Gamallo | Elisa Fernández Rei | Alberto Bugarín-Diz | Manuel González González | Senén Barro | Xosé Luis Regueira
Proceedings of the Workshop Towards Digital Language Equality within the 13th Language Resources and Evaluation Conference

The development of language technologies (LTs) such as machine translation, text analytics, and dialogue systems is essential in the current digital society, culture and economy. These LTs, widely supported in languages in high demand worldwide, such as English, are also necessary for smaller and less economically powerful languages, as they are a driving force in the democratization of the communities that use them due to their great social and cultural impact. As an example, dialogue systems allow us to communicate with machines in our own language; machine translation increases access to contents in different languages, thus facilitating intercultural relations; and text-to-speech and speech-to-text systems broaden different categories of users’ access to technology. In the case of Galician (co-official language, together with Spanish, in the autonomous region of Galicia, located in northwestern Spain), incorporating the language into state-of-the-art AI applications can not only significantly favor its prestige (a decisive factor in language normalization), but also guarantee citizens’ language rights, reduce social inequality, and narrow the digital divide. This is the main motivation behind the Nós Project (Proxecto Nós), which aims to have a significant contribution to the development of LTs in Galician (currently considered a low-resource language) by providing openly licensed resources, tools, and demonstrators in the area of intelligent technologies.

2020

pdf bib
CitiusNLP at SemEval-2020 Task 3: Comparing Two Approaches for Word Vector Contextualization
Pablo Gamallo
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This article describes some unsupervised strategies submitted to SemEval 2020 Task 3, a task which consists of considering the effect of context to compute word similarity. More precisely, given two words in context, the system must predict the degree of similarity of those words considering the context in which they occur, and the system score is compared against human prediction. We compare one approach based on pre-trained BERT models with other strategy relying on static word embeddings and syntactic dependencies. The BERT-based method clearly outperformed the dependency-based strategy.

2019

pdf bib
CiTIUS-COLE at SemEval-2019 Task 5: Combining Linguistic Features to Identify Hate Speech Against Immigrants and Women on Multilingual Tweets
Sattam Almatarneh | Pablo Gamallo | Francisco J. Ribadas Pena
Proceedings of the 13th International Workshop on Semantic Evaluation

This article describes the strategy submitted by the CiTIUS-COLE team to SemEval 2019 Task 5, a task which consists of binary classi- fication where the system predicting whether a tweet in English or in Spanish is hateful against women or immigrants or not. The proposed strategy relies on combining linguis- tic features to improve the classifier’s perfor- mance. More precisely, the method combines textual and lexical features, embedding words with the bag of words in Term Frequency- Inverse Document Frequency (TF-IDF) repre- sentation. The system performance reaches about 81% F1 when it is applied to the training dataset, but its F1 drops to 36% on the official test dataset for the English and 64% for the Spanish language concerning the hate speech class

pdf bib
Contextualized Translations of Phrasal Verbs with Distributional Compositional Semantics and Monolingual Corpora
Pablo Gamallo | Susana Sotelo | José Ramom Pichel | Mikel Artetxe
Computational Linguistics, Volume 45, Issue 3 - September 2019

This article describes a compositional distributional method to generate contextualized senses of words and identify their appropriate translations in the target language using monolingual corpora. Word translation is modeled in the same way as contextualization of word meaning, but in a bilingual vector space. The contextualization of meaning is carried out by means of distributional composition within a structured vector space with syntactic dependencies, and the bilingual space is created by means of transfer rules and a bilingual dictionary. A phrase in the source language, consisting of a head and a dependent, is translated into the target language by selecting both the nearest neighbor of the head given the dependent, and the nearest neighbor of the dependent given the head. This process is expanded to larger phrases by means of incremental composition. Experiments were performed on English and Spanish monolingual corpora in order to translate phrasal verbs in context. A new bilingual data set to evaluate strategies aimed at translating phrasal verbs in restricted syntactic domains has been created and released.

pdf bib
Unsupervised Compositional Translation of Multiword Expressions
Pablo Gamallo | Marcos Garcia
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

This article describes a dependency-based strategy that uses compositional distributional semantics and cross-lingual word embeddings to translate multiword expressions (MWEs). Our unsupervised approach performs translation as a process of word contextualization by taking into account lexico-syntactic contexts and selectional preferences. This strategy is suited to translate phraseological combinations and phrases whose constituent words are lexically restricted by each other. Several experiments in adjective-noun and verb-object compounds show that mutual contextualization (co-compositionality) clearly outperforms other compositional methods. The paper also contributes with a new freely available dataset of English-Spanish MWEs used to validate the proposed compositional strategy.

2018

pdf bib
Measuring language distance among historical varieties using perplexity. Application to European Portuguese.
Jose Ramom Pichel Campos | Pablo Gamallo | Iñaki Alegria
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

The objective of this work is to quantify, with a simple and robust measure, the distance between historical varieties of a language. The measure will be inferred from text corpora corresponding to historical periods. Different approaches have been proposed for similar aims: Language Identification, Phylogenetics, Historical Linguistics or Dialectology. In our approach, we used a perplexity-based measure to calculate language distance between all the historical periods of a specific language: European Portuguese. Perplexity has also proven to be a robust metric to calculate distance between languages. However, this measure has not been tested yet to identify diachronic periods within the historical evolution of a specific language. For this purpose, a historical Portuguese corpus has been constructed from different open sources containing texts with close original spelling. The results of our experiments show that Portuguese keeps an important degree of homogeneity over time. We anticipate this metric to be a starting point to be applied to other languages.

pdf bib
CitiusNLP at SemEval-2018 Task 10: The Use of Transparent Distributional Models and Salient Contexts to Discriminate Word Attributes
Pablo Gamallo
Proceedings of the 12th International Workshop on Semantic Evaluation

This article describes the unsupervised strategy submitted by the CitiusNLP team to the SemEval 2018 Task 10, a task which consists of predict whether a word is a discriminative attribute between two other words. Our strategy relies on the correspondence between discriminative attributes and relevant contexts of a word. More precisely, the method uses transparent distributional models to extract salient contexts of words which are identified as discriminative attributes. The system performance reaches about 70% accuracy when it is applied on the development dataset, but its accuracy goes down (63%) on the official test dataset.

2017

pdf bib
A rule-based system for cross-lingual parsing of Romance languages with Universal Dependencies
Marcos Garcia | Pablo Gamallo
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This article describes MetaRomance, a rule-based cross-lingual parser for Romance languages submitted to CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. The system is an almost delexicalized parser which does not need training data to analyze Romance languages. It contains linguistically motivated rules based on PoS-tag patterns. The rules included in MetaRomance were developed in about 12 hours by one expert with no prior knowledge in Universal Dependencies, and can be easily extended using a transparent formalism. In this paper we compare the performance of MetaRomance with other supervised systems participating in the competition, paying special attention to the parsing of different treebanks of the same language. We also compare our system with a delexicalized parser for Romance languages, and take advantage of the harmonized annotation of Universal Dependencies to propose a language ranking based on the syntactic distance each variety has from Romance languages.

pdf bib
Citius at SemEval-2017 Task 2: Cross-Lingual Similarity from Comparable Corpora and Dependency-Based Contexts
Pablo Gamallo
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This article describes the distributional strategy submitted by the Citius team to the SemEval 2017 Task 2. Even though the team participated in two subtasks, namely monolingual and cross-lingual word similarity, the article is mainly focused on the cross-lingual subtask. Our method uses comparable corpora and syntactic dependencies to extract count-based and transparent bilingual distributional contexts. The evaluation of the results show that our method is competitive with other cross-lingual strategies, even those using aligned and parallel texts.

pdf bib
A Web Interface for Diachronic Semantic Search in Spanish
Pablo Gamallo | Iván Rodríguez-Torres | Marcos Garcia
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

This article describes a semantic system which is based on distributional models obtained from a chronologically structured language resource, namely Google Books Syntactic Ngrams. The models were created using dependency-based contexts and a strategy for reducing the vector space, which consists in selecting the more informative and relevant word contexts. The system allowslinguists to analize meaning change of Spanish words in the written language across time.

pdf bib
A Perplexity-Based Method for Similar Languages Discrimination
Pablo Gamallo | Jose Ramom Pichel | Iñaki Alegria
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This article describes the system submitted by the Citius_Ixa_Imaxin team to the VarDial 2017 (DSL and GDI tasks). The strategy underlying our system is based on a language distance computed by means of model perplexity. The best model configuration we have tested is a voting system making use of several n-grams models of both words and characters, even if word unigrams turned out to be a very competitive model with reasonable results in the tasks we have participated. An error analysis has been performed in which we identified many test examples with no linguistic evidences to distinguish among the variants.

pdf bib
Compositional Semantics using Feature-Based Models from WordNet
Pablo Gamallo | Martín Pereira-Fariña
Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications

This article describes a method to build semantic representations of composite expressions in a compositional way by using WordNet relations to represent the meaning of words. The meaning of a target word is modelled as a vector in which its semantically related words are assigned weights according to both the type of the relationship and the distance to the target word. Word vectors are compositionally combined by syntactic dependencies. Each syntactic dependency triggers two complementary compositional functions: the named head function and dependent function. The experiments show that the proposed compositional method outperforms the state-of-the-art for both intransitive subject-verb and transitive subject-verb-object constructions.

pdf bib
Sense Contextualization in a Dependency-Based Compositional Distributional Model
Pablo Gamallo
Proceedings of the 2nd Workshop on Representation Learning for NLP

Little attention has been paid to distributional compositional methods which employ syntactically structured vector models. As word vectors belonging to different syntactic categories have incompatible syntactic distributions, no trivial compositional operation can be applied to combine them into a new compositional vector. In this article, we generalize the method described by Erk and Padó (2009) by proposing a dependency-base framework that contextualize not only lemmas but also selectional preferences. The main contribution of the article is to expand their model to a fully compositional framework in which syntactic dependencies are put at the core of semantic composition. We claim that semantic composition is mainly driven by syntactic dependencies. Each syntactic dependency generates two new compositional vectors representing the contextualized sense of the two related lemmas. The sequential application of the compositional operations associated to the dependencies results in as many contextualized vectors as lemmas the composite expression contains. At the end of the semantic process, we do not obtain a single compositional vector representing the semantic denotation of the whole composite expression, but one contextualized vector for each lemma of the whole expression. Our method avoids the troublesome high-order tensor representations by defining lemmas and selectional restrictions as first-order tensors (i.e. standard vectors). A corpus-based experiment is performed to both evaluate the quality of the compositional vectors built with our strategy, and to compare them to other approaches on distributional compositional semantics. The experiments show that our dependency-based compositional method performs as (or even better than) the state-of-the-art.

2016

pdf bib
Comparing Two Basic Methods for Discriminating Between Similar Languages and Varieties
Pablo Gamallo | Iñaki Alegria | José Ramom Pichel | Manex Agirrezabal
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

This article describes the systems submitted by the Citius_Ixa_Imaxin team to the Discriminating Similar Languages Shared Task 2016. The systems are based on two different strategies: classification with ranked dictionaries and Naive Bayes classifiers. The results of the evaluation show that ranking dictionaries are more sound and stable across different domains while basic bayesian models perform reasonably well on in-domain datasets, but their performance drops when they are applied on out-of-domain texts.

pdf bib
TweetMT: A Parallel Microblog Corpus
Iñaki San Vicente | Iñaki Alegría | Cristina España-Bonet | Pablo Gamallo | Hugo Gonçalo Oliveira | Eva Martínez Garcia | Antonio Toral | Arkaitz Zubiaga | Nora Aranberri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We introduce TweetMT, a parallel corpus of tweets in four language pairs that combine five languages (Spanish from/to Basque, Catalan, Galician and Portuguese), all of which have an official status in the Iberian Peninsula. The corpus has been created by combining automatic collection and crowdsourcing approaches, and it is publicly available. It is intended for the development and testing of microtext machine translation systems. In this paper we describe the methodology followed to build the corpus, and present the results of the shared task in which it was tested.

2015

pdf bib
Dependency Parsing with Compression Rules
Pablo Gamallo
Proceedings of the 14th International Conference on Parsing Technologies

2014

pdf bib
TweetNorm_es: an annotated corpus for Spanish microtext normalization
Iñaki Alegria | Nora Aranberri | Pere Comas | Víctor Fresno | Pablo Gamallo | Lluis Padró | Iñaki San Vicente | Jordi Turmo | Arkaitz Zubiaga
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we introduce TweetNorm_es, an annotated corpus of tweets in Spanish language, which we make publicly available under the terms of the CC-BY license. This corpus is intended for development and testing of microtext normalization systems. It was created for Tweet-Norm, a tweet normalization workshop and shared task, and is the result of a joint annotation effort from different research groups. In this paper we describe the methodology defined to build the corpus as well as the guidelines followed in the annotation process. We also present a brief overview of the Tweet-Norm shared task, as the first evaluation environment where the corpus was used.

pdf bib
Multilingual corpora with coreferential annotation of person entities
Marcos Garcia | Pablo Gamallo
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents three corpora with coreferential annotation of person entities for Portuguese, Galician and Spanish. They contain coreference links between several types of pronouns (including elliptical, possessive, indefinite, demonstrative, relative and personal clitic and non-clitic pronouns) and nominal phrases (including proper nouns). Some statistics have been computed, showing distributional aspects of coreference both in journalistic and in encyclopedic texts. Furthermore, the paper shows the importance of coreference resolution for a task such as Information Extraction, by evaluating the output of an Open Information Extraction system on the annotated corpora. The corpora are freely distributed in two formats: (i) the SemEval-2010 and (ii) the brat rapid annotation tool, so they can be enlarged and improved collaboratively.

pdf bib
Citius: A Naive-Bayes Strategy for Sentiment Analysis on English Tweets
Pablo Gamallo | Marcos Garcia
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
An Entity-Centric Coreference Resolution System for Person Entities with Rich Linguistic Information
Marcos Garcia | Pablo Gamallo
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2012

pdf bib
Dependency-Based Open Information Extraction
Pablo Gamallo | Marcos Garcia | Santiago Fernández-Lanza
Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP

2011

pdf bib
Dependency-Based Text Compression for Semantic Relation Extraction
Marcos Garcia | Pablo Gamallo
Proceedings of the RANLP 2011 Workshop on Information Extraction and Knowledge Acquisition

pdf bib
Evaluating Various Linguistic Features on Semantic Relation Extraction
Marcos Garcia | Pablo Gamallo
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

2005

pdf bib
Clustering Syntactic Positions with Similar Semantic Requirements
Pablo Gamallo | Alexandre Agustini | Gabriel P. Lopes
Computational Linguistics, Volume 31, Number 1, March 2005

2004

pdf bib
Disambiguation and Optional Co-Composition
Pablo Gamallo | Gabriel P. Lopes | Alexandre Agustini
Actes de la 11ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

This paper describes a specific semantic property underlying binary dependencies: co-composition. We propose a more general definition than that given by Pustejovsky, what we call “optional co-composition”. The aim of the paper is to explore the benefits of optional cocomposition in two disambiguation tasks: both word sense and structural disambiguation. Concerning the second task, some experiments were performed on large corpora.

2002

pdf bib
Using Co-Composition for Acquiring Syntactic and Semantic Subcategorisation
Pablo Gamallo | Alexandre Agustini | Gabriel P. Lopes
Proceedings of the ACL-02 Workshop on Unsupervised Lexical Acquisition