2022
pdf
bib
abs
Is anisotropy really the cause of BERT embeddings not being semantic?
Alejandro Fuster Baggetto
|
Victor Fresno
Findings of the Association for Computational Linguistics: EMNLP 2022
In this paper we conduct a set of experiments aimed to improve our understanding of the lack of semantic isometry in BERT, i.e. the lack of correspondence between the embedding and meaning spaces of its contextualized word representations. Our empirical results show that, contrary to popular belief, the anisotropy is not the root cause of the poor performance of these contextual models’ embeddings in semantic tasks. What does affect both the anisotropy and semantic isometry is a set of known biases: frequency, subword, punctuation, and case. For each one of them, we measure its magnitude and the effect of its removal, showing that these biases contribute but do not completely explain the phenomenon of anisotropy and lack of semantic isometry of these contextual language models.
pdf
bib
abs
Information Theory–based Compositional Distributional Semantics
Enrique Amigó
|
Alejandro Ariza-Casabona
|
Victor Fresno
|
M. Antònia Martí
Computational Linguistics, Volume 48, Issue 4 - December 2022
In the context of text representation, Compositional Distributional Semantics models aim to fuse the Distributional Hypothesis and the Principle of Compositionality. Text embedding is based on co-ocurrence distributions and the representations are in turn combined by compositional functions taking into account the text structure. However, the theoretical basis of compositional functions is still an open issue. In this article we define and study the notion of Information Theory–based Compositional Distributional Semantics (ICDS): (i) We first establish formal properties for embedding, composition, and similarity functions based on Shannon’s Information Theory; (ii) we analyze the existing approaches under this prism, checking whether or not they comply with the established desirable properties; (iii) we propose two parameterizable composition and similarity functions that generalize traditional approaches while fulfilling the formal properties; and finally (iv) we perform an empirical study on several textual similarity datasets that include sentences with a high and low lexical overlap, and on the similarity between words and their description. Our theoretical analysis and empirical results show that fulfilling formal properties affects positively the accuracy of text representation models in terms of correspondence (isometry) between the embedding and meaning spaces.
2014
pdf
bib
abs
TweetNorm_es: an annotated corpus for Spanish microtext normalization
Iñaki Alegria
|
Nora Aranberri
|
Pere Comas
|
Víctor Fresno
|
Pablo Gamallo
|
Lluis Padró
|
Iñaki San Vicente
|
Jordi Turmo
|
Arkaitz Zubiaga
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
In this paper we introduce TweetNorm_es, an annotated corpus of tweets in Spanish language, which we make publicly available under the terms of the CC-BY license. This corpus is intended for development and testing of microtext normalization systems. It was created for Tweet-Norm, a tweet normalization workshop and shared task, and is the result of a joint annotation effort from different research groups. In this paper we describe the methodology defined to build the corpus as well as the guidelines followed in the annotation process. We also present a brief overview of the Tweet-Norm shared task, as the first evaluation environment where the corpus was used.
pdf
bib
A Data Driven Approach for Person Name Disambiguation in Web Search Results
Agustín D. Delgado
|
Raquel Martínez
|
Víctor Fresno
|
Soto Montalvo
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers
2009
pdf
bib
Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Arkaitz Zubiaga
|
Víctor Fresno
|
Raquel Martínez
Proceedings of the NAACL HLT 2009 Workshop on Semi-supervised Learning for Natural Language Processing
2006
pdf
bib
Multilingual Document Clustering: An Heuristic Approach Based on Cognate Named Entities
Soto Montalvo
|
Raquel Martínez
|
Arantza Casillas
|
Víctor Fresno
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics