Enrique Alfonseca

2023

pdf bib abs
SamToNe: Improving Contrastive Loss for Dual Encoder Retrieval Models with Same Tower Negatives
Fedor Moiseev | Gustavo Hernandez Abrego | Peter Dornbach | Imed Zitouni | Enrique Alfonseca | Zhe Dong
Findings of the Association for Computational Linguistics: ACL 2023

Dual encoders have been used for retrieval tasks and representation learning with good results. A standard way to train dual encoders is using a contrastive loss with in-batch negatives. In this work, we propose an improved contrastive learning objective by adding queries or documents from the same encoder towers to the negatives, for which we name it as “contrastive loss with SAMe TOwer NEgatives” (SamToNe). By evaluating on question answering retrieval benchmarks from MS MARCO and MultiReQA, and heterogenous zero-shot information retrieval benchmarks (BEIR), we demonstrate that SamToNe can effectively improve the retrieval quality for both symmetric and asymmetric dual encoders. By directly probing the embedding spaces of the two encoding towers via the t-SNE algorithm (van der Maaten and Hinton, 2008), we observe that SamToNe ensures the alignment between the embedding spaces from the two encoder towers. Based on the analysis of the embedding distance distributions of the top-1 retrieved results, we further explain the efficacy of the method from the perspective of regularisation.

2022

pdf bib abs
SKILL: Structured Knowledge Infusion for Large Language Models
Fedor Moiseev | Zhe Dong | Enrique Alfonseca | Martin Jaggi
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Large language models (LLMs) have demonstrated human-level performance on a vast spectrum of natural language tasks. However, it is largely unexplored whether they can better internalize knowledge from a structured data, such as a knowledge graph, or from text. In this work, we propose a method to infuse structured knowledge into LLMs, by directly training T5 models on factual triples of knowledge graphs (KGs). We show that models pre-trained on Wikidata KG with our method outperform the T5 baselines on FreebaseQA and WikiHop, as well as the Wikidata-answerable subset of TriviaQA and NaturalQuestions. The models pre-trained on factual triples compare competitively with the ones on natural language sentences that contain the same knowledge. Trained on a smaller size KG, WikiMovies, we saw 3x improvement of exact match score on MetaQA task. The proposed method has an advantage that no alignment between the knowledge graph and text corpus is required in curating training data. This makes our method particularly useful when working with industry-scale knowledge graphs.

Dual encoders have been used for question-answering (QA) and information retrieval (IR) tasks with good results. There are two major types of dual encoders, Siamese Dual Encoders (SDE), with parameters shared across two encoders, and Asymmetric Dual Encoder (ADE), with two distinctly parameterized encoders. In this work, we explore the dual encoder architectures for QA retrieval tasks. By evaluating on MS MARCO, open domain NQ, and the MultiReQA benchmarks, we show that SDE performs significantly better than ADE. We further propose three different improved versions of ADEs. Based on the evaluation of QA retrieval tasks and direct analysis of the embeddings, we demonstrate that sharing parameters in projection layers would enable ADEs to perform competitively with SDEs.

2010

2009

pdf bib
Gazpacho and summer rash: lexical relationships from temporal patterns of web search queries
Enrique Alfonseca | Massimiliano Ciaramita | Keith Hall
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches
Eneko Agirre | Enrique Alfonseca | Keith Hall | Jana Kravalova | Marius Paşca | Aitor Soroa
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Large-scale Computation of Distributional Similarities for Queries
Enrique Alfonseca | Keith Hall | Silvana Hartmann
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

2008

pdf bib
Decompounding query keywords from compounding languages
Enrique Alfonseca | Slaven Bilac | Stefan Pharies
Proceedings of ACL-08: HLT, Short Papers

2007

pdf bib
Support Vector Machines for Query-focused Summarization trained and evaluated on Pyramid data
Maria Fuentes | Enrique Alfonseca | Horacio Rodríguez
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

2006

pdf bib
A Rote Extractor with Edit Distance-Based Generalisation and Multi-Corpora Precision Calculation
Enrique Alfonseca | Pablo Castells | Manabu Okumura | Maria Ruiz-Casado
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf bib abs
Building a Parallel Multilingual Corpus (Arabic-Spanish-English)
Doaa Samy | Antonio Moreno Sandoval | José M. Guirao | Enrique Alfonseca
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper presents the results (1st phase) of the on-going research in the Computational Linguistics Laboratory at Autónoma University of Madrid (LLI-UAM) aiming at the development of a multi-lingual parallel corpus (Arabic-Spanish-English) aligned on the sentence level and tagged on the POS level. A multilingual parallel corpus which brings together Arabic, Spanish and English is a new resource for the NLP community that completes the present panorama of parallel corpora. In the first part of this study, we introduce the novelty of our approach and the challenges encountered to create such a corpus. This introductory part highlights the main features of the corpus and the criteria applied during the selection process. The second part focuses on two main stages: basic processing (tokenization and segmentation) and alignment. Methodology of alignment is explained in detail and results obtained in the three different linguistic pairs are compared. POS tagging and tools used in this stage are discussed in the third part. The final output is available in two versions: the non-aligned version and the aligned one. The latter adopts the TMX (Translation Memory Exchange) standard format. At the end, the section dedicated to the future work points out the key stages concerned with extending the corpus and the studies that can benefit, directly or indirectly, from such a resource.

pdf bib abs
The wraetlic NLP suite
Enrique Alfonseca | Antonio Moreno-Sandoval | José María Guirao | María Ruiz-Casado
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we describe the second release of a suite of language analysers, developed over the last five years, called wraetlic, which includes tools for several partial parsing tasks, both for English and Spanish. It has been successfully used in fields such as Information Extraction, thesaurus acquisition, Text Summarisation and Computer Assisted Assessment.

pdf bib
Towards Large-scale Non-taxonomic Relation Extraction: Estimating the Precision of Rote Extractors
Enrique Alfonseca | Maria Ruiz-Casado | Manabu Okumura | Pablo Castells
Proceedings of the 2nd Workshop on Ontology Learning and Population: Bridging the Gap between Text and Knowledge