Stefano Campese

2025

pdf bib abs
Improving Document Retrieval Coherence for Semantically Equivalent Queries
Stefano Campese | Alessandro Moschitti | Ivano Lauriola
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Dense Retrieval (DR) models have proven to be effective for Document Retrieval and Information Grounding tasks. Usually, these models are trained and optimized for improving the relevance of top-ranked documents for a given query. Previous work has shown that popular DR models are sensitive to the query and document lexicon: small variations of it may lead to a significant difference in the set of retrieved documents. In this paper, we propose a variation of the Multi-Negative Ranking loss for training DR that improves the coherence of models in retrieving the same documents with respect to semantically similar queries. The loss penalizes discrepancies between the top-k ranked documents retrieved for diverse but semantically equivalent queries. We conducted extensive experiments on various datasets, MS-MARCO, Natural Questions, BEIR, and TREC DL 19/20. The results show that (i) models optimizes by our loss are subject to lower sensitivity, and, (ii) interestingly, higher accuracy.

pdf bib abs
Analyzing and Improving Coherence of Large Language Models in Question Answering
Ivano Lauriola | Stefano Campese | Alessandro Moschitti
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large language models (LLMs) have recently revolutionized natural language processing. These models, however, often suffer from instability or lack of coherence, that is the ability of the models to generate semantically equivalent outputs when receiving diverse yet semantically equivalent input variations. In this work, we analyze the behavior of multiple LLMs, including Mixtral-8x7B, Llama2-70b, Smaug-72b, and Phi-3, when dealing with multiple lexical variations of the same info-seeking questions. Our results suggest that various LLMs struggle to consistently answer diverse equivalent queries. To address this issue, we show how redundant information encoded as a prompt can increase the coherence of these models. In addition, we introduce a Retrieval-Augmented Generation (RAG) technique that supplements LLMs with the top-k most similar questions from a question retrieval engine. This knowledge-augmentation leads to 4-8 percentage point improvement in end-to-end performance in factual question answering tasks. These findings underscore the need to enhance LLM stability and coherence through semantic awareness.

2024

pdf bib abs
Pre-Training Methods for Question Reranking
Stefano Campese | Ivano Lauriola | Alessandro Moschitti
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

One interesting approach to Question Answering (QA) is to search for semantically similar questions, which have been answered before. This task is different from answer retrieval as it focuses on questions rather than only on the answers, therefore it requires different model training on different data.In this work, we introduce a novel unsupervised pre-training method specialized for retrieving and ranking questions. This leverages (i) knowledge distillation from a basic question retrieval model, and (ii) new pre-training task and objective for learning to rank questions in terms of their relevance with the query. Our experiments show that (i) the proposed technique achieves state-of-the-art performance on QRC and Quora-match datasets, and (ii) the benefit of combining re-ranking and retrieval models.

pdf bib abs
Datasets for Multilingual Answer Sentence Selection
Matteo Gabburo | Stefano Campese | Federico Agostini | Alessandro Moschitti
Findings of the Association for Computational Linguistics: EMNLP 2024

Answer Sentence Selection (AS2) is a critical task for designing effective retrieval-based Question Answering (QA) systems. Most advancements in AS2 focus on English due to the scarcity of annotated datasets for other languages. This lack of resources prevents the training of effective AS2 models in different languages, creating a performance gap between QA systems in English and other locales. In this paper, we introduce new high-quality datasets for AS2 in five European languages (French, German, Italian, Portuguese, and Spanish), obtained through supervised Automatic Machine Translation (AMT) of existing English AS2 datasets such as ASNQ, WikiQA, and TREC-QA using a Large Language Model (LLM). We evaluated our approach and the quality of the translated datasets through multiple experiments with different Transformer architectures. The results indicate that our datasets are pivotal in producing robust and powerful multilingual AS2 models, significantly contributing to closing the performance gap between English and other languages.

2023

pdf bib abs
QUADRo: Dataset and Models for QUestion-Answer Database Retrieval
Stefano Campese | Ivano Lauriola | Alessandro Moschitti
Findings of the Association for Computational Linguistics: EMNLP 2023

An effective approach to design automated Question Answering (QA) systems is to efficiently retrieve answers from pre-computed databases containing question/answer pairs. One of the main challenges to this design is the lack of training/testing data. Existing resources are limited in size and topics and either do not consider answers (question-question similarity only) or their quality in the annotation process. To fill this gap, we introduce a novel open-domain annotated resource to train and evaluate models for this task. The resource consists of 15,211 input questions. Each question is paired with 30 similar question/answer pairs, resulting in a total of 443,000 annotated examples. The binary label associated with each pair indicates the relevance with respect to the input question. Furthermore, we report extensive experimentation to test the quality and properties of our resource with respect to various key aspects of QA systems, including answer relevance, training strategies, and models input configuration.

Co-authors

Venues

Fix author