Guido Zuccon

2026

AutoBool: Reinforcement-Learned LLM for Effective Automatic Systematic Reviews Boolean Query Generation
Shuai Wang | Harrisen Scells | Bevan Koopman | Guido Zuccon
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

We present AutoBool, a reinforcement learning (RL) framework that trains large language models (LLMs) to generate effective Boolean queries for medical systematic reviews. Boolean queries are the primary mechanism for literature retrieval in this domain and must achieve high recall while maintaining reasonable precision—a challenging balance that existing prompt-based LLM approaches often struggle to achieve.A major limitation in this space is the lack of ground-truth best Boolean queries for each topic, which makes supervised fine-tuning impractical. AutoBool addresses this challenge by leveraging RL to directly optimize query generation against retrieval performance metrics, without requiring ideal target queries. To support this effort, we create and release the largest dataset of its kind: 65 588 topics in total for training and evaluating the task of automatic Boolean query formulation.Experiments on our new dataset and two established datasets (CLEF TAR and Seed Collection) show that AutoBool significantly outperforms zero-shot/few-shot prompting and matches or exceeds the effectiveness of much larger GPT-based models (e.g., GPT-4o, O3) using smaller backbones. It also approaches effectiveness of expert-authored queries while retrieving 10–16 times fewer documents. Ablation studies reveal the critical roles of model backbone, size, decoding temperature, and prompt design. Code and data are available at https://github.com/ielab/AutoBool.

2025

pdf bib abs

VISA: Retrieval Augmented Generation with Visual Source Attribution
Xueguang Ma | Shengyao Zhuang | Bevan Koopman | Guido Zuccon | Wenhu Chen | Jimmy Lin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Generation with source attribution is important for enhancing the verifiability of retrieval-augmented generation (RAG) systems. However, existing approaches in RAG primarily link generated content to document-level references, making it challenging for users to locate evidence among multiple content-rich retrieved documents. To address this challenge, we propose Retrieval-Augmented Generation with Visual Source Attribution (VISA), a novel approach that combines answer generation with visual source attribution. Leveraging large vision-language models (VLMs), VISA identifies the evidence and highlights the exact regions that support the generated answers with bounding boxes in the retrieved document screenshots. To evaluate its effectiveness, we curated two datasets: Wiki-VISA, based on crawled Wikipedia webpage screenshots, and Paper-VISA, derived from PubLayNet and tailored to the medical domain. Experimental results demonstrate the effectiveness of VISA for visual source attribution on documents’ original look, as well as highlighting the challenges for improvement.

2024

pdf bib abs

PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval
Shengyao Zhuang | Xueguang Ma | Bevan Koopman | Jimmy Lin | Guido Zuccon
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Utilizing large language models (LLMs) for zero-shot document ranking is done in one of two ways: (1) prompt-based re-ranking methods, which require no further training but are only feasible for re-ranking a handful of candidate documents due to computational costs; and (2) unsupervised contrastive trained dense retrieval methods, which can retrieve relevant documents from the entire corpus but require a large amount of paired text data for contrastive training.In this paper, we propose PromptReps, which combines the advantages of both categories: no need for training and the ability to retrieve from the whole corpus. Our method only requires prompts to guide an LLM to generate query and document representations for effective document retrieval. Specifically, we prompt the LLMs to represent a given text using a single word, and then use the last token’s hidden states and the corresponding logits associated with the prediction of the next token to construct a hybrid document retrieval system. The retrieval system harnesses both dense text embedding and sparse bag-of-words representations given by the LLM.Our experimental evaluation on the MSMARCO, TREC deep learning and BEIR zero-shot document retrieval datasets illustrates that this simple prompt-based LLM retrieval method can achieve a similar or higher retrieval effectiveness than state-of-the-art LLM embedding methods that are trained with large amounts of unsupervised data, especially when using a larger LLM.

2023

pdf bib abs

Dr ChatGPT tell me what I want to hear: How different prompts impact health answer correctness
Bevan Koopman | Guido Zuccon
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

This paper investigates the significant impact different prompts have on the behaviour of ChatGPT when used for health information seeking. As people more and more depend on generative large language models (LLMs) like ChatGPT, it is critical to understand model behaviour under different conditions, especially for domains where incorrect answers can have serious consequences such as health. Using the TREC Misinformation dataset, we empirically evaluate ChatGPT to show not just its effectiveness but reveal that knowledge passed in the prompt can bias the model to the detriment of answer correctness. We show this occurs both for retrieve-then-generate pipelines and based on how a user phrases their question as well as the question type. This work has important implications for the development of more robust and transparent question-answering systems based on generative large language models. Prompts, raw result files and manual analysis are made publicly available at https://github.com/ielab/drchatgpt-health_prompting.

pdf bib abs

Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking
Shengyao Zhuang | Bing Liu | Bevan Koopman | Guido Zuccon
Findings of the Association for Computational Linguistics: EMNLP 2023

In the field of information retrieval, Query Likelihood Models (QLMs) rank documents based on the probability of generating the query given the content of a document. Recently, advanced large language models (LLMs) have emerged as effective QLMs, showcasing promising ranking capabilities. This paper focuses on investigating the genuine zero-shot ranking effectiveness of recent LLMs, which are solely pre-trained on unstructured text data without supervised instruction fine-tuning. Our findings reveal the robust zero-shot ranking ability of such LLMs, highlighting that additional instruction fine-tuning may hinder effectiveness unless a question generation task is present in the fine-tuning dataset. Furthermore, we introduce a novel state-of-the-art ranking system that integrates LLM-based QLMs with a hybrid zero-shot retriever, demonstrating exceptional effectiveness in both zero-shot and few-shot scenarios. We make our codebase publicly available at https://github.com/ielab/llm-qlm.

2022

pdf bib abs

Entity Alignment (EA) aims to find equivalent entities between two Knowledge Graphs (KGs). While numerous neural EA models have been devised, they are mainly learned using labelled data only. In this work, we argue that different entities within one KG should have compatible counterparts in the other KG due to the potential dependencies among the entities. Making compatible predictions thus should be one of the goals of training an EA model along with fitting the labelled data: this aspect however is neglected in current methods. To power neural EA models with compatibility, we devise a training framework by addressing three problems: (1) how to measure the compatibility of an EA model; (2) how to inject the property of being compatible into an EA model; (3) how to optimise parameters of the compatibility model. Extensive experiments on widely-used datasets demonstrate the advantages of integrating compatibility within EA models. In fact, state-of-the-art neural EA models trained within our framework using just 5% of the labelled data can achieve comparable effectiveness with supervised training using 20% of the labelled data.

2021

pdf bib

Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association
Afshin Rahimi | William Lane | Guido Zuccon
Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association

pdf bib abs

Dealing with Typos for BERT-based Passage Retrieval and Ranking
Shengyao Zhuang | Guido Zuccon
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Passage retrieval and ranking is a key task in open-domain question answering and information retrieval. Current effective approaches mostly rely on pre-trained deep language model-based retrievers and rankers. These methods have been shown to effectively model the semantic matching between queries and passages, also in presence of keyword mismatch, i.e. passages that are relevant to a query but do not contain important query keywords. In this paper we consider the Dense Retriever (DR), a passage retrieval method, and the BERT re-ranker, a popular passage re-ranking method. In this context, we formally investigate how these models respond and adapt to a specific type of keyword mismatch – that caused by keyword typos occurring in queries. Through empirical investigation, we find that typos can lead to a significant drop in retrieval and ranking effectiveness. We then propose a simple typos-aware training framework for DR and BERT re-ranker to address this issue. Our experimental results on the MS MARCO passage ranking dataset show that, with our proposed typos-aware training, DR and BERT re-ranker can become robust to typos in queries, resulting in significantly improved effectiveness compared to models trained without appropriately accounting for typos.

pdf bib abs

ActiveEA: Active Learning for Neural Entity Alignment
Bing Liu | Harrisen Scells | Guido Zuccon | Wen Hua | Genghong Zhao
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Entity Alignment (EA) aims to match equivalent entities across different Knowledge Graphs (KGs) and is an essential step of KG fusion. Current mainstream methods – neural EA models – rely on training with seed alignment, i.e., a set of pre-aligned entity pairs which are very costly to annotate. In this paper, we devise a novel Active Learning (AL) framework for neural EA, aiming to create highly informative seed alignment to obtain more effective EA models with less annotation cost. Our framework tackles two main challenges encountered when applying AL to EA: (1) How to exploit dependencies between entities within the AL strategy. Most AL strategies assume that the data instances to sample are independent and identically distributed. However, entities in KGs are related. To address this challenge, we propose a structure-aware uncertainty sampling strategy that can measure the uncertainty of each entity as well as its impact on its neighbour entities in the KG. (2) How to recognise entities that appear in one KG but not in the other KG (i.e., bachelors). Identifying bachelors would likely save annotation budget. To address this challenge, we devise a bachelor recognizer paying attention to alleviate the effect of sampling bias. Empirical results show that our proposed AL strategy can significantly improve sampling quality with good generality across different datasets, EA models and amount of bachelors.

2016

pdf bib abs

Building Evaluation Datasets for Consumer-Oriented Information Retrieval
Lorraine Goeuriot | Liadh Kelly | Guido Zuccon | Joao Palotti
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Common people often experience difficulties in accessing relevant, correct, accurate and understandable health information online. Developing search techniques that aid these information needs is challenging. In this paper we present the datasets created by CLEF eHealth Lab from 2013-2015 for evaluation of search solutions to support common people finding health information online. Specifically, the CLEF eHealth information retrieval (IR) task of this Lab has provided the research community with benchmarks for evaluating consumer-centered health information retrieval, thus fostering research and development aimed to address this challenging problem. Given consumer queries, the goal of the task is to retrieve relevant documents from the provided collection of web pages. The shared datasets provide a large health web crawl, queries representing people’s real world information needs, and relevance assessment judgements for the queries.

pdf bib

The Benefits of Word Embeddings Features for Active Learning in Clinical Information Extraction
Mahnoosh Kholghi | Lance De Vine | Laurianne Sitbon | Guido Zuccon | Anthony Nguyen
Proceedings of the Australasian Language Technology Association Workshop 2016