Bruno Martins - ACL Anthology

Bruno Martins

2025

Leveraging LLMs to Streamline the Review of Public Funding Applications
João DS Marques | Andre Vicente Duarte | André Mendes Marques de Carvalho | Gil Rocha | Bruno Martins | Arlindo L. Oliveira
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Every year, the European Union and its member states allocate millions of euros to fund various development initiatives. However, the increasing number of applications received for these programs often creates significant bottlenecks in evaluation processes, due to limited human capacity. In this work, we detail the real-world deployment of AI-assisted evaluation within the pipeline of two government initiatives: (i) corporate applications aimed at international business expansion, and (ii) citizen reimbursement claims for investments in energy-efficient home improvements. While these two cases involve distinct evaluation procedures, our findings confirm that AI effectively enhanced processing efficiency and reduced workload across both types of applications. Specifically, in the citizen reimbursement claims initiative, our solution increased reviewer productivity by 20.1%, while keeping a negligible false-positive rate based on our test set observations. These improvements resulted in an overall reduction of more than 2 months in the total evaluation time, illustrating the impact of AI-driven automation in large-scale evaluation workflows.

From Tower to Spire: Adding the Speech Modality to a Translation-Specialist LLM
Kshitij Ambilduke | Ben Peters | Sonal Sannigrahi | Anil Keshwani | Tsz Kin Lam | Bruno Martins | Andre Martins | Marcely Zanon Boito
Findings of the Association for Computational Linguistics: EMNLP 2025

We introduce Spire, a speech-augmented language model (LM) capable of both translating and transcribing speech input from English into 10 other languages as well as translating text input in both language directions. Spire integrates the speech modality into an existing multilingual LM via speech discretization and continued pre-training using only 42.5 K hours of speech. In particular, we adopt the pretraining framework of multilingual LMs and treat discretized speech input as an additional translation language. This approach not only equips the model with speech capabilities, but also preserves its strong text-based performance. We achieve this using significantly less data than existing speech LMs, demonstrating that discretized speech input integration as an additional language is feasible during LM adaptation. We make our code and models available to the community.

Explainable ICD Coding via Entity Linking
Leonor Barreiros | Isabel Coutinho | Gonçalo Correia | Bruno Martins
Proceedings of the Second Workshop on Patient-Oriented Language Processing (CL4Health)

Clinical coding is a critical task in healthcare, although traditional methods for automating clinical coding may not provide sufficient explicit evidence for coders in production environments. This evidence is crucial, as medical coders have to make sure there exists at least one explicit passage in the input health record that justifies the attribution of a code. We therefore propose to reframe the task as an entity linking problem, in which each document is annotated with its set of codes and respective textual evidence, enabling better human-machine collaboration. By leveraging parameter-efficient fine-tuning of Large Language Models (LLMs), together with constrained decoding, we introduce three approaches to solve this problem that prove effective at disambiguating clinical mentions and that perform well in few-shot scenarios.

A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates
Goncalo Emanuel Cavaco Gomes | Bruno Martins | Chrysoula Zerva
Findings of the Association for Computational Linguistics: ACL 2025

This study explores current limitations of learned image captioning evaluation metrics, specifically the lack of granular assessments for errors within captions, and the reliance on single-point quality estimates without considering uncertainty. To address the limitations, we propose a simple yet effective strategy for generating and calibrating distributions of CLIPScore values. Leveraging a model-agnostic conformal risk control framework, we calibrate CLIPScore values for task-specific control variables, tackling the aforementioned limitations. Experimental results demonstrate that using conformal risk control, over score distributions produced with simple methods such as input masking, can achieve competitive performance compared to more complex approaches. Our method effectively detects erroneous words, while providing formal guarantees aligned with desired risk levels. It also improves the correlation between uncertainty estimations and prediction errors, thus enhancing the overall reliability of caption evaluation metrics.

Evaluation of Multilingual Image Captioning: How far can we get with CLIP models?
Goncalo Emanuel Cavaco Gomes | Chrysoula Zerva | Bruno Martins
Findings of the Association for Computational Linguistics: NAACL 2025

The evaluation of image captions, looking at both linguistic fluency and semantic correspondence to visual contents, has witnessed a significant effort. Still, despite advancements such as the CLIPScore metric, multilingual captioning evaluation has remained relatively unexplored. This work presents several strategies, and extensive experiments, related to evaluating CLIPScore variants in multilingual settings. To address the lack of multilingual test data, we consider two different strategies: (1) using quality aware machine-translated datasets with human judgements, and (2) re-purposing multilingual datasets that target semantic inference and reasoning. Our results highlight the potential of finetuned multilingual models to generalize across languages and to handle complex linguistic challenges. Tests with machine-translated data show that multilingual CLIPScore models can maintain a high correlation with human judgements across different languages, and additional tests with natively multilingual and multicultural data further attest to the high-quality assessments.

Efficient Architectures for High Resolution Vision-Language Models
Miguel Carvalho | Bruno Martins
Proceedings of the 31st International Conference on Computational Linguistics

Vision-Language Models (VLMs) have recently experienced significant advancements. However, challenges persist in the accurate recognition of fine details within high resolution images, which limits performance in multiple tasks. This work introduces Pheye, a novel architecture that efficiently processes high-resolution images while training fewer parameters than similarly sized VLMs. Notably, Pheye achieves a high efficiency while maintaining strong performance, particularly in tasks that demand fine-grained image understanding and/or the handling of scene-text.

2024

Accurate and Well-Calibrated ICD Code Assignment Through Attention Over Diverse Label Embeddings
Goncalo Gomes | Isabel Coutinho | Bruno Martins
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Although the International Classification of Diseases (ICD) has been adopted worldwide, manually assigning ICD codes to clinical text is time-consuming, error-prone, and expensive, motivating the development of automated approaches. This paper describes a novel approach for automated ICD coding, combining several ideas from previous related work. We specifically employ a strong Transformer-based model as a text encoder and, to handle lengthy clinical narratives, we explored either (a) adapting the base encoder model into a Longformer, or (b) dividing the text into chunks and processing each chunk independently. The representations produced by the encoder are combined with a label embedding mechanism that explores diverse ICD code synonyms. Experiments with different splits of the MIMIC-III dataset show that the proposed approach outperforms the current state-of-the-art models in ICD coding, with the label embeddings significantly contributing to the good performance. Our approach also leads to properly calibrated classification results, which can effectively inform downstream tasks such as quantification.

PAELLA: Parameter-Efficient Lightweight Language-Agnostic Captioning Model
Rita Ramos | Emanuele Bugliarello | Bruno Martins | Desmond Elliott
Findings of the Association for Computational Linguistics: NAACL 2024

We introduce PAELLA, a Parameter-Efficient Lightweight Language-Agnostic image captioning model designed to be both parameter and data-efficient using retrieval augmentation. The model is trained by learning a small mapping network with 34M parameters between a pre-trained visual model and a multilingual language model that is conditioned on two types of input: (i) the image itself, and (ii) a set of retrieved captions in the target language. The retrieved examples play a key role in guiding the model to generate captions across languages. Through retrieval, the model can be lightweight in terms of the number of trainable parameters, which only exist in its mapping network, and also in the amount of multilingual training data that is required. Experiments on the XM3600 dataset, featuring 36 languages, show that PAELLA can outperform or compete against some models with 3–77× more learned parameters and 35–863× more data, particularly in low-resource languages. We also find that PAELLA can be trained on only monolingual data and still show strong zero-shot abilities in other languages.

Lisbon Computational Linguists at SemEval-2024 Task 2: Using a Mistral-7B Model and Data Augmentation
Artur Guimarães | Bruno Martins | João Magalhães
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

ABSTRACT: This paper describes our approach to the SemEval-2024 safe biomedical Natural Language Inference for Clinical Trials (NLI4CT) task, which concerns classifying statements about Clinical Trial Reports (CTRs). We explored the capabilities of Mistral-7B, a generalistic open-source Large Language Model (LLM). We developed a prompt for the NLI4CT task, and fine-tuning a quantized version of the model using a slightly augmented version of the training dataset. The experimental results show that this approach can produce notable results in terms of the macro F1-score, while having limitations in terms of faithfulness and consistency. All the developed code is publicly available on a GitHub repository.

Dwell in the Beginning: How Language Models Embed Long Documents for Dense Retrieval
João Coelho | Bruno Martins | Joao Magalhaes | Jamie Callan | Chenyan Xiong
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This study investigates the existence of positional biases in Transformer-based language models for text representation learning, particularly in the context of web document retrieval. We build on previous research that demonstrated loss of information in the middle of input sequences for causal language models, extending it to the domain of embedding learning. We examine positional biases at multiple stages of the training pipeline for an encoder-decoder neural retrieval model, namely language model pre-training, contrastive pre-training, and contrastive fine-tuning. Experiments with the MS-MARCO document collection reveal that after contrastive pre-training the model already generates embeddings that better capture the beginning of the input content, with fine-tuning further aggravating this effect.

2023

Prompting, Retrieval, Training: An exploration of different approaches for task-oriented dialogue generation
Gonçalo Raposo | Luisa Coheur | Bruno Martins
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Task-oriented dialogue systems need to generate appropriate responses to help fulfill users’ requests. This paper explores different strategies, namely prompting, retrieval, and fine-tuning, for task-oriented dialogue generation. Through a systematic evaluation, we aim to provide valuable insights and guidelines for researchers and practitioners working on developing efficient and effective dialogue systems for real-world applications. Evaluation is performed on the MultiWOZ and Taskmaster-2 datasets, and we test various versions of FLAN-T5, GPT-3.5, and GPT-4 models. Costs associated with running these models are analyzed, and dialogue evaluation is briefly discussed. Our findings suggest that when testing data differs from the training data, fine-tuning may decrease performance, favoring a combination of a more general language model and a prompting mechanism based on retrieved examples.

LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting
Rita Ramos | Bruno Martins | Desmond Elliott
Findings of the Association for Computational Linguistics: ACL 2023

Multilingual image captioning has recently been tackled by training with large-scale machine translated data, which is an expensive, noisy, and time-consuming process. Without requiring any multilingual caption data, we propose LMCap, an image-blind few-shot multilingual captioning model that works by prompting a language model with retrieved captions. Specifically, instead of following the standard encoder-decoder paradigm, given an image, LMCap first retrieves the captions of similar images using a multilingual CLIP encoder. These captions are then combined into a prompt for an XGLM decoder, in order to generate captions in the desired language. In other words, the generation model does not directly process the image, instead it processes retrieved captions. Experiments on the XM3600 dataset of geographically diverse images show that our model is competitive with fully-supervised multilingual captioning models, without requiring any supervised training on any captioning data.

Retrieval-augmented Image Captioning
Rita Ramos | Desmond Elliott | Bruno Martins
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Inspired by retrieval-augmented language generation and pretrained Vision and Language (V&L) encoders, we present a new approach to image captioning that generates sentences given the input image and a set of captions retrieved from a datastore, as opposed to the image alone. The encoder in our model jointly processes the image and retrieved captions using a pretrained V&L BERT, while the decoder attends to the multimodal encoder representations, benefiting from the extra textual evidence from the retrieved captions. Experimental results on the COCO dataset show that image captioning can be effectively formulated from this new perspective. Our model, named EXTRA, benefits from using captions retrieved from the training dataset, and it can also benefit from using an external dataset without the need for retraining. Ablation studies show that retrieving a sufficient number of captions (e.g., k=5) can improve captioning quality. Our work contributes towards using pretrained V&L encoders for generative tasks, instead of standard classification tasks.

2022

Dense Template Retrieval for Customer Support
Tiago Mesquita | Bruno Martins | Mariana Almeida
Proceedings of the 29th International Conference on Computational Linguistics

Templated answers are used extensively in customer support scenarios, providing an efficient way to cover a plethora of topics, with an easily maintainable collection of templates. However, the number of templates is often too high for an agent to manually search. Automatically suggesting the correct template for a given question can thus improve the service efficiency, reducing costs and leading to a better customer satisfaction. In this work, we propose a dense retrieval framework for the customer support scenario, adapting a standard in-batch negatives technique to support unpaired sampling of queries and templates. We also propose a novel loss that extends the typical query-centric similarity, exploiting other similarity relations in the training data. Experiments show that our approach achieves considerable improvements, in terms of performance and training speed, over more standard dense retrieval methods. This includes methods such as DPR, and also ablated versions of the proposed approach.

Annotating Arguments in a Corpus of Opinion Articles
Gil Rocha | Luís Trigo | Henrique Lopes Cardoso | Rui Sousa-Silva | Paula Carvalho | Bruno Martins | Miguel Won
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Interest in argument mining has resulted in an increasing number of argument annotated corpora. However, most focus on English texts with explicit argumentative discourse markers, such as persuasive essays or legal documents. Conversely, we report on the first extensive and consolidated Portuguese argument annotation project focused on opinion articles. We briefly describe the annotation guidelines based on a multi-layered process and analyze the manual annotations produced, highlighting the main challenges of this textual genre. We then conduct a comprehensive inter-annotator agreement analysis, including argumentative discourse units, their classes and relations, and resulting graphs. This analysis reveals that each of these aspects tackles very different kinds of challenges. We observe differences in annotator profiles, motivating our aim of producing a non-aggregated corpus containing the insights of every annotator. We note that the interpretation and identification of token-level arguments is challenging; nevertheless, tasks that focus on higher-level components of the argument structure can obtain considerable agreement. We lay down perspectives on corpus usage, exploiting its multi-faceted nature.

2020

Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
André Martins | Helena Moniz | Sara Fumega | Bruno Martins | Fernando Batista | Luisa Coheur | Carla Parra | Isabel Trancoso | Marco Turchi | Arianna Bisazza | Joss Moorkens | Ana Guerberof | Mary Nurminen | Lena Marg | Mikel L. Forcada
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

2015

INESC-ID: A Regression Model for Large Scale Twitter Sentiment Lexicon Induction
Silvio Amir | Ramon F. Astudillo | Wang Ling | Bruno Martins | Mario J. Silva | Isabel Trancoso
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

Semi-Supervised Bootstrapping of Relationship Extractors with Distributional Semantics
David S. Batista | Bruno Martins | Mário J. Silva
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

INESC-ID: Sentiment Analysis without Hand-Coded Features or Linguistic Resources using Embedding Subspaces
Ramon F. Astudillo | Silvio Amir | Wang Ling | Bruno Martins | Mario J. Silva | Isabel Trancoso
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

ULisboa: Recognition and Normalization of Medical Concepts
André Leal | Bruno Martins | Francisco Couto
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

ULisboa: Identification and Classification of Medical Concepts
André Leal | Diogo Gonçalves | Bruno Martins | Francisco M. Couto
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

TUGAS: Exploiting unlabelled data for Twitter sentiment analysis
Silvio Amir | Miguel B. Almeida | Bruno Martins | João Filgueiras | Mário J. Silva
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

2013

REACTION: A naive machine learning approach for sentiment classification
Silvio Moreira | João Filgueiras | Bruno Martins | Francisco Couto | Mário J. Silva
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

2011

Named entity translation using anchor texts
Wang Ling | Pável Calado | Bruno Martins | Isabel Trancoso | Alan Black | Luísa Coheur
Proceedings of the 8th International Workshop on Spoken Language Translation: Papers

This work describes a process to extract Named Entity (NE) translations from the text available in web links (anchor texts). It translates a NE by retrieving a list of web documents in the target language, extracting the anchor texts from the links to those documents and finding the best translation from the anchor texts, using a combination of features, some of which, are specific to anchor texts. Experiments performed on a manually built corpora, suggest that over 70% of the NEs, ranging from unpopular to popular entities, can be translated correctly using sorely anchor texts. Tests on a Machine Translation task indicate that the system can be used to improve the quality of the translations of state-of-the-art statistical machine translation systems.

Co-authors

Desmond Elliott 3

Isabel Coutinho 2

Ramón Fernandez Astudillo 2

João Filgueiras 2

Goncalo Emanuel Cavaco Gomes 2

João Magalhães 2

André F. T. Martins 2

Chrysoula Zerva 2

Mariana Almeida 1

Miguel B. Almeida 1

Kshitij Ambilduke 1

Leonor Barreiros 1

Fernando Batista 1

David S. Batista 1

Arianna Bisazza 1

Alan W. Black 1

Emanuele Bugliarello 1

Pável Calado 1

Miguel Carvalho 1

Paula Carvalho 1

Gonçalo Correia 1

Andre Vicente Duarte 1

Mikel L. Forcada 1

Goncalo Gomes 1

Diogo Gonçalves 1

Ana Guerberof 1

Artur Guimarães 1

Anil Keshwani 1

Henrique Lopes Cardoso 1

João DS Marques 1

Tiago Mesquita 1

Joss Moorkens 1

Silvio Moreira 1

Mary Nurminen 1

Arlindo L. Oliveira 1

Carla Parra Escartín 1

Gonçalo Raposo 1

Sonal Sannigrahi 1

Rui Sousa-Silva 1

Chenyan Xiong 1

Marcely Zanon Boito 1

André Mendes Marques de Carvalho 1

Venues