Diogo Glória-Silva
NOVA
Also published as: Diogo Silva
2026
VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Diogo Glória-Silva | David Semedo | Joao Magalhaes
Findings of the Association for Computational Linguistics: EACL 2026
Diogo Glória-Silva | David Semedo | Joao Magalhaes
Findings of the Association for Computational Linguistics: EACL 2026
We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90% accuracy on plan-aware VQA.
AMALIA: A Fully Open Large Language Model for European Portuguese
Afonso Simplício | Gonçalo Vinagre | Miguel Moura Ramos | Diogo Tavares | Rafael Ferreira | Giuseppe Attanasio | Duarte M. Alves | Inês Calvo | Inês Vieira | Rui Guerra | James Furtado | Beatriz Canaverde | Iago Paulo | Vasco Ramos | Diogo Glória-Silva | Miguel Faria | Marcos Treviso | Daniel Gomes | Pedro Gomes | David Semedo | André Martins | João Magalhães
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Afonso Simplício | Gonçalo Vinagre | Miguel Moura Ramos | Diogo Tavares | Rafael Ferreira | Giuseppe Attanasio | Duarte M. Alves | Inês Calvo | Inês Vieira | Rui Guerra | James Furtado | Beatriz Canaverde | Iago Paulo | Vasco Ramos | Diogo Glória-Silva | Miguel Faria | Marcos Treviso | Daniel Gomes | Pedro Gomes | David Semedo | André Martins | João Magalhães
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant’s linguistic and cultural nuances. We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias. Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese.
ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLMs
Inês Vieira | Inês Calvo | Iago Paulo | James Furtado | Rafael Ferreira | Diogo Tavares | Diogo Glória-Silva | David Semedo | João Magalhães
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
Inês Vieira | Inês Calvo | Iago Paulo | James Furtado | Rafael Ferreira | Diogo Tavares | Diogo Glória-Silva | David Semedo | João Magalhães
Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1
As Large Language Models (LLMs) expand across multilingual domains, evaluating their performance in under-represented languages becomes increasingly important. European Portuguese (pt-PT) is particularly affected, as existing training data and benchmarks are mainly in Brazilian Portuguese (pt-BR). To address this, we introduce ALBA, a linguistically grounded benchmark designed from the ground up to assess LLM proficiency in linguistic-related tasks in pt-PT across eight linguistic dimensions, including Language Variety, Culture-bound Semantics, Discourse Analysis, Word Plays, Syntax, Morphology, Lexicology, and Phonetics and Phonology. ALBA is manually constructed by language experts and paired with an LLM-as-a-judge framework for scalable evaluation of pt-PT generated language. Experiments on a diverse set of models reveal performance variability across linguistic dimensions, highlighting the need for comprehensive, variety-sensitive benchmarks that support further development of tools in pt-PT.
2024
Show and Guide: Instructional-Plan Grounded Vision and Language Model
Diogo Glória-Silva | David Semedo | Joao Magalhaes
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Diogo Glória-Silva | David Semedo | Joao Magalhaes
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Guiding users through complex procedural plans is an inherently multimodal task in which having visually illustrated plan steps is crucial to deliver an effective plan guidance. However, existing works on plan-following language models (LMs) often are not capable of multimodal input and output. In this work, we present MM-PlanLLM, the first multimodal LLM designed to assist users in executing instructional tasks by leveraging both textual plans and visual information. Specifically, we bring cross-modality through two key tasks: Conversational Video Moment Retrieval, where the model retrieves relevant step-video segments based on user queries, and Visually-Informed Step Generation, where the model generates the next step in a plan, conditioned on an image of the user’s current progress. MM-PlanLLM is trained using a novel multitask-multistage approach, designed to gradually expose the model to multimodal instructional-plans semantic layers, achieving strong performance on both multimodal and textual dialogue in a plan-grounded setting. Furthermore, we show that the model delivers cross-modal temporal and plan-structure representations aligned between textual plan steps and instructional video moments.
Plan-Grounded Large Language Models for Dual Goal Conversational Settings
Diogo Glória-Silva | Rafael Ferreira | Diogo Tavares | David Semedo | Joao Magalhaes
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Diogo Glória-Silva | Rafael Ferreira | Diogo Tavares | David Semedo | Joao Magalhaes
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Training Large Language Models (LLMs) to follow user instructions has shown to supply the LLM with ample capacity to converse fluently while being aligned with humans. Yet, it is not completely clear how an LLM can lead a plan-grounded conversation in mixed-initiative settings where instructions flow in both directions of the conversation, i.e. both the LLM and the user provide instructions to one another. In this paper, we tackle a dual goal mixed-initiative conversational setting where the LLM not only grounds the conversation on an arbitrary plan but also seeks to satisfy both a procedural plan and user instructions. The LLM is then responsible for guiding the user through the plan and, at the same time, adapting to new circumstances, answering questions, and activating safety guardrails when needed. We propose a novel LLM that grounds the dialogue on a procedural plan, can take the dialogue initiative, and enforces guardrails on the system’s behavior, while also improving the LLM’s responses to unexpected user behavior. Experiments in controlled settings and with real users show that the best-performing model, which we call PlanLLM, achieves a 2.1x improvement over a strong baseline. Moreover, experiments also show good generalization to unseen domains.
Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
João Bordalo | Vasco Ramos | Rodrigo Valério | Diogo Glória-Silva | Yonatan Bitton | Michal Yarom | Idan Szpektor | Joao Magalhaes
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
João Bordalo | Vasco Ramos | Rodrigo Valério | Diogo Glória-Silva | Yonatan Bitton | Michal Yarom | Idan Szpektor | Joao Magalhaes
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multistep instructions, such as recipes and how-to guides, greatly benefit from visual aids, such as a series of images that accompany the instruction steps. While Large Language Models (LLMs) have become adept at generating coherent textual steps, Large Vision/Language Models (LVLMs) are less capable of generating accompanying image sequences. The most challenging aspect is that each generated image needs to adhere to the relevant textual step instruction, as well as be visually consistent with earlier images in the sequence. To address this problem, we propose an approach for generating consistent image sequences, which integrates a Latent Diffusion Model (LDM) with an LLM to transform the sequence into a caption to maintain the semantic coherence of the sequence. In addition, to maintain the visual coherence of the image sequence, we introduce a copy mechanism to initialise reverse diffusion processes with a latent vector iteration from a previously generated image from a relevant step. Both strategies will condition the reverse diffusion process on the sequence of instruction steps and tie the contents of the current image to previous instruction steps and corresponding images. Experiments show that the proposed approach is preferred by humans in 46.6% of the cases against 26.6% for the second best method. In addition, automatic metrics showed that the proposed method maintains semantic coherence and visual consistency across steps in both domains.
2022
Polite Task-oriented Dialog Agents: To Generate or to Rewrite?
Diogo Silva | David Semedo | João Magalhães
Proceedings of the 12th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis
Diogo Silva | David Semedo | João Magalhães
Proceedings of the 12th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis
For task-oriented dialog agents, the tone of voice mediates user-agent interactions, playing a central role in the flow of a conversation. Distinct from domain-agnostic politeness constructs, in specific domains such as online stores, booking platforms, and others, agents need to be capable of adopting highly specific vocabulary, with significant impact on lexical and grammatical aspects of utterances. Then, the challenge is on improving utterances’ politeness while preserving the actual content, an utterly central requirement to achieve the task goal. In this paper, we conduct a novel assessment of politeness strategies for task-oriented dialog agents under a transfer learning scenario. We extend existing generative and rewriting politeness approaches, towards overcoming domain-shifting issues, and enabling the transfer of politeness patterns to a novel domain. Both automatic and human evaluation is conducted on customer-store interactions, over the fashion domain, from which contribute with insightful and experimentally supported lessons regarding the improvement of politeness in task-specific dialog agents.
Search
Fix author
Co-authors
- João Magalhães 7
- David Semedo 6
- Rafael Ferreira 3
- Diogo Tavares 3
- Inês Calvo 2
- James Furtado 2
- Iago Paulo 2
- Vasco Ramos 2
- Inês Vieira 2
- Duarte M. Alves 1
- Giuseppe Attanasio 1
- Yonatan Bitton 1
- João Bordalo 1
- Beatriz Canaverde 1
- Miguel Faria 1
- Daniel Gomes 1
- Pedro Gomes 1
- Rui Guerra 1
- André F. T. Martins 1
- Miguel Moura Ramos 1
- Afonso Simplício 1
- Idan Szpektor 1
- Marcos Treviso 1
- Rodrigo Valério 1
- Gonçalo Vinagre 1
- Michal Yarom 1