Ioannis Papaioannou

2025

pdf bib abs
AveniBench: Accessible and Versatile Evaluation of Finance Intelligence
Mateusz Klimaszewski | Pinzhen Chen | Liane Guillou | Ioannis Papaioannou | Barry Haddow | Alexandra Birch
Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal)

Over the last few years, there has been great interest in applying large language models (LLMs) to problems in the finance industry, and the field needs a robust LLM benchmark to support this work. Current financial LLM benchmarks contain simple tasks which are not representative of real use cases and have test sets with licences that do not allow commercial use. In response, we release AveniBench, a permissively licensed benchmark that tests a group of six key finance-related skills: tabular reasoning, numerical reasoning, question answering, long context modelling, summarisation and dialogue. We refactor the test sets to ensure that metrics are comparable, providing a unified framework. Furthermore, AveniBench introduces two task difficulty modes, easy and hard, enabling scalable evaluation based on real-world deployment needs. We use our benchmark to evaluate a diverse set of 20 widely used LLMs, from small open-weight models to proprietary systems like GPT-4. This evaluation initiates our public leaderboard, providing valuable insights for future academic research and commercial development.

2024

AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. Second, we present , a 7B parameter VLM trained using parameter-efficient methods on EVUD. Finally, we evaluate ‘s capabilities on OpenEQA, a challenging benchmark for embodied video question answering. Our model achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%.Additionally, we outperform Claude 3 and Gemini Pro Vision 1.0 and showcase competitive results compared to Gemini Pro 1.5 and GPT-4V, even surpassing the latter in spatial reasoning. This research paves the way for building efficient VLMs that can be deployed in robots or wearables, leveraging embodied video understanding to collaborate seamlessly with humans in everyday tasks, contributing to the advancement of next-generation Embodied AI.

2023

pdf bib abs
The Dangers of trusting Stochastic Parrots: Faithfulness and Trust in Open-domain Conversational Question Answering
Sabrina Chiesurin | Dimitris Dimakopoulos | Marco Antonio Sobrevilla Cabezudo | Arash Eshghi | Ioannis Papaioannou | Verena Rieser | Ioannis Konstas
Findings of the Association for Computational Linguistics: ACL 2023

Large language models are known to produce output which sounds fluent and convincing, but is also often wrong, e.g. “unfaithful” with respect to a rationale as retrieved from a knowledge base. In this paper, we show that task-based systems which exhibit certain advanced linguistic dialog behaviors, such as lexical alignment (repeating what the user said), are in fact preferred and trusted more, whereas other phenomena, such as pronouns and ellipsis are dis-preferred. We use open-domain question answering systems as our test-bed for task based dialog generation and compare several open- and closed-book models. Our results highlight the danger of systems that appear to be trustworthy by parroting user input while providing an unfaithful response.

pdf bib abs
No that’s not what I meant: Handling Third Position Repair in Conversational Question Answering
Vevake Balaraman | Arash Eshghi | Ioannis Konstas | Ioannis Papaioannou
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

The ability to handle miscommunication is crucial to robust and faithful conversational AI. People usually deal with miscommunication immediately as they detect it, using highly systematic interactional mechanisms called repair. One important type of repair is Third Position Repair (TPR) whereby a speaker is initially misunderstood but then corrects the misunderstanding as it becomes apparent after the addressee’s erroneous response. Here, we collect and publicly release REPAIR-QA, the first large dataset of TPRs in a conversational question answering (QA) setting. The data is comprised of the TPR turns, corresponding dialogue contexts, and candidate repairs of the original turn for execution of TPRs. We demonstrate the usefulness of the data by training and evaluating strong baseline models for executing TPRs. For stand-alone TPR execution, we perform both automatic and human evaluations on a fine-tuned T5 model, as well as OpenAI’s GPT-3 LLMs. Additionally, we extrinsically evaluate the LLMs’ TPR processing capabilities in the downstream conversational QA task. The results indicate poor out-of-the-box performance on TPR’s by the GPT-3 models, which then significantly improves when exposed to REPAIR-QA.

2017

pdf bib abs
Sympathy Begins with a Smile, Intelligence Begins with a Word: Use of Multimodal Features in Spoken Human-Robot Interaction
Jekaterina Novikova | Christian Dondrup | Ioannis Papaioannou | Oliver Lemon
Proceedings of the First Workshop on Language Grounding for Robotics

Recognition of social signals, coming from human facial expressions or prosody of human speech, is a popular research topic in human-robot interaction studies. There is also a long line of research in the spoken dialogue community that investigates user satisfaction in relation to dialogue characteristics. However, very little research relates a combination of multimodal social signals and language features detected during spoken face-to-face human-robot interaction to the resulting user perception of a robot. In this paper we show how different emotional facial expressions of human users, in combination with prosodic characteristics of human speech and features of human-robot dialogue, correlate with users’ impressions of the robot after a conversation. We find that happiness in the user’s recognised facial expression strongly correlates with likeability of a robot, while dialogue-related features (such as number of human turns or number of sentences per robot utterance) correlate with perceiving a robot as intelligent. In addition, we show that the facial expression emotional features and prosody are better predictors of human ratings related to perceived robot likeability and anthropomorphism, while linguistic and non-linguistic features more often predict perceived robot intelligence and interpretability. As such, these characteristics may in future be used as an online reward signal for in-situ Reinforcement Learning-based adaptive human-robot dialogue systems.