Stuart E. Middleton

2025

pdf bib abs
CPIQA: Climate Paper Image Question Answering Dataset for Retrieval-Augmented Generation with Context-based Query Expansion
Rudra Mutalik | Abiram Panchalingam | Loitongbam Gyanendro Singh | Timothy J. Osborn | Ed Hawkins | Stuart E. Middleton
Proceedings of the 2nd Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2025)

Misinformation about climate science is a serious challenge for our society. This paper introduces CPIQA (Climate Paper Image Question-Answering), a new question-answer dataset featuring 4,551 full-text open-source academic papers in the area of climate science with 54,612 GPT-4o generated question-answer pairs. CPIQA contains four question types (numeric, figure-based, non-figure-based, reasoning), each generated using three user roles (expert, non-expert, climate sceptic). CPIQA is multimodal, incorporating information from figures and graphs with GPT-4o descriptive annotations. We describe Context-RAG, a novel method for RAG prompt decomposition and augmentation involving extracting distinct contexts for the question. Evaluation results for Context-RAG on the benchmark SPIQA dataset outperforms the previous best state of the art model in two out of three test cases. For our CPIQA dataset, Context-RAG outperforms our standard RAG baseline on all five base LLMs we tested, showing our novel contextual decomposition method can generalize to any LLM architecture. Expert evaluation of our best performing model (GPT-4o with Context-RAG) by climate science experts highlights strengths in precision and provenance tracking, particularly for figure-based and reasoning questions.

2024

pdf bib abs
Extracting and Summarizing Evidence of Suicidal Ideation in Social Media Contents Using Large Language Models
Loitongbam Gyanendro Singh | Junyu Mao | Rudra Mutalik | Stuart E. Middleton
Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024)

This paper explores the use of Large Language Models (LLMs) in analyzing social media content for mental health monitoring, specifically focusing on detecting and summarizing evidence of suicidal ideation. We utilized LLMs Mixtral7bx8 and Tulu-2-DPO-70B, applying diverse prompting strategies for effective content extraction and summarization. Our methodology included detailed analysis through Few-shot and Zero-shot learning, evaluating the ability of Chain-of-Thought and Direct prompting strategies. The study achieved notable success in the CLPsych 2024 shared task (ranked top for the evidence extraction task and second for the summarization task), demonstrating the potential of LLMs in mental health interventions and setting a precedent for future research in digital mental health monitoring.

pdf bib abs
Do Prompt Positions Really Matter?
Junyu Mao | Stuart E. Middleton | Mahesan Niranjan
Findings of the Association for Computational Linguistics: NAACL 2024

Prompt-based models have gathered a lot of attention from researchers due to their remarkable advancements in the fields of zero-shot and few-shot learning. Developing an effective prompt template plays a critical role. However, prior studies have mainly focused on prompt vocabulary searching or embedding initialization within a predefined template with the prompt position fixed. In this empirical study, we conduct the most comprehensive analysis to date of prompt position for diverse Natural Language Processing (NLP) tasks. Our findings quantify the substantial impact prompt position has on model performance. We observe that the prompt positions used in prior studies are often sub-optimal, and this observation is consistent even in widely used instruction-tuned models. These findings suggest prompt position optimisation as a valuable research direction to augment prompt engineering methodologies and prompt position-aware instruction tuning as a potential way to build more robust models in the future.

pdf bib abs
Rationale-based Learning Using Self-Supervised Narrative Events for Text Summarisation of Interactive Digital Narratives
Ashwathy T Revi | Stuart E. Middleton | David E. Millard
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper explores using rationale-based learning with supervised attention to focus the training of text summarisation models on words and sentences surrounding choice points for Interactive Digital Narratives (IDNs). IDNs allow players to interact with the story via choice points, making choices central to these narratives. Exploiting such knowledge about narrative structure during model training can help ensure key narrative information appears in generated summaries of narrative-based text and thus improve the quality of these summaries. We experiment with using word-level and sentence-level rationales indicating the proximity of words and sentences to self-supervised choice points. Our results indicate that rationale-based learning can improve the ability of attention-based text summarisation models to create higher quality summaries that encode key narrative information better for different playthroughs of the same interactive narrative. These results suggest a promising new direction for narrative-based text summarisation models.

2023

pdf bib abs
Augmenting pre-trained language models with audio feature embedding for argumentation mining in political debates
Rafael Mestre | Stuart E. Middleton | Matt Ryan | Masood Gheasi | Timothy Norman | Jiatong Zhu
Findings of the Association for Computational Linguistics: EACL 2023

The integration of multimodality in natural language processing (NLP) tasks seeks to exploit the complementary information contained in two or more modalities, such as text, audio and video. This paper investigates the integration of often under-researched audio features with text, using the task of argumentation mining (AM) as a case study. We take a previously reported dataset and present an audio-enhanced version (the Multimodal USElecDeb60To16 dataset). We report the performance of two text models based on BERT and GloVe embeddings, one audio model (based on CNN and Bi-LSTM) and multimodal combinations, on a dataset of 28,850 utterances. The results show that multimodal models do not outperform text-based models when using the full dataset. However, we show that audio features add value in fully supervised scenarios with limited data. We find that when data is scarce (e.g. with 10% of the original dataset) multimodal models yield improved performance, whereas text models based on BERT considerably decrease performance. Finally, we conduct a study with artificially generated voices and an ablation study to investigate the importance of different audio features in the audio models.

2022

pdf bib abs
Detecting Moments of Change and Suicidal Risks in Longitudinal User Texts Using Multi-task Learning
Tayyaba Azim | Loitongbam Gyanendro Singh | Stuart E. Middleton
Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology

This work describes the classification system proposed for the Computational Linguistics and Clinical Psychology (CLPsych) Shared Task 2022. We propose the use of multitask learning approach with bidirectional long-short term memory (Bi-LSTM) model for predicting changes in user’s mood and their suicidal risk level. The two classification tasks have been solved independently or in an augmented way previously, where the output of one task is leveraged for learning another task, however this work proposes an ‘all-in-one’ framework that jointly learns the related mental health tasks. The experimental results suggest that the proposed multi-task framework outperforms the remaining single-task frameworks submitted to the challenge and evaluated via timeline based and coverage based performance metrics shared by the organisers. We also assess the potential of using various types of feature embedding schemes that could prove useful in initialising the Bi-LSTM model for better multitask learning in the mental health domain.

pdf bib abs
IDN-Sum: A New Dataset for Interactive Digital Narrative Extractive Text Summarisation
Ashwathy T. Revi | Stuart E. Middleton | David E. Millard
Proceedings of the Workshop on Automatic Summarization for Creative Writing

Summarizing Interactive Digital Narratives (IDN) presents some unique challenges to existing text summarization models especially around capturing interactive elements in addition to important plot points. In this paper, we describe the first IDN dataset (IDN-Sum) designed specifically for training and testing IDN text summarization algorithms. Our dataset is generated using random playthroughs of 8 IDN episodes, taken from 2 different IDN games, and consists of 10,000 documents. Playthrough documents are annotated through automatic alignment with fan-sourced summaries using a commonly used alignment algorithm. We also report and discuss results from experiments applying common baseline extractive text summarization algorithms to this dataset. Qualitative analysis of the results reveals shortcomings in common annotation approaches and evaluation methods when applied to narrative and interactive narrative datasets. The dataset is released as open source for future researchers to train and test their own approaches for IDN text.

2021

Argumentation mining aims at extracting, analysing and modelling people’s arguments, but large, high-quality annotated datasets are limited, and no multimodal datasets exist for this task. In this paper, we present M-Arg, a multimodal argument mining dataset with a corpus of US 2020 presidential debates, annotated through crowd-sourced annotations. This dataset allows models to be trained to extract arguments from natural dialogue such as debates using information like the intonation and rhythm of the speaker. Our dataset contains 7 hours of annotated US presidential debates, 6527 utterances and 4104 relation labels, and we report results from different baseline models, namely a text-only model, an audio-only model and multimodal models that extract features from both text and audio. With accuracy reaching 0.86 in multimodal models, we find that audio features provide added value with respect to text-only models.