Diego Molla

Also published as: Diego Mollá Aliod, Diego Molla-Aliod, Diego Mollá, Diego Mollá-Aliod, Diego Mollá Aliod

2026

MULSUM: A Multimodal Summarization System with Vis-Aligner and Diversity-Aware Image Selection
Abid Ali | Diego Molla | Usman Naseem
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

The abundance of multimodal news in digital form has intensified demand for systems that condense articles and images into concise, faithful digests. Yet most approaches simply conduct unimodal text summarization and attach the most-similar images with the text summary, which leads to redundancy both in processing visual content as well as in selection of images to complement the summary. We propose MULSUM, a two-step framework: (i) a Cross-Vis Aligner that projects image-level embeddings into a shared space and conditions a pre-trained LLM decoder to generate a visually informed text summary, and (ii) a Diversity-Aware Image Selector that, after the summary is produced, maximizes images-relevance to the summary while enforcing pairwise image diversity, yielding a compact, complementary image set. Experimental results on the benchmark MSMO (Multimodal Summarization with Multimodal Output) corpus show that MULSUM consistently outperforms strong baselines on automatic metrics such as ROUGE, while qualitative inspection shows that selected images act as explanatory evidence rather than ornamental add-ons. Human evaluation results shows that our diverse set of selected images was 13% more helpful than mere similarity-based image selection.

2025

pdf bib abs

Hassles and Uplifts Detection on Social Media Narratives
Jiyu Chen | Sarvnaz Karimi | Diego Molla | Andreas Duenser | Maria Kangas | Cecile Paris
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Hassles and uplifts are psychological constructs of individuals’ positive or negative responses to daily minor incidents, with cumulative impacts on mental health. These concepts are largely overlooked in NLP, where existing tasks and models focus on identifying general sentiment expressed in text. These, however, cannot satisfy targeted information needs in psychological inquiry. To address this, we introduce Hassles and Uplifts Detection (HUD), a novel NLP application to identify these constructs in social media language.We evaluate various language models and task adaptation approaches on a probing dataset collected from a private, real-time emotional venting platform. Some of our models achieve F scores close to 80%. We also identify open opportunities to improve affective language understanding in support of studies in psychology.

pdf bib abs

Can an LLM Elicit Information from Users in Simple Optimization Modelling Dialogues?
Yelaman Abdullin | Diego Mollá | Bahadorreza Ofoghi | Vicky Mak-Hau | John Yearwood
Proceedings of the 23rd Annual Workshop of the Australasian Language Technology Association

For a natural language dialogue system to engage in a goal-oriented conversation, it must elicit information from a user. Research on large language models (LLMs) often focuses on aligning them with user goals. Consequently, studies show these models can serve as chat assistants and answer the user questions. However, their information-elicitation abilities remain understudied. This work evaluates these abilities in goal-oriented dialogues for optimisation modelling. We compare two GPT-4-based settings that generate conversations between a modeller and a user over NL4Opt, a collection of simple optimisation problem descriptions, and analyse the modeller’s information elicitation. In the first, the modeller LLM has access to problem details and asks targeted questions, simulating an informed modeller. In the second, the LLM infers problem details through interaction — asking clarifying questions, interpreting responses, and gradually constructing an understanding of the task. This comparison assesses whether LLMs can elicit information and navigate problem discovery without prior knowledge of the problem. We compare modeller turns in both settings using human raters across criteria at the whole-dialogue and turn levels. Results show that a non-informed LLM can elicit information nearly as well as an informed one, producing high-quality dialogues. In particular, the success levels of both agents in the system without modeller access to the problem details are comparable to those in a system with full access. Dialogues rate well on coherence, and a post-annotation error analysis identified useful types for improving quality. GPT-4’s capability to elicit information in optimisation modelling dialogues suggests newer LLMs may possess even greater capability.

pdf bib abs

CSIRO-LT at SemEval-2025 Task 11: Adapting LLMs for Emotion Recognition for Multiple Languages
Jiyu Chen | Necva Bölücü | Sarvnaz Karimi | Diego Molla | Cecile Paris
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Detecting emotions across different languages is challenging due to the varied and culturally nuanced ways of emotional expressions. The Semeval 2025 Task 11: Bridging the Gap in Text-Based emotion shared task was organised to investigate emotion recognition across different languages. The goal of the task is to implement an emotion recogniser that can identify the basic emotional states that general third-party observers would attribute to an author based on their written text snippet, along with the intensity of those emotions. We report our investigation of various task-adaptation strategies for LLMs in emotion recognition. We show that the most effective method for this task is to fine-tune a pre-trained multilingual LLM for each language.

pdf bib abs

To Labor is Not to Suffer: Exploration of Polarity Association Bias in LLMs for Sentiment Analysis
Jiyu Chen | Sarvnaz Karimi | Diego Molla | Cecile Paris
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Large language models (LLMs) are widely used for modeling sentiment trends on social media text. We examine whether LLMs have a polarity association bias—positive or negative—when encountering specific types of lexical word mentions. Such polarity association bias could lead to the wrong classification of neutral statements and thus a distorted estimation of sentiment trends. We estimate the severity of the polarity association bias across five widely used LLMs, identifying lexical word mentions spanning a diverse range of linguistic and psychological categories that correlate with this bias. Our results show a moderate to strong degree of polarity association bias in these LLMs.

pdf bib abs

Overview of the 2025 ALTA Shared Task: Normalise Adverse Drug Events
Diego Mollá | Xiang Dai | Sarvnaz Karimi | Cécile Paris
Proceedings of the 23rd Annual Workshop of the Australasian Language Technology Association

The ALTA shared tasks have been running annually since 2010. In 2025, the task focuses on the normalisation of Adverse Drug Events (ADE) found in forum posts to their corresponding standard term specified by the Medical Dictionary for Regulatory Activities (MedDRA). This is a comprehensive ontology of ADEs, which contains more ADE descriptions than those mentioned in the available training dataset. This makes the task more challenging than a straightforward supervised classification. We present the task, the evaluation criteria, and the results of the systems participating in the shared task.

2024

pdf bib abs

Using Large Language Models to Evaluate Biomedical Query-Focused Summarisation
Hashem Hijazi | Diego Molla | Vincent Nguyen | Sarvnaz Karimi
Proceedings of the 23rd Workshop on Biomedical Natural Language Processing

Biomedical question-answering systems remain popular for biomedical experts interacting with the literature to answer their medical questions. However, these systems are difficult to evaluate in the absence of costly human experts. Therefore, automatic evaluation metrics are often used in this space. Traditional automatic metrics such as ROUGE or BLEU, which rely on token overlap, have shown a low correlation with humans. We present a study that uses large language models (LLMs) to automatically evaluate systems from an international challenge on biomedical semantic indexing and question answering, called BioASQ. We measure the agreement of LLM-produced scores against human judgements. We show that LLMs correlate similarly to lexical methods when using basic prompting techniques. However, by aggregating evaluators with LLMs or by fine-tuning, we find that our methods outperform the baselines by a large margin, achieving a Spearman correlation of 0.501 and 0.511, respectively.

pdf bib abs

Overview of the 2024 ALTA Shared Task: Detect Automatic AI-Generated Sentences for Human-AI Hybrid Articles
Diego Mollá | Qiongkai Xu | Zijie Zeng | Zhuang Li
Proceedings of the 22nd Annual Workshop of the Australasian Language Technology Association

The ALTA shared tasks have been running annually since 2010. In 2024, the purpose of the task is to detect machine-generated text in a hybrid setting where the text may contain portions of human text and portions machine-generated. In this paper, we present the task, the evaluation criteria, and the results of the systems participating in the shared task.

pdf bib abs

Exploring Instructive Prompts for Large Language Models in the Extraction of Evidence for Supporting Assigned Suicidal Risk Levels
Jiyu Chen | Vincent Nguyen | Xiang Dai | Diego Molla-Aliod | Cecile Paris | Sarvnaz Karimi
Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024)

Monitoring and predicting the expression of suicidal risk in individuals’ social media posts is a central focus in clinical NLP. Yet, existing approaches frequently lack a crucial explainability component necessary for extracting evidence related to an individual’s mental health state. We describe the CSIRO Data61 team’s evidence extraction system submitted to the CLPsych 2024 shared task. The task aims to investigate the zero-shot capabilities of open-source LLM in extracting evidence regarding an individual’s assigned suicide risk level from social media discourse. The results are assessed against ground truth evidence annotated by psychological experts, with an achieved recall-oriented BERTScore of 0.919. Our findings suggest that LLMs showcase strong feasibility in the extraction of information supporting the evaluation of suicidal risk in social media discourse. Opportunities for refinement exist, notably in crafting concise and effective instructions to guide the extraction process.

2023

pdf bib abs

Synthetic Dialogue Dataset Generation using LLM Agents
Yelaman Abdullin | Diego Molla | Bahadorreza Ofoghi | John Yearwood | Qingyang Li
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Linear programming (LP) problems are pervasive in real-life applications. However, despite their apparent simplicity, an untrained user may find it difficult to determine the linear model of their specific problem. We envisage the creation of a goal-oriented conversational agent that will engage in conversation with the user to elicit all information required so that a subsequent agent can generate the linear model. In this paper, we present an approach for the generation of sample dialogues that can be used to develop and train such a conversational agent. Using prompt engineering, we develop two agents that “talk” to each other, one acting as the conversational agent, and the other acting as the user. Using a set of text descriptions of linear problems from NL4Opt available to the user only, the agent and the user engage in conversation until the agent has retrieved all key information from the original problem description. We also propose an extrinsic evaluation of the dialogues by assessing how well the summaries generated by the dialogues match the original problem descriptions. We conduct human and automatic evaluations, including an evaluation approach that uses GPT-4 to mimic the human evaluation metrics. The evaluation results show an overall good quality of the dialogues, though research is still needed to improve the quality of the GPT-4 evaluation metrics. The resulting dialogues, including the human annotations of a subset, are available to the research community. The conversational agent used for the generation of the dialogues can be used as a baseline.

pdf bib abs

Overview of the 2023 ALTA Shared Task: Discriminate between Human-Written and Machine-Generated Text
Diego Molla | Haolan Zhan | Xuanli He | Qiongkai Xu
Proceedings of the 21st Annual Workshop of the Australasian Language Technology Association

The ALTA shared tasks have been running annually since 2010. In 2023, the purpose of the task is to build automatic detection systems that can discriminate between human-written and synthetic text generated by Large Language Models (LLM). In this paper we present the task, the evaluation criteria, and the results of the systems participating in the shared task.

pdf bib abs

Exploring Causal Directions through Word Occurrences: Semi-supervised Bayesian Classification Framework
King Tao Jason Ng | Diego Molla
Proceedings of the 21st Annual Workshop of the Australasian Language Technology Association

Determining causal directions in sentences plays a critical role into understanding a cause-and-effect relationship between entities. In this paper, we show empirically that word occurrences from several Internet domains resemble the characteristics of causal directions. Our research contributes to the knowledge of the underlying data generation process behind causal directions. We propose a two-phase method: 1. Bayesian framework, which generates synthetic data from posteriors by incorporating word occurrences from the Internet domains. 2. Pre-trained BERT, which utilises semantics of words based on the context to perform classification. The proposed method achieves an improvement in performance for the Cause-Effect relations of the SemEval-2010 dataset, when compared with random guessing.

2022

pdf bib abs

The Construction and Evaluation of the LEAFTOP Dataset of Automatically Extracted Nouns in 1480 Languages
Gregory Baker | Diego Molla
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The LEAFTOP (language extracted automatically from thousands of passages) dataset consists of nouns that appear in multiple places in the four gospels of the New Testament. We use a naive approach — probabilistic inference — to identify likely translations in 1480 other languages. We evaluate this process and find that it provides lexiconaries with accuracy from 42% (Korafe) to 99% (Runyankole), averaging 72% correct across evaluated languages. The process translates up to 161 distinct lemmas from Koine Greek (average 159). We identify nouns which appear to be easy and hard to translate, language families where this technique works, and future possible improvements and extensions. The claims to novelty are: the use of a Koine Greek New Testament as the source language; using a fully-annotated manually-created grammatically parse of the source text; a custom scraper for texts in the target languages; a new metric for language similarity; a novel strategy for evaluation on low-resource languages.

pdf bib abs

Number Theory Meets Linguistics: Modelling Noun Pluralisation Across 1497 Languages Using 2-adic Metrics
Gregory Baker | Diego Molla
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

A simple machine learning model of pluralisation as a linear regression problem minimising a p-adic metric substantially outperforms even the most robust of Euclidean-space regressors on languages in the Indo-European, Austronesian, Trans New-Guinea, Sino-Tibetan, Nilo-Saharan, Oto-Meanguean and Atlantic-Congo language families. There is insufficient evidence to support modelling distinct noun declensions as a p-adic neighbourhood even in Indo-European languages.

pdf bib abs

Overview of the 2022 ALTA Shared task: PIBOSO sentence classification, 10 years later
Diego Mollá
Proceedings of the 20th Annual Workshop of the Australasian Language Technology Association

The 2022 ALTA shared task has been running annually since 2010. This year, the shared task is a re-visit of the 2012 ALTA shared task. The purpose of this task is to classify sentences of medical publications using the PIBOSO taxonomy. This is a multi-label classification task which can help medical researchers and practitioners conduct Evidence Based Medicine (EBM). In this paper we present the task, the evaluation criteria, and the results of the systems participating in the shared task.

2021

pdf bib abs

Vent is a specialised iOS/Android social media platform with the stated goal to encourage people to post about their feelings and explicitly label them. In this paper, we study a snapshot of more than 100 million messages obtained from the developers of Vent, together with the labels assigned by the authors of the messages. We establish the quality of the self-annotated data by conducting a qualitative analysis, a vocabulary based analysis, and by training and testing an emotion classifier. We conclude that the self-annotated labels of our corpus are indeed indicative of the emotional contents expressed in the text and thus can support more detailed analyses of emotion expression on social media, such as emotion trajectories and factors influencing them.

pdf bib abs

Overview of the 2021 ALTA Shared Task: Automatic Grading of Evidence, 10 years later
Diego Mollá
Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association

The 2021 ALTA shared task is the 12th instance of a series of shared tasks organised by ALTA since 2010. Motivated by the advances in machine learning in the last 10 years, this yearï¿½s task is a re-visit of the 2011 ALTA shared task. Set within the framework of Evidence Based Medicine (EBM), the goal is to predict the qual-ity of the clinical evidence present in a set of documents. This yearï¿½s participant results didnot improve over those of participants from 2011.

2020

pdf bib abs

Benchmarking of Transformer-Based Pre-Trained Models on Social Media Text Classification Datasets
Yuting Guo | Xiangjue Dong | Mohammed Ali Al-Garadi | Abeed Sarker | Cecile Paris | Diego Mollá Aliod
Proceedings of the 18th Annual Workshop of the Australasian Language Technology Association

Free text data from social media is now widely used in natural language processing research, and one of the most common machine learning tasks performed on this data is classification. Generally speaking, performances of supervised classification algorithms on social media datasets are lower than those on texts from other sources, but recently-proposed transformer-based models have considerably improved upon legacy state-of-the-art systems. Currently, there is no study that compares the performances of different variants of transformer-based models on a wide range of social media text classification datasets. In this paper, we benchmark the performances of transformer-based pre-trained models on 25 social media text classification datasets, 6 of which are health-related. We compare three pre-trained language models, RoBERTa-base, BERTweet and ClinicalBioBERT in terms of classification accuracy. Our experiments show that RoBERTa-base and BERTweet perform comparably on most datasets, and considerably better than ClinicalBioBERT, even on health-related datasets.

pdf bib abs

Overview of the 2020 ALTA Shared Task: Assess Human Behaviour
Diego Mollá
Proceedings of the 18th Annual Workshop of the Australasian Language Technology Association

The 2020 ALTA shared task is the 11th in stance of a series of shared tasks organised by ALTA since 2010. The task is to classify texts posted in social media according to human judgements expressed in them. The data used for this task is a subset of SemEval 2018 AIT DISC, which has been annotated by domain experts for this task. In this paper we introduce the task, describe the data and present the results of participating systems.

2019

pdf bib abs

Overview of the 2019 ALTA Shared Task: Sarcasm Target Identification
Diego Molla | Aditya Joshi
Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association

We present an overview of the 2019 ALTA shared task. This is the 10th of the series of shared tasks organised by ALTA since 2010. The task was to detect the target of sarcastic comments posted on social media. We intro- duce the task, describe the data and present the results of baselines and participants. This year’s shared task was particularly challenging and no participating systems improved the re- sults of our baseline.

2018

pdf bib abs

Macquarie University at BioASQ 6b: Deep learning and deep reinforcement learning for query-based summarisation
Diego Mollá
Proceedings of the 6th BioASQ Workshop A challenge on large-scale biomedical semantic indexing and question answering

This paper describes Macquarie University’s contribution to the BioASQ Challenge (BioASQ 6b, Phase B). We focused on the extraction of the ideal answers, and the task was approached as an instance of query-based multi-document summarisation. In particular, this paper focuses on the experiments related to the deep learning and reinforcement learning approaches used in the submitted runs. The best run used a deep learning model under a regression-based framework. The deep learning architecture used features derived from the output of LSTM chains on word embeddings, plus features based on similarity with the query, and sentence position. The reinforcement learning approach was a proof-of-concept prototype that trained a global policy using REINFORCE. The global policy was implemented as a neural network that used tf.idf features encoding the candidate sentence, question, and context.

pdf bib abs

Supervised Machine Learning for Extractive Query Based Summarisation of Biomedical Data
Mandeep Kaur | Diego Mollá
Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis

The automation of text summarisation of biomedical publications is a pressing need due to the plethora of information available online. This paper explores the impact of several supervised machine learning approaches for extracting multi-document summaries for given queries. In particular, we compare classification and regression approaches for query-based extractive summarisation using data provided by the BioASQ Challenge. We tackled the problem of annotating sentences for training classification systems and show that a simple annotation approach outperforms regression-based summarisation.

pdf bib abs

Overview of the 2018 ALTA Shared Task: Classifying Patent Applications
Diego Mollá | Dilesha Seneviratne
Proceedings of the Australasian Language Technology Association Workshop 2018

We present an overview of the 2018 ALTA shared task. This is the 9th of the series of shared tasks organised by ALTA since 2010. The task was to classify Australian patent classifications following the sections defined by the International Patient Classification (IPC), using data made available by IP Australia. We introduce the task, describe the data and present the results of the participating teams. Some of the participating teams outperformed state of the art.

Diego Molla

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2008

2007

2006

2005

2004

2003

2002

2000

Co-authors

Venues