Cecile Paris

Also published as: Cécile Paris, Cecile L. Paris

2025

This study investigates the extent to which linguistic typology influences the performance of two automatic speech recognition (ASR) systems across diverse language families. Using the FLEURS corpus and typological features from the World Atlas of Language Structures (WALS), we analysed 40 languages grouped by phonological, morphological, syntactic, and semantic domains. We evaluated two state-of-the-art multilingual ASR systems, Whisper and Seamless, to examine how their performance, measured by word error rate (WER), correlates with linguistic structures. Random Forests and Mixed Effects Models were used to quantify feature impact and statistical significance. Results reveal that while both systems leverage typological patterns, they differ in their sensitivity to specific domains. Our findings highlight how structural and functional linguistic features shape ASR performance, offering insights into model generalisability and typology-aware system development.

pdf bib abs

Overview of the 2025 ALTA Shared Task: Normalise Adverse Drug Events
Diego Mollá | Xiang Dai | Sarvnaz Karimi | Cécile Paris
Proceedings of the 23rd Annual Workshop of the Australasian Language Technology Association

The ALTA shared tasks have been running annually since 2010. In 2025, the task focuses on the normalisation of Adverse Drug Events (ADE) found in forum posts to their corresponding standard term specified by the Medical Dictionary for Regulatory Activities (MedDRA). This is a comprehensive ontology of ADEs, which contains more ADE descriptions than those mentioned in the available training dataset. This makes the task more challenging than a straightforward supervised classification. We present the task, the evaluation criteria, and the results of the systems participating in the shared task.

pdf bib abs

In an era where misinformation spreads freely, fact-checking (FC) plays a crucial role in verifying claims and promoting reliable information. While automated fact-checking (AFC) has advanced significantly, existing systems remain vulnerable to adversarial attacks that manipulate or generate claims, evidence, or claim-evidence pairs. These attacks can distort the truth, mislead decision-makers, and ultimately undermine the reliability of FC models. Despite growing research interest in adversarial attacks against AFC systems, a comprehensive, holistic overview of key challenges remains lacking. These challenges include understanding attack strategies, assessing the resilience of current models, and identifying ways to enhance robustness. This survey provides the first in-depth review of adversarial attacks targeting FC, categorizing existing attack methodologies and evaluating their impact on AFC systems. Additionally, we examine recent advancements in adversary-aware defenses and highlight open research questions that require further exploration. Our findings underscore the urgent need for resilient FC frameworks capable of withstanding adversarial manipulations in pursuit of preserving high verification accuracy.

pdf bib abs

Hassles and Uplifts Detection on Social Media Narratives
Jiyu Chen | Sarvnaz Karimi | Diego Molla | Andreas Duenser | Maria Kangas | Cecile Paris
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Hassles and uplifts are psychological constructs of individuals’ positive or negative responses to daily minor incidents, with cumulative impacts on mental health. These concepts are largely overlooked in NLP, where existing tasks and models focus on identifying general sentiment expressed in text. These, however, cannot satisfy targeted information needs in psychological inquiry. To address this, we introduce Hassles and Uplifts Detection (HUD), a novel NLP application to identify these constructs in social media language.We evaluate various language models and task adaptation approaches on a probing dataset collected from a private, real-time emotional venting platform. Some of our models achieve F scores close to 80%. We also identify open opportunities to improve affective language understanding in support of studies in psychology.

pdf bib abs

To Labor is Not to Suffer: Exploration of Polarity Association Bias in LLMs for Sentiment Analysis
Jiyu Chen | Sarvnaz Karimi | Diego Molla | Cecile Paris
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Large language models (LLMs) are widely used for modeling sentiment trends on social media text. We examine whether LLMs have a polarity association bias—positive or negative—when encountering specific types of lexical word mentions. Such polarity association bias could lead to the wrong classification of neutral statements and thus a distorted estimation of sentiment trends. We estimate the severity of the polarity association bias across five widely used LLMs, identifying lexical word mentions spanning a diverse range of linguistic and psychological categories that correlate with this bias. Our results show a moderate to strong degree of polarity association bias in these LLMs.

pdf bib abs

Generating Impact and Critique Explanations of Predictions made by a Goal Recognizer
Jair da Silva Ferreira Junior | Ingrid Zukerman | Enes Makalic | Cecile L. Paris | Mor Vered
Proceedings of the 18th International Natural Language Generation Conference

In this paper, we generate two types of explanations, Impact and Critique, of predictions made by a Goal Recognizer (GR) – a system that infers agents’ goals from observations. Impact explanations describe the top-predicted goal(s) and the main observations that led to these predictions. Critique explanations augment these explanations with evidence that challenges the GR’s predictions if so warranted. Our user study compares users’ goal-recognition accuracy for Impact and Critique explanations, and users’ views about these explanations, under three prediction-correctness conditions: correct, partially correct and incorrect. Our results show that (1) users stick with a GR’s predictions, even when a Critique explanation highlights its flaws; yet (2) Critique explanations are deemed better than Impact explanations in most respects.

pdf bib abs

CSIRO-LT at SemEval-2025 Task 11: Adapting LLMs for Emotion Recognition for Multiple Languages
Jiyu Chen | Necva Bölücü | Sarvnaz Karimi | Diego Molla | Cecile Paris
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Detecting emotions across different languages is challenging due to the varied and culturally nuanced ways of emotional expressions. The Semeval 2025 Task 11: Bridging the Gap in Text-Based emotion shared task was organised to investigate emotion recognition across different languages. The goal of the task is to implement an emotion recogniser that can identify the basic emotional states that general third-party observers would attribute to an author based on their written text snippet, along with the intensity of those emotions. We report our investigation of various task-adaptation strategies for LLMs in emotion recognition. We show that the most effective method for this task is to fine-tune a pre-trained multilingual LLM for each language.

2024

pdf bib abs

Presentation Matters: How to Communicate Science in the NLP Venues and in the Wild
Sarvnaz Karimi | Cecile Paris | Gholamreza Haffari
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts)

Each year a large number of early career researchers join the NLP/Computational Linguistics community, with most starting by presenting their research in the *ACL conferences and workshops. While writing a paper that has made it to these venues is one important step, what comes with communicating the outcome is equally important and sets the path to impact of a research outcome. In addition, not all PhD candidates get the chance of being trained for their presentation skills. Research methods courses are not all of the same quality and may not cover scientific communications, and certainly not all are tailored to the NLP community. We are proposing an introductory tutorial that covers a range of different communication skills, including writing, oral presentation (posters and demos), and social media presence. This is to fill in the gap for the researchers who may not have access to research methods courses or other mentors who could help them acquire such skills. The interactive nature of such a tutorial would allow attendees to ask questions and clarifications which would not be possible from reading materials alone.

pdf bib abs

Do LLMs Generate Creative and Visually Accessible Data visualisations?
Clarissa Miranda-Pena | Andrew Reeson | Cécile Paris | Josiah Poon | Jonathan K. Kummerfeld
Proceedings of the 22nd Annual Workshop of the Australasian Language Technology Association

Data visualisation is a valuable task that combines careful data processing with creative design. Large Language Models (LLMs) are now capable of responding to a data visualisation request in natural language with code that generates accurate data visualisations (e.g., using Matplotlib), but what about human-centered factors, such as the creativity and accessibility of the data visualisations? In this work, we study human perceptions of creativity in the data visualisations generated by LLMs, and propose metrics for accessibility. We generate a range of visualisations using GPT-4 and Claude-2 with controlled variations in prompt and inference parameters, to encourage the generation of different types of data visualisations for the same data. Subsets of these data visualisations are presented to people in a survey with questions that probe human perceptions of different aspects of creativity and accessibility. We find that the models produce visualisations that are novel, but not surprising. Our results also show that our accessibility metrics are consistent with human judgements. In all respects, the LLMs underperform visualisations produced by human-written code. To go beyond the simplest requests, these models need to become aware of human-centered factors, while maintaining accuracy.

pdf bib abs

Exploring Instructive Prompts for Large Language Models in the Extraction of Evidence for Supporting Assigned Suicidal Risk Levels
Jiyu Chen | Vincent Nguyen | Xiang Dai | Diego Molla-Aliod | Cecile Paris | Sarvnaz Karimi
Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024)

Monitoring and predicting the expression of suicidal risk in individuals’ social media posts is a central focus in clinical NLP. Yet, existing approaches frequently lack a crucial explainability component necessary for extracting evidence related to an individual’s mental health state. We describe the CSIRO Data61 team’s evidence extraction system submitted to the CLPsych 2024 shared task. The task aims to investigate the zero-shot capabilities of open-source LLM in extracting evidence regarding an individual’s assigned suicide risk level from social media discourse. The results are assessed against ground truth evidence annotated by psychological experts, with an achieved recall-oriented BERTScore of 0.919. Our findings suggest that LLMs showcase strong feasibility in the extraction of information supporting the evaluation of suicidal risk in social media discourse. Opportunities for refinement exist, notably in crafting concise and effective instructions to guide the extraction process.

pdf bib abs

Large language models are able to generate code for visualisations in response to simple user requests.This is a useful application and an appealing one for NLP research because plots of data provide grounding for language.However, there are relatively few benchmarks, and those that exist may not be representative of what users do in practice.This paper investigates whether benchmarks reflect real-world use through an empirical study comparing benchmark datasets with code from public repositories.Our findings reveal a substantial gap, with evaluations not testing the same distribution of chart types, attributes, and actions as real-world examples.One dataset is representative, but requires extensive modification to become a practical end-to-end benchmark. This shows that new benchmarks are needed to support the development of systems that truly address users’ visualisation needs.These observations will guide future data creation, highlighting which features hold genuine significance for users.

pdf bib abs

Generating Simple, Conservative and Unifying Explanations for Logistic Regression Models
Sameen Maruf | Ingrid Zukerman | Xuelin Situ | Cecile Paris | Gholamreza Haffari
Proceedings of the 17th International Natural Language Generation Conference

In this paper, we generate and compare three types of explanations of Machine Learning (ML) predictions: simple, conservative and unifying. Simple explanations are concise, conservative explanations address the surprisingness of a prediction, and unifying explanations convey the extent to which an ML model’s predictions are applicable. The results of our user study show that (1) conservative and unifying explanations are liked equally and considered largely equivalent in terms of completeness, helpfulness for understanding the AI, and enticement to act, and both are deemed better than simple explanations; and (2)users’ views about explanations are influenced by the (dis)agreement between the ML model’s predictions and users’ estimations of these predictions, and by the inclusion/omission of features users expect to see in explanations.

2021

pdf bib abs

Learning to Explain: Generating Stable Explanations Fast
Xuelin Situ | Ingrid Zukerman | Cecile Paris | Sameen Maruf | Gholamreza Haffari
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The importance of explaining the outcome of a machine learning model, especially a black-box model, is widely acknowledged. Recent approaches explain an outcome by identifying the contributions of input features to this outcome. In environments involving large black-box models or complex inputs, this leads to computationally demanding algorithms. Further, these algorithms often suffer from low stability, with explanations varying significantly across similar examples. In this paper, we propose a Learning to Explain (L2E) approach that learns the behaviour of an underlying explanation algorithm simultaneously from all training examples. Once the explanation algorithm is distilled into an explainer network, it can be used to explain new instances. Our experiments on three classification tasks, which compare our approach to six explanation algorithms, show that L2E is between 5 and 7.5×10ˆ4 times faster than these algorithms, while generating more stable explanations, and having comparable faithfulness to the black-box model.

pdf bib abs

Harnessing Privileged Information for Hyperbole Detection
Rhys Biddle | Maciek Rybinski | Qian Li | Cecile Paris | Guandong Xu
Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association

The detection of hyperbole is an important stepping stone to understanding the intentions of a hyperbolic utterance. We propose a model that combines pre-trained language models with privileged information for the task of hyperbole detection. We also introduce a suite of behavioural tests to probe the capabilities of hyperbole detection models across a range of hyperbole types. Our experiments show that our model improves upon baseline models on an existing hyperbole detection dataset. Probing experiments combined with analysis using local linear approximations (LIME) show that our model excels at detecting one particular type of hyperbole. Further, we discover that our experiments highlight annotation artifacts introduced through the process of literal paraphrasing of hyperbole. These annotation artifacts are likely to be a roadblock to further improvements in hyperbole detection.

pdf bib abs

Vent is a specialised iOS/Android social media platform with the stated goal to encourage people to post about their feelings and explicitly label them. In this paper, we study a snapshot of more than 100 million messages obtained from the developers of Vent, together with the labels assigned by the authors of the messages. We establish the quality of the self-annotated data by conducting a qualitative analysis, a vocabulary based analysis, and by training and testing an emotion classifier. We conclude that the self-annotated labels of our corpus are indeed indicative of the emotional contents expressed in the text and thus can support more detailed analyses of emotion expression on social media, such as emotion trajectories and factors influencing them.

pdf bib abs

Lifelong Explainer for Lifelong Learners
Xuelin Situ | Sameen Maruf | Ingrid Zukerman | Cecile Paris | Gholamreza Haffari
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Lifelong Learning (LL) black-box models are dynamic in that they keep learning from new tasks and constantly update their parameters. Owing to the need to utilize information from previously seen tasks, and capture commonalities in potentially diverse data, it is hard for automatic explanation methods to explain the outcomes of these models. In addition, existing explanation methods, e.g., LIME, which are computationally expensive when explaining a static black-box model, are even more inefficient in the LL setting. In this paper, we propose a novel Lifelong Explanation (LLE) approach that continuously trains a student explainer under the supervision of a teacher – an arbitrary explanation algorithm – on different tasks undertaken in LL. We also leverage the Experience Replay (ER) mechanism to prevent catastrophic forgetting in the student explainer. Our experiments comparing LLE to three baselines on text classification tasks show that LLE can enhance the stability of the explanations for all seen tasks and maintain the same level of faithfulness to the black-box model as the teacher, while being up to 10ˆ2 times faster at test time. Our ablation study shows that the ER mechanism in our LLE approach enhances the learning capabilities of the student explainer. Our code is available at https://github.com/situsnow/LLE.

2020

pdf bib abs

An Effective Transition-based Model for Discontinuous NER
Xiang Dai | Sarvnaz Karimi | Ben Hachey | Cecile Paris
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Unlike widely used Named Entity Recognition (NER) data sets in generic domains, biomedical NER data sets often contain mentions consisting of discontinuous spans. Conventional sequence tagging techniques encode Markov assumptions that are efficient but preclude recovery of these mentions. We propose a simple, effective transition-based model with generic neural encoding for discontinuous NER. Through extensive experiments on three biomedical data sets, we show that our model can effectively recognize discontinuous mentions without sacrificing the accuracy on continuous mentions.

pdf bib abs

Benchmarking of Transformer-Based Pre-Trained Models on Social Media Text Classification Datasets
Yuting Guo | Xiangjue Dong | Mohammed Ali Al-Garadi | Abeed Sarker | Cecile Paris | Diego Mollá Aliod
Proceedings of the 18th Annual Workshop of the Australasian Language Technology Association

Free text data from social media is now widely used in natural language processing research, and one of the most common machine learning tasks performed on this data is classification. Generally speaking, performances of supervised classification algorithms on social media datasets are lower than those on texts from other sources, but recently-proposed transformer-based models have considerably improved upon legacy state-of-the-art systems. Currently, there is no study that compares the performances of different variants of transformer-based models on a wide range of social media text classification datasets. In this paper, we benchmark the performances of transformer-based pre-trained models on 25 social media text classification datasets, 6 of which are health-related. We compare three pre-trained language models, RoBERTa-base, BERTweet and ClinicalBioBERT in terms of classification accuracy. Our experiments show that RoBERTa-base and BERTweet perform comparably on most datasets, and considerably better than ClinicalBioBERT, even on health-related datasets.

pdf bib abs

Assessing Social License to Operate from the Public Discourse on Social Media
Chang Xu | Cecile Paris | Ross Sparks | Surya Nepal | Keith VanderLinden
Proceedings of the 28th International Conference on Computational Linguistics: Industry Track

Organisations are monitoring their Social License to Operate (SLO) with increasing regularity. SLO, the level of support organisations gain from the public, is typically assessed through surveys or focus groups, which require expensive manual efforts and yield quickly-outdated results. In this paper, we present SIRTA (Social Insight via Real-Time Text Analytics), a novel real-time text analytics system for assessing and monitoring organisations’ SLO levels by analysing the public discourse from social posts. To assess SLO levels, our insight is to extract and transform peoples’ stances towards an organisation into SLO levels. SIRTA achieves this by performing a chain of three text classification tasks, where it identifies task-relevant social posts, discovers key SLO risks discussed in the posts, and infers stances specific to the SLO risks. We leverage recent language understanding techniques (e.g., BERT) for building our classifiers. To monitor SLO levels over time, SIRTA employs quality control mechanisms to reliably identify SLO trends and variations of multiple organisations in a market. These are derived from the smoothed time series of their SLO levels based on exponentially-weighted moving average (EWMA) calculation. Our experimental results show that SIRTA is highly effective in distilling stances from social posts for SLO level assessment, and that the continuous monitoring of SLO levels afforded by SIRTA enables the early detection of critical SLO changes.

pdf bib abs

Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media
Xiang Dai | Sarvnaz Karimi | Ben Hachey | Cecile Paris
Findings of the Association for Computational Linguistics: EMNLP 2020

Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.

Cecile Paris

2025

2024

2021

2020

2019

2018

2017

2016

2015

2013

2012

2011

2010

2009

2008

2006

2005

2004

2003

2002

1996

1994

1993

1989

1988

1987

1985

Co-authors

Venues