Noura Al Moubayed

2025

Analyzing LLMs’ Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations
Chenghao Xiao | Hou Pong Chan | Hao Zhang | Mahani Aljunied | Lidong Bing | Noura Al Moubayed | Yu Rong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

While understanding the knowledge boundaries of LLMs is crucial to prevent hallucination, research on the knowledge boundaries of LLMs has predominantly focused on English. In this work, we present the first study to analyze how LLMs recognize knowledge boundaries across different languages by probing their internal representations when processing known and unknown questions in multiple languages. Our empirical studies reveal three key findings: 1) LLMs’ perceptions of knowledge boundaries are encoded in the middle to middle-upper layers across different languages. 2) Language differences in knowledge boundary perception follow a linear structure, which motivates our proposal of a training-free alignment method that effectively transfers knowledge boundary perception ability across languages, thereby helping reduce hallucination risk in low-resource languages; 3) Fine-tuning on bilingual question pair translation further enhances LLMs’ recognition of knowledge boundaries across languages. Given the absence of standard testbeds for cross-lingual knowledge boundary analysis, we construct a multilingual evaluation suite comprising three representative types of knowledge boundary data. Our code and datasets are publicly available at https://github.com/DAMO-NLP-SG/LLM-Multilingual-Knowledge-Boundaries.

pdf bib abs

This paper presents the setup and results of the third edition of the BioLaySumm shared task on Lay Summarization of Biomedical Research Articles and Radiology Reports, hosted at the BioNLP Workshop at ACL 2025. In this task edition, we aim to build on the first two editions’ successes by further increasing research interest in this important task and encouraging participants to explore novel approaches that will help advance the state-of-the-art. Specifically, we introduce the new task of Radiology Report Generation with Layman’s terms, which is parallel to the task of lay summarization of biomedical articles in the first two editions. Overall, our results show that a broad range of innovative approaches were adopted by task participants, including inspiring explorations of latest RL techniques adopted in the training of general-domain large reasoning models.

pdf bib abs

Early Detection and Reduction of Memorization for Domain Adaptation and Instruction Tuning
Dean L. Slack | Noura Al Moubayed
Transactions of the Association for Computational Linguistics, Volume 13

Although large language models excel across many tasks, they can memorize training data and thereby expose private or copyrighted text. Most defenses target the pre-training stage, leaving memorization during fine-tuning–especially for domain adaptation and instruction tuning–poorly understood. We fine-tune Pythia, Llama3, and Mistral models spanning 1.4B–70B parameters on common evaluation datasets and track verbatim memorization throughout training. We find that memorization increases dramatically in the first few epochs, often significantly before either validation perplexity or evaluation performance is optimized. We use a simple but effective n-gram memorization score which reliably precedes verbatim memorization; using it as an early-stopping criterion mitigates memorization with minimal performance loss. Further, we introduce an n-gram–aware loss regularizer and show that it reduces memorization across all model families tested by up to 40% while minimizing evaluation performance trade-offs when compared to an existing memorization mitigation strategy. These results yield practical, scalable insights into memorization dynamics during language model fine-tuning.

pdf bib abs

Adversarial Defense without Adversarial Defense: Enhancing Language Model Robustness via Instance-level Principal Component Removal
Yang Wang | Chenghao Xiao | Yizhi Li | Stuart E. Middleton | Noura Al Moubayed | Chenghua Lin
Transactions of the Association for Computational Linguistics, Volume 13

Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defenses or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimizes the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before-attack accuracy to baselines, achieving a balanced trade-off between robustness and generalization.

pdf bib abs

PetEVAL: A veterinary free text electronic health records benchmark
Sean Farrell | Alan Radford | Noura Al Moubayed | Peter-John Noble
Proceedings of the 24th Workshop on Biomedical Language Processing

We introduce PetEVAL, the first benchmark dataset derived from real-world, free-text veterinary electronic health records (EHRs). PetEVAL comprises 17,600 professionally annotated EHRs from first-opinion veterinary practices across the UK, partitioned into training (11,000), evaluation (1,600), and test (5,000) sets with distinct clinic distributions to assess model generalisability. Each record is annotated with International Classification of Disease 11 (ICD-11) syndromic chapter labels (20,408 labels), disease Named Entity Recognition (NER) tags (429 labels), and anonymisation NER tags (8,244 labels). PetEVAL enables evaluating Natural Language Processing (NLP) tools across applications, including syndrome surveillance and disease outbreak detection. We implement a multistage anonymisation protocol, replacing identifiable information with clinically relevant pseudonyms while establishing the first definition of identifiers in veterinary free text. PetEVAL introduces three core tasks: syndromic classification, disease entity recognition, and anonymisation. We provide baseline results using BERT-base, PetBERT, and LLaMA 3.1 8B generative models. Our experiments demonstrate the unique challenges of veterinary text, showcasing the importance of domain-specific approaches. By fostering advancements in veterinary informatics and epidemiology, we envision PetEVAL catalysing innovations in veterinary care, animal health, and comparative biomedical research through access to real-world, annotated veterinary clinical data.

2024

pdf bib abs

Multi-modal information retrieval (MMIR) is a rapidly evolving field where significant progress has been made through advanced representation learning and cross-modality alignment research, particularly in image-text pairing.However, current benchmarks for evaluating MMIR performance on image-text pairings overlook the scientific domain, which has a notable gap with the generic data since the caption of scientific charts and tables usually describes the analysis of experimental results or scientific principles in contrast to human activity or scenery depicted in generic images.To bridge this gap, we develop a scientific domain-specific MMIR benchmark (SciMMIR) by leveraging open-access research paper corpora to extract data relevant to the scientific domain. This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions from scientific documents.We further annotate the image-text pairs with a two-level subset-subcategory hierarchy to facilitate a more comprehensive evaluation of the baselines. We conduct zero-shot and fine-tuned evaluations on prominent multi-modal image-captioning and visual language models, such as CLIP, BLIP, and BLIP-2.Our findings offer critical insights for MMIR in the scientific domain, including the impact of pre-training and fine-tuning settings and the effects of different visual and textual encoders.

2023

pdf bib

Towards more Human-like Language Models based on Contextualizer Pretraining Strategy
Chenghao Xiao | G Thomas Hudson | Noura Al Moubayed
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

pdf bib abs

On Isotropy, Contextualization and Learning Dynamics of Contrastive-based Sentence Representation Learning
Chenghao Xiao | Yang Long | Noura Al Moubayed
Findings of the Association for Computational Linguistics: ACL 2023

Incorporating contrastive learning objectives in sentence representation learning (SRL) has yielded significant improvements on many sentence-level NLP tasks. However, it is not well understood why contrastive learning works for learning sentence-level semantics. In this paper, we aim to help guide future designs of sentence representation learning methods by taking a closer look at contrastive SRL through the lens of isotropy, contextualization and learning dynamics. We interpret its successes through the geometry of the representation shifts and show that contrastive learning brings isotropy, and drives high intra-sentence similarity: when in the same sentence, tokens converge to similar positions in the semantic space. We also find that what we formalize as “spurious contextualization” is mitigated for semantically meaningful tokens, while augmented for functional ones. We find that the embedding space is directed towards the origin during training, with more areas now better defined. We ablate these findings by observing the learning dynamics with different training temperatures, batch sizes and pooling methods.

pdf bib abs

Length is a Curse and a Blessing for Document-level Semantics
Chenghao Xiao | Yizhi Li | G Hudson | Chenghua Lin | Noura Al Moubayed
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In recent years, contrastive learning (CL) has been extensively utilized to recover sentence and document-level encoding capability from pre-trained language models. In this work, we question the length generalizability of CL-based models, i.e., their vulnerability towards length-induced semantic shift. We verify not only that length vulnerability is a significant yet overlooked research gap, but we can devise unsupervised CL methods solely depending on the semantic signal provided by document length. We first derive the theoretical foundations underlying length attacks, showing that elongating a document would intensify the high intra-document similarity that is already brought by CL. Moreover, we found that isotropy promised by CL is highly dependent on the length range of text exposed in training. Inspired by these findings, we introduce a simple yet universal document representation learning framework, **LA(SER)³**: length-agnostic self-reference for semantically robust sentence representation learning, achieving state-of-the-art unsupervised performance on the standard information retrieval benchmark. [Our code is publicly available.](https://github.com/gowitheflow-1998/LA-SER-cubed)

2022

pdf bib abs

MuLD: The Multitask Long Document Benchmark
George Hudson | Noura Al Moubayed
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The impressive progress in NLP techniques has been driven by the development of multi-task benchmarks such as GLUE and SuperGLUE. While these benchmarks focus on tasks for one or two input sentences, there has been exciting work in designing efficient techniques for processing much longer inputs. In this paper, we present MuLD: a new long document benchmark consisting of only documents over 10,000 tokens. By modifying existing NLP tasks, we create a diverse benchmark which requires models to successfully model long-term dependencies in the text. We evaluate how existing models perform, and find that our benchmark is much more challenging than their ‘short document’ equivalents. Furthermore, by evaluating both regular and efficient transformers, we show that models with increased context length are better able to solve the tasks presented, suggesting that future improvements in these models are vital for solving similar long document problems. We release the data and code for baselines to encourage further research on efficient NLP models.

pdf bib abs

Generating Textual Explanations for Machine Learning Models Performance: A Table-to-Text Task
Isaac Ampomah | James Burton | Amir Enshaei | Noura Al Moubayed
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Numerical tables are widely employed to communicate or report the classification performance of machine learning (ML) models with respect to a set of evaluation metrics. For non-experts, domain knowledge is required to fully understand and interpret the information presented by numerical tables. This paper proposes a new natural language generation (NLG) task where neural models are trained to generate textual explanations, analytically describing the classification performance of ML models based on the metrics’ scores reported in the tables. Presenting the generated texts along with the numerical tables will allow for a better understanding of the classification performance of ML models. We constructed a dataset comprising numerical tables paired with their corresponding textual explanations written by experts to facilitate this NLG task. Experiments on the dataset are conducted by fine-tuning pre-trained language models (T5 and BART) to generate analytical textual explanations conditioned on the information in the tables. Furthermore, we propose a neural module, Metrics Processing Unit (MPU), to improve the performance of the baselines in terms of correctly verbalising the information in the corresponding table. Evaluation and analysis conducted indicate, that exploring pre-trained models for data-to-text generation leads to better generalisation performance and can produce high-quality textual explanations.

2021

pdf bib abs

Towards Equal Gender Representation in the Annotations of Toxic Language Detection
Elizabeth Excell | Noura Al Moubayed
Proceedings of the 3rd Workshop on Gender Bias in Natural Language Processing

Classifiers tend to propagate biases present in the data on which they are trained. Hence, it is important to understand how the demographic identities of the annotators of comments affect the fairness of the resulting model. In this paper, we focus on the differences in the ways men and women annotate comments for toxicity, investigating how these differences result in models that amplify the opinions of male annotators. We find that the BERT model associates toxic comments containing offensive words with male annotators, causing the model to predict 67.7% of toxic comments as having been annotated by men. We show that this disparity between gender predictions can be mitigated by removing offensive words and highly toxic comments from the training data. We then apply the learned associations between gender and language to toxic language classifiers, finding that models trained exclusively on female-annotated data perform 1.8% better than those trained solely on male-annotated data, and that training models on data after removing all offensive words reduces bias in the model by 55.5% while increasing the sensitivity by 0.4%.