Potsawee Manakul


2024

pdf bib
LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models
Adian Liusie | Potsawee Manakul | Mark Gales
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Current developments in large language models (LLMs) have enabled impressive zero-shot capabilities across various natural language tasks. An interesting application of these systems is in the automated assessment of natural language generation (NLG), a highly challenging area with great practical benefit. In this paper, we explore two options for exploiting the emergent abilities of LLMs for zero-shot NLG assessment: absolute score prediction, and comparative assessment which uses relative comparisons between pairs of candidates. Though comparative assessment has not been extensively studied in NLG assessment, we note that humans often find it more intuitive to compare two options rather than scoring each one independently. This work examines comparative assessment from multiple perspectives: performance compared to absolute grading; positional biases in the prompt; and efficient ranking in terms of the number of comparisons. We illustrate that LLM comparative assessment is a simple, general and effective approach for NLG assessment. For moderate-sized open-source LLMs, such as FlanT5 and Llama2-chat, comparative assessment is superior to prompt scoring, and in many cases can achieve performance competitive with state-of-the-art methods. Additionally, we demonstrate that LLMs often exhibit strong positional biases when making pairwise comparisons, and we propose debiasing methods that can further improve performance.

2023

pdf bib
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
Potsawee Manakul | Adian Liusie | Mark Gales
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Generative Large Language Models (LLMs) such as GPT-3 are capable of generating highly fluent responses to a wide variety of user prompts. However, LLMs are known to hallucinate facts and make non-factual statements which can undermine trust in their output. Existing fact-checking approaches either require access to the output probability distribution (which may not be available for systems such as ChatGPT) or external databases that are interfaced via separate, often complex, modules. In this work, we propose “SelfCheckGPT”, a simple sampling-based approach that can be used to fact-check the responses of black-box models in a zero-resource fashion, i.e. without an external database. SelfCheckGPT leverages the simple idea that if an LLM has knowledge of a given concept, sampled responses are likely to be similar and contain consistent facts. However, for hallucinated facts, stochastically sampled responses are likely to diverge and contradict one another. We investigate this approach by using GPT-3 to generate passages about individuals from the WikiBio dataset, and manually annotate the factuality of the generated passages. We demonstrate that SelfCheckGPT can: i) detect non-factual and factual sentences; and ii) rank passages in terms of factuality. We compare our approach to several baselines and show that our approach has considerably higher AUC-PR scores in sentence-level hallucination detection and higher correlation scores in passage-level factuality assessment compared to grey-box methods.

pdf bib
CUED at ProbSum 2023: Hierarchical Ensemble of Summarization Models
Potsawee Manakul | Yassir Fathullah | Adian Liusie | Vyas Raina | Vatsal Raina | Mark Gales
The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks

In this paper, we consider the challenge of summarizing patients medical progress notes in a limited data setting. For the Problem List Summarization (shared task 1A) at the BioNLP Workshop 2023, we demonstrate that ClinicalT5 fine-tuned to 765 medical clinic notes outperforms other extractive, abstractive and zero-shot baselines, yielding reasonable baseline systems for medical note summarization. Further, we introduce Hierarchical Ensemble of Summarization Models (HESM), consisting of token-level ensembles of diverse fine-tuned ClinicalT5 models, followed by Minimum Bayes Risk (MBR) decoding. Our HESM approach lead to a considerable summarization performance boost, and when evaluated on held-out challenge data achieved a ROUGE-L of 32.77, which was the best-performing system at the top of the shared task leaderboard.

pdf bib
MQAG: Multiple-choice Question Answering and Generation for Assessing Information Consistency in Summarization
Potsawee Manakul | Adian Liusie | Mark Gales
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Mitigating Word Bias in Zero-shot Prompt-based Classifiers
Adian Liusie | Potsawee Manakul | Mark Gales
Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)

2021

pdf bib
Long-Span Summarization via Local Attention and Content Selection
Potsawee Manakul | Mark Gales
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Transformer-based models have achieved state-of-the-art results in a wide range of natural language processing (NLP) tasks including document summarization. Typically these systems are trained by fine-tuning a large pre-trained model to the target task. One issue with these transformer-based models is that they do not scale well in terms of memory and compute requirements as the input length grows. Thus, for long document summarization, it can be challenging to train or fine-tune these models. In this work, we exploit large pre-trained transformer-based models and address long-span dependencies in abstractive summarization using two methods: local self-attention; and explicit content selection. These approaches are compared on a range of network configurations. Experiments are carried out on standard long-span summarization tasks, including Spotify Podcast, arXiv, and PubMed datasets. We demonstrate that by combining these methods, we can achieve state-of-the-art results on all three tasks in the ROUGE scores. Moreover, without a large-scale GPU card, our approach can achieve comparable or better results than existing approaches.

pdf bib
Sparsity and Sentence Structure in Encoder-Decoder Attention of Summarization Systems
Potsawee Manakul | Mark Gales
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Transformer models have achieved state-of-the-art results in a wide range of NLP tasks including summarization. Training and inference using large transformer models can be computationally expensive. Previous work has focused on one important bottleneck, the quadratic self-attention mechanism in the encoder. Modified encoder architectures such as LED or LoBART use local attention patterns to address this problem for summarization. In contrast, this work focuses on the transformer’s encoder-decoder attention mechanism. The cost of this attention becomes more significant in inference or training approaches that require model-generated histories. First, we examine the complexity of the encoder-decoder attention. We demonstrate empirically that there is a sparse sentence structure in document summarization that can be exploited by constraining the attention mechanism to a subset of input sentences, whilst maintaining system performance. Second, we propose a modified architecture that selects the subset of sentences to constrain the encoder-decoder attention. Experiments are carried out on abstractive summarization tasks, including CNN/DailyMail, XSum, Spotify Podcast, and arXiv.