Jey Han Lau


2024

pdf bib
A Sentiment Consolidation Framework for Meta-Review Generation
Miao Li | Jey Han Lau | Eduard Hovy
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Modern natural language generation systems with Large Language Models (LLMs) exhibit the capability to generate a plausible summary of multiple documents; however, it is uncertain if they truly possess the capability of information consolidation to generate summaries, especially on documents with opinionated information. We focus on meta-review generation, a form of sentiment summarisation for the scientific domain. To make scientific sentiment summarization more grounded, we hypothesize that human meta-reviewers follow a three-layer framework of sentiment consolidation to write meta-reviews. Based on the framework, we propose novel prompting methods for LLMs to generate meta-reviews and evaluation metrics to assess the quality of generated meta-reviews. Our framework is validated empirically as we find that prompting LLMs based on the framework — compared with prompting them with simple instructions — generates better meta-reviews.

pdf bib
CMA-R: Causal Mediation Analysis for Explaining Rumour Detection
Lin Tian | Xiuzhen Zhang | Jey Han Lau
Findings of the Association for Computational Linguistics: EACL 2024

We apply causal mediation analysis to explain the decision-making process of neural models for rumour detection on Twitter.Interventions at the input and network level reveal the causal impacts of tweets and words in the model output.We find that our approach CMA-R – Causal Mediation Analysis for Rumour detection – identifies salient tweets that explain model predictions and show strong agreement with human judgements for critical tweets determining the truthfulness of stories.CMA-R can further highlight causally impactful words in the salient tweets, providing another layer of interpretability and transparency into these blackbox rumour detection systems. Code is available at: https://github.com/ltian678/cma-r.

pdf bib
To Aggregate or Not to Aggregate. That is the Question: A Case Study on Annotation Subjectivity in Span Prediction
Kemal Kurniawan | Meladel Mistica | Timothy Baldwin | Jey Han Lau
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

This paper explores the task of automatic prediction of text spans in a legal problem description that support a legal area label. We use a corpus of problem descriptions written by laypeople in English that is annotated by practising lawyers. Inherent subjectivity exists in our task because legal area categorisation is a complex task, and lawyers often have different views on a problem. Experiments show that training on majority-voted spans outperforms training on disaggregated ones.

2023

pdf bib
Annotating and Detecting Fine-grained Factual Errors for Dialogue Summarization
Rongxin Zhu | Jianzhong Qi | Jey Han Lau
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

A series of datasets and models have been proposed for summaries generated for well-formatted documents such as news articles. Dialogue summaries, however, have been under explored. In this paper, we present the first dataset with fine-grained factual error annotations named DIASUMFACT. We define fine-grained factual error detection as a sentence-level multi-label classification problem, and weevaluate two state-of-the-art (SOTA) models on our dataset. Both models yield sub-optimal results, with a macro-averaged F1 score of around 0.25 over 6 error classes. We further propose an unsupervised model ENDERANKER via candidate ranking using pretrained encoder-decoder models. Our model performs on par with the SOTA models while requiring fewer resources. These observations confirm the challenges in detecting factual errors from dialogue summaries, which call for further studies, for which our dataset and results offer a solid foundation.

pdf bib
The uncivil empathy: Investigating the relation between empathy and toxicity in online mental health support forums
Ming-Bin Chen | Jey Han Lau | Lea Frermann
Proceedings of the 21st Annual Workshop of the Australasian Language Technology Association

We explore the relationship between empathy and toxicity in the context of online mental health forums. Despite the common assumption of a negative correlation between these concepts, it has not been empirically examined. We augment the EPITOME mental health empathy dataset with toxicity labels using two widely employed toxic/harmful content detection APIs: Perspective API and OpenAI moderation API. We find a notable presence of toxic/harmful content (17.77%) within empathetic responses, and only a very weak negative correlation between the two variables. Qualitative analysis revealed contributions labeled as empathetic often contain harmful content such as promotion of suicidal ideas. Our results highlight the need for reevaluating empathy independently from toxicity in future research and encourage a reconsideration of empathy’s role in natural language generation and evaluation.

pdf bib
NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages
Genta Indra Winata | Alham Fikri Aji | Samuel Cahyawijaya | Rahmad Mahendra | Fajri Koto | Ade Romadhony | Kemal Kurniawan | David Moeljadi | Radityo Eko Prasojo | Pascale Fung | Timothy Baldwin | Jey Han Lau | Rico Sennrich | Sebastian Ruder
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing resources for languages in Indonesia. Despite being the second most linguistically diverse country, most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia. Our resource includes sentiment and machine translation datasets, and bilingual lexicons. We provide extensive analyses and describe challenges for creating such resources. We hope this work can spark NLP research on Indonesian and other underrepresented languages.

pdf bib
Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective
Zijian Zhang | Chang Shu | Ya Xiao | Yuan Shen | Di Zhu | Youxin Chen | Jing Xiao | Jey Han Lau | Qian Zhang | Zheng Lu
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Visual-Semantic Embedding (VSE) aims to learn an embedding space where related visual and semantic instances are close to each other. Recent VSE models tend to design complex structures to pool visual and semantic features into fixed-length vectors and use hard triplet loss for optimization. However, we find that: (1) combining simple pooling methods is no worse than these sophisticated methods; and (2) only considering the most difficult-to-distinguish negative sample leads to slow convergence and poor Recall@K improvement. To this end, we propose an adaptive pooling strategy that allows the model to learn how to aggregate features through a combination of simple pooling methods. We also introduce a strategy to dynamically select a group of negative samples to make the optimization converge faster and perform better. Experimental results on Flickr30K and MS-COCO demonstrate that a standard VSE using our pooling and optimization strategies outperforms current state-of-the-art systems (at least 1.0% on the metrics of recall) in image-to-text and text-to-image retrieval. Source code of our experiments is available at https://github.com/96-Zachary/vse_2ad .

pdf bib
Unsupervised Paraphrasing of Multiword Expressions
Takashi Wada | Yuji Matsumoto | Timothy Baldwin | Jey Han Lau
Findings of the Association for Computational Linguistics: ACL 2023

We propose an unsupervised approach to paraphrasing multiword expressions (MWEs) in context. Our model employs only monolingual corpus data and pre-trained language models (without fine-tuning), and does not make use of any external resources such as dictionaries. We evaluate our method on the SemEval 2022 idiomatic semantic text similarity task, and show that it outperforms all unsupervised systems and rivals supervised systems.

pdf bib
Unsupervised Lexical Simplification with Context Augmentation
Takashi Wada | Timothy Baldwin | Jey Han Lau
Findings of the Association for Computational Linguistics: EMNLP 2023

We propose a new unsupervised lexical simplification method that uses only monolingual data and pre-trained language models. Given a target word and its context, our method generates substitutes based on the target context and also additional contexts sampled from monolingual data. We conduct experiments in English, Portuguese, and Spanish on the TSAR-2022 shared task, and show that our model substantially outperforms other unsupervised systems across all languages. We also establish a new state-of-the-art by ensembling our model with GPT-3.5. Lastly, we evaluate our model on the SWORDS lexical substitution data set, achieving a state-of-the-art result.

pdf bib
The Next Chapter: A Study of Large Language Models in Storytelling
Zhuohan Xie | Trevor Cohn | Jey Han Lau
Proceedings of the 16th International Natural Language Generation Conference

To enhance the quality of generated stories, recent story generation models have been investigating the utilization of higher-level attributes like plots or commonsense knowledge. The application of prompt-based learning with large language models (LLMs), exemplified by GPT-3, has exhibited remarkable performance in diverse natural language processing (NLP) tasks. This paper conducts a comprehensive investigation, utilizing both automatic and human evaluation, to compare the story generation capacity of LLMs with recent models across three datasets with variations in style, register, and length of stories. The results demonstrate that LLMs generate stories of significantly higher quality compared to other story generation models. Moreover, they exhibit a level of performance that competes with human authors, albeit with the preliminary observation that they tend to replicate real stories in situations involving world knowledge, resembling a form of plagiarism.

2022

pdf bib
Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal Negation
Thinh Hung Truong | Yulia Otmakhova | Timothy Baldwin | Trevor Cohn | Jey Han Lau | Karin Verspoor
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Negation is poorly captured by current language models, although the extent of this problem is not widely understood. We introduce a natural language inference (NLI) test suite to enable probing the capabilities of NLP methods, with the aim of understanding sub-clausal negation. The test suite contains premise–hypothesis pairs where the premise contains sub-clausal negation and the hypothesis is constructed by making minimal modifications to the premise in order to reflect different possible interpretations. Aside from adopting standard NLI labels, our test suite is systematically constructed under a rigorous linguistic framework. It includes annotation of negation types and constructions grounded in linguistic theory, as well as the operations used to construct hypotheses. This facilitates fine-grained analysis of model performance. We conduct experiments using pre-trained language models to demonstrate that our test suite is more challenging than existing benchmarks focused on negation, and show how our annotation supports a deeper understanding of the current NLI capabilities in terms of negation and quantification.

pdf bib
An Interpretable Neuro-Symbolic Reasoning Framework for Task-Oriented Dialogue Generation
Shiquan Yang | Rui Zhang | Sarah Erfani | Jey Han Lau
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We study the interpretability issue of task-oriented dialogue systems in this paper. Previously, most neural-based task-oriented dialogue systems employ an implicit reasoning strategy that makes the model predictions uninterpretable to humans. To obtain a transparent reasoning process, we introduce neuro-symbolic to perform explicit reasoning that justifies model decisions by reasoning chains. Since deriving reasoning chains requires multi-hop reasoning for task-oriented dialogues, existing neuro-symbolic approaches would induce error propagation due to the one-phase design. To overcome this, we propose a two-phase approach that consists of a hypothesis generator and a reasoner. We first obtain multiple hypotheses, i.e., potential operations to perform the desired task, through the hypothesis generator. Each hypothesis is then verified by the reasoner, and the valid one is selected to conduct the final prediction. The whole system is trained by exploiting raw textual dialogues without using any reasoning chain annotations. Experimental studies on two public benchmark datasets demonstrate that the proposed approach not only achieves better results, but also introduces an interpretable decision process.

pdf bib
The patient is more dead than alive: exploring the current state of the multi-document summarisation of the biomedical literature
Yulia Otmakhova | Karin Verspoor | Timothy Baldwin | Jey Han Lau
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Although multi-document summarisation (MDS) of the biomedical literature is a highly valuable task that has recently attracted substantial interest, evaluation of the quality of biomedical summaries lacks consistency and transparency. In this paper, we examine the summaries generated by two current models in order to understand the deficiencies of existing evaluation approaches in the context of the challenges that arise in the MDS task. Based on this analysis, we propose a new approach to human evaluation and identify several challenges that must be overcome to develop effective biomedical MDS systems.

pdf bib
One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia
Alham Fikri Aji | Genta Indra Winata | Fajri Koto | Samuel Cahyawijaya | Ade Romadhony | Rahmad Mahendra | Kemal Kurniawan | David Moeljadi | Radityo Eko Prasojo | Timothy Baldwin | Jey Han Lau | Sebastian Ruder
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia’s 700+ languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems. Finally, we provide general recommendations to help develop NLP technology not only for languages of Indonesia but also other underrepresented languages.

pdf bib
Automatic Explanation Generation For Climate Science Claims
Rui Xing | Shraey Bhatia | Timothy Baldwin | Jey Han Lau
Proceedings of the 20th Annual Workshop of the Australasian Language Technology Association

Climate change is an existential threat to humanity, the proliferation of unsubstantiated claims relating to climate science is manipulating public perception, motivating the need for fact-checking in climate science. In this work, we draw on recent work that uses retrieval-augmented generation for veracity prediction and explanation generation, in framing explanation generation as a query-focused multi-document summarization task. We adapt PRIMERA to the climate science domain by adding additional global attention on claims. Through automatic evaluation and qualitative analysis, we demonstrate that our method is effective at generating explanations.

pdf bib
Easy-First Bottom-Up Discourse Parsing via Sequence Labelling
Andrew Shen | Fajri Koto | Jey Han Lau | Timothy Baldwin
Proceedings of the 3rd Workshop on Computational Approaches to Discourse

We propose a novel unconstrained bottom-up approach for rhetorical discourse parsing based on sequence labelling of adjacent pairs of discourse units (DUs), based on the framework of Koto et al. (2021). We describe the unique training requirements of an unconstrained parser, and explore two different training procedures: (1) fixed left-to-right; and (2) random order in tree construction. Additionally, we introduce a novel dynamic oracle for unconstrained bottom-up parsing. Our proposed parser achieves competitive results for bottom-up rhetorical discourse parsing.

pdf bib
LipKey: A Large-Scale News Dataset for Absent Keyphrases Generation and Abstractive Summarization
Fajri Koto | Timothy Baldwin | Jey Han Lau
Proceedings of the 29th International Conference on Computational Linguistics

Summaries, keyphrases, and titles are different ways of concisely capturing the content of a document. While most previous work has released the datasets of keyphrases and summarization separately, in this work, we introduce LipKey, the largest news corpus with human-written abstractive summaries, absent keyphrases, and titles. We jointly use the three elements via multi-task training and training as joint structured inputs, in the context of document summarization. We find that including absent keyphrases and titles as additional context to the source document improves transformer-based summarization models.

pdf bib
Unsupervised Lexical Substitution with Decontextualised Embeddings
Takashi Wada | Timothy Baldwin | Yuji Matsumoto | Jey Han Lau
Proceedings of the 29th International Conference on Computational Linguistics

We propose a new unsupervised method for lexical substitution using pre-trained language models. Compared to previous approaches that use the generative capability of language models to predict substitutes, our method retrieves substitutes based on the similarity of contextualised and decontextualised word embeddings, i.e. the average contextual representation of a word in multiple contexts. We conduct experiments in English and Italian, and show that our method substantially outperforms strong baselines and establishes a new state-of-the-art without any explicit supervision or fine-tuning. We further show that our method performs particularly well at predicting low-frequency substitutes, and also generates a diverse list of substitute candidates, reducing morphophonetic or morphosyntactic biases induced by article-noun agreement.

pdf bib
Cloze Evaluation for Deeper Understanding of Commonsense Stories in Indonesian
Fajri Koto | Timothy Baldwin | Jey Han Lau
Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022)

Story comprehension that involves complex causal and temporal relations is a critical task in NLP, but previous studies have focused predominantly on English, leaving open the question of how the findings generalize to other languages, such as Indonesian. In this paper, we follow the Story Cloze Test framework of Mostafazadeh et al. (2016) in evaluating story understanding in Indonesian, by constructing a four-sentence story with one correct ending and one incorrect ending. To investigate commonsense knowledge acquisition in language models, we experimented with: (1) a classification task to predict the correct ending; and (2) a generation task to complete the story with a single sentence. We investigate these tasks in two settings: (i) monolingual training and ii) zero-shot cross-lingual transfer between Indonesian and English.

pdf bib
Can Pretrained Language Models Generate Persuasive, Faithful, and Informative Ad Text for Product Descriptions?
Fajri Koto | Jey Han Lau | Timothy Baldwin
Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5)

For any e-commerce service, persuasive, faithful, and informative product descriptions can attract shoppers and improve sales. While not all sellers are capable of providing such interesting descriptions, a language generation system can be a source of such descriptions at scale, and potentially assist sellers to improve their product descriptions. Most previous work has addressed this task based on statistical approaches (Wang et al., 2017), limited attributes such as titles (Chen et al., 2019; Chan et al., 2020), and focused on only one product type (Wang et al., 2017; Munigala et al., 2018; Hong et al., 2021). In this paper, we jointly train image features and 10 text attributes across 23 diverse product types, with two different target text types with different writing styles: bullet points and paragraph descriptions. Our findings suggest that multimodal training with modern pretrained language models can generate fluent and persuasive advertisements, but are less faithful and informative, especially out of domain.

pdf bib
Robust Task-Oriented Dialogue Generation with Contrastive Pre-training and Adversarial Filtering
Shiquan Yang | Xinting Huang | Jey Han Lau | Sarah Erfani
Findings of the Association for Computational Linguistics: EMNLP 2022

Data artifacts incentivize machine learning models to learn non-transferable generalizations by taking advantage of shortcuts in the data, andthere is growing evidence that data artifacts play a role for the strong results that deep learning models achieve in recent natural language processing benchmarks.In this paper, we focus on task-oriented dialogue and investigate whether popular datasets such as MultiWOZ contain such data artifacts.We found that by only keeping frequent phrases in the trainingexamples, state-of-the-art models perform similarly compared to the variant trained with full data, suggesting they exploit these spurious correlationsto solve the task. Motivated by this, we propose a contrastive learning based framework to encourage the model to ignore these cues and focus on learning generalisable patterns. We also experiment with adversarial filtering to remove easy training instances so that the model would focus on learning from the harder instances. We conduct a number of generalization experiments — e.g., cross-domain/dataset and adversarial tests — to assess the robustness of our approach and found that it works exceptionally well.

pdf bib
M3: Multi-level dataset for Multi-document summarisation of Medical studies
Yulia Otmakhova | Karin Verspoor | Timothy Baldwin | Antonio Jimeno Yepes | Jey Han Lau
Findings of the Association for Computational Linguistics: EMNLP 2022

We present M3 (Multi-level dataset for Multi-document summarisation of Medical studies), a benchmark dataset for evaluating the quality of summarisation systems in the biomedical domain. The dataset contains sets of multiple input documents and target summaries of three levels of complexity: documents, sentences, and propositions. The dataset also includes several levels of annotation, including biomedical entities, direction, and strength of relations between them, and the discourse relationships between the input documents (“contradiction” or “agreement”). We showcase usage scenarios of the dataset by testing 10 generic and domain-specific summarisation models in a zero-shot setting, and introduce a probing task based on counterfactuals to test if models are aware of the direction and strength of the conclusions generated from input studies.

pdf bib
DUCK: Rumour Detection on Social Media by Modelling User and Comment Propagation Networks
Lin Tian | Xiuzhen Zhang | Jey Han Lau
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Social media rumours, a form of misinformation, can mislead the public and cause significant economic and social disruption. Motivated by the observation that the user network — which captures who engage with a story — and the comment network — which captures how they react to it — provide complementary signals for rumour detection, in this paper, we propose DUCK (rumour  ̲detection with  ̲user and  ̲comment networ ̲ks) for rumour detection on social media. We study how to leverage transformers and graph attention networks to jointly model the contents and structure of social media conversations, as well as the network of users who engaged in these conversations. Over four widely used benchmark rumour datasets in English and Chinese, we show that DUCK produces superior performance for detecting rumours, creating a new state-of-the-art. Source code for DUCK is available at: https://github.com/ltian678/DUCK-code.

pdf bib
LED down the rabbit hole: exploring the potential of global attention for biomedical multi-document summarisation
Yulia Otmakhova | Thinh Hung Truong | Timothy Baldwin | Trevor Cohn | Karin Verspoor | Jey Han Lau
Proceedings of the Third Workshop on Scholarly Document Processing

In this paper we report the experiments performed for the submission to the Multidocument summarisation for Literature Review (MSLR) Shared Task. In particular, we adopt Primera model to the biomedical domain by placing global attention on important biomedical entities in several ways. We analyse the outputs of 23 resulting models and report some patterns related to the presence of additional global attention, number of training steps and the inputs configuration.

pdf bib
Cross-linguistic Comparison of Linguistic Feature Encoding in BERT Models for Typologically Different Languages
Yulia Otmakhova | Karin Verspoor | Jey Han Lau
Proceedings of the 4th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

Though recently there have been an increased interest in how pre-trained language models encode different linguistic features, there is still a lack of systematic comparison between languages with different morphology and syntax. In this paper, using BERT as an example of a pre-trained model, we compare how three typologically different languages (English, Korean, and Russian) encode morphology and syntax features across different layers. In particular, we contrast languages which differ in a particular aspect, such as flexibility of word order, head directionality, morphological type, presence of grammatical gender, and morphological richness, across four different tasks.

2021

pdf bib
Findings on Conversation Disentanglement
Rongxin Zhu | Jey Han Lau | Jianzhong Qi
Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association

Conversation disentanglement, the task to identify separate threads in conversations, is an important pre-processing step in multi-party conversational NLP applications such as conversational question answering and con-versation summarization. Framing it as a utterance-to-utterance classification problem — i.e. given an utterance of interest (UOI), find which past utterance it replies to — we explore a number of transformer-based models and found that BERT in combination with handcrafted features remains a strong baseline. We then build a multi-task learning model that jointly learns utterance-to-utterance and utterance-to-thread classification. Observing that the ground truth label (past utterance) is in the top candidates when our model makes an error, we experiment with using bipartite graphs as a post-processing step to learn how to best match a set of UOIs to past utterances. Experiments on the Ubuntu IRC dataset show that this approach has the potential to out-perform the conventional greedy approach of simply selecting the highest probability candidate for each UOI independently, indicating a promising future research direction.

pdf bib
Exploring Story Generation with Multi-task Objectives in Variational Autoencoders
Zhuohan Xie | Jey Han Lau | Trevor Cohn
Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association

GPT-2 has been frequently adapted in story generation models as it provides powerful generative capability. However, it still fails to generate consistent stories and lacks diversity. Current story generation models leverage additional information such as plots or commonsense into GPT-2 to guide the generation process. These approaches focus on improving generation quality of stories while our work look at both quality and diversity. We explore combining BERT and GPT-2 to build a variational autoencoder (VAE), and extend it by adding additional objectives to learn global features such as story topic and discourse relations. Our evaluations show our enhanced VAE can provide better quality and diversity trade off, generate less repetitive story content and learn a more informative latent variable.

pdf bib
Top-down Discourse Parsing via Sequence Labelling
Fajri Koto | Jey Han Lau | Timothy Baldwin
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We introduce a top-down approach to discourse parsing that is conceptually simpler than its predecessors (Kobayashi et al., 2020; Zhang et al., 2020). By framing the task as a sequence labelling problem where the goal is to iteratively segment a document into individual discourse units, we are able to eliminate the decoder and reduce the search space for splitting points. We explore both traditional recurrent models and modern pre-trained transformer models for the task, and additionally introduce a novel dynamic oracle for top-down parsing. Based on the Full metric, our proposed LSTM model sets a new state-of-the-art for RST parsing.

pdf bib
IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization
Fajri Koto | Jey Han Lau | Timothy Baldwin
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We present IndoBERTweet, the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolingually-trained Indonesian BERT model with additive domain-specific vocabulary. We focus in particular on efficient model adaptation under vocabulary mismatch, and benchmark different ways of initializing the BERT embedding layer for new word types. We find that initializing with the average BERT subword embedding makes pretraining five times faster, and is more effective than proposed methods for vocabulary adaptation in terms of extrinsic evaluation over seven Twitter-based datasets.

pdf bib
Evaluating the Efficacy of Summarization Evaluation across Languages
Fajri Koto | Jey Han Lau | Timothy Baldwin
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora
Takashi Wada | Tomoharu Iwata | Yuji Matsumoto | Timothy Baldwin | Jey Han Lau
Proceedings of the 1st Workshop on Multilingual Representation Learning

We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus (e.g. a few hundred sentence pairs). Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence. Through sharing model parameters among different languages, our model jointly trains the word embeddings in a common cross-lingual space. We also propose to combine word and subword embeddings to make use of orthographic similarities across different languages. We base our experiments on real-world data from endangered languages, namely Yongning Na, Shipibo-Konibo, and Griko. Our experiments on bilingual lexicon induction and word alignment tasks show that our model outperforms existing methods by a large margin for most language pairs. These results demonstrate that, contrary to common belief, an encoder-decoder translation model is beneficial for learning cross-lingual representations even in extremely low-resource conditions. Furthermore, our model also works well on high-resource conditions, achieving state-of-the-art performance on a German-English word-alignment task.

pdf bib
Automatic Classification of Neutralization Techniques in the Narrative of Climate Change Scepticism
Shraey Bhatia | Jey Han Lau | Timothy Baldwin
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Neutralisation techniques, e.g. denial of responsibility and denial of victim, are used in the narrative of climate change scepticism to justify lack of action or to promote an alternative view. We first draw on social science to introduce the problem to the community of nlp, present the granularity of the coding schema and then collect manual annotations of neutralised techniques in text relating to climate change, and experiment with supervised and semi- supervised BERT-based models.

pdf bib
Discourse Probing of Pretrained Language Models
Fajri Koto | Jey Han Lau | Timothy Baldwin
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Existing work on probing of pretrained language models (LMs) has predominantly focused on sentence-level syntactic tasks. In this paper, we introduce document-level discourse probing to evaluate the ability of pretrained LMs to capture document-level relations. We experiment with 7 pretrained LMs, 4 languages, and 7 discourse probing tasks, and find BART to be overall the best model at capturing discourse — but only in its encoder, with BERT performing surprisingly well as the baseline model. Across the different models, there are substantial differences in which layers best capture discourse information, and large disparities between models.

pdf bib
Grey-box Adversarial Attack And Defence For Sentiment Classification
Ying Xu | Xu Zhong | Antonio Jimeno Yepes | Jey Han Lau
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We introduce a grey-box adversarial attack and defence framework for sentiment classification. We address the issues of differentiability, label preservation and input reconstruction for adversarial attack and defence in one unified framework. Our results show that once trained, the attacking model is capable of generating high-quality adversarial examples substantially faster (one order of magnitude less in time) than state-of-the-art attacking methods. These examples also preserve the original sentiment according to human evaluation. Additionally, our framework produces an improved classifier that is robust in defending against multiple adversarial attacking methods. Code is available at: https://github.com/ibm-aur-nlp/adv-def-text-dist.

pdf bib
Semi-automatic Triage of Requests for Free Legal Assistance
Meladel Mistica | Jey Han Lau | Brayden Merrifield | Kate Fazio | Timothy Baldwin
Proceedings of the Natural Legal Language Processing Workshop 2021

Free legal assistance is critically under-resourced, and many of those who seek legal help have their needs unmet. A major bottleneck in the provision of free legal assistance to those most in need is the determination of the precise nature of the legal problem. This paper describes a collaboration with a major provider of free legal assistance, and the deployment of natural language processing models to assign area-of-law categories to real-world requests for legal assistance. In particular, we focus on an investigation of models to generate efficiencies in the triage process, but also the risks associated with naive use of model predictions, including fairness across different user demographics.

2020

pdf bib
Liputan6: A Large-scale Indonesian Dataset for Text Summarization
Fajri Koto | Jey Han Lau | Timothy Baldwin
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

In this paper, we introduce a large-scale Indonesian summarization dataset. We harvest articles from Liputan6.com, an online news portal, and obtain 215,827 document–summary pairs. We leverage pre-trained language models to develop benchmark extractive and abstractive summarization methods over the dataset with multilingual and monolingual BERT-based models. We include a thorough error analysis by examining machine-generated summaries that have low ROUGE scores, and expose both issues with ROUGE itself, as well as with extractive and abstractive summarization models.

pdf bib
Give Me Convenience and Give Her Death: Who Should Decide What Uses of NLP are Appropriate, and on What Basis?
Kobi Leins | Jey Han Lau | Timothy Baldwin
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

As part of growing NLP capabilities, coupled with an awareness of the ethical dimensions of research, questions have been raised about whether particular datasets and tasks should be deemed off-limits for NLP research. We examine this question with respect to a paper on automatic legal sentencing from EMNLP 2019 which was a source of some debate, in asking whether the paper should have been allowed to be published, who should have been charged with making such a decision, and on what basis. We focus in particular on the role of data statements in ethically assessing research, but also discuss the topic of dual use, and examine the outcomes of similar debates in other scientific disciplines.

pdf bib
IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP
Fajri Koto | Afshin Rahimi | Jey Han Lau | Timothy Baldwin
Proceedings of the 28th International Conference on Computational Linguistics

Although the Indonesian language is spoken by almost 200 million people and the 10th most spoken language in the world, it is under-represented in NLP research. Previous work on Indonesian has been hampered by a lack of annotated datasets, a sparsity of language resources, and a lack of resource standardization. In this work, we release the IndoLEM dataset comprising seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse. We additionally release IndoBERT, a new pre-trained language model for Indonesian, and evaluate it over IndoLEM, in addition to benchmarking it against existing resources. Our experiments show that IndoBERT achieves state-of-the-art performance over most of the tasks in IndoLEM.

pdf bib
How Furiously Can Colorless Green Ideas Sleep? Sentence Acceptability in Context
Jey Han Lau | Carlos Armendariz | Shalom Lappin | Matthew Purver | Chang Shu
Transactions of the Association for Computational Linguistics, Volume 8

We study the influence of context on sentence acceptability. First we compare the acceptability ratings of sentences judged in isolation, with a relevant context, and with an irrelevant context. Our results show that context induces a cognitive load for humans, which compresses the distribution of ratings. Moreover, in relevant contexts we observe a discourse coherence effect that uniformly raises acceptability. Next, we test unidirectional and bidirectional language models in their ability to predict acceptability ratings. The bidirectional models show very promising results, with the best model achieving a new state-of-the-art for unsupervised acceptability prediction. The two sets of experiments provide insights into the cognitive aspects of sentence processing and central issues in the computational modeling of text and discourse.

2019

pdf bib
Early Rumour Detection
Kaimin Zhou | Chang Shu | Binyang Li | Jey Han Lau
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Rumours can spread quickly through social media, and malicious ones can bring about significant economical and social impact. Motivated by this, our paper focuses on the task of rumour detection; particularly, we are interested in understanding how early we can detect them. Although there are numerous studies on rumour detection, few are concerned with the timing of the detection. A successfully-detected malicious rumour can still cause significant damage if it isn’t detected in a timely manner, and so timing is crucial. To address this, we present a novel methodology for early rumour detection. Our model treats social media posts (e.g. tweets) as a data stream and integrates reinforcement learning to learn the number minimum number of posts required before we classify an event as a rumour. Experiments on Twitter and Weibo demonstrate that our model identifies rumours earlier than state-of-the-art systems while maintaining a comparable accuracy.

pdf bib
From Shakespeare to Li-Bai: Adapting a Sonnet Model to Chinese Poetry
Zhuohan Xie | Jey Han Lau | Trevor Cohn
Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association

In this paper, we adapt Deep-speare, a joint neural network model for English sonnets, to Chinese poetry. We illustrate characteristics of Chinese quatrain and explain our architecture as well as training and generation procedure, which differs from Shakespeare sonnets in several aspects. We analyse the generated poetry and find that model works well for Chinese poetry, as it can: (1) generate coherent 4-line quatrains of different topics; and (2) capture rhyme automatically (to a certain extent).

pdf bib
Improved Document Modelling with a Neural Discourse Parser
Fajri Koto | Jey Han Lau | Timothy Baldwin
Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association

Despite the success of attention-based neural models for natural language generation and classification tasks, they are unable to capture the discourse structure of larger documents. We hypothesize that explicit discourse representations have utility for NLP tasks over longer documents or document sequences, which sequence-to-sequence models are unable to capture. For abstractive summarization, for instance, conventional neural models simply match source documents and the summary in a latent space without explicit representation of text structure or relations. In this paper, we propose to use neural discourse representations obtained from a rhetorical structure theory (RST) parser to enhance document representations. Specifically, document representations are generated for discourse spans, known as the elementary discourse units (EDUs). We empirically investigate the benefit of the proposed approach on two different tasks: abstractive summarization and popularity prediction of online petitions. We find that the proposed approach leads to substantial improvements in all cases.

2018

pdf bib
Topic Intrusion for Automatic Topic Model Evaluation
Shraey Bhatia | Jey Han Lau | Timothy Baldwin
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Topic coherence is increasingly being used to evaluate topic models and filter topics for end-user applications. Topic coherence measures how well topic words relate to each other, but offers little insight on the utility of the topics in describing the documents. In this paper, we explore the topic intrusion task — the task of guessing an outlier topic given a document and a few topics — and propose a method to automate it. We improve upon the state-of-the-art substantially, demonstrating its viability as an alternative method for topic model evaluation.

pdf bib
Deep-speare: A joint neural model of poetic language, meter and rhyme
Jey Han Lau | Trevor Cohn | Timothy Baldwin | Julian Brooke | Adam Hammond
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we propose a joint architecture that captures language, rhyme and meter for sonnet modelling. We assess the quality of generated poems using crowd and expert judgements. The stress and rhyme models perform very well, as generated poems are largely indistinguishable from human-written poems. Expert evaluation, however, reveals that a vanilla language model captures meter implicitly, and that machine-generated poems still underperform in terms of readability and emotion. Our research shows the importance expert evaluation for poetry generation, and that future research should look beyond rhyme/meter and focus on poetic language.

pdf bib
The Influence of Context on Sentence Acceptability Judgements
Jean-Philippe Bernardy | Shalom Lappin | Jey Han Lau
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We investigate the influence that document context exerts on human acceptability judgements for English sentences, via two sets of experiments. The first compares ratings for sentences presented on their own with ratings for the same set of sentences given in their document contexts. The second assesses the accuracy with which two types of neural models — one that incorporates context during training and one that does not — predict these judgements. Our results indicate that: (1) context improves acceptability ratings for ill-formed sentences, but also reduces them for well-formed sentences; and (2) context helps unsupervised systems to model acceptability.

pdf bib
Preferred Answer Selection in Stack Overflow: Better Text Representations ... and Metadata, Metadata, Metadata
Steven Xu | Andrew Bennett | Doris Hoogeveen | Jey Han Lau | Timothy Baldwin
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

Community question answering (cQA) forums provide a rich source of data for facilitating non-factoid question answering over many technical domains. Given this, there is considerable interest in answer retrieval from these kinds of forums. However this is a difficult task as the structure of these forums is very rich, and both metadata and text features are important for successful retrieval. While there has recently been a lot of work on solving this problem using deep learning models applied to question/answer text, this work has not looked at how to make use of the rich metadata available in cQA forums. We propose an attention-based model which achieves state-of-the-art results for text-based answer selection alone, and by making use of complementary meta-data, achieves a substantially higher result over two reference datasets novel to this work.

2017

pdf bib
Multimodal Topic Labelling
Ionut Sorodoc | Jey Han Lau | Nikolaos Aletras | Timothy Baldwin
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Topics generated by topic models are typically presented as a list of topic terms. Automatic topic labelling is the task of generating a succinct label that summarises the theme or subject of a topic, with the intention of reducing the cognitive load of end-users when interpreting these topics. Traditionally, topic label systems focus on a single label modality, e.g. textual labels. In this work we propose a multimodal approach to topic labelling using a simple feedforward neural network. Given a topic and a candidate image or textual label, our method automatically generates a rating for the label, relative to the topic. Experiments show that this multimodal approach outperforms single-modality topic labelling systems.

pdf bib
End-to-end Network for Twitter Geolocation Prediction and Hashing
Jey Han Lau | Lianhua Chi | Khoi-Nguyen Tran | Trevor Cohn
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We propose an end-to-end neural network to predict the geolocation of a tweet. The network takes as input a number of raw Twitter metadata such as the tweet message and associated user account information. Our model is language independent, and despite minimal feature engineering, it is interpretable and capable of learning location indicative words and timing patterns. Compared to state-of-the-art systems, our model outperforms them by 2%-6%. Additionally, we propose extensions to the model to compress representation learnt by the network into binary codes. Experiments show that it produces compact codes compared to benchmark hashing algorithms. An implementation of the model is released publicly.

pdf bib
An Automatic Approach for Document-level Topic Model Evaluation
Shraey Bhatia | Jey Han Lau | Timothy Baldwin
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

Topic models jointly learn topics and document-level topic distribution. Extrinsic evaluation of topic models tends to focus exclusively on topic-level evaluation, e.g. by assessing the coherence of topics. We demonstrate that there can be large discrepancies between topic- and document-level model quality, and that basing model evaluation on topic-level analysis can be highly misleading. We propose a method for automatically predicting topic model quality based on analysis of document-level topic allocations, and provide empirical evidence for its robustness.

pdf bib
Topically Driven Neural Language Model
Jey Han Lau | Timothy Baldwin | Trevor Cohn
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Language models are typically applied at the sentence level, without access to the broader document context. We present a neural language model that incorporates document context in the form of a topic model-like architecture, thus providing a succinct representation of the broader document context outside of the current sentence. Experiments over a range of datasets demonstrate that our model outperforms a pure sentence-based model in terms of language model perplexity, and leads to topics that are potentially more coherent than those produced by a standard LDA topic model. Our model also has the ability to generate related sentences for a topic, providing another way to interpret topics.

pdf bib
Decoupling Encoder and Decoder Networks for Abstractive Document Summarization
Ying Xu | Jey Han Lau | Timothy Baldwin | Trevor Cohn
Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres

Abstractive document summarization seeks to automatically generate a summary for a document, based on some abstract “understanding” of the original document. State-of-the-art techniques traditionally use attentive encoder–decoder architectures. However, due to the large number of parameters in these models, they require large training datasets and long training times. In this paper, we propose decoupling the encoder and decoder networks, and training them separately. We encode documents using an unsupervised document encoder, and then feed the document vector to a recurrent neural network decoder. With this decoupled architecture, we decrease the number of parameters in the decoder substantially, and shorten its training time. Experiments show that the decoupled model achieves comparable performance with state-of-the-art models for in-domain documents, but less well for out-of-domain documents.

2016

pdf bib
Automatic Labelling of Topics with Neural Embeddings
Shraey Bhatia | Jey Han Lau | Timothy Baldwin
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Topics generated by topic models are typically represented as list of terms. To reduce the cognitive overhead of interpreting these topics for end-users, we propose labelling a topic with a succinct phrase that summarises its theme or idea. Using Wikipedia document titles as label candidates, we compute neural embeddings for documents and words to select the most relevant labels for topics. Comparing to a state-of-the-art topic labelling system, our methodology is simpler, more efficient and finds better topic labels.

pdf bib
The Sensitivity of Topic Coherence Evaluation to Topic Cardinality
Jey Han Lau | Timothy Baldwin
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
LexSemTm: A Semantic Dataset Based on All-words Unsupervised Sense Distribution Learning
Andrew Bennett | Timothy Baldwin | Jey Han Lau | Diana McCarthy | Francis Bond
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation
Jey Han Lau | Timothy Baldwin
Proceedings of the 1st Workshop on Representation Learning for NLP

2015

pdf bib
Unsupervised Prediction of Acceptability Judgements
Jey Han Lau | Alexander Clark | Shalom Lappin
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2014

pdf bib
Novel Word-sense Identification
Paul Cook | Jey Han Lau | Diana McCarthy | Timothy Baldwin
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality
Jey Han Lau | David Newman | Timothy Baldwin
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Learning Word Sense Distributions, Detecting Unattested Senses and Identifying Novel Senses Using Topic Models
Jey Han Lau | Paul Cook | Diana McCarthy | Spandana Gella | Timothy Baldwin
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Automatic Detection and Language Identification of Multilingual Documents
Marco Lui | Jey Han Lau | Timothy Baldwin
Transactions of the Association for Computational Linguistics, Volume 2

Language identification is the task of automatically detecting the language(s) present in a document based on the content of the document. In this work, we address the problem of detecting documents that contain text from more than one language (multilingual documents). We introduce a method that is able to detect that a document is multilingual, identify the languages present, and estimate their relative proportions. We demonstrate the effectiveness of our method over synthetic data, as well as real-world multilingual documents collected from the web.

2013

pdf bib
Unsupervised Word Class Induction for Under-resourced Languages: A Case Study on Indonesian
Meladel Mistica | Jey Han Lau | Timothy Baldwin
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
unimelb: Topic Modelling-based Word Sense Induction for Web Snippet Clustering
Jey Han Lau | Paul Cook | Timothy Baldwin
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib
unimelb: Topic Modelling-based Word Sense Induction
Jey Han Lau | Paul Cook | Timothy Baldwin
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

2012

pdf bib
On-line Trend Analysis with Topic Models: #twitter Trends Detection Topic Model Online
Jey Han Lau | Nigel Collier | Timothy Baldwin
Proceedings of COLING 2012

pdf bib
Bayesian Text Segmentation for Index Term Identification and Keyphrase Extraction
David Newman | Nagendra Koilada | Jey Han Lau | Timothy Baldwin
Proceedings of COLING 2012

pdf bib
Word Sense Induction for Novel Sense Detection
Jey Han Lau | Paul Cook | Diana McCarthy | David Newman | Timothy Baldwin
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

2011

pdf bib
Automatic Labelling of Topic Models
Jey Han Lau | Karl Grieser | David Newman | Timothy Baldwin
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Best Topic Word Selection for Topic Labelling
Jey Han Lau | David Newman | Sarvnaz Karimi | Timothy Baldwin
Coling 2010: Posters

pdf bib
Automatic Evaluation of Topic Coherence
David Newman | Jey Han Lau | Karl Grieser | Timothy Baldwin
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics