Michael J. Witbrock

Also published as: Michael Witbrock


2024

pdf bib
Can Large Language Models Learn Independent Causal Mechanisms?
Gael Gendron | Bao Trung Nguyen | Alex Yuxuan Peng | Michael Witbrock | Gillian Dobbie
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Despite impressive performance on language modelling and complex reasoning tasks, Large Language Models (LLMs) fall short on the same tasks in uncommon settings or with distribution shifts, exhibiting a lack of generalisation ability. By contrast, systems such as causal models, that learn abstract variables and causal relationships, can demonstrate increased robustness against changes in the distribution. One reason for this success is the existence and use of Independent Causal Mechanisms (ICMs) representing high-level concepts that only sparsely interact. In this work, we apply two concepts from causality to learn ICMs within LLMs. We develop a new LLM architecture composed of multiple sparsely interacting language modelling modules. We show that such causal constraints can improve out-of-distribution performance on abstract and causal reasoning tasks. We also investigate the level of independence and domain specialisation and show that LLMs rely on pre-trained partially domain-invariant mechanisms resilient to fine-tuning.

pdf bib
Abstract Meaning Representation-Based Logic-Driven Data Augmentation for Logical Reasoning
Qiming Bao | Alex Peng | Zhenyun Deng | Wanjun Zhong | Gael Gendron | Timothy Pistotti | Neset Tan | Nathan Young | Yang Chen | Yonghua Zhu | Paul Denny | Michael Witbrock | Jiamou Liu
Findings of the Association for Computational Linguistics: ACL 2024

Combining large language models with logical reasoning enhances their capacity to address problems in a robust and reliable manner. Nevertheless, the intricate nature of logical reasoning poses challenges when gathering reliable data from the web to build comprehensive training datasets, subsequently affecting performance on downstream tasks. To address this, we introduce a novel logic-driven data augmentation approach, AMR-LDA. AMR-LDA converts the original text into an Abstract Meaning Representation (AMR) graph, a structured semantic representation that encapsulates the logical structure of the sentence, upon which operations are performed to generate logically modified AMR graphs. The modified AMR graphs are subsequently converted back into text to create augmented data. Notably, our methodology is architecture-agnostic and enhances both generative large language models, such as GPT-3.5 and GPT-4, through prompt augmentation, and discriminative large language models through contrastive learning with logic-driven data augmentation. Empirical evidence underscores the efficacy of our proposed method with improvement in performance across seven downstream tasks, such as reading comprehension requiring logical reasoning, textual entailment, and natural language inference. Furthermore, our method leads on the ReClor leaderboard. The source code and data are publicly available

2023

pdf bib
Multi2Claim: Generating Scientific Claims from Multi-Choice Questions for Scientific Fact-Checking
Neset Tan | Trung Nguyen | Josh Bensemann | Alex Peng | Qiming Bao | Yang Chen | Mark Gahegan | Michael Witbrock
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Training machine learning models to successfully perform scientific fact-checking tasks is challenging due to the expertise bottleneck that limits the availability of appropriate training datasets. In this task, models use textual evidence to confirm scientific claims, which requires data that contains extensive domain-expert annotation. Consequently, the number of existing scientific-fact-checking datasets and the sizes of those datasets are limited. However, these limitations do not apply to multiple-choice question datasets because of the necessity of domain exams in the modern education system. As one of the first steps towards addressing the fact-checking dataset scarcity problem in scientific domains, we propose a pipeline for automatically converting multiple-choice questions into fact-checking data, which we call Multi2Claim. By applying the proposed pipeline, we generated two large-scale datasets for scientific-fact-checking tasks: Med-Fact and Gsci-Fact for the medical and general science domains, respectively. These two datasets are among the first examples of large-scale scientific-fact-checking datasets. We developed baseline models for the verdict prediction task using each dataset. Additionally, we demonstrated that the datasets could be used to improve performance with respect to the F 1 weighted metric on existing fact-checking datasets such as SciFact, HEALTHVER, COVID-Fact, and CLIMATE-FEVER. In some cases, the improvement in performance was up to a 26% increase.

2022

pdf bib
Eye Gaze and Self-attention: How Humans and Transformers Attend Words in Sentences
Joshua Bensemann | Alex Peng | Diana Benavides-Prado | Yang Chen | Neset Tan | Paul Michael Corballis | Patricia Riddle | Michael Witbrock
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Attention describes cognitive processes that are important to many human phenomena including reading. The term is also used to describe the way in which transformer neural networks perform natural language processing. While attention appears to be very different under these two contexts, this paper presents an analysis of the correlations between transformer attention and overt human attention during reading tasks. An extensive analysis of human eye tracking datasets showed that the dwell times of human eye movements were strongly correlated with the attention patterns occurring in the early layers of pre-trained transformers such as BERT. Additionally, the strength of a correlation was not related to the number of parameters within a transformer. This suggests that something about the transformers’ architecture determined how closely the two measures were correlated.

pdf bib
Explicit Graph Reasoning Fusing Knowledge and Contextual Information for Multi-hop Question Answering
Zhenyun Deng | Yonghua Zhu | Qianqian Qi | Michael Witbrock | Patricia Riddle
Proceedings of the 2nd Workshop on Deep Learning on Graphs for Natural Language Processing (DLG4NLP 2022)

Current graph-neural-network-based (GNN-based) approaches to multi-hop questions integrate clues from scattered paragraphs in an entity graph, achieving implicit reasoning by synchronous update of graph node representations using information from neighbours; this is poorly suited for explaining how clues are passed through the graph in hops. In this paper, we describe a structured Knowledge and contextual Information Fusion GNN (KIFGraph) whose explicit multi-hop graph reasoning mimics human step by step reasoning. Specifically, we first integrate clues at multiple levels of granularity (question, paragraph, sentence, entity) as nodes in the graph, connected by edges derived using structured semantic knowledge, then use a contextual encoder to obtain the initial node representations, followed by step-by-step two-stage graph reasoning that asynchronously updates node representations. Each node can be related to its neighbour nodes through fused structured knowledge and contextual information, reliably integrating their answer clues. Moreover, a masked attention mechanism (MAM) filters out noisy or redundant nodes and edges, to avoid ineffective clue propagation in graph reasoning. Experimental results show performance competitive with published models on the HotpotQA dataset.

pdf bib
AbductionRules: Training Transformers to Explain Unexpected Inputs
Nathan Young | Qiming Bao | Joshua Bensemann | Michael Witbrock
Findings of the Association for Computational Linguistics: ACL 2022

Transformers have recently been shown to be capable of reliably performing logical reasoning over facts and rules expressed in natural language, but abductive reasoning - inference to the best explanation of an unexpected observation - has been underexplored despite significant applications to scientific discovery, common-sense reasoning, and model interpretability. This paper presents AbductionRules, a group of natural language datasets designed to train and test generalisable abduction over natural-language knowledge bases. We use these datasets to finetune pretrained Transformers and discuss their performance, finding that our models learned generalisable abductive techniques but also learned to exploit the structure of our data. Finally, we discuss the viability of this approach to abductive reasoning and ways in which it may be improved in future work.

pdf bib
TaKG: A New Dataset for Paragraph-level Table-to-Text Generation Enhanced with Knowledge Graphs
Qianqian Qi | Zhenyun Deng | Yonghua Zhu | Lia Jisoo Lee | Michael Witbrock | Jiamou Liu
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

We introduce TaKG, a new table-to-text generation dataset with the following highlights: (1) TaKG defines a long-text (paragraph-level) generation task as opposed to well-established short-text (sentence-level) generation datasets. (2) TaKG is the first large-scale dataset for this task, containing three application domains and ~750,000 samples. (3) To address the divergence phenomenon, TaKG enhances table input using external knowledge graphs, extracted by a new Wikidata-based method. We then propose a new Transformer-based multimodal sequence-to-sequence architecture for TaKG that integrates two pretrained language models RoBERTa and GPT-2. Our model shows reliable performance on long-text generation across a variety of metrics, and outperforms existing models for short-text generation tasks.

pdf bib
Prompt-based Conservation Learning for Multi-hop Question Answering
Zhenyun Deng | Yonghua Zhu | Yang Chen | Qianqian Qi | Michael Witbrock | Patricia Riddle
Proceedings of the 29th International Conference on Computational Linguistics

Multi-hop question answering (QA) requires reasoning over multiple documents to answer a complex question and provide interpretable supporting evidence. However, providing supporting evidence is not enough to demonstrate that a model has performed the desired reasoning to reach the correct answer. Most existing multi-hop QA methods fail to answer a large fraction of sub-questions, even if their parent questions are answered correctly. In this paper, we propose the Prompt-based Conservation Learning (PCL) framework for multi-hop QA, which acquires new knowledge from multi-hop QA tasks while conserving old knowledge learned on single-hop QA tasks, mitigating forgetting. Specifically, we first train a model on existing single-hop QA tasks, and then freeze this model and expand it by allocating additional sub-networks for the multi-hop QA task. Moreover, to condition pre-trained language models to stimulate the kind of reasoning required for specific multi-hop questions, we learn soft prompts for the novel sub-networks to perform type-specific reasoning. Experimental results on the HotpotQA benchmark show that PCL is competitive for multi-hop QA and retains good performance on the corresponding single-hop sub-questions, demonstrating the efficacy of PCL in mitigating knowledge loss by forgetting.

2020

pdf bib
Convolutional and Recurrent Neural Networks for Spoken Emotion Recognition
Aaron Keesing | Ian Watson | Michael Witbrock
Proceedings of the 18th Annual Workshop of the Australasian Language Technology Association

We test four models proposed in the speech emotion recognition (SER) literature on 15 public and academic licensed datasets in speaker-independent cross-validation. Results indicate differences in the performance of the models which is partly dependent on the dataset and features used. We also show that a standard utterance-level feature set still performs competitively with neural models on some datasets. This work serves as a starting point for future model comparisons, in addition to open-sourcing the testing code.

2019

pdf bib
Graph Enhanced Cross-Domain Text-to-SQL Generation
Siyu Huo | Tengfei Ma | Jie Chen | Maria Chang | Lingfei Wu | Michael Witbrock
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)

Semantic parsing is a fundamental problem in natural language understanding, as it involves the mapping of natural language to structured forms such as executable queries or logic-like knowledge representations. Existing deep learning approaches for semantic parsing have shown promise on a variety of benchmark data sets, particularly on text-to-SQL parsing. However, most text-to-SQL parsers do not generalize to unseen data sets in different domains. In this paper, we propose a new cross-domain learning scheme to perform text-to-SQL translation and demonstrate its use on Spider, a large-scale cross-domain text-to-SQL data set. We improve upon a state-of-the-art Spider model, SyntaxSQLNet, by constructing a graph of column names for all databases and using graph neural networks to compute their embeddings. The resulting embeddings offer better cross-domain representations and SQL queries, as evidenced by substantial improvement on the Spider data set compared to SyntaxSQLNet.

2018

pdf bib
A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset
Michael Boratko | Harshit Padigela | Divyendra Mikkilineni | Pritish Yuvraj | Rajarshi Das | Andrew McCallum | Maria Chang | Achille Fokoue-Nkoutche | Pavan Kapanipathi | Nicholas Mattei | Ryan Musa | Kartik Talamadupula | Michael Witbrock
Proceedings of the Workshop on Machine Reading for Question Answering

The recent work of Clark et al. (2018) introduces the AI2 Reasoning Challenge (ARC) and the associated ARC dataset that partitions open domain, complex science questions into easy and challenge sets. That paper includes an analysis of 100 questions with respect to the types of knowledge and reasoning required to answer them; however, it does not include clear definitions of these types, nor does it offer information about the quality of the labels. We propose a comprehensive set of definitions of knowledge and reasoning types necessary for answering the questions in the ARC dataset. Using ten annotators and a sophisticated annotation interface, we analyze the distribution of labels across the challenge set and statistics related to them. Additionally, we demonstrate that although naive information retrieval methods return sentences that are irrelevant to answering the query, sufficient supporting text is often present in the (ARC) corpus. Evaluating with human-selected relevant sentences improves the performance of a neural machine comprehension model by 42 points.

pdf bib
Word Mover’s Embedding: From Word2Vec to Document Embedding
Lingfei Wu | Ian En-Hsu Yen | Kun Xu | Fangli Xu | Avinash Balakrishnan | Pin-Yu Chen | Pradeep Ravikumar | Michael J. Witbrock
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

While the celebrated Word2Vec technique yields semantically rich representations for individual words, there has been relatively less success in extending to generate unsupervised sentences or documents embeddings. Recent work has demonstrated that a distance measure between documents called Word Mover’s Distance (WMD) that aligns semantically similar words, yields unprecedented KNN classification accuracy. However, WMD is expensive to compute, and it is hard to extend its use beyond a KNN classifier. In this paper, we propose the Word Mover’s Embedding (WME), a novel approach to building an unsupervised document (sentence) embedding from pre-trained word embeddings. In our experiments on 9 benchmark text classification datasets and 22 textual similarity tasks, the proposed technique consistently matches or outperforms state-of-the-art techniques, with significantly higher accuracy on problems of short length.

pdf bib
An Interface for Annotating Science Questions
Michael Boratko | Harshit Padigela | Divyendra Mikkilineni | Pritish Yuvraj | Rajarshi Das | Andrew McCallum | Maria Chang | Achille Fokoue | Pavan Kapanipathi | Nicholas Mattei | Ryan Musa | Kartik Talamadupula | Michael Witbrock
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Recent work introduces the AI2 Reasoning Challenge (ARC) and the associated ARC dataset that partitions open domain, complex science questions into an Easy Set and a Challenge Set. That work includes an analysis of 100 questions with respect to the types of knowledge and reasoning required to answer them. However, it does not include clear definitions of these types, nor does it offer information about the quality of the labels or the annotation process used. In this paper, we introduce a novel interface for human annotation of science question-answer pairs with their respective knowledge and reasoning types, in order that the classification of new questions may be improved. We build on the classification schema proposed by prior work on the ARC dataset, and evaluate the effectiveness of our interface with a preliminary study involving 10 participants.

2009

pdf bib
Building Conversational Agents with Basilica
Rohit Kumar | Carolyn P. Rosé | Michael J. Witbrock
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Demonstration Session

2004

pdf bib
Inferring parts of speech for lexical mappings via the Cyc KB
Tom O’Hara | Stefano Bertolo | Michael Witbrock | Bjørn Aldag | Jon Curtis | Kathy Panton | Dave Schneider | Nancy Salay
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2000

pdf bib
Headline Generation Based on Statistical Translation
Michele Banko | Vibhu O. Mittal | Michael J. Witbrock
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics