Jan Buys


2023

pdf bib
Subword Segmental Machine Translation: Unifying Segmentation and Target Sentence Generation
Francois Meyer | Jan Buys
Findings of the Association for Computational Linguistics: ACL 2023

Subword segmenters like BPE operate as a preprocessing step in neural machine translation and other (conditional) language models. They are applied to datasets before training, so translation or text generation quality relies on the quality of segmentations. We propose a departure from this paradigm, called subword segmental machine translation (SSMT). SSMT unifies subword segmentation and MT in a single trainable model. It learns to segment target sentence words while jointly learning to generate target sentences. To use SSMT during inference we propose dynamic decoding, a text generation algorithm that adapts segmentations as it generates translations. Experiments across 6 translation directions show that SSMT improves chrF scores for morphologically rich agglutinative languages. Gains are strongest in the very low-resource scenario. SSMT also learns subwords that are closer to morphemes compared to baselines and proves more robust on a test set constructed for evaluating morphological compositional generalisation.

pdf bib
Policy-based Reinforcement Learning for Generalisation in Interactive Text-based Environments
Edan Toledo | Jan Buys | Jonathan Shock
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Text-based environments enable RL agents to learn to converse and perform interactive tasks through natural language. However, previous RL approaches applied to text-based environments show poor performance when evaluated on unseen games. This paper investigates the improvement of generalisation performance through the simple switch from a value-based update method to a policy-based one, within text-based environments. We show that by replacing commonly used value-based methods with REINFORCE with baseline, a far more general agent is produced. The policy-based agent is evaluated on Coin Collector and Question Answering with interactive text (QAit), two text-based environments designed to test zero-shot performance. We see substantial improvements on a variety of zero-shot evaluation experiments, including tripling accuracy on various QAit benchmark configurations. The results indicate that policy-based RL has significantly better generalisation capabilities than value-based methods within such text-based environments, suggesting that RL agents could be applied to more complex natural language environments.

2022

pdf bib
Generic Overgeneralization in Pre-trained Language Models
Sello Ralethe | Jan Buys
Proceedings of the 29th International Conference on Computational Linguistics

Generic statements such as “ducks lay eggs” make claims about kinds, e.g., ducks as a category. The generic overgeneralization effect refers to the inclination to accept false universal generalizations such as “all ducks lay eggs” or “all lions have manes” as true. In this paper, we investigate the generic overgeneralization effect in pre-trained language models experimentally. We show that pre-trained language models suffer from overgeneralization and tend to treat quantified generic statements such as “all ducks lay eggs” as if they were true generics. Furthermore, we demonstrate how knowledge embedding methods can lessen this effect by injecting factual knowledge about kinds into pre-trained language models. To this end, we source factual knowledge about two types of generics, minority characteristic generics and majority characteristic generics, and inject this knowledge using a knowledge embedding model. Our results show that knowledge injection reduces, but does not eliminate, generic overgeneralization, and that majority characteristic generics of kinds are more susceptible to overgeneralization bias.

pdf bib
Subword Segmental Language Modelling for Nguni Languages
Francois Meyer | Jan Buys
Findings of the Association for Computational Linguistics: EMNLP 2022

Subwords have become the standard units of text in NLP, enabling efficient open-vocabulary models. With algorithms like byte-pair encoding (BPE), subword segmentation is viewed as a preprocessing step applied to the corpus before training. This can lead to sub-optimal segmentations for low-resource languages with complex morphologies. We propose a subword segmental language model (SSLM) that learns how to segment words while being trained for autoregressive language modelling. By unifying subword segmentation and language modelling, our model learns subwords that optimise LM performance. We train our model on the 4 Nguni languages of South Africa. These are low-resource agglutinative languages, so subword information is critical. As an LM, SSLM outperforms existing approaches such as BPE-based models on average across the 4 languages. Furthermore, it outperforms standard subword segmenters on unsupervised morphological segmentation. We also train our model as a word-level sequence model, resulting in an unsupervised morphological segmenter that outperforms existing methods by a large margin for all 4 languages. Our results show that learning subword segmentation is an effective alternative to existing subword segmenters, enabling the model to discover morpheme-like subwords that improve its LM capabilities.

pdf bib
A Sequence Modelling Approach to Question Answering in Text-Based Games
Gregory Furman | Edan Toledo | Jonathan Shock | Jan Buys
Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022)

Interactive Question Answering (IQA) requires an intelligent agent to interact with a dynamic environment in order to gather information necessary to answer a question. IQA tasks have been proposed as means of training systems to develop language or visual comprehension abilities. To this end, the Question Answering with Interactive Text (QAit) task was created to produce and benchmark interactive agents capable of seeking information and answering questions in unseen environments. While prior work has exclusively focused on IQA as a reinforcement learning problem, such methods suffer from low sample efficiency and poor accuracy in zero-shot evaluation. In this paper, we propose the use of the recently proposed Decision Transformer architecture to provide improvements upon prior baselines. By utilising a causally masked GPT-2 Transformer for command generation and a BERT model for question answer prediction, we show that the Decision Transformer achieves performance greater than or equal to current state-of-the-art RL baselines on the QAit task in a sample efficient manner. In addition, these results are achievable by training on sub-optimal random trajectories, therefore not requiring the use of online agents to gather data.

pdf bib
University of Cape Town’s WMT22 System: Multilingual Machine Translation for Southern African Languages
Khalid Elmadani | Francois Meyer | Jan Buys
Proceedings of the Seventh Conference on Machine Translation (WMT)

The paper describes the University of Cape Town’s submission to the constrained track of the WMT22 Shared Task: Large-Scale Machine Translation Evaluation for African Languages. Our system is a single multilingual translation model that translates between English and 8 South / South East African Languages, as well as between specific pairs of the African languages. We used several techniques suited for low-resource machine translation (MT), including overlap BPE, back-translation, synthetic training data generation, and adding more translation directions during training. Our results show the value of these techniques, especially for directions where very little or no bilingual training data is available.

2021

pdf bib
Discourse Understanding and Factual Consistency in Abstractive Summarization
Saadia Gabriel | Antoine Bosselut | Jeff Da | Ari Holtzman | Jan Buys | Kyle Lo | Asli Celikyilmaz | Yejin Choi
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We introduce a general framework for abstractive summarization with factual consistency and distinct modeling of the narrative flow in an output summary. Our work addresses current limitations of models for abstractive summarization that often hallucinate information or generate summaries with coherence issues. To generate abstractive summaries with factual consistency and narrative flow, we propose Cooperative Generator-Discriminator Networks (Co-opNet), a novel transformer-based framework where the generator works with a discriminator architecture to compose coherent long-form summaries. We explore four different discriminator objectives which each capture a different aspect of coherence, including whether salient spans of generated abstracts are hallucinated or appear in the input context, and the likelihood of sentence adjacency in generated abstracts. We measure the ability of Co-opNet to learn these objectives with arXiv scientific papers, using the abstracts as a proxy for gold long-form scientific article summaries. Empirical results from automatic and human evaluations demonstrate that Co-opNet learns to summarize with considerably improved global coherence compared to competitive baselines.

pdf bib
RepGraph: Visualising and Analysing Meaning Representation Graphs
Jaron Cohen | Roy Cohen | Edan Toledo | Jan Buys
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present RepGraph, an open source visualisation and analysis tool for meaning representation graphs. Graph-based meaning representations provide rich semantic annotations, but visualising them clearly is more challenging than for fully lexicalized representations. Our application provides a seamless, unifying interface with which to visualise, manipulate and analyse semantically parsed graph data represented in a JSON-based serialisation format. RepGraph visualises graphs in multiple formats, with an emphasis on showing the relation between nodes and their corresponding token spans, whilst keeping the representation compact. Additionally, the web-based tool provides NLP researchers with a clear, visually intuitive way of interacting with these graphs, and includes a number of graph analysis features. The tool currently supports the DMRS, EDS, PTG, UCCA, and AMR semantic frameworks. A live demo is available at https://repgraph.vercel.app/.

2019

pdf bib
Neural Text Generation from Rich Semantic Representations
Valerie Hajdik | Jan Buys | Michael Wayne Goodman | Emily M. Bender
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We propose neural models to generate high-quality text from structured representations based on Minimal Recursion Semantics (MRS). MRS is a rich semantic representation that encodes more precise semantic detail than other representations such as Abstract Meaning Representation (AMR). We show that a sequence-to-sequence model that maps a linearization of Dependency MRS, a graph-based representation of MRS, to text can achieve a BLEU score of 66.11 when trained on gold data. The performance of the model can be improved further using a high-precision, broad coverage grammar-based parser to generate a large silver training corpus, achieving a final BLEU score of 77.17 on the full test set, and 83.37 on the subset of test data most closely matching the silver data domain. Our results suggest that MRS-based representations are a good choice for applications that need both structured semantics and the ability to produce natural language text as output.

pdf bib
Benchmarking Hierarchical Script Knowledge
Yonatan Bisk | Jan Buys | Karl Pichotta | Yejin Choi
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Understanding procedural language requires reasoning about both hierarchical and temporal relations between events. For example, “boiling pasta” is a sub-event of “making a pasta dish”, typically happens before “draining pasta,” and requires the use of omitted tools (e.g. a strainer, sink...). While people are able to choose when and how to use abstract versus concrete instructions, the NLP community lacks corpora and tasks for evaluating if our models can do the same. In this paper, we introduce KidsCook, a parallel script corpus, as well as a cloze task which matches video captions with missing procedural details. Experimental results show that state-of-the-art models struggle at this task, which requires inducing functional commonsense knowledge not explicitly stated in text.

pdf bib
BottleSum: Unsupervised and Self-supervised Sentence Summarization using the Information Bottleneck Principle
Peter West | Ari Holtzman | Jan Buys | Yejin Choi
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

The principle of the Information Bottleneck (Tishby et al., 1999) produces a summary of information X optimized to predict some other relevant information Y. In this paper, we propose a novel approach to unsupervised sentence summarization by mapping the Information Bottleneck principle to a conditional language modelling objective: given a sentence, our approach seeks a compressed sentence that can best predict the next sentence. Our iterative algorithm under the Information Bottleneck objective searches gradually shorter subsequences of the given sentence while maximizing the probability of the next sentence conditioned on the summary. Using only pretrained language models with no direct supervision, our approach can efficiently perform extractive sentence summarization over a large corpus. Building on our unsupervised extractive summarization, we also present a new approach to self-supervised abstractive summarization, where a transformer-based language model is trained on the output summaries of our unsupervised method. Empirical results demonstrate that our extractive method outperforms other unsupervised models on multiple automatic metrics. In addition, we find that our self-supervised abstractive model outperforms unsupervised baselines (including our own) by human evaluation along multiple attributes.

2018

pdf bib
Neural Syntactic Generative Models with Exact Marginalization
Jan Buys | Phil Blunsom
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We present neural syntactic generative models with exact marginalization that support both dependency parsing and language modeling. Exact marginalization is made tractable through dynamic programming over shift-reduce parsing and minimal RNN-based feature sets. Our algorithms complement previous approaches by supporting batched training and enabling online computation of next word probabilities. For supervised dependency parsing, our model achieves a state-of-the-art result among generative approaches. We also report empirical results on unsupervised syntactic models and their role in language modeling. We find that our model formulation of latent dependencies with exact marginalization do not lead to better intrinsic language modeling performance than vanilla RNNs, and that parsing accuracy is not correlated with language modeling perplexity in stack-based models.

pdf bib
Learning to Write with Cooperative Discriminators
Ari Holtzman | Jan Buys | Maxwell Forbes | Antoine Bosselut | David Golub | Yejin Choi
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite their local fluency, long-form text generated from RNNs is often generic, repetitive, and even self-contradictory. We propose a unified learning framework that collectively addresses all the above issues by composing a committee of discriminators that can guide a base RNN generator towards more globally coherent generations. More concretely, discriminators each specialize in a different principle of communication, such as Grice’s maxims, and are collectively combined with the base RNN generator through a composite decoding objective. Human evaluation demonstrates that text generated by our model is preferred over that of baselines by a large margin, significantly enhancing the overall coherence, style, and information of the generations.

2017

pdf bib
Robust Incremental Neural Semantic Graph Parsing
Jan Buys | Phil Blunsom
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Parsing sentences to linguistically-expressive semantic representations is a key goal of Natural Language Processing. Yet statistical parsing has focussed almost exclusively on bilexical dependencies or domain-specific logical forms. We propose a neural encoder-decoder transition-based parser which is the first full-coverage semantic graph parser for Minimal Recursion Semantics (MRS). The model architecture uses stack-based embedding features, predicting graphs jointly with unlexicalized predicates and their token alignments. Our parser is more accurate than attention-based baselines on MRS, and on an additional Abstract Meaning Representation (AMR) benchmark, and GPU batch processing makes it an order of magnitude faster than a high-precision grammar-based parser. Further, the 86.69% Smatch score of our MRS parser is higher than the upper-bound on AMR parsing, making MRS an attractive choice as a semantic representation.

pdf bib
Oxford at SemEval-2017 Task 9: Neural AMR Parsing with Pointer-Augmented Attention
Jan Buys | Phil Blunsom
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

We present a neural encoder-decoder AMR parser that extends an attention-based model by predicting the alignment between graph nodes and sentence tokens explicitly with a pointer mechanism. Candidate lemmas are predicted as a pre-processing step so that the lemmas of lexical concepts, as well as constant strings, are factored out of the graph linearization and recovered through the predicted alignments. The approach does not rely on syntactic parses or extensive external resources. Our parser obtained 59% Smatch on the SemEval test set.

2016

pdf bib
Online Segment to Segment Neural Transduction
Lei Yu | Jan Buys | Phil Blunsom
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Cross-Lingual Morphological Tagging for Low-Resource Languages
Jan Buys | Jan A. Botha
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf bib
A Bayesian Model for Generative Transition-based Dependency Parsing
Jan Buys | Phil Blunsom
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)

pdf bib
Generative Incremental Dependency Parsing with Neural Networks
Jan Buys | Phil Blunsom
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2013

pdf bib
A Tree Transducer Model for Grammatical Error Correction
Jan Buys | Brink van der Merwe
Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task