Jan Buys - ACL Anthology

Jan Buys

2025

Story Generation with Large Language Models for African Languages
Catherine Nana Nyaah Essuman | Jan Buys
Proceedings of the Sixth Workshop on African Natural Language Processing (AfricaNLP 2025)

The development of Large Language Models (LLMs) for African languages has been hindered by the lack of large-scale textual data. Previous research has shown that relatively small language models, when trained on synthetic data generated by larger models, can produce fluent, short English stories, providing a data-efficient alternative to large-scale pretraining. In this paper, we apply a similar approach to develop and evaluate small language models for generating childrens stories in isiZulu and Yoruba, using synthetic datasets created through translation and multilingual prompting. We train six language-specific models varying in dataset size and source, and based on the GPT-2 architecture. Our results show that models trained on synthetic low-resource data are capable of producing coherent and fluent short stories in isiZulu and Yoruba. Models trained on larger synthetic datasets generally perform better in terms of coherence and grammar, and also tend to generalize better, as seen by their lower evaluation perplexities. Models trained on datasets generated through prompting instead of translation generate similar or more coherent stories and display more creativity, but perform worse in terms of generalization to unseen data. In addition to the potential educational applications of the automated story generation, our approach has the potential to be used as the foundation for more data-efficient low-resource language models.

Cross-Lingual Knowledge Projection and Knowledge Enhancement for Zero-Shot Question Answering in Low-Resource Languages
Sello Ralethe | Jan Buys
Proceedings of the 31st International Conference on Computational Linguistics

Knowledge bases (KBs) in low-resource languages (LRLs) are often incomplete, posing a challenge for developing effective question answering systems over KBs in those languages. On the other hand, the size of training corpora for LRL language models is also limited, restricting the ability to do zero-shot question answering using multilingual language models. To address these issues, we propose a two-fold approach. First, we introduce LeNS-Align, a novel cross-lingual mapping technique which improves the quality of word alignments extracted from parallel English-LRL text by combining lexical alignment, named entity recognition, and semantic alignment. LeNS-Align is applied to perform cross-lingual projection of KB triples. Second, we leverage the projected KBs to enhance multilingual language models’ question answering capabilities by augmenting the models with Graph Neural Networks embedding the projected knowledge. We apply our approach to map triples from two existing English KBs, ConceptNet and DBpedia, to create comprehensive LRL knowledge bases for four low-resource South African languages. Evaluation on three translated test sets show that our approach improves zero-shot question answering accuracy by up to 17% compared to baselines without KB access. The results highlight how our approach contributes to bridging the knowledge gap for low-resource languages by expanding knowledge coverage and question answering capabilities.

The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages
Francois Meyer | Jan Buys
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.

KERLQA: Knowledge-Enhanced Reinforcement Learning for Question Answering in Low-resource Languages
Sello Ralethe | Jan Buys
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Question answering in low-resource languages faces critical challenges when models encounter questions beyond their knowledge boundaries, often producing confident but incorrect answers. We propose Knowledge-Enhanced Reinforcement Learning for Question Answering (KERLQA), a novel approach that combines knowledge graph integration with reinforcement learning to enable principled abstention decisions. Unlike existing refusal-tuned methods that make binary decisions based solely on internal confidence, KERLQA implements a three-way decision process: answer with internal knowledge, answer with external knowledge assistance, or abstain. Using a composite reward function that jointly optimizes for correctness, appropriate abstention, and efficient knowledge utilization, we train policies via PPO and DPO with dynamic calibration for low-resource settings. Experiments on CommonsenseQA and OpenBookQA across English and four South African languages show KERLQA achieves improved F1 scores, with up to 6.2 point improvements in low-resource languages. Our analysis reveals that KERLQA reduces false positive abstention rates by 30% while expanding the boundary of answerable questions through external knowledge integration.

Cross-Lingual Knowledge Augmentation for Mitigating Generic Overgeneralization in Multilingual Language Models
Sello Ralethe | Jan Buys
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)

Generic statements like “birds fly” or “lions have manes” express generalizations about kinds that allow exceptions, yet language models tend to overgeneralize them to universal claims. While previous work showed that ASCENT KB could reduce this effect in English by 30-40%, the effectiveness of broader knowledge sources and the cross-lingual nature of this phenomenon remain unexplored. We investigate generic overgeneralization across English and four South African languages (isiZulu, isiXhosa, Sepedi, SeSotho), comparing the impact of ConceptNet and DBpedia against the previously used ASCENT KB. Our experiments show that ConceptNet reduces overgeneralization by 45-52%% for minority characteristic generics, while DBpedia achieves 48-58%% for majority characteristics, with combined knowledge bases reaching 67%% reduction. These improvements are consistent across all languages, though Nguni languages show higher baseline overgeneralization than Sotho-Tswana languages, potentially suggesting that morphological features may influence this semantic bias. Our findings demonstrate that commonsense and encyclopedic knowledge provide complementary benefits for multilingual semantic understanding, offering insights for developing NLP systems that capture nuanced semantics in low-resource languages.

2024

Multipath parsing in the brain
Berta Franzluebbers | Donald Dunagan | Miloš Stanojević | Jan Buys | John Hale
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Humans understand sentences word-by-word, in the order that they hear them. This incrementality entails resolving temporary ambiguities about syntactic relationships. We investigate how humans process these syntactic ambiguities by correlating predictions from incremental generative dependency parsers with timecourse data from people undergoing functional neuroimaging while listening to an audiobook. In particular, we compare competing hypotheses regarding the number of developing syntactic analyses in play during word-by-word comprehension: one vs more than one. This comparison involves evaluating syntactic surprisal from a state-of-the-art dependency parser with LLM-adapted encodings against an existing fMRI dataset. In both English and Chinese data, we find evidence for multipath parsing. Brain regions associated with this multipath effect include bilateral superior temporal gyrus.

A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation
Francois Meyer | Jan Buys
Findings of the Association for Computational Linguistics: NAACL 2024

Multilingual modelling can improve machine translation for low-resource languages, partly through shared subword representations. This paper studies the role of subword segmentation in cross-lingual transfer. We systematically compare the efficacy of several subword methods in promoting synergy and preventing interference across different linguistic typologies. Our findings show that subword regularisation boosts synergy in multilingual modelling, whereas BPE more effectively facilitates transfer during cross-lingual fine-tuning. Notably, our results suggest that differences in orthographic word boundary conventions (the morphological granularity of written words) may impede cross-lingual transfer more significantly than linguistic unrelatedness. Our study confirms that decisions around subword modelling can be key to optimising the benefits of multilingual modelling.

Neural Machine Translation between Low-Resource Languages with Synthetic Pivoting
Khalid N. Elmadani | Jan Buys
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Training neural models for translating between low-resource languages is challenging due to the scarcity of direct parallel data between such languages. Pivot-based neural machine translation (NMT) systems overcome data scarcity by including a high-resource pivot language in the process of translating between low-resource languages. We propose synthetic pivoting, a novel approach to pivot-based translation in which the pivot sentences are generated synthetically from both the source and target languages. Synthetic pivot sentences are generated through sequence-level knowledge distillation, with the aim of changing the structure of pivot sentences to be closer to that of the source or target languages, thereby reducing pivot translation complexity. We incorporate synthetic pivoting into two paradigms for pivoting: cascading and direct translation using synthetic source and target sentences. We find that the performance of pivot-based systems highly depends on the quality of the NMT model used for sentence regeneration. Furthermore, training back-translation models on these sentences can make the models more robust to input-side noise. The results show that synthetic data generation improves pivot-based systems translating between low-resource Southern African languages by up to 5.6 BLEU points after fine-tuning.

NGLUEni: Benchmarking and Adapting Pretrained Language Models for Nguni Languages
Francois Meyer | Haiyue Song | Abhisek Chakrabarty | Jan Buys | Raj Dabre | Hideki Tanaka
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The Nguni languages have over 20 million home language speakers in South Africa. There has been considerable growth in the datasets for Nguni languages, but so far no analysis of the performance of NLP models for these languages has been reported across languages and tasks. In this paper we study pretrained language models for the 4 Nguni languages - isiXhosa, isiZulu, isiNdebele, and Siswati. We compile publicly available datasets for natural language understanding and generation, spanning 6 tasks and 11 datasets. This benchmark, which we call NGLUEni, is the first centralised evaluation suite for the Nguni languages, allowing us to systematically evaluate the Nguni-language capabilities of pretrained language models (PLMs). Besides evaluating existing PLMs, we develop new PLMs for the Nguni languages through multilingual adaptive finetuning. Our models, Nguni-XLMR and Nguni-ByT5, outperform their base models and large-scale adapted models, showing that performance gains are obtainable through limited language group-based adaptation. We also perform experiments on cross-lingual transfer and machine translation. Our models achieve notable cross-lingual transfer improvements in the lower resourced Nguni languages (isiNdebele and Siswati). To facilitate future use of NGLUEni as a standardised evaluation suite for the Nguni languages, we create a web portal to access the collection of datasets and publicly release our models.

Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource Agglutinative Data-to-Text Generation
Francois Meyer | Jan Buys
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Most data-to-text datasets are for English, so the difficulties of modelling data-to-text for low-resource languages are largely unexplored. In this paper we tackle data-to-text for isiXhosa, which is low-resource and agglutinative. We introduce Triples-to-isiXhosa (T2X), a new dataset based on a subset of WebNLG, which presents a new linguistic context that shifts modelling demands to subword-driven techniques. We also develop an evaluation framework for T2X that measures how accurately generated text describes the data. This enables future users of T2X to go beyond surface-level metrics in evaluation. On the modelling side we explore two classes of methods - dedicated data-to-text models trained from scratch and pretrained language models (PLMs). We propose a new dedicated architecture aimed at agglutinative data-to-text, the Subword Segmental Pointer Generator (SSPG). It jointly learns to segment words and copy entities, and outperforms existing dedicated models for 2 agglutinative languages (isiXhosa and Finnish). We investigate pretrained solutions for T2X, which reveals that standard PLMs come up short. Fine-tuning machine translation models emerges as the best method overall. These findings underscore the distinct challenge presented by T2X: neither well-established data-to-text architectures nor customary pretrained methodologies prove optimal. We conclude with a qualitative analysis of generation errors and an ablation study.

2023

Policy-based Reinforcement Learning for Generalisation in Interactive Text-based Environments
Edan Toledo | Jan Buys | Jonathan Shock
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Text-based environments enable RL agents to learn to converse and perform interactive tasks through natural language. However, previous RL approaches applied to text-based environments show poor performance when evaluated on unseen games. This paper investigates the improvement of generalisation performance through the simple switch from a value-based update method to a policy-based one, within text-based environments. We show that by replacing commonly used value-based methods with REINFORCE with baseline, a far more general agent is produced. The policy-based agent is evaluated on Coin Collector and Question Answering with interactive text (QAit), two text-based environments designed to test zero-shot performance. We see substantial improvements on a variety of zero-shot evaluation experiments, including tripling accuracy on various QAit benchmark configurations. The results indicate that policy-based RL has significantly better generalisation capabilities than value-based methods within such text-based environments, suggesting that RL agents could be applied to more complex natural language environments.

Subword Segmental Machine Translation: Unifying Segmentation and Target Sentence Generation
Francois Meyer | Jan Buys
Findings of the Association for Computational Linguistics: ACL 2023

Subword segmenters like BPE operate as a preprocessing step in neural machine translation and other (conditional) language models. They are applied to datasets before training, so translation or text generation quality relies on the quality of segmentations. We propose a departure from this paradigm, called subword segmental machine translation (SSMT). SSMT unifies subword segmentation and MT in a single trainable model. It learns to segment target sentence words while jointly learning to generate target sentences. To use SSMT during inference we propose dynamic decoding, a text generation algorithm that adapts segmentations as it generates translations. Experiments across 6 translation directions show that SSMT improves chrF scores for morphologically rich agglutinative languages. Gains are strongest in the very low-resource scenario. SSMT also learns subwords that are closer to morphemes compared to baselines and proves more robust on a test set constructed for evaluating morphological compositional generalisation.

2022

Generic Overgeneralization in Pre-trained Language Models
Sello Ralethe | Jan Buys
Proceedings of the 29th International Conference on Computational Linguistics

Generic statements such as “ducks lay eggs” make claims about kinds, e.g., ducks as a category. The generic overgeneralization effect refers to the inclination to accept false universal generalizations such as “all ducks lay eggs” or “all lions have manes” as true. In this paper, we investigate the generic overgeneralization effect in pre-trained language models experimentally. We show that pre-trained language models suffer from overgeneralization and tend to treat quantified generic statements such as “all ducks lay eggs” as if they were true generics. Furthermore, we demonstrate how knowledge embedding methods can lessen this effect by injecting factual knowledge about kinds into pre-trained language models. To this end, we source factual knowledge about two types of generics, minority characteristic generics and majority characteristic generics, and inject this knowledge using a knowledge embedding model. Our results show that knowledge injection reduces, but does not eliminate, generic overgeneralization, and that majority characteristic generics of kinds are more susceptible to overgeneralization bias.

Subword Segmental Language Modelling for Nguni Languages
Francois Meyer | Jan Buys
Findings of the Association for Computational Linguistics: EMNLP 2022

Subwords have become the standard units of text in NLP, enabling efficient open-vocabulary models. With algorithms like byte-pair encoding (BPE), subword segmentation is viewed as a preprocessing step applied to the corpus before training. This can lead to sub-optimal segmentations for low-resource languages with complex morphologies. We propose a subword segmental language model (SSLM) that learns how to segment words while being trained for autoregressive language modelling. By unifying subword segmentation and language modelling, our model learns subwords that optimise LM performance. We train our model on the 4 Nguni languages of South Africa. These are low-resource agglutinative languages, so subword information is critical. As an LM, SSLM outperforms existing approaches such as BPE-based models on average across the 4 languages. Furthermore, it outperforms standard subword segmenters on unsupervised morphological segmentation. We also train our model as a word-level sequence model, resulting in an unsupervised morphological segmenter that outperforms existing methods by a large margin for all 4 languages. Our results show that learning subword segmentation is an effective alternative to existing subword segmenters, enabling the model to discover morpheme-like subwords that improve its LM capabilities.

University of Cape Town’s WMT22 System: Multilingual Machine Translation for Southern African Languages
Khalid N. Elmadani | Francois Meyer | Jan Buys
Proceedings of the Seventh Conference on Machine Translation (WMT)

The paper describes the University of Cape Town’s submission to the constrained track of the WMT22 Shared Task: Large-Scale Machine Translation Evaluation for African Languages. Our system is a single multilingual translation model that translates between English and 8 South / South East African Languages, as well as between specific pairs of the African languages. We used several techniques suited for low-resource machine translation (MT), including overlap BPE, back-translation, synthetic training data generation, and adding more translation directions during training. Our results show the value of these techniques, especially for directions where very little or no bilingual training data is available.

A Sequence Modelling Approach to Question Answering in Text-Based Games
Gregory Furman | Edan Toledo | Jonathan Shock | Jan Buys
Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022)

Interactive Question Answering (IQA) requires an intelligent agent to interact with a dynamic environment in order to gather information necessary to answer a question. IQA tasks have been proposed as means of training systems to develop language or visual comprehension abilities. To this end, the Question Answering with Interactive Text (QAit) task was created to produce and benchmark interactive agents capable of seeking information and answering questions in unseen environments. While prior work has exclusively focused on IQA as a reinforcement learning problem, such methods suffer from low sample efficiency and poor accuracy in zero-shot evaluation. In this paper, we propose the use of the recently proposed Decision Transformer architecture to provide improvements upon prior baselines. By utilising a causally masked GPT-2 Transformer for command generation and a BERT model for question answer prediction, we show that the Decision Transformer achieves performance greater than or equal to current state-of-the-art RL baselines on the QAit task in a sample efficient manner. In addition, these results are achievable by training on sub-optimal random trajectories, therefore not requiring the use of online agents to gather data.

2021

Discourse Understanding and Factual Consistency in Abstractive Summarization
Saadia Gabriel | Antoine Bosselut | Jeff Da | Ari Holtzman | Jan Buys | Kyle Lo | Asli Celikyilmaz | Yejin Choi
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We introduce a general framework for abstractive summarization with factual consistency and distinct modeling of the narrative flow in an output summary. Our work addresses current limitations of models for abstractive summarization that often hallucinate information or generate summaries with coherence issues. To generate abstractive summaries with factual consistency and narrative flow, we propose Cooperative Generator-Discriminator Networks (Co-opNet), a novel transformer-based framework where the generator works with a discriminator architecture to compose coherent long-form summaries. We explore four different discriminator objectives which each capture a different aspect of coherence, including whether salient spans of generated abstracts are hallucinated or appear in the input context, and the likelihood of sentence adjacency in generated abstracts. We measure the ability of Co-opNet to learn these objectives with arXiv scientific papers, using the abstracts as a proxy for gold long-form scientific article summaries. Empirical results from automatic and human evaluations demonstrate that Co-opNet learns to summarize with considerably improved global coherence compared to competitive baselines.

RepGraph: Visualising and Analysing Meaning Representation Graphs
Jaron Cohen | Roy Cohen | Edan Toledo | Jan Buys
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present RepGraph, an open source visualisation and analysis tool for meaning representation graphs. Graph-based meaning representations provide rich semantic annotations, but visualising them clearly is more challenging than for fully lexicalized representations. Our application provides a seamless, unifying interface with which to visualise, manipulate and analyse semantically parsed graph data represented in a JSON-based serialisation format. RepGraph visualises graphs in multiple formats, with an emphasis on showing the relation between nodes and their corresponding token spans, whilst keeping the representation compact. Additionally, the web-based tool provides NLP researchers with a clear, visually intuitive way of interacting with these graphs, and includes a number of graph analysis features. The tool currently supports the DMRS, EDS, PTG, UCCA, and AMR semantic frameworks. A live demo is available at https://repgraph.vercel.app/.

2019

BottleSum: Unsupervised and Self-supervised Sentence Summarization using the Information Bottleneck Principle
Peter West | Ari Holtzman | Jan Buys | Yejin Choi
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

The principle of the Information Bottleneck (Tishby et al., 1999) produces a summary of information X optimized to predict some other relevant information Y. In this paper, we propose a novel approach to unsupervised sentence summarization by mapping the Information Bottleneck principle to a conditional language modelling objective: given a sentence, our approach seeks a compressed sentence that can best predict the next sentence. Our iterative algorithm under the Information Bottleneck objective searches gradually shorter subsequences of the given sentence while maximizing the probability of the next sentence conditioned on the summary. Using only pretrained language models with no direct supervision, our approach can efficiently perform extractive sentence summarization over a large corpus. Building on our unsupervised extractive summarization, we also present a new approach to self-supervised abstractive summarization, where a transformer-based language model is trained on the output summaries of our unsupervised method. Empirical results demonstrate that our extractive method outperforms other unsupervised models on multiple automatic metrics. In addition, we find that our self-supervised abstractive model outperforms unsupervised baselines (including our own) by human evaluation along multiple attributes.

Neural Text Generation from Rich Semantic Representations
Valerie Hajdik | Jan Buys | Michael Wayne Goodman | Emily M. Bender
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We propose neural models to generate high-quality text from structured representations based on Minimal Recursion Semantics (MRS). MRS is a rich semantic representation that encodes more precise semantic detail than other representations such as Abstract Meaning Representation (AMR). We show that a sequence-to-sequence model that maps a linearization of Dependency MRS, a graph-based representation of MRS, to text can achieve a BLEU score of 66.11 when trained on gold data. The performance of the model can be improved further using a high-precision, broad coverage grammar-based parser to generate a large silver training corpus, achieving a final BLEU score of 77.17 on the full test set, and 83.37 on the subset of test data most closely matching the silver data domain. Our results suggest that MRS-based representations are a good choice for applications that need both structured semantics and the ability to produce natural language text as output.

Benchmarking Hierarchical Script Knowledge
Yonatan Bisk | Jan Buys | Karl Pichotta | Yejin Choi
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Understanding procedural language requires reasoning about both hierarchical and temporal relations between events. For example, “boiling pasta” is a sub-event of “making a pasta dish”, typically happens before “draining pasta,” and requires the use of omitted tools (e.g. a strainer, sink...). While people are able to choose when and how to use abstract versus concrete instructions, the NLP community lacks corpora and tasks for evaluating if our models can do the same. In this paper, we introduce KidsCook, a parallel script corpus, as well as a cloze task which matches video captions with missing procedural details. Experimental results show that state-of-the-art models struggle at this task, which requires inducing functional commonsense knowledge not explicitly stated in text.

2018

Neural Syntactic Generative Models with Exact Marginalization
Jan Buys | Phil Blunsom
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We present neural syntactic generative models with exact marginalization that support both dependency parsing and language modeling. Exact marginalization is made tractable through dynamic programming over shift-reduce parsing and minimal RNN-based feature sets. Our algorithms complement previous approaches by supporting batched training and enabling online computation of next word probabilities. For supervised dependency parsing, our model achieves a state-of-the-art result among generative approaches. We also report empirical results on unsupervised syntactic models and their role in language modeling. We find that our model formulation of latent dependencies with exact marginalization do not lead to better intrinsic language modeling performance than vanilla RNNs, and that parsing accuracy is not correlated with language modeling perplexity in stack-based models.

Learning to Write with Cooperative Discriminators
Ari Holtzman | Jan Buys | Maxwell Forbes | Antoine Bosselut | David Golub | Yejin Choi
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite their local fluency, long-form text generated from RNNs is often generic, repetitive, and even self-contradictory. We propose a unified learning framework that collectively addresses all the above issues by composing a committee of discriminators that can guide a base RNN generator towards more globally coherent generations. More concretely, discriminators each specialize in a different principle of communication, such as Grice’s maxims, and are collectively combined with the base RNN generator through a composite decoding objective. Human evaluation demonstrates that text generated by our model is preferred over that of baselines by a large margin, significantly enhancing the overall coherence, style, and information of the generations.

2017

Robust Incremental Neural Semantic Graph Parsing
Jan Buys | Phil Blunsom
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Parsing sentences to linguistically-expressive semantic representations is a key goal of Natural Language Processing. Yet statistical parsing has focussed almost exclusively on bilexical dependencies or domain-specific logical forms. We propose a neural encoder-decoder transition-based parser which is the first full-coverage semantic graph parser for Minimal Recursion Semantics (MRS). The model architecture uses stack-based embedding features, predicting graphs jointly with unlexicalized predicates and their token alignments. Our parser is more accurate than attention-based baselines on MRS, and on an additional Abstract Meaning Representation (AMR) benchmark, and GPU batch processing makes it an order of magnitude faster than a high-precision grammar-based parser. Further, the 86.69% Smatch score of our MRS parser is higher than the upper-bound on AMR parsing, making MRS an attractive choice as a semantic representation.

Oxford at SemEval-2017 Task 9: Neural AMR Parsing with Pointer-Augmented Attention
Jan Buys | Phil Blunsom
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

We present a neural encoder-decoder AMR parser that extends an attention-based model by predicting the alignment between graph nodes and sentence tokens explicitly with a pointer mechanism. Candidate lemmas are predicted as a pre-processing step so that the lemmas of lexical concepts, as well as constant strings, are factored out of the graph linearization and recovered through the predicted alignments. The approach does not rely on syntactic parses or extensive external resources. Our parser obtained 59% Smatch on the SemEval test set.

2016

Online Segment to Segment Neural Transduction
Lei Yu | Jan Buys | Phil Blunsom
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Cross-Lingual Morphological Tagging for Low-Resource Languages
Jan Buys | Jan A. Botha
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

Generative Incremental Dependency Parsing with Neural Networks
Jan Buys | Phil Blunsom
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

A Bayesian Model for Generative Transition-based Dependency Parsing
Jan Buys | Phil Blunsom
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)

2013

A Tree Transducer Model for Grammatical Error Correction
Jan Buys | Brink van der Merwe
Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task

Venues