Philip Arthur

2025

Large language models (LLMs) have shown impressive performance in code understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore future research directions, and offer practical guidance for new researchers entering the field.

2021

pdf bib abs

Learning Coupled Policies for Simultaneous Machine Translation using Imitation Learning
Philip Arthur | Trevor Cohn | Gholamreza Haffari
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We present a novel approach to efficiently learn a simultaneous translation model with coupled programmer-interpreter policies. First, we present an algorithmic oracle to produce oracle READ/WRITE actions for training bilingual sentence-pairs using the notion of word alignments. This oracle actions are designed to capture enough information from the partial input before writing the output. Next, we perform a coupled scheduled sampling to effectively mitigate the exposure bias when learning both policies jointly with imitation learning. Experiments on six language-pairs show our method outperforms strong baselines in terms of translation quality quality while keeping the delay low.

pdf bib abs

It Is Not As Good As You Think! Evaluating Simultaneous Machine Translation on Interpretation Data
Jinming Zhao | Philip Arthur | Gholamreza Haffari | Trevor Cohn | Ehsan Shareghi
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Most existing simultaneous machine translation (SiMT) systems are trained and evaluated on offline translation corpora. We argue that SiMT systems should be trained and tested on real interpretation data. To illustrate this argument, we propose an interpretation test set and conduct a realistic evaluation of SiMT trained on offline translations. Our results, on our test set along with 3 existing smaller scale language pairs, highlight the difference of up-to 13.83 BLEU score when SiMT models are evaluated on translation vs interpretation data. In the absence of interpretation training data, we propose a translation-to-interpretation (T2I) style transfer method which allows converting existing offline translations into interpretation-style data, leading to up-to 2.8 BLEU improvement. However, the evaluation gap remains notable, calling for constructing large-scale interpretation corpora better suited for evaluating and developing SiMT systems.

pdf bib

Multilingual Simultaneous Neural Machine Translation
Philip Arthur | Dongwon Ryu | Gholamreza Haffari
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib

2018

pdf bib

2017

pdf bib abs

Neural Machine Translation via Binary Code Prediction
Yusuke Oda | Philip Arthur | Graham Neubig | Koichiro Yoshino | Satoshi Nakamura
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we propose a new method for calculating the output layer in neural machine translation systems. The method is based on predicting a binary code for each word and can reduce computation time/memory requirements of the output layer to be logarithmic in vocabulary size in the best case. In addition, we also introduce two advanced approaches to improve the robustness of the proposed model: using error-correcting codes and combining softmax and binary codes. Experiments on two English-Japanese bidirectional translation tasks show proposed models achieve BLEU scores that approach the softmax, while reducing memory usage to the order of less than 1/10 and improving decoding speed on CPUs by x5 to x10.

2016

pdf bib

Incorporating Discrete Translation Lexicons into Neural Machine Translation
Philip Arthur | Graham Neubig | Satoshi Nakamura
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2015

pdf bib

Multi-Target Machine Translation with Multi-Synchronous Context-free Grammars
Graham Neubig | Philip Arthur | Kevin Duh
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib abs

Semantic Parsing of Ambiguous Input through Paraphrasing and Verification
Philip Arthur | Graham Neubig | Sakriani Sakti | Tomoki Toda | Satoshi Nakamura
Transactions of the Association for Computational Linguistics, Volume 3

We propose a new method for semantic parsing of ambiguous and ungrammatical input, such as search queries. We do so by building on an existing semantic parsing framework that uses synchronous context free grammars (SCFG) to jointly model the input sentence and output meaning representation. We generalize this SCFG framework to allow not one, but multiple outputs. Using this formalism, we construct a grammar that takes an ambiguous input string and jointly maps it into both a meaning representation and a natural language paraphrase that is less ambiguous than the original input. This paraphrase can be used to disambiguate the meaning representation via verification using a language model that calculates the probability of each paraphrase.

Venues

Philip Arthur

2025

2021

2018

2017

2016

2015

Co-authors

Venues