Arul Menezes


2023

pdf bib
Dissecting In-Context Learning of Translations in GPT-3
Vikas Raunak | Arul Menezes | Hany Awadalla
Findings of the Association for Computational Linguistics: EMNLP 2023

Most of the recent work in leveraging Large Language Models (LLMs) such as GPT-3 for Machine Translation (MT) has focused on selecting the few-shot samples for prompting. In this work, we try to better understand the role of demonstration attributes for the in-context learning of translations through perturbations of high-quality, in-domain demonstrations. We find that asymmetric perturbation of the source-target mappings yield vastly different results. We show that the perturbation of the source side has surprisingly little impact, while target perturbation can drastically reduce translation quality, suggesting that it is the output text distribution that provides the most important learning signal during in-context learning of translations. We propose a method named Zero-Shot-Context to add this signal automatically in Zero-Shot prompting. We demonstrate that it improves upon the zero-shot translation performance of GPT-3, even making it competitive with few-shot prompted translations.

pdf bib
TRIP: Accelerating Document-level Multilingual Pre-training via Triangular Document-level Pre-training on Parallel Data Triplets
Hongyuan Lu | Haoyang Huang | Shuming Ma | Dongdong Zhang | Wai Lam | Zhaochuan Gao | Anthony Aue | Arul Menezes | Furu Wei
Findings of the Association for Computational Linguistics: EMNLP 2023

Despite the success of multilingual sequence-to-sequence pre-training, most existing approaches rely on document-level monolingual corpora in many different languages, sentence-level bilingual corpora, and sometimes synthetic document-level bilingual corpora. This hampers the performance with cross-lingual document-level tasks such as document-level translation. Hence, we propose to mine and leverage document-level trilingual parallel corpora to improve sequence-to-sequence multilingual pre-training. We present Triangular Document-level Pre-training (TRIP) as the first in the field to accelerate the conventional monolingual and bilingual objectives into a trilingual objective with a novel method called Grafting. Experiments show that TRIP achieves several strong state-of-the-art (SOTA) scores on three multilingual document-level machine translation benchmarks and one cross-lingual abstractive summarization benchmark, including consistent improvements by up to 3.11 d-BLEU points and 8.9 ROUGE-L points.

pdf bib
Leveraging GPT-4 for Automatic Translation Post-Editing
Vikas Raunak | Amr Sharaf | Yiren Wang | Hany Awadalla | Arul Menezes
Findings of the Association for Computational Linguistics: EMNLP 2023

While Neural Machine Translation (NMT) represents the leading approach to Machine Translation (MT), the outputs of NMT models still require translation post-editing to rectify errors and enhance quality under critical settings. In this work, we formalize the task of direct translation post-editing with Large Language Models (LLMs) and explore the use of GPT-4 to automatically post-edit NMT outputs across several language pairs. Our results demonstrate that GPT-4 is adept at translation post-editing, producing meaningful and trustworthy edits to translations that help improve its general quality as well as remove different classes of major errors in translations. In particular, human evaluations on assessing edit trustworthiness show that GPT-4 exhibits a large improvement over the prior state-of-the-art LLM. Notably, we improve upon state-of-the-art performance on WMT-22 English-Chinese, English-German, Chinese-English and German-English language pairs using GPT-4 based post-editing, as evaluated by state-of-the-art MT quality metrics. However, we also show that GPT-4 could produce hallucinated edits, thereby urging caution in its use as an expert translation post-editor.

pdf bib
Do GPTs Produce Less Literal Translations?
Vikas Raunak | Arul Menezes | Matt Post | Hany Hassan
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Large Language Models (LLMs) such as GPT-3 have emerged as general-purpose language models capable of addressing many natural language generation or understanding tasks. On the task of Machine Translation (MT), multiple works have investigated few-shot prompting mechanisms to elicit better translations from LLMs. However, there has been relatively little investigation on how such translations differ qualitatively from the translations generated by standard Neural Machine Translation (NMT) models. In this work, we investigate these differences in terms of the literalness of translations produced by the two systems. Using literalness measures involving word alignment and monotonicity, we find that translations out of English (E-X) from GPTs tend to be less literal, while exhibiting similar or better scores on MT quality metrics. We demonstrate that this finding is borne out in human evaluations as well. We then show that these differences are especially pronounced when translating sentences that contain idiomatic expressions.

2022

pdf bib
Finding Memo: Extractive Memorization in Constrained Sequence Generation Tasks
Vikas Raunak | Arul Menezes
Findings of the Association for Computational Linguistics: EMNLP 2022

Memorization presents a challenge for several constrained Natural Language Generation (NLG) tasks such as Neural Machine Translation (NMT), wherein the proclivity of neural models to memorize noisy and atypical samples reacts adversely with the noisy (web crawled) datasets. However, previous studies of memorization in constrained NLG tasks have only focused on counterfactual memorization, linking it to the problem of hallucinations. In this work, we propose a new, inexpensive algorithm for extractive memorization (exact training data generation under insufficient context) in constrained sequence generation tasks and use it to study extractive memorization and its effects in NMT. We demonstrate that extractive memorization poses a serious threat to NMT reliability by qualitatively and quantitatively characterizing the memorized samples as well as the model behavior in their vicinity. Based on empirical observations, we develop a simple algorithm which elicits non-memorized translations of memorized samples from the same model, for a large fraction of such samples. Finally, we show that the proposed algorithm could also be leveraged to mitigate memorization in the model through finetuning. We have released the code to reproduce our results at https://github.com/vyraun/Finding-Memo.

pdf bib
SALTED: A Framework for SAlient Long-tail Translation Error Detection
Vikas Raunak | Matt Post | Arul Menezes
Findings of the Association for Computational Linguistics: EMNLP 2022

Traditional machine translation (MT) metrics provide an average measure of translation quality that is insensitive to the long tail of behavioral problems. Examples include translation of numbers, physical units, dropped content and hallucinations. These errors, which occur rarely and unpredictably in Neural Machine Translation (NMT), greatly undermine the reliability of state-of-the-art MT systems. Consequently, it is important to have visibility into these problems during model development. Towards this end, we introduce SALTED, a specifications-based framework for behavioral testing of NMT models. At the core of our approach is the use of high-precision detectors that flag errors (or alternatively, verify output correctness) between a source sentence and a system output. These detectors provide fine-grained measurements of long-tail errors, providing a trustworthy view of problems that were previously invisible. We demonstrate that such detectors could be used not just to identify salient long-tail errors in MT systems, but also for higher-recall filtering of the training data, fixing targeted errors with model fine-tuning in NMT and generating novel data for metamorphic testing to elicit further bugs in models.

2021

pdf bib
To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation
Tom Kocmi | Christian Federmann | Roman Grundkiewicz | Marcin Junczys-Dowmunt | Hitokazu Matsushita | Arul Menezes
Proceedings of the Sixth Conference on Machine Translation

Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation system’s quality over another. The community choice of automatic metric guides research directions and industrial developments by deciding which models are deemed better. Evaluating metrics correlations with sets of human judgements has been limited by the size of these sets. In this paper, we corroborate how reliable metrics are in contrast to human judgements on – to the best of our knowledge – the largest collection of judgements reported in the literature. Arguably, pairwise rankings of two systems are the most common evaluation tasks in research or deployment scenarios. Taking human judgement as a gold standard, we investigate which metrics have the highest accuracy in predicting translation quality rankings for such system pairs. Furthermore, we evaluate the performance of various metrics across different language pairs and domains. Lastly, we show that the sole use of BLEU impeded the development of improved models leading to bad deployment decisions. We release the collection of 2.3M sentence-level human judgements for 4380 systems for further analysis and replication of our work.

pdf bib
The Curious Case of Hallucinations in Neural Machine Translation
Vikas Raunak | Arul Menezes | Marcin Junczys-Dowmunt
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

In this work, we study hallucinations in Neural Machine Translation (NMT), which lie at an extreme end on the spectrum of NMT pathologies. Firstly, we connect the phenomenon of hallucinations under source perturbation to the Long-Tail theory of Feldman, and present an empirically validated hypothesis that explains hallucinations under source perturbation. Secondly, we consider hallucinations under corpus-level noise (without any source perturbation) and demonstrate that two prominent types of natural hallucinations (detached and oscillatory outputs) could be generated and explained through specific corpus-level noise patterns. Finally, we elucidate the phenomenon of hallucination amplification in popular data-generation processes such as Backtranslation and sequence-level Knowledge Distillation. We have released the datasets and code to replicate our results.

2015

pdf bib
Pre-Computable Multi-Layer Neural Network Language Models
Jacob Devlin | Chris Quirk | Arul Menezes
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
An AMR parser for English, French, German, Spanish and Japanese and a new AMR-annotated corpus
Lucy Vanderwende | Arul Menezes | Chris Quirk
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

2014

pdf bib
Speech translation for everyone – breaking down the barriers
Arul Menezes
Proceedings of the 11th International Workshop on Spoken Language Translation: Keynotes

2013

pdf bib
Social Text Normalization using Contextual Graph Random Walks
Hany Hassan | Arul Menezes
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2008

pdf bib
Syntactic Models for Structural Word Insertion and Deletion during Translation
Arul Menezes | Chris Quirk
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

pdf bib
Generative models of noisy translations with applications to parallel fragment extraction
Chris Quirk | Raghavendra Udupa U. | Arul Menezes
Proceedings of Machine Translation Summit XI: Papers

pdf bib
Using Dependency Order Templates to Improve Generality in Translation
Arul Menezes | Chris Quirk
Proceedings of the Second Workshop on Statistical Machine Translation

2006

pdf bib
Do we need phrases? Challenging the conventional wisdom in Statistical Machine Translation
Chris Quirk | Arul Menezes
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

pdf bib
Effectively Using Syntax for Recognizing False Entailment
Rion Snow | Lucy Vanderwende | Arul Menezes
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

pdf bib
Microsoft Research Treelet Translation System: NAACL 2006 Europarl Evaluation
Arul Menezes | Kristina Toutanova | Chris Quirk
Proceedings on the Workshop on Statistical Machine Translation

2005

pdf bib
MindNet: An Automatically-Created Lexical Resource
Lucy Vanderwende | Gary Kacmarcik | Hisami Suzuki | Arul Menezes
Proceedings of HLT/EMNLP 2005 Interactive Demonstrations

pdf bib
Dependency Treelet Translation: Syntactically Informed Phrasal SMT
Chris Quirk | Arul Menezes | Colin Cherry
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

pdf bib
Microsoft Research Treelet Translation System: IWSLT Evaluation
Arul Menezes | Chris Quirk
Proceedings of the Second International Workshop on Spoken Language Translation

pdf bib
Dependency Treelet Translation: The Convergence of Statistical and Example-based Machine-translation?
Arul Menezes | Chris Quirk
Workshop on example-based machine translation

We describe a novel approach to machine translation that combines the strengths of the two leading corpus-based approaches: Phrasal SMT and EBMT. We use a syntactically informed decoder and reordering model based on the source dependency tree, in combination with conventional SMT models to incorporate the power of phrasal SMT with the linguistic generality available in a parser. We show that this approach significantly outperforms a leading string-based Phrasal SMT decoder and an EBMT system. We present results from two radically different language pairs, and investigate the sensitivity of this approach to parse quality by using two distinct parsers and oracle experiments. We also validate our automated BLEU scores with a small human evaluation.

2004

pdf bib
Statistical machine translation using labeled semantic dependency graphs
Anthony Aue | Arul Menezes | Bob Moore | Chris Quirk | Eric Ringger
Proceedings of the 10th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

2002

pdf bib
English-Japanese Example-Based Machine Translation Using Abstract Linguistic Representations
Chris Brockett | Takako Aikawa | Anthony Aue | Arul Menezes | Chris Quirk | Hisami Suzuki
COLING-02: Machine Translation in Asia

pdf bib
Better contextual translation using machine learning
Arul Menezes
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: Technical Papers

One of the problems facing translation systems that automatically extract transfer mappings (rules or examples) from bilingual corpora is the trade-off between contextual specificity and general applicability of the mappings, which typically results in conflicting mappings without distinguishing context. We present a machine-learning approach to choosing between such mappings, using classifiers that, in effect, selectively expand the context for these mappings using features available in a linguistic representation of the source language input. We show that using these classifiers in our machine translation system significantly improves the quality of the translated output. Additionally, the set of distinguishing features selected by the classifiers provides insight into the relative importance of the various linguistic features in choosing the correct contextual translation.

2001

pdf bib
Achieving commercial-quality translation with example-based methods
Stephen Richardson | William Dolan | Arul Menezes | Jessie Pinkham
Proceedings of Machine Translation Summit VIII

We describe MSR-MT, a large-scale example-based machine translation system under development for several language pairs. Trained on aligned English-Spanish technical prose, a blind evaluation shows that MSR-MT’s integration of rule-based parsers, example based processing, and statistical techniques produces translations whose quality in this domain exceeds that of uncustomized commercial MT systems.

pdf bib
A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora
Arul Menezes | Stephen D. Richardson
Workshop on Example-Based machine Translation

Translation systems that automatically extract transfer mappings (rules or examples) from bilingual corpora have been hampered by the difficulty of achieving accurate alignment and acquiring high quality mappings. We describe an algorithm that uses a best-first strategy and a small alignment grammar to significantly improve the quality of the mappings extracted. For each mapping, frequencies are computed and sufficient context is retained to distinguish competing mappings during translation. Variants of the algorithm are run against a corpus containing 200K sentence pairs and evaluated based on the quality of resulting translations.

pdf bib
Overcoming the customization bottleneck using example-based MT
Stephen D. Richardson | William B. Dolan | Arul Menezes | Monica Corston-Oliver
Proceedings of the ACL 2001 Workshop on Data-Driven Methods in Machine Translation

pdf bib
A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora
Arul Menezes | Stephen D. Richardson
Proceedings of the ACL 2001 Workshop on Data-Driven Methods in Machine Translation