Shankar Kumar


2022

pdf bib
Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models
Felix Stahlberg | Ilia Kulikov | Shankar Kumar
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In many natural language processing (NLP) tasks the same input (e.g. source sentence) can have multiple possible outputs (e.g. translations). To analyze how this ambiguity (also known as intrinsic uncertainty) shapes the distribution learned by neural sequence models we measure sentence-level uncertainty by computing the degree of overlap between references in multi-reference test sets from two different NLP tasks: machine translation (MT) and grammatical error correction (GEC). At both the sentence- and the task-level, intrinsic uncertainty has major implications for various aspects of search such as the inductive biases in beam search and the complexity of exact search. In particular, we show that well-known pathologies such as a high number of beam search errors, the inadequacy of the mode, and the drop in system performance with large beam sizes apply to tasks with high level of ambiguity such as MT but not to less uncertain tasks such as GEC. Furthermore, we propose a novel exact n-best search algorithm for neural sequence models, and show that intrinsic uncertainty affects model uncertainty as the model tends to overly spread out the probability mass for uncertain tasks and sentences.

pdf bib
Scaling Language Model Size in Cross-Device Federated Learning
Jae Ro | Theresa Breiner | Lara McConnaughey | Mingqing Chen | Ananda Suresh | Shankar Kumar | Rajiv Mathews
Proceedings of the First Workshop on Federated Learning for Natural Language Processing (FL4NLP 2022)

Most studies in cross-device federated learning focus on small models, due to the server-client communication and on-device computation bottlenecks. In this work, we leverage various techniques for mitigating these bottlenecks to train larger language models in cross-device federated learning. With systematic applications of partial model training, quantization, efficient transfer learning, and communication-efficient optimizers, we are able to train a 21M parameter Transformer that achieves the same perplexity as that of a similarly sized LSTM with ∼10× smaller client-to-server communication cost and 11% lower perplexity than smaller LSTMs commonly studied in literature.

2021

pdf bib
Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models
Felix Stahlberg | Shankar Kumar
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications

Synthetic data generation is widely known to boost the accuracy of neural grammatical error correction (GEC) systems, but existing methods often lack diversity or are too simplistic to generate the broad range of grammatical errors made by human writers. In this work, we use error type tags from automatic annotation tools such as ERRANT to guide synthetic data generation. We compare several models that can produce an ungrammatical sentence given a clean sentence and an error type tag. We use these models to build a new, large synthetic pre-training data set with error tag frequency distributions matching a given development set. Our synthetic data set yields large and consistent gains, improving the state-of-the-art on the BEA-19 and CoNLL-14 test sets. We also show that our approach is particularly effective in adapting a GEC system, trained on mixed native and non-native English, to a native English test set, even surpassing real training data consisting of high-quality sentence pairs.

pdf bib
Data Strategies for Low-Resource Grammatical Error Correction
Simon Flachs | Felix Stahlberg | Shankar Kumar
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications

Grammatical Error Correction (GEC) is a task that has been extensively investigated for the English language. However, for low-resource languages the best practices for training GEC systems have not yet been systematically determined. We investigate how best to take advantage of existing data sources for improving GEC systems for languages with limited quantities of high quality training data. We show that methods for generating artificial training data for GEC can benefit from including morphological errors. We also demonstrate that noisy error correction data gathered from Wikipedia revision histories and the language learning website Lang8, are valuable data sources. Finally, we show that GEC systems pre-trained on noisy data sources can be fine-tuned effectively using small amounts of high quality, human-annotated data.

2020

pdf bib
Seq2Edits: Sequence Transduction Using Span-level Edit Operations
Felix Stahlberg | Shankar Kumar
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We propose Seq2Edits, an open-vocabulary approach to sequence editing for natural language processing (NLP) tasks with a high degree of overlap between input and output texts. In this approach, each sequence-to-sequence transduction is represented as a sequence of edit operations, where each operation either replaces an entire source span with target tokens or keeps it unchanged. We evaluate our method on five NLP tasks (text normalization, sentence fusion, sentence splitting & rephrasing, text simplification, and grammatical error correction) and report competitive results across the board. For grammatical error correction, our method speeds up inference by up to 5.2x compared to full sequence models because inference time depends on the number of edits rather than the number of target tokens. For text normalization, sentence fusion, and grammatical error correction, our approach improves explainability by associating each edit operation with a human-readable tag.

pdf bib
Data Weighted Training Strategies for Grammatical Error Correction
Jared Lichtarge | Chris Alberti | Shankar Kumar
Transactions of the Association for Computational Linguistics, Volume 8

Recent progress in the task of Grammatical Error Correction (GEC) has been driven by addressing data sparsity, both through new methods for generating large and noisy pretraining data and through the publication of small and higher-quality finetuning data in the BEA-2019 shared task. Building upon recent work in Neural Machine Translation (NMT), we make use of both kinds of data by deriving example-level scores on our large pretraining data based on a smaller, higher-quality dataset. In this work, we perform an empirical study to discover how to best incorporate delta-log-perplexity, a type of example scoring, into a training schedule for GEC. In doing so, we perform experiments that shed light on the function and applicability of delta-log-perplexity. Models trained on scored data achieve state- of-the-art results on common GEC test sets.

2019

pdf bib
Corpora Generation for Grammatical Error Correction
Jared Lichtarge | Chris Alberti | Shankar Kumar | Noam Shazeer | Niki Parmar | Simon Tong
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Grammatical Error Correction (GEC) has been recently modeled using the sequence-to-sequence framework. However, unlike sequence transduction problems such as machine translation, GEC suffers from the lack of plentiful parallel data. We describe two approaches for generating large parallel datasets for GEC using publicly available Wikipedia data. The first method extracts source-target pairs from Wikipedia edit histories with minimal filtration heuristics while the second method introduces noise into Wikipedia sentences via round-trip translation through bridge languages. Both strategies yield similar sized parallel corpora containing around 4B tokens. We employ an iterative decoding strategy that is tailored to the loosely supervised nature of our constructed corpora. We demonstrate that neural GEC models trained using either type of corpora give similar performance. Fine-tuning these models on the Lang-8 corpus and ensembling allows us to surpass the state of the art on both the CoNLL ‘14 benchmark and the JFLEG task. We present systematic analysis that compares the two approaches to data generation and highlights the effectiveness of ensembling.

2015

pdf bib
Multilingual Open Relation Extraction Using Cross-lingual Projection
Manaal Faruqui | Shankar Kumar
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Expected Sequence Similarity Maximization
Cyril Allauzen | Shankar Kumar | Wolfgang Macherey | Mehryar Mohri | Michael Riley
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Model Combination for Machine Translation
John DeNero | Shankar Kumar | Ciprian Chelba | Franz Och
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2009

pdf bib
Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
Shankar Kumar | Wolfgang Macherey | Chris Dyer | Franz Och
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

pdf bib
Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation
Roy Tromble | Shankar Kumar | Franz Och | Wolfgang Macherey
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

pdf bib
Improving Word Alignment with Bridge Languages
Shankar Kumar | Franz J. Och | Wolfgang Macherey
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2005

pdf bib
Local Phrase Reordering Models for Statistical Machine Translation
Shankar Kumar | William Byrne
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

2004

pdf bib
A Smorgasbord of Features for Statistical Machine Translation
Franz Josef Och | Daniel Gildea | Sanjeev Khudanpur | Anoop Sarkar | Kenji Yamada | Alex Fraser | Shankar Kumar | Libin Shen | David Smith | Katherine Eng | Viren Jain | Zhen Jin | Dragomir Radev
Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004

pdf bib
Minimum Bayes-Risk Decoding for Statistical Machine Translation
Shankar Kumar | William Byrne
Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004

2003

pdf bib
A Weighted Finite State Transducer Implementation of the Alignment Template Model for Statistical Machine Translation
Shankar Kumar | William Byrne
Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics

2002

pdf bib
Minimum Bayes-Risk Word Alignments of Bilingual Texts
Shankar Kumar | William Byrne
Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)