Lane Schwartz


2021

pdf bib
Expanding Universal Dependencies for Polysynthetic Languages: A Case of St. Lawrence Island Yupik
Hyunji Park | Lane Schwartz | Francis Tyers
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This paper describes the development of the first Universal Dependencies (UD) treebank for St. Lawrence Island Yupik, an endangered language spoken in the Bering Strait region. While the UD guidelines provided a general framework for our annotations, language-specific decisions were made necessary by the rich morphology of the polysynthetic language. Most notably, we annotated a corpus at the morpheme level as well as the word level. The morpheme level annotation was conducted using an existing morphological analyzer and manual disambiguation. By comparing the two resulting annotation schemes, we argue that morpheme-level annotation is essential for polysynthetic languages like St. Lawrence Island Yupik. Word-level annotation results in degenerate trees for some Yupik sentences and often fails to capture syntactic relations that can be manifested at the morpheme level. Dependency parsing experiments provide further support for morpheme-level annotation. Implications for UD annotation of other polysynthetic languages are discussed.

2020

pdf bib
Improved Finite-State Morphological Analysis for St. Lawrence Island Yupik Using Paradigm Function Morphology
Emily Chen | Hyunji Hayley Park | Lane Schwartz
Proceedings of the 12th Language Resources and Evaluation Conference

St. Lawrence Island Yupik is an endangered polysynthetic language of the Bering Strait region. While conducting linguistic fieldwork between 2016 and 2019, we observed substantial support within the Yupik community for language revitalization and for resource development to support Yupik education. To that end, Chen & Schwartz (2018) implemented a finite-state morphological analyzer as a critical enabling technology for use in Yupik language education and technology. Chen & Schwartz (2018) reported a morphological analysis coverage rate of approximately 75% on a dataset of 60K Yupik tokens, leaving considerable room for improvement. In this work, we present a re-implementation of the Chen & Schwartz (2018) finite-state morphological analyzer for St. Lawrence Island Yupik that incorporates new linguistic insights; in particular, in this implementation we make use of the Paradigm Function Morphology (PFM) theory of morphology. We evaluate this new PFM-based morphological analyzer, and demonstrate that it consistently outperforms the existing analyzer of Chen & Schwartz (2018) with respect to accuracy and coverage rate across multiple datasets.

2019

pdf bib
Community lexical access for an endangered polysynthetic language: An electronic dictionary for St. Lawrence Island Yupik
Benjamin Hunt | Emily Chen | Sylvia L.R. Schreiner | Lane Schwartz
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

In this paper, we introduce a morphologically-aware electronic dictionary for St. Lawrence Island Yupik, an endangered language of the Bering Strait region. Implemented using HTML, Javascript, and CSS, the dictionary is set in an uncluttered interface and permits users to search in Yupik or in English for Yupik root words and Yupik derivational suffixes. For each matching result, our electronic dictionary presents the user with the corresponding entry from the Badten (2008) Yupik-English paper dictionary. Because Yupik is a polysynthetic language, handling of multimorphemic word forms is critical. If a user searches for an inflected Yupik word form, we perform a morphological analysis and return entries for the root word and for any derivational suffixes present in the word. This electronic dictionary should serve not only as a valuable resource for all students and speakers of Yupik, but also for field linguists working towards documentation and conservation of the language.

pdf bib
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)
Antti Arppe | Jeff Good | Mans Hulden | Jordan Lachler | Alexis Palmer | Lane Schwartz | Miikka Silfverberg
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

pdf bib
Bootstrapping a Neural Morphological Analyzer for St. Lawrence Island Yupik from a Finite-State Transducer
Lane Schwartz | Emily Chen | Benjamin Hunt | Sylvia L.R. Schreiner
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

pdf bib
Unsupervised Learning of PCFGs with Normalizing Flow
Lifeng Jin | Finale Doshi-Velez | Timothy Miller | Lane Schwartz | William Schuler
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Unsupervised PCFG inducers hypothesize sets of compact context-free rules as explanations for sentences. PCFG induction not only provides tools for low-resource languages, but also plays an important role in modeling language acquisition (Bannard et al., 2009; Abend et al. 2017). However, current PCFG induction models, using word tokens as input, are unable to incorporate semantics and morphology into induction, and may encounter issues of sparse vocabulary when facing morphologically rich languages. This paper describes a neural PCFG inducer which employs context embeddings (Peters et al., 2018) in a normalizing flow model (Dinh et al., 2015) to extend PCFG induction to use semantic and morphological information. Linguistically motivated sparsity and categorical distance constraints are imposed on the inducer as regularization. Experiments show that the PCFG induction model with normalizing flow produces grammars with state-of-the-art accuracy on a variety of different languages. Ablation further shows a positive effect of normalizing flow, context embeddings and proposed regularizers.

2018

pdf bib
A Morphological Analyzer for St. Lawrence Island / Central Siberian Yupik
Emily Chen | Lane Schwartz
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Unsupervised Grammar Induction with Depth-bounded PCFG
Lifeng Jin | Finale Doshi-Velez | Timothy Miller | William Schuler | Lane Schwartz
Transactions of the Association for Computational Linguistics, Volume 6

There has been recent interest in applying cognitively- or empirically-motivated bounds on recursion depth to limit the search space of grammar induction models (Ponvert et al., 2011; Noji and Johnson, 2016; Shain et al., 2016). This work extends this depth-bounding approach to probabilistic context-free grammar induction (DB-PCFG), which has a smaller parameter space than hierarchical sequence models, and therefore more fully exploits the space reductions of depth-bounding. Results for this model on grammar acquisition from transcribed child-directed speech and newswire text exceed or are competitive with those of other models when evaluated on parse accuracy. Moreover, grammars acquired from this model demonstrate a consistent use of category labels, something which has not been demonstrated by other acquisition models.

pdf bib
Depth-bounding is effective: Improvements and evaluation of unsupervised PCFG induction
Lifeng Jin | Finale Doshi-Velez | Timothy Miller | William Schuler | Lane Schwartz
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

There have been several recent attempts to improve the accuracy of grammar induction systems by bounding the recursive complexity of the induction model. Modern depth-bounded grammar inducers have been shown to be more accurate than early unbounded PCFG inducers, but this technique has never been compared against unbounded induction within the same system, in part because most previous depth-bounding models are built around sequence models, the complexity of which grows exponentially with the maximum allowed depth. The present work instead applies depth bounds within a chart-based Bayesian PCFG inducer, where bounding can be switched on and off, and then samples trees with or without bounding. Results show that depth-bounding is indeed significantly effective in limiting the search space of the inducer and thereby increasing accuracy of resulting parsing model, independent of the contribution of modern Bayesian induction techniques. Moreover, parsing results on English, Chinese and German show that this bounded model is able to produce parse trees more accurately than or competitively with state-of-the-art constituency grammar induction models.

2017

pdf bib
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages
Antti Arppe | Jeff Good | Mans Hulden | Jordan Lachler | Alexis Palmer | Lane Schwartz
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

2016

pdf bib
Normalized Log-Linear Interpolation of Backoff Language Models is Efficient
Kenneth Heafield | Chase Geigle | Sean Massung | Lane Schwartz
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track
Spence Green | Lane Schwartz
Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track

pdf bib
Fast, Scalable Phrase-Based SMT Decoding
Hieu Hoang | Nikolay Bogoychev | Lane Schwartz | Marcin Junczys-Dowmunt
Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track

The utilization of statistical machine translation (SMT) has grown enormously over the last decade, many using open-source software developed by the NLP community. As commercial use has increased, there is need for software that is optimized for commercial requirements, in particular, fast phrase-based decoding and more efficient utilization of modern multicore servers. In this paper we re-examine the major components of phrase-based decoding and decoder implementation with particular emphasis on speed and scalability on multicore machines. The result is a drop-in replacement for the Moses decoder which is up to fifteen times faster and scales monotonically with the number of cores.

pdf bib
Conferences of the Association for Machine Translation in the Americas: MT Users' Track
Spence Green | Lane Schwartz
Conferences of the Association for Machine Translation in the Americas: MT Users' Track

pdf bib
Memory-Bounded Left-Corner Unsupervised Grammar Induction on Child-Directed Input
Cory Shain | William Bryce | Lifeng Jin | Victoria Krakovna | Finale Doshi-Velez | Timothy Miller | William Schuler | Lane Schwartz
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This paper presents a new memory-bounded left-corner parsing model for unsupervised raw-text syntax induction, using unsupervised hierarchical hidden Markov models (UHHMM). We deploy this algorithm to shed light on the extent to which human language learners can discover hierarchical syntax through distributional statistics alone, by modeling two widely-accepted features of human language acquisition and sentence processing that have not been simultaneously modeled by any existing grammar induction algorithm: (1) a left-corner parsing strategy and (2) limited working memory capacity. To model realistic input to human language learners, we evaluate our system on a corpus of child-directed speech rather than typical newswire corpora. Results beat or closely match those of three competing systems.

2015

pdf bib
Automated Translation of a Literary Work: A Pilot Study
Laurent Besacier | Lane Schwartz
Proceedings of the Fourth Workshop on Computational Linguistics for Literature

pdf bib
The University of Illinois submission to the WMT 2015 Shared Translation Task
Lane Schwartz | Bill Bryce | Chase Geigle | Sean Massung | Yisi Liu | Haoruo Peng | Vignesh Raja | Subhro Roy | Shyam Upadhyay
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Effects of word alignment visualization on post-editing quality & speed
Lane Schwartz | Isabel Lacruz | Tatyana Bystrova
Proceedings of Machine Translation Summit XV: Papers

2014

pdf bib
Monolingual post-editing by a domain expert is highly effective for translation triage
Lane Schwartz
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas

Various small-scale pilot studies have found that for at least some documents, monolingual target language speakers may be able to successfully post-edit machine translations. We begin by analyzing previously published post-editing data to ascertain the effect, if any, of original source language on post-editing quality. Schwartz et al. (2014) hypothesized that post-editing success may be more pronounced when the monolingual post-editors are experts in the domain of the translated documents. This work tests that hypothesis by asking a domain expert to post-edit machine translations of a French scientific article (Besacier, 2014) into English. We find that the monolingual domain expert post-editor was able to successfully post-edit 86.7% of the sentences without requesting assistance from a bilingual post-editor. We evaluate the post-edited sentences according to a bilingual adequacy metric, and find that 96.5% of those sentences post-edited by only a monolingual post-editor are judged to be completely correct. These results confirm that a monolingual domain expert can successfully triage the post-editing effort, substantially reducing the workload on the bilingual post-editor by only sending the most challenging sentences to the bilingual post-editor.

pdf bib
An open source desktop post-editing tool
Lane Schwartz
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas

We present a simple user interface for post-editing that presents the user with the source sentence, machine translation, and word alignments for each sentence in a test document (Figure 1). This software is open source, written in Java, and has no external dependencies; it can be run on Linux, Mac OS X, and Windows. This software was originally designed for monolingual post-editors, but should be equally usable by bilingual post-editors. While it may seem counter-intuitive to present monolingual post-editors with the source sentence, we found that the presence of alignment links between source words and target words can in fact aid a monolingual post-editor, especially with regard to correcting word order. For example, in our experiments using this interface (Schwartz et al., 2014), post-editors encountered some sentences where a word or phrase was enclosed within bracketing punctuation marks (such as quotation marks, commas, or parentheses) in the source sentence, and the machine translation system incorrectly reordered the word or phrase outside the enclosing punctuation; by examining the alignment links the post-editors were able to correct such reordering mistakes.

pdf bib
Machine Translation and Monolingual Postediting: The AFRL WMT-14 System
Lane Schwartz | Timothy Anderson | Jeremy Gwinnup | Katherine Young
Proceedings of the Ninth Workshop on Statistical Machine Translation

2013

pdf bib
The MIT-LL/AFRL IWSLT-2013 MT system
Michaeel Kazi | Michael Coury | Elizabeth Salesky | Jessica Ray | Wade Shen | Terry Gleason | Tim Anderson | Grant Erdmann | Lane Schwartz | Brian Ore | Raymond Slyh | Jeremy Gwinnup | Katherine Young | Michael Hutt
Proceedings of the 10th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the MIT-LL/AFRL statistical MT system and the improvements that were developed during the IWSLT 2013 evaluation campaign [1]. As part of these efforts, we experimented with a number of extensions to the standard phrase-based model that improve performance on the Russian to English, Chinese to English, Arabic to English, and English to French TED-talk translation task. We also applied our existing ASR system to the TED-talk lecture ASR task. We discuss the architecture of the MIT-LL/AFRL MT system, improvements over our 2012 system, and experiments we ran during the IWSLT-2013 evaluation. Specifically, we focus on 1) cross-entropy filtering of MT training data, and 2) improved optimization techniques, 3) language modeling, and 4) approximation of out-of-vocabulary words.

2011

pdf bib
Incremental Syntactic Language Models for Phrase-based Translation
Lane Schwartz | Chris Callison-Burch | William Schuler | Stephen Wu
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Joshua 2.0: A Toolkit for Parsing-Based Machine Translation with Syntax, Semirings, Discriminative Training and Other Goodies
Zhifei Li | Chris Callison-Burch | Chris Dyer | Juri Ganitkevitch | Ann Irvine | Sanjeev Khudanpur | Lane Schwartz | Wren Thornton | Ziyuan Wang | Jonathan Weese | Omar Zaidan
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
Reproducible Results in Parsing-Based Machine Translation: The JHU Shared Task Submission
Lane Schwartz
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
Broad-Coverage Parsing Using Human-Like Memory Constraints
William Schuler | Samir AbdelRahman | Tim Miller | Lane Schwartz
Computational Linguistics, Volume 36, Number 1, March 2010

2009

pdf bib
Articles: A Framework for Fast Incremental Interpretation during Speech Decoding
William Schuler | Stephen Wu | Lane Schwartz
Computational Linguistics, Volume 35, Number 3, September 2009

pdf bib
Demonstration of Joshua: An Open Source Toolkit for Parsing-based Machine Translation
Zhifei Li | Chris Callison-Burch | Chris Dyer | Juri Ganitkevitch | Sanjeev Khudanpur | Lane Schwartz | Wren N. G. Thornton | Jonathan Weese | Omar F. Zaidan
Proceedings of the ACL-IJCNLP 2009 Software Demonstrations

pdf bib
Joshua: An Open Source Toolkit for Parsing-Based Machine Translation
Zhifei Li | Chris Callison-Burch | Chris Dyer | Sanjeev Khudanpur | Lane Schwartz | Wren Thornton | Jonathan Weese | Omar Zaidan
Proceedings of the Fourth Workshop on Statistical Machine Translation

2008

pdf bib
Toward a Psycholinguistically-Motivated Model of Language Processing
William Schuler | Samir AbdelRahman | Tim Miller | Lane Schwartz
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Multi-Source Translation Methods
Lane Schwartz
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Student Research Workshop

Multi-parallel corpora provide a potentially rich resource for machine translation. This paper surveys existing methods for utilizing such resources, including hypothesis ranking and system combination techniques. We find that despite significant research into system combination, relatively little is know about how best to translate when multiple parallel source languages are available. We provide results to show that the MAX multilingual multi-source hypothesis ranking method presented by Och and Ney (2001) does not reliably improve translation quality when a broad range of language pairs are considered. We also show that the PROD multilingual multi-source hypothesis ranking method of Och and Ney (2001) cannot be used with standard phrase-based translation engines, due to a high number of unreachable hypotheses. Finally, we present an oracle experiment which shows that current hypothesis ranking methods fall far short of the best results reachable via sentence-level ranking.