John Miller

Also published as: John E. Miller


2023

pdf bib
Detecting Lexical Borrowings from Dominant Languages in Multilingual Wordlists
John E. Miller | Johann-Mattis List
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Language contact is a pervasive phenomenon reflected in the borrowing of words from donor to recipient languages. Most computational approaches to borrowing detection treat all languages under study as equally important, even though dominant languages have a stronger impact on heritage languages than vice versa. We test new methods for lexical borrowing detection in contact situations where dominant languages play an important role, applying two classical sequence comparison methods and one machine learning method to a sample of seven Latin American languages which have all borrowed extensively from Spanish. All systems perform well, with the supervised machine learning system outperforming the classical systems. A review of detection errors shows that borrowing detection could be substantially improved by taking into account donor words with divergent meanings from recipient words.

2021

pdf bib
Representation of Yine [Arawak] Morphology by Finite State Transducer Formalism
Adriano Ingunza Torres | John Miller | Arturo Oncevay | Roberto Zariquiey Biondi
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

We represent the complexity of Yine (Arawak) morphology with a finite state transducer (FST) based morphological analyzer. Yine is a low-resource indigenous polysynthetic Peruvian language spoken by approximately 3,000 people and is classified as ‘definitely endangered’ by UNESCO. We review Yine morphology focusing on morphophonology, possessive constructions and verbal predicates. Then we develop FSTs to model these components proposing techniques to solve challenging problems such as complex patterns of incorporating open and closed category arguments. This is a work in progress and we still have more to do in the development and verification of our analyzer. Our analyzer will serve both as a tool to better document the Yine language and as a component of natural language processing (NLP) applications such as spell checking and correction.

pdf bib
Neural Borrowing Detection with Monolingual Lexical Models
John Miller | Emanuel Pariasca | Cesar Beltran Castañon
Proceedings of the Student Research Workshop Associated with RANLP 2021

Identification of lexical borrowings, transfer of words between languages, is an essential practice of historical linguistics and a vital tool in analysis of language contact and cultural events in general. We seek to improve tools for automatic detection of lexical borrowings, focusing here on detecting borrowed words from monolingual wordlists. Starting with a recurrent neural lexical language model and competing entropies approach, we incorporate a more current Transformer based lexical model. From there we experiment with several different models and approaches including a lexical donor model with augmented wordlist. The Transformer model reduces execution time and minimally improves borrowing detection. The augmented donor model shows some promise. A substantive change in approach or model is needed to make significant gains in identification of lexical borrowings.

2018

pdf bib
Toward Universal Dependencies for Shipibo-Konibo
Alonso Vasquez | Renzo Ego Aguirre | Candy Angulo | John Miller | Claudia Villanueva | Željko Agić | Roberto Zariquiey | Arturo Oncevay
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

We present an initial version of the Universal Dependencies (UD) treebank for Shipibo-Konibo, the first South American, Amazonian, Panoan and Peruvian language with a resource built under UD. We describe the linguistic aspects of how the tagset was defined and the treebank was annotated; in addition we present our specific treatment of linguistic units called clitics. Although the treebank is still under development, it allowed us to perform a typological comparison against Spanish, the predominant language in Peru, and dependency syntax parsing experiments in both monolingual and cross-lingual approaches.

2017

pdf bib
Topic Model Stability for Hierarchical Summarization
John Miller | Kathleen McCoy
Proceedings of the Workshop on New Frontiers in Summarization

We envisioned responsive generic hierarchical text summarization with summaries organized by section and paragraph based on hierarchical structure topic models. But we had to be sure that topic models were stable for the sampled corpora. To that end we developed a methodology for aligning multiple hierarchical structure topic models run over the same corpus under similar conditions, calculating a representative centroid model, and reporting stability of the centroid model. We ran stability experiments for standard corpora and a development corpus of Global Warming articles. We found flat and hierarchical structures of two levels plus the root offer stable centroid models, but hierarchical structures of three levels plus the root didn’t seem stable enough for use in hierarchical summarization.

pdf bib
Globally Normalized Reader
Jonathan Raiman | John Miller
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Rapid progress has been made towards question answering (QA) systems that can extract answers from text. Existing neural approaches make use of expensive bi-directional attention mechanisms or score all possible answer spans, limiting scalability. We propose instead to cast extractive QA as an iterative search problem: select the answer’s sentence, start word, and end word. This representation reduces the space of each search step and allows computation to be conditionally allocated to promising search paths. We show that globally normalizing the decision process and back-propagating through beam search makes this representation viable and learning efficient. We empirically demonstrate the benefits of this approach using our model, Globally Normalized Reader (GNR), which achieves the second highest single model performance on the Stanford Question Answering Dataset (68.4 EM, 76.21 F1 dev) and is 24.7x faster than bi-attention-flow. We also introduce a data-augmentation method to produce semantically valid examples by aligning named entities to a knowledge base and swapping them with new entities of the same type. This method improves the performance of all models considered in this work and is of independent interest for a variety of NLP tasks.

2015

pdf bib
Traversing Knowledge Graphs in Vector Space
Kelvin Guu | John Miller | Percy Liang
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

pdf bib
Experimental Design to Improve Topic Analysis Based Summarization
John Miller | Kathleen McCoy
Proceedings of the 8th International Natural Language Generation Conference (INLG)

2007

pdf bib
Building Domain-Specific Taggers without Annotated (Domain) Data
John Miller | Manabu Torii | K. Vijay-Shanker
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf bib
Adaptation of POS Tagging for Multiple BioMedical Domains
John E. Miller | Manabu Torii | K. Vijay-Shanker
Biological, translational, and clinical language processing

2006

pdf bib
Rapid Adaptation of POS Tagging for Domain Specific Uses
John E. Miller | Michael Bloodgood | Manabu Torii | K. Vijay-Shanker
Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology