Scott Miller


2020

pdf bib
SEARCHER: Shared Embedding Architecture for Effective Retrieval
Joel Barry | Elizabeth Boschee | Marjorie Freedman | Scott Miller
Proceedings of the workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020)

We describe an approach to cross lingual information retrieval that does not rely on explicit translation of either document or query terms. Instead, both queries and documents are mapped into a shared embedding space where retrieval is performed. We discuss potential advantages of the approach in handling polysemy and synonymy. We present a method for training the model, and give details of the model implementation. We present experimental results for two cases: Somali-English and Bulgarian-English CLIR.

2019

pdf bib
The Challenges of Optimizing Machine Translation for Low Resource Cross-Language Information Retrieval
Constantine Lignos | Daniel Cohen | Yen-Chieh Lien | Pratik Mehta | W. Bruce Croft | Scott Miller
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

When performing cross-language information retrieval (CLIR) for lower-resourced languages, a common approach is to retrieve over the output of machine translation (MT). However, there is no established guidance on how to optimize the resulting MT-IR system. In this paper, we examine the relationship between the performance of MT systems and both neural and term frequency-based IR models to identify how CLIR performance can be best predicted from MT quality. We explore performance at varying amounts of MT training data, byte pair encoding (BPE) merge operations, and across two IR collections and retrieval models. We find that the choice of IR collection can substantially affect the predictive power of MT tuning decisions and evaluation, potentially introducing dissociations between MT-only and overall CLIR performance.

pdf bib
Cross-lingual Joint Entity and Word Embedding to Improve Entity Linking and Parallel Sentence Mining
Xiaoman Pan | Thamme Gowda | Heng Ji | Jonathan May | Scott Miller
Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)

Entities, which refer to distinct objects in the real world, can be viewed as language universals and used as effective signals to generate less ambiguous semantic representations and align multiple languages. We propose a novel method, CLEW, to generate cross-lingual data that is a mix of entities and contextual words based on Wikipedia. We replace each anchor link in the source language with its corresponding entity title in the target language if it exists, or in the source language otherwise. A cross-lingual joint entity and word embedding learned from this kind of data not only can disambiguate linkable entities but can also effectively represent unlinkable entities. Because this multilingual common space directly relates the semantics of contextual words in the source language to that of entities in the target language, we leverage it for unsupervised cross-lingual entity linking. Experimental results show that CLEW significantly advances the state-of-the-art: up to 3.1% absolute F-score gain for unsupervised cross-lingual entity linking. Moreover, it provides reliable alignment on both the word/entity level and the sentence level, and thus we use it to mine parallel sentences for all (302, 2) language pairs in Wikipedia.

pdf bib
SARAL: A Low-Resource Cross-Lingual Domain-Focused Information Retrieval System for Effective Rapid Document Triage
Elizabeth Boschee | Joel Barry | Jayadev Billa | Marjorie Freedman | Thamme Gowda | Constantine Lignos | Chester Palen-Michel | Michael Pust | Banriskhem Kayang Khonglah | Srikanth Madikeri | Jonathan May | Scott Miller
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

With the increasing democratization of electronic media, vast information resources are available in less-frequently-taught languages such as Swahili or Somali. That information, which may be crucially important and not available elsewhere, can be difficult for monolingual English speakers to effectively access. In this paper we present an end-to-end cross-lingual information retrieval (CLIR) and summarization system for low-resource languages that 1) enables English speakers to search foreign language repositories of text and audio using English queries, 2) summarizes the retrieved documents in English with respect to a particular information need, and 3) provides complete transcriptions and translations as needed. The SARAL system achieved the top end-to-end performance in the most recent IARPA MATERIAL CLIR+summarization evaluations. Our demonstration system provides end-to-end open query retrieval and summarization capability, and presents the original source text or audio, speech transcription, and machine translation, for two low resource languages.

2017

pdf bib
CADET: Computer Assisted Discovery Extraction and Translation
Benjamin Van Durme | Tom Lippincott | Kevin Duh | Deana Burchfield | Adam Poliak | Cash Costello | Tim Finin | Scott Miller | James Mayfield | Philipp Koehn | Craig Harman | Dawn Lawrie | Chandler May | Max Thomas | Annabelle Carrell | Julianne Chaloux | Tongfei Chen | Alex Comerford | Mark Dredze | Benjamin Glass | Shudong Hao | Patrick Martin | Pushpendre Rastogi | Rashmi Sankepally | Travis Wolfe | Ying-Ying Tran | Ted Zhang
Proceedings of the IJCNLP 2017, System Demonstrations

Computer Assisted Discovery Extraction and Translation (CADET) is a workbench for helping knowledge workers find, label, and translate documents of interest. It combines a multitude of analytics together with a flexible environment for customizing the workflow for different users. This open-source framework allows for easy development of new research prototypes using a micro-service architecture based atop Docker and Apache Thrift.

2012

pdf bib
Modality and Negation in SIMT Use of Modality and Negation in Semantically-Informed Syntactic MT
Kathryn Baker | Michael Bloodgood | Bonnie J. Dorr | Chris Callison-Burch | Nathaniel W. Filardo | Christine Piatko | Lori Levin | Scott Miller
Computational Linguistics, Volume 38, Issue 2 - June 2012

2010

pdf bib
Semantically-Informed Syntactic Machine Translation: A Tree-Grafting Approach
Kathryn Baker | Michael Bloodgood | Chris Callison-Burch | Bonnie Dorr | Nathaniel Filardo | Lori Levin | Scott Miller | Christine Piatko
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

We describe a unified and coherent syntactic framework for supporting a semantically-informed syntactic approach to statistical machine translation. Semantically enriched syntactic tags assigned to the target-language training texts improved translation quality. The resulting system significantly outperformed a linguistically naive baseline model (Hiero), and reached the highest scores yet reported on the NIST 2009 Urdu-English translation task. This finding supports the hypothesis (posed by many researchers in the MT community, e.g., in DARPA GALE) that both syntactic and semantic information are critical for improving translation quality—and further demonstrates that large gains can be achieved for low-resource languages with different word order than English.

2004

pdf bib
Name Tagging with Word Clusters and Discriminative Training
Scott Miller | Jethran Guinness | Alex Zamanian
Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004

2001

pdf bib
Experiments in Multi-Modal Automatic Content Extraction
Lance Ramshaw | Elizabeth Boschee | Sergey Bratus | Scott Miller | Rebecca Stone | Ralph Weischedel | Alex Zamanian
Proceedings of the First International Conference on Human Language Technology Research

pdf bib
FactBrowser Demonstration
Scott Miller | Sergey Bratus | Lance Ramshaw | Ralph Weischedel | Alex Zamanian
Proceedings of the First International Conference on Human Language Technology Research

2000

pdf bib
A Novel Use of Statistical Parsing to Extract Information from Text
Scott Miller | Heidi Fox | Lance Ramshaw | Ralph Weischedel
1st Meeting of the North American Chapter of the Association for Computational Linguistics

1998

pdf bib
BBN: Description of the SIFT System as Used for MUC-7
Scott Miller | Michael Crystal | Heidi Fox | Lance Ramshaw | Richard Schwartz | Rebecca Stone | Ralph Weischedel | The Annotation Group
Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29 - May 1, 1998

pdf bib
Semantic Tagging using a Probabilistic Context Free Grammar
Michael Collins | Scott Miller
Sixth Workshop on Very Large Corpora

pdf bib
Algorithms That Learn to Extract Information BBN: TIPSTER Phase III
Scott Miller | Michael Crystal | Heidi Fox | Lance Ramshaw | Richard Schwartz | Rebecca Stone | Ralph Weischedel
TIPSTER TEXT PROGRAM PHASE III: Proceedings of a Workshop held at Baltimore, Maryland, October 13-15, 1998

1997

pdf bib
Nymble: a High-Performance Learning Name-finder
Daniel M. Bikel | Scott Miller | Richard Schwartz | Ralph Weischedel
Fifth Conference on Applied Natural Language Processing

1996

pdf bib
A Fully Statistical Approach to Natural Language Interfaces
Scott Miller | David Stallard | Robert Bobrow | Richard Schwartz
34th Annual Meeting of the Association for Computational Linguistics

1994

pdf bib
Hidden Understanding Models of Natural Language
Scott Miller | Robert Bobrow | Robert Ingria | Richard Schwartz
32nd Annual Meeting of the Association for Computational Linguistics

pdf bib
Automatic Grammar Acquisition
Scott Miller | Heidi J. Fox
Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994

pdf bib
Statistical Language Processing Using Hidden Understanding Models
Scott Miller | Richard Schwartz | Robert Bobrow | Robert Ingria
Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994

1993

pdf bib
Example-Based Correction of Word Segmentation and Part of Speech Labelling
Tomoyoshi Matsukawa | Scott Miller | Ralph Weischedel
Human Language Technology: Proceedings of a Workshop Held at Plainsboro, New Jersey, March 21-24, 1993

pdf bib
BBN’s PLUM Probabilistic Language Understanding System
Ralph Weischedel | Damaris Ayuso | Heidi Fox | Tomoyoshi Matsukawa | Constantine Papageorgiou | Dawn MacLaughlin | Masaichiro Kitagawa | Tsutomu Sakai | June Abe | Hiroto Hosiho | Yoichi Miyamoto | Scott Miller
TIPSTER TEXT PROGRAM: PHASE I: Proceedings of a Workshop held at Fredricksburg, Virginia, September 19-23, 1993

pdf bib
BBN: Description of the PLUM System as Used for MUC-5
Ralph Weischedel | Damaris Ayuso | Sean Boisen | Heidi Fox | Robert Ingria | Tomoyoshi Matsukawa | Constantine Papageorgiou | Dawn MacLaughlin | Masaichiro Kitagawa | Tsutomu Sakai | June Abe | Hiroto Hosiho | Yoichi Miyamoto | Scott Miller
Fifth Message Understanding Conference (MUC-5): Proceedings of a Conference Held in Baltimore, Maryland, August 25-27, 1993