James Mayfield


2023

pdf bib
On the Surprising Effectiveness of Name Matching Alone in Autoregressive Entity Linking
Elliot Schumacher | James Mayfield | Mark Dredze
Proceedings of the First Workshop on Matching From Unstructured and Structured Data (MATCHING 2023)

Fifteen years of work on entity linking has established the importance of different information sources in making linking decisions: mention and entity name similarity, contextual relevance, and features of the knowledge base. Modern state-of-the-art systems build on these features, including through neural representations (Wu et al., 2020). In contrast to this trend, the autoregressive language model GENRE (De Cao et al., 2021) generates normalized entity names for mentions and beats many other entity linking systems, despite making no use of knowledge base (KB) information. How is this possible? We analyze the behavior of GENRE on several entity linking datasets and demonstrate that its performance stems from memorization of name patterns. In contrast, it fails in cases that might benefit from using the KB. We experiment with a modification to the model to enable it to utilize KB information, highlighting challenges to incorporating traditional entity linking information sources into autoregressive models.

2022

pdf bib
Zero-shot Cross-Language Transfer of Monolingual Entity Linking Models
Elliot Schumacher | James Mayfield | Mark Dredze
Proceedings of the 2nd Workshop on Multi-lingual Representation Learning (MRL)

Most entity linking systems, whether mono or multilingual, link mentions to a single English knowledge base. Few have considered linking non-English text to a non-English KB, and therefore, transferring an English entity linking model to both a new document and KB language. We consider the task of zero-shot cross-language transfer of entity linking systems to a new language and KB. We find that a system trained with multilingual representations does reasonably well, and propose improvements to system training that lead to improved recall in most datasets, often matching the in-language performance. We further conduct a detailed evaluation to elucidate the challenges of this setting.

2021

pdf bib
Cross-Lingual Transfer in Zero-Shot Cross-Language Entity Linking
Elliot Schumacher | James Mayfield | Mark Dredze
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
Tagging Location Phrases in Text
Paul McNamee | James Mayfield | Cash Costello | Caitlyn Bishop | Shelby Anderson
Proceedings of the Twelfth Language Resources and Evaluation Conference

For over thirty years researchers have studied the problem of automatically detecting named entities in written language. Throughout this time the majority of such work has focused on detection and classification of entities into coarse-grained types like: PERSON, ORGANIZATION, and LOCATION. Less attention has been focused on non-named mentions of entities, including non-named location phrases such as “the medical clinic in Telonge” or “2 km below the Dolin Maniche bridge”. In this work we describe the Location Phrase Detection task to identify such spans. Our key accomplishments include: developing a sequential tagging approach; crafting annotation guidelines; building annotated datasets for English and Russian news; and, conducting experiments in automated detection of location phrases with both statistical and neural taggers. This work is motivated by extracting rich location information to support situational awareness during humanitarian crises such as natural disasters.

pdf bib
Building OCR/NER Test Collections
Dawn Lawrie | James Mayfield | David Etter
Proceedings of the Twelfth Language Resources and Evaluation Conference

Named entity recognition (NER) identifies spans of text that contain names. Many researchers have reported the results of NER on text created through optical character recognition (OCR) over the past two decades. Unfortunately, the test collections that support this research are annotated with named entities after optical character recognition (OCR) has been run. This means that the collection must be re-annotated if the OCR output changes. Instead by tying annotations to character locations on the page, a collection can be built that supports OCR and NER research without requiring re-annotation when either improves. This means that named entities are annotated on the transcribed text. The transcribed text is all that is needed to evaluate the performance of OCR. For NER evaluation, the tagged OCR output is aligned to the transcriptions the aligned files, creating modified files of each, which are scored. This paper presents a methodology for building such a test collection and releases a collection of Chinese OCR-NER data constructed using the methodology. The paper provides performance baselines for current OCR and NER systems applied to this new collection.

pdf bib
Dragonfly: Advances in Non-Speaker Annotation for Low Resource Languages
Cash Costello | Shelby Anderson | Caitlyn Bishop | James Mayfield | Paul McNamee
Proceedings of the Twelfth Language Resources and Evaluation Conference

Dragonfly is an open source software tool that supports annotation of text in a low resource language by non-speakers of the language. Using semantic and contextual information, non-speakers of a language familiar with the Latin script can produce high quality named entity annotations to support construction of a name tagger. We describe a procedure for annotating low resource languages using Dragonfly that others can use, which we developed based on our experience annotating data in more than ten languages. We also present performance comparisons between models trained on native speaker and non-speaker annotations.

2018

pdf bib
Platforms for Non-speakers Annotating Names in Any Language
Ying Lin | Cash Costello | Boliang Zhang | Di Lu | Heng Ji | James Mayfield | Paul McNamee
Proceedings of ACL 2018, System Demonstrations

We demonstrate two annotation platforms that allow an English speaker to annotate names for any language without knowing the language. These platforms provided high-quality ’‘silver standard” annotations for low-resource language name taggers (Zhang et al., 2017) that achieved state-of-the-art performance on two surprise languages (Oromo and Tigrinya) at LoreHLT20171 and ten languages at TAC-KBP EDL2017 (Ji et al., 2017). We discuss strengths and limitations and compare other methods of creating silver- and gold-standard annotations using native speakers. We will make our tools publicly available for research use.

2017

pdf bib
CADET: Computer Assisted Discovery Extraction and Translation
Benjamin Van Durme | Tom Lippincott | Kevin Duh | Deana Burchfield | Adam Poliak | Cash Costello | Tim Finin | Scott Miller | James Mayfield | Philipp Koehn | Craig Harman | Dawn Lawrie | Chandler May | Max Thomas | Annabelle Carrell | Julianne Chaloux | Tongfei Chen | Alex Comerford | Mark Dredze | Benjamin Glass | Shudong Hao | Patrick Martin | Pushpendre Rastogi | Rashmi Sankepally | Travis Wolfe | Ying-Ying Tran | Ted Zhang
Proceedings of the IJCNLP 2017, System Demonstrations

Computer Assisted Discovery Extraction and Translation (CADET) is a workbench for helping knowledge workers find, label, and translate documents of interest. It combines a multitude of analytics together with a flexible environment for customizing the workflow for different users. This open-source framework allows for easy development of new research prototypes using a micro-service architecture based atop Docker and Apache Thrift.

pdf bib
Language-Independent Named Entity Analysis Using Parallel Projection and Rule-Based Disambiguation
James Mayfield | Paul McNamee | Cash Costello
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

The 2017 shared task at the Balto-Slavic NLP workshop requires identifying coarse-grained named entities in seven languages, identifying each entity’s base form, and clustering name mentions across the multilingual set of documents. The fact that no training data is provided to systems for building supervised classifiers further adds to the complexity. To complete the task we first use publicly available parallel texts to project named entity recognition capability from English to each evaluation language. We ignore entirely the subtask of identifying non-inflected forms of names. Finally, we create cross-document entity identifiers by clustering named mentions using a procedure-based approach.

2013

pdf bib
KELVIN: a tool for automated knowledge base construction
Paul McNamee | James Mayfield | Tim Finin | Tim Oates | Dawn Lawrie | Tan Xu | Douglas Oard
Proceedings of the 2013 NAACL HLT Demonstration Session

pdf bib
UMBC_EBIQUITY-CORE: Semantic Textual Similarity Systems
Lushan Han | Abhay L. Kashyap | Tim Finin | James Mayfield | Jonathan Weese
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

2012

pdf bib
A Context-Aware Approach to Entity Linking
Veselin Stoyanov | James Mayfield | Tan Xu | Douglas Oard | Dawn Lawrie | Tim Oates | Tim Finin
Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX)

pdf bib
Evaluating the Quality of a Knowledge Base Populated from Text
James Mayfield | Tim Finin
Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX)

pdf bib
Creating and Curating a Cross-Language Person-Entity Linking Collection
Dawn Lawrie | James Mayfield | Paul McNamee | Douglas Oard
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

To stimulate research in cross-language entity linking, we present a new test collection for evaluating the accuracy of cross-language entity linking in twenty-one languages. This paper describes an efficient way to create and curate such a collection, judiciously exploiting existing language resources. Queries are created by semi-automatically identifying person names on the English side of a parallel corpus, using judgments obtained through crowdsourcing to identify the entity corresponding to the name, and projecting the English name onto the non-English document using word alignments. Name projections are then curated, again through crowdsourcing. This technique resulted in the first publicly available multilingual cross-language entity linking collection. The collection includes approximately 55,000 queries, comprising between 875 and 4,329 queries for each of twenty-one non-English languages.

2011

pdf bib
Cross-Language Entity Linking
Paul McNamee | James Mayfield | Dawn Lawrie | Douglas Oard | David Doermann
Proceedings of 5th International Joint Conference on Natural Language Processing

2009

pdf bib
Translation Corpus Source and Size in Bilingual Retrieval
Paul McNamee | James Mayfield | Charles Nicholas
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

2008

pdf bib
Learning Named Entity Hyponyms for Question Answering
Paul McNamee | Rion Snow | Patrick Schone | James Mayfield
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

2006

pdf bib
Translation of Multiword Expressions Using Parallel Suffix Arrays
Paul McNamee | James Mayfield
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers

Accurately translating multiword expressions is important to obtain good performance in machine translation, cross-language information retrieval, and other multilingual tasks in human language technology. Existing approaches to inducing translation equivalents of multiword units have focused on agglomerating individual words or on aligning words in a statistical machine translation system. We present a different approach based upon information theoretic heuristics and the exact counting of frequencies of occurrence of multiword strings in aligned parallel corpora. We are applying a technique introduced by Yamamoto and Church that uses suffix arrays and longest common prefix arrays. Evaluation of the method in multiple language pairs was performed using bilingual lexicons of domain-specific terminology as a gold standard. We found that performance of 50-70%, as measured by mean reciprocal rank, can be obtained for terms that occur more than 10 or so times.

2003

pdf bib
Named Entity Recognition using Hundreds of Thousands of Features
James Mayfield | Paul McNamee | Christine Piatko
Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003

2002

pdf bib
Entity Extraction without Language-Specific Resources
Paul McNamee | James Mayfield
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)

1992

pdf bib
University of Maryland/ConQuest: MUC-4 Test Results and Analysis
James Mayfield
Fourth Message Understanding Conference (MUC-4): Proceedings of a Conference Held in McLean, Virginia, June 16-18, 1992

pdf bib
University of Maryland/ConQuest: Description of the ICTOAN System as Used for MUC-4
James Mayfield
Fourth Message Understanding Conference (MUC-4): Proceedings of a Conference Held in McLean, Virginia, June 16-18, 1992

1991

pdf bib
Synchronetics: MUC-3 Test Results and Analysis
James Mayfield | Edwin Addison
Third Message Understanding Conference (MUC-3): Proceedings of a Conference Held in San Diego, California, May 21-23, 1991

pdf bib
Synchronetics: Description of the Synchronetics System Used for MUC-3
James Mayfield
Third Message Understanding Conference (MUC-3): Proceedings of a Conference Held in San Diego, California, May 21-23, 1991

1988

pdf bib
The Berkeley Unix Consultant Project
Robert Wilensky | David N. Chin | Marc Luria | James Martin | James Mayfield | Dekai Wu
Computational Linguistics, Volume 14, Number 4, December 1988, LFP: A Logic for Linguistic Descriptions and an Analysis of its Complexity