Mark Sammons


pdf bib
On the Strength of Character Language Models for Multilingual Named Entity Recognition
Xiaodong Yu | Stephen Mayhew | Mark Sammons | Dan Roth
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Character-level patterns have been widely used as features in English Named Entity Recognition (NER) systems. However, to date there has been no direct investigation of the inherent differences between name and nonname tokens in text, nor whether this property holds across multiple languages. This paper analyzes the capabilities of corpus-agnostic Character-level Language Models (CLMs) in the binary task of distinguishing name tokens from non-name tokens. We demonstrate that CLMs provide a simple and powerful model for capturing these differences, identifying named entity tokens in a diverse set of languages at close to the performance of full NER systems. Moreover, by adding very simple CLM-based features we can significantly improve the performance of an off-the-shelf NER system for multiple languages.

pdf bib
CogCompNLP: Your Swiss Army Knife for NLP
Daniel Khashabi | Mark Sammons | Ben Zhou | Tom Redman | Christos Christodoulopoulos | Vivek Srikumar | Nicholas Rizzolo | Lev Ratinov | Guanheng Luo | Quang Do | Chen-Tse Tsai | Subhro Roy | Stephen Mayhew | Zhili Feng | John Wieting | Xiaodong Yu | Yangqiu Song | Shashank Gupta | Shyam Upadhyay | Naveen Arivazhagan | Qiang Ning | Shaoshi Ling | Dan Roth
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


pdf bib
Adapting to Learner Errors with Minimal Supervision
Alla Rozovskaya | Dan Roth | Mark Sammons
Computational Linguistics, Volume 43, Issue 4 - December 2017

This article considers the problem of correcting errors made by English as a Second Language writers from a machine learning perspective, and addresses an important issue of developing an appropriate training paradigm for the task, one that accounts for error patterns of non-native writers using minimal supervision. Existing training approaches present a trade-off between large amounts of cheap data offered by the native-trained models and additional knowledge of learner error patterns provided by the more expensive method of training on annotated learner data. We propose a novel training approach that draws on the strengths offered by the two standard training paradigms—of training either on native or on annotated learner data—and that outperforms both of these standard methods. Using the key observation that parameters relating to error regularities exhibited by non-native writers are relatively simple, we develop models that can incorporate knowledge about error regularities based on a small annotated sample but that are otherwise trained on native English data. The key contribution of this article is the introduction and analysis of two methods for adapting the learned models to error patterns of non-native writers; one method that applies to generative classifiers and a second that applies to discriminative classifiers. Both methods demonstrated state-of-the-art performance in several text correction competitions. In particular, the Illinois system that implements these methods ranked at the top in two recent CoNLL shared tasks on error correction.1 We conduct further evaluation of the proposed approaches studying the effect of using error data from speakers of the same native language, languages that are closely related linguistically, and unrelated languages.


pdf bib
EDISON: Feature Extraction for NLP, Simplified
Mark Sammons | Christos Christodoulopoulos | Parisa Kordjamshidi | Daniel Khashabi | Vivek Srikumar | Dan Roth
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

When designing Natural Language Processing (NLP) applications that use Machine Learning (ML) techniques, feature extraction becomes a significant part of the development effort, whether developing a new application or attempting to reproduce results reported for existing NLP tasks. We present EDISON, a Java library of feature generation functions used in a suite of state-of-the-art NLP tools, based on a set of generic NLP data structures. These feature extractors populate simple data structures encoding the extracted features, which the package can also serialize to an intuitive JSON file format that can be easily mapped to formats used by ML packages. EDISON can also be used programmatically with JVM-based (Java/Scala) NLP software to provide the feature extractor input. The collection of feature extractors is organised hierarchically and a simple search interface is provided. In this paper we include examples that demonstrate the versatility and ease-of-use of the EDISON feature extraction suite to show that this can significantly reduce the time spent by developers on feature extraction design for NLP systems. The library is publicly hosted at, and we hope that other NLP researchers will contribute to the set of feature extractors. In this way, the community can help simplify reproduction of published results and the integration of ideas from diverse sources when developing new and improved NLP applications.


pdf bib
Improving a Pipeline Architecture for Shallow Discourse Parsing
Yangqiu Song | Haoruo Peng | Parisa Kordjamshidi | Mark Sammons | Dan Roth
Proceedings of the Nineteenth Conference on Computational Natural Language Learning - Shared Task


pdf bib
The Illinois-Columbia System in the CoNLL-2014 Shared Task
Alla Rozovskaya | Kai-Wei Chang | Mark Sammons | Dan Roth | Nizar Habash
Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task

pdf bib
ILLINOISCLOUDNLP: Text Analytics Services in the Cloud
Hao Wu | Zhiye Fei | Aaron Dai | Mark Sammons | Dan Roth | Stephen Mayhew
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Natural Language Processing (NLP) continues to grow in popularity in a range of research and commercial applications. However, installing, maintaining, and running NLP tools can be time consuming, and many commercial and research end users have only intermittent need for large processing capacity. This paper describes ILLINOISCLOUDNLP, an on-demand framework built around NLPCURATOR and Amazon Web Services’ Elastic Compute Cloud (EC2). This framework provides a simple interface to end users via which they can deploy one or more NLPCURATOR instances on EC2, upload plain text documents, specify a set of Text Analytics tools (NLP annotations) to apply, and process and store or download the processed data. It can also allow end users to use a model trained on their own data: ILLINOISCLOUDNLP takes care of training, hosting, and applying it to new data just as it does with existing models within NLPCURATOR. As a representative use case, we describe our use of ILLINOISCLOUDNLP to process 3.05 million documents used in the 2012 and 2013 Text Analysis Conference Knowledge Base Population tasks at a relatively deep level of processing, in approximately 20 hours, at an approximate cost of US$500; this is about 20 times faster than doing so on a single server and requires no human supervision and no NLP or Machine Learning expertise.


pdf bib
The University of Illinois System in the CoNLL-2013 Shared Task
Alla Rozovskaya | Kai-Wei Chang | Mark Sammons | Dan Roth
Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task


pdf bib
The UI System in the HOO 2012 Shared Task on Error Correction
Alla Rozovskaya | Mark Sammons | Dan Roth
Proceedings of the Seventh Workshop on Building Educational Applications Using NLP

pdf bib
Illinois-Coref: The UI System in the CoNLL-2012 Shared Task
Kai-Wei Chang | Rajhans Samdani | Alla Rozovskaya | Mark Sammons | Dan Roth
Joint Conference on EMNLP and CoNLL - Shared Task

pdf bib
An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines)
James Clarke | Vivek Srikumar | Mark Sammons | Dan Roth
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Natural Language Processing continues to grow in popularity in a range of research and commercial applications, yet managing the wide array of potential NLP components remains a difficult problem. This paper describes Curator, an NLP management framework designed to address some common problems and inefficiencies associated with building NLP process pipelines; and Edison, an NLP data structure library in Java that provides streamlined interactions with Curator and offers a range of useful supporting functionality.


pdf bib
Inference Protocols for Coreference Resolution
Kai-Wei Chang | Rajhans Samdani | Alla Rozovskaya | Nick Rizzolo | Mark Sammons | Dan Roth
Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task

pdf bib
University of Illinois System in HOO Text Correction Shared Task
Alla Rozovskaya | Mark Sammons | Joshua Gioja | Dan Roth
Proceedings of the 13th European Workshop on Natural Language Generation


pdf bib
“Ask Not What Textual Entailment Can Do for You...”
Mark Sammons | V.G.Vinod Vydiswaran | Dan Roth
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Textual Entailment
Mark Sammons | Idan Szpektor | V.G.Vinod Vydiswaran
NAACL HLT 2010 Tutorial Abstracts


pdf bib
A Framework for Entailed Relation Recognition
Dan Roth | Mark Sammons | V.G.Vinod Vydiswaran
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers


pdf bib
Extraction of Entailed Semantic Relations Through Syntax-Based Comma Resolution
Vivek Srikumar | Roi Reichart | Mark Sammons | Ari Rappoport | Dan Roth
Proceedings of ACL-08: HLT


pdf bib
Semantic and Logical Inference Model for Textual Entailment
Dan Roth | Mark Sammons
Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing


pdf bib
Demonstrating an Interactive Semantic Role Labeling System
Vasin Punyakanok | Dan Roth | Mark Sammons | Wen-tau Yih
Proceedings of HLT/EMNLP 2005 Interactive Demonstrations