Justin Mott

2020

This paper describes a new morphology resource created by Linguistic Data Consortium and the University of Pennsylvania for the DARPA LORELEI Program. The data consists of approximately 2000 tokens annotated for morphological segmentation in each of 9 low resource languages, along with root information for 7 of the languages. The languages annotated show a broad diversity of typological features. A minimal annotation scheme for segmentation was developed such that it could capture the patterns of a wide range of languages and also be performed reliably by non-linguist annotators. The basic annotation guidelines were designed to be language-independent, but included language-specific morphological paradigms and other specifications. The resulting annotated corpus is designed to support and stimulate the development of unsupervised morphological segmenters and analyzers by providing a gold standard for their evaluation on a more typologically diverse set of languages than has previously been available. By providing root annotation, this corpus is also a step toward supporting research in identifying richer morphological structures than simple morpheme boundaries.

This paper introduces the parallel Chinese-English Entities, Relations and Events (ERE) corpora developed by Linguistic Data Consortium under the DARPA Deep Exploration and Filtering of Text (DEFT) Program. Original Chinese newswire and discussion forum documents are annotated for two versions of the ERE task. The texts are manually translated into English and then annotated for the same ERE tasks on the English translation, resulting in a rich parallel resource that has utility for performers within the DEFT program, for participants in NIST’s Knowledge Base Population evaluations, and for cross-language projection research more generally.

pdf bib

2015

pdf bib

2014

pdf bib abs

Incorporating Alternate Translations into English Translation Treebank
Ann Bies | Justin Mott | Seth Kulick | Jennifer Garland | Colin Warner
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

New annotation guidelines and new processing methods were developed to accommodate English treebank annotation of a parallel English/Chinese corpus of web data that includes alternate English translations (one fluent, one literal) of expressions that are idiomatic in the Chinese source. In previous machine translation programs, alternate translations of idiomatic expressions had been present in untreebanked data only, but due to the high frequency of such expressions in informal genres such as discussion forums, machine translation system developers requested that alternatives be added to the treebanked data as well. In consultation with machine translation researchers, we chose a pragmatic approach of syntactically annotating only the fluent translation, while retaining the alternate literal translation as a segregated node in the tree. Since the literal translation alternates are often incompatible with English syntax, this approach allows us to create fluent trees without losing information. This resource is expected to support machine translation efforts, and the flexibility provided by the alternate translations is an enhancement to the treebank for this purpose.

pdf bib

Parser Evaluation Using Derivation Trees: A Complement to evalb
Seth Kulick | Ann Bies | Justin Mott | Anthony Kroch | Beatrice Santorini | Mark Liberman
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib

Inter-annotator Agreement for ERE annotation
Seth Kulick | Ann Bies | Justin Mott
Proceedings of the Second Workshop on EVENTS: Definition, Detection, Coreference, and Representation

2013

pdf bib

Using Derivation Trees for Informative Treebank Inter-Annotator Agreement Evaluation
Seth Kulick | Ann Bies | Justin Mott | Mohamed Maamouri | Beatrice Santorini | Anthony Kroch
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2012

pdf bib abs

Further Developments in Treebank Error Detection Using Derivation Trees
Seth Kulick | Ann Bies | Justin Mott
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This work describes how derivation tree fragments based on a variant of Tree Adjoining Grammar (TAG) can be used to check treebank consistency. Annotation of word sequences are compared both for their internal structural consistency, and their external relation to the rest of the tree. We expand on earlier work in this area in three ways. First, we provide a more complete description of the system, showing how a naive use of TAG structures will not work, leading to a necessary refinement. We also provide a more complete account of the processing pipeline, including the grouping together of structurally similar errors and their elimination of duplicates. Second, we include the new experimental external relation check to find an additional class of errors. Third, we broaden the evaluation to include both the internal and external relation checks, and evaluate the system on both an Arabic and English treebank. The evaluation has been successful enough that the internal check has been integrated into the standard pipeline for current English treebank construction at the Linguistic Data Consortium

pdf bib

Using Supertags and Encoded Annotation Principles for Improved Dependency to Phrase Structure Conversion
Seth Kulick | Ann Bies | Justin Mott
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies