Zhiyi Song


2019

pdf bib
Corpus Building for Low Resource Languages in the DARPA LORELEI Program
Jennifer Tracey | Stephanie Strassel | Ann Bies | Zhiyi Song | Michael Arrigo | Kira Griffitt | Dana Delgado | Dave Graff | Seth Kulick | Justin Mott | Neil Kuster
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages

2018

pdf bib
Laying the Groundwork for Knowledge Base Population: Nine Years of Linguistic Resources for TAC KBP
Jeremy Getman | Joe Ellis | Stephanie Strassel | Zhiyi Song | Jennifer Tracey
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Cross-Document, Cross-Language Event Coreference Annotation Using Event Hoppers
Zhiyi Song | Ann Bies | Justin Mott | Xuansong Li | Stephanie Strassel | Christopher Caruso
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
A Comparison of Event Representations in DEFT
Ann Bies | Zhiyi Song | Jeremy Getman | Joe Ellis | Justin Mott | Stephanie Strassel | Martha Palmer | Teruko Mitamura | Marjorie Freedman | Heng Ji | Tim O’Gorman
Proceedings of the Fourth Workshop on Events

pdf bib
Event Nugget and Event Coreference Annotation
Zhiyi Song | Ann Bies | Stephanie Strassel | Joe Ellis | Teruko Mitamura | Hoa Trang Dang | Yukari Yamakawa | Sue Holm
Proceedings of the Fourth Workshop on Events

pdf bib
Parallel Chinese-English Entities, Relations and Events Corpora
Justin Mott | Ann Bies | Zhiyi Song | Stephanie Strassel
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper introduces the parallel Chinese-English Entities, Relations and Events (ERE) corpora developed by Linguistic Data Consortium under the DARPA Deep Exploration and Filtering of Text (DEFT) Program. Original Chinese newswire and discussion forum documents are annotated for two versions of the ERE task. The texts are manually translated into English and then annotated for the same ERE tasks on the English translation, resulting in a rich parallel resource that has utility for performers within the DEFT program, for participants in NIST’s Knowledge Base Population evaluations, and for cross-language projection research more generally.

2015

pdf bib
Event Nugget Annotation: Processes and Issues
Teruko Mitamura | Yukari Yamakawa | Susan Holm | Zhiyi Song | Ann Bies | Seth Kulick | Stephanie Strassel
Proceedings of the 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation

pdf bib
From Light to Rich ERE: Annotation of Entities, Relations, and Events
Zhiyi Song | Ann Bies | Stephanie Strassel | Tom Riese | Justin Mott | Joe Ellis | Jonathan Wright | Seth Kulick | Neville Ryant | Xiaoyi Ma
Proceedings of the 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation

2014

pdf bib
Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT Phase 2 Corpus
Zhiyi Song | Stephanie Strassel | Haejoong Lee | Kevin Walker | Jonathan Wright | Jennifer Garland | Dana Fore | Brian Gainor | Preston Cabe | Thomas Thomas | Brendan Callahan | Ann Sawyer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The DARPA BOLT Program develops systems capable of allowing English speakers to retrieve and understand information from informal foreign language genres. Phase 2 of the program required large volumes of naturally occurring informal text (SMS) and chat messages from individual users in multiple languages to support evaluation of machine translation systems. We describe the design and implementation of a robust collection system capable of capturing both live and archived SMS and chat conversations from willing participants. We also discuss the challenges recruitment at a time when potential participants have acute and growing concerns about their personal privacy in the realm of digital communication, and we outline the techniques adopted to confront those challenges. Finally, we review the properties of the resulting BOLT Phase 2 Corpus, which comprises over 6.5 million words of naturally-occurring chat and SMS in English, Chinese and Egyptian Arabic.

pdf bib
A Comparison of the Events and Relations Across ACE, ERE, TAC-KBP, and FrameNet Annotation Standards
Jacqueline Aguilar | Charley Beller | Paul McNamee | Benjamin Van Durme | Stephanie Strassel | Zhiyi Song | Joe Ellis
Proceedings of the Second Workshop on EVENTS: Definition, Detection, Coreference, and Representation

pdf bib
Transliteration of Arabizi into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus
Ann Bies | Zhiyi Song | Mohamed Maamouri | Stephen Grimes | Haejoong Lee | Jonathan Wright | Stephanie Strassel | Nizar Habash | Ramy Eskander | Owen Rambow
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

2012

pdf bib
Linguistic Resources for Handwriting Recognition and Translation Evaluation
Zhiyi Song | Safa Ismael | Stephen Grimes | David Doermann | Stephanie Strassel
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We describe efforts to create corpora to support development and evaluation of handwriting recognition and translation technology. LDC has developed a stable pipeline and infrastructures for collecting and annotating handwriting linguistic resources to support the evaluation of MADCAT and OpenHaRT. We collect and annotate handwritten samples of pre-processed Arabic and Chinese data that has been already translated in English that is used in the GALE program. To date, LDC has recruited more than 600 scribes and collected, annotated and released more than 225,000 handwriting images. Most linguistic resources created for these programs will be made available to the larger research community by publishing in LDC's catalog. The phase 1 MADCAT corpus is now available.

2010

pdf bib
Enhanced Infrastructure for Creation and Collection of Translation Resources
Zhiyi Song | Stephanie Strassel | Gary Krug | Kazuaki Maeda
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Statistical Machine Translation (MT) systems have achieved impressive results in recent years, due in large part to the increasing availability of parallel text for system training and development. This paper describes recent efforts at Linguistic Data Consortium to create linguistic resources for MT, including corpora, specifications and resource infrastructure. We review LDC's three-pronged ap-proach to parallel text corpus development (acquisition of existing parallel text from known repositories, harvesting and aligning of potential parallel documents from the web, and manual creation of parallel text by professional translators), and describe recent adap-tations that have enabled significant expansions in the scope, variety, quality, efficiency and cost-effectiveness of translation resource creation at LDC.

2008

pdf bib
Entity Translation and Alignment in the ACE-07 ET Task
Zhiyi Song | Stephanie Strassel
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Entities - people, organizations, locations and the like - have long been a central focus of natural language processing technology development, since entities convey essential content in human languages. For multilingual systems, accurate translation of named entities and their descriptors is critical. LDC produced Entity Translation pilot data to support the ACE ET 2007 Evaluation and the current paper delves more deeply into the entity alignment issue across languages, combining the automatic alignment techniques developed for ACE-07 with manual alignment. Altogether 84% of the Chinese-English entity mentions and 74% of the Arabic-English entity mentions are perfect aligned. The results of this investigation offer several important insights. Automatic alignment algorithms predicted that perfect alignment for the ET corpus was likely to be no greater than 55%; perfect alignment on the 15 pilot documents was predicted at 62.5%. Our results suggest the actual perfect alignment rate is substantially higher (82% average, 92% for NAM entities). The careful analysis of alignment errors also suggests strategies for human translation to support the ET task; for instance, translators might be given additional guidance about preferred treatments of name versus nominal translation. These results can also contribute to refined methods of evaluating ET systems.

pdf bib
Linguistic Resources and Evaluation Techniques for Evaluation of Cross-Document Automatic Content Extraction
Stephanie Strassel | Mark Przybocki | Kay Peterson | Zhiyi Song | Kazuaki Maeda
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The NIST Automatic Content Extraction (ACE) Evaluation expands its focus in 2008 to encompass the challenge of cross-document and cross-language global integration and reconciliation of information. While past ACE evaluations have been limited to local (within-document) detection and disambiguation of entities, relations and events, the current evaluation adds global (cross-document and cross-language) entity disambiguation tasks for Arabic and English. This paper presents the 2008 ACE XDoc evaluation task and associated infrastructure. We describe the linguistic resources created by LDC to support the evaluation, focusing on new approaches required for data selection, data processing, annotation task definitions and annotation software, and we conclude with a discussion of the metrics developed by NIST to support the evaluation.