Yilun Zhu


2021

pdf bib
Overview of AMALGUM – Large Silver Quality Annotations across English Genres
Luke Gessler | Siyao Peng | Yang Liu | Yilun Zhu | Shabnam Behzad | Amir Zeldes
Proceedings of the Society for Computation in Linguistics 2021

pdf bib
OntoGUM: Evaluating Contextualized SOTA Coreference Resolution on 12 More Genres
Yilun Zhu | Sameer Pradhan | Amir Zeldes
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

SOTA coreference resolution produces increasingly impressive scores on the OntoNotes benchmark. However lack of comparable data following the same scheme for more genres makes it difficult to evaluate generalizability to open domain data. This paper provides a dataset and comprehensive evaluation showing that the latest neural LM based end-to-end systems degrade very substantially out of domain. We make an OntoNotes-like coreference dataset called OntoGUM publicly available, converted from GUM, an English corpus covering 12 genres, using deterministic rules, which we evaluate. Thanks to the rich syntactic and discourse annotations in GUM, we are able to create the largest human-annotated coreference corpus following the OntoNotes guidelines, and the first to be evaluated for consistency with the OntoNotes scheme. Out-of-domain evaluation across 12 genres shows nearly 15-20% degradation for both deterministic and deep learning systems, indicating a lack of generalizability or covert overfitting in existing coreference resolution models.

2020

pdf bib
AMALGUM – A Free, Balanced, Multilayer English Web Corpus
Luke Gessler | Siyao Peng | Yang Liu | Yilun Zhu | Shabnam Behzad | Amir Zeldes
Proceedings of the 12th Language Resources and Evaluation Conference

We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a “better than NLP” benchmark and evaluate the accuracy of the resulting resource.

pdf bib
A Corpus of Adpositional Supersenses for Mandarin Chinese
Siyao Peng | Yang Liu | Yilun Zhu | Austin Blodgett | Yushi Zhao | Nathan Schneider
Proceedings of the 12th Language Resources and Evaluation Conference

Adpositions are frequent markers of semantic relations, but they are highly ambiguous and vary significantly from language to language. Moreover, there is a dearth of annotated corpora for investigating the cross-linguistic variation of adposition semantics, or for building multilingual disambiguation systems. This paper presents a corpus in which all adpositions have been semantically annotated in Mandarin Chinese; to the best of our knowledge, this is the first Chinese corpus to be broadly annotated with adposition semantics. Our approach adapts a framework that defined a general set of supersenses according to ostensibly language-independent semantic criteria, though its development focused primarily on English prepositions (Schneider et al., 2018). We find that the supersense categories are well-suited to Chinese adpositions despite syntactic differences from English. On a Mandarin translation of The Little Prince, we achieve high inter-annotator agreement and analyze semantic correspondences of adposition tokens in bitext.

2019

pdf bib
GumDrop at the DISRPT2019 Shared Task: A Model Stacking Approach to Discourse Unit Segmentation and Connective Detection
Yue Yu | Yilun Zhu | Yang Liu | Yan Liu | Siyao Peng | Mackenzie Gong | Amir Zeldes
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019

In this paper we present GumDrop, Georgetown University’s entry at the DISRPT 2019 Shared Task on automatic discourse unit segmentation and connective detection. Our approach relies on model stacking, creating a heterogeneous ensemble of classifiers, which feed into a metalearner for each final task. The system encompasses three trainable component stacks: one for sentence splitting, one for discourse unit segmentation and one for connective detection. The flexibility of each ensemble allows the system to generalize well to datasets of different sizes and with varying levels of homogeneity.