Yang Janet Liu

Georgetown University; 刘洋

Also published as: Yang Liu

Other people with similar names: Yang Liu (May refer to several people), Yang Liu (3M Health Information Systems), Yang Liu (University of Helsinki), Yang Liu (National University of Defense Technology), Yang Liu (Edinburgh), Yang Liu (The Chinese University of Hong Kong (Shenzhen)), Yang Liu (刘扬; Ph.D Purdue; ICSI, Dallas, Facebook, Liulishuo, Amazon), Yang Liu (刘洋; ICT, Tsinghua, Beijing Academy of Artificial Intelligence), Yang Liu (Microsoft Cognitive Services Research), Yang Liu (Peking University), Yang Liu (Univ. of Michigan, UC Santa Cruz)


2021

pdf bib
Proceedings of the 2nd Shared Task on Discourse Relation Parsing and Treebanking (DISRPT 2021)
Amir Zeldes | Yang Janet Liu | Mikel Iruskieta | Philippe Muller | Chloé Braud | Sonia Badene
Proceedings of the 2nd Shared Task on Discourse Relation Parsing and Treebanking (DISRPT 2021)

pdf bib
The DISRPT 2021 Shared Task on Elementary Discourse Unit Segmentation, Connective Detection, and Relation Classification
Amir Zeldes | Yang Janet Liu | Mikel Iruskieta | Philippe Muller | Chloé Braud | Sonia Badene
Proceedings of the 2nd Shared Task on Discourse Relation Parsing and Treebanking (DISRPT 2021)

In 2021, we organized the second iteration of a shared task dedicated to the underlying units used in discourse parsing across formalisms: the DISRPT Shared Task (Discourse Relation Parsing and Treebanking). Adding to the 2019 tasks on Elementary Discourse Unit Segmentation and Connective Detection, this iteration of the Shared Task included for the first time a track on discourse relation classification across three formalisms: RST, SDRT, and PDTB. In this paper we review the data included in the Shared Task, which covers nearly 3 million manually annotated tokens from 16 datasets in 11 languages, survey and compare submitted systems and report on system performance on each task for both annotated and plain-tokenized versions of the data.

pdf bib
DisCoDisCo at the DISRPT2021 Shared Task: A System for Discourse Segmentation, Classification, and Connective Detection
Luke Gessler | Shabnam Behzad | Yang Janet Liu | Siyao Peng | Yilun Zhu | Amir Zeldes
Proceedings of the 2nd Shared Task on Discourse Relation Parsing and Treebanking (DISRPT 2021)

This paper describes our submission to the DISRPT2021 Shared Task on Discourse Unit Segmentation, Connective Detection, and Relation Classification. Our system, called DisCoDisCo, is a Transformer-based neural classifier which enhances contextualized word embeddings (CWEs) with hand-crafted features, relying on tokenwise sequence tagging for discourse segmentation and connective detection, and a feature-rich, encoder-less sentence pair classifier for relation classification. Our results for the first two tasks outperform SOTA scores from the previous 2019 shared task, and results on relation classification suggest strong performance on the new 2021 benchmark. Ablation tests show that including features beyond CWEs are helpful for both tasks, and a partial evaluation of multiple pretrained Transformer-based language models indicates that models pre-trained on the Next Sentence Prediction (NSP) task are optimal for relation classification.

pdf bib
Overview of AMALGUM – Large Silver Quality Annotations across English Genres
Luke Gessler | Siyao Peng | Yang Liu | Yilun Zhu | Shabnam Behzad | Amir Zeldes
Proceedings of the Society for Computation in Linguistics 2021

2020

pdf bib
AMALGUM – A Free, Balanced, Multilayer English Web Corpus
Luke Gessler | Siyao Peng | Yang Liu | Yilun Zhu | Shabnam Behzad | Amir Zeldes
Proceedings of the 12th Language Resources and Evaluation Conference

We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a “better than NLP” benchmark and evaluate the accuracy of the resulting resource.

pdf bib
A Corpus of Adpositional Supersenses for Mandarin Chinese
Siyao Peng | Yang Liu | Yilun Zhu | Austin Blodgett | Yushi Zhao | Nathan Schneider
Proceedings of the 12th Language Resources and Evaluation Conference

Adpositions are frequent markers of semantic relations, but they are highly ambiguous and vary significantly from language to language. Moreover, there is a dearth of annotated corpora for investigating the cross-linguistic variation of adposition semantics, or for building multilingual disambiguation systems. This paper presents a corpus in which all adpositions have been semantically annotated in Mandarin Chinese; to the best of our knowledge, this is the first Chinese corpus to be broadly annotated with adposition semantics. Our approach adapts a framework that defined a general set of supersenses according to ostensibly language-independent semantic criteria, though its development focused primarily on English prepositions (Schneider et al., 2018). We find that the supersense categories are well-suited to Chinese adpositions despite syntactic differences from English. On a Mandarin translation of The Little Prince, we achieve high inter-annotator agreement and analyze semantic correspondences of adposition tokens in bitext.

2019

pdf bib
A Discourse Signal Annotation System for RST Trees
Luke Gessler | Yang Liu | Amir Zeldes
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019

This paper presents a new system for open-ended discourse relation signal annotation in the framework of Rhetorical Structure Theory (RST), implemented on top of an online tool for RST annotation. We discuss existing projects annotating textual signals of discourse relations, which have so far not allowed simultaneously structuring and annotating words signaling hierarchical discourse trees, and demonstrate the design and applications of our interface by extending existing RST annotations in the freely available GUM corpus.

pdf bib
Beyond The Wall Street Journal: Anchoring and Comparing Discourse Signals across Genres
Yang Liu
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019

Recent research on discourse relations has found that they are cued not only by discourse markers (DMs) but also by other textual signals and that signaling information is indicative of genres. While several corpora exist with discourse relation signaling information such as the Penn Discourse Treebank (PDTB, Prasad et al. 2008) and the Rhetorical Structure Theory Signalling Corpus (RST-SC, Das and Taboada 2018), they both annotate the Wall Street Journal (WSJ) section of the Penn Treebank (PTB, Marcus et al. 1993), which is limited to the news domain. Thus, this paper adapts the signal identification and anchoring scheme (Liu and Zeldes, 2019) to three more genres, examines the distribution of signaling devices across relations and genres, and provides a taxonomy of indicative signals found in this dataset.

pdf bib
GumDrop at the DISRPT2019 Shared Task: A Model Stacking Approach to Discourse Unit Segmentation and Connective Detection
Yue Yu | Yilun Zhu | Yang Liu | Yan Liu | Siyao Peng | Mackenzie Gong | Amir Zeldes
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019

In this paper we present GumDrop, Georgetown University’s entry at the DISRPT 2019 Shared Task on automatic discourse unit segmentation and connective detection. Our approach relies on model stacking, creating a heterogeneous ensemble of classifiers, which feed into a metalearner for each final task. The system encompasses three trainable component stacks: one for sentence splitting, one for discourse unit segmentation and one for connective detection. The flexibility of each ensemble allows the system to generalize well to datasets of different sizes and with varying levels of homogeneity.