Amir Zeldes


2021

pdf bib
A Balanced and Broadly Targeted Computational Linguistics Curriculum
Emma Manning | Nathan Schneider | Amir Zeldes
Proceedings of the Fifth Workshop on Teaching NLP

This paper describes the primarily-graduate computational linguistics and NLP curriculum at Georgetown University, a U.S. university that has seen significant growth in these areas in recent years. We reflect on the principles behind our curriculum choices, including recognizing the various academic backgrounds and goals of our students; teaching a variety of skills with an emphasis on working directly with data; encouraging collaboration and interdisciplinary work; and including languages beyond English. We reflect on challenges we have encountered, such as the difficulty of teaching programming skills alongside NLP fundamentals, and discuss areas for future growth.

pdf bib
Overview of AMALGUM – Large Silver Quality Annotations across English Genres
Luke Gessler | Siyao Peng | Yang Liu | Yilun Zhu | Shabnam Behzad | Amir Zeldes
Proceedings of the Society for Computation in Linguistics 2021

pdf bib
OntoGUM: Evaluating Contextualized SOTA Coreference Resolution on 12 More Genres
Yilun Zhu | Sameer Pradhan | Amir Zeldes
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

SOTA coreference resolution produces increasingly impressive scores on the OntoNotes benchmark. However lack of comparable data following the same scheme for more genres makes it difficult to evaluate generalizability to open domain data. This paper provides a dataset and comprehensive evaluation showing that the latest neural LM based end-to-end systems degrade very substantially out of domain. We make an OntoNotes-like coreference dataset called OntoGUM publicly available, converted from GUM, an English corpus covering 12 genres, using deterministic rules, which we evaluate. Thanks to the rich syntactic and discourse annotations in GUM, we are able to create the largest human-annotated coreference corpus following the OntoNotes guidelines, and the first to be evaluated for consistency with the OntoNotes scheme. Out-of-domain evaluation across 12 genres shows nearly 15-20% degradation for both deterministic and deep learning systems, indicating a lack of generalizability or covert overfitting in existing coreference resolution models.

2020

pdf bib
Treebanking User-Generated Content: A Proposal for a Unified Representation in Universal Dependencies
Manuela Sanguinetti | Cristina Bosco | Lauren Cassidy | Özlem Çetinoğlu | Alessandra Teresa Cignarella | Teresa Lynn | Ines Rehbein | Josef Ruppenhofer | Djamé Seddah | Amir Zeldes
Proceedings of the 12th Language Resources and Evaluation Conference

The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD.

pdf bib
AMALGUM – A Free, Balanced, Multilayer English Web Corpus
Luke Gessler | Siyao Peng | Yang Liu | Yilun Zhu | Shabnam Behzad | Amir Zeldes
Proceedings of the 12th Language Resources and Evaluation Conference

We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a “better than NLP” benchmark and evaluate the accuracy of the resulting resource.

pdf bib
A Cross-Genre Ensemble Approach to Robust Reddit Part of Speech Tagging
Shabnam Behzad | Amir Zeldes
Proceedings of the 12th Web as Corpus Workshop

Part of speech tagging is a fundamental NLP task often regarded as solved for high-resource languages such as English. Current state-of-the-art models have achieved high accuracy, especially on the news domain. However, when these models are applied to other corpora with different genres, and especially user-generated data from the Web, we see substantial drops in performance. In this work, we study how a state-of-the-art tagging model trained on different genres performs on Web content from unfiltered Reddit forum discussions. We report the results when training on different splits of the data, tested on Reddit. Our results show that even small amounts of in-domain data can outperform the contribution of data an order of magnitude larger coming from other Web domains. To make progress on out-of-domain tagging, we also evaluate an ensemble approach using multiple single-genre taggers as input features to a meta-classifier. We present state of the art performance on tagging Reddit data, as well as error analysis of the results of these models, and offer a typology of the most common error types among them, broken down by training corpus.

pdf bib
Proceedings of the 14th Linguistic Annotation Workshop
Stefanie Dipper | Amir Zeldes
Proceedings of the 14th Linguistic Annotation Workshop

pdf bib
Exhaustive Entity Recognition for Coptic: Challenges and Solutions
Amir Zeldes | Lance Martin | Sichang Tu
Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Entity recognition provides semantic access to ancient materials in the Digital Humanities: it exposes people and places of interest in texts that cannot be read exhaustively, facilitates linking resources and can provide a window into text contents, even for texts with no translations. In this paper we present entity recognition for Coptic, the language of Hellenistic era Egypt. We evaluate NLP approaches to the task and lay out difficulties in applying them to a low-resource, morphologically complex language. We present solutions for named and non-named nested entity recognition and semi-automatic entity linking to Wikipedia, relying on robust dependency parsing, feature-based CRF models, and hand-crafted knowledge base resources, enabling high accuracy NER with orders of magnitude less data than those used for high resource languages. The results suggest avenues for research on other languages in similar settings.

2019

pdf bib
The Making of Coptic Wordnet
Laura Slaughter | Luis Morgado Da Costa | So Miyagawa | Marco Büchler | Amir Zeldes | Heike Behlmer
Proceedings of the 10th Global Wordnet Conference

With the increasing availability of wordnets for ancient languages, such as Ancient Greek and Latin, gaps remain in the coverage of less studied languages of antiquity. This paper reports on the construction and evaluation of a new wordnet for Coptic, the language of Late Roman, Byzantine and Early Islamic Egypt in the first millenium CE. We present our approach to constructing the wordnet which uses multilingual Coptic dictionaries and wordnets for five different languages. We further discuss the results of this effort and outline our on-going/future work.

pdf bib
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019
Amir Zeldes | Debopam Das | Erick Maziero Galani | Juliano Desiderato Antonio | Mikel Iruskieta
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019

pdf bib
Introduction to Discourse Relation Parsing and Treebanking (DISRPT): 7th Workshop on Rhetorical Structure Theory and Related Formalisms
Amir Zeldes | Debopam Das | Erick Galani Maziero | Juliano Antonio | Mikel Iruskieta
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019

This overview summarizes the main contributions of the accepted papers at the 2019 workshop on Discourse Relation Parsing and Treebanking (DISRPT 2019). Co-located with NAACL 2019 in Minneapolis, the workshop’s aim was to bring together researchers working on corpus-based and computational approaches to discourse relations. In addition to an invited talk, eighteen papers outlined below were presented, four of which were submitted as part of a shared task on elementary discourse unit segmentation and connective detection.

pdf bib
A Discourse Signal Annotation System for RST Trees
Luke Gessler | Yang Liu | Amir Zeldes
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019

This paper presents a new system for open-ended discourse relation signal annotation in the framework of Rhetorical Structure Theory (RST), implemented on top of an online tool for RST annotation. We discuss existing projects annotating textual signals of discourse relations, which have so far not allowed simultaneously structuring and annotating words signaling hierarchical discourse trees, and demonstrate the design and applications of our interface by extending existing RST annotations in the freely available GUM corpus.

pdf bib
The DISRPT 2019 Shared Task on Elementary Discourse Unit Segmentation and Connective Detection
Amir Zeldes | Debopam Das | Erick Galani Maziero | Juliano Antonio | Mikel Iruskieta
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019

In 2019, we organized the first iteration of a shared task dedicated to the underlying units used in discourse parsing across formalisms: the DISRPT Shared Task on Elementary Discourse Unit Segmentation and Connective Detection. In this paper we review the data included in the task, which cover 2.6 million manually annotated tokens from 15 datasets in 10 languages, survey and compare submitted systems and report on system performance on each task for both annotated and plain-tokenized versions of the data.

pdf bib
GumDrop at the DISRPT2019 Shared Task: A Model Stacking Approach to Discourse Unit Segmentation and Connective Detection
Yue Yu | Yilun Zhu | Yang Liu | Yan Liu | Siyao Peng | Mackenzie Gong | Amir Zeldes
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019

In this paper we present GumDrop, Georgetown University’s entry at the DISRPT 2019 Shared Task on automatic discourse unit segmentation and connective detection. Our approach relies on model stacking, creating a heterogeneous ensemble of classifiers, which feed into a metalearner for each final task. The system encompasses three trainable component stacks: one for sentence splitting, one for discourse unit segmentation and one for connective detection. The flexibility of each ensemble allows the system to generalize well to datasets of different sizes and with varying levels of homogeneity.

2018

pdf bib
A Deeper Look into Dependency-Based Word Embeddings
Sean MacAvaney | Amir Zeldes
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

We investigate the effect of various dependency-based word embeddings on distinguishing between functional and domain similarity, word similarity rankings, and two downstream tasks in English. Variations include word embeddings trained using context windows from Stanford and Universal dependencies at several levels of enhancement (ranging from unlabeled, to Enhanced++ dependencies). Results are compared to basic linear contexts and evaluated on several datasets. We found that embeddings trained with Universal and Stanford dependency contexts excel at different tasks, and that enhanced dependencies often improve performance.

pdf bib
A Predictive Model for Notional Anaphora in English
Amir Zeldes
Proceedings of the First Workshop on Computational Models of Reference, Anaphora and Coreference

Notional anaphors are pronouns which disagree with their antecedents’ grammatical categories for notional reasons, such as plural to singular agreement in: “the government ... they”. Since such cases are rare and conflict with evidence from strictly agreeing cases (“the government ... it”), they present a substantial challenge to both coreference resolution and referring expression generation. Using the OntoNotes corpus, this paper takes an ensemble approach to predicting English notional anaphora in context on the basis of the largest empirical data to date. In addition to state of the art prediction accuracy, the results suggest that theoretical approaches positing a plural construal at the antecedent’s utterance are insufficient, and that circumstances at the anaphor’s utterance location, as well as global factors such as genre, have a strong effect on the choice of referring expression.

pdf bib
A Linked Coptic Dictionary Online
Frank Feder | Maxim Kupreyev | Emma Manning | Caroline T. Schroeder | Amir Zeldes
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

We describe a new project publishing a freely available online dictionary for Coptic. The dictionary encompasses comprehensive cross-referencing mechanisms, including linking entries to an online scanned edition of Crum’s Coptic Dictionary, internal cross-references and etymological information, translated searchable definitions in English, French and German, and linked corpus data which provides frequencies and corpus look-up for headwords and multiword expressions. Headwords are available for linking in external projects using a REST API. We describe the challenges in encoding our dictionary using TEI XML and implementing linking mechanisms to construct a Web interface querying frequency information, which draw on NLP tools to recognize inflected forms in context. We evaluate our dictionary’s coverage using digital corpora of Coptic available online.

pdf bib
All Roads Lead to UD: Converting Stanford and Penn Parses to English Universal Dependencies with Multilayer Annotations
Siyao Peng | Amir Zeldes
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

We describe and evaluate different approaches to the conversion of gold standard corpus data from Stanford Typed Dependencies (SD) and Penn-style constituent trees to the latest English Universal Dependencies representation (UD 2.2). Our results indicate that pure SD to UD conversion is highly accurate across multiple genres, resulting in around 1.5% errors, but can be improved further to fewer than 0.5% errors given access to annotations beyond the pure syntax tree, such as entity types and coreference resolution, which are necessary for correct generation of several UD relations. We show that constituent-based conversion using CoreNLP (with automatic NER) performs substantially worse in all genres, including when using gold constituent trees, primarily due to underspecification of phrasal grammatical functions.

pdf bib
A Characterwise Windowed Approach to Hebrew Morphological Segmentation
Amir Zeldes
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper presents a novel approach to the segmentation of orthographic word forms in contemporary Hebrew, focusing purely on splitting without carrying out morphological analysis or disambiguation. Casting the analysis task as character-wise binary classification and using adjacent character and word-based lexicon-lookup features, this approach achieves over 98% accuracy on the benchmark SPMRL shared task data for Hebrew, and 97% accuracy on a new out of domain Wikipedia dataset, an improvement of ≈4% and 5% over previous state of the art performance.

pdf bib
The Coptic Universal Dependency Treebank
Amir Zeldes | Mitchell Abrams
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

This paper presents the Coptic Universal Dependency Treebank, the first dependency treebank within the Egyptian subfamily of the Afro-Asiatic languages. We discuss the composition of the corpus, challenges in adapting the UD annotation scheme to existing conventions for annotating Coptic, and evaluate inter-annotator agreement on UD annotation for the language. Some specific constructions are taken as a starting point for discussing several more general UD annotation guidelines, in particular for appositions, ambiguous passivization, incorporation and object-doubling.

2017

pdf bib
A Distributional View of Discourse Encapsulation: Multifactorial Prediction of Coreference Density in RST
Amir Zeldes
Proceedings of the 6th Workshop on Recent Advances in RST and Related Formalisms

2016

pdf bib
When Annotation Schemes Change Rules Help: A Configurable Approach to Coreference Resolution beyond OntoNotes
Amir Zeldes | Shuo Zhang
Proceedings of the Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2016)

pdf bib
Different Flavors of GUM: Evaluating Genre and Sentence Type Effects on Multilayer Corpus Annotation Quality
Amir Zeldes | Dan Simonson
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

pdf bib
An NLP Pipeline for Coptic
Amir Zeldes | Caroline T. Schroeder
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
rstWeb - A Browser-based Annotation Interface for Rhetorical Structure Theory and Discourse Relations
Amir Zeldes
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

2009

pdf bib
Information Structure in African Languages: Corpora and Tools
Christian Chiarcos | Ines Fiedler | Mira Grubic | Andreas Haida | Katharina Hartmann | Julia Ritz | Anne Schwarz | Amir Zeldes | Malte Zimmermann
Proceedings of the First Workshop on Language Technologies for African Languages

pdf bib
Quantifying Constructional Productivity with Unseen Slot Members
Amir Zeldes
Proceedings of the Workshop on Computational Approaches to Linguistic Creativity