Ann Bies


2020

pdf bib
Morphological Segmentation for Low Resource Languages
Justin Mott | Ann Bies | Stephanie Strassel | Jordan Kodner | Caitlin Richter | Hongzhi Xu | Mitchell Marcus
Proceedings of the 12th Language Resources and Evaluation Conference

This paper describes a new morphology resource created by Linguistic Data Consortium and the University of Pennsylvania for the DARPA LORELEI Program. The data consists of approximately 2000 tokens annotated for morphological segmentation in each of 9 low resource languages, along with root information for 7 of the languages. The languages annotated show a broad diversity of typological features. A minimal annotation scheme for segmentation was developed such that it could capture the patterns of a wide range of languages and also be performed reliably by non-linguist annotators. The basic annotation guidelines were designed to be language-independent, but included language-specific morphological paradigms and other specifications. The resulting annotated corpus is designed to support and stimulate the development of unsupervised morphological segmenters and analyzers by providing a gold standard for their evaluation on a more typologically diverse set of languages than has previously been available. By providing root annotation, this corpus is also a step toward supporting research in identifying richer morphological structures than simple morpheme boundaries.

2019

pdf bib
Corpus Building for Low Resource Languages in the DARPA LORELEI Program
Jennifer Tracey | Stephanie Strassel | Ann Bies | Zhiyi Song | Michael Arrigo | Kira Griffitt | Dana Delgado | Dave Graff | Seth Kulick | Justin Mott | Neil Kuster
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages

2018

pdf bib
Simple Semantic Annotation and Situation Frames: Two Approaches to Basic Text Understanding in LORELEI
Kira Griffitt | Jennifer Tracey | Ann Bies | Stephanie Strassel
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Cross-Document, Cross-Language Event Coreference Annotation Using Event Hoppers
Zhiyi Song | Ann Bies | Justin Mott | Xuansong Li | Stephanie Strassel | Christopher Caruso
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
A Comparison of Event Representations in DEFT
Ann Bies | Zhiyi Song | Jeremy Getman | Joe Ellis | Justin Mott | Stephanie Strassel | Martha Palmer | Teruko Mitamura | Marjorie Freedman | Heng Ji | Tim O’Gorman
Proceedings of the Fourth Workshop on Events

pdf bib
Event Nugget and Event Coreference Annotation
Zhiyi Song | Ann Bies | Stephanie Strassel | Joe Ellis | Teruko Mitamura | Hoa Trang Dang | Yukari Yamakawa | Sue Holm
Proceedings of the Fourth Workshop on Events

pdf bib
Large Multi-lingual, Multi-level and Multi-genre Annotation Corpus
Xuansong Li | Martha Palmer | Nianwen Xue | Lance Ramshaw | Mohamed Maamouri | Ann Bies | Kathryn Conger | Stephen Grimes | Stephanie Strassel
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

High accuracy for automated translation and information retrieval calls for linguistic annotations at various language levels. The plethora of informal internet content sparked the demand for porting state-of-art natural language processing (NLP) applications to new social media as well as diverse language adaptation. Effort launched by the BOLT (Broad Operational Language Translation) program at DARPA (Defense Advanced Research Projects Agency) successfully addressed the internet information with enhanced NLP systems. BOLT aims for automated translation and linguistic analysis for informal genres of text and speech in online and in-person communication. As a part of this program, the Linguistic Data Consortium (LDC) developed valuable linguistic resources in support of the training and evaluation of such new technologies. This paper focuses on methodologies, infrastructure, and procedure for developing linguistic annotation at various language levels, including Treebank (TB), word alignment (WA), PropBank (PB), and co-reference (CoRef). Inspired by the OntoNotes approach with adaptations to the tasks to reflect the goals and scope of the BOLT project, this effort has introduced more annotation types of informal and free-style genres in English, Chinese and Egyptian Arabic. The corpus produced is by far the largest multi-lingual, multi-level and multi-genre annotation corpus of informal text and speech.

pdf bib
Rapid Development of Morphological Analyzers for Typologically Diverse Languages
Seth Kulick | Ann Bies
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The Low Resource Language research conducted under DARPA’s Broad Operational Language Translation (BOLT) program required the rapid creation of text corpora of typologically diverse languages (Turkish, Hausa, and Uzbek) which were annotated with morphological information, along with other types of annotation. Since the output of morphological analyzers is a significant aid to morphological annotation, we developed a morphological analyzer for each language in order to support the annotation task, and also as a deliverable by itself. Our framework for analyzer creation results in tables similar to those used in the successful SAMA analyzer for Arabic, but with a more abstract linguistic level, from which the tables are derived. A lexicon was developed from available resources for integration with the analyzer, and given the speed of development and uncertain coverage of the lexicon, we assumed that the analyzer would necessarily be lacking in some coverage for the project annotation. Our analyzer framework was therefore focused on rapid implementation of the key structures of the language, together with accepting “wildcard” solutions as possible analyses for a word with an unknown stem, building upon our similar experiences with morphological annotation with Modern Standard Arabic and Egyptian Arabic.

pdf bib
Parallel Chinese-English Entities, Relations and Events Corpora
Justin Mott | Ann Bies | Zhiyi Song | Stephanie Strassel
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper introduces the parallel Chinese-English Entities, Relations and Events (ERE) corpora developed by Linguistic Data Consortium under the DARPA Deep Exploration and Filtering of Text (DEFT) Program. Original Chinese newswire and discussion forum documents are annotated for two versions of the ERE task. The texts are manually translated into English and then annotated for the same ERE tasks on the English translation, resulting in a rich parallel resource that has utility for performers within the DEFT program, for participants in NIST’s Knowledge Base Population evaluations, and for cross-language projection research more generally.

2015

pdf bib
Event Nugget Annotation: Processes and Issues
Teruko Mitamura | Yukari Yamakawa | Susan Holm | Zhiyi Song | Ann Bies | Seth Kulick | Stephanie Strassel
Proceedings of the The 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation

pdf bib
From Light to Rich ERE: Annotation of Entities, Relations, and Events
Zhiyi Song | Ann Bies | Stephanie Strassel | Tom Riese | Justin Mott | Joe Ellis | Jonathan Wright | Seth Kulick | Neville Ryant | Xiaoyi Ma
Proceedings of the The 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation

pdf bib
Balancing the Existing and the New in the Context of Annotating Non-Canonical Language
Ann Bies
Proceedings of The 9th Linguistic Annotation Workshop

2014

pdf bib
Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology on Annotation and Tool Development
Mohamed Maamouri | Ann Bies | Seth Kulick | Michael Ciul | Nizar Habash | Ramy Eskander
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes the parallel development of an Egyptian Arabic Treebank and a morphological analyzer for Egyptian Arabic (CALIMA). By the very nature of Egyptian Arabic, the data collected is informal, for example Discussion Forum text, which we use for the treebank discussed here. In addition, Egyptian Arabic, like other Arabic dialects, is sufficiently different from Modern Standard Arabic (MSA) that tools and techniques developed for MSA cannot be simply transferred over to work on Egyptian Arabic work. In particular, a morphological analyzer for Egyptian Arabic is needed to mediate between the written text and the segmented, vocalized form used for the syntactic trees. This led to the necessity of a feedback loop between the treebank team and the analyzer team, as improvements in each area were fed to the other. Therefore, by necessity, there needed to be close cooperation between the annotation team and the tool development team, which was to their mutual benefit. Collaboration on this type of challenge, where tools and resources are limited, proved to be remarkably synergistic and opens the way to further fruitful work on Arabic dialects.

pdf bib
Incorporating Alternate Translations into English Translation Treebank
Ann Bies | Justin Mott | Seth Kulick | Jennifer Garland | Colin Warner
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

New annotation guidelines and new processing methods were developed to accommodate English treebank annotation of a parallel English/Chinese corpus of web data that includes alternate English translations (one fluent, one literal) of expressions that are idiomatic in the Chinese source. In previous machine translation programs, alternate translations of idiomatic expressions had been present in untreebanked data only, but due to the high frequency of such expressions in informal genres such as discussion forums, machine translation system developers requested that alternatives be added to the treebanked data as well. In consultation with machine translation researchers, we chose a pragmatic approach of syntactically annotating only the fluent translation, while retaining the alternate literal translation as a segregated node in the tree. Since the literal translation alternates are often incompatible with English syntax, this approach allows us to create fluent trees without losing information. This resource is expected to support machine translation efforts, and the flexibility provided by the alternate translations is an enhancement to the treebank for this purpose.

pdf bib
Parser Evaluation Using Derivation Trees: A Complement to evalb
Seth Kulick | Ann Bies | Justin Mott | Anthony Kroch | Beatrice Santorini | Mark Liberman
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Inter-annotator Agreement for ERE annotation
Seth Kulick | Ann Bies | Justin Mott
Proceedings of the Second Workshop on EVENTS: Definition, Detection, Coreference, and Representation

pdf bib
Transliteration of Arabizi into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus
Ann Bies | Zhiyi Song | Mohamed Maamouri | Stephen Grimes | Haejoong Lee | Jonathan Wright | Stephanie Strassel | Nizar Habash | Ramy Eskander | Owen Rambow
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

2013

pdf bib
Using Derivation Trees for Informative Treebank Inter-Annotator Agreement Evaluation
Seth Kulick | Ann Bies | Justin Mott | Mohamed Maamouri | Beatrice Santorini | Anthony Kroch
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Automatic Correction and Extension of Morphological Annotations
Ramy Eskander | Nizar Habash | Ann Bies | Seth Kulick | Mohamed Maamouri
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

2012

pdf bib
Further Developments in Treebank Error Detection Using Derivation Trees
Seth Kulick | Ann Bies | Justin Mott
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This work describes how derivation tree fragments based on a variant of Tree Adjoining Grammar (TAG) can be used to check treebank consistency. Annotation of word sequences are compared both for their internal structural consistency, and their external relation to the rest of the tree. We expand on earlier work in this area in three ways. First, we provide a more complete description of the system, showing how a naive use of TAG structures will not work, leading to a necessary refinement. We also provide a more complete account of the processing pipeline, including the grouping together of structurally similar errors and their elimination of duplicates. Second, we include the new experimental external relation check to find an additional class of errors. Third, we broaden the evaluation to include both the internal and external relation checks, and evaluate the system on both an Arabic and English treebank. The evaluation has been successful enough that the internal check has been integrated into the standard pipeline for current English treebank construction at the Linguistic Data Consortium

pdf bib
Parallel Aligned Treebanks at LDC: New Challenges Interfacing Existing Infrastructures
Xuansong Li | Stephanie Strassel | Stephen Grimes | Safa Ismael | Mohamed Maamouri | Ann Bies | Nianwen Xue
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Parallel aligned treebanks (PAT) are linguistic corpora annotated with morphological and syntactic structures that are aligned at sentence as well as sub-sentence levels. They are valuable resources for improving machine translation (MT) quality. Recently, there has been an increasing demand for such data, especially for divergent language pairs. The Linguistic Data Consortium (LDC) and its academic partners have been developing Arabic-English and Chinese-English PATs for several years. This paper describes the PAT corpus creation effort for the program GALE (Global Autonomous Language Exploitation) and introduces the potential issues of scaling up this PAT effort for the program BOLT (Broad Operational Language Translation). Based on existing infrastructures and in the light of current annotation process, challenges and approaches, we are exploring new methodologies to address emerging challenges in constructing PATs, including data volume bottlenecks, dialect issues of Arabic languages, and new genre features related to rapidly changing social media. Preliminary experimental results are presented to show the feasibility of the approaches proposed.

pdf bib
Expanding Arabic Treebank to Speech: Results from Broadcast News
Mohamed Maamouri | Ann Bies | Seth Kulick
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Treebanking a large corpus of relatively structured speech transcribed from various Arabic Broadcast News (BN) sources has allowed us to begin to address the many challenges of annotating and parsing a speech corpus in Arabic. The now completed Arabic Treebank BN corpus consists of 432,976 source tokens (517,080 tree tokens) in 120 files of manually transcribed news broadcasts. Because news broadcasts are predominantly scripted, most of the transcribed speech is in Modern Standard Arabic (MSA). As such, the lexical and syntactic structures are very similar to the MSA in written newswire data. However, because this is spoken news, cross-linguistic speech effects such as restarts, fillers, hesitations, and repetitions are common. There is also a certain amount of dialect data present in the BN corpus, from on-the-street interviews and similar informal contexts. In this paper, we describe the finished corpus and focus on some of the necessary additions to our annotation guidelines, along with some of the technical challenges of a treebanked speech corpus and an initial parsing evaluation for this data. This corpus will be available to the community in 2012 as an LDC publication.

pdf bib
Using Supertags and Encoded Annotation Principles for Improved Dependency to Phrase Structure Conversion
Seth Kulick | Ann Bies | Justin Mott
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2011

pdf bib
Using Derivation Trees for Treebank Error Detection
Seth Kulick | Ann Bies | Justin Mott
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
From Speech to Trees: Applying Treebank Annotation to Arabic Broadcast News
Mohamed Maamouri | Ann Bies | Seth Kulick | Wajdi Zaghouani | Dave Graff | Mike Ciul
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The Arabic Treebank (ATB) Project at the Linguistic Data Consortium (LDC) has embarked on a large corpus of Broadcast News (BN) transcriptions, and this has led to a number of new challenges for the data processing and annotation procedures that were originally developed for Arabic newswire text (ATB1, ATB2 and ATB3). The corpus requirements currently posed by the DARPA GALE Program, including English translation of Arabic BN transcripts, word-level alignment of Arabic and English data, and creation of a corresponding English Treebank, place significant new constraints on ATB corpus creation, and require careful coordination among a wide assortment of concurrent activities and participants. Nonetheless, in spite of the new challenges posed by BN data, the ATB’s newly improved pipeline and revised annotation guidelines for newswire have proven to be robust enough that very few changes were necessary to account for the new genre of data. This paper presents the points where some adaptation has been necessary, and the overall pipeline as used in the production of BN ATB data.

pdf bib
Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank
Seth Kulick | Ann Bies | Mohamed Maamouri
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Complications arise for standoff annotation when the annotation is not on the source text itself, but on a more abstract representation. This is particularly the case in a language such as Arabic with morphological and orthographic challenges, and we discuss various aspects of these issues in the context of the Arabic Treebank. The Standard Arabic Morphological Analyzer (SAMA) is closely integrated into the annotation workflow, as the basis for the abstraction between the explicit source text and the more abstract token representation. However, this integration with SAMA gives rise to various problems for the annotation workflow and for maintaining the link between the Treebank and SAMA. In this paper we discuss how we have overcome these problems with consistent and more precise categorization of all of the tokens for their relationship with SAMA. We also discuss how we have improved the creation of several distinct alternative forms of the tokens used in the syntactic trees. As a result, the Treebank provides a resource relating the different forms of the same underlying token with varying degrees of vocalization, in terms of how they relate (1) to each other, (2) to the syntactic structure, and (3) to the morphological analyzer.

pdf bib
A TAG-derived Database for Treebank Search and Parser Analysis
Seth Kulick | Ann Bies
Proceedings of the 10th International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+10)

pdf bib
A Treebank Query System Based on an Extracted Tree Grammar
Seth Kulick | Ann Bies
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2008

pdf bib
Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation
Mohamed Maamouri | Seth Kulick | Ann Bies
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The Arabic Treebank (ATB), released by the Linguistic Data Consortium, contains multiple annotation files for each source file, due in part to the role of diacritic inclusion in the annotation process. The data is made available in both “vocalized” and “unvocalized” forms, with and without the diacritic marks, respectively. Much parsing work with the ATB has used the unvocalized form, on the basis that it more closely represents the “real-world” situation. We point out some problems with this usage of the unvocalized data and explain why the unvocalized form does not in fact represent “real-world” data. This is due to some aspects of the treebank annotation that to our knowledge have never before been published.

pdf bib
Enhancing the Arabic Treebank: a Collaborative Effort toward New Annotation Guidelines
Mohamed Maamouri | Ann Bies | Seth Kulick
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The Arabic Treebank team at the Linguistic Data Consortium has significantly revised and enhanced its annotation guidelines and procedure over the past year. Improvements were made to both the morphological and syntactic annotation guidelines, and annotators were trained in the new guidelines, focusing on areas of low inter-annotator agreement. The revised guidelines are now being applied in annotation production, and the combination of the revised guidelines and a period of intensive annotator training has raised inter-annotator agreement f-measure scores already and has also improved parsing results.

pdf bib
A Pilot Arabic Propbank
Martha Palmer | Olga Babko-Malaya | Ann Bies | Mona Diab | Mohamed Maamouri | Aous Mansouri | Wajdi Zaghouani
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper, we present the details of creating a pilot Arabic proposition bank (Propbank). Propbanks exist for both English and Chinese. However the morphological and syntactic expression of linguistic phenomena in Arabic yields a very different type of process in creating an Arabic propbank. Hence, we highlight those characteristics of Arabic that make creating a propbank for the language a different challenge compared to the creation of an English Propbank.We believe that many of the lessons learned in dealing with Arabic could generalise to other languages that exhibit equally rich morphology and relatively free word order.

2006

pdf bib
Diacritization: A Challenge to Arabic Treebank Annotation and Parsing
Mohamed Maamouri | Seth Kulick | Ann Bies
Proceedings of the International Conference on the Challenge of Arabic for NLP/MT

Arabic diacritization (referred to sometimes as vocalization or vowelling), defined as the full or partial representation of short vowels, shadda (consonantal length or germination), tanween (nunation or definiteness), and hamza (the glottal stop and its support letters), is still largely understudied in the current NLP literature. In this paper, the lack of diacritics in standard Arabic texts is presented as a major challenge to most Arabic natural language processing tasks, including parsing. Recent studies (Messaoudi, et al. 2004; Vergyri & Kirchhoff 2004; Zitouni, et al. 2006 and Maamouri, et al. forthcoming) about the place and impact of diacritization in text-based NLP research are presented along with an analysis of the weight of the missing diacritics on Treebank morphological and syntactic analyses and the impact on parser development.

pdf bib
Developing and Using a Pilot Dialectal Arabic Treebank
Mohamed Maamouri | Ann Bies | Tim Buckwalter | Mona Diab | Nizar Habash | Owen Rambow | Dalila Tabessi
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we describe the methodological procedures and issues that emerged from the development of a pilot Levantine Arabic Treebank (LATB) at the Linguistic Data Consortium (LDC) and its use at the Johns Hopkins University (JHU) Center for Language and Speech Processing workshop on Parsing Arabic Dialects (PAD). This pilot, consisting of morphological and syntactic annotation of approximately 26,000 words of Levantine Arabic conversational telephone speech, was developed under severe time constraints; hence the LDC team drew on their experience in treebanking Modern Standard Arabic (MSA) text. The resulting Levantine dialect treebanked corpus was used by the PAD team to develop and evaluate parsers for Levantine dialect texts. The parsers were trained on MSA resources and adapted using dialect-MSA lexical resources (some developed especially for this task) and existing linguistic knowledge about syntactic differences between MSA and dialect. The use of the LATB for development and evaluation of syntactic parsers allowed the PAD team to provide feedbasck to the LDC treebank developers. In this paper, we describe the creation of resources for this corpus, as well as transformations on the corpus to eliminate speech effects and lessen the gap between our pre-existing MSA resources and the new dialectal corpus

pdf bib
Linguistic Resources for Speech Parsing
Ann Bies | Stephanie Strassel | Haejoong Lee | Kazuaki Maeda | Seth Kulick | Yang Liu | Mary Harper | Matthew Lease
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We report on the success of a two-pass approach to annotating metadata, speech effects and syntactic structure in English conversational speech: separately annotating transcribed speech for structural metadata, or structural events, (fillers, speech repairs ( or edit dysfluencies) and SUs, or syntactic/semantic units) and for syntactic structure (treebanking constituent structure and shallow argument structure). The two annotations were then combined into a single representation. Certain alignment issues between the two types of annotation led to the discovery and correction of annotation errors in each, resulting in a more accurate and useful resource. The development of this corpus was motivated by the need to have both metadata and syntactic structure annotated in order to support synergistic work on speech parsing and structural event detection. Automatic detection of these speech phenomena would simultaneously improve parsing accuracy and provide a mechanism for cleaning up transcriptions for downstream text processing. Similarly, constraints imposed by text processing systems such as parsers can be used to help improve identification of disfluencies and sentence boundaries. This paper reports on our efforts to develop a linguistic resource providing both spoken metadata and syntactic structure information, and describes the resulting corpus of English conversational speech.

pdf bib
Issues in Synchronizing the English Treebank and PropBank
Olga Babko-Malaya | Ann Bies | Ann Taylor | Szuting Yi | Martha Palmer | Mitch Marcus | Seth Kulick | Libin Shen
Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006

2005

pdf bib
Parallel Entity and Treebank Annotation
Ann Bies | Seth Kulick | Mark Mandel
Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky

2004

pdf bib
Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools
Mohamed Maamouri | Ann Bies
Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages

pdf bib
Integrated Annotation for Biomedical Information Extraction
Seth Kulick | Ann Bies | Mark Liberman | Mark Mandel | Ryan McDonald | Martha Palmer | Andrew Schein | Lyle Ungar | Scott Winters | Pete White
HLT-NAACL 2004 Workshop: Linking Biological Literature, Ontologies and Databases

1994

pdf bib
The Penn Treebank: Annotating Predicate Argument Structure
Mitchell Marcus | Grace Kim | Mary Ann Marcinkiewicz | Robert MacIntyre | Ann Bies | Mark Ferguson | Karen Katz | Britta Schasberger
Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994