Steven Bird


2021

pdf bib
A Computational Model for Interactive Transcription
William Lane | Mat Bettinson | Steven Bird
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances

Transcribing low resource languages can be challenging in the absence of a good lexicon and trained transcribers. Accordingly, we seek a way to enable interactive transcription whereby the machine amplifies human efforts. This paper presents a data model and a system architecture for interactive transcription, supporting multiple modes of interactivity, increasing the likelihood of finding tasks that engage local participation in language work. The approach also supports other applications which are useful in our context, including spoken document retrieval and language learning.

2020

pdf bib
Enabling Interactive Transcription in an Indigenous Community
Eric Le Ferrand | Steven Bird | Laurent Besacier
Proceedings of the 28th International Conference on Computational Linguistics

We propose a novel transcription workflow which combines spoken term detection and human-in-the-loop, together with a pilot experiment. This work is grounded in an almost zero-resource scenario where only a few terms have so far been identified, involving two endangered languages. We show that in the early stages of transcription, when the available data is insufficient to train a robust ASR system, it is possible to take advantage of the transcription of a small number of isolated words in order to bootstrap the transcription of a speech collection.

pdf bib
Decolonising Speech and Language Technology
Steven Bird
Proceedings of the 28th International Conference on Computational Linguistics

After generations of exploitation, Indigenous people often respond negatively to the idea that their languages are data ready for the taking. By treating Indigenous knowledge as a commodity, speech and language technologists risk disenfranchising local knowledge authorities, reenacting the causes of language endangerment. Scholars in related fields have responded to calls for decolonisation, and we in the speech and language technology community need to follow suit, and explore what this means for our practices that involve Indigenous languages and the communities who own them. This paper reviews colonising discourses in speech and language technology, and suggests new ways of working with Indigenous communities, and seeks to open a discussion of a postcolonial approach to computational methods for supporting language vitality.

pdf bib
Interactive Word Completion for Morphologically Complex Languages
William Lane | Steven Bird
Proceedings of the 28th International Conference on Computational Linguistics

Text input technologies for low-resource languages support literacy, content authoring, and language learning. However, tasks such as word completion pose a challenge for morphologically complex languages thanks to the combinatorial explosion of possible words. We have developed a method for morphologically-aware text input in Kunwinjku, a polysynthetic language of northern Australia. We modify an existing finite state recognizer to map input morph prefixes to morph completions, respecting the morphosyntax and morphophonology of the language. We demonstrate the portability of the method by applying it to Turkish. We show that the space of proximal morph completions is many orders of magnitude smaller than the space of full word completions for Kunwinjku. We provide a visualization of the morph completion space to enable the text completion parameters to be fine-tuned. Finally, we report on a web services deployment, along with a web interface which helps users enter morphologically complex words and which retrieves corresponding entries from the lexicon.

pdf bib
Bootstrapping Techniques for Polysynthetic Morphological Analysis
William Lane | Steven Bird
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Polysynthetic languages have exceptionally large and sparse vocabularies, thanks to the number of morpheme slots and combinations in a word. This complexity, together with a general scarcity of written data, poses a challenge to the development of natural language technologies. To address this challenge, we offer linguistically-informed approaches for bootstrapping a neural morphological analyzer, and demonstrate its application to Kunwinjku, a polysynthetic Australian language. We generate data from a finite state transducer to train an encoder-decoder model. We improve the model by “hallucinating” missing linguistic structure into the training data, and by resampling from a Zipf distribution to simulate a more natural distribution of morphemes. The best model accounts for all instances of reduplication in the test set and achieves an accuracy of 94.7% overall, a 10 percentage point improvement over the FST baseline. This process demonstrates the feasibility of bootstrapping a neural morph analyzer from minimal resources.

pdf bib
Sparse Transcription
Steven Bird
Computational Linguistics, Volume 46, Issue 4 - December 2020

The transcription bottleneck is often cited as a major obstacle for efforts to document the world’s endangered languages and supply them with language technologies. One solution is to extend methods from automatic speech recognition and machine translation, and recruit linguists to provide narrow phonetic transcriptions and sentence-aligned translations. However, I believe that these approaches are not a good fit with the available data and skills, or with long-established practices that are essentially word-based. In seeking a more effective approach, I consider a century of transcription practice and a wide range of computational approaches, before proposing a computational model based on spoken term detection that I call “sparse transcription.” This represents a shift away from current assumptions that we transcribe phones, transcribe fully, and transcribe first. Instead, sparse transcription combines the older practice of word-level transcription with interpretive, iterative, and interactive processes that are amenable to wider participation and that open the way to new methods for processing oral languages.

2019

pdf bib
Towards A Robust Morphological Analyzer for Kunwinjku
William Lane | Steven Bird
Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association

Kunwinjku is an indigenous Australian language spoken in northern Australia which exhibits agglutinative and polysynthetic properties. Members of the community have expressed interest in co-developing language applications that promote their values and priorities. Modeling the morphology of the Kunwinjku language is an important step towards accomplishing the community’s goals. Finite State Transducers have long been the go-to method for modeling morphologically rich languages, and in this paper we discuss some of the distinct modeling challenges present in the morphosyntax of verbs in Kunwinjku. We show that a fairly straightforward implementation using standard features of the foma toolkit can account for much of the verb structure. Continuing challenges include robustness in the face of variation and unseen vocabulary, as well as how to handle complex reduplicative processes. Our future work will build off the baseline and challenges presented here.

2018

pdf bib
Evaluation Phonemic Transcription of Low-Resource Tonal Languages for Language Documentation
Oliver Adams | Trevor Cohn | Graham Neubig | Hilaria Cruz | Steven Bird | Alexis Michaud
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Developing a Suite of Mobile Applications for Collaborative Language Documentation
Mat Bettinson | Steven Bird
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib
Multilingual Training of Crosslingual Word Embeddings
Long Duong | Hiroshi Kanayama | Tengfei Ma | Steven Bird | Trevor Cohn
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Crosslingual word embeddings represent lexical items from different languages using the same vector space, enabling crosslingual transfer. Most prior work constructs embeddings for a pair of languages, with English on one side. We investigate methods for building high quality crosslingual word embeddings for many languages in a unified vector space.In this way, we can exploit and combine strength of many languages. We obtained high performance on bilingual lexicon induction, monolingual similarity and crosslingual document classification tasks.

pdf bib
Cross-Lingual Word Embeddings for Low-Resource Language Modeling
Oliver Adams | Adam Makarucha | Graham Neubig | Steven Bird | Trevor Cohn
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Most languages have no established writing system and minimal written records. However, textual data is essential for natural language processing, and particularly important for training language models to support speech recognition. Even in cases where text data is missing, there are some languages for which bilingual lexicons are available, since creating lexicons is a fundamental task of documentary linguistics. We investigate the use of such lexicons to improve language models when textual training data is limited to as few as a thousand sentences. The method involves learning cross-lingual word embeddings as a preliminary step in training monolingual language models. Results across a number of languages show that language models are improved by this pre-training. Application to Yongning Na, a threatened language, highlights challenges in deploying the approach in real low-resource environments.

2016

pdf bib
Learning Crosslingual Word Embeddings without Bilingual Corpora
Long Duong | Hiroshi Kanayama | Tengfei Ma | Steven Bird | Trevor Cohn
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Learning a Lexicon and Translation Model from Phoneme Lattices
Oliver Adams | Graham Neubig | Trevor Cohn | Steven Bird | Quoc Truong Do | Satoshi Nakamura
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
An Attentional Model for Speech Translation Without Transcription
Long Duong | Antonios Anastasopoulos | David Chiang | Steven Bird | Trevor Cohn
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2015

pdf bib
A Neural Network Model for Low-Resource Universal Dependency Parsing
Long Duong | Trevor Cohn | Steven Bird | Paul Cook
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Inducing bilingual lexicons from small quantities of sentence-aligned phonemic transcriptions
Oliver Adams | Graham Neubig | Trevor Cohn | Steven Bird
Proceedings of the 12th International Workshop on Spoken Language Translation: Papers

pdf bib
Collective Document Classification with Implicit Inter-document Semantic Relationships
Clint Burford | Steven Bird | Timothy Baldwin
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

pdf bib
Low Resource Dependency Parsing: Cross-lingual Parameter Sharing in a Neural Network Parser
Long Duong | Trevor Cohn | Steven Bird | Paul Cook
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
Cross-lingual Transfer for Unsupervised Dependency Parsing Without Parallel Data
Long Duong | Trevor Cohn | Steven Bird | Paul Cook
Proceedings of the Nineteenth Conference on Computational Natural Language Learning

2014

pdf bib
Aikuma: A Mobile App for Collaborative Language Documentation
Steven Bird | Florian R. Hanke | Oliver Adams | Haejoong Lee
Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib
Collecting Bilingual Audio in Remote Indigenous Communities
Steven Bird | Lauren Gawne | Katie Gelbart | Isaac McAlister
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
What Can We Get From 1000 Tokens? A Case Study of Multilingual POS Tagging For Resource-Poor Languages
Long Duong | Trevor Cohn | Karin Verspoor | Steven Bird | Paul Cook
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Large-Scale Text Collection for Unwritten Languages
Florian R. Hanke | Steven Bird
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Increasing the Quality and Quantity of Source Language Data for Unsupervised Cross-Lingual POS Tagging
Long Duong | Paul Cook | Steven Bird | Pavel Pecina
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Simpler unsupervised POS tagging with bilingual projections
Long Duong | Paul Cook | Steven Bird | Pavel Pecina
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf bib
Machine Translation for Language Preservation
Steven Bird | David Chiang
Proceedings of COLING 2012: Posters

pdf bib
Fangorn: A System for Querying very large Treebanks
Sumukh Ghodke | Steven Bird
Proceedings of COLING 2012: Demonstration Papers

2011

pdf bib
Towards a Data Model for the Universal Corpus
Steven Abney | Steven Bird
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

pdf bib
Normalising Audio Transcriptions for Unwritten Languages
Adel Foda | Steven Bird
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
A Breadth-First Representation for Tree Matching in Large Scale Forest-Based Translation
Sumukh Ghodke | Steven Bird | Rui Zhang
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Collective Classification of Congressional Floor-Debate Transcripts
Clinton Burfoot | Steven Bird | Timothy Baldwin
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Fast Query for Large Treebanks
Sumukh Ghodke | Steven Bird
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
The Human Language Project: Building a Universal Corpus of the World’s Languages
Steven Abney | Steven Bird
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

2009

pdf bib
Last Words: Natural Language Processing and Linguistic Fieldwork
Steven Bird
Computational Linguistics, Volume 35, Number 3, September 2009

2008

pdf bib
Defining a Core Body of Knowledge for the Introductory Computational Linguistics Curriculum
Steven Bird
Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics

pdf bib
Multidisciplinary Instruction with the Natural Language Toolkit
Steven Bird | Ewan Klein | Edward Loper | Jason Baldridge
Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics

pdf bib
The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics
Steven Bird | Robert Dale | Bonnie Dorr | Bryan Gibson | Mark Joseph | Min-Yen Kan | Dongwon Lee | Brett Powley | Dragomir Radev | Yee Fan Tan
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The ACL Anthology is a digital archive of conference and journal papers in natural language processing and computational linguistics. Its primary purpose is to serve as a reference repository of research results, but we believe that it can also be an object of study and a platform for research in its own right. We describe an enriched and standardized reference corpus derived from the ACL Anthology that can be used for research in scholarly document processing. This corpus, which we call the ACL Anthology Reference Corpus (ACL ARC), brings together the recent activities of a number of research groups around the world. Our goal is to make the corpus widely available, and to encourage other researchers to use it as a standard testbed for experiments in both bibliographic and bibliometric research.

pdf bib
Toward a Global Infrastructure for the Sustainability of Language Resources
Gary Simons | Steven Bird
Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation

2007

pdf bib
Dynamic Path Prediction and Recommendation in a Museum Environment
Karl Grieser | Timothy Baldwin | Steven Bird
Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007).

2006

pdf bib
NLTK: The Natural Language Toolkit
Steven Bird
Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions

pdf bib
Reconsidering Language Identification for Written Language Resources
Baden Hughes | Timothy Baldwin | Steven Bird | Jeremy Nicholson | Andrew MacKinlay
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The task of identifying the language in which a given document (ranging from a sentence to thousands of pages) is written has been relatively well studied over several decades. Automated approachesto written language identification are used widely throughout research and industrial contexts, over both oral and written source materials. Despite this widespread acceptance, a review of previous research in written language identification reveals a number of questions which remain openand ripe for further investigation.

pdf bib
Analysis and Prediction of User Behaviour in a Museum Environment
Karl Grieser | Timothy Baldwin | Steven Bird
Proceedings of the Australasian Language Technology Workshop 2006

2005

pdf bib
Structuring Documents Efficiently
Robert Marshall | Steven Bird | Peter Stuckey
Proceedings of the Australasian Language Technology Workshop 2005

pdf bib
LPath+: A First-Order Complete Language for Linguistic Tree Query
Catherine Lai | Steven Bird
Proceedings of the 19th Pacific Asia Conference on Language, Information and Computation

2004

pdf bib
Representing and Rendering Linguistic Paradigms
David Penton | Steven Bird
Proceedings of the Australasian Language Technology Workshop 2004

pdf bib
Querying and Updating Treebanks: A Critical Survey and Requirements Analysis
Catherine Lai | Steven Bird
Proceedings of the Australasian Language Technology Workshop 2004

pdf bib
NLTK: The Natural Language Toolkit
Steven Bird | Edward Loper
Proceedings of the ACL Interactive Poster and Demonstration Sessions

pdf bib
Securing Interpretability: The Case of Ega Language Documentation
Dafydd Gibbon | Catherine Bow | Steven Bird | Baden Hughes
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Functional Requirements for an Interlinear Text Editor
Baden Hughes | Catherine Bow | Steven Bird
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Management of Metadata in Linguistic Fieldwork: Experience from the ACLA Project
Baden Hughes | David Penton | Steven Bird | Catherine Bow | Gillian Wigglesworth | Patrick McConvell | Jane Simpson
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Talkbank: Building an Open Unified Multimodal Database of Communicative Interaction
Brian MacWhinney | Steven Bird | Christopher Cieri | Craig Martell
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2003

pdf bib
Encoding and presenting interlinear text using XML technologies
Baden Hughes | Steven Bird | Catherine Bow
Proceedings of the Australasian Language Technology Workshop 2003

pdf bib
Grid-Enabling Natural Language Engineering By Stealth
Baden Hughes | Steven Bird
Proceedings of the HLT-NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS)

2002

pdf bib
The Open Language Archives Community
Steven Bird | Hans Uszkoreit | Gary Simons
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Models and Tools for Collaborative Annotation
Xiaoyi Ma | Haejoong Lee | Steven Bird | Kazuaki Maeda
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
TableTrans, MultiTrans, InterTrans and TreeTrans: Diverse Tools Built on the Annotation Graph Toolkit
Steven Bird | Kazuaki Maeda | Xiaoyi Ma | Haejoong Lee | Beth Randall | Salim Zayat
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Creating Annotation Tools with the Annotation Graph Toolkit
Kazauki Maeda | Steven Bird | Xiaoyi Ma | Haejoong Lee
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
An integrated framework for treebanks and multilayer annotations
Scott Cotton | Steven Bird
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
NLTK: The Natural Language Toolkit
Edward Loper | Steven Bird
Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics

2001

pdf bib
The OLAC Metadata Set and Controlled Vocabularies
Steven Bird | Gary Simons
Proceedings of the ACL 2001 Workshop on Sharing Tools and Resources

pdf bib
Annotation Graphs and Servers and Multi-Modal Resources: Infrastructure for Interdisciplinary Education, Research and Development
Christopher Cieri | Steven Bird
Proceedings of the ACL 2001 Workshop on Sharing Tools and Resources

pdf bib
Annotation Tools Based on the Annotation Graph API
Steven Bird | Kazuaki Maeda | Xiaoyi Ma | Haejoong Lee
Proceedings of the ACL 2001 Workshop on Sharing Tools and Resources

pdf bib
The Annotation Graph Toolkit: Software Components for Building Linguistic Annotation Tools
Kazuaki Maeda | Steven Bird | Xiaoyi Ma | Haejoong Lee
Proceedings of the First International Conference on Human Language Technology Research

2000

pdf bib
ATLAS: A Flexible and Extensible Architecture for Linguistic Annotation
Steven Bird | David Day | John Garofolo | John Henderson | Christophe Laprun | Mark Liberman
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
Transcribing with Annotation Graphs
Edouard Geoffrois | Claude Barras | Steven Bird | Zhibiao Wu
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
Towards a Query Language for Annotation Graphs
Steven Bird | Peter Buneman | Wang-Chiew Tan
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
Many Uses, Many Annotations for Large Speech Corpora: Switchboard and TDT as Case Studies
David Graff | Steven Bird
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

1999

pdf bib
Annotation Graphs as a Framework for Multidimensional Linguistic Data Analysis
Steven Bird | Mark Liberman
Towards Standards and Tools for Discourse Tagging

1997

pdf bib
A Lexical Database Tool for Quantitative Phonological Research
Steven Bird
Computational Phonology: Third Meeting of the ACL Special Interest Group in Computational Phonology

1994

pdf bib
One-Level Phonology: Autosegmental Representations and Rules as Finite Automata
Steven Bird | T. Mark Ellison
Computational Linguistics, Volume 20, Number 1, March 1994

pdf bib
Phonological Analysis in Typed Feature Systems
Steven Bird | Ewan Klein
Computational Linguistics, Volume 20, Number 3, September 1994

pdf bib
Automated Tone Transcription
Steven Bird
Computational Phonology

1992

pdf bib
Finite-State Phonology in HPSG
Steven Bird
COLING 1992 Volume 1: The 14th International Conference on Computational Linguistics

1991

pdf bib
A Logical Approach to Arabic Phonology
Steven Bird | Patrick Blackburn
Fifth Conference of the European Chapter of the Association for Computational Linguistics