Kenneth Church

Also published as: Ken Church, Kenneth W. Church, Kenneth Ward Church


2024

pdf bib
No Culture Left Behind: ArtELingo-28, a Benchmark of WikiArt with Captions in 28 Languages
Youssef Mohamed | Runjia Li | Ibrahim Said Ahmad | Kilichbek Haydarov | Philip Torr | Kenneth Church | Mohamed Elhoseiny
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Research in vision and language has made considerable progress thanks to benchmarks such as COCO. COCO captions focused on unambiguous facts in English; ArtEmis introduced subjective emotions and ArtELingo introduced some multilinguality (Chinese and Arabic). However we believe there should be more multilinguality. Hence, we present ArtELingo-28, a vision-language benchmark that spans 28 languages and encompasses approximately 200,000 annotations (140 annotations per image). Traditionally, vision research focused on unambiguous class labels, whereas ArtELingo-28 emphasizes diversity of opinions over languages and cultures. The challenge is to build machine learning systems that assign emotional captions to images. Baseline results will be presented for three novel conditions: Zero-Shot, Few-Shot and One-vs-All Zero-Shot. We find that cross-lingual transfer is more successful for culturally-related languages. Data and code will be made publicly available.

pdf bib
Are Generative Language Models Multicultural? A Study on Hausa Culture and Emotions using ChatGPT
Ibrahim Ahmad | Shiran Dudy | Resmi Ramachandranpillai | Kenneth Church
Proceedings of the 2nd Workshop on Cross-Cultural Considerations in NLP

Large Language Models (LLMs), such as ChatGPT, are widely used to generate content for various purposes and audiences. However, these models may not reflect the cultural and emotional diversity of their users, especially for low-resource languages. In this paper, we investigate how ChatGPT represents Hausa’s culture and emotions. We compare responses generated by ChatGPT with those provided by native Hausa speakers on 37 culturally relevant questions. We conducted experiments using emotion analysis. We also used two similarity metrics to measure the alignment between human and ChatGPT responses. We also collect human participants ratings and feedback on ChatGPT responses. Our results show that ChatGPT has some level of similarity to human responses, but also exhibits some gaps and biases in its knowledge and awareness of Hausa culture and emotions. We discuss the implications and limitations of our methodology and analysis and suggest ways to improve the performance and evaluation of LLMs for low-resource languages.

pdf bib
Comparing Edge-based and Node-based Methods on a Citation Prediction Task
Peter Vickers | Kenneth Church
Findings of the Association for Computational Linguistics: EMNLP 2024

Citation Prediction, estimating whether paper a cites paper b, is particularly interesting in a forecasting setting where the model is trained on papers published before time t, and evaluated on papers published after h, where h is the forecast horizon. Performance improves with t (larger training sets) and degrades with h (longer forecast horizons). The trade-off between edge-based methods and node-based methods depends on t. Because edges grow faster than nodes, larger training sets favor edge-based methods.We introduce a new forecast-based Citation Prediction benchmark of 3 million papers to quantify these trends.Our benchmark shows that desirable policies for combining edge- and node-based methods depend on h and t.We release our benchmark, evaluation scripts, and embeddings.

pdf bib
On Translating Technical Terminology: A Translation Workflow for Machine-Translated Acronyms
Richard Yue | John Ortega | Kenneth Church
Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

The typical workflow for a professional translator to translate a document from its source language (SL) to a target language (TL) is not always focused on what many language models in natural language processing (NLP) do - predict the next word in a series of words. While high-resource languages like English and French are reported to achieve near human parity using common metrics for measurement such as BLEU and COMET, we find that an important step is being missed: the translation of technical terms, specifically acronyms. Some state-of-the art machine translation systems like Google Translate which are publicly available can be erroneous when dealing with acronyms - as much as 50% in our findings. This article addresses acronym disambiguation for MT systems by proposing an additional step to the SL-TL (FR-EN) translation workflow where we first offer a new acronym corpus for public consumption and then experiment with a search-based thresholding algorithm that achieves nearly 10% increase when compared to Google Translate and OpusMT.

2023

pdf bib
A Research-Based Guide for the Creation and Deployment of a Low-Resource Machine Translation System
John E. Ortega | Kenneth Church
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

The machine translation (MT) field seems to focus heavily on English and other high-resource languages. Though, low-resource MT (LRMT) is receiving more attention than in the past. Successful LRMT systems (LRMTS) should make a compelling business case in terms of demand, cost and quality in order to be viable for end users. When used by communities where low-resource languages are spoken, LRMT quality should not only be determined by the use of traditional metrics like BLEU, but it should also take into account other factors in order to be inclusive and not risk overall rejection by the community. MT systems based on neural methods tend to perform better with high volumes of training data, but they may be unrealistic and even harmful for LRMT. It is obvious that for research purposes, the development and creation of LRMTS is necessary. However, in this article, we argue that two main workarounds could be considered by companies that are considering deployment of LRMTS in the wild: human-in-the-loop and sub-domains.

2022

pdf bib
A Gentle Introduction to Deep Nets and Opportunities for the Future
Kenneth Church | Valia Kordoni | Gary Marcus | Ernest Davis | Yanjun Ma | Zeyu Chen
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

The first half of this tutorial will make deep nets more accessible to a broader audience, following “Deep Nets for Poets” and “A Gentle Introduction to Fine-Tuning.” We will also introduce GFT (general fine tuning), a little language for fine tuning deep nets with short (one line) programs that are as easy to code as regression in statistics packages such as R using glm (general linear models). Based on the success of these methods on a number of benchmarks, one might come away with the impression that deep nets are all we need. However, we believe the glass is half-full: while there is much that can be done with deep nets, there is always more to do. The second half of this tutorial will discuss some of these opportunities.

pdf bib
ArtELingo: A Million Emotion Annotations of WikiArt with Emphasis on Diversity over Language and Culture
Youssef Mohamed | Mohamed Abdelfattah | Shyma Alhuwaider | Feifan Li | Xiangliang Zhang | Kenneth Church | Mohamed Elhoseiny
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

This paper introduces ArtELingo, a new benchmark and dataset, designed to encourage work on diversity across languages and cultures. Following ArtEmis, a collection of 80k artworks from WikiArt with 0.45M emotion labels and English-only captions, ArtELingo adds another 0.79M annotations in Arabic and Chinese, plus 4.8K in Spanish to evaluate “cultural-transfer” performance. 51K artworks have 5 annotations or more in 3 languages. This diversity makes it possible to study similarities and differences across languages and cultures. Further, we investigate captioning tasks, and find diversity improves the performance of baseline models. ArtELingo is publicly available at ‘www.artelingo.org‘ with standard splits and baseline models. We hope our work will help ease future research on multilinguality and culturally-aware AI.

pdf bib
Training on Lexical Resources
Kenneth Church | Xingyu Cai | Yuchen Bian
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We propose using lexical resources (thesaurus, VAD) to fine-tune pretrained deep nets such as BERT and ERNIE. Then at inference time, these nets can be used to distinguish synonyms from antonyms, as well as VAD distances. The inference method can be applied to words as well as texts such as multiword expressions (MWEs), out of vocabulary words (OOVs), morphological variants and more. Code and data are posted on https://github.com/kwchurch/syn_ant.

pdf bib
Data Augmentation for the Post-Stroke Speech Transcription (PSST) Challenge: Sometimes Less Is More
Jiahong Yuan | Xingyu Cai | Kenneth Church
Proceedings of the RaPID Workshop - Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments - within the 13th Language Resources and Evaluation Conference

We employ the method of fine-tuning wav2vec2.0 for recognition of phonemes in aphasic speech. Our effort focuses on data augmentation, by supplementing data from both in-domain and out-of-domain datasets for training. We found that although a modest amount of out-of-domain data may be helpful, the performance of the model degrades significantly when the amount of out-of-domain data is much larger than in-domain data. Our hypothesis is that fine-tuning wav2vec2.0 with a CTC loss not only learns bottom-up acoustic properties but also top-down constraints. Therefore, out-of-domain data augmentation is likely to degrade performance if there is a language model mismatch between “in” and “out” domains. For in-domain audio without ground truth labels, we found that it is beneficial to exclude samples with less confident pseudo labels. Our final model achieves 16.7% PER (phoneme error rate) on the validation set, without using a language model for decoding. The result represents a relative error reduction of 14% over the baseline model trained without data augmentation. Finally, we found that “canonicalized” phonemes are much easier to recognize than manually transcribed phonemes.

2021

pdf bib
On Attention Redundancy: A Comprehensive Study
Yuchen Bian | Jiaji Huang | Xingyu Cai | Jiahong Yuan | Kenneth Church
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Multi-layer multi-head self-attention mechanism is widely applied in modern neural language models. Attention redundancy has been observed among attention heads but has not been deeply studied in the literature. Using BERT-base model as an example, this paper provides a comprehensive study on attention redundancy which is helpful for model interpretation and model compression. We analyze the attention redundancy with Five-Ws and How. (What) We define and focus the study on redundancy matrices generated from pre-trained and fine-tuned BERT-base model for GLUE datasets. (How) We use both token-based and sentence-based distance functions to measure the redundancy. (Where) Clear and similar redundancy patterns (cluster structure) are observed among attention heads. (When) Redundancy patterns are similar in both pre-training and fine-tuning phases. (Who) We discover that redundancy patterns are task-agnostic. Similar redundancy patterns even exist for randomly generated token sequences. (“Why”) We also evaluate influences of the pre-training dropout ratios on attention redundancy. Based on the phase-independent and task-agnostic attention redundancy patterns, we propose a simple zero-shot pruning method as a case study. Experiments on fine-tuning GLUE tasks verify its effectiveness. The comprehensive analyses on attention redundancy make model understanding and zero-shot model pruning promising.

pdf bib
Data Collection vs. Knowledge Graph Completion: What is Needed to Improve Coverage?
Kenneth Church | Yuchen Bian
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

This survey/position paper discusses ways to improve coverage of resources such as WordNet. Rapp estimated correlations, rho, between corpus statistics and pyscholinguistic norms. rho improves with quantity (corpus size) and quality (balance). 1M words is enough for simple estimates (unigram frequencies), but at least 100x more is required for good estimates of word associations and embeddings. Given such estimates, WordNet’s coverage is remarkable. WordNet was developed on SemCor, a small sample (200k words) from the Brown Corpus. Knowledge Graph Completion (KGC) attempts to learn missing links from subsets. But Rapp’s estimates of sizes suggest it would be more profitable to collect more data than to infer missing information that is not there.

pdf bib
Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future
Kenneth Church | Mark Liberman | Valia Kordoni
Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future

pdf bib
Benchmarking: Past, Present and Future
Kenneth Church | Mark Liberman | Valia Kordoni
Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future

Where have we been, and where are we going? It is easier to talk about the past than the future. These days, benchmarks evolve more bottom up (such as papers with code). There used to be more top-down leadership from government (and industry, in the case of systems, with benchmarks such as SPEC). Going forward, there may be more top-down leadership from organizations like MLPerf and/or influencers like David Ferrucci, who was responsible for IBM’s success with Jeopardy, and has recently written a paper suggesting how the community should think about benchmarking for machine comprehension. Tasks such as reading comprehension become even more interesting as we move beyond English. Multilinguality introduces many challenges, and even more opportunities.

2020

pdf bib
Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework
Mingbo Ma | Baigong Zheng | Kaibo Liu | Renjie Zheng | Hairong Liu | Kainan Peng | Kenneth Church | Liang Huang
Findings of the Association for Computational Linguistics: EMNLP 2020

Text-to-speech synthesis (TTS) has witnessed rapid progress in recent years, where neural methods became capable of producing audios with high naturalness. However, these efforts still suffer from two types of latencies: (a) the computational latency (synthesizing time), which grows linearly with the sentence length, and (b) the input latency in scenarios where the input text is incrementally available (such as in simultaneous translation, dialog generation, and assistive technologies). To reduce these latencies, we propose a neural incremental TTS approach using the prefix-to-prefix framework from simultaneous translation. We synthesize speech in an online fashion, playing a segment of audio while generating the next, resulting in an O(1) rather than O(n) latency. Experiments on English and Chinese TTS show that our approach achieves similar speech naturalness compared to full sentence TTS, but only with a constant (1-2 words) latency.

pdf bib
Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training
Renjie Zheng | Mingbo Ma | Baigong Zheng | Kaibo Liu | Jiahong Yuan | Kenneth Church | Liang Huang
Findings of the Association for Computational Linguistics: EMNLP 2020

Simultaneous speech-to-speech translation is an extremely challenging but widely useful scenario that aims to generate target-language speech only a few seconds behind the source-language speech. In addition, we have to continuously translate a speech of multiple sentences, but all recent solutions merely focus on the single-sentence scenario. As a result, current approaches will accumulate more and more latencies in later sentences when the speaker talks faster and introduce unnatural pauses into translated speech when the speaker talks slower. To overcome these issues, we propose Self-Adaptive Translation which flexibly adjusts the length of translations to accommodate different source speech rates. At similar levels of translation quality (as measured by BLEU), our method generates more fluent target speech latency than the baseline, in both Zh<->En directions.

pdf bib
Improving Bilingual Lexicon Induction for Low Frequency Words
Jiaji Huang | Xingyu Cai | Kenneth Church
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

This paper designs a Monolingual Lexicon Induction task and observes that two factors accompany the degraded accuracy of bilingual lexicon induction for rare words. First, a diminishing margin between similarities in low frequency regime, and secondly, exacerbated hubness at low frequency. Based on the observation, we further propose two methods to address these two factors, respectively. The larger issue is hubness. Addressing that improves induction accuracy significantly, especially for low-frequency words.

2019

pdf bib
Hubless Nearest Neighbor Search for Bilingual Lexicon Induction
Jiaji Huang | Qiang Qiu | Kenneth Church
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Bilingual Lexicon Induction (BLI) is the task of translating words from corpora in two languages. Recent advances in BLI work by aligning the two word embedding spaces. Following that, a key step is to retrieve the nearest neighbor (NN) in the target space given the source word. However, a phenomenon called hubness often degrades the accuracy of NN. Hubness appears as some data points, called hubs, being extra-ordinarily close to many of the other data points. Reducing hubness is necessary for retrieval tasks. One successful example is Inverted SoFtmax (ISF), recently proposed to improve NN. This work proposes a new method, Hubless Nearest Neighbor (HNN), to mitigate hubness. HNN differs from NN by imposing an additional equal preference assumption. Moreover, the HNN formulation explains why ISF works as well as it does. Empirical results demonstrate that HNN outperforms NN, ISF and other state-of-the-art. For reproducibility and follow-ups, we have published all code.

2016

pdf bib
C2D2E2: Using Call Centers to Motivate the Use of Dialog and Diarization in Entity Extraction
Ken Church | Weizhong Zhu | Jason Pelecanos
Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods

2014

pdf bib
The Case for Empiricism (With and Without Statistics)
Kenneth Church
Proceedings of Frame Semantics in NLP: A Workshop in Honor of Chuck Fillmore (1929-2014)

2011

pdf bib
Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation
Shane Bergsma | David Yarowsky | Kenneth Church
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
How Many Multiword Expressions do People Know?
Kenneth Church
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

pdf bib
Proceedings of the IJCNLP 2011 System Demonstrations
Kenneth Church | Yunqing Xia
Proceedings of the IJCNLP 2011 System Demonstrations

pdf bib
A Fast Re-scoring Strategy to Capture Long-Distance Dependencies
Anoop Deoras | Tomáš Mikolov | Kenneth Church
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

2010

pdf bib
NLP on Spoken Documents Without ASR
Mark Dredze | Aren Jansen | Glen Coppersmith | Ken Church
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
Using Web-scale N-grams to Improve Base NP Parsing Performance
Emily Pitler | Shane Bergsma | Dekang Lin | Kenneth Church
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
New Tools for Web-Scale N-grams
Dekang Lin | Kenneth Church | Heng Ji | Satoshi Sekine | David Yarowsky | Shane Bergsma | Kailash Patil | Emily Pitler | Rachel Lathbury | Vikram Rao | Kapil Dalwani | Sushant Narsale
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic resources, such as lists of named-entity types or clusters of distributionally similar words. An alternative to processing web-scale text directly is to use the information provided in an N-gram corpus. An N-gram corpus is an efficient compression of large amounts of text. An N-gram corpus states how often each sequence of words (up to length N) occurs. We propose tools for working with enhanced web-scale N-gram corpora that include richer levels of source annotation, such as part-of-speech tags. We describe a new set of search tools that make use of these tags, and collectively lower the barrier for lexical learning and ambiguity resolution at web-scale. They will allow novel sources of information to be applied to long-standing natural language challenges.

2009

pdf bib
Using Word-Sense Disambiguation Methods to Classify Web Queries by Intent
Emily Pitler | Ken Church
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
Repetition and Language Models and Comparable Corpora
Ken Church
Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC)

2007

pdf bib
K-Best Suffix Arrays
Kenneth Church | Bo Thiesson | Robert Ragno
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf bib
A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations
Ping Li | Kenneth W. Church
Computational Linguistics, Volume 33, Number 3, September 2007

pdf bib
Compressing Trigram Language Models With Golomb Coding
Kenneth Church | Ted Hart | Jianfeng Gao
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2005

pdf bib
Last Words: Reviewing the Reviewers
Kenneth Church
Computational Linguistics, Volume 31, Number 4, December 2005

pdf bib
Using Sketches to Estimate Associations
Ping Li | Kenneth W. Church
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

pdf bib
The Wild Thing
Ken Church | Bo Thiesson
Proceedings of the ACL Interactive Poster and Demonstration Sessions

2002

pdf bib
NLP Found Helpful (at least for one Text Categorization Task)
Carl Sable | Kathleen McKeown | Kenneth Church
Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)

2001

pdf bib
Using Bins to Empirically Estimate Term Weights for Text Categorization
Carl Sable | Kenneth W. Church
Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing

pdf bib
Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus
Mikio Yamamoto | Kenneth W. Church
Computational Linguistics, Volume 27, Number 1, March 2001

2000

pdf bib
Empirical Term Weighting and Expansion Frequency
Kyoji Umemura | Kenneth W. Church
2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora

pdf bib
Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p2
Kenneth W. Church
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

1999

pdf bib
What’s Happened Since the First SIGDAT Meeting?
Kenneth Ward Church
1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora

1998

pdf bib
Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus
Mikio Yamamoto | Kenneth W. Church
Sixth Workshop on Very Large Corpora

1996

pdf bib
Panel: The limits of automation: optimists vs skeptics.
Eduard Hovy | Ken Church | Denis Gachot | Marge Leon | Alan Melby | Sergei Nirenburg | Yorick Wilks
Conference of the Association for Machine Translation in the Americas

1995

pdf bib
Inverse Document Frequency (IDF): A Measure of Deviations from Poisson
Kenneth Church | William Gale
Third Workshop on Very Large Corpora

1994

pdf bib
Fax: An Alternative to SGML
Kenneth W. Church | William A. Gale | Jonathan I. Helfman | David D. Lewis
COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics

pdf bib
K-vec: A New Approach for Aligning Parallel Texts
Pascale Fung | Kenneth Ward Church
COLING 1994 Volume 2: The 15th International Conference on Computational Linguistics

pdf bib
Termight: Identifying and Translating Technical Terminology
Ido Dagan | Ken Church
Fourth Conference on Applied Natural Language Processing

pdf bib
Is MT Research Doing Any Good?
Kenneth Church | Bonnie Dorr | Eduard Hovy | Sergei Nirenburg | Bernard Scott | Virginia Teller
Proceedings of the First Conference of the Association for Machine Translation in the Americas

1993

pdf bib
Robust Bilingual Word Alignment for Machine Aided Translation
Ido Dagan | Kenneth Church | Willian Gale
Very Large Corpora: Academic and Industrial Perspectives

pdf bib
Introduction to the Special Issue on Computational Linguistics Using Large Corpora
Kenneth W. Church | Robert L. Mercer
Computational Linguistics, Volume 19, Number 1, March 1993, Special Issue on Using Large Corpora: I

pdf bib
A Program for Aligning Sentences in Bilingual Corpora
William A. Gale | Kenneth W. Church
Computational Linguistics, Volume 19, Number 1, March 1993, Special Issue on Using Large Corpora: I

pdf bib
Char_align: A Program for Aligning Parallel Texts at the Character Level
Kenneth Ward Church
31st Annual Meeting of the Association for Computational Linguistics

1992

pdf bib
One Sense Per Discourse
William A. Gale | Kenneth W. Church | David Yarowsky
Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992

pdf bib
Estimating Upper and Lower Bounds on the Performance of Word-Sense Disambiguation Programs
William Gale | Kenneth Ward Church | David Yarowsky
30th Annual Meeting of the Association for Computational Linguistics

pdf bib
Using bilingual materials to develop word sense disambiguation methods
William A. Gale | Kenneth W. Church | David Yarowsky
Proceedings of the Fourth Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

1991

pdf bib
Book Reviews: Theory and Practice in Corpus Linguistics
Kenneth Ward Church
Computational Linguistics, Volume 17, Number 1, March 1991

pdf bib
A Program for Aligning Sentences in Bilingual Corpora
William A. Gale | Kenneth W. Church
29th Annual Meeting of the Association for Computational Linguistics

pdf bib
Identifying Word Correspondences in Parallel Texts
William A. Gale | Kenneth W. Church
Speech and Natural Language: Proceedings of a Workshop Held at Pacific Grove, California, February 19-22, 1991

1990

pdf bib
A Spelling Correction Program Based on a Noisy Channel Model
Mark D. Kernighan | Kenneth W. Church | William A. Gale
COLING 1990 Volume 2: Papers presented to the 13th International Conference on Computational Linguistics

pdf bib
Poor Estimates of Context are Worse than None
William A. Gale | Kenneth W. Church
Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27,1990

pdf bib
Word Association Norms, Mutual Information, and Lexicography
Kenneth Ward Church | Patrick Hanks
Computational Linguistics, Volume 16, Number 1, March 1990

1989

pdf bib
Parsing, Word Associations and Typical Predicate-Argument Relations
Kenneth Church | William Gale | Patrick Hanks | Donald Hindle
Proceedings of the First International Workshop on Parsing Technologies

There are a number of collocational constraints in natural languages that ought to play a more important role in natural language parsers. Thus, for example, it is hard for most parsers to take advantage of the fact that wine is typically drunk, produced, and sold, but (probably) not pruned. So too, it is hard for a parser to know which verbs go with which prepositions (e.g., set up) and which nouns fit together to form compound noun phrases (e.g., computer programmer). This paper will attempt to show that many of these types of concerns can be addressed with syntactic methods (symbol pushing), and need not require explicit semantic interpretation. We have found that it is possible to identify many of these interesting co-occurrence relations by computing simple summary statistics over millions of words of text. This paper will summarize a number of experiments carried out by various subsets of the authors over the last few years. The term collocation will be used quite broadly to include constraints on SVO (subject verb object) triples, phrasal verbs, compound noun phrases, and psychoiinguistic notions of word association (e.g., doctor/nurse).

pdf bib
Word Association Norms, Mutual Information, and Lexicography
Kenneth Ward Church | Patrick Hanks
27th Annual Meeting of the Association for Computational Linguistics

pdf bib
Parsing, Word Associations and Typical Predicate-Argument Relations
Kenneth Church | William Gale | Patrick Hanks | Donald Hindle
Speech and Natural Language: Proceedings of a Workshop Held at Cape Cod, Massachusetts, October 15-18, 1989

pdf bib
Enhanced Good-Turing and Cat-Cal: Two New Methods for Estimating Probabilities of English Bigrams (abbreviated version)
Kenneth W. Church | William A. Gale
Speech and Natural Language: Proceedings of a Workshop Held at Cape Cod, Massachusetts, October 15-18, 1989

pdf bib
Session 11 Natural Language III
Kenneth Ward Church
Speech and Natural Language: Proceedings of a Workshop Held at Cape Cod, Massachusetts, October 15-18, 1989

1988

pdf bib
A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text
Kenneth Ward Church
Second Conference on Applied Natural Language Processing

pdf bib
Complexity, Two-Level Morphology and Finnish
Kimmo Koskenniemi | Kenneth Ward Church
Coling Budapest 1988 Volume 1: International Conference on Computational Linguistics

1986

pdf bib
Morphological Decomposition and Stress Assignment for Speech Synthesis
Kenneth Church
24th Annual Meeting of the Association for Computational Linguistics

1985

pdf bib
Stress Assignment in Letter to Sound Rules for Speech Synthesis
Kenneth Church
23rd Annual Meeting of the Association for Computational Linguistics

1983

pdf bib
A Finite-State Parser for Use in Speech Recognition
Kenneth W. Church
21st Annual Meeting of the Association for Computational Linguistics

1982

pdf bib
Coping with Syntactic Ambiguity or How to Put the Block in the Box on the Table
Kenneth Church | Ramesh Patil
American Journal of Computational Linguistics, Volume 8, Number 3-4, July-December 1982

1980

pdf bib
On Parsing Strategies and Closure
Kenneth Church
18th Annual Meeting of the Association for Computational Linguistics