John S. White

Also published as: John White

2009

This paper looks at granularity issues in machine translation evaluation. We start with work by (White, 2001) who examined the correlation between intelligibility and fidelity at the document level. His work showed that intelligibility and fidelity do not correlate well at the document level. These dissimilarities lead to our investigation of evaluation granularity. In particular, we revisit the intelligibility and fidelity relationship at the corpus level. We expect these to support certain assumptions in both evaluations as well as indicate issues germane to future evaluations.

2001

pdf bib abs

The naming of things and the confusion of tongues: an MT metric
Florence Reeder | Keith Miller | Jennifer Doyon | John White
Workshop on MT Evaluation

This paper reports the results of an experiment in machine translation (MT) evaluation, designed to determine whether easily/rapidly collected metrics can predict the human generated quality parameters of MT output. In this experiment we evaluated a system’s ability to translate named entities, and compared this measure with previous evaluation scores of fidelity and intelligibility. There are two significant benefits potentially associated with a correlation between traditional MT measures and named entity scores: the ability to automate named entity scoring and thus MT scoring; and insights into the linguistic aspects of task-based uses of MT, as captured in previous studies.

pdf bib abs

Predicting intelligibility from fidelity in MT evaluation
John White
Workshop on MT Evaluation

Attempts to formulate methods of automatically evaluating machine translation (MT) have generally looked at some attrinbute of translation and then tried, explicitly or implicitly, to extrapolate the measurement to cover a broader class of attributes. In particular, some studies have focused on measuring fidelity of translation, and inferring intelligibility from that, and others have taken the opposite approach. In this paper we examine the more fundamental question of whether, and to what extent, the one attribute can be predicted by the other. As a starting point we use the 1994 DARPA MT corpus, which has measures for both attributes, and perform a simple comparison of the behavior of each. Two hypotheses about a predictable inference between fidelity and intelligibility are compared with the comparative behavior across all language pairs and all documents in the corpus.

pdf bib abs

Predicting MT fidelity from noun-compound handling
John White | Monika Forner
Workshop on MT Evaluation

Approaches to the automation of machine translation (MT) evaluation have attempted, or presumed, to connect some rapidly measurable phenomenon with general attributes of the MT output and/or system. In particular, measurements of the fluency of output are often asserted to be predictive of the usefulness of MT output in information-intensive, downstream tasks. The connections between the fluency (“intelligibility”) of translation and its informational adequacy (“fidelity”) are not actually straightforward. This paper discussed a small experiment in isolating a particular contrastive linguistic phenomena common to both French-English and Spanish-English pairs, and attempts to associate that behavior in machine and human translations with known fidelity properties of those translations. Our results show a definite correlative trend.

2000

pdf bib abs

Contemplating automatic MT evaluation
John S. White
Proceedings of the Fourth Conference of the Association for Machine Translation in the Americas: Technical Papers

Researchers, developers, translators and information consumers all share the problem that there is no accepted standard for machine translation. The problem is much further confounded by the fact that MT evaluations properly done require a considerable commitment of time and resources, an anachronism in this day of cross-lingual information processing when new MT systems may developed in weeks instead of years. This paper surveys the needs addressed by several of the classic “types” of MT, and speculates on ways that each of these types might be automated to create relevant, near-instantaneous evaluation of approaches and systems.

pdf bib

Book Reviews: Breadth and Depth of Semantic Lexicons
John S. White
Computational Linguistics, Volume 26, Number 4, December 2000

pdf bib

Determining the Tolerance of Text-handling Tasks for MT Output
John White | Jennifer Doyon | Susan Talbott
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib

Task Tolerance of MT Output in Integrated Text Processes
John S. White | Jennifer B. Doyon | Susan W. Talbott
ANLP-NAACL 2000 Workshop: Embedded Machine Translation Systems

1999

pdf bib abs

MT evaluation
Margaret King | Eduard Hovy | Benjamin K. Tsou | John White | Yusoff Zaharin
Proceedings of Machine Translation Summit VII

This panel deals with the general topic of evaluation of machine translation systems. The first contribution sets out some recent work on creating standards for the design of evaluations. The second, by Eduard Hovy. takes up the particular issue of how metrics can be differentiated and systematized. Benjamin K. T'sou suggests that whilst men may evaluate machines, machines may also evaluate men. John S. White focuses on the question of the role of the user in evaluation design, and Yusoff Zaharin points out that circumstances and settings may have a major influence on evaluation design.

pdf bib abs

Task-based evaluation for machine translation
Jennifer B. Doyon | Kathryn B. Taylor | John S. White
Proceedings of Machine Translation Summit VII

In an effort to reduce the subjectivity, cost, and complexity of evaluation methods for machine translation (MT) and other language technologies, task-based assessment is examined as an alternative to metrics-based in human judgments about MT, i.e., the previously applied adequacy, fluency, and informativeness measures. For task-based evaluation strategies to be employed effectively to evaluate languageprocessing technologies in general, certain key elements must be known. Most importantly, the objectives the technology’s use is expected to accomplish must be known, the objectives must be expressed as tasks that accomplish the objectives, and then successful outcomes defined for the tasks. For MT, task-based evaluation is correlated to a scale of tasks, and has as its premise that certain tasks are more forgiving of errors than others. In other words, a poor translation may suffice to determine the general topic of a text, but may not permit accurate identification of participants or the specific event. The ordering of tasks according to their tolerance for errors, as determined by actual task outcomes provided in this paper, is the basis of a scale and repeatable process by which to measure MT systems that has advantages over previous methods.

1998

bib

MT evaluation
John S. White
Proceedings of the Third Conference of the Association for Machine Translation in the Americas: Tutorial Descriptions

pdf bib abs

Predicting what MT is good for: user judgments and task performance
Kathryn Taylor | John White
Proceedings of the Third Conference of the Association for Machine Translation in the Americas: Technical Papers

As part of the Machine Translation (MT) Proficiency Scale project at the US Federal Intelligent Document Understanding Laboratory (FIDUL), Litton PRC is developing a method to measure MT systems in terms of the tasks for which their output may be successfully used. This paper describes the development of a task inventory, i.e., a comprehensive list of the tasks analysts perform with translated material and details the capture of subjective user judgments and insights about MT samples. Also described are the user exercises conducted using machine and human translation samples and the assessment of task performance. By analyzing translation errors, user judgments about errors that interfere with task performance, and user task performance results, we isolate source language patterns which produce output problems. These patterns can then be captured in a single diagnostic test set, to be easily applied to any new Japanese-English system to predict the utility of its output.

1997

bib abs

MT evaluation: old, new, and recycled
John White
Proceedings of Machine Translation Summit VI: Tutorials

The tutorial addresses the issues peculiar to machine translation evaluation, namely the difficulty in determining what constitutes correct translation, and which types of evaluation are the most meaningful for evaluation "consumers." The tutorial is structured around evaluation methods designed for particular purposes: types of MT design, stages in the development lifecycle, and intended end-use of a system that includes MT. It will provide an overview of the issues and classic approaches to MT evaluation. The traditional processes, such as those outlined in the ALPAC report, will be examined for their value historically and in terms of today's environments. The tutorial also provides an insight into the latest evaluation techniques, designed to capture the value of MT systems in the context of current and future automated text handling processes.