Calandra R. Tate
2008
A Statistical Analysis of Automated MT Evaluation Metrics for Assessments in Task-Based MT Evaluation
Calandra R. Tate
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers
This paper applies nonparametric statistical techniques to Machine Translation (MT) Evaluation using data from a large scale task-based study. In particular, the relationship between human task performance on an information extraction task with translated documents and well-known automated translation evaluation metric scores for those documents is studied. Findings from a correlation analysis of this connection are presented and contrasted with current strategies for evaluating translations. An extended analysis that involves a novel idea for assessing partial rank correlation within the presence of grouping factors is also discussed. This work exposes the limitations of descriptive statistics generally used in this area, mainly correlation analysis, when using automated metrics for assessments in task handling purposes.
2006
Task-based Evaluation of Machine Translation (MT) Engines. Measuring How Well People Extract Who, When, Where-Type Elements in MT Output
Clare R. Voss
|
Calandra R. Tate
Proceedings of the 11th Annual Conference of the European Association for Machine Translation
Task-based MT Evaluation: From Who/When/Where Extraction to Event Understanding
Jamal Laoudi
|
Calandra R. Tate
|
Clare R. Voss
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Task-based machine translation (MT) evaluation asks, how well do people perform text-handling tasks given MT output? This method of evaluation yields an extrinsic assessment of an MT engine, in terms of users task performance on MT output. While this method is time-consuming, its key advantage is that MT users and stakeholders understand how to interpret the assessment results. Prior experiments showed that subjects can extract individual who-, when-, and where-type elements of information from MT output passages that were not especially fluent. This paper presents the results of a pilot study to assess a slightly more complex task: when given such wh-items already identified in an MT output passage, how well can subjects properly select from and place these items into wh-typed slots to complete a sentence-template about the passages event? The results of the pilot with nearly sixty subjects, while only preliminary, indicate that this task was extremely challenging: given six test templates to complete, half of the subjects had no completely correct templates and 42% had exactly one completely correct template. The provisional interpretation of this pilot study is that event-based template completion defines a task ceiling, against which to evaluate future improvements on MT engines.