Calandra Tate
2006
Combining Evaluation Metrics via Loss Functions
Calandra Tate
|
Clare Voss
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers
When response metrics for evaluating the utility of machine translation (MT) output on a given task do not yield a single ranking of MT engines, how are MT users to decide which engine best supports their task? When the cost of different types of response errors vary, how are MT users to factor that information into their rankings? What impact do different costs have on response-based rankings? Starting with data from an extraction experiment detailed in Voss and Tate (2006), this paper describes three response-rate metrics developed to quantify different aspects of MT users’ performance identifying who/when/where-items in MT output, and then presents a loss function analysis over these rates to derive a single customizable metric, applying a range of values to correct responses and costs to different error types. For the given experimental dataset, loss function analyses provided a clearer characterization of the engines’ relative strength than did comparing the response rates to each other. For one MT engine, varying the costs had no impact: the engine consistently ranked best. By contrast, cost variations did impact the ranking of the other two engines: a rank reversal occurred on who-item extractions when incorrect responses were penalized more than non-responses. Future work with loss analysis, developing operational cost ratios of error rates to correct response rates, will require user studies and expert document-screening personnel to establish baseline values for effective MT engine support on wh-item extraction.
2003
Task-based MT evaluation: tackling software, experimental design, & statistical models.
Calandra Tate
|
Sooyon Lee
|
Clare R. Voss
Workshop on Systemizing MT Evaluation
Even with recent, renewed attention to MT evaluation—due in part to n-gram-based metrics (Papineni et al., 2001; Doddington, 2002) and the extensive, online catalogue of MT metrics on the ISLE project (Hovy et al., 2001, 2003), few reports involving task-based metrics have surfaced. This paper presents our work on three parts of task-based MT evaluation: (i) software to track and record users' task performance via a browser, run from a desktop computer or remotely over the web, (ii) factorial experimental design with replicate observations to compare the MT engines, based on the accuracy of users' task responses, and (iii) the use of chi-squared and generalized linear models (GLMs) to permit finer-grained data analyses. We report on the experimental results of a six-way document categorization task, used for the evaluation of three Korean-English MT engines. The statistical models of the probabilities of correct responses yield an ordering of the MT engines, with one engine having a statistically significant lead over the other two. Future research will involve testing user performance on linguistically more complex tasks, as well as extending our initial GLMs with the documents' Bleu scores as variables, to test the scores as independent predictors of task results.