Debbie Elliott


2005

pdf bib
Estimating the predictive Power of N-gram MT Evaluation Metrics across Language and Text Types
Bogdan Babych | Anthony Hartley | Debbie Elliott
Proceedings of Machine Translation Summit X: Posters

The use of n-gram metrics to evaluate the output of MT systems is widespread. Typically, they are used in system development, where an increase in the score is taken to represent an improvement in the output of the system. However, purchasers of MT systems or services are more concerned to know how well a score predicts the acceptability of the output to a reader-user. Moreover, they usually want to know if these predictions will hold across a range of target languages and text types. We describe an experiment involving human and automated evaluations of four MT systems across two text types and 23 language directions. It establishes that the correlation between human and automated scores is high, but that the predictive power of these scores depends crucially on target language and text type.

2004

pdf bib
Extending MT evaluation tools with translation complexity metrics
Bogdan Babych | Debbie Elliott | Anthony Hartley
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Calibrating Resource-light Automatic MT Evaluation: a Cheap Approach to Ranking MT Systems by the Usability of Their Output
Bogdan Babych | Debbie Elliott | Anthony Hartley
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
A fluency error categorization scheme to guide automated machine translation evaluation
Debbie Elliott | Anthony Hartley | Eric Atwell
Proceedings of the 6th Conference of the Association for Machine Translation in the Americas: Technical Papers

Existing automated MT evaluation methods often require expert human translations. These are produced for every language pair evaluated and, due to this expense, subsequent evaluations tend to rely on the same texts, which do not necessarily reflect real MT use. In contrast, we are designing an automated MT evaluation system, intended for use by post-editors, purchasers and developers, that requires nothing but the raw MT output. Furthermore, our research is based on texts that reflect corporate use of MT. This paper describes our first step in system design: a hierarchical classification scheme of fluency errors in English MT output, to enable us to identify error types and frequencies, and guide the selection of errors for automated detection. We present results from the statistical analysis of 20,000 words of MT output, manually annotated using our classification scheme, and describe correlations between error frequencies and human scores for fluency and adequacy.