Evelyn Garland
2022
Comparison Between ATA Grading Framework Scores and Auto Scores
Evelyn Garland
|
Carola Berger
|
Jon Ritzdorf
Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track)
The authors of this study compared two types of translation quality scores assigned to the same sets of translation samples: 1) the ATA Grading Framework scores assigned by human experts, and 2) auto scores, including BLEU, TER, and COMET (with and without reference). They further explored the impact of different reference translations on the auto scores. Key findings from this study include: 1. auto scores that rely on reference translations depend heavily on which reference is used; 2. referenceless COMET seems promising when it is used to evaluate translations of short passages (250-300 English words); and 3. evidence suggests good agreement between the ATA-Framework score and some auto scores within a middle range, but the relationship becomes non-monotonic beyond the middle range. This study is subject to the limitation of a small sample size and is a retrospective exploratory study not specifically designed to test a pre-defined hypothesis.