Date of release: Fri Jun 06, 2008
Version: mt08_official_release_v0
The NIST 2008 Machine Translation Evaluation (MT-08) is part of an ongoing series of evaluations of human language translation technology. NIST conducts these evaluations in order to support machine translation (MT) research and help advance the state-of-the-art in machine translation technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. The evaluation was administered as outlined in the official MT-08 evaluation plan.
DisclaimerThese
results are not to be construed, or represented as endorsements of any
participant's system or commercial product, or as official findings on the part
of NIST or the
Certain commercial equipment, instruments, software, or materials are
identified in this paper in order to specify the experimental procedure
adequately. Such identification is not intended to imply recommendation or
endorsement by the (NIST), nor is it intended to imply that the equipment,
instruments, software or materials are necessarily the best available for the
purpose.
There is ongoing discussion within the MT research community regarding the most informative metrics for machine translation. The design and implementation of these metrics are themselves very much part of the research. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance.
The data, protocols, and metrics employed in this evaluation were chosen to support MT research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.
Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.
The MT-08 evaluation consisted of four tasks. Each task required a system to perform translation from a given source language into the target language. The source and target language pairs that made up the four MT-08 tasks were:
MT research and development requires language data resources. System performance is strongly affected by the type and amount of resources used. Therefore, two different resource categories were defined as conditions of evaluation. The categories differed solely by the amount of data that was available for use in the training and development of the core MT engine. The evaluation conditions were called "Constrained Data Track" and "Un-Constrained Data Track".
· Note: For the URDU task, the constrained data condition required that the core system development use only the provided resource DVD. No other data was allowed for the primary condition of interest.
Other submissions not in categories described above will not be reported in the final release.
Source Data
MT-08 evaluation data sets contained documents drawn from newswire text documents and web-based newsgroup documents. The source documents were encoded in UTF-8.
The test data was selected from a pool of data collected by the LDC during July 2007. The careful selection process sought to have a variety of sources (see below) and publication dates while hitting the target test set size.
Source Language |
Sources |
|
Newswire |
Newsgroup / Web |
|
Arabic |
AAW, AFP, AHR, ASB, HYT, NHR, QDS, XIN |
various web
forums |
Chinese |
AFP, CNS, GMW, PDA, PLA, XIN |
various web
forums |
Urdu |
BBC,JNG, PTB, VOA |
various web
forums |
English |
AFP, APW, LTW, NYT, XIN |
n/a |
Reference Data
MT-08 reference data consists of four independently generated high quality translations that were produced by professional translation companies. Each translation agency was required to have native speaker(s) of the source and target languages working on the translations.
Current versus Progress Data Division
For those willing to abide by the strict processing rules, a "PROGRESS" test set was distributed to use as a BLIND benchmark for several evaluations. Teams that processed this data submitted their translations to NIST and deleted all related files (source, translations, and any other derivitive file). The scores of the progress test sets were reported to the participants but were not reported here. Future OpenMT evaluations will report PROGRESS test set scores from year to year.
Machine translation quality was measured automatically using an N-gram co-occurrence statistic metric developed by IBM and referred to as BLEU. BLEU measures translation accuracy according to N-grams or sequence of N-words that it shares with one or more high quality reference translations. Thus, the more co-occurrences, the better the score. BLEU is an accuracy metric, ranging from "0" to "1" with "1" being the best possible score. A detailed description of BLEU can be found in the paper Papineni, Roukos, Ward, Zhu (2001). "Bleu: a Method for Automatic Evaluation of Machine Translation" (keyword = RC22176).
Although BLEU was the official metric for MT-08, measuring translation quality is an ongoing research topic in the MT community. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance.
Automatic metrics reported:
Other metrics (to be) reported:
The table below lists the organizations entered as participants and the evaluation tasks they are registered for in MT-08.
Site ID |
Organization |
Location |
||||
apptek |
Applications Technology Inc. |
USA |
||||
auc |
The American University in Cairo |
Egypt |
||||
basistech |
Basis Technology |
USA |
||||
bbn |
BBN Technologies |
USA |
||||
bjut-mtg |
Beijing University
of Technology, |
China |
||||
cas-ia |
Chinese Academy of Sciences, Institute of Automation |
China |
||||
cas-ict |
Chinese Academy of Sciences, Institute of Computing Technology |
China |
||||
cas-is |
Chinese Academy of Sciences, Institute of Software |
China |
||||
cmu-ebmt |
Carnegie Mellon |
USA |
||||
cmu-smt |
Carnegie Mellon, interACT |
USA |
||||
cmu-xfer |
Carnegie Mellon |
USA |
||||
columbia |
Columbia University |
USA |
||||
cued |
University of Cambridge, Dept. of Engineering |
UK |
||||
edinburgh |
University of Edinburgh |
UK |
||||
|
|
USA |
||||
hit-ir |
Harbin Institute of Technology, Information Retrieval Laboratory |
China |
||||
hkust |
. |
China |
||||
ibm |
IBM |
USA |
||||
lium |
Universite du Maine (Le Mans), Laboratoire d'Informatique |
France |
||||
msra |
Microsoft Research Asia |
China |
||||
nrc |
National Research Council |
Canada |
||||
nthu |
National Tsing Hua University |
Taiwan |
||||
ntt |
NTT Communication Science Laboratories |
Japan |
||||
qmul |
Queen Mary University of London |
UK |
||||
sakhr |
Sakhr Software Co. |
Egypt |
||||
sri |
SRI International |
USA |
||||
stanford |
Stanford University |
USA |
||||
uka |
Universitaet Karlsruhe |
Germany |
||||
umd |
University of Maryland |
USA |
||||
upc-lsi |
Universitat Politechnica de Catalunya, LSI |
Spain |
||||
upc-talp |
Universitat Politechnica de Catalunya, TALP |
Spain |
||||
xmu-iai |
Xiamen University, Institute of Artificial Intelligence |
China |
||||
Collaborations |
||||||
ibm_umd |
IBM / |
USA |
|
|
|
|
jhu_umd |
Johns Hopkins
University / |
USA |
|
|
|
|
isi_lw |
USC-ISI / |
USA |
|
|
|
|
msr_msra |
Microsoft Research
/ |
. |
|
|
|
|
msr_nrc_sri |
Microsoft Research
/ |
. |
|
|
|
|
nict_atr |
NICT / |
Japan |
|
|
|
|
nrc_systran |
National Research
Council Canada / |
. |
|
|
|
|
Each site/team could submit up to four systems for evaluation with one system marked as its primary system. The primary system indicated the site/team's best effort. This official public version of the results report the results only for the primary systems. Note that these charts show an absolute ranking according to the primary metric.
Systems that fail to meet the requirements for either track will not be reported here.
"significance groups*" shows areas where the wilcoxon signed rank test was not able to differenciate system performance at the 95% confidence level. That is, if two systems belong to the same significance group (by sharing the same number), then they are determined to be comparble, based n BLEU-4 scoring.
Late and corrected submission will be linked here
Entire Current Evaluation Test Set
significance |
system |
BLEU-4* |
IBM BLEU |
NIST |
TER |
METEOR |
Constrained Training Track |
||||||
1 |
google_arabic_constrained_primary |
0.4557 |
0.4526 |
10.8821 |
48.535 |
0.6857 |
2 |
IBM-UMD_arabic_constrained_primary |
0.4525 |
0.4300 |
10.6183 |
48.436 |
0.6539 |
3 |
IBM_arabic_constrained_primary |
0.4507 |
0.4276 |
10.5904 |
48.547 |
0.6530 |
3 |
bbn_arabic_constrained_primary |
0.4340 |
0.4290 |
10.6590 |
49.599 |
0.6784 |
4 |
LIUM_arabic_constrained_primary |
0.4298 |
0.4105 |
10.2732 |
50.484 |
0.6490 |
5 |
isi-lw_arabic_constrained_primary |
0.4248 |
0.4227 |
10.4077 |
51.820 |
0.6695 |
6 |
CUED_arabic_constrained_primary |
0.4238 |
0.4018 |
9.9486 |
51.557 |
0.6274 |
6 |
SRI_arabic_constrained_primary |
0.4229 |
0.4031 |
10.1935 |
49.780 |
0.6430 |
7 |
Edinburgh_arabic_constrained_primary |
0.4029 |
0.3833 |
9.9641 |
51.165 |
0.6396 |
8 |
UMD_arabic_constrained_primary |
0.3906 |
0.3784 |
10.1176 |
52.158 |
0.6553 |
9 |
UPC_arabic_constrained_primary |
0.3743 |
0.3576 |
9.6553 |
53.260 |
0.6380 |
10 |
columbia_arabic_constrained_primary |
0.3740 |
0.3594 |
9.4806 |
51.973 |
0.6092 |
9,10 |
NTT_arabic_constrained_primary |
0.3671 |
0.3540 |
9.8806 |
56.077 |
0.6312 |
11 |
CMUEBMT_arabic_constrained_primary |
0.3481 |
0.3479 |
9.2165 |
57.376 |
0.6057 |
12 |
qmul_arabic_constrained_primary |
0.3308 |
0.3181 |
8.8124 |
55.145 |
0.5893 |
13 |
SAKHR_arabic_constrained_primary |
0.3133 |
0.3133 |
9.1373 |
57.159 |
0.6659 |
14 |
UPC.lsi_english_constrained_primary |
0.3021 |
0.2876 |
8.6350 |
58.228 |
0.5639 |
15 |
BASISTECH_arabic_constrained_primary |
0.2529 |
0.2423 |
7.8781 |
63.015 |
0.5454 |
16 |
AUC_arabic_constrained_primary |
0.1415 |
0.1359 |
6.3210 |
76.406 |
0.4468 |
UnConstrained Training Track |
||||||
17 |
google_arabic_unconstrained_primary |
0.4772 |
0.4739 |
11.1864 |
46.853 |
0.6996 |
18 |
IBM_arabic_unconstrained_primary |
0.4717 |
0.4527 |
11.0591 |
46.755 |
0.6902 |
19 |
apptek_arabic_unconstrained_primary |
0.4483 |
0.4474 |
10.8420 |
48.263 |
0.7160 |
20 |
cmu-smt_arabic_unconstrained_primary |
0.4312 |
0.4114 |
10.3617 |
50.082 |
0.6672 |
* designates primary metric
Entire Current Evaluation Test Set
significance |
system |
BLEU-4* |
IBM BLEU |
NIST |
TER |
METEOR |
Constrained Training Track |
||||||
1 |
MSR-NRC-SRI_chinese_constrained_primary |
0.3089 |
0.2947 |
8.5059 |
58.460 |
0.5379 |
1 |
bbn_chinese_constrained_primary |
0.3059 |
0.2959 |
8.2023 |
57.067 |
0.5468 |
1 |
isi-lw_chinese_constrained_primary |
0.3041 |
0.2940 |
8.0950 |
57.734 |
0.5467 |
1 |
google_chinese_constrained_primary |
0.2999 |
0.2887 |
8.5143 |
58.359 |
0.5567 |
2 |
MSR-MSRA_chinese_constrained_primary |
0.2901 |
0.2766 |
8.1480 |
60.073 |
0.5171 |
3 |
SRI_chinese_constrained_primary |
0.2697 |
0.2575 |
7.8942 |
61.622 |
0.5101 |
3 |
Edinburgh_chinese_constrained_primary |
0.2608 |
0.2513 |
7.8117 |
60.654 |
0.5142 |
4 |
SU_chinese_constrained_primary |
0.2547 |
0.2420 |
7.7994 |
63.288 |
0.5122 |
4,5 |
UMD_chinese_constrained_primary |
0.2506 |
0.2387 |
7.8236 |
62.134 |
0.5167 |
4,5 |
NTT_chinese_constrained_primary |
0.2469 |
0.2270 |
7.9511 |
63.415 |
0.5126 |
5 |
NRC_chinese_constrained_primary |
0.2458 |
0.2373 |
7.9964 |
63.835 |
0.5362 |
5 |
CASIA_chinese_constrained_primary |
0.2407 |
0.2310 |
7.5790 |
62.518 |
0.4999 |
6 |
NICT-ATR_chinese_constrained_primary |
0.2269 |
0.2184 |
7.1635 |
64.524 |
0.4962 |
6 |
ICT_chinese_constrained_primary |
0.2258 |
0.2213 |
6.1551 |
61.387 |
0.4878 |
7 |
JHU-UMD_chinese_constrained_primary |
0.2111 |
0.2079 |
6.0509 |
61.834 |
0.4691 |
8 |
XMU_chinese_constrained_primary |
0.1979 |
0.1938 |
6.7514 |
63.139 |
0.4780 |
9 |
HITIRLab_chinese_constrained_primary |
0.1866 |
0.1795 |
6.5942 |
67.376 |
0.4458 |
10 |
hkust_large_primary |
0.1678 |
0.1624 |
6.7124 |
75.803 |
0.4332 |
10 |
ISCAS_chinese_constrained_primary |
0.1569 |
0.1520 |
5.9557 |
68.221 |
0.4354 |
11 |
NTHU_Chinese_constrained_primary |
0.0393 |
0.0390 |
3.5096 |
93.892 |
0.3209 |
UnConstrained Training Track |
||||||
12 |
google_chinese_unconstrained_primary |
0.3195 |
0.3069 |
8.8628 |
57.009 |
0.5707 |
13 |
cmu-smt_chinese_unconstrained_primary |
0.2597 |
0.2474 |
8.0026 |
62.411 |
0.5363 |
14 |
NRC-SYSTRAN_chinese_unconstrained_primary |
0.2523 |
0.2443 |
8.0473 |
63.002 |
0.5490 |
15 |
UKA_chinese_unconstrained_primary |
0.2406 |
0.2323 |
7.4571 |
61.706 |
0.4916 |
16 |
CMUXfer_chinese_unconstrained_primary |
0.1310 |
0.1309 |
6.2452 |
76.722 |
0.4614 |
17 |
BJUT_chinese_unconstrained_primary |
0.0735 |
0.0694 |
4.7239 |
77.685 |
0.3944 |
* designates primary metric
significance |
system |
BLEU-4* |
IBM BLEU |
NIST |
TER |
METEOR |
Constrained Training Track |
||||||
1 |
google_urdu_constrained_primary |
0.2281 |
0.2280 |
7.8406 |
69.906 |
0.5693 |
2 |
bbn_urdu_constrained_primary |
0.2028 |
0.2026 |
7.6927 |
70.885 |
0.5437 |
2 |
IBM_urdu_constrained_primary |
0.2026 |
0.1999 |
7.7022 |
68.860 |
0.5096 |
2 |
isi-lw_urdu_constrained_primary |
0.1983 |
0.1985 |
7.3030 |
72.749 |
0.5239 |
3 |
UMD_urdu_constrained_primary |
0.1829 |
0.1826 |
7.2905 |
68.748 |
0.5053 |
4 |
MITLLAFRL_urdu_constrained_primary |
0.1666 |
0.1666 |
7.0460 |
72.859 |
|
5 |
UPC_urdu_constrained_primary |
0.1614 |
0.1614 |
7.0958 |
72.839 |
0.4904 |
6 |
columbia_urdu_constrained_primary |
0.1459 |
0.1460 |
6.5474 |
78.686 |
0.4903 |
6,7 |
Edinburgh_urdu_constrained_primary |
0.1456 |
0.1455 |
6.4393 |
75.982 |
0.5215 |
7,8 |
NTT_urdu_constrained_primary |
0.1394 |
0.1383 |
6.9604 |
75.605 |
0.5022 |
8 |
qmul_urdu_constrained_primary |
0.1338 |
0.1338 |
6.2915 |
81.457 |
0.4728 |
8 |
CMU-XFER_urdu_constrained_primary# |
0.1016 |
0.1017 |
4.1885 |
108.167 |
0.3518 |
*
designates primary metric
# designates system with known alignment problem, corrected system submitted
late.
Here is a description of the scores:
· BLEU-4*: primary metric produced using mteval-v12, which is a language independent version that tokenizes on every unicode symbol.
· BLEU-4 normalized: makes use of a mapping file to normalize both the reference and system translations to a single variant of certain sybmols.
· NIST: the Doddington improvment to BLEU as reported from mteval-v12.
· BLEU-4 word segmented: mteval-v12 with word scoring, using a standard word segmenter for both reference and system translation.
We are not identifying significance groups for this task.
. |
system |
BLEU-4* |
BLEU-4 |
NIST |
BLEU-4 |
. |
Constrained Training Track |
||||||
. |
google_english_constrained_primary |
0.4142 |
0.4309 |
9.7727 |
0.1643 |
. |
. |
MSRA_English_constrained_primary |
0.4099 |
0.4343 |
9.4918 |
0.1769 |
. |
. |
isi-lw_english_constrained_primary |
0.3857 |
0.4163 |
8.6810 |
0.1687 |
. |
. |
NICT-ATR_english_constrained_primary |
0.3438 |
0.3718 |
7.9608 |
0.1416 |
. |
. |
HITIRLab_english_constrained_primary |
0.3225 |
0.3436 |
7.3768 |
0.0946 |
. |
. |
ICT_english_constrained_primary |
0.3176 |
0.3411 |
7.7030 |
0.0879 |
. |
. |
CMUEBMT_english_constrained_primary |
0.2738 |
0.2954 |
7.3042 |
0.0760 |
. |
. |
XMU_english_constrained_primary |
0.2502 |
0.2664 |
6.2083 |
0.0593 |
. |
. |
UMD_english_constrained_primary |
0.1982 |
0.2391 |
3.6922 |
0.0899 |
. |
UnConstrained Training Track |
||||||
. |
google_english_unconstrained_primary |
0.4710 |
0.4914 |
10.7868 |
0.1963 |
. |
. |
BJUT_english_unconstrained_primary |
0.2765 |
0.2906 |
7.8185 |
0.1046 |
. |
* designates primary metric
All reported scores are limited to the entire "CURRENT" data sets. All primary submissions are shown here.
All scores are BLEU-4* |
||||
. |
system |
All data |
NW |
WB |
Arabic to English |
||||
. |
AUC_arabic_constrained_primary |
0.1415 |
0.1718 |
0.0983 |
. |
BASISTECH_arabic_constrained_primary |
0.2529 |
0.2951 |
0.1900 |
. |
CMUEBMT_arabic_constrained_primary |
0.3481 |
0.4094 |
0.2695 |
. |
CUED_arabic_constrained_primary |
0.4238 |
0.4819 |
0.3456 |
. |
Edinburgh_arabic_constrained_primary |
0.4029 |
0.4675 |
0.3008 |
. |
IBM-UMD_arabic_constrained_primary |
0.4525 |
0.5085 |
0.3489 |
. |
IBM_arabic_constrained_primary |
0.4507 |
0.5089 |
0.3432 |
. |
LIUM_arabic_constrained_primary |
0.4298 |
0.4830 |
0.3431 |
. |
NTT_arabic_constrained_primary |
0.3671 |
0.4186 |
0.2923 |
. |
SAKHR_arabic_constrained_primary |
0.3133 |
0.3505 |
0.2622 |
. |
SRI_arabic_constrained_primary |
0.4229 |
0.4886 |
0.3171 |
. |
UMD_arabic_constrained_primary |
0.3906 |
0.4452 |
0.3117 |
. |
UPC.lsi_english_constrained_primary |
0.3021 |
0.3475 |
0.2292 |
. |
UPC_arabic_constrained_primary |
0.3743 |
0.4281 |
0.2840 |
. |
bbn_arabic_constrained_primary |
0.4340 |
0.4919 |
0.3497 |
. |
columbia_arabic_constrained_primary |
0.3740 |
0.4431 |
0.2797 |
. |
google_arabic_constrained_primary |
0.4557 |
0.5164 |
0.3724 |
. |
isi-lw_arabic_constrained_primary |
0.4248 |
0.4870 |
0.3355 |
. |
qmul_arabic_constrained_primary |
0.3308 |
0.4005 |
0.2358 |
UNCONSTRAINED SYSTEMS |
All data |
NW |
WB |
|
. |
IBM_arabic_unconstrained_primary |
0.4717 |
0.5264 |
0.3762 |
. |
apptek_arabic_unconstrained_primary |
0.4483 |
0.4900 |
0.3925 |
. |
cmu-smt_arabic_unconstrained_primary |
0.4312 |
0.4884 |
0.3392 |
. |
google_arabic_unconstrained_primary |
0.4772 |
0.5385 |
0.3940 |
Chinese to English |
||||
. |
CASIA_chinese_constrained_primary |
0.2407 |
0.2756 |
0.1936 |
. |
Edinburgh_chinese_constrained_primary |
0.2608 |
0.2976 |
0.2116 |
. |
HITIRLab_chinese_constrained_primary |
0.1866 |
0.2116 |
0.1529 |
. |
ICT_chinese_constrained_primary |
0.2258 |
0.2760 |
0.1586 |
. |
ISCAS_chinese_constrained_primary |
0.1569 |
0.1805 |
0.1257 |
. |
JHU-UMD_chinese_constrained_primary |
0.2111 |
0.2502 |
0.1586 |
. |
MSR-MSRA_chinese_constrained_primary |
0.2901 |
0.3435 |
0.2175 |
. |
MSR-NRC-SRI_chinese_constrained_primary |
0.3089 |
0.3614 |
0.2376 |
. |
NICT-ATR_chinese_constrained_primary |
0.2269 |
0.2579 |
0.1854 |
. |
NRC_chinese_constrained_primary |
0.2458 |
0.2679 |
0.2150 |
. |
NTHU_Chinese_constrained_primary |
0.0393 |
0.0367 |
0.0425 |
. |
NTT_chinese_constrained_primary |
0.2469 |
0.2828 |
0.1991 |
. |
SRI_chinese_constrained_primary |
0.2697 |
0.3154 |
0.2075 |
. |
SU_chinese_constrained_primary |
0.2547 |
0.2924 |
0.2039 |
. |
UMD_chinese_constrained_primary |
0.2506 |
0.2939 |
0.1871 |
. |
XMU_chinese_constrained_primary |
0.1979 |
0.2401 |
0.1401 |
. |
bbn_chinese_constrained_primary |
0.3059 |
0.3639 |
0.2273 |
. |
google_chinese_constrained_primary |
0.2999 |
0.3489 |
0.2344 |
. |
hkust_large_primary |
0.1678 |
0.1891 |
0.1377 |
. |
isi-lw_chinese_constrained_primary |
0.3041 |
0.3676 |
0.2176 |
UNCONSTRAINED SYSTEMS |
All data |
NW |
WB |
|
. |
BJUT_chinese_unconstrained_primary |
0.0735 |
0.0751 |
0.0689 |
. |
CMUXfer_chinese_unconstrained_primary |
0.1310 |
0.1536 |
0.0994 |
. |
NRC-SYSTRAN_chinese_unconstrained_primary |
0.2523 |
0.2757 |
0.2192 |
. |
UKA_chinese_unconstrained_primary |
0.2406 |
0.2846 |
0.1810 |
. |
cmu-smt_chinese_unconstrained_primary |
0.2597 |
0.2909 |
0.2127 |
. |
google_chinese_unconstrained_primary |
0.3195 |
0.3701 |
0.2515 |
Urdu to English |
||||
. |
CMU-XFER_urdu_constrained_primary# |
0.1016 |
0.1827 |
0.0183 |
. |
Edinburgh_urdu_constrained_primary |
0.1456 |
0.1609 |
0.1291 |
. |
IBM_urdu_constrained_primary |
0.2026 |
0.2347 |
0.1668 |
. |
MITLLAFRL_urdu_constrained_primary |
0.1666 |
0.1939 |
0.1373 |
. |
NTT_urdu_constrained_primary |
0.1394 |
0.1630 |
0.1155 |
. |
UMD_urdu_constrained_primary |
0.1829 |
0.2160 |
0.1478 |
. |
UPC_urdu_constrained_primary |
0.1614 |
0.1878 |
0.1320 |
. |
bbn_urdu_constrained_primary |
0.2028 |
0.2388 |
0.1632 |
. |
columbia_urdu_constrained_primary |
0.1459 |
0.1714 |
0.1195 |
. |
google_urdu_constrained_primary |
0.2281 |
0.2619 |
0.1903 |
. |
isi-lw_urdu_constrained_primary |
0.1983 |
0.2292 |
0.1645 |
. |
qmul_urdu_constrained_primary |
0.1338 |
0.1578 |
0.1077 |
*
designates primary metric
# designates system with known alignment problem, corrected system submitted
late.