NIST 2008 Open Machine Translation Evaluation - (MT08)

Official Evaluation Results

Date of release: Fri Jun 06, 2008

Version: mt08_official_release_v0

The NIST 2008 Machine Translation Evaluation (MT-08) is part of an ongoing series of evaluations of human language translation technology. NIST conducts these evaluations in order to support machine translation (MT) research and help advance the state-of-the-art in machine translation technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. The evaluation was administered as outlined in the official MT-08 evaluation plan.

DisclaimerThese results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial MT products were generally from research systems, not commercially available products. Since MT-08 was an evaluation of research algorithms, the MT-08 test design required local implementation by each participant. As such, participants were only required to submit their translation system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.

Certain commercial equipment, instruments, software, or materials are identified in this paper in order to specify the experimental procedure adequately. Such identification is not intended to imply recommendation or endorsement by the (NIST), nor is it intended to imply that the equipment, instruments, software or materials are necessarily the best available for the purpose.

There is ongoing discussion within the MT research community regarding the most informative metrics for machine translation. The design and implementation of these metrics are themselves very much part of the research. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance.

The data, protocols, and metrics employed in this evaluation were chosen to support MT research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.

Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.


Evaluation Tasks

The MT-08 evaluation consisted of four tasks. Each task required a system to perform translation from a given source language into the target language. The source and target language pairs that made up the four MT-08 tasks were:

  • Translate Arabic text into English text
  • Translate Chinese text into English text
  • Translate Urdu text into English text
  • Translate English text into Chinese text

Evaluation Conditions

MT research and development requires language data resources. System performance is strongly affected by the type and amount of resources used. Therefore, two different resource categories were defined as conditions of evaluation. The categories differed solely by the amount of data that was available for use in the training and development of the core MT engine. The evaluation conditions were called "Constrained Data Track" and "Un-Constrained Data Track".

  • Constrained Data Track - limited the training data to data in the LDC public catalogue existing before July 1st, 2007.

·         Note: For the URDU task, the constrained data condition required that the core system development use only the provided resource DVD. No other data was allowed for the primary condition of interest.

  • Un-Constrained Data Track - extended the training data to any publicly available data existing before July 1st, 2007.

Other submissions not in categories described above will not be reported in the final release.

Evaluation Data

Source Data

MT-08 evaluation data sets contained documents drawn from newswire text documents and web-based newsgroup documents. The source documents were encoded in UTF-8.

The test data was selected from a pool of data collected by the LDC during July 2007. The careful selection process sought to have a variety of sources (see below) and publication dates while hitting the target test set size.

Source Language

Sources

Newswire

Newsgroup / Web

Arabic

AAW, AFP, AHR, ASB, HYT, NHR, QDS, XIN
Assabah
Xinhua News Agency

various web forums

Chinese

AFP, CNS, GMW, PDA, PLA, XIN

various web forums

Urdu

BBC,JNG, PTB, VOA

various web forums

English

AFP, APW, LTW, NYT, XIN

n/a

Reference Data

MT-08 reference data consists of four independently generated high quality translations that were produced by professional translation companies. Each translation agency was required to have native speaker(s) of the source and target languages working on the translations.

Current versus Progress Data Division

For those willing to abide by the strict processing rules, a "PROGRESS" test set was distributed to use as a BLIND benchmark for several evaluations. Teams that processed this data submitted their translations to NIST and deleted all related files (source, translations, and any other derivitive file). The scores of the progress test sets were reported to the participants but were not reported here. Future OpenMT evaluations will report PROGRESS test set scores from year to year.

Performance Measurement

Machine translation quality was measured automatically using an N-gram co-occurrence statistic metric developed by IBM and referred to as BLEU. BLEU measures translation accuracy according to N-grams or sequence of N-words that it shares with one or more high quality reference translations. Thus, the more co-occurrences, the better the score. BLEU is an accuracy metric, ranging from "0" to "1" with "1" being the best possible score. A detailed description of BLEU can be found in the paper Papineni, Roukos, Ward, Zhu (2001). "Bleu: a Method for Automatic Evaluation of Machine Translation" (keyword = RC22176).

Although BLEU was the official metric for MT-08, measuring translation quality is an ongoing research topic in the MT community. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance.

Automatic metrics reported:

  • BLEU-4 (MTeval-v11b: official metric)
  • IBM BLEU (IBM's BLEU with original brevity penalty)
  • NIST (NIST's refinement of BLEU, commonly referred to as NIST)
  • TER
  • METEOR

Other metrics (to be) reported:

  • Human Assessments of Adequacy (judged by participants and others)
  • Human judgments of Preference (judged by participants and others)
  • MT Comprehension Test (implemented by MIT-LL)

Evaluation Participants

The table below lists the organizations entered as participants and the evaluation tasks they are registered for in MT-08.

Site ID

Organization

Location

apptek

Applications Technology Inc.

USA

auc

The American University in Cairo

Egypt

basistech

Basis Technology

USA

bbn

BBN Technologies

USA

bjut-mtg

Beijing University of Technology,
Machine Translation Group

China

cas-ia

Chinese Academy of Sciences, Institute of Automation

China

cas-ict

Chinese Academy of Sciences, Institute of Computing Technology

China

cas-is

Chinese Academy of Sciences, Institute of Software

China

cmu-ebmt

Carnegie Mellon

USA

cmu-smt

Carnegie Mellon, interACT

USA

cmu-xfer

Carnegie Mellon

USA

columbia

Columbia University

USA

cued

University of Cambridge, Dept. of Engineering

UK

edinburgh

University of Edinburgh

UK

google

Google

USA

hit-ir

Harbin Institute of Technology, Information Retrieval Laboratory

China

hkust

.

China

ibm

IBM

USA

lium

Universite du Maine (Le Mans), Laboratoire d'Informatique

France

msra

Microsoft Research Asia

China

nrc

National Research Council

Canada

nthu

National Tsing Hua University

Taiwan

ntt

NTT Communication Science Laboratories

Japan

qmul

Queen Mary University of London

UK

sakhr

Sakhr Software Co.

Egypt

sri

SRI International

USA

stanford

Stanford University

USA

uka

Universitaet Karlsruhe

Germany

umd

University of Maryland

USA

upc-lsi

Universitat Politechnica de Catalunya, LSI

Spain

upc-talp

Universitat Politechnica de Catalunya, TALP

Spain

xmu-iai

Xiamen University, Institute of Artificial Intelligence

China

Collaborations

ibm_umd

IBM /
University of Maryland MD

USA

 

 

 

 

jhu_umd

Johns Hopkins University /
University of Maryland

USA

 

 

 

 

isi_lw

USC-ISI /
Language Weaver Inc.

USA

 

 

 

 

msr_msra

Microsoft Research /
Microsoft Research Asia

.

 

 

 

 

msr_nrc_sri

Microsoft Research /
Microsoft Research Asia /
National Research Council Canada /
SRI International

.

 

 

 

 

nict_atr

NICT /
ATR

Japan

 

 

 

 

nrc_systran

National Research Council Canada /
SYSTRAN

.

 

 

 

 

Evaluation Systems

Each site/team could submit up to four systems for evaluation with one system marked as its primary system. The primary system indicated the site/team's best effort. This official public version of the results report the results only for the primary systems. Note that these charts show an absolute ranking according to the primary metric.

Systems that fail to meet the requirements for either track will not be reported here.

"significance groups*" shows areas where the wilcoxon signed rank test was not able to differenciate system performance at the 95% confidence level. That is, if two systems belong to the same significance group (by sharing the same number), then they are determined to be comparble, based n BLEU-4 scoring.


Results Section

Contains Valid On-Time Submissions

Late and corrected submission will be linked here

 


Overall System Results

Arabic to English (primary system) Results

Entire Current Evaluation Test Set

significance
groups*

system

BLEU-4*

IBM BLEU

NIST

TER

METEOR

Constrained Training Track

1

google_arabic_constrained_primary

0.4557

0.4526

10.8821

48.535

0.6857

2

IBM-UMD_arabic_constrained_primary

0.4525

0.4300

10.6183

48.436

0.6539

3

IBM_arabic_constrained_primary

0.4507

0.4276

10.5904

48.547

0.6530

3

bbn_arabic_constrained_primary

0.4340

0.4290

10.6590

49.599

0.6784

4

LIUM_arabic_constrained_primary

0.4298

0.4105

10.2732

50.484

0.6490

5

isi-lw_arabic_constrained_primary

0.4248

0.4227

10.4077

51.820

0.6695

6

CUED_arabic_constrained_primary

0.4238

0.4018

9.9486

51.557

0.6274

6

SRI_arabic_constrained_primary

0.4229

0.4031

10.1935

49.780

0.6430

7

Edinburgh_arabic_constrained_primary

0.4029

0.3833

9.9641

51.165

0.6396

8

UMD_arabic_constrained_primary

0.3906

0.3784

10.1176

52.158

0.6553

9

UPC_arabic_constrained_primary

0.3743

0.3576

9.6553

53.260

0.6380

10

columbia_arabic_constrained_primary

0.3740

0.3594

9.4806

51.973

0.6092

9,10

NTT_arabic_constrained_primary

0.3671

0.3540

9.8806

56.077

0.6312

11

CMUEBMT_arabic_constrained_primary

0.3481

0.3479

9.2165

57.376

0.6057

12

qmul_arabic_constrained_primary

0.3308

0.3181

8.8124

55.145

0.5893

13

SAKHR_arabic_constrained_primary

0.3133

0.3133

9.1373

57.159

0.6659

14

UPC.lsi_english_constrained_primary

0.3021

0.2876

8.6350

58.228

0.5639

15

BASISTECH_arabic_constrained_primary

0.2529

0.2423

7.8781

63.015

0.5454

16

AUC_arabic_constrained_primary

0.1415

0.1359

6.3210

76.406

0.4468

UnConstrained Training Track

17

google_arabic_unconstrained_primary

0.4772

0.4739

11.1864

46.853

0.6996

18

IBM_arabic_unconstrained_primary

0.4717

0.4527

11.0591

46.755

0.6902

19

apptek_arabic_unconstrained_primary

0.4483

0.4474

10.8420

48.263

0.7160

20

cmu-smt_arabic_unconstrained_primary

0.4312

0.4114

10.3617

50.082

0.6672

* designates primary metric

Chinese to English (primary system) Results

Entire Current Evaluation Test Set

significance
groups*

system

BLEU-4*

IBM BLEU

NIST

TER

METEOR

Constrained Training Track

1

MSR-NRC-SRI_chinese_constrained_primary

0.3089

0.2947

8.5059

58.460

0.5379

1

bbn_chinese_constrained_primary

0.3059

0.2959

8.2023

57.067

0.5468

1

isi-lw_chinese_constrained_primary

0.3041

0.2940

8.0950

57.734

0.5467

1

google_chinese_constrained_primary

0.2999

0.2887

8.5143

58.359

0.5567

2

MSR-MSRA_chinese_constrained_primary

0.2901

0.2766

8.1480

60.073

0.5171

3

SRI_chinese_constrained_primary

0.2697

0.2575

7.8942

61.622

0.5101

3

Edinburgh_chinese_constrained_primary

0.2608

0.2513

7.8117

60.654

0.5142

4

SU_chinese_constrained_primary

0.2547

0.2420

7.7994

63.288

0.5122

4,5

UMD_chinese_constrained_primary

0.2506

0.2387

7.8236

62.134

0.5167

4,5

NTT_chinese_constrained_primary

0.2469

0.2270

7.9511

63.415

0.5126

5

NRC_chinese_constrained_primary

0.2458

0.2373

7.9964

63.835

0.5362

5

CASIA_chinese_constrained_primary

0.2407

0.2310

7.5790

62.518

0.4999

6

NICT-ATR_chinese_constrained_primary

0.2269

0.2184

7.1635

64.524

0.4962

6

ICT_chinese_constrained_primary

0.2258

0.2213

6.1551

61.387

0.4878

7

JHU-UMD_chinese_constrained_primary

0.2111

0.2079

6.0509

61.834

0.4691

8

XMU_chinese_constrained_primary

0.1979

0.1938

6.7514

63.139

0.4780

9

HITIRLab_chinese_constrained_primary

0.1866

0.1795

6.5942

67.376

0.4458

10

hkust_large_primary

0.1678

0.1624

6.7124

75.803

0.4332

10

ISCAS_chinese_constrained_primary

0.1569

0.1520

5.9557

68.221

0.4354

11

NTHU_Chinese_constrained_primary

0.0393

0.0390

3.5096

93.892

0.3209

UnConstrained Training Track

12

google_chinese_unconstrained_primary

0.3195

0.3069

8.8628

57.009

0.5707

13

cmu-smt_chinese_unconstrained_primary

0.2597

0.2474

8.0026

62.411

0.5363

14

NRC-SYSTRAN_chinese_unconstrained_primary

0.2523

0.2443

8.0473

63.002

0.5490

15

UKA_chinese_unconstrained_primary

0.2406

0.2323

7.4571

61.706

0.4916

16

CMUXfer_chinese_unconstrained_primary

0.1310

0.1309

6.2452

76.722

0.4614

17

BJUT_chinese_unconstrained_primary

0.0735

0.0694

4.7239

77.685

0.3944

* designates primary metric

Urdu to English (primary system) Results

significance
groups*

system

BLEU-4*

IBM BLEU

NIST

TER

METEOR

Constrained Training Track

1

google_urdu_constrained_primary

0.2281

0.2280

7.8406

69.906

0.5693

2

bbn_urdu_constrained_primary

0.2028

0.2026

7.6927

70.885

0.5437

2

IBM_urdu_constrained_primary

0.2026

0.1999

7.7022

68.860

0.5096

2

isi-lw_urdu_constrained_primary

0.1983

0.1985

7.3030

72.749

0.5239

3

UMD_urdu_constrained_primary

0.1829

0.1826

7.2905

68.748

0.5053

4

MITLLAFRL_urdu_constrained_primary

0.1666

0.1666

7.0460

72.859

 

5

UPC_urdu_constrained_primary

0.1614

0.1614

7.0958

72.839

0.4904

6

columbia_urdu_constrained_primary

0.1459

0.1460

6.5474

78.686

0.4903

6,7

Edinburgh_urdu_constrained_primary

0.1456

0.1455

6.4393

75.982

0.5215

7,8

NTT_urdu_constrained_primary

0.1394

0.1383

6.9604

75.605

0.5022

8

qmul_urdu_constrained_primary

0.1338

0.1338

6.2915

81.457

0.4728

8

CMU-XFER_urdu_constrained_primary#

0.1016

0.1017

4.1885

108.167

0.3518

* designates primary metric
# designates system with known alignment problem, corrected system submitted late.

English to Chinese (primary system) Results

Here is a description of the scores:

·  BLEU-4*: primary metric produced using mteval-v12, which is a language independent version that tokenizes on every unicode symbol.

·  BLEU-4 normalized: makes use of a mapping file to normalize both the reference and system translations to a single variant of certain sybmols.

·  NIST: the Doddington improvment to BLEU as reported from mteval-v12.

·  BLEU-4 word segmented: mteval-v12 with word scoring, using a standard word segmenter for both reference and system translation.

We are not identifying significance groups for this task.

.

system

BLEU-4*

BLEU-4
normalized

NIST

BLEU-4
Word Segmented

.

Constrained Training Track

.

google_english_constrained_primary

0.4142

0.4309

9.7727

0.1643

.

.

MSRA_English_constrained_primary

0.4099

0.4343

9.4918

0.1769

.

.

isi-lw_english_constrained_primary

0.3857

0.4163

8.6810

0.1687

.

.

NICT-ATR_english_constrained_primary

0.3438

0.3718

7.9608

0.1416

.

.

HITIRLab_english_constrained_primary

0.3225

0.3436

7.3768

0.0946

.

.

ICT_english_constrained_primary

0.3176

0.3411

7.7030

0.0879

.

.

CMUEBMT_english_constrained_primary

0.2738

0.2954

7.3042

0.0760

.

.

XMU_english_constrained_primary

0.2502

0.2664

6.2083

0.0593

.

.

UMD_english_constrained_primary

0.1982

0.2391

3.6922

0.0899

.

UnConstrained Training Track

.

google_english_unconstrained_primary

0.4710

0.4914

10.7868

0.1963

.

.

BJUT_english_unconstrained_primary

0.2765

0.2906

7.8185

0.1046

.

* designates primary metric


Results by Genre

All reported scores are limited to the entire "CURRENT" data sets. All primary submissions are shown here.

Site Results ( alphabetic order )

All scores are BLEU-4*

.

system

All data

NW

WB

Arabic to English

.

AUC_arabic_constrained_primary

0.1415

0.1718

0.0983

.

BASISTECH_arabic_constrained_primary

0.2529

0.2951

0.1900

.

CMUEBMT_arabic_constrained_primary

0.3481

0.4094

0.2695

.

CUED_arabic_constrained_primary

0.4238

0.4819

0.3456

.

Edinburgh_arabic_constrained_primary

0.4029

0.4675

0.3008

.

IBM-UMD_arabic_constrained_primary

0.4525

0.5085

0.3489

.

IBM_arabic_constrained_primary

0.4507

0.5089

0.3432

.

LIUM_arabic_constrained_primary

0.4298

0.4830

0.3431

.

NTT_arabic_constrained_primary

0.3671

0.4186

0.2923

.

SAKHR_arabic_constrained_primary

0.3133

0.3505

0.2622

.

SRI_arabic_constrained_primary

0.4229

0.4886

0.3171

.

UMD_arabic_constrained_primary

0.3906

0.4452

0.3117

.

UPC.lsi_english_constrained_primary

0.3021

0.3475

0.2292

.

UPC_arabic_constrained_primary

0.3743

0.4281

0.2840

.

bbn_arabic_constrained_primary

0.4340

0.4919

0.3497

.

columbia_arabic_constrained_primary

0.3740

0.4431

0.2797

.

google_arabic_constrained_primary

0.4557

0.5164

0.3724

.

isi-lw_arabic_constrained_primary

0.4248

0.4870

0.3355

.

qmul_arabic_constrained_primary

0.3308

0.4005

0.2358

UNCONSTRAINED SYSTEMS

All data

NW

WB

.

IBM_arabic_unconstrained_primary

0.4717

0.5264

0.3762

.

apptek_arabic_unconstrained_primary

0.4483

0.4900

0.3925

.

cmu-smt_arabic_unconstrained_primary

0.4312

0.4884

0.3392

.

google_arabic_unconstrained_primary

0.4772

0.5385

0.3940

Chinese to English

.

CASIA_chinese_constrained_primary

0.2407

0.2756

0.1936

.

Edinburgh_chinese_constrained_primary

0.2608

0.2976

0.2116

.

HITIRLab_chinese_constrained_primary

0.1866

0.2116

0.1529

.

ICT_chinese_constrained_primary

0.2258

0.2760

0.1586

.

ISCAS_chinese_constrained_primary

0.1569

0.1805

0.1257

.

JHU-UMD_chinese_constrained_primary

0.2111

0.2502

0.1586

.

MSR-MSRA_chinese_constrained_primary

0.2901

0.3435

0.2175

.

MSR-NRC-SRI_chinese_constrained_primary

0.3089

0.3614

0.2376

.

NICT-ATR_chinese_constrained_primary

0.2269

0.2579

0.1854

.

NRC_chinese_constrained_primary

0.2458

0.2679

0.2150

.

NTHU_Chinese_constrained_primary

0.0393

0.0367

0.0425

.

NTT_chinese_constrained_primary

0.2469

0.2828

0.1991

.

SRI_chinese_constrained_primary

0.2697

0.3154

0.2075

.

SU_chinese_constrained_primary

0.2547

0.2924

0.2039

.

UMD_chinese_constrained_primary

0.2506

0.2939

0.1871

.

XMU_chinese_constrained_primary

0.1979

0.2401

0.1401

.

bbn_chinese_constrained_primary

0.3059

0.3639

0.2273

.

google_chinese_constrained_primary

0.2999

0.3489

0.2344

.

hkust_large_primary

0.1678

0.1891

0.1377

.

isi-lw_chinese_constrained_primary

0.3041

0.3676

0.2176

UNCONSTRAINED SYSTEMS

All data

NW

WB

.

BJUT_chinese_unconstrained_primary

0.0735

0.0751

0.0689

.

CMUXfer_chinese_unconstrained_primary

0.1310

0.1536

0.0994

.

NRC-SYSTRAN_chinese_unconstrained_primary

0.2523

0.2757

0.2192

.

UKA_chinese_unconstrained_primary

0.2406

0.2846

0.1810

.

cmu-smt_chinese_unconstrained_primary

0.2597

0.2909

0.2127

.

google_chinese_unconstrained_primary

0.3195

0.3701

0.2515

Urdu to English

.

CMU-XFER_urdu_constrained_primary#

0.1016

0.1827

0.0183

.

Edinburgh_urdu_constrained_primary

0.1456

0.1609

0.1291

.

IBM_urdu_constrained_primary

0.2026

0.2347

0.1668

.

MITLLAFRL_urdu_constrained_primary

0.1666

0.1939

0.1373

.

NTT_urdu_constrained_primary

0.1394

0.1630

0.1155

.

UMD_urdu_constrained_primary

0.1829

0.2160

0.1478

.

UPC_urdu_constrained_primary

0.1614

0.1878

0.1320

.

bbn_urdu_constrained_primary

0.2028

0.2388

0.1632

.

columbia_urdu_constrained_primary

0.1459

0.1714

0.1195

.

google_urdu_constrained_primary

0.2281

0.2619

0.1903

.

isi-lw_urdu_constrained_primary

0.1983

0.2292

0.1645

.

qmul_urdu_constrained_primary

0.1338

0.1578

0.1077

* designates primary metric
# designates system with known alignment problem, corrected system submitted late.