NIST 2006 Machine Translation Evaluation Official Results

Date of Updated Release: Tuesday, November 1, 2006, version 4

The NIST 2006 Machine Translation Evaluation (MT-06) was part of an ongoing series of evaluations of human language translation technology. NIST conducts these evaluations in order to support machine translation (MT) research and help advance the state-of-the-art in machine translation technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. The evaluation was administered as outlined in the official MT-06 evaluation plan.

Disclaimer These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial MT products were generally from research systems, not commercially available products. Since MT-06 was an evaluation of research algorithms, the MT-06 test design required local implementation by each participant. As such, participants were only required to submit their translation system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.There is ongoing discussion within the MT research community regarding the most informative metrics for machine translation. The design and implementation of these metrics are themselves very much part of the research. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance.

The data, protocols, and metrics employed in this evaluation were chosen to support MT research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.

Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.

Evaluation Tasks

The MT-06 evaluation consisted of two tasks. Each task required a system to perform translation from a given source language into the target language. The source languages were Arabic and Chinese, and the target language was English.

  • Translate Arabic text into English text
  • Translate Chinese text into English text

Evaluation Conditions

MT research and development requires language data resources. System performance is strongly affected by the type and amount of resources used. Therefore, two different resource categories were defined as conditions of evaluation. The categories differed solely by the amount of data that was available for use in system training and development. The evaluation conditions were called "Large Data Track" and "Unlimited Data Track".

  • Large Data Track – limited the training data to data in the LDC public catalogue existing before February 1st, 2006.
  • Unlimited Data Track – extended the training data to any publicly available data existing before February 1st, 2006.

and an unofficial track added late to the evaluation to support issues with using non-publicly available data:

  • Unlimited Plus Data Track – further extended the training data to include non-publicly available data existing before February 1st, 2006. [see end of page]

Other submissions not in categories described above are not reported here.

Evaluation Data

Source Data

In an effort to reduce data creation costs, the MT-06 evaluation made use of GALE-06 evaluation data (GALE subset). NIST augmented the GALE subset with additional data of equal or greater size for most of the genres (NIST subset). This provided a larger and more diverse test set. Each set contained documents drawn from newswire text documents, web-based newsgroup documents, human transcription of broadcast news, and human transcription of broadcast conversations. The source documents were encoded in UTF-8.

The test data was selected from a pool of data collected by the LDC during February 2006. The careful selection process sought to have a variety of sources (see below), publication dates, and difficulty ratings while hitting the target test set size.

Genre

Arabic

Chinese

Sources

Target Size (num of reference words)

Sources

Target Size (num of reference words)

Newswire

Agence France Presse
Assabah
Xinhua News Agency

30K

Agence France Presse
Xinhua News Agency

30K

Newsgroup

Google's groups
Yahoo's groups

20K

Google's groups

20K

Broadcast News

Dubai TV
Al Jazeera
Lebanese Broadcast Corporation

20K

Central China TV
New Tang Dynasty TV
Phoenix TV

20K

Broadcast Conversation

Dubai TV
Al Jazeera
Lebanese Broadcast Corporation

10K

Central China TV
New Tang Dynasty TV
Phoenix TV

10K

Reference Data

The GALE subset had one adjudicated high quality translation that was produced by the National Virtual Translation Center. The NIST subset had four independently generated high quality translations that were produced by professional translation companies. In both subsets, each translation agency was required to have native speaker(s) of the source and target languages, working on the translations.

Performance Measurement

Machine translation quality was measured automatically using an N-gram co-occurrence statistic metric developed by IBM and referred to as BLEU. BLEU measures translation accuracy according to the N-grams or sequence of N-words that it shares with one or more high quality reference translations. Thus, the more co-occurrences the better the score. BLEU is an accuracy metric, ranging from "0" to "1" with "1" being the best possible score. A detailed description of BLEU can be found in the paper Papineni, Roukos, Ward, Zhu (2001). "Bleu: a Method for Automatic Evaluation of Machine Translation" (keyword = RC22176).

Although BLEU was the official metric for MT-06, measuring translation quality is an ongoing research topic in the MT community. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance. Three additional automatic metrics METEOR, TER, and BLEU-refinement as well as human assessment were used to report the system performance. As stated in the evaluation specification document, this official public version of the results will report only the scores as measured by BLEU.

Evaluation Participants

The table below lists the organizations involved in submitting MT-06 evaluation results. Most submitted results representing their own organizations, some participated only in a collaborative effort (marked by the @ symbol), and some did both (marked by the + symbol).

Site ID

Organization

Location

apptek

Applications Technology Inc.

USA

 

arl

Army Research Laboratory

USA

 

auc

The American University in Cairo

Egypt

 

bbn

BBN Technologies

USA

 

cu

Cambridge University

UK

 

cmu

Carnegie Mellon University

USA

 

casia

Institute of Automation Chinese Academy of Sciences

China

 

columbia

Columbia University

USA

 

dcu

Dublin City University

Ireland

 

google

Google

USA

 

hkust

Hong Kong University of Science and Technology

China

 

ibm

IBM

USA

 

ict

Institute of Computing Technology Chinese Academy of Sciences

China

 

iscas

Institute of Software Chinese Academy of Sciences

China

 

isi

Information Sciences Institute+

USA

 

itcirst

ITC-irst

Italy

 

jhu

Johns Hopkins University

USA

 

ksu

Kansas State University

USA

 

kcsl

KCSL Inc.

Canada

 

lw

Language Weaver

USA

 

lcc

Language Computer

USA

 

lingua

Lingua Technologies Inc.

Canada

 

msr

Microsoft Research

USA

 

mit

MIT

USA

 

nict

National Institute of Information and Communications Technology

Japan

 

nlmp

National Laboratory on Machine Perception Peking University

China

 

ntt

NTT Communication Science Laboratories

Japan

 

nrc

National Research Council Canada

Canada

 

qmul

Queen Mary University of London

England

 

rwth

RWTH Aachen University+

Germany

 

sakhr

Sakhr Software Co.

USA

 

sri

SRI International

USA

 

ucb

University of California Berkeley

USA

 

edinburgh

University of Edinburgh

Scotland

 

uka

University of Karlsruhe

Germany

 

umd

University of Maryland

USA

 

upenn

University of Pennsylvania

USA

 

upc

Universitat Politecnica de Catalunya

Spain

 

uw

University of Washington

USA

 

xmu

Xiamen University

China

 

Site ID

Team/Collaboration

Location

arl-cmu

Army Research Laboratory & Carnegie Mellon University

USA

 

cmu-uka

Carnegie Mellon University & University of Karlsruhe

USA, Germany

 

edinburgh-mit

University of Edinburgh & MIT

Scotland, USA

 

isi-cu

Information Sciences Institute & Cambridge University

USA, England

 

rwth-sri-nrc-uw

RWTH Aachen University, SRI International, National Rearch Council Canada, University of Washington

Germany, USA, Canada, USA

 

umd-jhu

University of Maryland & Johns Hopkins University

USA

 

  • DFKI GmbH registered but dropped out of the evaluation on July 28, 2006.
  • Fitchburg State College registered but dropped out of the evaluation on August 3, 2006.

Evaluation Systems

Each site/team could submit one or more systems for evaluation with one system marked as its primary system. The primary system indicated the site/team's best effort. This official public version of the results report the results only for the primary systems.

Evaluation Results

The tables below list the results of the NIST 2006 Machine Translation Evaluation. The results are sorted by the BLEU scores and reported separately for the GALE subset and the NIST subset because they do not have the same number of reference translations. The results are also reported for each data domain. Note that these scores reflect case-errors.

Friedman's Rank Test for k Correlated Samples was used to test for significant difference among the systems. The initial null hypothesis was that all systems were the same. If the null hypothesis was rejected at the 95% level of confidence, the lowest scoring system was taken out of the pool of systems to be tested, and the Friedman's Rank Test was repeated for the remaining systems until no significant difference was found. The remaining systems that were not removed from the pool were deemed to be statistically equivalent. The process was repeated for the systems taken out of the pool. Alternating colors (white and yellow backgrounds) show the different groups.

Key:

  • An asterisk (*) indicates the submission was submitted late.
  • A pound sign (#) indicates the submission was a bug-fix. Bug-fix means that error(s) was found in the system during the testing period and sites fixed the error(s) and reran the test. Format errors do not count as bug-fixes.

Note: Site 'nlmp' was unable to process the entire test set. No result is listed for that site.

Arabic-to-English Results

Large Data Track

NIST Subset

Overall BLEU Scores

Site ID

BLEU-4

google

0.4281

ibm

0.3954

isi

0.3908

rwth

0.3906

apptek*#

0.3874

lw

0.3741

bbn

0.3690

ntt

0.3680

itcirst

0.3466

cmu-uka

0.3369

umd-jhu

0.3333

edinburgh*#

0.3303

sakhr

0.3296

nict

0.2930

qmul

0.2896

lcc

0.2778

upc

0.2741

columbia

0.2465

ucb

0.1978

auc

0.1531

dcu

0.0947

kcsl*#

0.0522

Newswire BLEU Scores

Site ID

BLEU-4

google

0.4814

ibm

0.4542

rwth

0.4441

isi

0.4426

lw

0.4368

bbn

0.4254

apptek*#

0.4212

ntt

0.4035

umd-jhu

0.3997

edinburgh*#

0.3945

cmu-uka

0.3943

itcirst

0.3798

qmul

0.3737

sakhr

0.3736

nict

0.3568

lcc

0.3089

upc

0.3049

columbia

0.2759

ucb

0.2369

auc

0.1750

dcu

0.0875

kcsl*#

0.0423

Newsgroup BLEU Scores

Site ID

BLEU-4

apptek*#

0.3311

google

0.3225

ntt

0.2973

isi

0.2895

ibm

0.2774

bbn

0.2771

rwth

0.2726

itcirst

0.2696

sakhr

0.2634

lw

0.2503

cmu

0.2436

edinburgh*#

0.2208

lcc

0.2135

columbia

0.2111

umd-jhu

0.2059

nict

0.1875

upc

0.1842

ucb

0.1690

dcu

0.1177

qmul

0.1116

auc

0.1099

kcsl*#

0.0770

Broadcast News BLEU Scores

Site ID

BLEU-4

google

0.3781

apptek*#

0.3729

lw

0.3646

isi

0.3630

ibm

0.3612

rwth

0.3511

ntt

0.3324

bbn

0.3302

umd-jhu

0.3148

itcirst

0.3128

edinburgh*#

0.2925

cmu

0.2874

sakhr

0.2814

qmul

0.2768

upc

0.2463

nict

0.2458

lcc

0.2445

columbia

0.2054

auc

0.1419

ucb

0.1114

dcu

0.0594

kcsl*#

0.0326

GALE Subset

Overall BLEU Scores

Site ID

BLEU-4

apptek*#

0.1918

google

0.1826

isi

0.1714

ibm

0.1674

sakhr

0.1648

rwth

0.1639

lw

0.1594

ntt

0.1533

itcirst

0.1475

bbn

0.1461

cmu

0.1392

umd-jhu

0.1370

qmul

0.1345

edinburgh*#

0.1305

nict

0.1192

upc

0.1149

lcc

0.1129

columbia

0.0960

ucb

0.0732

auc

0.0635

dcu

0.0320

kcsl*#

0.0176

Newswire BLEU Scores

Site ID

BLEU-4

google

0.2647

ibm

0.2432

isi

0.2300

rwth

0.2263

apptek*#

0.2225

sakhr

0.2196

lw

0.2193

ntt

0.2180

bbn

0.2170

itcirst

0.2104

umd-jhu

0.2084

cmu

0.2055

edinburgh*#

0.2052

qmul

0.1984

nict

0.1773

lcc

0.1648

upc

0.1575

columbia

0.1438

ucb

0.1299

auc

0.0937

dcu

0.0466

kcsl*#

0.0182

Newsgroup BLEU Scores

Site ID

BLEU-4

apptek*#

0.1747

sakhr

0.1331

google

0.1130

ibm

0.1060

rwth

0.1017

isi

0.0918

ntt

0.0906

lw

0.0853

cmu

0.0840

bbn

0.0837

itcirst

0.0821

qmul

0.0818

umd-jhu

0.0754

edinburgh*#

0.0681

lcc

0.0643

nict

0.0639

columbia

0.0634

upc

0.0603

ucb

0.0411

auc

0.0326

dcu

0.0254

kcsl*#

0.0089

Broadcast News BLEU Scores

Site ID

BLEU-4

apptek*#

0.1944

isi

0.1766

google

0.1721

lw

0.1649

rwth

0.1599

ibm

0.1588

sakhr

0.1495

itcirst

0.1471

ntt

0.1469

bbn

0.1391

cmu

0.1362

umd-jhu

0.1309

qmul

0.1266

edinburgh*#

0.1240

nict

0.1152

upc

0.1150

lcc

0.1016

columbia

0.0879

auc

0.0619

ucb

0.0412

dcu

0.0252

kcsl*#

0.0229

Broadcast Conversation BLEU Scores

Site ID

BLEU-4

isi

0.1756

apptek*#

0.1747

google

0.1745

rwth

0.1615

lw

0.1582

ibm

0.1563

ntt

0.1512

sakhr

0.1446

itcirst

0.1425

bbn

0.1400

umd-jhu

0.1277

qmul

0.1265

cmu

0.1261

edinburgh*#

0.1203

upc

0.1200

lcc

0.1157

nict

0.1156

columbia

0.0866

ucb

0.0783

auc

0.0620

dcu

0.0306

kcsl*#

0.0183

Unlimited Data Track

NIST Subset

Overall BLEU Scores

Site ID

BLEU-4

google

0.4535

lw

0.4008

rwth

0.3970

rwth+sri+nrc+uw*

0.3966

nrc

0.3750

sri

0.3743

edinburgh*#

0.3449

cmu

0.3376

arl-cmu

0.1424

Newswire BLEU Scores

Site ID

BLEU-4

google

0.5034

lw

0.4589

rwth+sri+nrc+uw*

0.4493

rwth

0.4458

nrc

0.4300

sri

0.4240

edinburgh*#

0.4133

cmu

0.3974

arl-cmu

0.1402

Newsgroup BLEU Scores

Site ID

BLEU-4

google

0.3652

lw

0.2851

rwth

0.2829

nrc

0.2799

rwth+sri+nrc+uw*

0.2755

sri

0.2534

cmu

0.2372

edinburgh*#

0.2287

arl-cmu

0.1485

Broadcast News BLEU Scores

Site ID

BLEU-4

google

0.4018

lw

0.3685

rwth

0.3662

rwth+sri+nrc+uw*

0.3639

sri

0.3326

nrc

0.3312

edinburgh*#

0.3049

cmu

0.2988

arl-cmu

0.1363

GALE Subset

Overall BLEU Scores

Site ID

BLEU-4

google

0.1957

lw

0.1721

rwth+sri+nrc+uw*

0.1710

rwth

0.1680

sri

0.1614

nrc

0.1517

cmu

0.1382

edinburgh*#

0.1365

arl-cmu

0.0736

Newswire BLEU Scores

Site ID

BLEU-4

google

0.2812

lw

0.2294

rwth+sri+nrc+uw*

0.2289

rwth

0.2258

nrc

0.2172

sri

0.2081

edinburgh*#

0.2068

cmu

0.2006

arl-cmu

0.0858

Newsgroup BLEU Scores

Site ID

BLEU-4

google

0.1267

rwth

0.1133

rwth+sri+nrc+uw*

0.1078

lw

0.1007

nrc

0.1007

sri

0.0953

cmu

0.0894

edinburgh*#

0.0722

arl-cmu

0.0558

Broadcast News BLEU Scores

Site ID

BLEU-4

google

0.1868

rwth+sri+nrc+uw*

0.1730

lw

0.1715

sri

0.1661

rwth

0.1625

nrc

0.1415

edinburgh*#

0.1293

cmu

0.1276

arl-cmu

0.0855

Broadcast Conversation BLEU Scores

Site ID

BLEU-4

google

0.1824

lw

0.1756

rwth+sri+nrc+uw*

0.1676

sri

0.1671

rwth

0.1658

nrc

0.1429

edinburgh*#

0.1341

cmu

0.1322

arl-cmu

0.0584

Chinese-to-English Results

Large Data Track

NIST Subset

Overall BLEU Scores

Site ID

BLEU-4

isi

0.3393

google

0.3316

lw

0.3278

rwth

0.3022

ict

0.2913

edinburgh*#

0.2830

bbn

0.2781

nrc

0.2762

itcirst

0.2749

umd-jhu

0.2704

ntt

0.2595

nict

0.2449

cmu

0.2348

msr

0.2314

qmul

0.2276

hkust

0.2080

upc

0.2071

upenn

0.1958

iscas

0.1816

lcc

0.1814

xmu

0.1580

lingua*

0.1341

kcsl*#

0.0512

ksu

0.0401

Newswire BLEU Scores

Site ID

BLEU-4

isi

0.3486

google

0.3470

lw

0.3404

ict

0.3085

rwth

0.3022

nrc

0.2867

umd-jhu

0.2863

edinburgh*#

0.2776

bbn

0.2774

itcirst

0.2739

ntt

0.2656

nict

0.2509

cmu

0.2496

msr

0.2387

qmul

0.2299

upenn

0.2064

upc

0.2057

hkust

0.1999

lcc

0.1721

iscas

0.1715

xmu

0.1619

lingua*

0.1412

kcsl*#

0.0510

ksu

0.0380

Newsgroup BLEU Scores

Site ID

BLEU-4

google

0.2620

isi

0.2571

lw

0.2454

edinburgh*#

0.2434

rwth

0.2417

nrc

0.2330

ict

0.2325

bbn

0.2275

itcirst

0.2264

umd-jhu

0.2061

ntt

0.2036

nict

0.2006

msr

0.1878

cmu

0.1865

hkust

0.1851

qmul

0.1840

iscas

0.1681

upenn

0.1665

lcc

0.1634

upc

0.1619

xmu

0.1406

lingua*

0.1207

kcsl*#

0.0531

ksu

0.0361

Broadcast News BLEU Scores

Site ID

BLEU-4

rwth

0.3501

google

0.3481

isi

0.3463

lw

0.3327

bbn

0.3197

edinburgh*#

0.3172

itcirst

0.3128

ict

0.2977

ntt

0.2928

umd-jhu

0.2928

nrc

0.2914

qmul

0.2571

nict

0.2568

msr

0.2527

cmu

0.2468

upc

0.2403

hkust

0.2376

iscas

0.2090

lcc

0.2046

upenn

0.2008

xmu

0.1652

lingua*

0.1323

kcsl*#

0.0475

ksu

0.0464

GALE Subset

Overall BLEU Scores

Site ID

BLEU-4

google

0.1470

isi

0.1413

lw

0.1299

edinburgh*#

0.1199

itcirst

0.1194

nrc

0.1194

rwth

0.1187

ict

0.1185

bbn

0.1165

umd-jhu

0.1140

cmu

0.1135

ntt

0.1116

nict

0.1106

hkust

0.0984

msr

0.0972

qmul

0.0943

upc

0.0931

upenn

0.0923

iscas

0.0860

lcc

0.0813

xmu

0.0747

lingua*

0.0663

ksu

0.0218

kcsl*#

0.0199

Newswire BLEU Scores

Site ID

BLEU-4

google

0.1905

isi

0.1685

lw

0.1596

ict

0.1515

edinburgh*#

0.1467

rwth

0.1448

bbn

0.1433

umd-jhu

0.1419

nrc

0.1404

itcirst

0.1377

cmu

0.1353

ntt

0.1350

msr

0.1280

hkust

0.1161

nict

0.1155

qmul

0.1102

upenn

0.1068

upc

0.1039

iscas

0.0947

lcc

0.0878

xmu

0.0861

lingua*

0.0657

kcsl*#

0.0178

ksu

0.0138

Newsgroup BLEU Scores

Site ID

BLEU-4

google

0.1365

isi

0.1235

edinburgh*#

0.1140

lw

0.1137

ict

0.1130

itcirst

0.1108

nrc

0.1098

nict

0.1075

rwth

0.1071

cmu

0.1054

bbn

0.1049

ntt

0.1026

umd-jhu

0.0978

upenn

0.0941

hkust

0.0892

qmul

0.0858

upc

0.0851

msr

0.0841

lcc

0.0765

iscas

0.0745

lingua*

0.0687

xmu

0.0681

ksu

0.0249

kcsl*#

0.0177

Broadcast News BLEU Scores

Site ID

BLEU-4

isi

0.1441

google

0.1409

lw

0.1343

rwth

0.1231

itcirst

0.1193

nrc

0.1192

cmu

0.1159

bbn

0.1146

ict

0.1146

edinburgh*#

0.1110

ntt

0.1096

nict

0.1090

umd-jhu

0.1084

hkust

0.1005

upc

0.0986

qmul

0.0951

msr

0.0922

iscas

0.0891

upenn

0.0882

lcc

0.0814

xmu

0.0705

lingua*

0.0609

kcsl*#

0.0204

ksu

0.0192

Broadcast Conversation BLEU Scores

Site ID

BLEU-4

isi

0.1280

google

0.1262

edinburgh*#

0.1119

lw

0.1112

itcirst

0.1106

nict

0.1106

umd-jhu

0.1102

nrc

0.1095

bbn

0.1060

ntt

0.1016

rwth

0.1013

ict

0.0990

cmu

0.0973

hkust

0.0891

msr

0.0873

qmul

0.0870

upc

0.0848

iscas

0.0842

upenn

0.0815

lcc

0.0796

xmu

0.0753

lingua*

0.0700

ksu

0.0270

kcsl*#

0.0223

Unlimited Data Track

NIST Subset

Overall BLEU Scores

Site ID

BLEU-4

google

0.3496

rwth

0.2975

edinburgh*#

0.2843

cmu

0.2449

casia

0.1894

xmu

0.1713

Newswire BLEU Scores

Site ID

BLEU-4

google

0.3634

rwth

0.2974

edinburgh*#

0.2852

cmu

0.2430

casia

0.1905

xmu

0.1696

Newsgroup BLEU Scores

Site ID

BLEU-4

google

0.2870

edinburgh*#

0.2450

rwth

0.2307

cmu

0.2004

casia

0.1709

xmu

0.1618

Broadcast News BLEU Scores

Site ID

BLEU-4

google

0.3649

rwth

0.3509

edinburgh*#

0.3142

cmu

0.2644

casia

0.1889

xmu

0.1818

GALE Subset

Overall BLEU Scores

Site ID

BLEU-4

google

0.1526

edinburgh*#

0.1187

rwth

0.1172

cmu

0.1034

casia

0.0900

xmu

0.0793

Newswire BLEU Scores

Site ID

BLEU-4

google

0.2057

edinburgh*#

0.1465

rwth

0.1436

cmu

0.1158

casia

0.1001

xmu

0.0817

Newsgroup BLEU Scores

Site ID

BLEU-4

google

0.1432

edinburgh*#

0.1070

rwth

0.1032

cmu

0.1015

casia

0.0916

xmu

0.0782

Broadcast News BLEU Scores

Site ID

BLEU-4

google

0.1482

rwth

0.1224

edinburgh*#

0.1090

cmu

0.1020

casia

0.0891

xmu

0.0775

Broadcast Conversation BLEU Scores

Site ID

BLEU-4

google

0.1206

edinburgh*#

0.1157

rwth

0.1010

cmu

0.0957

casia

0.0812

xmu

0.0801



Unlimited Plus Data track

NIST data set

BLEU-4

Site ID

Language

Overall

Newswire

Newsgroup

Broadcast News

google

Arabic

0.4569

0.5060

0.3727

0.4076

google

Chinese

0.3615

0.3725

0.2926

0.3859

 

GALE data set

BLEU-4

Site ID

Language

Overall

Newswire

Newsgroup

Broadcast News

Broadcast Conversation

google

Arabic

0.2024

0.2820

0.1359

0.1932

0.1925

google

Chinese

0.1576

0.2086

0.1454

0.1532

0.1300



Release History

  • Version 1: Initial release of preliminary results to evaluation participants
  • Version 2: Added late and bug fixed submissions, added METEOR, TER, and BLEU-refinement scores.
  • Version 3: Public version of the results (included only the BLEU scores for primary systems)
  • Version 4: Public version of the results, updated disclaimer