Chi-kiu Lo - ACL Anthology

Chi-kiu Lo

Also published as: Chi-Kiu Lo

2025

NRC Systems for the WMT2025-LRSL Shared Task
Samuel Larkin | Chi-Kiu Lo | Rebecca Knowles
Proceedings of the Tenth Conference on Machine Translation

We describe the NRC team systems for the WMT25 Shared Tasks on Large Language Models (LLMs) with Limited Resources for Slavic Languages. We participate in the Lower Sorbian and Upper Sorbian Machine Translation and Question Answering tasks. On the machine translation tasks, our primary focus, our systems rank first according to the automatic MT evaluation metric (chrF). Our systems underperform on the QA tasks.

MSLC25: Metric Performance on Low-Quality Machine Translation, Empty Strings, and Language Variants
Rebecca Knowles | Samuel Larkin | Chi-Kiu Lo
Proceedings of the Tenth Conference on Machine Translation

In this challenge set, we examine how automatic metrics for machine translation perform on a wide variety of machine translation output, covering a wider range of quality than the WMT submissions. We also explore metric results on specific types of corner cases, such as empty strings, wrong- or mixed-language text, and more. We primarily focus on Japanese–Chinese data, with some work on English and Czech.

The WMT25 Shared Task on Automated Translation Evaluation Systems evaluates metrics and quality estimation systems that assess the quality of language translation systems. This task unifies and consolidates the separate WMT shared tasks on Machine Translation Evaluation Metrics and Quality Estimation from previous years. Our primary goal is to encourage the development and assessment of new state-of-the-art translation quality evaluation systems. The shared task this year consisted of three subtasks: (1) segment-level quality score prediction, (2) span-level translation error annotation, and (3) quality-informed segment-level error correction. The evaluation data for the shared task were provided by the General MT shared task and were complemented by “challenge sets” from both the organizers and participants. Task 1 results indicate the strong performance of large LLMs at the system level, whilereference-based baseline metrics outperform LLMs at the segment level. Task 2 results indicate that accurate error detection and balancing precision and recall are persistent challenges. Task 3 results show that minimal editing is challenging even when informed by quality indicators. Robustness across the broad diversity of languages remains a major challenge across all three subtasks.

Challenges in Technical Regulatory Text Variation Detection
Shriya Vaagdevi Chikati | Samuel Larkin | David Minicola | Chi-kiu Lo
Proceedings of the 1st Regulatory NLP Workshop (RegNLP 2025)

We present a preliminary study on the feasibility of using current natural language processing techniques to detect variations between the construction codes of different jurisdictions. We formulate the task as a sentence alignment problem and evaluate various sentence representation models for their performance in this task. Our results show that task-specific trained embeddings perform marginally better than other models, but the overall accuracy remains a challenge. We also show that domain-specific fine-tuning hurts the task performance. The results highlight the challenges of developing NLP applications for technical regulatory texts.

Gender-Neutral Machine Translation Strategies in Practice
Hillary Dawkins | Isar Nejadgholi | Chi-Kiu Lo
Proceedings of the 3rd Workshop on Gender-Inclusive Translation Technologies (GITT 2025)

Gender-inclusive machine translation (MT) should preserve gender ambiguity in the source to avoid misgendering and representational harms. While gender ambiguity often occurs naturally in notional gender languages such as English, maintaining that gender neutrality in grammatical gender languages is a challenge. Here we assess the sensitivity of 21 MT systems to the need for gender neutrality in response to gender ambiguity in three translation directions of varying difficulty. The specific gender-neutral strategies that are observed in practice are categorized and discussed. Additionally, we examine the effect of binary gender stereotypes on the use of gender-neutral translation. In general, we report a disappointing absence of gender-neutral translations in response to gender ambiguity. However, we observe a small handful of MT systems that switch to gender neutral translation using specific strategies, depending on the target language.

Machine Translation Metrics for Indigenous Languages Using Fine-tuned Semantic Embeddings
Nathaniel Krasner | Justin Vasselli | Belu Ticona | Antonios Anastasopoulos | Chi-Kiu Lo
Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)

This paper describes the Tekio submission to the AmericasNLP 2025 shared task on machine translation metrics for Indigenous languages. We developed two primary metric approaches leveraging multilingual semantic embeddings. First, we fine-tuned the Language-agnostic BERT Sentence Encoder (LaBSE) specifically for Guarani, Bribri, and Nahuatl, significantly enhancing semantic representation quality. Next, we integrated our fine-tuned LaBSE into the semantic similarity metric YiSi-1, exploring the effectiveness of averaging multiple layers. Additionally, we trained regression-based COMET metrics (COMET-DA) using the fine-tuned LaBSE embeddings as a semantic backbone, comparing Mean Absolute Error (MAE) and Mean Squared Error (MSE) loss functions. Our YiSi-1 metric using layer-averaged embeddings chosen by having the best performance on the development set for each individual language achieved the highest average correlation across languages among our submitted systems, and our COMET models demonstrated competitive performance for Guarani.

2024

MSLC24 Submissions to the General Machine Translation Task
Samuel Larkin | Chi-Kiu Lo | Rebecca Knowles
Proceedings of the Ninth Conference on Machine Translation

The MSLC (Metric Score Landscape Challenge) submissions for English-German, English-Spanish, and Japanese-Chinese are constrained systems built using Transformer models for the purpose of better evaluating metric performance in the WMT24 Metrics Task. They are intended to be representative of the performance of systems that can be built relatively simply using constrained data and with minimal modifications to the translation training pipeline.

MSLC24: Further Challenges for Metrics on a Wide Landscape of Translation Quality
Rebecca Knowles | Samuel Larkin | Chi-Kiu Lo
Proceedings of the Ninth Conference on Machine Translation

In this second edition of the Metric Score Landscape Challenge (MSLC), we examine how automatic metrics for machine translation perform on a wide variety of machine translation output, ranging from very low quality systems to the types of high-quality systems submitted to the General MT shared task at WMT. We also explore metric results on specific types of data, such as empty strings, wrong- or mixed-language text, and more. We raise several alarms about inconsistencies in metric scores, some of which can be resolved by increasingly explicit instructions for metric use, while others highlight technical flaws.

WMT24 Test Suite: Gender Resolution in Speaker-Listener Dialogue Roles
Hillary Dawkins | Isar Nejadgholi | Chi-Kiu Lo
Proceedings of the Ninth Conference on Machine Translation

We assess the difficulty of gender resolution in literary-style dialogue settings and the influence of gender stereotypes. Instances of the test suite contain spoken dialogue interleaved with external meta-context about the characters and the manner of speaking. We find that character and manner stereotypes outside of the dialogue significantly impact the gender agreement of referents within the dialogue.

Are LLMs Breaking MT Metrics? Results of the WMT24 Metrics Shared Task
Markus Freitag | Nitika Mathur | Daniel Deutsch | Chi-Kiu Lo | Eleftherios Avramidis | Ricardo Rei | Brian Thompson | Frederic Blain | Tom Kocmi | Jiayi Wang | David Ifeoluwa Adelani | Marianna Buchicchio | Chrysoula Zerva | Alon Lavie
Proceedings of the Ninth Conference on Machine Translation

The WMT24 Metrics Shared Task evaluated the performance of automatic metrics for machine translation (MT), with a major focus on LLM-based translations that were generated as part of the WMT24 General MT Shared Task. As LLMs become increasingly popular in MT, it is crucial to determine whether existing evaluation metrics can accurately assess the output of these systems.To provide a robust benchmark for this evaluation, human assessments were collected using Multidimensional Quality Metrics (MQM), continuing the practice from recent years. Furthermore, building on the success of the previous year, a challenge set subtask was included, requiring participants to design contrastive test suites that specifically target a metric’s ability to identify and penalize different types of translation errors.Finally, the meta-evaluation procedure was refined to better reflect real-world usage of MT metrics, focusing on pairwise accuracy at both the system- and segment-levels.We present an extensive analysis on how well metrics perform on three language pairs: English to Spanish (Latin America), Japanese to Chinese, and English to German. The results strongly confirm the results reported last year, that fine-tuned neural metrics continue to perform well, even when used to evaluate LLM-based translation systems.

Evaluation Briefs: Drawing on Translation Studies for Human Evaluation of MT
Ting Liu | Chi-kiu Lo | Elizabeth Marshman | Rebecca Knowles
Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

In this position paper, we examine ways in which researchers in machine translation and translation studies have approached the problem of evaluating the output of machine translation systems and, more broadly, the questions of what it means to define translation quality. We explore their similarities and differences, highlighting the role that the purpose and context of translation plays in translation studies approaches. We argue that evaluation of machine translation (e.g., in shared tasks) would benefit from additional insights from translation studies, and we suggest the introduction of an ‘evaluation brief” (analogous to the ‘translation brief’) which could help set out useful context for annotators tasked with evaluating machine translation.

Some Tradeoffs in Continual Learning for Parliamentary Neural Machine Translation Systems
Rebecca Knowles | Samuel Larkin | Michel Simard | Marc A Tessier | Gabriel Bernier-Colborne | Cyril Goutte | Chi-kiu Lo
Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

In long-term translation projects, like Parliamentary text, there is a desire to build machine translation systems that can adapt to changes over time. We implement and examine a simple approach to continual learning for neural machine translation, exploring tradeoffs between consistency, the model’s ability to learn from incoming data, and the time a client would need to wait to obtain a newly trained translation system.

2023

Metric Score Landscape Challenge (MSLC23): Understanding Metrics’ Performance on a Wider Landscape of Translation Quality
Chi-kiu Lo | Samuel Larkin | Rebecca Knowles
Proceedings of the Eighth Conference on Machine Translation

The Metric Score Landscape Challenge (MSLC23) dataset aims to gain insight into metric scores on a broader/wider landscape of machine translation (MT) quality. It provides a collection of low- to medium-quality MT output on the WMT23 general task test set. Together with the high quality systems submitted to the general task, this will enable better interpretation of metric scores across a range of different levels of translation quality. With this wider range of MT quality, we also visualize and analyze metric characteristics beyond just correlation.

Results of WMT23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Innocent
Markus Freitag | Nitika Mathur | Chi-kiu Lo | Eleftherios Avramidis | Ricardo Rei | Brian Thompson | Tom Kocmi | Frederic Blain | Daniel Deutsch | Craig Stewart | Chrysoula Zerva | Sheila Castilho | Alon Lavie | George Foster
Proceedings of the Eighth Conference on Machine Translation

This paper presents the results of the WMT23 Metrics Shared Task. Participants submitting automatic MT evaluation metrics were asked to score the outputs of the translation systems competing in the WMT23 News Translation Task. All metrics were evaluated on how well they correlate with human ratings at the system and segment level. Similar to last year, we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). Following last year’s success, we also included a challenge set subtask, where participants had to create contrastive test suites for evaluating metrics’ ability to capture and penalise specific types of translation errors. Furthermore, we improved our meta-evaluation procedure by considering fewer tasks and calculating a global score by weighted averaging across the various tasks. We present an extensive analysis on how well metrics perform on three language pairs: Chinese-English, Hebrew-English on the sentence-level and English-German on the paragraph-level. The results strongly confirm the results reported last year, that neural-based metrics are significantly better than non-neural metrics in their levels of correlation with human judgments. Further, we investigate the impact of bad reference translations on the correlations of metrics with human judgment. We present a novel approach for generating synthetic reference translations based on the collection of MT system outputs and their corresponding MQM ratings, which has the potential to mitigate bad reference issues we observed this year for some language pairs. Finally, we also study the connections between the magnitude of metric differences and their expected significance in human evaluation, which should help the community to better understand and adopt new metrics.

Beyond Correlation: Making Sense of the Score Differences of New MT Evaluation Metrics
Chi-kiu Lo | Rebecca Knowles | Cyril Goutte
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

While many new automatic metrics for machine translation evaluation have been proposed in recent years, BLEU scores are still used as the primary metric in the vast majority of MT research papers. There are many reasons that researchers may be reluctant to switch to new metrics, from external pressures (reviewers, prior work) to the ease of use of metric toolkits. Another reason is a lack of intuition about the meaning of novel metric scores. In this work, we examine “rules of thumb” about metric score differences and how they do (and do not) correspond to human judgments of statistically significant differences between systems. In particular, we show that common rules of thumb about BLEU score differences do not in fact guarantee that human annotators will find significant differences between systems. We also show ways in which these rules of thumb fail to generalize across translation directions or domains.

Data Sampling and (In)stability in Machine Translation Evaluation
Chi-kiu Lo | Rebecca Knowles
Findings of the Association for Computational Linguistics: ACL 2023

We analyze the different data sampling approaches used in selecting data for human evaluation and ranking of machine translation systems at the highly influential Conference on Machine Translation (WMT). By using automatic evaluation metrics, we are able to focus on the impact of the data sampling procedure as separate from questions about human annotator consistency. We provide evidence that the latest data sampling approach used at WMT skews the annotated data toward shorter documents, not necessarily representative of the full test set. Lastly, we examine a new data sampling method that uses the available labour budget to sample data in a more representative manner, with the goals of improving representation of various document lengths in the sample and producing more stable rankings of system translation quality.

2022

Test Set Sampling Affects System Rankings: Expanded Human Evaluation of WMT20 English-Inuktitut Systems
Rebecca Knowles | Chi-kiu Lo
Proceedings of the Seventh Conference on Machine Translation (WMT)

We present a collection of expanded human annotations of the WMT20 English-Inuktitut machine translation shared task, covering the Nunavut Hansard portion of the dataset. Additionally, we recompute News rankings to take into account the completed set of human annotations and certain irregularities in the annotation task construction. We show the effect of these changes on the downstream task of the evaluation of automatic metrics. Finally, we demonstrate that character-level metrics correlate well with human judgments for the task of automatically evaluating translation into this polysynthetic language.

Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust
Markus Freitag | Ricardo Rei | Nitika Mathur | Chi-kiu Lo | Craig Stewart | Eleftherios Avramidis | Tom Kocmi | George Foster | Alon Lavie | André F. T. Martins
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper presents the results of the WMT22 Metrics Shared Task. Participants submitting automatic MT evaluation metrics were asked to score the outputs of the translation systems competing in the WMT22 News Translation Task on four different domains: news, social, ecommerce, and chat. All metrics were evaluated on how well they correlate with human ratings at the system and segment level. Similar to last year, we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). This setup had several advantages, among other things: (i) expert-based evaluation is more reliable, (ii) we extended the pool of translations by 5 additional translations based on MBR decoding or rescoring which are challenging for current metrics. In addition, we initiated a challenge set subtask, where participants had to create contrastive test suites for evaluating metrics’ ability to capture and penalise specific types of translation errors. Finally, we present an extensive analysis on how well metrics perform on three language pairs: English to German, English to Russian and Chinese to English. The results demonstrate the superiority of neural-based learned metrics and demonstrate again that overlap metrics like Bleu, spBleu or chrf correlate poorly with human ratings. The results also reveal that neural-based metrics are remarkably robust across different domains and challenges.

2021

Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-based Human Evaluations on TED and News Domain
Markus Freitag | Ricardo Rei | Nitika Mathur | Chi-kiu Lo | Craig Stewart | George Foster | Alon Lavie | Ondřej Bojar
Proceedings of the Sixth Conference on Machine Translation

This paper presents the results of the WMT21 Metrics Shared Task. Participants were asked to score the outputs of the translation systems competing in the WMT21 News Translation Task with automatic metrics on two different domains: news and TED talks. All metrics were evaluated on how well they correlate at the system- and segment-level with human ratings. Contrary to previous years’ editions, this year we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). This setup had several advantages: (i) expert-based evaluation has been shown to be more reliable, (ii) we were able to evaluate all metrics on two different domains using translations of the same MT systems, (iii) we added 5 additional translations coming from the same system during system development. In addition, we designed three challenge sets that evaluate the robustness of all automatic metrics. We present an extensive analysis on how well metrics perform on three language pairs: English to German, English to Russian and Chinese to English. We further show the impact of different reference translations on reference-based metrics and compare our expert-based MQM annotation with the DA scores acquired by WMT.

2020

Extended Study on Using Pretrained Language Models and YiSi-1 for Machine Translation Evaluation
Chi-kiu Lo
Proceedings of the Fifth Conference on Machine Translation

We present an extended study on using pretrained language models and YiSi-1 for machine translation evaluation. Although the recently proposed contextual embedding based metrics, YiSi-1, significantly outperform BLEU and other metrics in correlating with human judgment on translation quality, we have yet to understand the full strength of using pretrained language models for machine translation evaluation. In this paper, we study YiSi-1’s correlation with human translation quality judgment by varying three major attributes (which architecture; which inter- mediate layer; whether it is monolingual or multilingual) of the pretrained language mod- els. Results of the study show further improvements over YiSi-1 on the WMT 2019 Metrics shared task. We also describe the pretrained language model we trained for evaluating Inuktitut machine translation output.

Improving Parallel Data Identification using Iteratively Refined Sentence Alignments and Bilingual Mappings of Pre-trained Language Models
Chi-kiu Lo | Eric Joanis
Proceedings of the Fifth Conference on Machine Translation

The National Research Council of Canada’s team submissions to the parallel corpus filtering task at the Fifth Conference on Machine Translation are based on two key components: (1) iteratively refined statistical sentence alignments for extracting sentence pairs from document pairs and (2) a crosslingual semantic textual similarity metric based on a pretrained multilingual language model, XLM-RoBERTa, with bilingual mappings learnt from a minimal amount of clean parallel data for scoring the parallelism of the extracted sentence pairs. The translation quality of the neural machine translation systems trained and fine-tuned on the parallel data extracted by our submissions improved significantly when compared to the organizers’ LASER-based baseline, a sentence-embedding method that worked well last year. For re-aligning the sentences in the document pairs (component 1), our statistical approach has outperformed the current state-of-the-art neural approach in this low-resource context.

Machine Translation Reference-less Evaluation using YiSi-2 with Bilingual Mappings of Massive Multilingual Language Model
Chi-kiu Lo | Samuel Larkin
Proceedings of the Fifth Conference on Machine Translation

We present a study on using YiSi-2 with massive multilingual pretrained language models for machine translation (MT) reference-less evaluation. Aiming at finding better semantic representation for semantic MT evaluation, we first test YiSi-2 with contextual embed- dings extracted from different layers of two different pretrained models, multilingual BERT and XLM-RoBERTa. We also experiment with learning bilingual mappings that trans- form the vector subspace of the source language to be closer to that of the target language in the pretrained model to obtain more accurate cross-lingual semantic similarity representations. Our results show that YiSi-2’s correlation with human direct assessment on translation quality is greatly improved by replacing multilingual BERT with XLM-RoBERTa and projecting the source embeddings into the tar- get embedding space using a cross-lingual lin- ear projection (CLP) matrix learnt from a small development set.

This paper presents the results of the news translation task and the similar language translation task, both organised alongside the Conference on Machine Translation (WMT) 2020. In the news task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting mainly of news stories. The task was also opened up to additional test suites to probe specific aspects of translation. In the similar language translation task, participants built machine translation systems for translating between closely related pairs of languages.

The Nunavut Hansard Inuktitut–English Parallel Corpus 3.0 with Preliminary Machine Translation Results
Eric Joanis | Rebecca Knowles | Roland Kuhn | Samuel Larkin | Patrick Littell | Chi-kiu Lo | Darlene Stewart | Jeffrey Micher
Proceedings of the Twelfth Language Resources and Evaluation Conference

The Inuktitut language, a member of the Inuit-Yupik-Unangan language family, is spoken across Arctic Canada and noted for its morphological complexity. It is an official language of two territories, Nunavut and the Northwest Territories, and has recognition in additional regions. This paper describes a newly released sentence-aligned Inuktitut–English corpus based on the proceedings of the Legislative Assembly of Nunavut, covering sessions from April 1999 to June 2017. With approximately 1.3 million aligned sentence pairs, this is, to our knowledge, the largest parallel corpus of a polysynthetic language or an Indigenous language of the Americas released to date. The paper describes the alignment methodology used, the evaluation of the alignments, and preliminary experiments on statistical and neural machine translation (SMT and NMT) between Inuktitut and English, in both directions.

2019

NRC Parallel Corpus Filtering System for WMT 2019
Gabriel Bernier-Colborne | Chi-kiu Lo
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

We describe the National Research Council Canada team’s submissions to the parallel corpus filtering task at the Fourth Conference on Machine Translation.

YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources
Chi-kiu Lo
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

We present YiSi, a unified automatic semantic machine translation quality evaluation and estimation metric for languages with different levels of available resources. Underneath the interface with different language resources settings, YiSi uses the same representation for the two sentences in assessment. Besides, we show significant improvement in the correlation of YiSi-1’s scores with human judgment is made by using contextual embeddings in multilingual BERT–Bidirectional Encoder Representations from Transformers to evaluate lexical semantic similarity. YiSi is open source and publicly available.

Multi-Source Transformer for Kazakh-Russian-English Neural Machine Translation
Patrick Littell | Chi-kiu Lo | Samuel Larkin | Darlene Stewart
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

We describe the neural machine translation (NMT) system developed at the National Research Council of Canada (NRC) for the Kazakh-English news translation task of the Fourth Conference on Machine Translation (WMT19). Our submission is a multi-source NMT taking both the original Kazakh sentence and its Russian translation as input for translating into English.

Fully Unsupervised Crosslingual Semantic Textual Similarity Metric Based on BERT for Identifying Parallel Data
Chi-kiu Lo | Michel Simard
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

We present a fully unsupervised crosslingual semantic textual similarity (STS) metric, based on contextual embeddings extracted from BERT – Bidirectional Encoder Representations from Transformers (Devlin et al., 2019). The goal of crosslingual STS is to measure to what degree two segments of text in different languages express the same meaning. Not only is it a key task in crosslingual natural language understanding (XLU), it is also particularly useful for identifying parallel resources for training and evaluating downstream multilingual natural language processing (NLP) applications, such as machine translation. Most previous crosslingual STS methods relied heavily on existing parallel resources, thus leading to a circular dependency problem. With the advent of massively multilingual context representation models such as BERT, which are trained on the concatenation of non-parallel data from each language, we show that the deadlock around parallel resources can be broken. We perform intrinsic evaluations on crosslingual STS data sets and extrinsic evaluations on parallel corpus filtering and human translation equivalence assessment tasks. Our results show that the unsupervised crosslingual STS metric using BERT without fine-tuning achieves performance on par with supervised or weakly supervised approaches.

2018

Accurate semantic textual similarity for cleaning noisy parallel corpora using semantic machine translation evaluation metric: The NRC supervised submissions to the Parallel Corpus Filtering task
Chi-kiu Lo | Michel Simard | Darlene Stewart | Samuel Larkin | Cyril Goutte | Patrick Littell
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We present our semantic textual similarity approach in filtering a noisy web crawled parallel corpus using YiSi—a novel semantic machine translation evaluation metric. The systems mainly based on this supervised approach perform well in the WMT18 Parallel Corpus Filtering shared task (4th place in 100-million-word evaluation, 8th place in 10-million-word evaluation, and 6th place overall, out of 48 submissions). In fact, our best performing system—NRC-yisi-bicov is one of the only four submissions ranked top 10 in both evaluations. Our submitted systems also include some initial filtering steps for scaling down the size of the test corpus and a final redundancy removal step for better semantic and token coverage of the filtered corpus. In this paper, we also describe our unsuccessful attempt in automatically synthesizing a noisy parallel development corpus for tuning the weights to combine different parallelism and fluency features.

Measuring sentence parallelism using Mahalanobis distances: The NRC unsupervised submissions to the WMT18 Parallel Corpus Filtering shared task
Patrick Littell | Samuel Larkin | Darlene Stewart | Michel Simard | Cyril Goutte | Chi-kiu Lo
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

The WMT18 shared task on parallel corpus filtering (Koehn et al., 2018b) challenged teams to score sentence pairs from a large high-recall, low-precision web-scraped parallel corpus (Koehn et al., 2018a). Participants could use existing sample corpora (e.g. past WMT data) as a supervisory signal to learn what a “clean” corpus looks like. However, in lower-resource situations it often happens that the target corpus of the language is the only sample of parallel text in that language. We therefore made several unsupervised entries, setting ourselves an additional constraint that we not utilize the additional clean parallel corpora. One such entry fairly consistently scored in the top ten systems in the 100M-word conditions, and for one task—translating the European Medicines Agency corpus (Tiedemann, 2009)—scored among the best systems even in the 10M-word conditions.

2017

MEANT 2.0: Accurate semantic MT evaluation for any output language
Chi-kiu Lo
Proceedings of the Second Conference on Machine Translation

NRC Machine Translation System for WMT 2017
Chi-kiu Lo | Boxing Chen | Colin Cherry | George Foster | Samuel Larkin | Darlene Stewart | Roland Kuhn
Proceedings of the Second Conference on Machine Translation

2016

NRC Russian-English Machine Translation System for WMT 2016
Chi-kiu Lo | Colin Cherry | George Foster | Darlene Stewart | Rabib Islam | Anna Kazantseva | Roland Kuhn
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

CNRC at SemEval-2016 Task 1: Experiments in Crosslingual Semantic Textual Similarity
Chi-kiu Lo | Cyril Goutte | Michel Simard
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

Improving evaluation and optimization of MT systems against MEANT
Chi-kiu Lo | Philipp Dowling | Dekai Wu
Proceedings of the Tenth Workshop on Statistical Machine Translation

2014

Lexical Access Preference and Constraint Strategies for Improving Multiword Expression Association within Semantic MT Evaluation
Dekai Wu | Chi-kiu Lo | Markus Saers
Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)

Better Semantic Frame Based MT Evaluation via Inversion Transduction Grammars
Dekai Wu | Chi-kiu Lo | Meriem Beloucif | Markus Saers
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation

XMEANT: Better semantic MT evaluation without reference translations
Chi-kiu Lo | Meriem Beloucif | Markus Saers | Dekai Wu
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

On the reliability and inter-annotator agreement of human semantic MT evaluation via HMEANT
Chi-kiu Lo | Dekai Wu
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present analyses showing that HMEANT is a reliable, accurate and fine-grained semantic frame based human MT evaluation metric with high inter-annotator agreement (IAA) and correlation with human adequacy judgments, despite only requiring a minimal training of about 15 minutes for lay annotators. Previous work shows that the IAA on the semantic role labeling (SRL) subtask within HMEANT is over 70%. In this paper we focus on (1) the IAA on the semantic role alignment task and (2) the overall IAA of HMEANT. Our results show that the IAA on the alignment task of HMEANT is over 90% when humans align SRL output from the same SRL annotator, which shows that the instructions on the alignment task are sufficiently precise, although the overall IAA where humans align SRL output from different SRL annotators falls to only 61% due to the pipeline effect on the disagreement in the two annotation task. We show that instead of manually aligning the semantic roles using an automatic algorithm not only helps maintaining the overall IAA of HMEANT at 70%, but also provides a finer-grained assessment on the phrasal similarity of the semantic role fillers. This suggests that HMEANT equipped with automatic alignment is reliable and accurate for humans to evaluate MT adequacy while achieving higher correlation with human adequacy judgments than HTER.

Improving MEANT based semantically tuned SMT
Meriem Beloucif | Chi-kiu Lo | Dekai Wu
Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign

We discuss various improvements to our MEANT tuned system, previously presented at IWSLT 2013. In our 2014 system, we incorporate this year’s improved version of MEANT, improved Chinese word segmentation, Chinese named entity recognition and dedicated proper name translation, and number expression handling. This results in a significant performance jump compared to last year’s system. We also ran preliminary experiments on tuning to IMEANT, our new ITG based variant of MEANT. The performance of tuning to IMEANT is comparable to tuning on MEANT (differences are statistically insignificant). We are presently investigating if tuning on IMEANT can produce even better results, since IMEANT was actually shown to correlate with human adequacy judgment more closely than MEANT. Finally, we ran experiments applying our new architectural improvements to a contrastive system tuned to BLEU. We observed a slightly higher jump in comparison to last year, possibly due to mismatches of MEANT’s similarity models to our new entity handling.

2013

MEANT at WMT 2013: A Tunable, Accurate yet Inexpensive Semantic Frame Based MT Evaluation Metric
Chi-kiu Lo | Dekai Wu
Proceedings of the Eighth Workshop on Statistical Machine Translation

Improving machine translation by training against an automatic semantic frame based evaluation metric
Chi-kiu Lo | Karteek Addanki | Markus Saers | Dekai Wu
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Can Informal Genres be better Translated by Tuning on Automatic Semantic Metrics?
Chi-Kiu Lo | Dekai Wu
Proceedings of Machine Translation Summit XIV: Papers

Improving machine translation into Chinese by tuning against Chinese MEANT
Chi-kiu Lo | Meriem Beloucif | Dekai Wu
Proceedings of the 10th International Workshop on Spoken Language Translation: Evaluation Campaign

We present the first ever results showing that Chinese MT output is significantly improved by tuning a MT system against a semantic frame based objective function, MEANT, rather than an n-gram based objective function, BLEU, as measured across commonly used metrics and different test sets. Recent work showed that by preserving the meaning of the translations as captured by semantic frames in the training process, MT systems for translating into English on both formal and informal genres are constrained to produce more adequate translations by making more accurate choices on lexical output and reordering rules. In this paper we describe our experiments in IWSLT 2013 TED talk MT tasks on tuning MT systems against MEANT for translating into Chinese and English respectively. We show that the Chinese translation output benefits more from tuning a MT system against MEANT than the English translation output due to the ambiguous nature of word boundaries in Chinese. Our encouraging results show that using MEANT is a promising alternative to BLEU in both evaluating and tuning MT systems to drive the progress of MT research across different languages.

Human semantic MT evaluation with HMEANT for IWSLT 2013
Chi-kiu Lo | Dekai Wu
Proceedings of the 10th International Workshop on Spoken Language Translation: Evaluation Campaign

We present the results of large-scale human semantic MT evaluation with HMEANT on the IWSLT 2013 German-English MT and SLT tracks and show that HMEANT evaluates the performance of the MT systems differently compared to BLEU and TER. Together with the references, all the translations are annotated by annotators who are native English speakers in both semantic role labeling stage and role filler alignment stage of HMEANT. We obtain high inter-annotator agreement and low annotation time costs which indicate that it is feasible to run a large-scale human semantic MT evaluation campaign using HMEANT. Our results also show that HMEANT is a robust and reliable semantic MT evaluation metric for running large-scale evaluation campaigns as it is inexpensive and simple while maintaining the semantic representational transparency to provide a perspective which is different from BLEU and TER in order to understand the performance of the state-of-the-art MT systems.

2012

Accuracy and robustness in measuring the lexical similarity of semantic role fillers for automatic semantic MT evaluation
Anand Karthik Tumuluru | Chi-kiu Lo | Dekai Wu
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

Unsupervised vs. supervised weight estimation for semantic MT evaluation metrics
Chi-kiu Lo | Dekai Wu
Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation

Fully Automatic Semantic MT Evaluation
Chi-kiu Lo | Anand Karthik Tumuluru | Dekai Wu
Proceedings of the Seventh Workshop on Statistical Machine Translation

LTG vs. ITG Coverage of Cross-Lingual Verb Frame Alternations
Karteek Addanki | Chi-kiu Lo | Markus Saers | Dekai Wu
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

2011

Structured vs. Flat Semantic Role Representations for Machine Translation Evaluation
Chi-kiu Lo | Dekai Wu
Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation

MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles
Chi-kiu Lo | Dekai Wu
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Mining Parallel Documents Using Low Bandwidth and High Precision CLIR from the Heterogeneous Web
Simon Shi | Pascale Fung | Emmanuel Prochasson | Chi-kiu Lo | Dekai Wu
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

Semantic vs. Syntactic vs. N-gram Structure for Machine Translation Evaluation
Chi-kiu Lo | Dekai Wu
Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation

Evaluating Machine Translation Utility via Semantic Role Labels
Chi-kiu Lo | Dekai Wu
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present the methodology that underlies mew metrics for semantic machine translation evaluation we are developing. Unlike widely-used lexical and n-gram based MT evaluation metrics, the aim of semantic MT evaluation is to measure the utility of translations. We discuss the design of empirical studies to evaluate the utility of machine translation output by assessing the accuracy for key semantic roles. These roles are from the English 5W templates (who, what, when, where, why) used in recent GALE distillation evaluations. Recent work by Wu and Fung (2009) introduced semantic role labeling into statistical machine translation to enhance the quality of MT output. However, this approach has so far only been evaluated using lexical and n-gram based SMT evaluation metrics like BLEU which are not aimed at evaluating the utility of MT output. Direct data analysis are still needed to understand how semantic models can be leveraged to evaluate the utility of MT output. In this paper, we discuss a new methodology for evaluating the utility of the machine translation output, by assessing the accuracy with which human readers are able to complete the English 5W templates.

2007

HKUST statistical machine translation experiments for IWSLT 2007
Yihai Shen | Chi-kiu Lo | Marine Carpuat | Dekai Wu
Proceedings of the Fourth International Workshop on Spoken Language Translation

This paper describes the HKUST experiments in the IWSLT 2007 evaluation campaign on spoken language translation. Our primary objective was to compare the open-source phrase-based statistical machine translation toolkit Moses against Pharaoh. We focused on Chinese to English translation, but we also report results on the Arabic to English, Italian to English, and Japanese to English tasks.

Co-authors

Markus Freitag 5

Michel Simard 5

Eleftherios Avramidis 4

Meriem Beloucif 4

Patrick Littell 4

Nitika Mathur 4

Frédéric Blain 3

Daniel Deutsch 3

Craig Stewart 3

Brian Thompson 3

Chrysoula Zerva 3

Karteek Addanki 2

David Ifeoluwa Adelani 2

Gabriel Bernier-Colborne 2

Ondřej Bojar 2

Hillary Dawkins 2

Isar Nejadgholi 2

Anand Karthik Tumuluru 2

Sweta Agrawal 1

Antonios Anastasopoulos 1

Loic Barrault 1

Magdalena Biesialska 1

Marianna Buchicchio 1

Marine Carpuat 1

Sheila Castilho 1

Shriya Vaagdevi Chikati 1

Marta R. Costa-jussà 1

Sourabh Deoghare 1

Philipp Dowling 1

Christian Federmann 1

Yvette Graham 1

Roman Grundkiewicz 1

Greg Hanneman 1

Matthias Huck 1

Diptesh Kanojia 1

Anna Kazantseva 1

Philipp Koehn 1

Nathaniel Krasner 1

Nikola Ljubešić 1

Elizabeth Marshman 1

André F. T. Martins 1

Jeffrey Micher 1

David Minicola 1

Christof Monz 1

Makoto Morishita 1

Masaaki Nagata 1

Toshiaki Nakazawa 1

Emmanuel Prochasson 1

Archchana Sindhujan 1

Marc A Tessier 1

Justin Vasselli 1

Marcos Zampieri 1

Vilém Zouhar 1

Venues