Runzhe Zhan


pdf bib
Test-time Adaptation for Machine Translation Evaluation by Uncertainty Minimization
Runzhe Zhan | Xuebo Liu | Derek F. Wong | Cuilian Zhang | Lidia S. Chao | Min Zhang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The neural metrics recently received considerable attention from the research community in the automatic evaluation of machine translation. Unlike text-based metrics that have interpretable and consistent evaluation mechanisms for various data sources, the reliability of neural metrics in assessing out-of-distribution data remains a concern due to the disparity between training data and real-world data. This paper aims to address the inference bias of neural metrics through uncertainty minimization during test time, without requiring additional data. Our proposed method comprises three steps: uncertainty estimation, test-time adaptation, and inference. Specifically, the model employs the prediction uncertainty of the current data as a signal to update a small fraction of parameters during test time and subsequently refine the prediction through optimization. To validate our approach, we apply the proposed method to three representative models and conduct experiments on the WMT21 benchmarks. The results obtained from both in-domain and out-of-distribution evaluations consistently demonstrate improvements in correlation performance across different models. Furthermore, we provide evidence that the proposed method effectively reduces model uncertainty. The code is publicly available at

pdf bib
Revisiting Commonsense Reasoning in Machine Translation: Training, Evaluation and Challenge
Xuebo Liu | Yutong Wang | Derek F. Wong | Runzhe Zhan | Liangxuan Yu | Min Zhang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The ability of commonsense reasoning (CR) decides whether a neural machine translation (NMT) model can move beyond pattern recognition. Despite the rapid advancement of NMT and the use of pretraining to enhance NMT models, research on CR in NMT is still in its infancy, leaving much to be explored in terms of effectively training NMT models with high CR abilities and devising accurate automatic evaluation metrics. This paper presents a comprehensive study aimed at expanding the understanding of CR in NMT.For the training, we confirm the effectiveness of incorporating pretrained knowledge into NMT models and subsequently utilizing these models as robust testbeds for investigating CR in NMT. For the evaluation, we propose a novel entity-aware evaluation method that takes into account both the NMT candidate and important entities in the candidate, which is more aligned with human judgement. Based on the strong testbed and evaluation methods, we identify challenges in training NMT models with high CR abilities and suggest directions for further unlabeled data utilization and model design. We hope that our methods and findings will contribute to advancing the research of CR in NMT. Source data, code and scripts are freely available at

pdf bib
Towards Zero-Shot Multilingual Poetry Translation
Wai Lei Song | Haoyun Xu | Derek F. Wong | Runzhe Zhan | Lidia S. Chao | Shanshan Wang
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

The application of machine translation in the field of poetry has always presented significant challenges. Conventional machine translation techniques are inadequate for capturing and translating the unique style of poetry. The absence of a parallel poetry corpus and the distinctive structure of poetry further restrict the effectiveness of traditional methods. This paper introduces a zero-shot method that is capable of translating poetry style without the need for a large-scale training corpus. Specifically, we treat poetry translation as a standard machine translation problem and subsequently inject the poetry style upon completion of the translation process. Our injection model only requires back-translation and easily obtainable monolingual data, making it a low-cost solution. We conducted experiments on three translation directions and presented automatic and human evaluations, demonstrating that our proposed method outperforms existing online systems and other competitive baselines. These results validate the feasibility and potential of our proposed approach and provide new prospects for poetry translation.

pdf bib
Human-in-the-loop Machine Translation with Large Language Model
Xinyi Yang | Runzhe Zhan | Derek F. Wong | Junchao Wu | Lidia S. Chao
Proceedings of Machine Translation Summit XIX, Vol. 2: Users Track

The large language model (LLM) has garnered significant attention due to its in-context learning mechanisms and emergent capabilities. The research community has conducted several pilot studies to apply LLMs to machine translation tasks and evaluate their performance from diverse perspectives. However, previous research has primarily focused on the LLM itself and has not explored human intervention in the inference process of LLM. The characteristics of LLM, such as in-context learning and prompt engineering, closely mirror human cognitive abilities in language tasks, offering an intuitive solution for human-in-the-loop generation. In this study, we propose a human-in-the-loop pipeline that guides LLMs to produce customized outputs with revision instructions. The pipeline initiates by prompting the LLM to produce a draft translation, followed by the utilization of automatic retrieval or human feedback as supervision signals to enhance the LLM’s translation through in-context learning. The human-machine interactions generated in this pipeline are also stored in an external database to expand the in-context retrieval database, enabling us to leverage human supervision in an offline setting. We evaluate the proposed pipeline using the GPT-3.5-turbo API on five domain-specific benchmarks for German-English translation. The results demonstrate the effectiveness of the pipeline in tailoring in-domain translations and improving translation performance compared to direct translation instructions. Additionally, we discuss the experimental results from the following perspectives: 1) the effectiveness of different in-context retrieval methods; 2) the construction of a retrieval database under low-resource scenarios; 3) the observed differences across selected domains; 4) the quantitative analysis of sentence-level and word-level statistics; and 5) the qualitative analysis of representative translation cases.

pdf bib
TransGEC: Improving Grammatical Error Correction with Translationese
Tao Fang | Xuebo Liu | Derek F. Wong | Runzhe Zhan | Liang Ding | Lidia S. Chao | Dacheng Tao | Min Zhang
Findings of the Association for Computational Linguistics: ACL 2023

Data augmentation is an effective way to improve model performance of grammatical error correction (GEC). This paper identifies a critical side-effect of GEC data augmentation, which is due to the style discrepancy between the data used in GEC tasks (i.e., texts produced by non-native speakers) and data augmentation (i.e., native texts). To alleviate this issue, we propose to use an alternative data source, translationese (i.e., human-translated texts), as input for GEC data augmentation, which 1) is easier to obtain and usually has better quality than non-native texts, and 2) has a more similar style to non-native texts. Experimental results on the CoNLL14 and BEA19 English, NLPCC18 Chinese, Falko-MERLIN German, and RULEC-GEC Russian GEC benchmarks show that our approach consistently improves correction accuracy over strong baselines. Further analyses reveal that our approach is helpful for overcoming mainstream correction difficulties such as the corrections of frequent words, missing words, and substitution errors. Data, code, models and scripts are freely available at

pdf bib
Yu Sheng: Human-in-Loop Classical Chinese Poetry Generation System
Jingkun Ma | Runzhe Zhan | Derek F. Wong
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

The development of poetry generation system mainly focuses on enhancing the capacity of generation model. However, the demands of customization and polishing are generally ignored, which highly reduces the scope of application. In this work, we present Yu Sheng, a web-based poetry generation system that is featured a human-in-loop generation framework, providing various customization options for users with different backgrounds to engage in the process of poetry composition. To this end, we propose two methods and train the models that can perform constrained generation and fine-grained polishing. The automatic and human evaluation results show that our system has a strong ability to generate and polish poetry compared to other vanilla models. Our system is publicly accessible at:


pdf bib
中国语言学研究 70 年:核心期刊的词汇增长(70 Years of Linguistics Research in China: Vocabulary Growth of Core Journals)
Shan Wang (王珊) | Runzhe Zhan (詹润哲) | Shuangyun Yao (姚双云)
Proceedings of the 21st Chinese National Conference on Computational Linguistics

“建国以来我国语言学经过 70 年的发展取得了瞩目的成就,已有研究主要以回顾主要历史事件的方式介绍这一进程,但尚缺少使用量化手段分析其历时发展的研究。本文以词汇增长为切入点探究这一主题,首次创建大规模语言学中文核心期刊摘要的历时语料库,并使用三大词汇增长模型预测语料库中词汇的变化。本文选择拟合效果最好的 Heaps 模型分阶段深入分析语言学词汇的变化,显示出国家政策的指导作用和特定时代的语言生活特征。此外,与时序无关的验证程序支撑了本文研究方法的有效性。 关键词:中国语言学;词汇增长;核心期刊;摘要;语料库;历时发展”


pdf bib
Difficulty-Aware Machine Translation Evaluation
Runzhe Zhan | Xuebo Liu | Derek F. Wong | Lidia S. Chao
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

The high-quality translation results produced by machine translation (MT) systems still pose a huge challenge for automatic evaluation. Current MT evaluation pays the same attention to each sentence component, while the questions of real-world examinations (e.g., university examinations) have different difficulties and weightings. In this paper, we propose a novel difficulty-aware MT evaluation metric, expanding the evaluation dimension by taking translation difficulty into consideration. A translation that fails to be predicted by most MT systems will be treated as a difficult one and assigned a large weight in the final score function, and conversely. Experimental results on the WMT19 English-German Metrics shared tasks show that our proposed method outperforms commonly used MT metrics in terms of human correlation. In particular, our proposed method performs well even when all the MT systems are very competitive, which is when most existing metrics fail to distinguish between them. The source code is freely available at