Yuchen Eleanor Jiang
2026
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values
Siwei Wu | JinCheng Ren | Xeron Du | Shuyue Guo | Xingwei Qu | Yiming Liang | Jie Liu | Yunwen Li | Tyler Loakman | Tianyu Zheng | Boyu Feng | Huaqing Yuan | Zili Wang | Jiaheng Liu | Wenhao Huang | Chenglin Cai | Haoran Que | Jian Yang | Yuelin Bai | Zekun Moore Wang | Zhouliang Yu | Qunshu Lin | Ding Pan | Yuchen Eleanor Jiang | Tiannan Wang | Wangchunshu Zhou | Shenzhi Wang | Xingyuan Bu | Minghao Liu | Guoyin Wang | Ge Zhang | Chenghua Lin
Findings of the Association for Computational Linguistics: EACL 2026
Siwei Wu | JinCheng Ren | Xeron Du | Shuyue Guo | Xingwei Qu | Yiming Liang | Jie Liu | Yunwen Li | Tyler Loakman | Tianyu Zheng | Boyu Feng | Huaqing Yuan | Zili Wang | Jiaheng Liu | Wenhao Huang | Chenglin Cai | Haoran Que | Jian Yang | Yuelin Bai | Zekun Moore Wang | Zhouliang Yu | Qunshu Lin | Ding Pan | Yuchen Eleanor Jiang | Tiannan Wang | Wangchunshu Zhou | Shenzhi Wang | Xingyuan Bu | Minghao Liu | Guoyin Wang | Ge Zhang | Chenghua Lin
Findings of the Association for Computational Linguistics: EACL 2026
Existing Chinese preference datasets suffer from limited scale, restricted domain coverage, and insufficiently rigorous data validation. Human annotation significantly limits the scalability of human preference datasets. As a result, Chinese Alignment and Chinese Reward Models (CRM) have not yet been thoroughly explored. To address these challenges, we design an LLM-based data annotation pipeline with no human intervention. Based on this pipeline, we curate COIG-P (Chinese Open Instruction Generalist - Preference), a high-quality, large-scale Chinese preference dataset consisting of 1M Chinese preference pairs and 92k carefully curated Chinese queries across diverse domains, including Chat, Coding, Maths, and others. We conduct experiments to verify the quality of COIG-P from two perspectives. (1) COIG-P brings significant performance improvements for the Qwen2/2.5 and Infinity-Instruct model series on AlignBench through DPO, with gains ranging from 2% to 12%. Furthermore, it significantly outperforms other existing Chinese preference datasets. (2) We train an 8B-sized CRM and manually annotate a Chinese Reward Benchmark (CRBench). Our CRM demonstrates robust scoring ability on CRBench. In addition, in practical data construction experiments, the quality of the data constructed by our CRM is comparable to that produced by GPT-4o.
2025
OAgents: An Empirical Study of Building Effective Agents
He Zhu | Tianrui Qin | King Zhu | Heyuan Huang | Yeyi Guan | Jinxiang Xia | Hanhao Li | Yi Yao | Ningning Wang | Pai Liu | Tianhao Peng | Xin Gui | Li Xiaowan | Yuhui Liu | Xiangru Tang | Jian Yang | Ge Zhang | Xitong Gao | Yuchen Eleanor Jiang | Changwang Zhang | Jun Wang | Jiaheng Liu | Wangchunshu Zhou
Findings of the Association for Computational Linguistics: EMNLP 2025
He Zhu | Tianrui Qin | King Zhu | Heyuan Huang | Yeyi Guan | Jinxiang Xia | Hanhao Li | Yi Yao | Ningning Wang | Pai Liu | Tianhao Peng | Xin Gui | Li Xiaowan | Yuhui Liu | Xiangru Tang | Jian Yang | Ge Zhang | Xitong Gao | Yuchen Eleanor Jiang | Changwang Zhang | Jun Wang | Jiaheng Liu | Wangchunshu Zhou
Findings of the Association for Computational Linguistics: EMNLP 2025
Recently, Agentic AI has become an increasingly popular field of research. However, we argue that current practices on agent research are far from standard, rigorous scientific research, which makes it hard to conduct apples-to-apples comparisons among and against existing methods. As a result, it is still obscure how different design choices in an agent framework impact its effectiveness, and measuring progress on agent research remains very hard. In this work, we conduct a systematic empirical study on the GAIA benchmark to investigate the impact of different popular design choices within key agent components in a fair and rigorous way. To begin with, we find that the lack of a standard evaluation protocol makes previous works, even the open-sourced ones, not reproducible, and the variance between different random runs is often non-negligible. Therefore, we first introduce a more robust evaluation protocol to make comparisons more stable. Our empirical study then unveils which components and designs, as well as correlations between these designs, are the keys for building effective agents, while others are not and redundant, despite seemingly making sense. With the insights gained from our empirical study, we build and open-source OAgents, a new foundation agent framework that achieves state-of-the-art performance among open-source projects, providing a good starting point and guidelines for building effective agents. More importantly, supports various design choices for agent components in a modularized way, facilitating future scientific research on Agentic AI.
OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use
Xueyu Hu | Tao Xiong | Biao Yi | Zishu Wei | Ruixuan Xiao | Yurun Chen | Jiasheng Ye | Meiling Tao | Xiangxin Zhou | Ziyu Zhao | Yuhuai Li | Shengze Xu | Shenzhi Wang | Xinchen Xu | Shuofei Qiao | Zhaokai Wang | Kun Kuang | Tieyong Zeng | Liang Wang | Jiwei Li | Yuchen Eleanor Jiang | Wangchunshu Zhou | Guoyin Wang | Keting Yin | Zhou Zhao | Hongxia Yang | Fan Wu | Shengyu Zhang | Fei Wu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xueyu Hu | Tao Xiong | Biao Yi | Zishu Wei | Ruixuan Xiao | Yurun Chen | Jiasheng Ye | Meiling Tao | Xiangxin Zhou | Ziyu Zhao | Yuhuai Li | Shengze Xu | Shenzhi Wang | Xinchen Xu | Shuofei Qiao | Zhaokai Wang | Kun Kuang | Tieyong Zeng | Liang Wang | Jiwei Li | Yuchen Eleanor Jiang | Wangchunshu Zhou | Guoyin Wang | Keting Yin | Zhou Zhao | Hongxia Yang | Fan Wu | Shengyu Zhang | Fei Wu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of multi-modal large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computers, mobile phones and web browsers by operating within the environments and interfaces (e.g., Graphical User Interface (GUI) and Command Line Interface (CLI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey on these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components and capabilities. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation metrics and benchmarks highlights how OS Agents are assessed across diverse platforms and tasks. Finally, we discuss current challenges and identify promising directions for future research. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field.
2023
Findings of the WMT 2023 Shared Task on Machine Translation with Terminologies
Kirill Semenov | Vilém Zouhar | Tom Kocmi | Dongdong Zhang | Wangchunshu Zhou | Yuchen Eleanor Jiang
Proceedings of the Eighth Conference on Machine Translation
Kirill Semenov | Vilém Zouhar | Tom Kocmi | Dongdong Zhang | Wangchunshu Zhou | Yuchen Eleanor Jiang
Proceedings of the Eighth Conference on Machine Translation
The WMT 2023 Terminology Shared Task investigates progress in machine translation of texts with specialized vocabulary. The participants were given the source text and segment-level terminology dictionaries for three language pairs: Chinese→English, English→Czech, and German→English. We evaluate 21 submissions from 7 teams on two main criteria: general translation quality and the effectiveness of translating specialized terminology. Systems took varied approaches — incorporating terminology at inference time or weakly supervised training that uses terminology access. While incorporating terminology dictionaries leads to improvement in the translation quality, incorporating an equal amount of information from the reference leads to similar results. This challenges the position of terminologies being the crux of meaning in translation, it can also be explained by inadequate metrics which are not terminology-centric.
Poor Man’s Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference
Vilém Zouhar | Shehzaad Dhuliawala | Wangchunshu Zhou | Nico Daheim | Tom Kocmi | Yuchen Eleanor Jiang | Mrinmaya Sachan
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Vilém Zouhar | Shehzaad Dhuliawala | Wangchunshu Zhou | Nico Daheim | Tom Kocmi | Yuchen Eleanor Jiang | Mrinmaya Sachan
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Machine translation quality estimation (QE) predicts human judgements of a translation hypothesis without seeing the reference. State-of-the-art QE systems based on pretrained language models have been achieving remarkable correlations with human judgements yet they are computationally heavy and require human annotations, which are slow and expensive to create. To address these limitations, we define the problem of metric estimation (ME) where one predicts the automated metric scores also without the reference. We show that even without access to the reference, our model can estimate automated metrics (ρ = 60% for BLEU, ρ = 51% for other metrics) at the sentence-level. Because automated metrics correlate with human judgements, we can leverage the ME task for pre-training a QE model. For the QE task, we find that pre-training on TER is better (ρ = 23%) than training for scratch (ρ = 20%).
Discourse-Centric Evaluation of Document-level Machine Translation with a New Densely Annotated Parallel Corpus of Novels
Yuchen Eleanor Jiang | Tianyu Liu | Shuming Ma | Dongdong Zhang | Mrinmaya Sachan | Ryan Cotterell
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuchen Eleanor Jiang | Tianyu Liu | Shuming Ma | Dongdong Zhang | Mrinmaya Sachan | Ryan Cotterell
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Several recent papers claim to have achieved human parity at sentence-level machine translation (MT)—especially between high-resource language pairs. In response, the MT community has, in part, shifted its focus to document-level translation. Translating documents requires a deeper understanding of the structure and meaning of text, which is often captured by various kinds of discourse phenomena such as consistency, coherence, and cohesion. However, this renders conventional sentence-level MT evaluation benchmarks inadequate for evaluating the performance of context-aware MT systems. This paperpresents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al. (2022a). The new BWB annotation introduces four extra evaluation aspects, i.e., entity, terminology, coreference, and quotation, covering 15,095 entity mentions in both languages. Using these annotations, we systematically investigate the similarities and differences between the discourse structures of source and target languages, and the challenges they pose to MT. We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures. This gives us a new perspective on the challenges and opportunities in document-level MT. We make our resource publicly available to spur future research in document-level MT and its generalization to other language translation tasks.
2022
Autoregressive Structured Prediction with Language Models
Tianyu Liu | Yuchen Eleanor Jiang | Nicholas Monath | Ryan Cotterell | Mrinmaya Sachan
Findings of the Association for Computational Linguistics: EMNLP 2022
Tianyu Liu | Yuchen Eleanor Jiang | Nicholas Monath | Ryan Cotterell | Mrinmaya Sachan
Findings of the Association for Computational Linguistics: EMNLP 2022
Recent years have seen a paradigm shift in NLP towards using pretrained language models (PLM) for a wide range of tasks. However, there are many difficult design decisions to represent structures (e.g. tagged text, coreference chains) in a way such that they can be captured by PLMs. Prior work on structured prediction with PLMs typically flattens the structured output into a sequence, which limits the quality of structural information being learned and leads to inferior performance compared to classic discriminative models. In this work, we describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs, allowing in-structure dependencies to be learned without any loss. Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at, namely, named entity recognition, end-to-end relation extraction, and coreference resolution.
Search
Fix author
Co-authors
- Wangchunshu Zhou 5
- Mrinmaya Sachan 3
- Ryan Cotterell 2
- Tom Kocmi 2
- Jiaheng Liu 2
- Shenzhi Wang 2
- Guoyin Wang 2
- Jian Yang 2
- Dongdong Zhang 2
- Ge Zhang 2
- Vilém Zouhar 2
- Yuelin Bai 1
- Xingyuan Bu 1
- Chenglin Cai 1
- Yurun Chen 1
- Nico Daheim 1
- Shehzaad Dhuliawala 1
- Xeron Du 1
- Boyu Feng 1
- Xitong Gao 1
- Yeyi Guan 1
- Xin Gui 1
- Shuyue Guo 1
- Xueyu Hu 1
- Heyuan Huang 1
- Wenhao Huang 1
- Kun Kuang 1
- Hanhao Li 1
- Yunwen Li 1
- Yuhuai Li 1
- Jiwei Li 1
- Yiming Liang 1
- Qunshu Lin 1
- Chenghua Lin 1
- Pai Liu 1
- Yuhui Liu 1
- Jie Liu 1
- Minghao Liu 1
- Tianyu Liu 1
- Tianyu Liu 1
- Tyler Loakman 1
- Shuming Ma 1
- Nicholas Monath 1
- Ding Pan 1
- Tianhao Peng 1
- Shuofei Qiao 1
- Tianrui Qin 1
- Xingwei Qu 1
- Haoran Que 1
- JinCheng Ren 1
- Kirill Semenov 1
- Xiangru Tang 1
- Meiling Tao 1
- Ningning Wang 1
- Jun Wang 1
- Zili Wang 1
- Zekun Moore Wang 1
- Tiannan Wang 1
- Zhaokai Wang 1
- Liang Wang 1
- Zishu Wei 1
- Siwei Wu 1
- Fan Wu (吴凡, 吴钒) 1
- Fei Wu 1
- Jinxiang Xia 1
- Ruixuan Xiao 1
- Li Xiaowan 1
- Tao Xiong 1
- Shengze Xu 1
- Xinchen Xu 1
- Hongxia Yang 1
- Yi Yao 1
- Jiasheng Ye 1
- Biao Yi 1
- Keting Yin 1
- Zhouliang Yu 1
- Huaqing Yuan 1
- Tieyong Zeng 1
- Changwang Zhang 1
- Shengyu Zhang 1
- Ziyu Zhao 1
- Zhou Zhao 1
- Tianyu Zheng 1
- Xiangxin Zhou 1
- He Zhu 1
- King Zhu 1