Cheng Yang - ACL Anthology

Cheng Yang

2025

GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents
Lingxiao Diao | Xinyue Xu | Wanxuan Sun | Cheng Yang | Zhuosheng Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) have been widely deployed as autonomous agents capable of following user instructions and making decisions in real-world applications. Previous studies have made notable progress in benchmarking the instruction following capabilities of LLMs in general domains, with a primary focus on their inherent commonsense knowledge. Recently, LLMs have been increasingly deployed as domain-oriented agents, which rely on domain-oriented guidelines that may conflict with their commonsense knowledge. These guidelines exhibit two key characteristics: they consist of a wide range of domain-oriented rules and are subject to frequent updates. Despite these challenges, the absence of comprehensive benchmarks for evaluating the domain-oriented guideline following capabilities of LLMs presents a significant obstacle to their effective assessment and further development. In this paper, we introduce GuideBench, a comprehensive benchmark designed to evaluate guideline following performance of LLMs. GuideBench evaluates LLMs on three critical aspects: (i) adherence to diverse rules, (ii) robustness to rule updates, and (iii) alignment with human preferences. Experimental results on a range of LLMs indicate substantial opportunities for improving their ability to follow domain-oriented guidelines. Data and code are available at Anonymous.

Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph Completion
Ben Liu | Jihai Zhang | Fangquan Lin | Cheng Yang | Min Peng
Proceedings of the 31st International Conference on Computational Linguistics

Large Language Models (LLMs) present massive inherent knowledge and superior semantic comprehension capability, which have revolutionized various tasks in natural language processing. Despite their success, a critical gap remains in enabling LLMs to perform knowledge graph completion (KGC). Empirical evidence suggests that LLMs consistently perform worse than conventional KGC approaches, even through sophisticated prompt design or tailored instruction-tuning. Fundamentally, applying LLMs on KGC introduces several critical challenges, including a vast set of entity candidates, hallucination issue of LLMs, and under-exploitation of the graph structure. To address these challenges, we propose a novel instruction-tuning-based method, namely FtG. Specifically, we present a filter-then-generate paradigm and formulate the KGC task into a multiple-choice question format. In this way, we can harness the capability of LLMs while mitigating the issue casused by hallucinations. Moreover, we devise a flexible ego-graph serialization prompt and employ a structure-text adapter to couple structure and text information in a contextualized manner. Experimental results demonstrate that FtG achieves substantial performance gain compared to existing state-of-the-art methods. The instruction dataset and code are available at https://github.com/LB0828/FtG.

RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution
Jiahui Li | Lin Li | Tai-Wei Chang | Kun Kuang | Long Chen | Jun Zhou | Cheng Yang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Reinforcement learning from human feedback (RLHF) offers a promising approach to aligning large language models (LLMs) with human preferences. Typically, a reward model is trained or supplied to act as a proxy for humans in evaluating generated responses during the reinforcement training phase. However, current reward models operate as sequence-to-one models, allocating a single, sparse, and delayed reward to an entire output sequence. This approach may overlook the significant contributions of individual tokens toward the desired outcome. To this end, we propose a more fine-grained, token-level guidance approach for RL training. Specifically, we introduce RED, a novel REward reDistribition method that evaluates and assigns specific credit to each token using an off-the-shelf reward model. Utilizing these fine-grained rewards enhances the model’s understanding of language nuances, leading to more precise performance improvements. Notably, our method does not require modifying the reward model or introducing additional training steps, thereby incurring minimal computational costs. Experimental results across diverse datasets and tasks demonstrate the superiority of our approach.

Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models
Shuliang Liu | Xinze Li | Zhenghao Liu | Yukun Yan | Cheng Yang | Zheni Zeng | Zhiyuan Liu | Maosong Sun | Ge Yu
Findings of the Association for Computational Linguistics: ACL 2025

Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs). However, existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. LLM-based judgment models provide the potential to produce high-quality judgments, but they are highly sensitive to evaluation prompts, leading to inconsistencies when judging the output of RAG models. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models. Specifically, ConsJudge prompts LLMs to generate different judgments based on various combinations of judgment dimensions, utilizes the judge-consistency to evaluate these judgments, and selects the chosen and rejected judgments for DPO training. Our experiments show that ConsJudge can effectively provide more accurate judgments for optimizing RAG models across various RAG models and datasets. Further analysis reveals that judgments generated by ConsJudge have a high agreement with the superior LLM. All codes are available at https://github.com/OpenBMB/ConsJudge.

Multi-Agent Collaboration via Cross-Team Orchestration
Zhuoyun Du | Chen Qian | Wei Liu | Zihao Xie | YiFei Wang | Rennai Qiu | Yufan Dang | Weize Chen | Cheng Yang | Ye Tian | Xuantang Xiong | Lei Han
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) have significantly impacted various domains, especially through organized LLM-driven autonomous agents. A representative scenario is in software development, where agents can collaborate in a team like humans, following predefined phases to complete sub-tasks sequentially. However, for an agent team, each phase yields only one possible outcome. This results in the completion of only one development chain, thereby losing the opportunity to explore multiple potential decision paths within the solution space. Consequently leading to suboptimal results or extensive trial and error. To address this, we introduce Cross-Team Orchestration (Croto), a scalable multi-team framework that enables orchestrated teams to jointly propose various task-oriented solutions and interact with their insights in a self-independence while cross-team collaboration environment for superior solutions generation. Experiments reveal a notable increase in software quality compared to state-of-the-art baselines. We further tested our framework on story generation tasks, which demonstrated a promising generalization ability of our framework in other domains. The code and data is available at https://github.com/OpenBMB/ChatDev/tree/macnet

DAGS: A Dependency-Based Dual-Attention and Global Semantic Improvement Framework for Metaphor Recognition
Puli Chen | Cheng Yang | Xingmao Zhang | Qingbao Huang
Findings of the Association for Computational Linguistics: ACL 2025

Current metaphor recognition mainly rely on Metaphor Detection Theory (MDT), such as the Metaphor Identification Procedure, which recognizes metaphors by comparing the basic meaning of target word with context meaning. Existing studies have gradually adopted literal annotations to model basic meanings, rejecting the aggregated meanings of target words. However, these methods ignore the problem of interference caused by literal annotations, and do not make full use of semantic expression relations of MDT, making the models difficult to detect and generalize. To address these challenges, we propose a dependency-based Dual-Attention and Global Semantic Improvement (DAGS) framework. DAGS first extracts literal annotations of target words as basic meaning from several mainstream corpora. Then, we apply dependency tree and dual-attention while filtering on input sentences and basic meanings. Finally, we improve the MDT to further consider the global semantic relationship on contexts. The DAGS can not only extract features from multiple information sources but alsoeffectively removes redundancy, while focusing on mission-critical information. We achieve state-of-the-art on several mainstream metaphor datasets (e.g., VUA ALL, VUAverb, TroFi and PSUCMC), which suggests that filtering and global semantic improvement of contexts is crucial for enhancing metaphor recognition performance.

Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System
Weize Chen | Jiarui Yuan | Chen Qian | Cheng Yang | Zhiyuan Liu | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Model (LLM) based multi-agent systems (MAS) show remarkable potential in collaborative problem-solving, yet they still face critical challenges: low communication efficiency, poor scalability, and a lack of effective parameter-updating optimization methods. We present Optima, a novel framework that addresses these issues by significantly enhancing both communication efficiency and task effectiveness in LLM-based MAS through training. Optima employs an iterative generate, rank, select, and train paradigm with a reward function balancing task performance, token efficiency, and communication readability. We explore various algorithms, including Supervised Fine-Tuning, Direct Preference Optimization, and their hybrid approaches, providing insights into their effectiveness-efficiency trade-offs. We integrate Monte Carlo Tree Search-inspired techniques for DPO data generation, treating conversation turns as tree nodes to explore diverse interaction paths. Evaluated on common multi-agent tasks, including information-asymmetric question answering and complex reasoning, Optimashows consistent and substantial improvements over single-agent baselines and vanilla MAS based on Llama 3 8B / 3.2 3B, achieving up to 2.8x performance gain with less than 10% tokens on tasks requiring heavy information exchange. Moreover, Optima’s efficiency gains enable more effective compute utilization during inference, leading to improved inference-time scaling laws. By addressing fundamental challenges in LLM-based MAS, Optima shows the potential towards scalable, efficient, and effective MAS.

Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training
Ao Sun | Weilin Zhao | Xu Han | Cheng Yang | Xinrong Zhang | Zhiyuan Liu | Chuan Shi | Maosong Sun
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Training large language models (LLMs) heavily relies on distributed training strategies, among which pipeline parallelism (PP) plays a crucial role. As training sequences extend to 32k or even 128k tokens, current PP methods face severe bottlenecks, including substantial pipeline bubbles and high memory footprint, greatly hindering training throughput and model scalability. This paper introduces a sequence-level one-forward-one-backward (1F1B) PP method, named Seq1F1B, tailored for training LLMs on long sequences with high training throughput and memory efficiency. Unlike typical PP methods, which adopt batch-level pipeline schedule, Seq1F1B schedules the pipeline of training LLMs at the sequence level. It uses a computational strategy to partition sequences appropriately, significantly reducing pipeline bubbles and memory footprint. Compared to competitive PP baselines such as Megatron 1F1B PP, Seq1F1B achieves 1.14X training throughput with half memory footprint.Notably, Seq1F1B trains an LLM with 30B parameters on sequences up to 64k tokens using 64X NVIDIA A100 GPUs without using recomputation strategies, a feat unachievable with existing methods.We have released our code on GitHub to facilitate further research and development in LLM training on long sequences: https://github.com/thunlp/Seq1F1B.

LLM2: Let Large Language Models Harness System 2 Reasoning
Cheng Yang | Chufan Shi | Siheng Li | Bo Shui | Yujiu Yang | Wai Lam
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs. We posit that these limitations are rooted in the foundational autoregressive architecture of LLMs, which inherently lacks mechanisms for differentiating between desirable and undesirable results. Drawing inspiration from the dual-process theory of human cognition, we introduce LLM2, a novel framework that combines an LLM (System 1) with a process-based verifier (System 2). Within LLM2, the LLM is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs. The verifier is trained with a pairwise comparison loss on synthetic process-supervision data generated through our token quality exploration strategy. Empirical results on mathematical reasoning benchmarks substantiate the efficacy of LLM2, exemplified by an accuracy enhancement from 50.3 to 57.8 (+7.5) for Llama3-1B on GSM8K. Furthermore, when combined with self-consistency, LLM2 achieves additional improvements, boosting major@20 accuracy from 56.2 to 70.2 (+14.0).

2024

Can ChatGPT’s Performance be Improved on Verb Metaphor Detection Tasks? Bootstrapping and Combining Tacit Knowledge
Cheng Yang | Puli Chen | Qingbao Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Metaphors detection, as an important task in the field of NLP, has been receiving sustained academic attention in recent years. Current researches focus supervised metaphors detection systems, which usually require large-scale, high-quality labeled data support. The emerge of large language models (e.g., ChatGPT) has made many NLP tasks (e.g., automatic summarization and dialogue systems) a qualitative leap. However, it is worth noting that the use of ChatGPT for unsupervised metaphors detection is often challenged with less-than-expected performance. Therefore, the aim of our work is to explore how to bootstrap and combine ChatGPT by detecting the most prevalent verb metaphors among metaphors. Our approach first utilizes ChatGPT to obtain literal collocations of target verbs and subject-object pairs of verbs in the text to be detected. Subsequently, these literal collocations and subject-object pairs are mapped to the same set of topics, and finally the verb metaphors are detected through the analysis of entailment relations. The experimental results show that our method achieves the best performance on the unsupervised verb metaphors detection task compared to existing unsupervised methods or direct prediction using ChatGPT. Our code is available at https://github.com/VILAN-Lab/Unsupervised-Metaphor-Detection.

Experiential Co-Learning of Software-Developing Agents
Chen Qian | Yufan Dang | Jiahao Li | Wei Liu | Zihao Xie | YiFei Wang | Weize Chen | Cheng Yang | Xin Cong | Xiaoyin Che | Zhiyuan Liu | Maosong Sun
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advancements in large language models (LLMs) have brought significant changes to various domains, especially through LLM-driven autonomous agents. A representative scenario is in software development, where LLM agents demonstrate efficient collaboration, task division, and assurance of software quality, markedly reducing the need for manual involvement. However, these agents frequently perform a variety of tasks independently, without benefiting from past experiences, which leads to repeated mistakes and inefficient attempts in multi-step task execution. To this end, we introduce Experiential Co-Learning, a novel LLM-agent learning framework in which instructor and assistant agents gather shortcut-oriented experiences from their historical trajectories and use these past experiences for future task execution. The extensive experiments demonstrate that the framework enables agents to tackle unseen software-developing tasks more effectively. We anticipate that our insights will guide LLM agents towards enhanced autonomy and contribute to their evolutionary growth in cooperative learning. The code and data are available at https://github.com/OpenBMB/ChatDev.

ChatDev: Communicative Agents for Software Development
Chen Qian | Wei Liu | Hongzhang Liu | Nuo Chen | Yufan Dang | Jiahao Li | Cheng Yang | Weize Chen | Yusheng Su | Xin Cong | Juyuan Xu | Dahai Li | Zhiyuan Liu | Maosong Sun
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Software development is a complex task that necessitates cooperation among multiple members with diverse skills. Numerous studies used deep learning to improve specific phases in a waterfall model, such as design, coding, and testing. However, the deep learning model in each phase requires unique designs, leading to technical inconsistencies across various phases, which results in a fragmented and ineffective development process. In this paper, we introduce ChatDev, a chat-powered software development framework in which specialized agents driven by large language models (LLMs) are guided in what to communicate (via chat chain) and how to communicate (via communicative dehallucination). These agents actively contribute to the design, coding, and testing phases through unified language-based communication, with solutions derived from their multi-turn dialogues. We found their utilization of natural language is advantageous for system design, and communicating in programming language proves helpful in debugging. This paradigm demonstrates how linguistic communication facilitates multi-agent collaboration, establishing language as a unifying bridge for autonomous task-solving among LLM agents. The code and data are available at https://github.com/OpenBMB/ChatDev.

BC-Prover: Backward Chaining Prover for Formal Theorem Proving
Yuhang He | Jihai Zhang | Jianzhu Bao | Fangquan Lin | Cheng Yang | Bing Qin | Ruifeng Xu | Wotao Yin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Despite the remarkable progress made by large language models in mathematical reasoning, interactive theorem proving in formal logic still remains a prominent challenge. Previous methods resort to neural models for proofstep generation and search. However, they suffer from exploring possible proofsteps empirically in a large search space. Moreover, they directly use a less rigorous informal proof for proofstep generation, neglecting the incomplete reasoning within. In this paper, we propose BC-Prover, a backward chaining framework guided by pseudo steps. Specifically, BC-Prover prioritizes pseudo steps to proofstep generation. The pseudo steps boost the proof construction in two aspects: (1) Backward Chaining that decomposes the proof into sub-goals for goal-oriented exploration. (2) Step Planning that makes a fine-grained planning to bridge the gap between informal and formal proofs. Experiments on the miniF2F benchmark show significant performance gains by our framework over the state-of-the-art approaches. Our framework is also compatible with existing provers and further improves their performance with the backward chaining technique.

ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models
Yuxiang Zhang | Jing Chen | Junjie Wang | Yaxin Liu | Cheng Yang | Chufan Shi | Xinyu Zhu | Zihao Lin | Hanwen Wan | Yujiu Yang | Tetsuya Sakai | Tian Feng | Hayato Yamana
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Tool-augmented large language models (LLMs) are rapidly being integrated into real-world applications. Due to the lack of benchmarks, the community has yet to fully understand the hallucination issues within these models. To address this challenge, we introduce a comprehensive diagnostic benchmark, ToolBH. Specifically, we assess the LLM’s hallucinations through two perspectives: depth and breadth. In terms of depth, we propose a multi-level diagnostic process, including (1) solvability detection, (2) solution planning, and (3) missing-tool analysis. For breadth, we consider three scenarios based on the characteristics of the toolset: missing necessary tools, potential tools, and limited functionality tools. Furthermore, we developed seven tasks and collected 700 evaluation samples through multiple rounds of manual annotation. The results show the significant challenges presented by the ToolBH benchmark. The current advanced models Gemini-1.5-Pro and GPT-4o only achieve total scores of 45.3 and 37.0, respectively, on a scale of 100. In this benchmark, larger model parameters do not guarantee better performance; the training data and response strategies also play crucial roles in tool-enhanced LLM scenarios. Our diagnostic analysis indicates that the primary reason for model errors lies in assessing task solvability. Additionally, open-weight models suffer from performance drops with verbose replies, whereas proprietary models excel with longer reasoning.

Merely Judging Metaphor is Not Enough: Research on Reasonable Metaphor Detection
Puli Chen | Cheng Yang | Qingbao Huang
Findings of the Association for Computational Linguistics: EMNLP 2024

Metaphor, as an advanced form of cognition, is challenging to understand their meaning. Current metaphor detection tasks only provide labels (i.e., metaphor or literal) without interpreting how to understand them. In this paper, we improve the metaphor detection task and explore the reason of metaphor. To the best of our knowledge, we are the first work to reason about metaphor using mainstream Large Language Models (LLMs). Specifically, we utilized ChatGPT3.5 to expand the mainstream datasets in current metaphor detection, including VUA ALL, TroFi, and MOH-X. We input the original sentence, target word, and usage (metaphor or literal) into ChatGPT, guiding it to generate corresponding metaphor reason. Then, we designed supervised baseline experiments (e.g., RoBERTa, GPT-2) and zero-shot experiments with LLMs (e.g., LLaMA3). For the results generated by the above experiments, we provided the case study. We devised four methods that include manual evaluation to evaluate the reason performance of the model, and discussed extensively the advantages and disadvantages of these evaluation methods. Our code is available at https://github.com/yc-cy/Metaphorical-Reasoning.

HoLLMwood: Unleashing the Creativity of Large Language Models in Screenwriting via Role Playing
Jing Chen | Xinyu Zhu | Cheng Yang | Chufan Shi | Yadong Xi | Yuxiang Zhang | Junjie Wang | Jiashu Pu | Tian Feng | Yujiu Yang | Rongsheng Zhang
Findings of the Association for Computational Linguistics: EMNLP 2024

Generative AI has demonstrated unprecedented creativity in the field of computer vision, yet such phenomena have not been observed in natural language processing. In particular, large language models (LLMs) can hardly produce written works at the level of human experts due to the extremely high complexity of literature writing. In this paper, we present HoLLMwood, an automated framework for unleashing the creativity of LLMs and exploring their potential in screenwriting, which is a highly demanding task. Mimicking the human creative process, we assign LLMs to different roles involved in the real-world scenario. In addition to the common practice of treating LLMs as Writer, we also apply LLMs as Editor, who is responsible for providing feedback and revision advice to Writer. Besides, to enrich the characters and deepen the plots, we introduce a role-playing mechanism and adopt LLMs as Actors that can communicate and interact with each other. Evaluations on automatically generated screenplays show that HoLLMwood substantially outperforms strong baselines in terms of coherence, relevance, interestingness and overall quality.

MoE-I²: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition
Cheng Yang | Yang Sui | Jinqi Xiao | Lingyi Huang | Yu Gong | Yuanlin Duan | Wenqi Jia | Miao Yin | Yu Cheng | Bo Yuan
Findings of the Association for Computational Linguistics: EMNLP 2024

The emergence of Mixture of Experts (MoE) LLMs has significantly advanced the development of language models. Compared to traditional LLMs, MoE LLMs outperform traditional LLMs by achieving higher performance with considerably fewer activated parameters. Despite this efficiency, their enormous parameter size still leads to high deployment costs. In this paper, we introduce a two-stage compression method tailored for MoE to reduce the model size and decrease the computational cost. First, in the inter-expert pruning stage, we analyze the importance of each layer and propose the Layer-wise Genetic Search and Block-wise KT-Reception Field with the non-uniform pruning ratio to prune the individual expert. Second, in the intra-expert decomposition stage, we apply the low-rank decomposition to further compress the parameters within the remaining experts. Extensive experiments on Qwen1.5-MoE-A2.7B, Deepseek-V2-Lite, and Mixtral-8×7B, demonstrate that our proposed methods can both reduce the model size and enhance inference efficiency while maintaining performance in various zero-shot tasks.

Beyond Natural Language: LLMs Leveraging Alternative Formats for Enhanced Reasoning and Communication
Weize Chen | Chenfei Yuan | Jiarui Yuan | Yusheng Su | Chen Qian | Cheng Yang | Ruobing Xie | Zhiyuan Liu | Maosong Sun
Findings of the Association for Computational Linguistics: EMNLP 2024

Natural language (NL) has long been the predominant format for human cognition and communication, and by extension, has been similarly pivotal in the development and application of Large Language Models (LLMs). Yet, besides NL, LLMs have seen various non-NL formats during pre-training, such as code and logical expression. NL’s status as the optimal format for LLMs, particularly in single-LLM reasoning and multi-agent communication, has not been thoroughly examined. In this work, we challenge the default use of NL by exploring the utility of non-NL formats in these contexts. We show that allowing LLMs to autonomously select the most suitable format before reasoning or communicating leads to a 3.3 to 5.7% improvement in reasoning efficiency for different LLMs, and up to a 72.7% reduction in token usage in multi-agent communication, all while maintaining communicative effectiveness. Our comprehensive analysis further reveals that LLMs can devise a format from limited task instructions and that the devised format is effectively transferable across different LLMs. Intriguingly, the structured communication format decided by LLMs exhibits notable parallels with established agent communication languages, suggesting a natural evolution towards efficient, structured communication in agent communication. Our code will be released to facilitate further exploration.

Solving General Natural-Language-Description Optimization Problems with Large Language Models
Jihai Zhang | Wei Wang | Siyan Guo | Li Wang | Fangquan Lin | Cheng Yang | Wotao Yin
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)

Optimization problems seek to find the best solution to an objective under a set of constraints, and have been widely investigated in real-world applications. Modeling and solving optimization problems in a specific domain typically require a combination of domain knowledge, mathematical skills, and programming ability, making it difficult for general users and even domain professionals. In this paper, we propose a novel framework called OptLLM that augments LLMs with external solvers. Specifically, OptLLM accepts user queries in natural language, convert them into mathematical formulations and programming codes, and calls the solvers to calculate the results for decision-making. In addition, OptLLM supports multi-round dialogues to gradually refine the modeling and solving of optimization problems. To illustrate the effectiveness of OptLLM, we provide tutorials on three typical optimization applications and conduct experiments on both prompt-based GPT models and a fine-tuned Qwen model using a large-scale self-developed optimization dataset. Experimental results show that OptLLM works with various LLMs, and the fine-tuned model achieves an accuracy boost compared to the prompt-based models. Some features of OptLLM framework have been available for trial since June 2023 (https://opt.alibabacloud.com/chat or https://opt.aliyun.com/chat).

An Energy-based Model for Word-level AutoCompletion in Computer-aided Translation
Cheng Yang | Guoping Huang | Mo Yu | Zhirui Zhang | Siheng Li | Mingming Yang | Shuming Shi | Yujiu Yang | Lemao Liu
Transactions of the Association for Computational Linguistics, Volume 12

Word-level AutoCompletion (WLAC) is a rewarding yet challenging task in Computer-aided Translation. Existing work addresses this task through a classification model based on a neural network that maps the hidden vector of the input context into its corresponding label (i.e., the candidate target word is treated as a label). Since the context hidden vector itself does not take the label into account and it is projected to the label through a linear classifier, the model cannot sufficiently leverage valuable information from the source sentence as verified in our experiments, which eventually hinders its overall performance. To alleviate this issue, this work proposes an energy-based model for WLAC, which enables the context hidden vector to capture crucial information from the source sentence. Unfortunately, training and inference suffer from efficiency and effectiveness challenges, therefore we employ three simple yet effective strategies to put our model into practice. Experiments on four standard benchmarks demonstrate that our reranking-based approach achieves substantial improvements (about 6.07%) over the previous state-of-the-art model. Further analyses show that each strategy of our approach contributes to the final performance.1

2023

AutoConv: Automatically Generating Information-seeking Conversations with Large Language Models
Siheng Li | Cheng Yang | Yichun Yin | Xinyu Zhu | Zesen Cheng | Lifeng Shang | Xin Jiang | Qun Liu | Yujiu Yang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Information-seeking conversation, which aims to help users gather information through conversation, has achieved great progress in recent years. However, the research is still stymied by the scarcity of training data. To alleviate this problem, we propose AutoConv for synthetic conversation generation, which takes advantage of the few-shot learning ability and generation capacity of large language models (LLM). Specifically, we formulate the conversation generation problem as a language modeling task, then finetune an LLM with a few human conversations to capture the characteristics of the information-seeking process and use it for generating synthetic conversations with high quality. Experimental results on two frequently-used datasets verify that AutoConv has substantial improvements over strong baselines and alleviates the dependence on human annotation. In addition, we also provide several analysis studies to promote future research.

Question Answering as Programming for Solving Time-Sensitive Questions
Xinyu Zhu | Cheng Yang | Bei Chen | Siheng Li | Jian-Guang Lou | Yujiu Yang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Question answering plays a pivotal role in human daily life because it involves our acquisition of knowledge about the world. However, due to the dynamic and ever-changing nature of real-world facts, the answer can be completely different when the time constraint in the question changes. Recently, Large Language Models (LLMs) have shown remarkable intelligence in question answering, while our experiments reveal that the aforementioned problems still pose a significant challenge to existing LLMs. This can be attributed to the LLMs’ inability to perform rigorous reasoning based on surface-level text semantics. To overcome this limitation, rather than requiring LLMs to directly answer the question, we propose a novel approach where we reframe the Question Answering task as Programming (QAaP). Concretely, by leveraging modern LLMs’ superior capability in understanding both natural language and programming language, we endeavor to harness LLMs to represent diversely expressed text as well-structured code and select the best matching answer from multiple candidates through programming. We evaluate our QAaP framework on several time-sensitive question answering datasets and achieve decent improvement, up to 14.5% over strong baselines.

Specialist or Generalist? Instruction Tuning for Specific NLP Tasks
Chufan Shi | Yixuan Su | Cheng Yang | Yujiu Yang | Deng Cai
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The potential of large language models (LLMs) to simultaneously perform a wide range of natural language processing (NLP) tasks has been the subject of extensive research. Although instruction tuning has proven to be a data-efficient method for transforming LLMs into such generalist models, their performance still lags behind specialist models trained exclusively for specific tasks. In this paper, we investigate whether incorporating broadcoverage generalist instruction tuning can contribute to building a specialist model. We hypothesize that its efficacy depends on task specificity and skill requirements. Our experiments assess four target tasks with distinct coverage levels, revealing that integrating generalist instruction tuning consistently enhances model performance when the task coverage is broad. The effect is particularly pronounced when the amount of task-specific training data is limited. Further investigation into three target tasks focusing on different capabilities demonstrates that generalist instruction tuning improves understanding and reasoning abilities. However, for tasks requiring factual knowledge, generalist data containing hallucinatory information may negatively affect the model’s performance. Overall, our work provides a systematic guide for developing specialist models with general instruction tuning.

Enhancing Dialogue Generation with Conversational Concept Flows
Siheng Li | Wangjie Jiang | Pengda Si | Cheng Yang | Qiu Yao | Jinchao Zhang | Jie Zhou | Yujiu Yang
Findings of the Association for Computational Linguistics: EACL 2023

Human conversations contain natural and reasonable topic shifts, reflected as the concept flows across utterances. Previous researches prove that explicitly modeling concept flows with a large commonsense knowledge graph effectively improves response quality. However, we argue that there exists a gap between the knowledge graph and the conversation. The knowledge graph has limited commonsense knowledge and ignores the characteristics of natural conversations. Thus, many concepts and relations in conversations are not included. To bridge this gap, we propose to enhance dialogue generation with conversational concept flows. Specifically, we extract abundant concepts and relations from natural conversations and build a new conversation-aware knowledge graph. In addition, we design a novel relation-aware graph encoder to capture the concept flows guided by the knowledge graph. Experimental results on the large-scale Reddit conversation dataset indicate that our method performs better than strong baselines, andfurther analysis verifies the effectiveness of each component. All our code and data will be publicly available after acceptance.

NewsDialogues: Towards Proactive News Grounded Conversation
Siheng Li | Yichun Yin | Cheng Yang | Wangjie Jiang | Yiwei Li | Zesen Cheng | Lifeng Shang | Xin Jiang | Qun Liu | Yujiu Yang
Findings of the Association for Computational Linguistics: ACL 2023

Hot news is one of the most popular topics in daily conversations. However, news grounded conversation has long been stymied by the lack of well-designed task definition and scarce data. In this paper, we propose a novel task, Proactive News Grounded Conversation, in which a dialogue system can proactively lead the conversation based on some key topics of the news. In addition, both information-seeking and chit-chat scenarios are included realistically, where the user may ask a series of questions about the news details or express their opinions and be eager to chat. To further develop this novel task, we collect a human-to-human Chinese dialogue dataset NewsDialogues, which includes 1K conversations with a total of 14.6K utterances and detailed annotations for target topics and knowledge spans. Furthermore, we propose a method named Predict-Generate-Rank, consisting of a generator for grounded knowledge prediction and response generation, and a ranker for the ranking of multiple responses to alleviate the exposure bias. We conduct comprehensive experiments to demonstrate the effectiveness of the proposed method and further present several key findings and challenges to prompt future research.

2022

IIGROUP Submissions for WMT22 Word-Level AutoCompletion Task
Cheng Yang | Siheng Li | Chufan Shi | Yujiu Yang
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper presents IIGroup’s submission to the WMT22 Word-Level AutoCompletion(WLAC) Shared Task in four language directions. We propose to use a Generate-then-Rerank framework to solve this task. More specifically, the generator is used to generate candidate words and recall as many positive candidates as possible. To facilitate the training process of the generator, we propose a span-level mask prediction task. Once we get the candidate words, we take the top-K candidates and feed them into the reranker. The reranker is used to select the most confident candidate. The experimental results in four language directions demonstrate the effectiveness of our systems. Our systems achieve competitive performance ranking 1st in English to Chinese subtask and 2nd in Chinese to English subtask.

2021

Treasures Outside Contexts: Improving Event Detection via Global Statistics
Rui Li | Wenlin Zhao | Cheng Yang | Sen Su
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Event detection (ED) aims at identifying event instances of specified types in given texts, which has been formalized as a sequence labeling task. As far as we know, existing neural-based ED models make decisions relying entirely on the contextual semantic features of each word in the inputted text, which we find is easy to be confused by the varied contexts in the test stage. To this end, we come up with the idea of introducing a set of statistical features from word-event co-occurrence frequencies in the entire training set to cooperate with contextual features. Specifically, we propose a Semantic and Statistic-Joint Discriminative Network (SS-JDN) consisting of a semantic feature extractor, a statistical feature extractor, and a joint event discriminator. In experiments, SS-JDN effectively exceeds ten recent strong baselines on ACE2005 and KBP2015 datasets. Further, we perform extensive experiments to comprehensively probe SS-JDN.

KACC: A Multi-task Benchmark for Knowledge Abstraction, Concretization and Completion
Jie Zhou | Shengding Hu | Xin Lv | Cheng Yang | Zhiyuan Liu | Wei Xu | Jie Jiang | Juanzi Li | Maosong Sun
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

Graph Neural News Recommendation with Unsupervised Preference Disentanglement
Linmei Hu | Siyong Xu | Chen Li | Cheng Yang | Chuan Shi | Nan Duan | Xing Xie | Ming Zhou
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

With the explosion of news information, personalized news recommendation has become very important for users to quickly find their interested contents. Most existing methods usually learn the representations of users and news from news contents for recommendation. However, they seldom consider high-order connectivity underlying the user-news interactions. Moreover, existing methods failed to disentangle a user’s latent preference factors which cause her clicks on different news. In this paper, we model the user-news interactions as a bipartite graph and propose a novel Graph Neural News Recommendation model with Unsupervised Preference Disentanglement, named GNUD. Our model can encode high-order relationships into user and news representations by information propagation along the graph. Furthermore, the learned representations are disentangled with latent preference factors by a neighborhood routing algorithm, which can enhance expressiveness and interpretability. A preference regularizer is also designed to force each disentangled subspace to independently reflect an isolated preference, improving the quality of the disentangled representations. Experimental results on real-world news datasets demonstrate that our proposed model can effectively improve the performance of news recommendation and outperform state-of-the-art news recommendation methods.

2019

Improving Distantly-Supervised Relation Extraction with Joint Label Embedding
Linmei Hu | Luhao Zhang | Chuan Shi | Liqiang Nie | Weili Guan | Cheng Yang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Distantly-supervised relation extraction has proven to be effective to find relational facts from texts. However, the existing approaches treat labels as independent and meaningless one-hot vectors, which cause a loss of potential label information for selecting valid instances. In this paper, we propose a novel multi-layer attention-based model to improve relation extraction with joint label embedding. The model makes full use of both structural information from Knowledge Graphs and textual information from entity descriptions to learn label embeddings through gating integration while avoiding the imposed noise with an attention mechanism. Then the learned label embeddings are used as another atten- tion over the instances (whose embeddings are also enhanced with the entity descriptions) for improving relation extraction. Extensive experiments demonstrate that our model significantly outperforms state-of-the-art methods.

GEAR: Graph-based Evidence Aggregating and Reasoning for Fact Verification
Jie Zhou | Xu Han | Cheng Yang | Zhiyuan Liu | Lifeng Wang | Changcheng Li | Maosong Sun
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Fact verification (FV) is a challenging task which requires to retrieve relevant evidence from plain text and use the evidence to verify given claims. Many claims require to simultaneously integrate and reason over several pieces of evidence for verification. However, previous work employs simple models to extract information from evidence without letting evidence communicate with each other, e.g., merely concatenate the evidence for processing. Therefore, these methods are unable to grasp sufficient relational and logical information among the evidence. To alleviate this issue, we propose a graph-based evidence aggregating and reasoning (GEAR) framework which enables information to transfer on a fully-connected evidence graph and then utilizes different aggregators to collect multi-evidence information. We further employ BERT, an effective pre-trained language representation model, to improve the performance. Experimental results on a large-scale benchmark dataset FEVER have demonstrated that GEAR could leverage multi-evidence information for FV and thus achieves the promising result with a test FEVER score of 67.10%. Our code is available at https://github.com/thunlp/GEAR.

Jiuge: A Human-Machine Collaborative Chinese Classical Poetry Generation System
Zhipeng Guo | Xiaoyuan Yi | Maosong Sun | Wenhao Li | Cheng Yang | Jiannan Liang | Huimin Chen | Yuhui Zhang | Ruoyu Li
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Research on the automatic generation of poetry, the treasure of human culture, has lasted for decades. Most existing systems, however, are merely model-oriented, which input some user-specified keywords and directly complete the generation process in one pass, with little user participation. We believe that the machine, being a collaborator or an assistant, should not replace human beings in poetic creation. Therefore, we proposed Jiuge, a human-machine collaborative Chinese classical poetry generation system. Unlike previous systems, Jiuge allows users to revise the unsatisfied parts of a generated poem draft repeatedly. According to the revision, the poem will be dynamically updated and regenerated. After the revision and modification procedure, the user can write a satisfying poem together with Jiuge system collaboratively. Besides, Jiuge can accept multi-modal inputs, such as keywords, plain text or images. By exposing the options of poetry genres, styles and revision modes, Jiuge, acting as a professional assistant, allows constant and active participation of users in poetic creation.

2018

Stylistic Chinese Poetry Generation via Unsupervised Style Disentanglement
Cheng Yang | Maosong Sun | Xiaoyuan Yi | Wenhao Li
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

The ability to write diverse poems in different styles under the same poetic imagery is an important characteristic of human poetry writing. Most previous works on automatic Chinese poetry generation focused on improving the coherency among lines. Some work explored style transfer but suffered from expensive expert labeling of poem styles. In this paper, we target on stylistic poetry generation in a fully unsupervised manner for the first time. We propose a novel model which requires no supervised style labeling by incorporating mutual information, a concept in information theory, into modeling. Experimental results show that our model is able to generate stylistic poems without losing fluency and coherency.

Co-authors

Qingbao Huang 3

Xu Han (韩旭) 2

Wangjie Jiang 2

Yuxiang Zhang (张宇翔) 2

Tai-Wei Chang 1

Lingxiao Diao 1

Guoping Huang 1

Changcheng Li 1

Jiannan Liang 1

Hongzhang Liu 1

Zhenghao Liu (刘正皓) 1

Jian-Guang Lou 1

Bing Qin (秦兵) 1

Tetsuya Sakai 1

Xuantang Xiong 1

Ruifeng Xu (徐睿峰) 1

Hayato Yamana 1

Yukun Yan (闫宇坤) 1

Mingming Yang 1

Ge Yu (于戈) 1

Jinchao Zhang 1

Rongsheng Zhang 1

Zhuosheng Zhang 1

Xingmao Zhang 1

Xinrong Zhang 1

Venues