2024
pdf
bib
abs
Watch Every Step! LLM Agent Learning via Iterative Step-level Process Refinement
Weimin Xiong
|
Yifan Song
|
Xiutian Zhao
|
Wenhao Wu
|
Xun Wang
|
Ke Wang
|
Cheng Li
|
Wei Peng
|
Sujian Li
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large language model agents have exhibited exceptional performance across a range of complex interactive tasks. Recent approaches have utilized tuning with expert trajectories to enhance agent performance, yet they primarily concentrate on outcome rewards, which may lead to errors or suboptimal actions due to the absence of process supervision signals. In this paper, we introduce the **I**terative step-level **P**rocess **R**efinement **(IPR)** framework, which provides detailed step-by-step guidance to enhance agent training. Specifically, we adopt the Monte Carlo method to estimate step-level rewards. During each iteration, the agent explores along the expert trajectory and generates new actions. These actions are then evaluated against the corresponding step of expert trajectory using step-level rewards. Such comparison helps identify discrepancies, yielding contrastive action pairs that serve as training data for the agent. Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines. Moreover, our analytical finds highlight the effectiveness of IPR in augmenting action efficiency and its applicability to diverse models.
pdf
bib
abs
An Electoral Approach to Diversify LLM-based Multi-Agent Collective Decision-Making
Xiutian Zhao
|
Ke Wang
|
Wei Peng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Modern large language models (LLMs) have exhibited cooperative synergy on complex task-solving, and collective decision-making (CDM) is a pivotal component in LLM-based multi-agent collaboration frameworks. Our survey on 52 recent such systems uncovers a severe lack of diversity, with a heavy reliance on dictatorial and plurality voting for CDM. Through the lens of social choice theory, we scrutinize widely-adopted CDM methods and identify their limitations. To enrich current landscape of LLM-based CDM, we present GEDI, an electoral CDM module that incorporates various ordinal preferential voting mechanisms. Our empirical case study across three benchmarks shows that the integration of certain CDM methods can markedly improve the reasoning capabilities and robustness of some leading LLMs, all without requiring intricate system designs. Additionally, we find that some CDM mechanisms generate positive synergies even with as few as three agents. The voting-based methods also demonstrate robustness against single points of failure, as well as diversity in terms of hit-rate@k and subject-wise impacts.
pdf
bib
abs
Measuring the Inconsistency of Large Language Models in Preferential Ranking
Xiutian Zhao
|
Ke Wang
|
Wei Peng
Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024)
Despite large language models’ (LLMs’) recent advancements, their bias and hallucination issues persist, and their ability to offer consistent and preferential rankings remains underexplored. This study investigates the capacity of LLMs to provide consistent ordinal preferences, a crucial aspect in scenarios lacking absolute answers. We introduce a formalization of consistency based on order theory, outlining criteria such as transitivity, asymmetry, reversibility, and independence from irrelevant alternatives. Our diagnostic experiments on selected state-of-the-art LLMs reveal their inability to meet these criteria, indicating a strong positional bias and poor transitivity, with preferences easily swayed by irrelevant alternatives. These findings highlight a significant inconsistency in LLM-generated preferential rankings, underscoring the need for further research to address these limitations.
pdf
bib
abs
Enhancing the General Agent Capabilities of Low-Paramter LLMs through Tuning and Multi-Branch Reasoning
Qinhao Zhou
|
Zihan Zhang
|
Xiang Xiang
|
Ke Wang
|
Yuchuan Wu
|
Yongbin Li
Findings of the Association for Computational Linguistics: NAACL 2024
Open-source pre-trained Large Language Models (LLMs) exhibit strong language understanding and generation capabilities, making them highly successful in a variety of tasks. However, when used as agents for dealing with complex problems in the real world, their performance is far inferior to large commercial models such as ChatGPT and GPT-4. As intelligent agents, LLMs need to have the capabilities of task planning, long-term memory, and the ability to leverage external tools to achieve satisfactory performance. Various methods have been proposed to enhance the agent capabilities of LLMs. On the one hand, methods involve constructing agent-specific data and fine-tuning the models. On the other hand, some methods focus on designing prompts that effectively activate the reasoning abilities of the LLMs. We explore both strategies on the 7B and 13B models. We propose a comprehensive method for constructing agent-specific data using GPT-4. Through supervised fine-tuning with constructed data, we find that for these models with a relatively small number of parameters, supervised fine-tuning can significantly reduce hallucination outputs and formatting errors in agent tasks. Furthermore, techniques such as multi-path reasoning and task decomposition can effectively decrease problem complexity and enhance the performance of LLMs as agents. We evaluate our method on five agent tasks of AgentBench and achieve satisfactory results.
pdf
bib
abs
AgentBank: Towards Generalized LLM Agents via Fine-Tuning on 50000+ Interaction Trajectories
Yifan Song
|
Weimin Xiong
|
Xiutian Zhao
|
Dawei Zhu
|
Wenhao Wu
|
Ke Wang
|
Cheng Li
|
Wei Peng
|
Sujian Li
Findings of the Association for Computational Linguistics: EMNLP 2024
Fine-tuning on agent-environment interaction trajectory data holds significant promise for surfacing generalized agent capabilities in open-source large language models (LLMs). In this work, we introduce AgentBank, by far the largest trajectory tuning data collection featuring more than 50k diverse high-quality interaction trajectories which comprises 16 tasks covering five distinct agent skill dimensions. Leveraging a novel annotation pipeline, we are able to scale the annotated trajectories and generate a trajectory dataset with minimized difficulty bias. Furthermore, we fine-tune LLMs on AgentBank to get a series of agent models, Samoyed. Our comparative experiments demonstrate the effectiveness of scaling the interaction trajectory data to acquire generalized agent capabilities. Additional studies also reveal some key observations regarding trajectory tuning and agent skill generalization.
pdf
bib
abs
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents
Ruixuan Xiao
|
Wentao Ma
|
Ke Wang
|
Yuchuan Wu
|
Junbo Zhao
|
Haobo Wang
|
Fei Huang
|
Yongbin Li
Findings of the Association for Computational Linguistics: EMNLP 2024
LLM-based agents have emerged as promising tools, which are crafted to fulfill complex tasks by iterative planning and action. However, these agents are susceptible to undesired planning hallucinations when lacking specific knowledge for expertise-intensive tasks. To address this, preliminary attempts are made to enhance planning reliability by incorporating external workflow-related knowledge. Despite the promise, such infused knowledge is mostly disorganized and diverse in formats, lacking rigorous formalization and comprehensive comparisons. Motivated by this, we formalize different formats of workflow knowledge and present FlowBench, the first benchmark for workflow-guided planning. FlowBench covers 51 different scenarios from 6 domains, with knowledge presented in diverse formats. To assess different LLMs on FlowBench, we design a multi-tiered evaluation framework. We evaluate the efficacy of workflow knowledge across multiple formats, and the results indicate that current LLM agents need considerable improvements for satisfactory planning. We hope that our challenging benchmark can pave the way for future agent planning research.
pdf
bib
abs
MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs
Zimu Lu
|
Aojun Zhou
|
Houxing Ren
|
Ke Wang
|
Weikang Shi
|
Junting Pan
|
Mingjie Zhan
|
Hongsheng Li
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have exhibited great potential in mathematical reasoning. However, there remains a performance gap in this area between existing open-source models and closed-source models such as GPT-4. In this paper, we introduce MathGenie, a novel method for generating diverse and reliable math problems by leveraging the ground-truth solutions of the seed data. We augment these ground-truth solutions and use a specially finetuned model to translate these augmented solutions back into new questions. Subsequently, we generate code-integrated solutions for these questions. To ensure the correctness of the code-integrated solutions, we employ rationale-based verification for filtering. Then, we finetune various pretrained models, ranging from 7B to 70B, on the newly curated data, resulting in a family of models known as MathGenie. These models consistently outperform previous open-source models across five representative mathematical reasoning datasets, achieving state-of-the-art performance. In particular, MathGenie-InternLM2 achieves an accuracy of 87.7% on GSM8K and 55.7% on MATH, securing the best overall score.
2023
pdf
bib
abs
Easy Guided Decoding in Providing Suggestions for Interactive Machine Translation
Ke Wang
|
Xin Ge
|
Jiayi Wang
|
Yuqi Zhang
|
Yu Zhao
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Machine translation technology has made great progress in recent years, but it cannot guarantee error-free results. Human translators perform post-editing on machine translations to correct errors in the scene of computer aided translation. In favor of expediting the post-editing process, many works have investigated machine translation in interactive modes, in which machines can automatically refine the rest of translations constrained by human’s edits. Translation Suggestion (TS), as an interactive mode to assist human translators, requires machines to generate alternatives for specific incorrect words or phrases selected by human translators. In this paper, we utilize the parameterized objective function of neural machine translation (NMT) and propose a novel constrained decoding algorithm, namely Prefix-Suffix Guided Decoding (PSGD), to deal with the TS problem without additional training. Compared to state-of-the-art lexical-constrained decoding method, PSGD improves translation quality by an average of 10.6 BLEU and reduces time overhead by an average of 63.4% on benchmark datasets. Furthermore, on both the WeTS and the WMT 2022 Translation Suggestion datasets, it is superior over other supervised learning systems trained with TS annotated data.
pdf
bib
abs
Disambiguated Lexically Constrained Neural Machine Translation
Jinpeng Zhang
|
Nini Xiao
|
Ke Wang
|
Chuanqi Dong
|
Xiangyu Duan
|
Yuqi Zhang
|
Min Zhang
Findings of the Association for Computational Linguistics: ACL 2023
Lexically constrained neural machine translation (LCNMT), which controls the translation generation with pre-specified constraints, is important in many practical applications. Current approaches to LCNMT typically assume that the pre-specified lexicon constraints are contextually appropriate. This assumption limits their application to real-world scenarios where a source lexicon may have multiple target constraints, and disambiguation is needed to select the most suitable one. In this paper, we propose disambiguated LCNMT (D-LCNMT) to solve the problem. D-LCNMT is a robust and effective two-stage framework that disambiguates the constraints based on contexts at first, then integrates the disambiguated constraints into LCNMT. Experimental results show that our approach outperforms strong baselines including existing data argumentation based approaches on benchmark datasets, and comprehensive experiments in scenarios where a source lexicon corresponds to multiple target constraints demonstrate the constraint disambiguation superiority of our approach.
pdf
bib
abs
Improving Neural Machine Translation by Multi-Knowledge Integration with Prompting
Ke Wang
|
Jun Xie
|
Yuqi Zhang
|
Yu Zhao
Findings of the Association for Computational Linguistics: EMNLP 2023
Improving neural machine translation (NMT) systems with prompting has achieved significant progress in recent years. In this work, we focus on how to integrate multi-knowledge, multiple types of knowledge, into NMT models to enhance the performance with prompting. We propose a unified framework, which can integrate effectively multiple types of knowledge including sentences, terminologies/phrases and translation templates into NMT models. We utilize multiple types of knowledge as prefix-prompts of input for the encoder and decoder of NMT models to guide the translation process. The approach requires no changes to the model architecture and effectively adapts to domain-specific translation without retraining. The experiments on English-Chinese and English-German translation demonstrate that our approach significantly outperform strong baselines, achieving high translation quality and terminology match accuracy.
pdf
bib
abs
PROSE: A Pronoun Omission Solution for Chinese-English Spoken Language Translation
Ke Wang
|
Xiutian Zhao
|
Yanghui Li
|
Wei Peng
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Neural Machine Translation (NMT) systems encounter a significant challenge when translating a pro-drop (‘pronoun-dropping’) language (e.g., Chinese) to a non-pro-drop one (e.g., English), since the pro-drop phenomenon demands NMT systems to recover omitted pronouns. This unique and crucial task, however, lacks sufficient datasets for benchmarking. To bridge this gap, we introduce PROSE, a new benchmark featured in diverse pro-drop instances for document-level Chinese-English spoken language translation. Furthermore, we conduct an in-depth investigation of the pro-drop phenomenon in spoken Chinese on this dataset, reconfirming that pro-drop reduces the performance of NMT systems in Chinese-English translation. To alleviate the negative impact introduced by pro-drop, we propose Mention-Aware Semantic Augmentation, a novel approach that leverages the semantic embedding of dropped pronouns to augment training pairs. Results from the experiments on four Chinese-English translation corpora show that our proposed method outperforms existing methods regarding omitted pronoun retrieval and overall translation quality.
pdf
bib
abs
M3Seg: A Maximum-Minimum Mutual Information Paradigm for Unsupervised Topic Segmentation in ASR Transcripts
Ke Wang
|
Xiutian Zhao
|
Yanghui Li
|
Wei Peng
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Topic segmentation aims to detect topic boundaries and split automatic speech recognition transcriptions (e.g., meeting transcripts) into segments that are bounded by thematic meanings. In this work, we propose M3Seg, a novel Maximum-Minimum Mutual information paradigm for linear topic segmentation without using any parallel data. Specifically, by employing sentence representations provided by pre-trained language models, M3Seg first learns a region-based segment encoder based on the maximization of mutual information between the global segment representation and the local contextual sentence representation. Secondly, an edge-based boundary detection module aims to segment the whole by topics based on minimizing the mutual information between different segments. Experiment results on two public datasets demonstrate the effectiveness of M3Seg, which outperform the state-of-the-art methods by a significant (18%–37% improvement) margin.
pdf
bib
abs
ORCHID: A Chinese Debate Corpus for Target-Independent Stance Detection and Argumentative Dialogue Summarization
Xiutian Zhao
|
Ke Wang
|
Wei Peng
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Dialogue agents have been receiving increasing attention for years, and this trend has been further boosted by the recent progress of large language models (LLMs). Stance detection and dialogue summarization are two core tasks of dialogue agents in application scenarios that involve argumentative dialogues. However, research on these tasks is limited by the insufficiency of public datasets, especially for non-English languages. To address this language resource gap in Chinese, we present ORCHID (Oral Chinese Debate), the first Chinese dataset for benchmarking target-independent stance detection and debate summarization. Our dataset consists of 1,218 real-world debates that were conducted in Chinese on 476 unique topics, containing 2,436 stance-specific summaries and 14,133 fully annotated utterances. Besides providing a versatile testbed for future research, we also conduct an empirical study on the dataset and propose an integrated task. The results show the challenging nature of the dataset and suggest a potential of incorporating stance detection in summarization for argumentative dialogue.
2022
pdf
bib
abs
TSMind: Alibaba and Soochow University’s Submission to the WMT22 Translation Suggestion Task
Xin Ge
|
Ke Wang
|
Jiayi Wang
|
Nini Xiao
|
Xiangyu Duan
|
Yu Zhao
|
Yuqi Zhang
Proceedings of the Seventh Conference on Machine Translation (WMT)
This paper describes the joint submission of Alibaba and Soochow University to the WMT 2022 Shared Task on Translation Suggestion (TS). We participate in the English to/from German and English to/from Chinese tasks. Basically, we utilize the model paradigm fine-tuning on the downstream tasks based on large-scale pre-trained models, which has recently achieved great success. We choose FAIR’s WMT19 English to/from German news translation system and MBART50 for English to/from Chinese as our pre-trained models. Considering the task’s condition of limited use of training data, we follow the data augmentation strategies provided by Yang to boost our TS model performance. And we further involve the dual conditional cross-entropy model and GPT-2 language model to filter augmented data. The leader board finally shows that our submissions are ranked first in three of four language directions in the Naive TS task of the WMT22 Translation Suggestion task.
pdf
bib
abs
Gated Mechanism Enhanced Multi-Task Learning for Dialog Routing
Ziming Huang
|
Zhuoxuan Jiang
|
Ke Wang
|
Juntao Li
|
Shanshan Feng
|
Xian-Ling Mao
Proceedings of the 29th International Conference on Computational Linguistics
Currently, human-bot symbiosis dialog systems, e.g. pre- and after-sales in E-commerce, are ubiquitous, and the dialog routing component is essential to improve the overall efficiency, reduce human resource cost and increase user experience. To satisfy this requirement, existing methods are mostly heuristic and cannot obtain high-quality performance. In this paper, we investigate the important problem by thoroughly mining both the data-to-task and task-to-task knowledge among various kinds of dialog data. To achieve the above target, we propose a comprehensive and general solution with multi-task learning framework, specifically including a novel dialog encoder and two tailored gated mechanism modules. The proposed Gated Mechanism enhanced Multi-task Model (G3M) can play the role of hierarchical information filtering and is non-invasive to the existing dialog systems. Experiments on two datasets collected from the real world demonstrate our method’s effectiveness and the results achieve the state-of-the-art performance by relatively increasing 8.7%/11.8% on RMSE metric and 2.2%/4.4% on F1 metric.
2021
pdf
bib
abs
TermMind: Alibaba’s WMT21 Machine Translation Using Terminologies Task Submission
Ke Wang
|
Shuqin Gu
|
Boxing Chen
|
Yu Zhao
|
Weihua Luo
|
Yuqi Zhang
Proceedings of the Sixth Conference on Machine Translation
This paper describes our work in the WMT 2021 Machine Translation using Terminologies Shared Task. We participate in the shared translation terminologies task in English to Chinese language pair. To satisfy terminology constraints on translation, we use a terminology data augmentation strategy based on Transformer model. We used tags to mark and add the term translations into the matched sentences. We created synthetic terms using phrase tables extracted from bilingual corpus to increase the proportion of term translations in training data. Detailed pre-processing and filtering on data, in-domain finetuning and ensemble method are used in our system. Our submission obtains competitive results in the terminology-targeted evaluation.
pdf
bib
abs
QEMind: Alibaba’s Submission to the WMT21 Quality Estimation Shared Task
Jiayi Wang
|
Ke Wang
|
Boxing Chen
|
Yu Zhao
|
Weihua Luo
|
Yuqi Zhang
Proceedings of the Sixth Conference on Machine Translation
Quality Estimation, as a crucial step of quality control for machine translation, has been explored for years. The goal is to to investigate automatic methods for estimating the quality of machine translation results without reference translations. In this year’s WMT QE shared task, we utilize the large-scale XLM-Roberta pre-trained model and additionally propose several useful features to evaluate the uncertainty of the translations to build our QE system, named QEMind . The system has been applied to the sentence-level scoring task of Direct Assessment and the binary score prediction task of Critical Error Detection. In this paper, we present our submissions to the WMT 2021 QE shared task and an extensive set of experimental results have shown us that our multilingual systems outperform the best system in the Direct Assessment QE task of WMT 2020.
pdf
bib
TransSum: Translating Aspect and Sentiment Embeddings for Self-Supervised Opinion Summarization
Ke Wang
|
Xiaojun Wan
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
pdf
bib
abs
Beyond Glass-Box Features: Uncertainty Quantification Enhanced Quality Estimation for Neural Machine Translation
Ke Wang
|
Yangbin Shi
|
Jiayi Wang
|
Yuqi Zhang
|
Yu Zhao
|
Xiaolin Zheng
Findings of the Association for Computational Linguistics: EMNLP 2021
Quality Estimation (QE) plays an essential role in applications of Machine Translation (MT). Traditionally, a QE system accepts the original source text and translation from a black-box MT system as input. Recently, a few studies indicate that as a by-product of translation, QE benefits from the model and training data’s information of the MT system where the translations come from, and it is called the “glass-box QE”. In this paper, we extend the definition of “glass-box QE” generally to uncertainty quantification with both “black-box” and “glass-box” approaches and design several features deduced from them to blaze a new trial in improving QE’s performance. We propose a framework to fuse the feature engineering of uncertainty quantification into a pre-trained cross-lingual language model to predict the translation quality. Experiment results show that our method achieves state-of-the-art performances on the datasets of WMT 2020 QE shared task.
2020
pdf
bib
abs
Alibaba’s Submission for the WMT 2020 APE Shared Task: Improving Automatic Post-Editing with Pre-trained Conditional Cross-Lingual BERT
Jiayi Wang
|
Ke Wang
|
Kai Fan
|
Yuqi Zhang
|
Jun Lu
|
Xin Ge
|
Yangbin Shi
|
Yu Zhao
Proceedings of the Fifth Conference on Machine Translation
The goal of Automatic Post-Editing (APE) is basically to examine the automatic methods for correcting translation errors generated by an unknown machine translation (MT) system. This paper describes Alibaba’s submissions to the WMT 2020 APE Shared Task for the English-German language pair. We design a two-stage training pipeline. First, a BERT-like cross-lingual language model is pre-trained by randomly masking target sentences alone. Then, an additional neural decoder on the top of the pre-trained model is jointly fine-tuned for the APE task. We also apply an imitation learning strategy to augment a reasonable amount of pseudo APE training data, potentially preventing the model to overfit on the limited real training data and boosting the performance on held-out data. To verify our proposed model and data augmentation, we examine our approach with the well-known benchmarking English-German dataset from the WMT 2017 APE task. The experiment results demonstrate that our system significantly outperforms all other baselines and achieves the state-of-the-art performance. The final results on the WMT 2020 test dataset show that our submission can achieve +5.56 BLEU and -4.57 TER with respect to the official MT baseline.
pdf
bib
abs
Adversarial Text Generation via Sequence Contrast Discrimination
Ke Wang
|
Xiaojun Wan
Findings of the Association for Computational Linguistics: EMNLP 2020
In this paper, we propose a sequence contrast loss driven text generation framework, which learns the difference between real texts and generated texts and uses that difference. Specifically, our discriminator contains a discriminative sequence generator instead of a binary classifier, and measures the ‘relative realism’ of generated texts against real texts by making use of them simultaneously. Moreover, our generator uses discriminative sequences to directly improve itself, which not only replaces the gradient propagation process from the discriminator to the generator, but also avoids the time-consuming sampling process of estimating rewards in some previous methods. We conduct extensive experiments with various metrics, substantiating that our framework brings improvements in terms of training stability and the quality of generated texts.
pdf
bib
abs
Computer Assisted Translation with Neural Quality Estimation and Automatic Post-Editing
Ke Wang
|
Jiayi Wang
|
Niyu Ge
|
Yangbin Shi
|
Yu Zhao
|
Kai Fan
Findings of the Association for Computational Linguistics: EMNLP 2020
With the advent of neural machine translation, there has been a marked shift towards leveraging and consuming the machine translation results. However, the gap between machine translation systems and human translators needs to be manually closed by post-editing. In this paper, we propose an end-to-end deep learning framework of the quality estimation and automatic post-editing of the machine translation output. Our goal is to provide error correction suggestions and to further relieve the burden of human translators through an interpretable model. To imitate the behavior of human translators, we design three efficient delegation modules – quality estimation, generative post-editing, and atomic operation post-editing and construct a hierarchical model based on them. We examine this approach with the English–German dataset from WMT 2017 APE shared task and our experimental results can achieve the state-of-the-art performance. We also verify that the certified translators can significantly expedite their post-editing processing with our model in human evaluation.