2024
pdf
bib
abs
How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition
Guanting Dong
|
Hongyi Yuan
|
Keming Lu
|
Chengpeng Li
|
Mingfeng Xue
|
Dayiheng Liu
|
Wei Wang
|
Zheng Yuan
|
Chang Zhou
|
Jingren Zhou
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) with enormous pre-training tokens and parameters emerge diverse abilities, including math reasoning, codegeneration, and instruction following. These abilities are further enhanced by supervised fine-tuning (SFT). While the open-source community has explored ad-hoc SFT for enhancing individual capabilities, proprietary LLMs exhibit versatility across various skills. Therefore, understanding the facilitation of multiple abilities via SFT is paramount. In this study, we specificially focuses on the interplay of data composition between mathematical reasoning, code generation, and general human-aligning abilities during SFT. We propose four intriguing research questions to explore the association between model performance and various factors including data amount, composition ratio, model size and SFT strategies. Our experiments reveal that distinct capabilities scale differently and larger models generally show superior performance with same amount of data. Mathematical reasoning and code generation consistently improve with increasing data amount, whereas general abilities plateau after roughly a thousand samples. Moreover, we observe data composition appears to enhance various abilities under limited data conditions, yet can lead to performance conflicts when data is plentiful. Our findings also suggest the amount of composition data influences performance more than the composition ratio. In analysis of SFT strategies, we find that sequentially learning multiple skills risks catastrophic forgetting. Our proposed Dual-stage Mixed Fine-tuning (DMT) strategy offers a promising solution to learn multiple abilities with different scaling patterns.
pdf
bib
abs
DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction Tuning
Yejie Wang
|
Keqing He
|
Guanting Dong
|
Pei Wang
|
Weihao Zeng
|
Muxi Diao
|
Weiran Xu
|
Jingang Wang
|
Mengdi Zhang
|
Xunliang Cai
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Code Large Language Models (Code LLMs) have demonstrated outstanding performance in code-related tasks. Various instruction finetuning approaches have been proposed to boost the code generation performance of pre-trained Code LLMs. In this paper, we introduce a diverse instruction model DolphCoder with self-evaluating for code generation. It learns diverse instruction targets and combines a code evaluation objective to enhance its code generation ability. Our model achieves superior performance on the HumanEval and MBPP benchmarks, demonstrating new insights for future code instruction tuning work. Our key findings are: (1) Augmenting more diverse responses with more distinct reasoning paths increases the code capability of LLMs. (2) Improving one’s ability to evaluate the correctness of code also enhances their ability to create it.
pdf
bib
abs
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning
Chengpeng Li
|
Zheng Yuan
|
Hongyi Yuan
|
Guanting Dong
|
Keming Lu
|
Jiancan Wu
|
Chuanqi Tan
|
Xiang Wang
|
Chang Zhou
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In math reasoning with large language models (LLMs), fine-tuning data augmentation by query evolution and diverse reasoning paths is empirically verified effective, profoundly narrowing the gap between open-sourced LLMs and cutting-edge proprietary LLMs. In this paper, we conduct an investigation for such data augmentation in math reasoning and are intended to answer: (1) What strategies of data augmentation are more effective; (2) What is the scaling relationship between the amount of augmented data and model performance; and (3) Can data augmentation incentivize generalization to out-of-domain mathematical reasoning tasks?To this end, we create two new dataset AugGSM8K and AugMATH, by complicating and diversifying the queries and sampling multiple reasoning paths from GSM8K and MATH.We obtained a series of LLMs called MuggleMath by fine-tuning LLaMA models on AugGSM8K and AugMATH. MuggleMath substantially achieves new state-of-the-art on GSM8K and MATH.A log-linear relationship and a segmented log-linear are presented between MuggleMath’s performance and the amount of augmented data on GSM8K and MATH, respectively.We also find that it is weak in out-of-domain math reasoning generalization from AugGSM8K to MATH and from AugMATH to GSM8K, which suggests that augmenting queries that cover a broader range of subjects is more beneficial for generalization.
pdf
bib
abs
MSI-Agent: Incorporating Multi-Scale Insight into Embodied Agents for Superior Planning and Decision-Making
Dayuan Fu
|
Biqing Qi
|
Yihuai Gao
|
Che Jiang
|
Guanting Dong
|
Bowen Zhou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Insight gradually becomes a crucial form of long-term memory for an agent. However, the emergence of irrelevant insight and the lack of general insight can greatly undermine the effectiveness of insight. To solve this problem, in this paper, we introduce **M**ulti-**S**cale **I**nsight Agent (MSI-Agent), an embodied agent designed to improve LLMs’ planning and decision-making ability by summarizing and utilizing insight effectively across different scales. MSI achieves this through the experience selector, insight generator, and insight selector. Leveraging a three-part pipeline, MSI can generate task-specific and high-level insight, store it in a database, and then use relevant insight from it to aid in decision-making. Our experiments show that MSI outperforms another insight strategy when planning by GPT3.5. Moreover, We delve into the strategies for selecting seed experience and insight, aiming to provide LLM with more useful and relevant insight for better decision-making. Our observations also indicate that MSI exhibits better robustness when facing domain-shifting scenarios.
pdf
bib
abs
How Do Your Code LLMs perform? Empowering Code Instruction Tuning with Really Good Data
Yejie Wang
|
Keqing He
|
Dayuan Fu
|
Zhuoma GongQue
|
Heyang Xu
|
Yanxu Chen
|
Zhexu Wang
|
Yujia Fu
|
Guanting Dong
|
Muxi Diao
|
Jingang Wang
|
Mengdi Zhang
|
Xunliang Cai
|
Weiran Xu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon further investigation, we find that many datasets suffer from severe data leakage. After cleaning up most of the leaked data, some well-known high-quality datasets perform poorly. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. To address this, we propose an efficient code data pruning strategy for selecting good samples. Our approach is based on three dimensions: instruction complexity, response quality, and instruction diversity. Based on our selected data, we present XCoder, a family of models finetuned from LLaMA3. Our experiments show Xcoder achieves new state-of-the-art performance using fewer training data, which verify the effectiveness of our data strategy. Moreover, we perform a comprehensive analysis on the data composition and find existing code datasets have different characteristics according to their construction methods, which provide new insights for future code LLMs.
pdf
bib
abs
ChatKBQA: A Generate-then-Retrieve Framework for Knowledge Base Question Answering with Fine-tuned Large Language Models
Haoran Luo
|
Haihong E
|
Zichen Tang
|
Shiyao Peng
|
Yikai Guo
|
Wentai Zhang
|
Chenghao Ma
|
Guanting Dong
|
Meina Song
|
Wei Lin
|
Yifan Zhu
|
Anh Tuan Luu
Findings of the Association for Computational Linguistics: ACL 2024
Knowledge Base Question Answering (KBQA) aims to answer natural language questions over large-scale knowledge bases (KBs), which can be summarized into two crucial steps: knowledge retrieval and semantic parsing. However, three core challenges remain: inefficient knowledge retrieval, mistakes of retrieval adversely impacting semantic parsing, and the complexity of previous KBQA methods. To tackle these challenges, we introduce ChatKBQA, a novel and simple generate-then-retrieve KBQA framework, which proposes first generating the logical form with fine-tuned LLMs, then retrieving and replacing entities and relations with an unsupervised retrieval method, to improve both generation and retrieval more directly. Experimental results show that ChatKBQA achieves new state-of-the-art performance on standard KBQA datasets, WebQSP, and CWQ. This work can also be regarded as a new paradigm for combining LLMs with knowledge graphs (KGs) for interpretable and knowledge-required question answering.
pdf
bib
abs
Clear Up Confusion: Advancing Cross-Domain Few-Shot Relation Extraction through Relation-Aware Prompt Learning
Ge Bai
|
Chenji Lu
|
Daichi Guo
|
Shilong Li
|
Ying Liu
|
Zhang Zhang
|
Guanting Dong
|
Ruifang Liu
|
Sun Yong
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Cross-domain few-shot Relation Extraction (RE) aims to transfer knowledge from a source domain to a different target domain to address low-resource problems.Previous work utilized label descriptions and entity information to leverage the knowledge of the source domain.However, these models are prone to confusion when directly applying this knowledge to a target domain with entirely new types of relations, which becomes particularly pronounced when facing similar relations.In this work, we propose a relation-aware prompt learning method with pre-training.Specifically, we empower the model to clear confusion by decomposing various relation types through an innovative label prompt, while a context prompt is employed to capture differences in different scenarios, enabling the model to further discern confusion. Two pre-training tasks are designed to leverage the prompt knowledge and paradigm.Experiments show that our method outperforms previous sota methods, yielding significantly better results on cross-domain few-shot RE tasks.
2023
pdf
bib
abs
Large Language Models Meet Open-World Intent Discovery and Recognition: An Evaluation of ChatGPT
Xiaoshuai Song
|
Keqing He
|
Pei Wang
|
Guanting Dong
|
Yutao Mou
|
Jingang Wang
|
Yunsen Xian
|
Xunliang Cai
|
Weiran Xu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
The tasks of out-of-domain (OOD) intent discovery and generalized intent discovery (GID) aim to extend a closed intent classifier to open-world intent sets, which is crucial to task-oriented dialogue (TOD) systems. Previous methods address them by fine-tuning discriminative models. Recently, although some studies has been exploring the application of large language models (LLMs) represented by ChatGPT to various downstream tasks, it is still unclear for the ability of ChatGPT to discover and incrementally extent OOD intents. In this paper, we comprehensively evaluate ChatGPT on OOD intent discovery and GID, and then outline the strengths and weaknesses of ChatGPT. Overall, ChatGPT exhibits consistent advantages under zero-shot settings, but is still at a disadvantage compared to fine-tuned models. More deeply, through a series of analytical experiments, we summarize and discuss the challenges faced by LLMs including clustering, domain-specific understanding, and cross-domain in-context learning scenarios. Finally, we provide empirical guidance for future directions to address these challenges.
pdf
bib
abs
Generative Zero-Shot Prompt Learning for Cross-Domain Slot Filling with Inverse Prompting
Xuefeng Li
|
Liwen Wang
|
Guanting Dong
|
Keqing He
|
Jinzheng Zhao
|
Hao Lei
|
Jiachi Liu
|
Weiran Xu
Findings of the Association for Computational Linguistics: ACL 2023
Zero-shot cross-domain slot filling aims to transfer knowledge from the labeled source domain to the unlabeled target domain. Existing models either encode slot descriptions and examples or design handcrafted question templates using heuristic rules, suffering from poor generalization capability or robustness. In this paper, we propose a generative zero-shot prompt learning framework for cross-domain slot filling, both improving generalization and robustness than previous work. Besides, we introduce a novel inverse prompting strategy to distinguish different slot types to avoid the multiple prediction problem, and an efficient prompt tuning strategy to boost higher performance only training fewer prompt parameters. Experiments and analysis demonstrate the effectiveness of our proposed framework, especially huge improvements (+13.44% F1) on the unseen slots.
pdf
bib
abs
Pay Attention to Implicit Attribute Values: A Multi-modal Generative Framework for AVE Task
Yupeng Zhang
|
Shensi Wang
|
Peiguang Li
|
Guanting Dong
|
Sirui Wang
|
Yunsen Xian
|
Zhoujun Li
|
Hongzhi Zhang
Findings of the Association for Computational Linguistics: ACL 2023
Attribute Value Extraction (AVE) boosts many e-commerce platform services such as targeted recommendation, product retrieval and question answering. Most previous studies adopt an extractive framework such as named entity recognition (NER) to capture subtokens in the product descriptions as the corresponding values of target attributes. However, in the real world scenario, there also exist implicit attribute values that are not mentioned explicitly but embedded in the image information and implied text meaning of products, for which the power of extractive methods is severely constrained. To address the above issues, we exploit a unified multi-modal AVE framework named DEFLATE (a multi-modal unifieD framEwork For impLicit And expliciT AVE) to acquire implicit attribute values in addition to the explicit ones. DEFLATE consists of a QA-based generation model to produce candidate attribute values from the product information of different modalities, and a discriminative model to ensure the credibility of the generated answers. Meanwhile, to provide a testbed that close to the real world, we collect and annotate a multi-modal dataset with parts of implicit attribute values. Extensive experiments conducted on multiple datasets demonstrate that DEFLATE significantly outperforms previous methods on the extraction of implicit attribute values, while achieving comparable performance for the explicit ones.
pdf
bib
abs
DemoSG: Demonstration-enhanced Schema-guided Generation for Low-resource Event Extraction
Gang Zhao
|
Xiaocheng Gong
|
Xinjie Yang
|
Guanting Dong
|
Shudong Lu
|
Si Li
Findings of the Association for Computational Linguistics: EMNLP 2023
Most current Event Extraction (EE) methods focus on the high-resource scenario, which requires a large amount of annotated data and can hardly be applied to low-resource domains. To address EE more effectively with limited resources, we propose the Demonstration-enhanced Schema-guided Generation (DemoSG) model, which benefits low-resource EE from two aspects: Firstly, we propose the demonstration-based learning paradigm for EE to fully use the annotated data, which transforms them into demonstrations to illustrate the extraction process and help the model learn effectively. Secondly, we formulate EE as a natural language generation task guided by schema-based prompts, thereby leveraging label semantics and promoting knowledge transfer in low-resource scenarios. We conduct extensive experiments under in-domain and domain adaptation low-resource settings on three datasets, and study the robustness of DemoSG. The results show that DemoSG significantly outperforms current methods in low-resource scenarios.
pdf
bib
abs
DemoNSF: A Multi-task Demonstration-based Generative Framework for Noisy Slot Filling Task
Guanting Dong
|
Tingfeng Hui
|
Zhuoma GongQue
|
Jinxu Zhao
|
Daichi Guo
|
Gang Zhao
|
Keqing He
|
Weiran Xu
Findings of the Association for Computational Linguistics: EMNLP 2023
Recently, prompt-based generative frameworks have shown impressive capabilities in sequence labeling tasks. However, in practical dialogue scenarios, relying solely on simplistic templates and traditional corpora presents a challenge for these methods in generalizing to unknown input perturbations. To address this gap, we propose a multi-task demonstration-based generative framework for noisy slot filling, named DemoNSF. Specifically, we introduce three noisy auxiliary tasks, namely noisy recovery (NR), random mask (RM), and hybrid discrimination (HD), to implicitly capture semantic structural information of input perturbations at different granularities. In the downstream main task, we design a noisy demonstration construction strategy for the generative framework, which explicitly incorporates task-specific information and perturbed distribution during training and inference. Experiments on two benchmarks demonstrate that DemoNSF outperforms all baseline methods and achieves strong generalization. Further analysis provides empirical guidance for the practical application of generative frameworks. Our code is released at https://github.com/dongguanting/Demo-NSF.
pdf
bib
abs
Semantic Parsing by Large Language Models for Intricate Updating Strategies of Zero-Shot Dialogue State Tracking
Yuxiang Wu
|
Guanting Dong
|
Weiran Xu
Findings of the Association for Computational Linguistics: EMNLP 2023
Zero-shot Dialogue State Tracking (DST) addresses the challenge of acquiring and annotating task-oriented dialogues, which can be time-consuming and costly. However, DST extends beyond simple slot-filling and requires effective updating strategies for tracking dialogue state as conversations progress. In this paper, we propose ParsingDST, a new In-Context Learning (ICL) method, to introduce additional intricate updating strategies in zero-shot DST. Our approach reformulates the DST task by leveraging powerful Large Language Models (LLMs) and translating the original dialogue text to JSON through semantic parsing as an intermediate state. We also design a novel framework that includes more modules to ensure the effectiveness of updating strategies in the text-to-JSON process. Experimental results demonstrate that our approach outperforms existing zero-shot DST methods on MultiWOZ, exhibiting significant improvements in Joint Goal Accuracy (JGA) and slot accuracy compared to existing ICL methods.
2022
pdf
bib
abs
PSSAT: A Perturbed Semantic Structure Awareness Transferring Method for Perturbation-Robust Slot Filling
Guanting Dong
|
Daichi Guo
|
Liwen Wang
|
Xuefeng Li
|
Zechen Wang
|
Chen Zeng
|
Keqing He
|
Jinzheng Zhao
|
Hao Lei
|
Xinyue Cui
|
Yi Huang
|
Junlan Feng
|
Weiran Xu
Proceedings of the 29th International Conference on Computational Linguistics
Most existing slot filling models tend to memorize inherent patterns of entities and corresponding contexts from training data. However, these models can lead to system failure or undesirable outputs when being exposed to spoken language perturbation or variation in practice. We propose a perturbed semantic structure awareness transferring method for training perturbation-robust slot filling models. Specifically, we introduce two MLM-based training strategies to respectively learn contextual semantic structure and word distribution from unsupervised language perturbation corpus. Then, we transfer semantic knowledge learned from upstream training procedure into the original samples and filter generated data by consistency processing. These procedures aims to enhance the robustness of slot filling models. Experimental results show that our method consistently outperforms the previous basic methods and gains strong generalization while preventing the model from memorizing inherent patterns of entities and contexts.
pdf
bib
abs
Exploiting domain-slot related keywords description for Few-Shot Cross-Domain Dialogue State Tracking
Gao Qixiang
|
Guanting Dong
|
Yutao Mou
|
Liwen Wang
|
Chen Zeng
|
Daichi Guo
|
Mingyang Sun
|
Weiran Xu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Collecting dialogue data with domain-slot-value labels for dialogue state tracking (DST) could be a costly process. In this paper, we propose a novel framework based on domain-slot related description to tackle the challenge of few-shot cross-domain DST. Specifically, we design an extraction module to extract domain-slot related verbs and nouns in the dialogue. Then, we integrates them into the description, which aims to prompt the model to identify the slot information. Furthermore, we introduce a random sampling strategy to improve the domain generalization ability of the model. We utilize a pre-trained model to encode contexts and description and generates answers with an auto-regressive manner. Experimental results show that our approaches substantially outperform the existing few-shot DST methods on MultiWOZ and gain strong improvements on the slot accuracy comparing to existing slot description methods.
pdf
bib
abs
Entity-level Interaction via Heterogeneous Graph for Multimodal Named Entity Recognition
Gang Zhao
|
Guanting Dong
|
Yidong Shi
|
Haolong Yan
|
Weiran Xu
|
Si Li
Findings of the Association for Computational Linguistics: EMNLP 2022
Multimodal Named Entity Recognition (MNER) faces two specific challenges: 1) How to capture useful entity-related visual information. 2) How to alleviate the interference of visual noise. Previous works have gained progress by improving interacting mechanisms or seeking for better visual features. However, existing methods neglect the integrity of entity semantics and conduct cross-modal interaction at token-level, which cuts apart the semantics of entities and makes non-entity tokens easily interfered with by irrelevant visual noise. Thus in this paper, we propose an end-to-end heterogeneous Graph-based Entity-level Interacting model (GEI) for MNER. GEI first utilizes a span detection subtask to obtain entity representations, which serve as the bridge between two modalities. Then, the heterogeneous graph interacting network interacts entity with object nodes to capture entity-related visual information, and fuses it into only entity-associated tokens to rid non-entity tokens of the visual noise. Experiments on two widely used datasets demonstrate the effectiveness of our method. Our code will be available at https://github.com/GangZhao98/GEI.
pdf
bib
abs
Semi-Supervised Knowledge-Grounded Pre-training for Task-Oriented Dialog Systems
Weihao Zeng
|
Keqing He
|
Zechen Wang
|
Dayuan Fu
|
Guanting Dong
|
Ruotong Geng
|
Pei Wang
|
Jingang Wang
|
Chaobo Sun
|
Wei Wu
|
Weiran Xu
Proceedings of the Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems (SereTOD)
Recent advances in neural approaches greatly improve task-oriented dialogue (TOD) systems which assist users to accomplish their goals. However, such systems rely on costly manually labeled dialogs which are not available in practical scenarios. In this paper, we present our models for Track 2 of the SereTOD 2022 challenge, which is the first challenge of building semisupervised and reinforced TOD systems on a large-scale real-world Chinese TOD dataset MobileCS. We build a knowledge-grounded dialog model to formulate dialog history and local KB as input and predict the system response. And we perform semi-supervised pretraining both on the labeled and unlabeled data. Our system achieves the first place both in the automatic evaluation and human interaction, especially with higher BLEU (+7.64) and Success (+13.6%) than the second place.