Boyang Xue


2024

pdf bib
AppBench: Planning of Multiple APIs from Various APPs for Complex User Instruction
Hongru Wang | Rui Wang | Boyang Xue | Heming Xia | Jingtao Cao | Zeming Liu | Jeff Z. Pan | Kam-Fai Wong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) can interact with the real world by connecting with versatile external APIs, resulting in better problem-solving and task automation capabilities. Previous research primarily either focuses on APIs with limited arguments from a single source or overlooks the complex dependency relationship between different APIs. However, it is essential to utilize multiple APIs collaboratively from various sources, especially for complex user instructions. In this paper, we introduce MetaBench, the first benchmark to evaluate LLMs’ ability to plan and execute multiple APIs from various sources in order to complete the user’s task. Specifically, we consider two significant challenges in multiple APIs: 1) graph structures: some APIs can be executed independently while others need to be executed one by one, resulting in graph-like execution order; and 2) permission constraints: which source is authorized to execute the API call. We have experimental results on 9 distinct LLMs; e.g., GPT-4o achieves only a 2.0% success rate at the most complex instruction, revealing that the existing state-of-the-art LLMs still cannot perform well in this situation even with the help of in-context learning and finetuning. Our code and data are publicly available at https://github.com/ruleGreen/AppBench.

pdf bib
Role Prompting Guided Domain Adaptation with General Capability Preserve for Large Language Models
Rui Wang | Fei Mi | Yi Chen | Boyang Xue | Hongru Wang | Qi Zhu | Kam-Fai Wong | Ruifeng Xu
Findings of the Association for Computational Linguistics: NAACL 2024

The growing interest in Large Language Models (LLMs) for specialized applications has revealed a significant challenge: when tailored to specific domains, LLMs tend to experience catastrophic forgetting, compromising their general capabilities and leading to a suboptimal user experience. Additionally, crafting a versatile model for multiple domains simultaneously often results in a decline in overall performance due to confusion between domains. In response to these issues, we present the RolE Prompting Guided Multi-Domain Adaptation (REGA) strategy. This novel approach effectively manages multi-domain LLM adaptation through three key components: 1) Self-Distillation constructs and replays general-domain exemplars to alleviate catastrophic forgetting. 2) Role Prompting assigns a central prompt to the general domain and a unique role prompt to each specific domain to minimize inter-domain confusion during training. 3) Role Integration reuses and integrates a small portion of domain-specific data to the general-domain data, which are trained under the guidance of the central prompt. The central prompt is used for a streamlined inference process, removing the necessity to switch prompts for different domains.Empirical results demonstrate that REGA effectively alleviates catastrophic forgetting and inter-domain confusion. This leads to improved domain-specific performance compared to standard fine-tuned models, while still preserving robust general capabilities.

pdf bib
LoRA Meets Dropout under a Unified Framework
Sheng Wang | Liheng Chen | Jiyue Jiang | Boyang Xue | Lingpeng Kong | Chuan Wu
Findings of the Association for Computational Linguistics: ACL 2024

With the remarkable capabilities, large language models (LLMs) have emergedas essential elements in numerous NLP applications, while parameter-efficientfinetuning, especially LoRA, has gained popularity as a lightweight approachfor model customization. Meanwhile, various dropout methods, initially designedfor full finetuning with all the parameters updated, alleviates overfittingassociated with excessive parameter redundancy. Hence, a possible contradictionarises from negligible trainable parameters of LoRA and the effectiveness ofprevious dropout methods, which has been largely overlooked. To fill this gap,we first confirm that parameter-efficient LoRA is also overfitting-prone. Wethen revisit transformer-specific dropout methods, and establish theirequivalence and distinctions mathematically and empirically. Building upon thiscomparative analysis, we introduce a unified framework for a comprehensiveinvestigation, which instantiates these methods based on dropping position,structural pattern and compensation measure. Through this framework, we revealthe new preferences and performance comparisons of them when involved withlimited trainable parameters. This framework also allows us to amalgamate themost favorable aspects into a novel dropout method named HiddenKey. Extensiveexperiments verify the remarkable superiority and sufficiency of HiddenKeyacross multiple models and tasks, which highlights it as the preferred approachfor high-performance and parameter-efficient finetuning of LLMs.

pdf bib
Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting
Rui Wang | Hongru Wang | Fei Mi | Boyang Xue | Yi Chen | Kam-Fai Wong | Ruifeng Xu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Numerous works are proposed to align large language models (LLMs) with human intents to better fulfill instructions, ensuring they are trustful and helpful.Nevertheless, some human instructions are often malicious or misleading and following them will lead to untruthful and unsafe responses.Previous work rarely focused on understanding how LLMs manage instructions based on counterfactual premises, referred to here as inductive instructions, which may stem from users’ false beliefs or malicious intents.In this paper, we aim to reveal the behaviors of LLMs towards inductive instructions and enhance their truthfulness and helpfulness accordingly. Specifically, we first introduce a benchmark of Inductive Instructions (INDust), where the false knowledge is incorporated into instructions in multiple different styles. After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions.Additionally, we identified that different inductive styles affect the models’ ability to identify the same underlying errors,and the complexity of the underlying assumptions also influences the model’s performance.Motivated by these results, we propose Dual-critique prompting to improve LLM robustness against inductive instructions.Our experiments demonstrate that Dual-critique prompting significantly bolsters the robustness of a diverse array of LLMs, even when confronted with varying degrees of inductive instruction complexity and differing inductive styles.

pdf bib
Fine-tuning after Prompting: an Explainable Way for Classification
Zezhong Wang | Luyao Ye | Hongru Wang | Boyang Xue | Yiming Du | Bin Liang | Kam-Fai Wong
Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10)

Prompting is an alternative approach for utilizing pre-trained language models (PLMs) in classification tasks. In contrast to fine-tuning, prompting is more understandable for humans because it utilizes natural language to interact with the PLM, but it often falls short in terms of accuracy. While current research primarily focuses on enhancing the performance of prompting methods to compete with fine-tuning, we believe that these two approaches are not mutually exclusive, each having its strengths and weaknesses. In our study, we depart from the competitive view of prompting versus fine-tuning and instead combine them, introducing a novel method called F&P. This approach enables us to harness the advantages of Fine-tuning for accuracy and the explainability of Prompting simultaneously. Specifically, we reformulate the sample into a prompt and subsequently fine-tune a linear classifier on top of the PLM. Following this, we extract verbalizers according to the weight of this classifier. During the inference phase, we reformulate the sample in the same way and query the PLM. The PLM generates a word, which is then subject to a dictionary lookup by the verbalizer to obtain the prediction. Experiments show that keeping only 30 keywords for each class can achieve comparable performance as fine-tuning. On the other hand, both the prompt and verbalizers are constructed in natural language, making them fully understandable to humans. Hence, the F&P method offers an effective and transparent way to employ a PLM for classification tasks.

pdf bib
PRoLoRA: Partial Rotation Empowers More Parameter-Efficient LoRA
Sheng Wang | Boyang Xue | Jiacheng Ye | Jiyue Jiang | Liheng Chen | Lingpeng Kong | Chuan Wu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

With the rapid scaling of large language models (LLMs), serving numerouslow-rank adaptations (LoRAs) concurrently has become increasingly impractical,leading to unaffordable costs and necessitating more parameter-efficientfinetuning methods. In this work, we introduce Partially Rotation-enhanced Low-Rank Adaptation (PRoLoRA), an intra-layer sharing mechanism comprising fouressential components: broadcast reduction, rotation enhancement,partially-sharing refinement, and rectified initialization strategy. As asuperset of LoRA, PRoLoRA retains its advantages, and effectively circumventthe drawbacks of peer parameter-sharing methods with superior model capacity,practical feasibility, and broad applicability. Empirical experimentsdemonstrate the remarkably higher parameter efficiency of PRoLoRA in bothspecific parameter budget and performance target scenarios, and its scalabilityto larger LLMs. Notably, with one time less trainable parameters, PRoLoRA stilloutperforms LoRA on multiple instruction tuning datasets. Subsequently, anablation study is conducted to validate the necessity of individual componentsand highlight the superiority of PRoLoRA over three potential variants.Hopefully, the conspicuously higher parameter efficiency can establish PRoLoRAas a resource-friendly alternative to LoRA.

pdf bib
UniRetriever: Multi-task Candidates Selection for Various Context-Adaptive Conversational Retrieval
Hongru Wang | Boyang Xue | Baohang Zhou | Rui Wang | Fei Mi | Weichao Wang | Yasheng Wang | Kam-Fai Wong
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Conversational retrieval refers to an information retrieval system that operates in an iterative and interactive manner, requiring the retrieval of various external resources, such as persona, knowledge, and even response, to effectively engage with the user and successfully complete the dialogue. However, most previous work trained independent retrievers for each specific resource, resulting in sub-optimal performance and low efficiency. Thus, we propose a multi-task framework function as a universal retriever for three dominant retrieval tasks during the conversation: persona selection, knowledge selection, and response selection. To this end, we design a dual-encoder architecture consisting of a context-adaptive dialogue encoder and a candidate encoder, aiming to attention to the relevant context from the long dialogue and retrieve suitable candidates by simply a dot product. Furthermore, we introduce two loss constraints to capture the subtle relationship between dialogue context and different candidates by regarding historically selected candidates as hard negatives. Extensive experiments and analysis establish state-of-the-art retrieval quality both within and outside its training domain, revealing the promising potential and generalization capability of our model to serve as a universal retriever for different candidate selection tasks simultaneously.

2023

pdf bib
Improving Factual Consistency for Knowledge-Grounded Dialogue Systems via Knowledge Enhancement and Alignment
Boyang Xue | Weichao Wang | Hongru Wang | Fei Mi | Rui Wang | Yasheng Wang | Lifeng Shang | Xin Jiang | Qun Liu | Kam-Fai Wong
Findings of the Association for Computational Linguistics: EMNLP 2023

Pretrained language models (PLMs) based knowledge-grounded dialogue systems are prone to generate responses that are factually inconsistent with the provided knowledge source. In such inconsistent responses, the dialogue models fail to accurately express the external factual knowledge they rely upon. Inspired by previous work which identified that feedforward networks (FFNs) within Transformers are responsible for factual knowledge expressions, we investigate two methods to efficiently improve the factual expression capability of FFNs by knowledge enhancement and alignment respectively. We first propose K-Dial, which explicitly introduces extended FFNs in Transformers to enhance factual knowledge expressions given the specific patterns of knowledge-grounded dialogue inputs. Additionally, we apply the reinforcement learning for factual consistency (RLFC) method to implicitly adjust FFNs’ expressions in responses by aligning with gold knowledge for the factual consistency preference. To comprehensively assess the factual consistency and dialogue quality of responses, we employ extensive automatic measures and human evaluations including sophisticated fine-grained NLI-based metrics. Experimental results on WoW and CMU_DoG datasets demonstrate that our methods efficiently enhance the ability of the FFN module to convey factual knowledge, validating the efficacy of improving factual consistency for knowledge-grounded dialogue systems.