2024
pdf
bib
abs
ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency
Yuhang Yao
|
Han Jin
|
Alay Dilipbhai Shah
|
Shanshan Han
|
Zijian Hu
|
Dimitris Stripelis
|
Yide Ran
|
Zhaozhuo Xu
|
Salman Avestimehr
|
Chaoyang He
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models (LLMs) have surged in popularity and are extensively used in commercial applications, where the efficiency of model serving is crucial for the user experience. Most current research focuses on optimizing individual sub-procedures, e.g. local inference and communication, however, there is no comprehensive framework that provides a holistic system view for optimizing LLM serving in an end-to-end manner. In this work, we conduct a detailed analysis to identify major bottlenecks that impact end-to-end latency in LLM serving systems. Our analysis reveals that a comprehensive LLM serving endpoint must address a series of efficiency bottlenecks that extend beyond LLM inference. We then propose ScaleLLM, an optimized system for resource-efficient LLM serving. Our extensive experiments reveal that reveal that with 64 concurrent requests on Mixtral 8x7B, ScaleLLM achieves a 4.3× speed up over vLLM and outperforms state-of-the-arts with 1.5× higher throughput.
pdf
bib
abs
TensorOpera Router: A Multi-Model Router for Efficient LLM Inference
Dimitris Stripelis
|
Zhaozhuo Xu
|
Zijian Hu
|
Alay Dilipbhai Shah
|
Han Jin
|
Yuhang Yao
|
Jipeng Zhang
|
Tong Zhang
|
Salman Avestimehr
|
Chaoyang He
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
With the rapid growth of Large Language Models (LLMs) across various domains, numerous new LLMs have emerged, each possessing domain-specific expertise. This proliferation has highlighted the need for quick, high-quality, and cost-effective LLM query response methods. Yet, no single LLM exists to efficiently balance this trilemma. Some models are powerful but extremely costly, while others are fast and inexpensive but qualitatively inferior. To address this challenge, we present TO-Router, a non-monolithic LLM querying system that seamlessly integrates various LLM experts into a single query interface and dynamically routes incoming queries to the most high-performant expert based on query’s requirements. Through extensive experiments, we demonstrate that when compared to standalone expert models, TO-Router improves query efficiency by up to 40%, and leads to significant cost reductions of up to 30%, while maintaining or enhancing model performance by up to 10%.
pdf
bib
abs
Ethos: Rectifying Language Models in Orthogonal Parameter Space
Lei Gao
|
Yue Niu
|
Tingting Tang
|
Salman Avestimehr
|
Murali Annavaram
Findings of the Association for Computational Linguistics: NAACL 2024
Language models (LMs) have greatly propelled the research on natural language processing. However, LMs also raise concerns regarding the generation of biased or toxic content and the potential disclosure of private information from the training dataset. In this work, we present a new efficient approach, Ethos, that rectifies LMs to mitigate toxicity and bias in outputs and avoid privacy leakage. Ethos is built on task arithmetic. However, unlike current task arithmetic algorithms, Ethos distinguishes general beneficial and undesired knowledge when reconstructing task vectors. Specifically, Ethos first obtains a set of principal components from the pre-trained models using singular value decomposition. Then, by projecting the task vector onto principal components, Ethos separates the principal components that encode general from those associated with undesired knowledge. Ethos performs forgetting or unlearning by only negating the task vector with undesired knowledge, thereby minimizing collateral damage on general model utility. We demonstrate the efficacy of our approach on three different tasks: bias, toxicity, and memorization unlearning. Evaluations show Ethos is more effective in removing undesired knowledge while maintaining the overall model performance compared to current task arithmetic methods.
pdf
bib
abs
Revisiting OPRO: The Limitations of Small-Scale LLMs as Optimizers
Tuo Zhang
|
Jinyue Yuan
|
Salman Avestimehr
Findings of the Association for Computational Linguistics: ACL 2024
Numerous recent works aim to enhance the efficacy of Large Language Models (LLMs) through strategic prompting. In particular, the Optimization by PROmpting (OPRO) approach provides state-of-the-art performance by leveraging LLMs as optimizers where the optimization task is to find instructions that maximize the task accuracy. In this paper, we revisit OPRO for automated prompting with relatively small-scale LLMs, such as LLaMa-2 family and Mistral 7B. Our investigation reveals that OPRO shows limited effectiveness in small-scale LLMs, with limited inference capabilities constraining optimization ability. We suggest future automatic prompting engineering to consider both model capabilities and computational costs. Additionally, for small-scale LLMs, we recommend direct instructions that clearly outline objectives and methodologies as robust prompt baselines, ensuring efficient and effective prompt engineering in ongoing research.
pdf
bib
abs
MARS: Meaning-Aware Response Scoring for Uncertainty Estimation in Generative LLMs
Yavuz Faruk Bakman
|
Duygu Nur Yaldiz
|
Baturalp Buyukates
|
Chenyang Tao
|
Dimitrios Dimitriadis
|
Salman Avestimehr
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Generative Large Language Models (LLMs) are widely utilized for their excellence in various tasks. However, their tendency to produce inaccurate or misleading outputs poses a potential risk, particularly in high-stakes environments. Therefore, estimating the correctness of generative LLM outputs is an important task for enhanced reliability. Uncertainty Estimation (UE) in generative LLMs is an evolving domain, where SOTA probability-based methods commonly employ length-normalized scoring. In this work, we propose Meaning-Aware Response Scoring (MARS) as an alternative to length-normalized scoring for UE methods. MARS is a novel scoring function that considers the semantic contribution of each token in the generated sequence in the context of the question. We demonstrate that integrating MARS into UE methods results in a universal and significant improvement in UE performance. We conduct experiments using three distinct closed-book question-answering datasets across five popular pre-trained LLMs. Lastly, we validate the efficacy of MARS on a Medical QA dataset. Code can be found here.
2022
pdf
bib
abs
Federated Learning with Noisy User Feedback
Rahul Sharma
|
Anil Ramakrishna
|
Ansel MacLaughlin
|
Anna Rumshisky
|
Jimit Majmudar
|
Clement Chung
|
Salman Avestimehr
|
Rahul Gupta
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Machine Learning (ML) systems are getting increasingly popular, and drive more and more applications and services in our daily life. Thishas led to growing concerns over user privacy, since human interaction data typically needs to be transmitted to the cloud in order to trainand improve such systems. Federated learning (FL) has recently emerged as a method for training ML models on edge devices using sensitive user data and is seen as a way to mitigate concerns over data privacy. However, since ML models are most commonly trained with label supervision, we need a way to extract labels on edge to make FL viable. In this work, we propose a strategy for training FL models using positive and negative user feedback. We also design a novel framework to study different noise patterns in user feedback, and explore how well standard noise-robust objectives can help mitigate this noise when training models in a federated setting. We evaluate our proposed training setup through detailed experiments on two text classification datasets and analyze the effects of varying levels of user reliability and feedback noise on model performance. We show that our method improves substantially over a self-training baseline, achieving performance closer to models trained with full supervision.
pdf
bib
abs
FedNLP: Benchmarking Federated Learning Methods for Natural Language Processing Tasks
Bill Yuchen Lin
|
Chaoyang He
|
Zihang Ze
|
Hulin Wang
|
Yufen Hua
|
Christophe Dupuy
|
Rahul Gupta
|
Mahdi Soltanolkotabi
|
Xiang Ren
|
Salman Avestimehr
Findings of the Association for Computational Linguistics: NAACL 2022
Increasing concerns and regulations about data privacy and sparsity necessitate the study of privacy-preserving, decentralized learning methods for natural language processing (NLP) tasks. Federated learning (FL) provides promising approaches for a large number of clients (e.g., personal devices or organizations) to collaboratively learn a shared global model to benefit all clients while allowing users to keep their data locally. Despite interest in studying FL methods for NLP tasks, a systematic comparison and analysis is lacking in the literature. Herein, we present the FedNLP, a benchmarking framework for evaluating federated learning methods on four different task formulations: text classification, sequence tagging, question answering, and seq2seq. We propose a universal interface between Transformer-based language models (e.g., BERT, BART) and FL methods (e.g., FedAvg, FedOPT, etc.) under various non-IID partitioning strategies. Our extensive experiments with FedNLP provide empirical comparisons between FL methods and help us better understand the inherent challenges of this direction. The comprehensive analysis points to intriguing and exciting future research aimed at developing FL methods for NLP tasks.
pdf
bib
abs
ActPerFL: Active Personalized Federated Learning
Huili Chen
|
Jie Ding
|
Eric Tramel
|
Shuang Wu
|
Anit Kumar Sahu
|
Salman Avestimehr
|
Tao Zhang
Proceedings of the First Workshop on Federated Learning for Natural Language Processing (FL4NLP 2022)
In the context of personalized federated learning (FL), the critical challenge is to balance local model improvement and global model tuning when the personal and global objectives may not be exactly aligned. Inspired by Bayesian hierarchical models, we develop ActPerFL, a self-aware personalized FL method where each client can automatically balance the training of its local personal model and the global model that implicitly contributes to other clients’ training. Such a balance is derived from the inter-client and intra-client uncertainty quantification. Consequently, ActPerFL can adapt to the underlying clients’ heterogeneity with uncertainty-driven local training and model aggregation. With experimental studies on Sent140 and Amazon Alexa audio data, we show that ActPerFL can achieve superior personalization performance compared with the existing counterparts.