Vijay Gadepally


2024

pdf bib
Sprout: Green Generative AI with Carbon-Efficient LLM Inference
Baolin Li | Yankai Jiang | Vijay Gadepally | Devesh Tiwari
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The rapid advancement of generative AI has heightened environmental concerns, particularly regarding carbon emissions. Our framework, Sprout, addresses these challenges by reducing the carbon footprint of inference in large language models (LLMs). Sprout introduces “generation directives” to guide the autoregressive generation process, achieving a balance between ecological sustainability and high-quality outputs. By employing a strategic optimizer for directive assignment and a novel offline quality evaluator, Sprout reduces the carbon footprint of generative LLM inference by over 40% in real-world evaluations, using the Llama model and global electricity grid data. This work is crucial as the rising interest in inference time compute scaling laws amplifies environmental concerns, emphasizing the need for eco-friendly AI solutions.

2022

pdf bib
Great Power, Great Responsibility: Recommendations for Reducing Energy for Training Language Models
Joseph McDonald | Baolin Li | Nathan Frey | Devesh Tiwari | Vijay Gadepally | Siddharth Samsi
Findings of the Association for Computational Linguistics: NAACL 2022

The energy requirements of current natural language processing models continue to grow at a rapid, unsustainable pace. Recent works highlighting this problem conclude there is an urgent need for methods that reduce the energy needs of NLP and machine learning more broadly. In this article, we investigate techniques that can be used to reduce the energy consumption of common NLP applications. In particular, we focus on techniques to measure energy usage and different hardware and datacenter-oriented settings that can be tuned to reduce energy consumption for training and inference for language models. We characterize the impact of these settings on metrics such as computational performance and energy consumption through experiments conducted on a high performance computing system as well as popular cloud computing platforms. These techniques can lead to significant reduction in energy consumption when training language models or their use for inference. For example, power-capping, which limits the maximum power a GPU can consume, can enable a 15% decrease in energy usage with marginal increase in overall computation time when training a transformer-based language model.