Ilia Shumailov
2023
Revisiting Automated Prompting: Are We Actually Doing Better?
Yulin Zhou
|
Yiren Zhao
|
Ilia Shumailov
|
Robert Mullins
|
Yarin Gal
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Current literature demonstrates that Large Language Models (LLMs) are great few-shot learners, and prompting significantly increases their performance on a range of downstream tasks in a few-shot learning setting. An attempt to automate human-led prompting followed, with some progress achieved. In particular, subsequent work demonstrates that automation can outperform fine-tuning in certain K-shot learning scenarios. In this paper, we revisit techniques for automated prompting on six different downstream tasks and a larger range of K-shot learning settings. We find that automated prompting does not consistently outperform simple manual prompting. Our work suggests that, in addition to fine-tuning, manual prompting should be used as a baseline in this line of research.
Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?
Cheng Zhang
|
Jianyi Cheng
|
Ilia Shumailov
|
George Constantinides
|
Yiren Zhao
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
The inference of Large language models (LLMs) requires immense computation and memory resources. To curtail these costs, quantisation has emerged as a promising solution, but existing LLM quantisation mainly focuses on 8-bit. In this work, we explore the statistical and learning properties of the LLM layer and attribute the bottleneck of LLM quantisation to numerical scaling offsets. To address this, we adapt block quantisations for LLMs, a family of methods that share scaling factors across packed numbers. Block quantisations efficiently reduce the numerical scaling offsets solely from an arithmetic perspective, without additional treatments in the computational path. Our nearly-lossless quantised 6-bit LLMs achieve a 19× higher arithmetic density and 5× memory density than the float32 baseline, surpassing the prior art 8-bit quantisation by 2.5× in arithmetic density and 1.2× in memory density, without requiring any data calibration or re-training. We also share our insights into sub-8-bit LLM quantisation, including the mismatch between activation and weight distributions, optimal fine-tuning strategies, and a lower quantisation granularity inherent in the statistical properties of LLMs. The latter two tricks enable nearly-lossless 4-bit LLMs on downstream tasks. Our code is open-sourced.
Search
Fix data
Co-authors
- Yiren Zhao 2
- Jianyi Cheng 1
- George Constantinides 1
- Yarin Gal 1
- Robert Mullins 1
- show all...