Yewei Fang
2024
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding
Weilin Zhao
|
Yuxiang Huang
|
Xu Han
|
Wang Xu
|
Chaojun Xiao
|
Xinrong Zhang
|
Yewei Fang
|
Kaihuo Zhang
|
Zhiyuan Liu
|
Maosong Sun
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Speculative decoding is a widely used method that accelerates the generation process of large language models (LLMs) with no compromise in model performance. It achieves this goal by using an existing smaller model for drafting and then employing the target LLM to verify the draft in a low-cost parallel manner. Under such a drafting-verification framework, drafting efficiency has become a bottleneck in the final speedup of speculative decoding. Therefore, generating longer drafts at less cost can lead to better decoding speedup. To achieve this, we introduce Ouroboros, which can generate draft phrases to parallelize the drafting process and meanwhile lengthen drafts in a training-free manner. The experimental results on various typical text generation tasks show that Ouroboros can achieve speedups of up to 2.4× over speculative decoding and 3.9× over vanilla decoding, without fine-tuning draft and target models. Code available at https://github.com/thunlp/Ouroboros.
Tokenization Falling Short: On Subword Robustness in Large Language Models
Yekun Chai
|
Yewei Fang
|
Qiwei Peng
|
Xuhong Li
Findings of the Association for Computational Linguistics: EMNLP 2024
Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens—issues we term *the curse of tokenization*. In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptible to these problems. This study systematically investigates these challenges and their impact on LLMs through three critical research questions: (1) complex problem solving, (2) token structure probing, and (3) resilience to typographical variation. Our findings reveal that scaling model parameters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations. Our experiments show that subword regularization such as BPE-dropout can mitigate this issue. We release our evaluation code and data at https://github.com/FloatAI/TKEval.