Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions

Haw-Shiuan Chang, Andrew McCallum


Abstract
Neural language models (LMs) such as GPT-2 estimate the probability distribution over the next word by a softmax over the vocabulary. The softmax layer produces the distribution based on the dot products of a single hidden state and the embeddings of words in the vocabulary. However, we discover that this single hidden state cannot produce all probability distributions regardless of the LM size or training data size because the single hidden state embedding cannot be close to the embeddings of all the possible next words simultaneously when there are other interfering word embeddings between them. In this work, we demonstrate the importance of this limitation both theoretically and practically. Our work not only deepens our understanding of softmax bottleneck and mixture of softmax (MoS) but also inspires us to propose multi-facet softmax (MFS) to address the limitations of MoS. Extensive empirical analyses confirm our findings and show that against MoS, the proposed MFS achieves two-fold improvements in the perplexity of GPT-2 and BERT.
Anthology ID:
2022.acl-long.554
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8048–8073
Language:
URL:
https://aclanthology.org/2022.acl-long.554
DOI:
10.18653/v1/2022.acl-long.554
Bibkey:
Cite (ACL):
Haw-Shiuan Chang and Andrew McCallum. 2022. Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8048–8073, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions (Chang & McCallum, ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-long.554.pdf
Software:
 2022.acl-long.554.software.zip
Data
ProtoQAWebText