Absolute Position Embedding Learns Sinusoid-like Waves for Attention Based on Relative Position

Yuji Yamamoto, Takuya Matsuzaki


Abstract
Attention weight is a clue to interpret how a Transformer-based model makes an inference. In some attention heads, the attention focuses on the neighbors of each token. This allows the output vector of each token to depend on the surrounding tokens and contributes to make the inference context-dependent. We analyze the mechanism behind the concentration of attention on nearby tokens. We show that the phenomenon emerges as follows: (1) learned position embedding has sinusoid-like components, (2) such components are transmitted to the query and the key in the self-attention, (3) the attention head shifts the phases of the sinusoid-like components so that the attention concentrates on nearby tokens at specific relative positions. In other words, a certain type of Transformer-based model acquires the sinusoidal positional encoding to some extent on its own through Masked Language Modeling.
Anthology ID:
2023.emnlp-main.2
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15–28
Language:
URL:
https://aclanthology.org/2023.emnlp-main.2
DOI:
10.18653/v1/2023.emnlp-main.2
Bibkey:
Cite (ACL):
Yuji Yamamoto and Takuya Matsuzaki. 2023. Absolute Position Embedding Learns Sinusoid-like Waves for Attention Based on Relative Position. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15–28, Singapore. Association for Computational Linguistics.
Cite (Informal):
Absolute Position Embedding Learns Sinusoid-like Waves for Attention Based on Relative Position (Yamamoto & Matsuzaki, EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.2.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.2.mp4