Understanding How Positional Encodings Work in Transformer Model

Taro Miyazaki, Hideya Mino, Hiroyuki Kaneko


Abstract
A transformer model is used in general tasks such as pre-trained language models and specific tasks including machine translation. Such a model mainly relies on positional encodings (PEs) to handle the sequential order of input vectors. There are variations of PEs, such as absolute and relative, and several studies have reported on the superiority of relative PEs. In this paper, we focus on analyzing in which part of a transformer model PEs work and the different characteristics between absolute and relative PEs through a series of experiments. Experimental results indicate that PEs work in both self- and cross-attention blocks in a transformer model, and PEs should be added only to the query and key of an attention mechanism, not to the value. We also found that applying two PEs in combination, a relative PE in the self-attention block and an absolute PE in the cross-attention block, can improve translation quality.
Anthology ID:
2024.lrec-main.1478
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
17011–17018
Language:
URL:
https://aclanthology.org/2024.lrec-main.1478
DOI:
Bibkey:
Cite (ACL):
Taro Miyazaki, Hideya Mino, and Hiroyuki Kaneko. 2024. Understanding How Positional Encodings Work in Transformer Model. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 17011–17018, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Understanding How Positional Encodings Work in Transformer Model (Miyazaki et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.1478.pdf