TLM: Token-Level Masking for Transformers

Yangjun Wu; Kebin Fang; Dongxiang Zhang; Han Wang; Hao Zhang; Gang Chen

doi:10.18653/v1/2023.emnlp-main.871

TLM: Token-Level Masking for Transformers

Yangjun Wu, Kebin Fang, Dongxiang Zhang, Han Wang, Hao Zhang, Gang Chen

Abstract

Structured dropout approaches, such as attention dropout and DropHead, have been investigated to regularize the multi-head attention mechanism in Transformers. In this paper, we propose a new regularization scheme based on token-level rather than structure-level to reduce overfitting. Specifically, we devise a novel Token-Level Masking (TLM) training strategy for Transformers to regularize the connections of self-attention, which consists of two masking techniques that are effective and easy to implement. The underlying idea is to manipulate the connections between tokens in the multi-head attention via masking, where the networks are forced to exploit partial neighbors’ information to produce a meaningful representation. The generality and effectiveness of TLM are thoroughly evaluated via extensive experiments on 4 diversified NLP tasks across 18 datasets, including natural language understanding benchmark GLUE, ChineseGLUE, Chinese Grammatical Error Correction, and data-to-text generation. The results indicate that TLM can consistently outperform attention dropout and DropHead, e.g., it increases by 0.5 points relative to DropHead with BERT-large on GLUE. Moreover, TLM can establish a new record on the data-to-text benchmark Rotowire (18.93 BLEU). Our code will be publicly available at https://github.com/Young1993/tlm.

Anthology ID:: 2023.emnlp-main.871
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14099–14111
Language:
URL:: https://aclanthology.org/2023.emnlp-main.871/
DOI:: 10.18653/v1/2023.emnlp-main.871
Bibkey:
Cite (ACL):: Yangjun Wu, Kebin Fang, Dongxiang Zhang, Han Wang, Hao Zhang, and Gang Chen. 2023. TLM: Token-Level Masking for Transformers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14099–14111, Singapore. Association for Computational Linguistics.
Cite (Informal):: TLM: Token-Level Masking for Transformers (Wu et al., EMNLP 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.emnlp-main.871.pdf
Video:: https://aclanthology.org/2023.emnlp-main.871.mp4

PDF Cite Search Video Fix data