MaxMatch-Dropout: Subword Regularization for WordPiece

Tatsuya Hiraoka


Abstract
We present a subword regularization method for WordPiece, which uses a maximum matching algorithm for tokenization. The proposed method, MaxMatch-Dropout, randomly drops words in a search using the maximum matching algorithm. It realizes finetuning with subword regularization for popular pretrained language models such as BERT-base. The experimental results demonstrate that MaxMatch-Dropout improves the performance of text classification and machine translation tasks as well as other subword regularization methods. Moreover, we provide a comparative analysis of subword regularization methods: subword regularization with SentencePiece (Unigram), BPE-Dropout, and MaxMatch-Dropout.
Anthology ID:
2022.coling-1.430
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
4864–4872
Language:
URL:
https://aclanthology.org/2022.coling-1.430
DOI:
Bibkey:
Cite (ACL):
Tatsuya Hiraoka. 2022. MaxMatch-Dropout: Subword Regularization for WordPiece. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4864–4872, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
MaxMatch-Dropout: Subword Regularization for WordPiece (Hiraoka, COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.430.pdf
Code
 tathi/maxmatch_dropout
Data
GLUEKLUEQNLISST