Application of Mix-Up Method in Document Classification Task Using BERT

Naoki Kikuta, Hiroyuki Shinnou


Abstract
The mix-up method (Zhang et al., 2017), one of the methods for data augmentation, is known to be easy to implement and highly effective. Although the mix-up method is intended for image identification, it can also be applied to natural language processing. In this paper, we attempt to apply the mix-up method to a document classification task using bidirectional encoder representations from transformers (BERT) (Devlin et al., 2018). Since BERT allows for two-sentence input, we concatenated word sequences from two documents with different labels and used the multi-class output as the supervised data with a one-hot vector. In an experiment using the livedoor news corpus, which is Japanese, we compared the accuracy of document classification using two methods for selecting documents to be concatenated with that of ordinary document classification. As a result, we found that the proposed method is better than the normal classification when the documents with labels shortages are mixed preferentially. This indicates that how to choose documents for mix-up has a significant impact on the results.
Anthology ID:
2021.ranlp-1.77
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Month:
September
Year:
2021
Address:
Held Online
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
679–683
Language:
URL:
https://aclanthology.org/2021.ranlp-1.77
DOI:
Bibkey:
Cite (ACL):
Naoki Kikuta and Hiroyuki Shinnou. 2021. Application of Mix-Up Method in Document Classification Task Using BERT. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 679–683, Held Online. INCOMA Ltd..
Cite (Informal):
Application of Mix-Up Method in Document Classification Task Using BERT (Kikuta & Shinnou, RANLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.ranlp-1.77.pdf