JLBert: Japanese Light BERT for Cross-Domain Short Text Classification

Chandrai Kayal, Sayantan Chattopadhyay, Aryan Gupta, Satyen Abrol, Archie Gugol


Abstract
Models, such as BERT, have made a significant breakthrough in the Natural Language Processing (NLP) domain solving 11+ tasks. This is achieved by training on a large scale of unlabelled text resources and leveraging Transformers architecture making it the “Jack of all NLP trades”. However, one of the popular and challenging tasks in Sequence Classification is Short Text Classification (STC). Short Texts face the problem of being short, equivocal, and non-standard. In this paper, we address two major problems: 1. Improving STC tasks performance in Japanese language which consists of many varieties and dialects. 2. Building a light-weight Japanese BERT model with cross-domain functionality and comparable accuracy with State of the Art (SOTA) BERT models. To solve this, we propose a novel cross-domain scalable model called JLBert, which is pre-trained on a rich, diverse and less explored Japanese e-commerce corpus. We present results from extensive experiments to show that JLBert is outperforming SOTA Multilingual and Japanese specialized BERT models on three Short Text datasets by approx 1.5% across various domain.
Anthology ID:
2024.lrec-main.833
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
9536–9542
Language:
URL:
https://aclanthology.org/2024.lrec-main.833
DOI:
Bibkey:
Cite (ACL):
Chandrai Kayal, Sayantan Chattopadhyay, Aryan Gupta, Satyen Abrol, and Archie Gugol. 2024. JLBert: Japanese Light BERT for Cross-Domain Short Text Classification. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9536–9542, Torino, Italia. ELRA and ICCL.
Cite (Informal):
JLBert: Japanese Light BERT for Cross-Domain Short Text Classification (Kayal et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.833.pdf