BERT Goes Off-Topic: Investigating the Domain Transfer Challenge using Genre Classification

Dmitri Roussinov, Serge Sharoff


Abstract
While performance of many text classification tasks has been recently improved due to Pretrained Language Models (PLMs), in this paper we show that they still suffer from a performance gap when the underlying distribution of topics changes. For example, a genre classifier trained on political topics often fails when tested on documents in the same genre, but about sport or medicine. In this work, we quantify this phenomenon empirically with a large corpus and a large set of topics. Thus, we verify that domain transfer remains challenging both for classic PLMs, such as BERT, and for modern large models (LLMs), such as GPT. We develop a data augmentation approach by generating texts in any desired genre and on any desired topic, even when there are no documents in the training corpus that are both in that particular genre and on that particular topic. When we augment the training dataset with the topically-controlled synthetic texts, F1 improves up to 50% for some topics, approaching on-topic training, while showing no or next to no improvement for other topics. While our empirical results focus on genre classification, our methodology is applicable to other classification tasks such as gender, authorship, or sentiment classification.
Anthology ID:
2023.findings-emnlp.34
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
468–483
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.34
DOI:
10.18653/v1/2023.findings-emnlp.34
Bibkey:
Cite (ACL):
Dmitri Roussinov and Serge Sharoff. 2023. BERT Goes Off-Topic: Investigating the Domain Transfer Challenge using Genre Classification. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 468–483, Singapore. Association for Computational Linguistics.
Cite (Informal):
BERT Goes Off-Topic: Investigating the Domain Transfer Challenge using Genre Classification (Roussinov & Sharoff, Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.34.pdf