Training and Domain Adaptation for Supervised Text Segmentation

Goran Glavaš, Ananya Ganesh, Swapna Somasundaran


Abstract
Unlike traditional unsupervised text segmentation methods, recent supervised segmentation models rely on Wikipedia as the source of large-scale segmentation supervision. These models have, however, predominantly been evaluated on the in-domain (Wikipedia-based) test sets, preventing conclusions about their general segmentation efficacy. In this work, we focus on the domain transfer performance of supervised neural text segmentation in the educational domain. To this end, we first introduce K12Seg, a new dataset for evaluation of supervised segmentation, created from educational reading material for grade-1 to college-level students. We then benchmark a hierarchical text segmentation model (HITS), based on RoBERTa, in both in-domain and domain-transfer segmentation experiments. While HITS produces state-of-the-art in-domain performance (on three Wikipedia-based test sets), we show that, subject to the standard full-blown fine-tuning, it is susceptible to domain overfitting. We identify adapter-based fine-tuning as a remedy that substantially improves transfer performance.
Anthology ID:
2021.bea-1.11
Volume:
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications
Month:
April
Year:
2021
Address:
Online
Venue:
BEA
SIG:
SIGEDU
Publisher:
Association for Computational Linguistics
Note:
Pages:
110–116
Language:
URL:
https://aclanthology.org/2021.bea-1.11
DOI:
Bibkey:
Cite (ACL):
Goran Glavaš, Ananya Ganesh, and Swapna Somasundaran. 2021. Training and Domain Adaptation for Supervised Text Segmentation. In Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, pages 110–116, Online. Association for Computational Linguistics.
Cite (Informal):
Training and Domain Adaptation for Supervised Text Segmentation (Glavaš et al., BEA 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.bea-1.11.pdf