Unsupervised Text Segmentation Based on Native Language Characteristics

Shervin Malmasi, Mark Dras, Mark Johnson, Lan Du, Magdalena Wolska


Abstract
Most work on segmenting text does so on the basis of topic changes, but it can be of interest to segment by other, stylistically expressed characteristics such as change of authorship or native language. We propose a Bayesian unsupervised text segmentation approach to the latter. While baseline models achieve essentially random segmentation on our task, indicating its difficulty, a Bayesian model that incorporates appropriately compact language models and alternating asymmetric priors can achieve scores on the standard metrics around halfway to perfect segmentation.
Anthology ID:
P17-1134
Volume:
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2017
Address:
Vancouver, Canada
Editors:
Regina Barzilay, Min-Yen Kan
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1457–1469
Language:
URL:
https://aclanthology.org/P17-1134
DOI:
10.18653/v1/P17-1134
Bibkey:
Cite (ACL):
Shervin Malmasi, Mark Dras, Mark Johnson, Lan Du, and Magdalena Wolska. 2017. Unsupervised Text Segmentation Based on Native Language Characteristics. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1457–1469, Vancouver, Canada. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Text Segmentation Based on Native Language Characteristics (Malmasi et al., ACL 2017)
Copy Citation:
PDF:
https://aclanthology.org/P17-1134.pdf