The Labeled Segmentation of Printed Books

Lara McConnaughey, Jennifer Dai, David Bamman


Abstract
We introduce the task of book structure labeling: segmenting and assigning a fixed category (such as Table of Contents, Preface, Index) to the document structure of printed books. We manually annotate the page-level structural categories for a large dataset totaling 294,816 pages in 1,055 books evenly sampled from 1750-1922, and present empirical results comparing the performance of several classes of models. The best-performing model, a bidirectional LSTM with rich features, achieves an overall accuracy of 95.8 and a class-balanced macro F-score of 71.4.
Anthology ID:
D17-1077
Volume:
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Editors:
Martha Palmer, Rebecca Hwa, Sebastian Riedel
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
737–747
Language:
URL:
https://aclanthology.org/D17-1077
DOI:
10.18653/v1/D17-1077
Bibkey:
Cite (ACL):
Lara McConnaughey, Jennifer Dai, and David Bamman. 2017. The Labeled Segmentation of Printed Books. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 737–747, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):
The Labeled Segmentation of Printed Books (McConnaughey et al., EMNLP 2017)
Copy Citation:
PDF:
https://aclanthology.org/D17-1077.pdf