Sequence Models for Document Structure Identification in an Undeciphered Script

Logan Born; M. Monroe; Kathryn Kelley; Anoop Sarkar

doi:10.18653/v1/2022.emnlp-main.620

Sequence Models for Document Structure Identification in an Undeciphered Script

Logan Born, M. Monroe, Kathryn Kelley, Anoop Sarkar

Abstract

This work describes the first thorough analysis of “header” signs in proto-Elamite, an undeciphered script from 3100-2900 BCE. Headers are a category of signs which have been provisionally identified through painstaking manual analysis of this script by domain experts. We use unsupervised neural and statistical sequence modeling techniques to provide new and independent evidence for the existence of headers, without supervision from domain experts. Having affirmed the existence of headers as a legitimate structural feature, we next arrive at a richer understanding of their possible meaning and purpose by (i) examining which features predict their presence; (ii) identifying correlations between these features and other document properties; and (iii) examining cases where these features predict the presence of a header in texts where domain experts do not expect one (or vice versa). We provide more concrete processes for labeling headers in this corpus and a clearer justification for existing intuitions about document structure in proto-Elamite.

Anthology ID:: 2022.emnlp-main.620
Volume:: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2022
Address:: Abu Dhabi, United Arab Emirates
Editors:: Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9111–9121
Language:
URL:: https://aclanthology.org/2022.emnlp-main.620/
DOI:: 10.18653/v1/2022.emnlp-main.620
Bibkey:
Cite (ACL):: Logan Born, M. Monroe, Kathryn Kelley, and Anoop Sarkar. 2022. Sequence Models for Document Structure Identification in an Undeciphered Script. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9111–9121, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):: Sequence Models for Document Structure Identification in an Undeciphered Script (Born et al., EMNLP 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.emnlp-main.620.pdf
Dataset:: 2022.emnlp-main.620.dataset.zip
Video:: https://aclanthology.org/2022.emnlp-main.620.mp4

PDF Cite Search Dataset Video Fix data