Automated Parsing of Interlinear Glossed Text from Page Images of Grammatical Descriptions

Erich Round, Mark Ellison, Jayden Macklin-Cordes, Sacha Beniamine


Abstract
Linguists seek insight from all human languages, however accessing information from most of the full store of extant global linguistic descriptions is not easy. One of the most common kinds of information that linguists have documented is vernacular sentences, as recorded in descriptive grammars. Typically these sentences are formatted as interlinear glossed text (IGT). Most descriptive grammars, however, exist only as hardcopy or scanned pdf documents. Consequently, parsing IGTs in scanned grammars is a priority, in order to significantly increase the volume of documented linguistic information that is readily accessible. Here we demonstrate fundamental viability for a technology that can assist in making a large number of linguistic data sources machine readable: the automated identification and parsing of interlinear glossed text from scanned page images. For example, we attain high median precision and recall (>0.95) in the identification of examples sentences in IGT format. Our results will be of interest to those who are keen to see more of the existing documentation of human language, especially for less-resourced and endangered languages, become more readily accessible.
Anthology ID:
2020.lrec-1.351
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2878–2883
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.351
DOI:
Bibkey:
Cite (ACL):
Erich Round, Mark Ellison, Jayden Macklin-Cordes, and Sacha Beniamine. 2020. Automated Parsing of Interlinear Glossed Text from Page Images of Grammatical Descriptions. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2878–2883, Marseille, France. European Language Resources Association.
Cite (Informal):
Automated Parsing of Interlinear Glossed Text from Page Images of Grammatical Descriptions (Round et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.351.pdf