Automatic Orality Identification in Historical Texts

Katrin Ortmann, Stefanie Dipper


Abstract
Independently of the medial representation (written/spoken), language can exhibit characteristics of conceptual orality or literacy, which mainly manifest themselves on the lexical or syntactic level. In this paper we aim at automatically identifying conceptually-oral historical texts, with the ultimate goal of gaining knowledge about spoken data of historical time stages. We apply a set of general linguistic features that have been proven to be effective for the classification of modern language data to historical German texts from various registers. Many of the features turn out to be equally useful in determining the conceptuality of historical data as they are for modern data, especially the frequency of different types of pronouns and the ratio of verbs to nouns. Other features like sentence length, particles or interjections point to peculiarities of the historical data and reveal problems with the adoption of a feature set that was developed on modern language data.
Anthology ID:
2020.lrec-1.162
Volume:
Proceedings of the 12th Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1293–1302
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.162
DOI:
Bibkey:
Cite (ACL):
Katrin Ortmann and Stefanie Dipper. 2020. Automatic Orality Identification in Historical Texts. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 1293–1302, Marseille, France. European Language Resources Association.
Cite (Informal):
Automatic Orality Identification in Historical Texts (Ortmann & Dipper, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.162.pdf