Automatic Topological Field Identification in (Historical) German Texts

Katrin Ortmann


Abstract
For the study of certain linguistic phenomena and their development over time, large amounts of textual data must be enriched with relevant annotations. Since the manual creation of such annotations requires a lot of effort, automating the process with NLP methods would be convenient. But the required amounts of training data are usually not available for non-standard or historical language. The present study investigates whether models trained on modern newspaper text can be used to automatically identify topological fields, i.e. syntactic structures, in different modern and historical German texts. The evaluation shows that, in general, it is possible to transfer a parser model to other registers or time periods with overall F1-scores >92%. However, an error analysis makes clear that additional rules and domain-specific training data would be beneficial if sentence structures differ significantly from the training data, e.g. in the case of Early New High German.
Anthology ID:
2020.latechclfl-1.2
Volume:
Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Month:
December
Year:
2020
Address:
Online
Venues:
CLFL | COLING | LaTeCH | LaTeCHCLfL
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
10–18
Language:
URL:
https://aclanthology.org/2020.latechclfl-1.2
DOI:
Bibkey:
Cite (ACL):
Katrin Ortmann. 2020. Automatic Topological Field Identification in (Historical) German Texts. In Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 10–18, Online. International Committee on Computational Linguistics.
Cite (Informal):
Automatic Topological Field Identification in (Historical) German Texts (Ortmann, LaTeCHCLfL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.latechclfl-1.2.pdf
Code
 rubcompling/latech2020