GHisBERT – Training BERT from scratch for lexical semantic investigations across historical German language stages

Christin Beck, Marisa Köllner


Abstract
While static embeddings have dominated computational approaches to lexical semantic change for quite some time, recent approaches try to leverage the contextualized embeddings generated by the language model BERT for identifying semantic shifts in historical texts. However, despite their usability for detecting changes in the more recent past, it remains unclear how well language models scale to investigations going back further in time, where the language differs substantially from the training data underlying the models. In this paper, we present GHisBERT, a BERT-based language model trained from scratch on historical data covering all attested stages of German (going back to Old High German, c. 750 CE). Given a lack of ground truth data for investigating lexical semantic change across historical German language stages, we evaluate our model via a lexical similarity analysis of ten stable concepts. We show that, in comparison with an unmodified and a fine-tuned German BERT-base model, our model performs best in terms of assessing inter-concept similarity as well as intra-concept similarity over time. This in turn argues for the necessity of pre-training historical language models from scratch when working with historical linguistic data.
Anthology ID:
2023.lchange-1.4
Volume:
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change
Month:
December
Year:
2023
Address:
Singapore
Editors:
Nina Tahmasebi, Syrielle Montariol, Haim Dubossarsky, Andrey Kutuzov, Simon Hengchen, David Alfter, Francesco Periti, Pierluigi Cassotti
Venue:
LChange
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
33–45
Language:
URL:
https://aclanthology.org/2023.lchange-1.4
DOI:
10.18653/v1/2023.lchange-1.4
Bibkey:
Cite (ACL):
Christin Beck and Marisa Köllner. 2023. GHisBERT – Training BERT from scratch for lexical semantic investigations across historical German language stages. In Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change, pages 33–45, Singapore. Association for Computational Linguistics.
Cite (Informal):
GHisBERT – Training BERT from scratch for lexical semantic investigations across historical German language stages (Beck & Köllner, LChange 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.lchange-1.4.pdf