2023
pdf
bib
abs
GHisBERT – Training BERT from scratch for lexical semantic investigations across historical German language stages
Christin Beck
|
Marisa Köllner
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change
While static embeddings have dominated computational approaches to lexical semantic change for quite some time, recent approaches try to leverage the contextualized embeddings generated by the language model BERT for identifying semantic shifts in historical texts. However, despite their usability for detecting changes in the more recent past, it remains unclear how well language models scale to investigations going back further in time, where the language differs substantially from the training data underlying the models. In this paper, we present GHisBERT, a BERT-based language model trained from scratch on historical data covering all attested stages of German (going back to Old High German, c. 750 CE). Given a lack of ground truth data for investigating lexical semantic change across historical German language stages, we evaluate our model via a lexical similarity analysis of ten stable concepts. We show that, in comparison with an unmodified and a fine-tuned German BERT-base model, our model performs best in terms of assessing inter-concept similarity as well as intra-concept similarity over time. This in turn argues for the necessity of pre-training historical language models from scratch when working with historical linguistic data.
2022
pdf
bib
abs
Negation, Coordination, and Quantifiers in Contextualized Language Models
Aikaterini-Lida Kalouli
|
Rita Sevastjanova
|
Christin Beck
|
Maribel Romero
Proceedings of the 29th International Conference on Computational Linguistics
With the success of contextualized language models, much research explores what these models really learn and in which cases they still fail. Most of this work focuses on specific NLP tasks and on the learning outcome. Little research has attempted to decouple the models’ weaknesses from specific tasks and focus on the embeddings per se and their mode of learning. In this paper, we take up this research opportunity: based on theoretical linguistic insights, we explore whether the semantic constraints of function words are learned and how the surrounding context impacts their embeddings. We create suitable datasets, provide new insights into the inner workings of LMs vis-a-vis function words and implement an assisting visual web interface for qualitative analysis.
2021
pdf
bib
abs
Explaining Contextualization in Language Models using Visual Analytics
Rita Sevastjanova
|
Aikaterini-Lida Kalouli
|
Christin Beck
|
Hanna Schäfer
|
Mennatallah El-Assady
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Despite the success of contextualized language models on various NLP tasks, it is still unclear what these models really learn. In this paper, we contribute to the current efforts of explaining such models by exploring the continuum between function and content words with respect to contextualization in BERT, based on linguistically-informed insights. In particular, we utilize scoring and visual analytics techniques: we use an existing similarity-based score to measure contextualization and integrate it into a novel visual analytics technique, presenting the model’s layers simultaneously and highlighting intra-layer properties and inter-layer differences. We show that contextualization is neither driven by polysemy nor by pure context variation. We also provide insights on why BERT fails to model words in the middle of the functionality continuum.
2020
pdf
bib
abs
Representation Problems in Linguistic Annotations: Ambiguity, Variation, Uncertainty, Error and Bias
Christin Beck
|
Hannah Booth
|
Mennatallah El-Assady
|
Miriam Butt
Proceedings of the 14th Linguistic Annotation Workshop
The development of linguistic corpora is fraught with various problems of annotation and representation. These constitute a very real challenge for the development and use of annotated corpora, but as yet not much literature exists on how to address the underlying problems. In this paper, we identify and discuss five sources of representation problems, which are independent though interrelated: ambiguity, variation, uncertainty, error and bias. We outline and characterize these sources, discussing how their improper treatment can have stark consequences for research outcomes. Finally, we discuss how an adequate treatment can inform corpus-related linguistic research, both computational and theoretical, improving the reliability of research results and NLP models, as well as informing the more general reproducibility issue.
pdf
bib
abs
DiaSense at SemEval-2020 Task 1: Modeling Sense Change via Pre-trained BERT Embeddings
Christin Beck
Proceedings of the Fourteenth Workshop on Semantic Evaluation
This paper describes DiaSense, a system developed for Task 1 ‘Unsupervised Lexical Semantic Change Detection’ of SemEval 2020. In DiaSense, contextualized word embeddings are used to model word sense changes. This allows for the calculation of metrics which mimic human intuitions about the semantic relatedness between individual use pairs of a target word for the assessment of lexical semantic change. DiaSense is able to detect lexical semantic change in English, German, Latin and Swedish (accuracy = 0.728). Moreover, DiaSense differentiates between weak and strong change.