Christian Bauckhage
2025
Resource-Efficient Anonymization of Textual Data via Knowledge Distillation from Large Language Models
Tobias Deußer
|
Max Hahnbück
|
Tobias Uelwer
|
Cong Zhao
|
Christian Bauckhage
|
Rafet Sifa
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Protecting personal and sensitive information in textual data is increasingly crucial, especially when leveraging large language models (LLMs) that may pose privacy risks due to their API-based access. We introduce a novel approach and pipeline for anonymizing text across arbitrary domains without the need for manually labeled data or extensive computational resources. Our method employs knowledge distillation from LLMs into smaller encoder-only models via named entity recognition (NER) coupled with regular expressions to create a lightweight model capable of effective anonymization while preserving the semantic and contextual integrity of the data. This reduces computational overhead, enabling deployment on less powerful servers or even personal computing devices. Our findings suggest that knowledge distillation offers a scalable, resource-efficient pathway for anonymization, balancing privacy preservation with model performance and computational efficiency.
2023
A New Aligned Simple German Corpus
Vanessa Toborek
|
Moritz Busch
|
Malte Boßert
|
Christian Bauckhage
|
Pascal Welke
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
“Leichte Sprache”, the German counterpart to Simple English, is a regulated language aiming to facilitate complex written language that would otherwise stay inaccessible to different groups of people. We present a new sentence-aligned monolingual corpus for Simple German – German. It contains multiple document-aligned sources which we have aligned using automatic sentence-alignment methods. We evaluate our alignments based on a manually labelled subset of aligned documents. The quality of our sentence alignments, as measured by the F1-score, surpasses previous work. We publish the dataset under CC BY-SA and the accompanying code under MIT license.
2019
Improving Word Embeddings Using Kernel PCA
Vishwani Gupta
|
Sven Giesselbach
|
Stefan Rüping
|
Christian Bauckhage
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)
Word-based embedding approaches such as Word2Vec capture the meaning of words and relations between them, particularly well when trained with large text collections; however, they fail to do so with small datasets. Extensions such as fastText reduce the amount of data needed slightly, however, the joint task of learning meaningful morphology, syntactic and semantic representations still requires a lot of data. In this paper, we introduce a new approach to warm-start embedding models with morphological information, in order to reduce training time and enhance their performance. We use word embeddings generated using both word2vec and fastText models and enrich them with morphological information of words, derived from kernel principal component analysis (KPCA) of word similarity matrices. This can be seen as explicitly feeding the network morphological similarities and letting it learn semantic and syntactic similarities. Evaluating our models on word similarity and analogy tasks in English and German, we find that they not only achieve higher accuracies than the original skip-gram and fastText models but also require significantly less training data and time. Another benefit of our approach is that it is capable of generating a high-quality representation of infrequent words as, for example, found in very recent news articles with rapidly changing vocabularies. Lastly, we evaluate the different models on a downstream sentence classification task in which a CNN model is initialized with our embeddings and find promising results.
Search
Fix data
Co-authors
- Malte Boßert 1
- Moritz Busch 1
- Tobias Deußer 1
- Sven Giesselbach 1
- Vishwani Gupta 1
- show all...