Accounting for different degrees of formality is crucial for producing contextually appropriate language. To assist NLP applications concerned with this problem and formality analysis in general, we present the first dataset of sentences from a wide range of genres assessed on a continuous informal-formal scale via comparative judgments. It is the first corpus with a comprehensive perspective on German sentence-level formality overall. We compare machine learning models for formality scoring, a task we treat as a regression problem, on our dataset. Finally, we investigate the relation between sentence- and document-level formality and evaluate leveraging sentence-based annotations for assessing formality on documents.
We address the task of identifying euphemistic abuse (e.g. “You inspire me to fall asleep”) paraphrasing simple explicitly abusive utterances (e.g. “You are boring”). For this task, we introduce a novel dataset that has been created via crowdsourcing. Special attention has been paid to the generation of appropriate negative (non-abusive) data. We report on classification experiments showing that classifiers trained on previous datasets are less capable of detecting such abuse. Best automatic results are obtained by a classifier that augments training data from our new dataset with automatically-generated GPT-3 completions. We also present a classifier that combines a few manually extracted features that exemplify the major linguistic phenomena constituting euphemistic abuse.
We address the task of distinguishing implicitly abusive sentences on identity groups (“Muslims contaminate our planet”) from other group-related negative polar sentences (“Muslims despise terrorism”). Implicitly abusive language are utterances not conveyed by abusive words (e.g. “bimbo” or “scum”). So far, the detection of such utterances could not be properly addressed since existing datasets displaying a high degree of implicit abuse are fairly biased. Following the recently-proposed strategy to solve implicit abuse by separately addressing its different subtypes, we present a new focused and less biased dataset that consists of the subtype of atomic negative sentences about identity groups. For that task, we model components that each address one facet of such implicit abuse, i.e. depiction as perpetrators, aspectual classification and non-conformist views. The approach generalizes across different identity groups and languages.
The exploding amount of user-generated content has spurred NLP research to deal with documents from various digital communication formats (tweets, chats, emails, etc.). Using these texts as language resources implies complying with legal data privacy regulations. To protect the personal data of individuals and preclude their identification, we employ pseudonymization. More precisely, we identify those text spans that carry information revealing an individual’s identity (e.g., names of persons, locations, phone numbers, or dates) and subsequently substitute them with synthetically generated surrogates. Based on CodE Alltag, a German-language email corpus, we address two tasks. The first task is to evaluate various architectures for the automatic recognition of privacy-sensitive entities in raw data. The second task examines the applicability of pseudonymized data as training data for such systems since models learned on original data cannot be published for reasons of privacy protection. As outputs of both tasks, we, first, generate a new pseudonymized version of CodE Alltag compliant with the legal requirements of the General Data Protection Regulation (GDPR). Second, we make accessible a tagger for recognizing privacy-sensitive information in German emails and similar text genres, which is trained on already pseudonymized data.
Abusive language detection is an emerging field in natural language processing which has received a large amount of attention recently. Still the success of automatic detection is limited. Particularly, the detection of implicitly abusive language, i.e. abusive language that is not conveyed by abusive words (e.g. dumbass or scum), is not working well. In this position paper, we explain why existing datasets make learning implicit abuse difficult and what needs to be changed in the design of such datasets. Arguing for a divide-and-conquer strategy, we present a list of subtypes of implicitly abusive language and formulate research tasks and questions for future research.
To track different levels of formality in written discourse, we introduce a novel type of lexicon for the German language, with entries ordered by their degree of (in)formality. We start with a set of words extracted from traditional lexicographic resources, extend it by sentence-based similarity computations, and let crowdworkers assess the enlarged set of lexical items on a continuous informal-formal scale as a gold standard for evaluation. We submit this lexicon to an intrinsic evaluation related to the best regression models and their effect on predicting formality scores and complement our investigation by an extrinsic evaluation of formality on a German-language email corpus.
The vast amount of social communication distributed over various electronic media channels (tweets, blogs, emails, etc.), so-called user-generated content (UGC), creates entirely new opportunities for today’s NLP research. Yet, data privacy concerns implied by the unauthorized use of these text streams as a data resource are often neglected. In an attempt to reconciliate the diverging needs of unconstrained raw data use and preservation of data privacy in digital communication, we here investigate the automatic recognition of privacy-sensitive stretches of text in UGC and provide an algorithmic solution for the protection of personal data via pseudonymization. Our focus is directed at the de-identification of emails where personally identifying information does not only refer to the sender but also to those people, locations, dates, and other identifiers mentioned in greetings, boilerplates and the content-carrying body of emails. We evaluate several de-identification procedures and systems on two hitherto non-anonymized German-language email corpora (CodE AlltagS+d and CodE AlltagXL), and generate fully pseudonymized versions for both (CodE Alltag 2.0) in which personally identifying information of all social actors addressed in these mails has been camouflaged (to the greatest extent possible).
In this paper, we describe a workflow for the data-driven acquisition and semantic scaling of a lexicon that covers lexical items from the lower end of the German language register—terms typically considered as rough, vulgar or obscene. Since the fine semantic representation of grades of obscenity can only inadequately be captured at the categorical level (e.g., obscene vs. non-obscene, or rough vs. vulgar), our main contribution lies in applying best-worst scaling, a rating methodology that has already been shown to be useful for emotional language, to capture the relative strength of obscenity of lexical items. We describe the empirical foundations for bootstrapping such a low-end lexicon for German by starting from manually supplied lexicographic categorizations of a small seed set of rough and vulgar lexical items and automatically enlarging this set by means of distributional semantics. We then determine the degrees of obscenity for the full set of all acquired lexical items by letting crowdworkers comparatively assess their pejorative grade using best-worst scaling. This semi-automatically enriched lexicon already comprises 3,300 lexical items and incorporates 33,000 vulgarity ratings. Using it as a seed lexicon for fully automatic lexical acquisition, we were able to raise its coverage up to slightly more than 11,000 entries.
We deal with the pseudonymization of those stretches of text in emails that might allow to identify real individual persons. This task is decomposed into two steps. First, named entities carrying privacy-sensitive information (e.g., names of persons, locations, phone numbers or dates) are identified, and, second, these privacy-bearing entities are replaced by synthetically generated surrogates (e.g., a person originally named ‘John Doe’ is renamed as ‘Bill Powers’). We describe a system architecture for surrogate generation and evaluate our approach on CodeAlltag, a German email corpus.