Acquiring a Formality-Informed Lexical Resource for Style Analysis

To track different levels of formality in written discourse, we introduce a novel type of lexicon for the German language, with entries ordered by their degree of (in)formality. We start with a set of words extracted from traditional lexicographic resources, extend it by sentence-based similarity computations, and let crowdworkers assess the enlarged set of lexical items on a continuous informal-formal scale as a gold standard for evaluation. We submit this lexicon to an intrinsic evaluation related to the best regression models and their effect on predicting formality scores and complement our investigation by an extrinsic evaluation of formality on a German-language email corpus.


Introduction
The computational treatment of style in verbal communication has long been dominated by application concerns, e.g., the identification or profiling of authors in forensic linguistics (Ding et al., 2019) or the recognition of plagiarism (Alzahrani et al., 2012). This research was conducted assuming that simple lexico-statistic patterns identified by stylometric computations were sufficient to solve authorship and plagiarism assignment problems.
Despite their undisputed success in those limited fields, these studies scratched only the surface of the notion of 'style' as discussed in linguistic pragmatics (Hickey, 1993). From the many ways 'style' can be approached from a pragmatics perspective, we here focus on its inherent formality dimension, i.e., the distinction between formal (standard) and informal (colloquial) language use (for a survey, cf. Heylighen and Dewaele (1999)), with further extensions directed at the higher level of formal (e.g., elevated style) and the lower level of informal (e.g., vulgar) phrasing. Such distinctions of formality levels are crucial for the appropriateness of verbal expressions in a given discourse context.
In order to track different levels of formality in written communication, we introduce a novel type of lexicon for the German language, with entries ordered by their degree of (in)formality. 1 We start with a set of words extracted from traditional lexicographic resources, extend it by sentence-based similarity computations, and let crowdworkers assess the enlarged set of lexical items on a continuous informal-formal scale. This workflow is described in Section 3. The resulting lexicon comprising words with their respective formality scores subsequently serves as a gold standard for evaluation. In Section 4, we submit this lexicon to an intrinsic evaluation related to the best regression models and their effect on predicting formality scores, and complement our investigation by an extrinsic evaluation of formality on a German-language email corpus in Section 5.
Many efforts to cope with language style have been spent, however, in application niches, such as author identification or plagiarism detection. Most of the methodological contributions developed in this forensic branch are summarized under the label of stylometrics and have recently found their way into NLP analytics to unveil deception (Potthast et al., 2018;Pascucci et al., 2020a) or linguistic aggression (Harpalani et al., 2011;Nogueira dos Santos et al., 2018;Pascucci et al., 2020b).
The computational analysis of style according to stylometric principles, from its inception, is closely linked with lexical frequency counts. Typically, mostly function words (such as articles, pronouns, conjunctions, contractions, common abbreviations, hedging terms, also including punctuation marks) are assembled in small-sized dictionaries, together, if at all, with only a few content words (domainspecific nouns, verbs, adjectives). The frequency distributions resulting from counting these dictionary entries at the document or corpus level are already very beneficial for successfully dealing with disputed authorship problems (mostly for literary texts, but also for the detection of spam, fake news, or other kinds of toxic language) or uncovering plagiarism (mostly in scientific or news documents). Similar in spirit, word and sentence length criteria originating from readability metrics (Flesch-Kincaid, etc.) and several measures of vocabulary richness (e.g., type-token ratios, Yule's K and Burrow's ∆) were also incorporated into stylometric toolkits (Eder et al., 2016).
As a simple extension from these uni-grams, lexical or pseudo-lexical character n-grams (bi-or tri-grams, mostly) were determined and counted, as well. Slightly extending this (pseudo-)lexically focused approach by syntactic information, partof-speech n-grams (or part-of-speech frequencies) were also considered to trace the human 'stylome', although lexical factors were found to be more relevant for style analysis than syntactic (POS sequence) patterns (van Halteren et al., 2005).
Simple frequency metrics have increasingly been complemented by various forms of lexical association measures (such as information gain, mutual information), and more sophisticated probabilistic models (principal component analysis (PCA), latent semantic analysis (LSA), or other types of topic models). Comprehensive lists of criteria and metrics are provided by Sheikha and Inkpen (2010); Neal et al. (2017);Ding et al. (2019).
We claim that despite their relevance for applications, such as authorship attribution and plagiarism detection, these mechanisms merely serve as easy to trace proxies for characterizing linguistic style. In our work, we will have a closer look at the style-marking semantic connotation of single lexical items as explicit carriers of linguistic formality as an important facet of language style.
A milestone for the formal definition of formality was set up by the pioneering work of Heylighen and Dewaele (1999) who defined the F-score-close in spirit with the simple lexico-statistic frequency metrics from stylometry-as the percentage difference between deictic (article, pronouns, etc.) and non-deictic parts of speech (nouns, adjectives, etc.) in a document (F ranges between 0 and 100, with higher F indicating higher formality). 2 This document-level perspective was adapted by Lahiri et al. (2011) to sentence-level formality analysis.
A complementary lexical dimension for the formalization of formality was introduced by Brooke et al. (2010). They define the formality score for a word as a real number value in the range 1 to −1, with 1 representing an extremely formal word and −1 an extremely informal one, and assign a formality score to each lexical item based on standard word length, morphology-based features, lexical distribution criteria or association methods (LSA). Our work adheres to their way formality is scored in a formality lexicon and manually supplied seed sets are used (as starters), but differs markedly whether the lexicon is considered as a static (Brooke et al., 2010) or a dynamic resource (as we do; in a later study, Brooke and Hirst (2014) proposed a dynamic acquisition method, as well, by assigning a continuous formality score to single words based on their co-occurrence frequency with a hand-picked seed set of formal, neutral and informal words), and the way how semantic similarity is computed (LSA vs. embeddings). Further, we do not induce formality levels for a near-synonym task automatically but rather crowdsource nuances of formality for a relationally unrestricted lexical inventory from human raters. Pavlick and Tetreault (2016) proposed a model of formality based on an empirical analysis of human formality perceptions. They apply their approach to analyze language use in online debate forums for multiple genres (news, blogs, emails, and community question answering sites). Formality assessments are solicited via Amazon Turk (following the protocol established by Lahiri (2015)) using a 7-point Likert scale, with labels ranging from −3 (Very Informal) to 3 (Very Formal). A ridge regression classifier uses 11 different feature groups-five rarely used ones (among them WORD2VEC embeddings (Mikolov et al., 2013), parse trees, dependency tuples, and named entities) and six much more common ones (among them lower/upper casing, punctuation, readability scores, POS tags, and length-normalized formality and subjectivity scores)-to determine the formality level of sentences. Cross-genre analysis reveals that n-grams and word embeddings perform the best among all tested features (they achieve over 80% of the performance of the full classifier in all cases). This work comes closest to our approach, yet with differences in the way formality is assessed (Likert scales vs. best-worst scaling) and lexicon building is dealt with. Pavlick and Tetreault (2016) employ an acquisition method to score the formality of unseen phrases along the formal-casual dimension from scratch, as described in earlier work by Pavlick and Nenkova (2015) who use a log ratio metric based on the occurrence of phrases in various style-tagged corpora, in contrast to the embedding-based similarity model we propose.
Earlier computational models for detecting formality were proposed by Sheikha and Inkpen (2010); Peterson et al. (2011); Mosquera and Moreda (2012). The first two perform a binary classification only into formal vs. informal utterances, the third model classifies into four levels of (in)formality, and all of them operate at the document (as opposed to sentence) level.

Getting Started with VULGER
Previous work on computational lexicons (and lexicon acquisition) incorporating formality information focuses exclusively on the English language (Brooke et al., 2010;Brooke and Hirst, 2014;Pavlick and Nenkova, 2015;Pavlick and Tetreault, 2016). For German, VULGER (Eder et al., 2019) 3 constitutes a lexical resource that can be reused for such purposes to some degree. It comprises 3,300 German words scored by vulgarity/neutrality 3 https://github.com/ee-2/VulGer within a range of −1 (most vulgar) to +1 (most neutral). Accordingly, it covers the lower half of the formality spectrum quite well but completely lacks its upper half (formal up to elevated language). This study attempts to fill this gap by introducing I-FORGER, a comprehensive lexicon for Informal and Formal German. To acquire a lexicon covering the formal spectrum as well, we gathered formalitymarked lexical entries in several ways as described in the following subsections. 4

Input from Lexicographic Resources
As a first lexical acquisition step, we gathered lexical items from existing lexicographic resources based on their manually assigned categorical (in)formality tags: Swear Words. As there is an overlap between swear words and vulgar lexicalizations, we used 500 lexical items randomly chosen from three German swear word lists 5 to feed the lower end of formality in I-FORGER. Colloquial Items. In addition, we extracted 500 arbitrary terms marked as 'colloquial' ('ugs.' or 'umgangssprachlich,' in German) from the German slice of WIKTIONARY 6 and the German OPENTHE-SAURUS 7 supposed to range somewhere between vulgar and neutral on our scale. Elevated Items. To extend the scale to the upper levels of linguistic formality, we also picked lexical items marked as 'elevated' ('geh.' or 'gehoben,' in German) from OPENTHESAURUS and WIK-TIONARY yielding 1,000 additional terms (for the sake of balancing informal and formal entries in that phase). The reuse of manually curated lexicon resources (as seeds) thus follows the approach proposed by Brooke et al. (2010).

Lexicon Extension via Sentence Similarity
Given the intrinsic limitations of any manually curated lexicon resource, in the next step, we augmented I-FORGER by automatic means. We here suggest harvesting lexical candidates potentially carrying formality information from semantically Figure 1: Generic language-independent workflow for gathering words for formality scoring approaches utilizing similar sentences (in blue) and its instantiation for our use case to acquire SIMSENTWORDS (in green) similar sentences. This proposal goes beyond the standard way to utilize word embeddings in order to find close semantic neighbors based on the distributional hypothesis (see, e.g., Tulkens et al. (2016); Wiegand et al. (2018a) for detecting abusive lexicalizations this way). Rather than only discovering semantically related words, we extended our scope to semantically similar sentences to identify other relevant lexical candidates in the mined sentences, like an adjective modifying an offensive noun or other vulgar, yet otherwise unrelated, words in a vulgar word's context. On the flip side, this method admittedly gathers a considerable amount of noise (cf. Section 4 for a scoring approach to account for this problem).

Sentence Embeddings
As is well-known, BERT (Devlin et al., 2019) reaches new state-of-the-art results for various NLP problems, including semantic similarity tasks. However, finding semantically similar sentences close in vector space with BERT is computationally expensive. As a cure, Reimers and Gurevych (2019) introduced SENTENCE-BERT (SBERT), which modifies the pre-trained BERT network using siamese and triplet networks and produces semantically meaningful sentence embeddings that can be compared employing standard cosine similarity.

Sentence Similarity
To obtain candidate sentences for similarity computation for the German language, we employed a wide range of corpora. Our choices were guided by the requirements that these corpora should possess a high stylistic variance and contain vocabulary from the lower language register, too. We came up with: • CODE ALLTAG 8 (Eder et al., 2020) comprising roughly 1,5M German-language emails, • ONE MILLION POSTS CORPUS 9 (Schabus et al., 2017) containing about 1M user comments on news articles from the Austrian daily broadsheet newspaper DER STANDARD, • DORTMUNDER CHAT KORPUS 10 (Beißwenger, 2013), with more than 140,000 German-language chats, • HATE SPEECH TOWARDS FOREIGNERS 11 (Bretschneider and Peters, 2017), with about 6,000 posts and comments on German antiforeign FACEBOOK pages, • GERMEVAL 2018/19, collected for the task of identifying offensive language 12 (Wiegand et al., 2018b;Struß et al., 2019), including roughly 15,000 German-language tweets.
Using VULGER as a seed lexicon, we extracted sentences from these corpora by separating those containing VULGER entries from those that did not contain any VULGER item. 13 To further enlarge the number of sentences for each seed item, we also gathered sentences given as examples on the WIKTIONARY pages for the entries included in VULGER. From the resulting pool of sentences with seed words, we collected up to six sentences per word. We chose them randomly but tried to take one sentence from each of the six resources to keep some balance, both formality-wise as well as genre-wise.
These sentences served as seeds for the computation of similar sentences. Like the remaining sentences not containing any seed words, they were embedded with SENTENCE TRANSFORMERS (Reimers and Gurevych, 2019) using the multilingual model supporting German (Reimers and Gurevych, 2020). Then, for all seed sentence embeddings, we calculated the most similar sentence in the remaining sentence embeddings using cosine distance (the acquisition step proper). From these most similar sentences, we gathered lemmatized nouns, finite verbs, adjectives, and adverbs, omitting named entities. An overview of the entire acquisition procedure is depicted in Figure 1.
From the resulting word list, we randomly chose 1,000 items (denoted SIMSENTWORDS, in the following) to evaluate the regression approach and the acquisition strategy of automatically gathering new words to score. As we also wanted to measure the acquisition noise, we further divided the words into 500 items manually cleansed from spelling mistakes, etc. (SIMSENTWORDS cleansed ), and left 500 as-is (SIMSENTWORDS noisy ).

I-FORGER at a Glance
Putting these pieces together, I-FORGER, the final lexicon, comprises 3,000 words, in total, with three major divisions: 1,000 terms from elevated language usage, 1,000 words, with swearwords and colloquial items joined, presumably linked to the lowered stylistic inventory, and 1,000 words that should rather occur at the lower end of our informality scale, but potentially include words from all stylistic levels (see Table 1). Total 3,000

Human Assessment of I-FORGER
To establish a gold standard for subsequent evaluation, we gathered human formality assessments. For that, I-FORGER was annotated with Best-Worst-Scaling (BWS), a method that delivers highquality annotations with only a relatively small number of annotation steps compared to standard point-interval based methods (e.g., Likert scales) for human assessment tasks. BWS also adheres to the principle that a "continuum of formality" (Heylighen and Dewaele, 1999) exists rather than n-ary categorical distinctions between formal and informal utterances (see also Lahiri et al. (2011); Brooke and Hirst (2014) for works based on degrees of formality). BWS was introduced into NLP for emotion scaling by Mohammad (2016, 2017). Annotators are presented with n items at a time (an n-tuple, where n > 1, and typically n = 4). They then have to decide which item from the n-tuple is the best (highest in terms of the property of interest) and which is the worst (lowest in terms of the property of interest).
In our case, judges had to select the most elevated and the most vulgar terms per given n-tuple. We used the BWS tool 14 from Mohammad (2016, 2017) to generate 6,000 4-tuples for human assessment. Tuples were produced randomly under the premise that each term had to occur only once in eight different tuples and each tuple was unique.
For the annotation process proper, we used the crowdsourcing platform CLICKWORKER, 15 where we had each n-tuple assessed by five annotators (Kiritchenko and Mohammad (2016) showed that as few as 2-3 responses per tuple are sufficient to get reliable scores, at least for the assessment of sentiment.). In order to get real-valued scores from the BWS annotations, we applied COUNTS ANAL-YSIS (Orme, 2009) 16 and subtracted the percentage of times the term was chosen as worst from the percentage of times the term was chosen as best. Thus, we got scores between +1 (most formal) and −1 (most informal). We computed the split-half reliability 16 by randomly splitting the annotations of a tuple into two halves, calculating scores independently for these halves, and mea- suring the correlation between the resulting two sets of scores. We got an average Spearman's ρ of 0.8954 (+/ − 0.0030) over 100 trials. Figure 2 displays the distribution of human assessed scores per resource for I-FORGER. While SWEARWORDS and, to a lesser degree, also COL-LOQUIALWORDS are linked to lower scores, and ELEVATEDWORDS obtained higher scores, SIM-SENTWORDS are found in the middle spreading on the entire scale of scores, also comprising a fair amount of words from the lower end of formality.

Intrinsic Evaluation of I-FORGER
Rather than increasing the size and thus the coverage of lexicons to improve performance on potential applications, we intend to score (unseen) words on the fly. Hence, we first evaluate the word scoring model (Section 4.1). Next, we assess the four main input streams of I-FORGER (Section 4.2) and the extension of the scale regarding formality levels (Section 4.3). Figure 3 illustrates the schematic workflow for our word scoring procedure, including (and marked in green) the three evaluation tasks.

Regression Models for Word Scoring
We adopted various approaches using a seed lexicon, actually, the entries' word embeddings, as training data for regression models to automatically score new lexical items for their formality connotation (see, e.g., Li et al. (2017) and Buechel and Hahn (2018) for a similar scenario for automatic emotion induction).
As input features we decided for FASTTEXT word embeddings (Grave et al., 2018) with their own out-of-vocabulary (OOV) functionality. We found that they performed better than getting the OOV handling from BPEMB subword embeddings (Heinzerling and Strube, 2018), based on Byte Pair Encoding (BPE) (Sennrich et al., 2016), instead, or solely utilizing pure BPEMB embeddings.
We evaluated different regression models. Besides RIDGE REGRESSION, 17 which is linear regression with L 2 regularization during training, we also experimented with DENSIFIER (Rothe et al., 2016), which learns an orthogonal transformation of the embedding space, and a modified, more robust variant of the latter, DENSRAY (Dufter and Schütze, 2019). 18 We ran a feed-forward neural network with one hidden layer combined with the boosting algorithm AdaBoost.R2 (BOOSTED FFNN) as proposed by Du and Zhang (2016). 19 Further, we tested neural networks with more than one hidden layer, namely two hidden layers with 256, 128 units (NN 2Hidden ) and three hidden layers with 256, 128 and 64 units (NN 3Hidden ). 20 Table 2 depicts that DENSIFIER and DENSRAY performed worse than all the others. Also, RIDGE REGRESSION yielded significantly lower results than the BOOSTED FFNN model. We found no difference between NN 2Hidden , NN 3Hidden and BOOSTED FFNN since all three reached a strong Spearman's ρ of 0.77. As a higher number of hidden layers did not significantly improve results, we used BOOSTED FFNN for further processing. 17 We used the SCIKIT-LEARN.ORG implementation with the default parameters. 18 We used their code provided on https://github. com/pdufter/densray. 19 We copied their code on https://github.com/ StevenLOL/ialp2016_Shared_Task. 20 We used KERAS in TENSORFLOW with the following hyperparameters: embedding/input layer with 0.2 and hidden layers with 0.5 dropout, MaxNorm weight constraint of 3, random normal weight initialization, ReLu activation, Adam optimizer, batch size of 32, mean squared error loss and 1,000 epochs with early stopping.  Table 2: Averaged Spearman's ρ for different models (10-fold cross-validation on I-FORGER); statistically significant differences (using the two-sided Wilcoxon signed-rank test on Spearman's ρ) are marked with '*' for p < 0.005 with respect to BOOSTED FFNN Table 3 pinpoints the predictability of formality for a particular input stream of I-FORGER in a 10-fold cross-validation setting. Learning scores of COLLOQUIALWORDS and ELEVATEDWORDS seems harder than scoring SWEARWORDS and SIMSENTWORDS. The lower human agreement on choosing the most elevated item supports this finding for the upper half of the formality spectrum.

Assessment of Input Streams
The data also reveal that the regression model is somewhat prone to noise since original SIMSENT-WORDS noisy achieved much lower results than curated SIMSENTWORDS cleansed . However, this acquisition strategy seems to be a choice worth considering for scoring approaches.

Assessment of Formality Scale Extension
A comparison with VULGER suggests that scoring an extended range of linguistic styles is a more difficult task, since evaluating the BOOSTED FFNN model on VULGER achieved a higher Spearman's ρ of 0.827 (10-fold cross-validation) than on I-FORGER (see Table 3). Nevertheless, applying a model trained on I-FORGER to VULGER gave a Spearman's ρ of 0.678, which signals evidence that I-FORGER still captures the vulgar-neutral dimension despite being trained on an extended scale with fewer words (3,000 vs. 3,300). It also shows that the word scoring approach per se indeed yields reliable results on the informal-formal dimension.

Extrinsic Evaluation of I-FORGER
In order to gather evidence for the value of I-FORGER in combination with the word scoring approach within a realistic use case, we ran experiments with emails, which possess a higher stylistic variability than news concerning their formality spread (Pavlick and Tetreault, 2016). Other work related to the formality of emails is typically carried out in the context of communication behavior studies in enterprises, with a focus on determining social factors (social distance, relative power, and the weight of imposition) that affect the sender's choice of formality (Peterson et al., 2011) or on the affective dimension of email exchanges (Chhaya et al., 2018) in terms of the prediction of frustration of employees from email data.

Email Corpus and Formality Gold Standard
Again using BWS and the tools from Mohammad (2016, 2017) mentioned before, we manually scored 800 German emails from CODE ALLTAG S+d , a specialized, metadata-rich subset of CODE ALLTAG (Eder et al., 2020), for their formality. 35 annotators had to select the most formal email and the most informal email from four emails per rating step. Altogether, we had 1,600 4-tuples assessed three times. We got an average Figure 4: Distribution of I-FORGER scores for formal (with formality scores from 0 to +1) and informal emails (rated from −1 to 0) Figure 5: Overview of our workflow to score emails for formality using I-FORGER scores Spearman's ρ of 0.9198 (+/ − 0.0043) over 100 trials. The resulting scores on an informal-formal scale from −1 (informal) to +1 (formal) served as basis for our experiments.

Distribution of I-FORGER Scores
Under the assumption that formal emails include more formal words and informal emails more informal terms, we, first, examined the distribution of scores calculated for the emails' words with the BOOSTED FFNN model learned on I-FORGER. We split our dataset tentatively in two folds: emails with scores from −1 to 0 formed the informal part, whereas emails rated with positive numbers in a range from 0 to +1 were regarded as formal. Figure 4 indicates that, in comparison, informal emails indeed contain more negatively scored terms and formal emails comprise more words in the upper part of the informal-formal scale.

Formality Scoring of Emails
In the final evaluation setup, we tested whether word formality scoring works better than lexicon look-up in traditional resources and whether categorized items or continuous scores get better results.
To determine the proper features for a linear regressor predicting formality scores, 21 we used a vector comprising the relative frequencies of an email's word scores as input (count per score divided by the total number of scored words). In one setting, the I-FORGER word scorer tagged (unseen) nouns, finite verbs and adjectives ( Figure 5 depicts the workflow for this experiment.). In another setting, we only counted the scores of words already present in I-FORGER (without acquisition step). Besides relative score frequencies, we also tested taking the average score per document (sum of all calculated scores divided by the total number of scored words) as input feature 22 for both settings. 21 We used a neural network with two hidden layers (128 and 64 units) and the same configurations in the KERAS library in TENSORFLOW as reported for NN 2Hidden or NN 3Hidden . 22 Using the average scores directly to determine a correlation to the emails' formality scores gave comparable results.
For a comparison of scores against pre-specified categories, we mapped the scores of I-FORGER to formality categories. We divided the scale into five distinct sections (e.g., scores between 0.6 and 1.0 form one category), assigned the respective category to each score and used a classifier instead of a linear regressor to learn the categories of new words. The relative frequencies of the categories then served as input for the linear regressor. We also experimented with ignoring OOV words and only utilizing lexicon look-up for the categorical scenario. For this setting, we exploited the complete pre-categorized word lists we got the SWEAR WORDS, COLLOQUIAL WORDS and ELEVATED WORDS from in order to increase coverage. In this way, in case of swear words, e.g., we did not only use the 500 items assembled in the I-FORGER lexicon, but used a list of more than 13,000 entries. As features instead of scores we counted the frequency of swear words, colloquial words and elevated words separately in each email and divided it by the total number of words found in the lexicons. words also performed better with score frequencies than using the average. However, compared to employing a word scorer for unseen words, the results for simple lexicon look-up are lower, a finding that seems to be due to the limited coverage of I-FORGER. Therefore, we can conclude that our way of scoring potentially unseen words is an effective and advantageous alternative to using fixed-size, and thus limited, lexical resources. Employing the relative frequencies of formality categories instead of scores also yielded lower results for both settings, classifying new words (see I-FORGER cat ) and utilizing lexicon look-up with pre-categorized items (CATEGORIES). This demonstrates the benefit of a scaling approach instead of relying on coarse-grained categories.

Conclusion
Different levels of formality these days find increasing attention, both in methodological approaches and NLP applications. The necessity of choosing a socially appropriate tone is particularly evident in digitally mediated discourse, e.g., formal business or informal private email communication (Chhaya et al., 2018) or social media interaction via reviews, chats, or blogs (Pavlick and Tetreault, 2016;Gonzàlez Bermúdez, 2015). The increasing relevance of conversationally adequate virtual personal assistants (Shamekhi et al., 2016), chatbots (Chaves et al., 2019) and automatic procedures for smart response generation (Kannan et al., 2016) requires sensitivity on the generator's side to strike the right tone and avoid the false one. Similarly, machine translation poses special problems when expressions of (in)formality have to be adequately transferred between different languages (Niu et al., 2018). Progress in monitoring formality levels is a methodological prerequisite for several downstream applications that have to comply with users' habitual expectations or increase user satisfaction, e.g., in commercial interactions (customer service communication) (Liebrecht et al., 2020;Elsholz et al., 2019) or medical consultation (Fadhil and Schiavo, 2019).
As a methodological contribution, we here propose a lexical approach to computational style analysis based on I-FORGER, a lexicon whose (3,000) items are scaled on a continuous informal-formal spectrum. We make three new contributions to style analysis: First, a language-independent lexicon acquisition architecture employing sentence embed-dings forms the basis for computing sentence similarity, thus finding formality-sensitive lexical items not contained in the seeds. Second, best-worst scaling is used for creating gold standards available for an in-depth intrinsic and extrinsic evaluation of the new lexical resource. Finally, I-FORGER stands out as the first formality-informed lexicon for the German language. This resource is available at https://github.com/ee-2/I-ForGer.
Despite our lexical focus, we are aware of the fact that formality is not only lexically expressed. Consequently, a lexicon-based approach has to be complemented by methods that account for nonlexicalized varieties of formality. Such forms may include syntactic variability, linguistic complexity and readability, as well as correctness of language use regarding orthography, morphology and syntax. For research on formality detection incorporating its syntactic, semantic and discourse facets, cf., e.g., Heylighen and Dewaele (1999), Li et al. (2013) or Pavlick and Tetreault (2016). These branches will also be part of our future work. Still, a (potentially) large portion of formality assessments is rooted in lexical signals, which we capture by the methodology advanced in this paper.