A Unified Feature Representation for Lexical Connotations

Ideological attitudes and stance are often expressed through subtle meanings of words and phrases. Understanding these connotations is critical to recognizing the cultural and emotional perspectives of the speaker. In this paper, we use distant labeling to create a new lexical resource representing connotation aspects for nouns and adjectives. Our analysis shows that it aligns well with human judgments. Additionally, we present a method for creating lexical representations that capture connotations within the embedding space and show that using the embeddings provides a statistically significant improvement on the task of stance detection when data is limited.


Introduction
Expressions of ideological attitudes are widespread in today's online world, influencing how we perceive and react to events and people on a daily basis. These attitudes are often expressed through subtle expressions or associations (Somasundaran and Wiebe, 2010;Murakami and Putra, 2010). For example, the sentence "the people opposed gun control" conveys no information about the author's opinion. However, by adding just one word, "the selfish people opposed gun control", the author can convey their stance on both gun control (against) and the people who support it (not valuable and disliked). Discerning such subtle meaning is crucial for fully understanding and recognizing the hidden influences behind everyday content.
Recent studies in NLP have begun to examine these hidden influences through framing in social media and news (Asur and Huberman, 2010;Hartmann et al., 2019;Klenner, 2017) and style detection in hyperpartisan news (Potthast et al., 2018). Lexical connotations provide a method to study these influences, including stance, in more detail.  Connotations are implied cultural and emotional associations for words that augment their literal meanings (Carpuat, 2015;Feng et al., 2011). Connotation values are associated with a phrase (e.g., fear is associated with "cancer") (Feng et al., 2011) and capture a range of nuances, such as whether a phrase is an insult or implies value (see Figure 1).
In this paper, we define six new fine-grained connotation aspects for nouns and adjectives, filling a gap in the literature on connotation lexica, which has focused on verbs (Sap et al., 2017;Rashkin et al., 2016, and coarse-grained polarity (Feng et al., 2011;Kang et al., 2014). We create a new distantly labeled English lexicon that maps nouns and adjectives to our six aspects and show that it aligns well with human judgments. In addition, we show that our lexicon confirms existing hypotheses about subtle semantic differences between synonyms.
We then learn a single connotation embedding space for words from all parts of speech, combining our lexicon with existing verb lexica and contributing to the literature on unifying lexica (Hoyle et al., 2019). Intrinsic evaluation shows that our embedding space captures clusters of connotativelysimilar words. In addition, our embedding model can generate representations for new words without the numerous training examples required by standard word-embedding methods. Finally, we show that our connotation embeddings improve performance on stance detection, particularly in a low-resource setting.
Our contributions are as follows: (1) we create a new connotation lexicon and show that it aligns well with human judgments, (2) we train a connotation feature embedding for all parts of speech and show that it captures connotations within the embedding space, and (3) we show the connotation embeddings improve stance detection when data is limited. Our resources are available: https://github.com/emilyallaway/ connotation-embedding.

Related Work
Studies of connotation build upon the literature examining subtle language nuances, including good and bad effects of verbs (Choi and Wiebe, 2014), evoked sentiments and emotions (Mohammad et al., 2013a;Mohammad and Turney, 2010;Mohammad, 2018b), multi-dimensional sentiment (Whissell, 2009;Mohammad, 2018a;Whissell, 1989), offensiveness (Klenner et al., 2018), and psychosociological properties of words (Stone and Hunt, 1963;Tausczik and Pennebaker, 2009). Work explicitly on connotations has focused primarily on detailed aspects for verbs (Rashkin et al., 2016Sap et al., 2017;Klenner, 2017) or single polarities for many parts of speech (Feng et al., 2011(Feng et al., , 2013Kang et al., 2014). One exception is the work of Field et al. (2019), which extends limited detailed connotation dimensions from verbs to nouns within the context of certain verbs. Our work is unique in directly defining detailed aspects for nouns and adjectives.
Early work on stance detection applied topicspecific models to various genres, including online debate forums (Sridhar et al., 2015;Somasundaran and Wiebe, 2010;Murakami and Putra, 2010;Hasan andNg, 2013, 2014) and student essays (Faulkner, 2014). More recent studies have used a single model for many topics to predict stance in Tweets (Mohammad et al., 2016;Augenstein et al., 2016;Xu et al., 2018) and as part of the fact extraction and verification pipeline (Conforti et al., 2018;Ghanem et al., 2018;Riedel et al., 2017;Hanselowski et al., 2018). Klenner et al. (2017) explore the relationship between connotations and stance through verb frames. In contrast, our work studies stance using connotation representations from a learned joint embedding space for words from all parts of speech. Recently, Webson et al. (2020) examine representations of political ideology as connotations and its use in information retrieval. Representation learning has been used for stance detection of online debates by Li et al. (2018), who develop a joint representation of the text and the authors. Our work, however, uses a representation of word connotations and does not use any author information (a strong feature in fullysupervised datasets but which may not be available in real-world settings).

Connotation Lexicon
We build a connotation lexicon for nouns and adjectives by defining six new aspects of connotation. We take inspiration from verb connotation frames and their extensions (Rashkin et al., 2016;Sap et al., 2017), which define aspects of connotation in terms of the agent and theme of transitive verbs. Rashkin et al. (2016) define six aspects of connotation for verbs (entities', writer's, and reader's perspectives, effect, value, and mental state) in connotation frames (e.g., "suffer" ; negative effect on the agent) and Sap et al. (2017) extend these aspects to include power and agency.
We first define the six new aspects of connotation for nouns and adjectives ( §3.1) in our work, then we describe our distant labeling procedure ( §3.2) and human evaluation of the final lexicon ( §3.3).

Definitions
We use w to indicate a word and w 0 to indicate the person, thing or attribute signified by w.
For each w, we define (1) Social Value: whether w 0 is considered valuable by society, (2) Politeness (Polite): whether w is a socially polite term, (3) Impact whether w 0 has an impact on society (or the thing modified by w if w is an adjective), (4) Factuality (Fact): whether w 0 is tangible, (5) Sentiment (Sent): the sentiment polarity of w, and (6) Emotional association (Emo): the emotions associated with w 0 . We show examples in Table 1.
(1) Social Value includes both the value of objects or concepts and the social status and power of people or people-referring nouns (e.g., occupations). "Sociocultural pragmatic reasoning" (Colston and Katz, 2005) about such factors is crucial for understanding language such as connotations.
Initial work on connotation polarity lexica recognized the important role of Social Value in overall connotation by defining a 'positive' connotation for objects and concepts that people value (Feng et al., 2011). Later work made this idea more explicit  (2) Politeness follows the definition of Lakoff (1973) in noting words that make the addressee feel good but also includes notions of formality. These notions have been previously studied within the context of politeness as a set of behaviors and linguistic cues (Brown and Levinson, 1987;Danescu-Niculescu-Mizil et al., 2013;Aubakirova and Bansal, 2016). We focus on purely lexical distinctions because how one comprehends these distinctions affects one's "attitude towards the speaker ... or some issue" as well as whether one feels insulted by the exchange (Colston and Katz, 2005). This aspect of perspective is a component of verb connotation frames and we extend it to nouns and adjectives in our lexicon through Politeness.
(3) Impact and effect have been studied in verb connotation frames and other verb lexica (Choi and Wiebe, 2014), capturing notions of implicit benefit or harm on the arguments of the verb. We extend this idea to nouns and adjectives by observing that while they do not directly have arguments, nouns (e.g. "democracy") often impact society and adjectives (e.g. "sick") impact the nouns they modify. Thus, we define Impact in this way.
(4) Factuality captures whether words corre-spond to real-world objects or attributes, following the sense of Saurí and Pustejovsky (2009). Klenner and Clematide (2016) argue that the factuality of events is crucial for understanding sentiment inferences. Building upon this, Klenner et al. (2017) use factuality as a key component of German verb connotations and of applying those connotations to analyze stance and sides in German Facebook posts. Imagery, as an "indicator of abstraction" (Whissell, 2009), also models a similar attribute to event factuality for all parts of speech. Given its importance, we include a notion of Factuality for nouns and adjectives as aspect of connotations.
(5) Sentiment polarity has been used to convey overall connotations since the early work on connotation lexica (Feng et al., 2011(Feng et al., , 2013Kang et al., 2014). As such, we deem it important to include this polarity in our lexicon.
(6) Emotional Associations for words can be strong, persisting long after they are formed and improving the recall of memories triggered by those words (Rubin, 2006). Emotions are also impacted when people process non-literal meaning (Colston and Katz, 2005). To fully understand what a piece of text is trying to convey, it is important to understand what emotional associations exist in the text. For example, news headlines often aim to evoke strong emotions in their readers (Mohammad and Turney, 2013). To capture this, we include Emotional Association as an aspect of connotation.

Labeling Connotations
We use distant labeling to build our lexicon, since complete manual annotation of a lexical resource is  a lengthy and costly process. Although crowdsourcing can lessen these burdens, the results are often unreliable with low inter-annotator agreement and, for this reason, many lexical resources are automatically created (Mohammad, 2012;Mohammad et al., 2013b;Kang et al., 2014). Following these researchers, we automatically generate our lexicon by combining several existing lexica.
To generate our lexicon, we map dimensions from existing lexica to connotation aspects (see Table 1). We use dimensions from the Harvard General Inquirer (Stone and Hunt, 1963) for Social Value, Politeness, and Impact. For Factuality we map the real-valued 'Imagery' dimension, Imagery(w), from the revised Dictionary of Affect in Language (Whissell, 2009) into distinct classes. For Sentiment we directly use the polarity v from Connotation WordNet (Kang et al., 2014) and for Emotional Association we use the eight Plutchik emotions (Plutchik, 2001) from the NRC Emotion Lexicon (Mohammad and Turney, 2013) (see appendix B for full rules).
The labels are word-sense-independent, following other automatically generated lexica, such as the Sentiment140 lexicon (Mohammad et al., 2013b), which do not treat word sense. In addition, sense-level annotations are not available for all lexica in our distant labeling method and therefore sense-level connotations would require both extensive manual annotation and automated word-sense disambiguation, introducing cost and additional noise. As a result, we use sense-level distinctions (e.g., in the Harvard General Inquirer) when available and combine the labels for an aspect across senses to obtain the final connotation aspect label. These aggregate aspect labels represent a word's connotative potential, rather than exact value Our resulting lexicon has 7, 578 words fullylabeled for all aspects, with an additional ⇠93k words labeled only for some aspects (e.g., only Sentiment), resulting in 100, 176 words total. For each non-emotion aspect, we have a label l 2 { 1, 0, 1}. For Emotional Association, each of the eight emotions has label l 2 {0, 1}.

Aspect
Avg   We find that many aspects exhibit uneven class distributions (e.g., 10.5% of words are polite and only 1% are impolite) (see Table 2). For emotions, we calculate the class distribution using the number of fully-labeled words with at least one associated emotion (1, 373 words or 18%). For these 1, 373 words, the average number of associated emotions is ⇠2. Our distributions are similar to previous work on verb connotations, where distributions range from 1.4% to 20.2% for the smallest class (Rashkin et al., 2016).

Human Evaluation
We evaluate the quality of the lexicon by creating a gold-labeled set and comparing the labels created with distant supervision against the human labels. We ask nine NLP researchers 1 to annotate 350 words (175 nouns, 175 adjectives) for Social Value, Politeness, Impact and Factuality. We do not annotate Sentiment or Emotional Association, since these labels come directly from existing lexica.
Annotators are given a word w, along with its definitions (for all senses) and related words, and annotate connotation independent of word sense. This setup mimics the input to the representation learning models in §4. The average Fleiss'  across nouns and adjectives is 0.60 (see Table 3), indicating substantial agreement. We select as the final annotator label, the majority vote.
We find that the distantly labeled lexicon agrees with human annotators the majority of the time (on average 64.2% or Cohen's  = 0.368 (Cohen, 1988)). If we consider non-conflicting value agreement (NC), the lexicon agreement with humans rises to 90%, where NC agreement is defined as: the pairs (+, neutral) and ( , neutral) agree but

Aspect
Same Connotation Different Connotation Social Value (+, ) does not. This shows that the lexicon and humans rarely select opposite values and instead disagree on neutral vs. non-neutral.
Looking closer at disagreements between neutral and non-neutral, we see that most result from human annotators selecting a non-neutral label. That is, the lexicon makes fewer distinctions between neutral and non-neutral than humans; humans select a non-neutral value 68% of the time, compared to 56% in the lexicon. Despite this tendency towards neutral, the lexicon aligns with human judgments, agreeing the majority of the time and rarely providing a value opposite to humans.

Synonym Analysis
We also evaluate the ability of our lexicon to capture subtle semantic differences between words using lexical paraphrases (synonyms). In the paraphrase literature, it has been argued that paraphrases actually differ in many ways, including in connotations (Bhagat and Hovy, 2013). In fact, Clark (1992) proposes that absolute synonymy between linguistic forms does not exist. With this in mind, we hypothesize that our connotation lexicon should differentiate between lexical paraphrases.
To test this, we select synonym paraphrase pairs from lexical PPDB (Pavlick et al., 2015) where one element in the pair is in the Wordnet synset of the other 2 . We find that out of the 2216 resulting pairs where both words are in our lexicon, 74.3% have connotations that differ in some aspect. Many words agree on Sentiment (67.5% the same), following the intuition that two synonyms likely have the same sentiment but differ in more fine-grained ways. Other pairs agree on Politeness (76.1% the same), resulting from the extreme class imbalance for this aspect (88.5% neutral). However, the lexicon does still capture differences along these dimensions, for example in terms of formality (e.g., "gentleman" vs. "man").
Looking more closely, we find that many times agreements along a particular dimension accurately represent synonyms that differ along other dimensions. For example, "weariness" and "fatigue" both have a negative Impact, but "weariness" is associated with sadness and "fatigue" is not.
On the other hand, the majority of differences across almost all aspects (79% on average) are between neutral and non-neutral polarities within a synonym pair, for example, between "position" (possibly tangible) and "post" (tangible), from Factuality. This confirms the intuition that synonyms often do not have opposing connotation values, although examples do exist (e.g., the Social Value of "relentless" vs. "persistent") (see Table 4). As a whole, our analysis confirms our hypothesis and the claims of Clark (1992) about synonymy.

Methods
Using our connotation lexicon, we train a dense connotation feature representation for words from all parts of speech. We combine three lexica (our lexicon and two verb lexica) into a single vector space, making connotations easier to use as model input and providing a single representation method for the connotations of any word.
We design a novel multi-task learning model that jointly predicts all of the connotation labels for a word w, from a learned representation v w . Each task is to predict the label for exactly one connotation aspect: the 6 aspects in §3.2 for nouns and adjectives and the 11 aspects in CFs+ (connotation frames and their extension to power and agency) for verbs (Rashkin et al., 2016;Sap et al., 2017).
To learn a representation for w we encode dictionary definitions of the word w and words related to w (e.g., synonyms, hypernyms) in a single vector, which we then use to predict connotation labels. We use definitions and related words since linguists have argued that definitions and related words convey a word's meaning (Guralnik, 1958).
Let w be a word with part of speech t. The input to the connotation encoding model is then: (1) a set of dictionary definitions D w t and (2) a set of words related to w t , R w t . We use multiple definitions to capture multiple senses of w t . To emphasize more prevalent senses of w t , we use similar repeated definitions for the same sense, collected from multiple sources. From D w t and R w t , the encoder produces a connotation feature embedding v w t 2 R d of dimension d = 300. Then we use v w t to predict the label`a for connotation aspect a (see Figure 2).

Encoding Models
For a word w t , the input to our encoder is ...; d N w t ] 2 R Nd in , the sequence of fixed pre-trained token embeddings for concatenated definitions in D w t . Then we take as our embedding the normalized final hidden state from a BiL-STM, a standard architecture for text encoding: and h w t 2 R 2H is the concatenation of the last forward and backward hidden states (model CE).
As a variation of our model, we apply scaled dot-product attention (Vaswani et al., 2017) over the related words R w t , using h w t as the attention query, to obtain v w t . Then we add the result to h w t before normalizing (model CE+R).

Label Classifier
For each connotation aspect, we train a separate linear layer plus softmax with the input [v w t ; e w t ], where e w t is the pre-trained embedding for w t . For the non-emotion aspects, the layer has three target classes { 1, 0, 1} for most aspects (four classes for the 'power' and 'agency' verb aspects) and we predict the label with highest output probability. For emotions, we do multi-label classification by thresholding the output probabilities for each emo- tion dimension with a fixed ✓ 2 R. We include e w t in the predictor input to encourage v w t to model connotation information that is complementary to the information present in pre-trained embeddings.

Learning
For each non-emotion connotation aspect a (e.g., Impact) we calculate the weighted cross-entropy loss L a . For Emo we calculate the one-versus-all cross-entropy loss on each of the eight emotions, L Emo i for 1  i  8, and their sum is L Emo . In our multi-task joint learning framework (J), we minimize the weighted sum of L a across all connotation aspects. We also experiment with training a separate encoding model for each connotation aspect a that minimizes L a (S).

Baselines and Models
For each baseline, we implement one classifier per connotation aspect, or, for Emo, one classifier for each emotion. Following Rashkin et al. (2016) we implement a Logistic Regression classifier trained on the 300-dimensional pre-trained word embedding for w using the standard L-BFGS optimization algorithm and sample re-weighting (LR). We also implement a majority class baseline (Maj).
We present three variations of our model: (i) trained jointly for all parts of speech and all connotation aspects (CE(J)), (ii) trained on each aspect individually with related word attention (CE+R(S)), and (iii) trained jointly on all parts of speech and all connotation aspects with related word attention (CE+R(J)).

Data and Parameters
For nouns and adjectives, we train using the aspects described in §3 (6 aspects). For verbs, we train on 9 aspects 3 from Rashkin et al. (2016)   (11 aspects total). We split our connotation lexicon ( §3) into train (60%), development (20%) and test (20%). For the verb CFs+, we preserve the originally published data splits where possible. We move words only to ensure that all parts of speech for a word are in the same split (e.g., 'evil' both as a noun and adj is in the dev set). We collect dictionary definitions and related words from all seven dictionaries available on the Wordnik API 4 . These are extracted for each word and part-of-speech pair. We preprocess definitions by removing stopwords, punctuation, and the word itself. We use pre-trained Concept-Net numberbatch embeddings (Speer et al., 2016).

Results Label Prediction
We present results on the connotation prediction task to check the quality of our representation learning system. Given dictionary definitions and related words, we predict the labels from our lexicon ( §3) and CFs+ (see Table 5).
First, we observe that joint learning (models (J)) improves over training representations individually (CE+R(S)). We hypothesize that joint learning provides regularization across all aspects. Second, we compare joint learning with (CE+R(J)) and without (CE(J) related words to the strong LR baseline. We find that the model with related words (CE+R(J)) is statistically indistinguishable from the baseline 5 (for p  0.05). In contrast, our model without related words (CE(J)) is significantly worse than the LR baseline for one aspect (see Appendix D for aspect-level results). Thus we conclude that related words are beneficial for learning connotations. Overall, our approach provides a single unified feature representation for the lexical connotations of all parts of speech, without any loss in label prediction performance. Specifically, our best 4 https://www.wordnik.com/ 5 We use an approximate randomization test representation learning model (CE+R(J)) has comparable label prediction performance to a strong baseline (LR), a baseline that does not learn any kind of representation. We use CE+R(J) to generate connotation embeddings that we use in all further evaluation.

Observations
Our connotation representation learning model presents several advantages. Since the model uses dictionary definitions, we can generate representations for slang words (e.g., "gucci" meaning "really good"), where knowledge-base entries (e.g., in Concept-Net) do not capture the slang meaning. For example, in our connotation embedding space, the nearest neighbors of "gucci" include words related to the slang connotations (e.g., "beneficial"positive impact, not factual), whereas neighbors in a pre-trained word embedding space are specific to the fashion meaning and connotations (e.g., "buy", "italy", "textile"). Along with slang, our model can also generate representations for new or rare words (e.g., "merchantile") that don't have a pre-trained word representation.

Intrinsic Evaluation
To evaluate the connotation embedding space, we look at the 50 nearest neighbors, by Euclidean distance, of every word in our training and development sets. We find that neighbors in the connotation embedding space are more closely related based on the connotation label than in the pretrained embedding space.
Looking at example nearest neighbors ( Table 6) we see that nearest neighbors in the pre-trained embedding space include antonyms (e.g., "inability" is close to "ability") and topically related words (e.g., "merry" is close to "wives"), while in the connotation space, neighbors often share connotation labels even though they may be topically or denotatively unrelated. For example, "slug" (noun) is close to many impolite but otherwise unrelated words (e.g., "shove", "murder", "scum") in the connotation embedding space while in the pre-trained space "slug" is close to topically related (e.g., "bug") but polite words. Therefore, we can see that words with similar connotations are placed closer together than in the pre-trained semantic space.
To quantify the semantic differences, we measure neighbor-cluster connotation label purity.   Specifically, for each connotation aspect a (e.g., Social Value) and each non-neutral label c (e.g., valuable (+)), we calculate r a(C) c : the average ratio of words with label c to label c in the set of nearest neighbors of all words with label c for aspect a. We compare it against the same ratio for the nearest neighbors selected using the same pre-trained word embeddings as in §4.2, denoted r a(P ) c . We find that across connotation aspects, these ratios are higher for the learned connotation embeddings, compared to pre-trained embeddings. For example, r  Table 7). This shows the connotation embeddings reshape the pre-trained semantic space.

Extrinsic Evaluation
We further evaluate our connotation embeddings using the stance detection task, hypothesizing they will lead to improvement. Given a text on a topic (e.g., "gun control"), the task is to predict the stance (pro/con/neutral) towards the topic (see Figure 1).

Methods and Experiments Models
As a baseline architecture, we implement the bidirectional conditional encoding model (Augenstein et al., 2016). This model encodes a text as h T with a BiLSTM, conditioned on a separate topic encoding h P , and predicts stance from h T (BiC). We include connotation embeddings through scaled dot-product attention over the noun, adjective, and verb embeddings from the text, with h P as the query (see Figure 3). We experiment with three types of embeddings in the attention: pre-trained word embeddings (BiC+W), our connotation embeddings (BiC+C), and randomly initialized embeddings (BiC+R), as a baseline to measure the importance of attention. We also implement a Bag-of-Word-Vectors baseline (BoWV), encoding the text and topic as separate BoW vectors and passing their concatenation to a Logistic Regression classifier.

Data and Parameters
We use the Internet Argument Corpus (Abbott et al., 2016): ⇠59k posts from online debate forums. Of the 16 total topics, four are large (L, with > 7k ex each), five are medium (M, with ⇠2k ex each), and seven are small (S, with 30-300 ex each).
Since not every text will take a position on every topic, we automatically generate 'neutral' examples for the data. To do this, we sample a pro/con example and then assign it a new (different) topic, randomly sampled from the original topic distribution. We split the data into train, development, and test such that no posts by one author are in multiple splits and preprocess the data by removing stopwords and punctuation and lowercasing.
Stance is topic-dependent and as a result, models require numerous training examples for each individual topic. However, many examples are not always available for every topic. Since there are hundreds of thousands of potential topics, the vast majority of which will have very few examples, our goal is to build models that exhibit strong performance across all topics, regardless of size.

BoWV
.  Therefore, we experiment with three data scenarios: (i) training and evaluating using all the data (All Data), (ii) truncating each topic in training to M size (at most 2k examples) and evaluating using all data (Trunc Train), and (iii) truncating each topic to M size in training and in evaluation (Trunc All), so that topics have the same frequency for both training and evaluation.

Results
We find that when using all of the training data, the pre-trained embeddings and our connotation embeddings perform comparably (significance level p = 0.3). Note that both the connotation and pretrained embeddings outperform the random embeddings in all scenarios, showing that the architecture difference is not the only reason for improvement when adding embeddings. We find that in both scenarios where data is limited per topic (Trunc Train and Trunc All), the connotation embeddings improve significantly over the pre-trained word embeddings. In fact, the same trend is visible across varying numbers of training examples (see Figure 4). Our results demonstrate that the connotation information is useful for detecting stance when data is limited. We find further evidence that the connotation embeddings (BiC+C) make the model robust to loss of training data when we look at the results on the individual topic level. Namely, in setting Trunc Train, BiC+C has a significant improvement (with p < 0.05) over BiC+W on six topics, including four of the M and truncated L topics. In fact, for the four M/L topics, the average per-topic decrease in F1 for BiC+C is 1/4 that of BiC+W. These per-topic results further highlight the robustness of BiC+C when training data is restricted.
We conclude that connotation embeddings improve stance performance when training data is limited, suggesting they can be used in future work that generalizes stance models to topics with no training data (i.e. most topics).

Conclusion
We create a new lexicon with six new connotation aspects for nouns and adjectives that aligns well with human judgments. We also show that the lexicon confirms hypotheses about semantic divergences between synonyms. We then use our lexicon to train a unified connotation representation for words from all parts of speech, yielding an embedding space that captures more connotative information than pre-trained word embeddings.
We evaluate our connotation representations on stance detection. Since the stance detection tasks encountered in real life concern a very large number of topics, zero-shot and few-shot stance detection are important subtasks. We show that models using our connotation representations are well suited for few-shot stance detection and may also generalize well to zero-shot settings.
In future work, we plan to explore the relationships between connotations, context, and word sense, as well as adapting our methods to learn multi-lingual connotation representations that accurately capture cultural and linguistic variations.

A Overview
The data and software are provided as supplementary material her: https: //github.com/connotationembeddingteam/ connotation-embedding.

B Connotation Labeling
We construct labels for our connotation lexicon ( §3) using categories from the following existing resources: HGI -the Harvard General Inquirer (Stone and Hunt, 1963), DAL -the revised Dictionary of Affect in Language (Whissell, 2009), CWN -Connotation WordNet (Kang et al., 2014), and NRCEmoLex -the NRC Emotion Lexicon (Mohammad and Turney, 2013). The HGI consists of 183 psycho-sociological categories. Each lexical entry (⇠11k total) is tagged with a non-zero number of categories. Different senses (noted through brief definitions) and partsof-speech for the same word have separate entries. The available categories include valence (i.e., positive and negative), words related to a particular entity or social structure (e.g., institutions, communication), and value judgements (e.g., concern with respect).
The DAL consists of ⇠8k words with scores for 3 categories: pleasantness, activation, and imagery. Word entries include inflection but do not explicitly mark part-of-speech. CWN is a lexicon of connotation polarity scores (ranging from 0 to 1) for ⇠180k words, explicitly marked for part of speech. Finally, NRCEmoLex consists of word entries marked for any number the eight Plutchik emotions (anticipation, joy, trust, fear, surprise, sadness, disgust, anger) as well as positive and negative sentiment. Two versions of the lexicon are available: with and without sense level distinctions. Neither version includes explicit information on part-of-speech, and so we infer part-of-speech using the words provided to distinguish different senses.
We provide the complete distant labeling rules for each of the connotation aspects in Table 9 (see http://www.wjh.harvard.edu/˜inquirer/ homecat.htm for complete information on abbreviations). Within each connotation aspect, we determine the connotation polarity using the additional categories: Positiv, Negativ, Strong, Weak, Hostile, Submit, Active and Power.

Aspect
General Inquirer Categories where x is the Imagery score normalized to [ 1,1].
where x is the sentiment score normalized to [ 1, 1].

C Analysis of the Connotation Lexicon
In this section, we provide further analysis of the connotation lexicon as exemplification of its content and properties.

C.1 Human Evaluation
We show the instructions provided to annotators for the manual labeling of samples from the connotation lexicon in §3.3 (see Figures 5 and 6). We include Cohen's kappa score for agreement between the lexicon and human annotators for individual connotation aspects in Table 10.

C.2 Gender Bias Analysis
Connotations have been used to study gender bias in movie scripts (Sap et al., 2017) and online media (Field et al., 2019). Here we use our connotation lexicon to analyze gender bias in two new domains: celebrity news (Celebrity) and student reviews of computer science professors (Professors).
We use existing datasets for these domains and the accompanying methodology of Chang and McKeown (2019) to infer word-level gender associations. Then, for the gender-associated words that are in our lexicon, for each connotation aspect and domain, we examine the percentage of positive and negative polarity words and find that these quantify known trends in gender-biased portrayals.
In the Celebrity domain, Factuality highlights the tendency of news media to focus on physical characteristics of female celebrities (Selby, 2014). More words with positive Factuality polarity (tangible concepts and attributes) are associated with women and more words with negative polarity (abstract concepts and attributes) are associated with men. For example, women are described as "beautiful" and "slim", while men are described as "political" and "presidential". In fact, even many of the not tangible female-associated words still align with physical attributes (e.g., "chic"), further emphasizing the biased portrayal.
We also find that in the Professors domain, patterns in Social Value and Impact agree with the observations of Chang and McKeown (2019) and with social science literature that finds male teachers are praised more than female teachers for being experts (both socially valuable and positively impacting society). For example, men are associated with posi-file:///Users/emilyallaway/OneDrive/Columbia/Research/stance-internal-git/amt/html/human-conn-labels-N.html 1/4

Full Instructions (Expand/Collapse)
Thanks for participating in this HIT! You will read several definitions for a NOUN, as well as a list of related words. Then you will label the connotations of that word.
A couple of notes on labeling:

Please label all connotations based on what society as a whole believes, NOT based on your own personal beliefs.
Connotations can be subjective, but we are interested in general connotations that hold for most people.

Consider connotations as word-sense INDEPENDENT.
If a word has multiple sense, please consider the connotations of ALL senses and label the most common connotations.

The Task:
For a word X, read definitions of X and words related to X, then label the following connotations of X: 1. Social Value NOTE: here X = the person/thing X refers to.
Is X valued by society?
For example: "power" and "beauty" would be Socially Valuable while "illness" and "poverty" would be Not Socially Valuable..
For people, "social value" is equivalent to social status.
For example: "boss" and "doctor" would be Socially Valuable while "terrorist" and "janitor" would be Not Socially Valuable..

Politeness
Is X a polite term?
Polite: words that make the receiver feel good (Lakoff), as well as words one would use in a socially formal setting and politically correct terms.
For example: "father" and "homeless person" would be Polite, while "daddy" and "bum" would be Impolite.
Impolite: words that make the receiver feel bad, as well as words for socially informal settings, curse words and slang.
For example: "bro" and "shit" would both be Impolite. tive Social Value (socially valuable) words such as "knowledge" and "experience", while women are relatively less often associated with the same type of words. Finally, in the Celebrity domain we find our lexicon reflects the coverage in the media of recent sexual harassment allegations against male celebrities. Namely, women are associated with more positive Social Value words and men are associated with many more negative Social Value words (see Table 11). Overall, our results quantitatively validate previous observations and known patterns of gender bias.  Factuality, Sentiment, and Emotional Association. For verbs we use 11 aspects: perspective of the writer on the theme P(wt) and agent P(wa), perspective of the agent on the theme P(at), effect on the theme E(t) and agent E(a), value of the theme V(t) and agent V(a), mental state of the theme S(t) and agent S(a), power, and agency.

D.2 Hyperparameters
All models are trained with hidden size H = 150, number of definition words N = 42, number of related words |R w t | = 20 and dropout of 0.5 to prevent overfitting. For emotion prediction we set ✓ = 0.5. We use Concept-Net numberbatch embeddings (Speer et al., 2016) because we find empirically that these outperform other pre-trained embeddings (GloVe and dependency-based embeddings (Levy and Goldberg, 2014)) on the development set.
We tune our only hyperparameters on the development set: the weights a for the contribution of each loss term L a to the total loss P a a L a (see 4.1.3). We experiment with 10 manually selected weight combinations, where each a 2 (0, 5). We find that the optimal weights are:  Table 12).

D.3 Training
We optimize using Adam (Kingma and Ba, 2014) with learning rate 0.001 and minibatch-size of 64 for 80 epochs with early stopping. We optimize the parameters W a , b a for each noun and adjective aspect a separately from the parameters for each verb aspect a, allowing both to update the parameters of the definition encoder, and attention layer.

D.4 Detailed results
We present aspect-level results for the task of connotation label prediction (see Table13).

E Extrinsic Stance Evaluation E.1 Dataset Details
We map the topic-stance annotations in the Internet Argument Corpus to individual topics and labels (e.g., 'pro-life' ! topic 'abortion' with label 'con'). We show dataset statistics in  dle part are medium sized, and topics in the lower part are small sized.

E.2 Training Details
We split the data 60% train, 20% development, and 20% test. We train our models using pretrained 100-dimensional word embeddings from GloVe (Pennington et al., 2014), as these are comparable to and more time-efficient than larger word embeddings. We use a hidden size of 60, dropout of 0.5, and train for 70 epochs with early stopping on the development set. We optimize Adam with learning rate 0.001 and minibatch-size of 64 on the cross-entropy loss.

E.3 Topic Stance Analysis
We present a detailed analysis of the results of the models BiC+W and BiC+C on the stance detection on individual topics. First, we find that when the models are trained with all of the data (All Data), there are statistically significant differences on only two topics, one of which is very small (see Table  15a). This is further evidence that the models are comparable in this setting. We then find that when trained with truncated training data (see §5.2 for details) (Trunc Train), BiC+C improves over BiC+W on six topics, including four of the medium or truncated large topics (see Table 15b). When trained and evaluated with truncated data (Trunc All), BiC+W and BiC+C have statistically significant improvements over each other on the same number of topics (two each) but BiC+C is significantly better overall (see Table 15c). These results further show that connotations help to learn stance when data is limited.