WordTies: Measuring Word Associations in Language Models via Constrained Sampling

,


Introduction
What do you think of when you see a word?Word association is a task where a human participant is shown a cue word, and is asked to quickly list words (formally responses) that come to the mind without thinking (Nelson et al., 2004;De Deyne et al., 2019).These associations provide a way to measure human representations of semantic knowledge (Rodriguez and Merlo, 2020).Similarly, researchers have been mirroring the human word association task on pretrained language models (LMs), as a method for intrinsic evaluations of Our code and data are available at https://github.com/U-Alberta/WordTies. Figure 1: Overview of the workflow of our word association probing algorithm.The network plotted shows example word associations where the word language is a cue or response.The associations are probed from BERT (Devlin et al., 2019) using the proposed algorithm.The radius of a word circle represents the average frequency of the word being a response for one of the cues.The length of connections represents the relative associative strength between words.Note that the lengths might not be the same for the two directions between the same pair of words.
word embeddings (Thawani et al., 2019) and for measuring and mitigating social biases in language models (Kaneko and Bollegala, 2021;Bommasani et al., 2020).Word associations could be used as a proxy for measuring linguistic and commonsense knowledge in language models.Existing approaches that probe word associations in language models (Rodriguez and Merlo, 2020;Kaneko and Bollegala, 2021;Bommasani et al., 2020;May et al., 2019) investigate the word embedding spaces of LMs.Word embeddings are contextualized in LMs, and they are converted to static embeddings for analyses with the help of external corpora or templates, which introduces confounding biases.In the meantime, associativity is often measured by the cosine similarity between the embeddings of the cue word and the response.A major problem here is that cosine similarity is symmetric, while human word associations are not (Rodriguez and Merlo, 2020).
Instead of investigating embedding spaces, we propose to perform association rule mining on discrete word sequences sampled from LMs with constraints.To the best of our knowledge, this is the first application of association rule mining on the investigation of word associations in distributional semantic models.This novel approach more closely imitates human word association, and allows us to probe language models as a whole and without the use of external inputs.Our algorithm, named WordTies, samples sentences from language models with the constraint that the cue word must appear in the sentence, and uses the conditional probability that a word co-occurs with the cue word in the sample as the associativity score.The workflow of the WordTies algorithm is illustrated Figure 1.We validate our probing method by measuring the overlap between associations found in LMs by our algorithm and human associations, and testing if distance properties of human associations, like asymmetry, are preserved by our algorithm.
In another part of this work, we attempt to uncover what linguistic and commonsense knowledge and reasoning are involved in the word association process, for both humans and language models.In order to reach a reasonable cause for a given cue to response association, we link the two words simultaneously to a lexical knowledge graph (Word-Net;Miller, 1995) and a commonsense knowledge graph (ASCENT++; Nguyen et al., 2021), which leads to new discoveries about word associations.

Human Word Associations
Human word associations exhibit certain intriguing properties, such as stability, asymmetry and intransitivity (Rodriguez and Merlo, 2020).Stability is the property that different people usually come up with similar associations, which correlates with one definition of commonsense knowledge that they are shared among most human beings (Sap et al., 2020).This suggests that word associations could potentially be used as a signal for inferring commonsense knowledge.Secondly, some associations are not symmetric, as demonstrated by Rodriguez and Merlo's (2020) example that participants indicate that North Korea is more closely associated with China than vice versa.Finally, intransitivity means the associations do not follow the triangular inequality.For example, iPhone is associated with apple and apple is associated with sour, but iPhone is not associated with sour.These two geometric properties indicate that traditional tools for interpreting language models, such as vector norms for word embeddings, will not be sufficient to discover word associations as humans do.
It was previously shown that humans often associate words based on similarity, contrast, and contiguity (Thawani et al., 2019).We further investigated what specific types of semantic knowledge and reasoning, including lexical and commonsense knowledge and reasoning, are involved in human word associations, by breaking down the relations between the cue and response word pairs.

Association Norms
Collections of human word associations are called word association norms.We use the data from the English Small World of Words (SWOW; De Deyne et al., 2019) project as the word association norms.In SWOW, up to 100 responses were each collected for 12,292 cues, along with an association strength computed from the frequency with which a word appears as a top-3 response.This serves as the ground truth when evaluating word associations generated from language models.
Compared to other popular word association norms, for example the University of South Florida norms (USF; Nelson et al., 2004) and the Edinburgh Associative Thesaurus (EAT; Kiss et al., 1973), SWOW is more contemporary, heterogenous, and includes a much larger number of cues and responses (De Deyne et al., 2019).In the USF study, participants were instructed to list words that are "meaningfully related or strongly associated" to the cue word, while in both SWOW and our analogy for LMs, no such constraints are imposed.

Semantic Knowledge
The two research questions we would like to answer here are: what semantic knowledge do humans rely on to produce word associations, and what kind of reasoning is built on that knowledge for word associations?
We attempt to answer the two questions by finding a possible "reasoning path" for each of the cue-response pairs in the SWOW dataset.Based on the observations of human word associations discussed at the beginning of §2, such as stability and the reliance on similarity, contrast and contiguity, it is natural to assume that there exists a certain lexical relation between the pair, or they are related by some commonsense knowledge.For Knowledge Graphs WordNet (Miller, 1995) is used as the lexical knowledge graph.It provides relations between senses of English words, such as hypernymy / hyponymy and antonymy.Specifically, the version we choose is English WordNet 2020 (McCrae et al., 2020), which is a fork of the original Princeton WordNet (Miller, 1995) that accommodates emerging phenomena in the English language, and is openly available.
For the commonsense knowledge graph, we use ASCENT++ (Nguyen et al., 2021), which contains over 2 million commonsense relationships for 10,000 concepts collected from a large web corpus.At the time of writing, this is the state-ofthe-art commonsense knowledge graph in terms of precision and recall.Relations in ASCENT++ are related to properties of general concepts, such as CapableOf and UsedFor.

Breakdown of Knowledge Types
If the reasoning path is shorter in the lexical graph, then the cue-response pair is assumed to be more likely to involve lexical knowledge.Otherwise it is assumed to be related to commonsense knowledge.in the SWOW dataset.The majority of pairs in the dataset can be linked to the two knowledge graphs, and the shortest reasoning paths are almost evenly split between lexical and commonsense knowledge.About 20% of the pairs in SWOW have no connection in either of the knowledge graphs (categorized as Unknown in Table 2).

Observations of Reasoning Paths
The reasoning path provides an explanation of the reasoning process behind a word association.The most frequent reasoning paths are provided in Table 1.
The majority of responses can be reached within 3 hops from the cue word, as illustrated in Figure 2. We found that the length of reasoning paths only has a slightly negative correlation with the relative order with which a response comes up in SWOW (reflected by the association strengths in the SWOW dataset), with a Spearman correlation coefficient of −0.083 (p < 0.01).

Word Association Mining
The proposed WordTies algorithm finds word associations in a language model by sampling discrete sentences from the language model, with the constraint that the sampled sentence must contain the given cue word (see §3.2 for details).It then applies association rule mining to the sampled sentences, and picks the words that most frequently appear in the sampled sentences as the response words 1 .Intuitively, the language model is asked to "write sentences" with the given cue word.The more likely that the LM uses a word to write such sentences, the higher chance that this word is associated with the cue word by the LM.
More formally, a language model, parametrized by Θ, is a probability distribution P (•; Θ) that assigns a probability P (x; Θ) to any given word sequence x = x 1 x 2 • • • x n .Such probability is commonly factorized by prefixes of the sequence, for example in this form: Each word pair w 1 , w 2 is assigned a score score(w 1 → w 2 ) that indicates the associative strength with which the response word w 2 is associated with the cue word w 1 .Suppose x is a random sequence drawn from the distribution defined by the LM, then we would like to use the following conditional probability as the score for word association: which is the conditional probability that given the cue word w 1 is in the sentence sampled from the LM, the response word w 2 is also in the sampled sentence.
In practice, the association score is calculated by estimating the expectation: 1 For obvious reasons, stop words are excluded.
which is done by sampling from the LM with the hard constraint that the cue word w 1 is in the sentence, and counting the words that co-occur with w 1 in the sampled sentences.It is computationally infeasible to estimate the score from unconstrained samples, i.e. sampling sentences directly from the LM and discarding the sentences without the appearance of the cue word.Word frequencies of common corpora, from which LMs are trained, follow Zipf's law and have a long-tail distribution (Zhao and Marcus, 2012), which means exponentially more samples are needed for rarer cue words.
For each cue word, we pick the words with the highest association scores as the response words, while filtering out stop words.In practice, we use the spaCy (Honnibal et al., 2020) tokenizer from its en_core_web_sm model to tokenize the sampled sentences, and only keep words that exist in Word-Net to reduce noise.Readers can refer to Table 6-8 in the appendix for samples of the mined word associations.In the terms of association rule mining literature (e.g.Piatetsky-Shapiro's (1991)), the association score we define is the confidence, and the filtering of stop words is equivalent to setting a threshold on the lift.

Constrained Sampling
From Masked LMs Masked LMs (Devlin et al., 2019;Liu et al., 2019), or MLMs in short, are not trained with the traditional language modeling objective of minimizing the negative log likelihood of training sequences.Instead, they are autoencoder based de-noising models trained to predict what the masked tokens should be as a distribution P M LM (x m |x \m ) over the vocabulary, given an input sequence x \m where the token at position m is replaced with a mask.Wang and Cho (2019) proved mathematically that a masked LM trained with this different objective still conforms to the definition of a language model described in §3.1, in the sense that it provides a probability for each sequence as a Markov random field.In the Markov random field defined by a masked LM, tokens of a sequence form a fully-connected graph, and the probability of a sequence is the normalized potential of that graph (the largest clique): Although the exact value of the normalizing factor Z cannot be tractably computed, it is still possible to sample from the distribution with Markov chain Monte Carlo methods.For example, Wang and Cho (2019) provide a Gibbs sampling algorithm for masked LMs.Starting from a randomly initialized sequence, at each step we choose a random position i, sample a token from the distribution P M LM (x i |x \i ), and replace the token at position i with the sampled token.We modified it to impose the hard constraint that the cue word is in the sequence while sampling by keeping certain tokens fixed as the cue, as shown in Algorithm 1.
Algorithm 1 Sampling from a masked LM with a hard constraint that the cue word must be in the sequence.L min and L max control the length of the sampled sequence, and S is the number of MCMC steps.Wang and Cho's (2019).
From Causal LMs Causal LMs factor the probability of a sequence in an autoregressive way as described in Eq. 1. Usually, sampling or decoding from a causal LM is also done in an autoregressive fashion, for example generating one token at a time from left to right.However, the conditional probability P (x|c) of a sequence c with the constraint c will no longer have the nice linear structure, and this poses a major obstacle for sampling.
Recent practice utilizes the fact that P (x|c) ∝ P (x; Θ) • P (c|x) where P (c|x) is a differentiable classifier for the constraint, and samples from the unnormalized distribution defined by the product of the two distribution functions with variations of Hamiltonian Monte Carlo (Neal, 2011), such as Langevin Monte Carlo (Kumar et al., 2022;Qin et al., 2022).As a Markov Chain Monte Carlo process, randomly initialized text sample is updated with enough steps by gradient descent with added Gaussian noise.
In our case, we could define the constraint classi-fier P (c|x) to be based on the distance (measured in embedding or simplex space) between the cue word and a token in the sequence, as suggested by Kumar et al. (2022).This Langevin Dynamicsbased method provides a theoretically plausible way to apply WordTies to causal LMs such as GPT-2 (Radford et al., 2019).However, we are yet unable to produce good samples with the hyperparameters provided and some tuning from Kumar et al.'s (2022) algorithm.We leave it as future work to continue on this direction.

Evaluation
We evaluate the performance of WordTies as a word association mining algorithm, by calculating the alignment with human associations and the precision of finding asymmetric associations, and comparing to methods from previous work.

Setting
Dataset We execute the experiments on a subset of 3,000 cues in SWOW.The subset of cues is chosen by uniformly sampling without replacement from the set of cues, and is available in the supplement materials.For the filtering of responses, we use English WordNet 2020 (McCrae et al., 2020) and the stop word list from NLTK (Bird et al., 2009).the first layer of the model for each subtoken of the word and each context.

Pre-trained Models
Vocab Embedding In Rodriguez and Merlo's (2020) recent analysis of word associations in LMs, the authors directly measured the cosine similarity between embeddings in the vocabulary layer without contextualization.

Corpus Only
We directly apply the same algorithm and score (2) as in WordTies to the same corpora, English Wikipedia and BookCorpus (Zhu et al., 2015), that were used to train BERT.

Statistical Tests
Since the WordTies algorithm involves sampling, we introduce statistical tests to make sure that an irrelevant word will not be chosen as a response by chance.Words are sampled from the multinomial distribution defined in Eq. 2, and to say that a response is not a noisy word that ends up in the top 50 most probable words by chance, the following null hypothesis needs to be rejected: there exist at least N − 50 words whose probability as defined in Eq. 2 is significantly lower than the chosen word where N is the size of the vocabulary.And for each pair of words, we test the null hypothesis that the probability of the first word is significantly higher than the the second by a binomial test.In our experiments, most of the words in the top-10 response list are statistically significant (p < 0.1).
Responses that passed the tests are highlighted in Table 6-8 in the appendix.Such tests provide a guideline for choosing the number of samples to generate.

Alignment
We measure how good the word associations produced from LMs by the algorithms align with human associations.The alignment is measured by both precision@k, which reflects the overlap, and Spearman's correlation coefficient (ρ) between the scores in the algorithms and the strengths in SWOW, which provides an indication of whether LM and human produce word associations in the same order.The results are shown in Table 3.Our method achieves much better precision@k than the baselines on both BERT and RoBERTa, and results at the same level for DistilBERT.It also achieves higher ρ on all three models.This means associations obtained with WordTies share more similarity with human associations in terms of both word choices and strengths.

Asymmetry
We test if the association scores produced by WordTies can be used to find asymmetries in word associations, an important feature of human word associations that previous methods fail to accommodate.The level of asymmetry is measured by the ratio between scores of both directions of association: We evaluate the precision by whether the found asymmetric pair has the correct direction as in the SWOW dataset.because virtually every pair of words is asymmetric in human associations.For the same reason, precision is only calculated on the overlap between SWOW and the output of WordTies.Additionally, we measure the Spearman's ρ of the asymmetric measure between WordTies and human word associations to see if LM and human perceive similar level of asymmetry.See Table 4 for the results.The baseline methods are unable to find asymmetric word associations because cosine similarity is symmetric, and therefore they are not listed for comparison.Meanwhile, WordTies is able to find asymmetric word associations that have the same direction as in SWOW, and there is a positive correlation for the level of asymmetry.

Discussion
Running Time On average it takes around 12s to generate 1,000 samples from BERT or RoBERTa with Algorithm 1, and 6s for DistilBERT.Time is measured on a single NVIDIA A40 GPU with a batch size of 2048.In our experiments we generated at least 3,000 samples per cue.For other models, the statistical tests described in §3.3.3 provide a framework for estimating the number of samples needed and hence the running time.

Comparison of Methods
We have already discussed how the symmetrical nature of cosine similarity used in previous methods do not fit well with word association.Adding to that, we suspect there are 2 other reasons behind the inferior performance of baseline methods: First, previous methods try to obtain a unified embedding for each word from contextualized models by either averaging embeddings in different contexts, or simply using the layers before contextualization.Such conversions defeat the purpose of building contextualized models and incur information loss.For example, contextualized BERT embeddings for a polysemous word are distinct enough for accurate word sense disambiguation (Hadiwinoto et al., 2019) while averaging elim-inates the distinctions.Second, embeddings from a contextualized model must be computed with a context sentence, which is sampled from an external corpus in the Contextualized2Static method.The choice of corpus or context affects embeddings, which introduces confounding biases to word association measurements.Conversely, a "pseudo corpus" is generated from the LM in WordTies, similar to a training data extraction attack (Carlini et al., 2021) on the LM.No external factors are involved so it is certain that we are only examining the LM itself.When we apply the same score as in WordTies to the real corpus used to train the LM, we observe an overlap with human word associations that is larger than any of the LMs evaluated.This observation hints that, no matter a LM can overcome reporting bias (Shwartz and Choi, 2020) and extrapolate beyond the corpus or not, it still has a gap to reach the upper bound of word associations.
Comparison of Models BERT was trained on English Wikipedia and BookCorpus (Zhu et al., 2015), and achieves the best overlap with human word associations.RoBERTa is a replica of BERT with more carefully selected hyper-parameters and a larger training corpus, which additionally incorporates news, stories, and web content.However, with better training settings it performs worse than BERT on the word association task.In this sense, world knowledge in RoBERTa is not as similar to that of humans, and we suspect it is because of the relevance and quality of the additional training corpus.The sampling process in WordTies is a reflection of the corpus (Carlini et al., 2021), and we observed more URLs and email addresses in the samples from RoBERTa, which are irrelevant to the knowledge involved in word association.DistilBERT is a smaller model trained on the same corpus as BERT and with BERT as the teacher.Embedding-based baselines perform on par with sampling-based method for DistilBERT, and we conjecture the reason to be that DistilBERT is not as good an MLM in the first place.Sanh et al. (2019) only reported that the model perform equally well as BERT on downstream tasks but not the MLM objective, and few studies used Distil-BERT in MLM-based zero-shot tasks.the properties of those associations.

Semantic Knowledge
We observed that LMs are slightly better at associating words by commonsense knowledge than lexically, judging by the precision@k for cue-response pairs broken down by the type of knowledge (Figure 3).This is consistent with the finding that humans use commonsense knowledge more for associations (Table 2).
Reasoning Path Length LMs' ability to find human-like associations is negatively associated with the length of the reasoning path to the response.In other words, the more hops to get from the cue to the response in the KGs, the harder for LMs to associate the cue with the response.See Table 5 for the correlation coefficients.Meanwhile, longer reasoning paths only slightly degrade human association strength ( §2.2).

Related Work
The study of Rodriguez and Merlo (2020) is mostly similar to ours, where they concluded that properties of human word associations, discovered in the 1970s (Tversky, 1977;Tversky and Gati, 1978;Tversky and Hutchinson, 1986), still hold in language models.They probed associations by ranking words by the cosine similarity of embeddings in the vocabulary layer, and measured asymmetry by handcrafted templates.Evert and Lapesa (2021) also tested word associations with word embeddings, but they held the same view as us that it is self-contradictory to obtain decontextualized embeddings from contextualized LM, and therefore did not extend their study on LMs.Measuring and mitigating social biases in pre-trained LMs, often formulated as measuring associations to a certain set of words, is a more popular task.Associations to the words related to social aspects are often measured by the cosine similarity of embeddings aggregated from context sentences (May et al., 2019;Bommasani et al., 2020;Kaneko and Bollegala, 2021).As we have been arguing, cosine similarity is not compatible with the asymmetry of word associations, while our algorithm takes asymmetry into consideration.In some work, biases are also measured via constrained generation, where the constraints (often prompts or templates) are collected from the web (Dhamala et al., 2021) or by crowdsourcing (Nangia et al., 2020).In comparison, our method relies on no external resources, and no confounder is introduced consequently.
Constrained text generation is used to evaluate the commonsense reasoning ability of LMs through other tasks.CommonGen (Lin et al., 2020) is a task where, instead of only one cue word as in our study, multiple words pertaining to commonsense concepts are required to be present in the generated text, as a way to measure how well LMs can link concepts together with commonsense knowledge.In abductive commonsense reasoning (Bhagavatula et al., 2020), LMs are used to complete text when the beginning and ending are given, to test their ability to reason about pre-and post-conditions.
It is considered non-trivial to impose constraints on left-to-right generations for causal LMs.Mostly, recent work (Qin et al., 2022;Dathathri et al., 2020) focus on constrained (also known as controlled) decoding, a related problem of finding the sequence that maximizes the likelihood, by modifying the original distribution.Prior to the Langevin Dynamics algorithm by Qin et al. (2022) and Kumar et al. (2022), Miao et al. (2019) proposed CGMH, a constrained sampling algorithm in discrete space based on Metropolis-Hasting sampling, but it uses a bidirectional causal LM to reduce computation (i.e. the LM also predicts the previous word based on suffixes).More recent causal LMs, such as GPT (Radford et al., 2019;Brown et al., 2020), are uni-directional, and it is therefore not very meaningful to apply CGHM in our work.
In a broader context, it has been an interesting idea to try to explain the behavior of neural networks by optimizing over the input.Sampling from a language model, as in our WordTies algorithm and in Carlini et al.'s (2021), can be seen as optimizing over the discrete input text sequence to minimize the negative log-likelihood with noise, and it provides a way to uncover how LMs associate words or properties of the training corpus.Bäuerle and Wexler (2020) optimized the activation of certain neurons in BERT over the input sequence, as an attempt to find the responsibilities of individual neurons, and Goh et al. (2021) applied similar thoughts on vision-language models.

Conclusion
In this study, we verified the proposition that examining discrete sequence samples from LMs is a better approach than inspecting embedding spaces for word associations.We also explored properties related to semantic knowledge and reasoning in both human and LM word associations.These revealed the high potential of using word associations as a proxy for probing, and as a signal for finetuning language models.

Limitations
We have yet to apply the WordTies algorithm to popular LMs such as GPT-2 that are causal, despite having provided a theoretically sound method to do so in §3.2.Due to the constraint of computation resources, we only evaluated our algorithm on the base version of popular pre-trained LMs.Models with a larger number of parameters, such as bert-large-cased and roberta-large, are yet to be evaluated.For the same reason, we were only able to run the experiments on a subset of SWOW.Our method is notably slower than simply running k-nearest-neighbor search on embedding spaces, although the running time is still acceptable and we have a method for estimating the running time required ( §3.3.6).Potential downstream use cases of word associations, such as measuring social biases in language models, are not evaluated in this paper.

Ethics Statement
As discussed in §5, measuring and mitigating social biases have been a prominent and motivating application of word associations.The algorithm we proposed contributes a practical way to measure associations to words related to social aspects (such as profession, gender, race, and other aspects) in language models with higher precisions and fewer confounders.These associations, in addition to being a measure of biases, could potentially serve as a signal for fine-tuning LMs, and lead to language models with less biases.

Figure 2 :
Figure 2: Distribution of reasoning path lengths in the SWOW dataset.The maximum path in SWOW is 14.

Table 1 :
Most frequent reasoning paths for the cue-response pairs in the SWOW dataset, with a potential interpretation for the path.HasProperty is shortened as HasProp.−1 denotes an inverse relation.For example, A HasProp −1 B means A is a property of B. "-" indicates that there is not a concise interpretation for the path.
Table 2 provides a breakdown of knowledge types involved

Table 2 :
Number of cue-response pairs in the SWOW dataset with reasoning paths in the lexical and commonsense knowledge graphs.

Table 3 :
Evaluation results for the alignment between human word associations and LM word associations.C2S is short for the Contextualized2Static baseline, Vocab is short for the Vocab Embedding baseline, and Corpus is short for the corpus-only baseline.All reported Spearman's ρs are statistically significant (p < 0.01).The best results for each metric and model combination are marked in bold.

Table 4 :
It is meaningless to measure recall, Precision and Spearman's ρ of WordTies for finding asymmetric association pairs.All ρs are statistically significant (p < 0.01).

Table 5 :
Correlation between precision@50 and reasoning path length for different models.All ρs are statistically significant (p < 0.01).