Evaluating Word Embeddings with Categorical Modularity

We introduce categorical modularity, a novel low-resource intrinsic metric to evaluate word embedding quality. Categorical modularity is a graph modularity metric based on the $k$-nearest neighbor graph constructed with embedding vectors of words from a fixed set of semantic categories, in which the goal is to measure the proportion of words that have nearest neighbors within the same categories. We use a core set of 500 words belonging to 59 neurobiologically motivated semantic categories in 29 languages and analyze three word embedding models per language (FastText, MUSE, and subs2vec). We find moderate to strong positive correlations between categorical modularity and performance on the monolingual tasks of sentiment analysis and word similarity calculation and on the cross-lingual task of bilingual lexicon induction both to and from English. Overall, we suggest that categorical modularity provides non-trivial predictive information about downstream task performance, with breakdowns of correlations by model suggesting some meta-predictive properties about semantic information loss as well.


Introduction
Word embeddings represent words and phrases in continuous low-dimensional vector spaces. They are usually trained with neural language models or word collocations (Bengio et al., 2003;Collobert and Weston, 2008;Mikolov et al., 2013b;Levy et al., 2015), such that similar assignments in a space reflect similar usage patterns. The rise of monolingual embeddings such as Word2Vec (Mikolov et al., 2013b), GloVe (Pennington et al., 2014), and FastText (Bojanowski et al., 2017), coupled with the need to transfer lexical knowledge across languages, has also led to the development of crosslingual word embeddings, in which different lan- * Equal contribution guages share a single distributed representation and are mapped into the same vector space. Such methods use different bilingual supervision signals (at the level of words, sentences, or documents) with varying levels of strength .
A central task in the study of word embeddings is finding metrics to evaluate their quality. These metrics can either be extrinsic, where embeddings are used as input features for downstream NLP tasks and evaluated on their performance, or intrinsic, where embeddings are directly tested for how well they capture syntactic or semantic properties in their own right (Qiu et al., 2018). Extrinsic methods are not always feasible for low-resource languages due to a lack of annotated data. Moreover, downstream model components can be finetuned to achieve higher performance on certain tasks without necessarily indicating improvement in the semantic representation of words in an embedding space (Leszczynski et al., 2020). This paper presents categorical modularity, a low-resource intrinsic evaluation metric for both monolingual and cross-lingual word embeddings based on the notion of graph modularity. The underlying principle is that in good embeddings, words in the same semantic category should be closer to each other than to words in different categories. We quantify this by building the k-nearest neighbor graph with a fixed set of words' semantic categories and computing the graph's modularity for a given embedding space. Modularity measures the strength of division of a graph with densely connected groups of vertices, with sparser connections between groups (Newman, 2006).
We source our semantic categories from Binder et al. (2016). In contrast to other semantic and ontological categories in the literature, these have been motivated by a set of experiential attributes with neurobiological consistency, covering sensory, motor, spatial, temporal, affective, social, and cog-nitive dimensions. We refer to these attributes collectively as Binder categories. The resulting dataset consists of 500 English words, each labeled with three categories at three levels of semantic granularity. For example, the word chair belongs to Concrete Objects (Level 1), Artifacts (Level 2), and Furniture (Level 3). 442 words are pulled from Binder, on top of which we add a few words to even out distributions of categories and replace a few English-specific words with words that are more easily translated to non-English languages.
We then translate these 500 English words into 28 more languages, selected based on their availability in the form of pre-trained vectors from the MUSE library (Conneau et al., 2018). We produce 300-dimensional embeddings for these words using three popular embedding models: the monolingual FastText (Bojanowski et al., 2017) and subs2vec (Paridon and Thompson, 2020) models and the cross-lingual MUSE (Conneau et al., 2018) model. Using these embeddings, we obtain the nearest-neighbor sets among the 500 words within each (language, model) pair and use those relationships to calculate a modularity score for the pair. We compare modularity scores to performance on three downstream tasks: sentiment analysis (monolingual classification), word similarity (monolingual regression), and word-level bilingual lexicon induction (BLI, cross-lingual regression) both to and from English. We obtain moderate to strong positive correlations on all three tasks, with slightly stronger results on the monolingual tasks. We also provide an analysis of correlations broken down by individual model and explore potential metapredictive properties of categorical modularity.
We further show that estimating modularity on Binder categories yields relevant information that cannot simply be derived from naturally occurring distributions of word clusters in embedding spaces. We show this by replicating all three downstream task correlation analyses with modularity scores based on clusters obtained with unsupervised community detection methods (Clauset et al., 2004), which we henceforth refer to as unsupervised clusters. After establishing the utility of categorical modularity, we show some of its use cases for comparing and selecting models for specific NLP problems, and we discuss preliminary results about the individual categories we find to be most predictive of downstream task performance.
Our code and data are available to the public. 1 2 Related Work

Word Embedding Evaluation Metrics
While word embeddings have become crucial tools in NLP, there is still little consensus on how to best evaluate them. Evaluation methods commonly fall into two categories: those motivated by an extrinsic downstream task and those motivated by the intrinsic study of the nature of semantics and the cognitive sciences (Bakarov, 2018). Intrinsic and extrinsic metrics do not always align, as some models have high quality as suggested by intrinsic scores but low extrinsic performance, and vice versa (Schnabel et al., 2015;Glavaš et al., 2019).
Our categorical modularity metric is inspired by Fujinuma et al. (2019). They study the modularity of cross-lingual embeddings based on the premise that different languages are well-mixed in good cross-lingual embeddings and thus have low modularity with respect to language. Our metric improves upon the modularity proposed in Fujinuma et al. (2019) by overcoming the problem caused by low modularity potentially occurring with a purely random distribution of word vectors and being mistaken for high embedding quality, as it is unlikely for a random distribution to coincidentally have highly modular clusters corresponding to Binder categories. Moreover, our metric is able to evaluate both monolingual and cross-lingual word embeddings and allow for comparisons between these types of embeddings (e.g., FastText and MUSE), and it incorporates cognitive information through the use of brain-based semantic categories.

Cognitive Approaches to NLP
Recent work on word embeddings has explored the connections between NLP word representations and cognitively grounded representations of words. Such connections enrich both computational and neuroscientific research: external cognitive signals can enhance the capacity of artificial neural networks to understand language, while language processing in neural networks can shed light on how the human brain stores, categorizes, and processes words (Muttenthaler et al., 2020).
Cognitive approaches to lexical semantics propose a model in which words are defined by how they are organized in the brain (Lakoff, 1988). Based on this premise, Hollenstein et al. (2019) propose CogniVal, a framework for word embedding evaluation with cognitive language processing data. They evaluate six different word embeddings against a combination of 15 cognitive data sources acquired via eye-tracking, electroencephalography (EEG), and functional magnetic resonance imaging (fMRI). In a similar line of work, both Søgaard (2016) and Beinborn et al. (2019) evaluate word embeddings using fMRI datasets.
The use of cognitive data in NLP goes well beyond the evaluation of word embeddings. Utsumi (2020) uses the neurosemantically inspired categories from Binder et al. (2016) to identify the knowledge encoded in word vectors. Among other conclusions, they find that the prediction accuracy of cognitive and social information is higher than that of perceptual and spatiotemporal information.

Modularity and k-NN Graphs
The concept of modularity has origins in the field of network science, as first introduced by Newman (Newman, 2006). The goal of the modularity measure is to quantify the strength of the division of a network into clusters. Usually, such networks are represented with graphs. Intuitively, the modularity of a graph measures the difference between the fraction of edges in the graph that connect two nodes of the same category and the expected corresponding fraction if the graph's edges were distributed at random. Thus, the higher the proportion of edges between nodes that belong to the same category, the higher the modularity.
In our case, we construct the pertinent graph with the k-nearest neighbors algorithm. Given a set S w of N words and a set S c of categories such that each of the N words belongs to exactly one of the categories in S c , we map each of the N words into a d-dimensional word embedding vector space and obtain a d-dimensional vector for each word.
For each pair (w i , w j ), where w i , w j ∈ S w and 1 ≤ i, j ≤ |S w |, with corresponding d-dimensional vectors v i and v j , we compute their cosine similarity (the cosine of the angle between them), which we denote by similarity(i, j).
We create a matrix M D of dimensions |S w | × |S w |, where entry (M D ) i,j is similarity(i, j). For a given k ∈ Z >0 , we build the |S w | × |S w | k-nearest neighbor matrix (denoted k-NNM) as follows: entry (i, j) of k-NNM is equal to 1 if and only if word j is one of the k nearest neighbors of word i (i.e., if similarity(i, j) is among the k largest cosine similarities between i and all other words in S w ). We note that M D and k-NNM are not necessarily symmetric, as word i being the k-th nearest neighbor of word j does not imply the reverse. Finally, we define the k-NN graph of S w as the graph defined by k-NNM viewed as an adjacency matrix. We can now describe how to compute the modularity score following the schema in Fujinuma et al. (2019).
Let d i denote the degree of node i, that is, d i = j (k-NNM) i,j , and let g i denote the category of word i. For each category c ∈ S c , the expected number of edges within c is where m is the total number of edges in the k-NN graph and I is the indicator function that evaluates to 1 if the argument is true and to 0 otherwise. The fraction of edges e c that connect words of the same semantic category c is (2) By weighting the |S c | different semantic categories together, we calculate the overall modularity Q as follows: Finally, we normalize Q by setting In our setting, Q norm indicates the modularity score of one (language, model) pair overall, but (a) A high-modularity semantic k-NN graph.
(b) A low-modularity semantic k-NN graph. we denote by Q c the modularity of said (language, model) pair with respect to category c ∈ S c . The definition of Q c (normalized) is deduced from Equation 4: A higher value of Q norm indicates that a higher number of words that belong to the same categories appear connected in the k-NN graph. In Sections 5 and 6, we analyze the values Q norm for each of the languages, and in Section 8, we make some observations about the different values of Q c .
In our conclusions about how our categorical modularity scores correlate with downstream task performance, we also want to prove that our selected neurosemantic-based categories are nontrivial and are better predictors than the unsupervised clusters that emerge from the embeddings. To find these clusters, we use the Clauset-Newman-Moore greedy modularity method (Clauset et al., 2004). This algorithm iteratively joins the pair of communities that most increases modularity until no such pair exists. For each value of k, we obtain the unsupervised communities in this manner and compute their modularity scores. In Section 6, we show that Binder categories are significantly better predictors than the unsupervised clusters using the same set of 500 words.

Dataset
In this section, we define the sets S N and S c of words and their semantic categories, respectively, that we use to compute categorical modularity scores for 29 languages. 2 As outlined in Section 1, our motivation to take a cognitive approach in the study of word embeddings prompts us to use words and categories that reflect a brain-based computational model of semantic representation as in Binder et al. (2016). We have 500 words (comprised of nouns, adjectives, and verbs) with 3 levels of categories, from most general (Level 1) to most specific (Level 3). Each word is tagged with 3 categories (one per level), which are listed in Ta  With the dataset of 500 words that belong to three levels of semantic categories, we compute the modularity scores of each of the 29 languages for each of the three word embedding models (which we refer to as 87 (language, model) pairs): FastText, 4 MUSE, 5 and subs2vec. 6 We briefly summarize the properties of each of these embeddings.
FastText. Monolingual embeddings for 157 languages trained on Common Crawl and Wikipedia that use CBOW with position-weights and character n-grams (Bojanowski et al., 2017).
MUSE. Cross-lingual embeddings resulting from the alignment of 30 FastText embeddings into a common space under the supervision of groundtruth bilingual dictionaries (Conneau et al., 2018).
subs2vec. Monolingual embeddings for 55 languages trained on the OpenSubtitles corpus of speech transcriptions from television shows and movies using the FastText implementation of the skipgram algorithm (Paridon and Thompson, 2020). The authors claim that subtitles are closer to the human linguistic experience (Paridon and Thompson, 2020).
Information about the sizes of each (language, model) pair can be found in Appendix B. For each pair, we build the k-NN graph and compute modularity for different values of k and different levels of categories, which we treat as our 2 hyperparameters. We consider small values for k (namely k ∈ {2, 3, 4}) due to the fact that categories such as States have as few as 4 words.

Downstream Task Experiments
We test the reliability of categorical modularity by running a few downstream tasks and computing the Spearman rank correlations between categorical modularity scores and performance on these tasks.
After determining the optimal set of hyperparameters (k and level of semantic categories) for each task, we then compare the correlation produced by that set of hyperparameters with the correlation produced by the corresponding value of k with the modularity of the unsupervised clusters constructed by the community detection algorithm described in Section 3 to establish the non-triviality of the 4 https://fasttext.cc/docs/en/ crawl-vectors.html 5 https://github.com/facebookresearch/ MUSE#download 6 https://github.com/jvparidon/subs2vec predictive properties of these chosen semantic categories. Table 2 provides a summary of correlation values for four tasks: movie review sentiment analysis (Sentiment), word similarity (WordSim), bilingual lexicon induction from English (BLI from), and bilingual lexicon induction to English (BLI to). Appendix A contains full tables with the correlation results. A visual summary of the results can be found in Figure 2.

Sentiment Analysis
We first test our modularity scores through correlations with performance on the binary classification task of sentiment analysis, where the input is a movie review and the output is a binary label that corresponds to either positive or negative sentiment for that review. For this task, our data consists of 5,000 randomly selected positive movie reviews and 5,000 randomly selected negative reviews from the IMDB Movie Reviews dataset (Maas et al., 2011). We randomly partition these 10,000 reviews into 80% training and 20% testing data. Because this dataset is only available in English, we use the Google Translate API 7 to translate the data to 15 more languages (Arabic, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, Estonian, Finnish, French, Greek, Hebrew, Hungarian, Indonesian, and Italian) for a total of n = 48 observations. The languages and the dataset size of 10,000 are chosen due to Google Translate API rate limits. For each (language, model) pair, we convert the raw text of each review to a 300-dimensional embedding vector. We use the built-in black-box position-weighted continuous bag-of-words embedding model for FastText and subs2vec (Bojanowski et al., 2017), and we use a simple mean of individual word embeddings for MUSE, as the MUSE Figure 2: Summary of modularity vs performance metrics across tasks. Each language is represented with its 2-letter ISO 639-1 code. Hebrew, an outlier on the low end, is not included in this plot. Full modularity and performance data is included in our public GitHub repository. library does not have multi-word phrase embeddings built into its functionality. Using a vanilla linear support vector machine model with scikit learn's default settings, 8 we run the task on each language-model pair 30 times and record the mean accuracy and precision scores for each pair. We then calculate the Spearman correlations of each of the 9 modularity scores with both the accuracy and precision values. Furthermore, we analyze the overall merged correlations (taking all 48 data points for a given modularity score and performance metric) as well as the correlations within models (taking only the 16 data points within each single model), giving us a total of 72 Spearman correlation values.
We find that the optimal set of hyperparameters is Level 3 categories with k = 2, which gives a Spearman correlation of ρ = 0.54 with the accuracy metric. Breaking it down by individual model, we have ρ ft = 0.44 for FastText, ρ m = 0.68 for MUSE, and ρ s = 0.46 for subs2vec. For k = 2, the correlations of unsupervised clusters with accuracy are ρ = 0.09 for all 48 observations merged, ρ ft = 0.1, ρ m = 0.4, and ρ s = 0.35, providing evidence that Binder categories contain non-trivial 8 https://scikit-learn.org/stable/ modules/svm.html predictive information that is not present in naturally emerging clusters.

Word Similarity
Our next downstream task is the monolingual regression task of word similarity, in which the input is two words in one language and the output is a real number between 0 and 4 representing how similar the two words are (a higher score represents a greater degree of similarity). We use the English, Italian, and Spanish word pair datasets from SemEval-2017(Camacho et al., 2017, and we use the same Google Translate API from Section 6.1 to translate the English dataset into the remaining 26 languages. Each language's dataset then has 500 word pairs, which we randomly split into 400 training pairs and 100 testing pairs for each trial. Given a language and a model, we take each word pair, compute the 300-dimensional embeddings of both words, and calculate the Euclidean distance, Manhattan distance, and cosine similarity between the embeddings. We then feed these three scalars as a vector of inputs into a standard linear regression model from Python's scikit-learn package with default settings, 9 whose output is the similarity score given in the dataset. To evaluate task performance, we compute the Mean Squared Error (MSE) loss for each run and record the mean MSE loss over 30 trials per (language, model) pair.
We then calculate the Spearman correlations of each of the 9 modularity scores with the negatives of the losses (such that positive correlation means that high modularity predicts high performance), both merged (87 data points) and within individual models (29 data points per model), for a total of 36 correlation values.
We find that all of the merged correlations are moderately to strongly positive. In particular, with the optimal hyperparameters of Level 2, k = 2, we have ρ = 0.71 overall, ρ ft = 0.59 for FastText, ρ m = 0.34 for MUSE, and ρ s = 0.8 for subs2vec. In comparison, the correlations of the unsupervised cluster modularities with mean MSE loss for k = 2 are ρ = 0.27, ρ ft = 0.36, ρ m = 0.3, and ρ s = 0.42, all weaker than their Binder counterparts.

Word-Level Bilingual Lexicon Induction
In addition to both monolingual classification and monolingual regression tasks, we also test our modularity metric on the cross-lingual regression task of bilingual lexicon induction. Using the groundtruth bilingual dictionaries provided by the publishers of MUSE (Conneau et al., 2018), we run this task with the 28 non-English languages listed in footnote 2 in two directions: translation to and from English. We use the 5,000-1,500 train-test split provided in the MUSE dictionary dataset and formulate the tasks as multivariate, multi-output regression tasks: for each observation in each (language, model) pair, we convert the English source word to its 300-dimensional embedding specified by the English version of the model and feed this vector as input to the same scikit-learn linear regression model as in Section 6.2, of which the output is a 300-dimensional vector in the target language model space representing the embedding of the target word.
We follow this procedure in the other direction as well by converting source non-English words to embeddings in the appropriate non-English model spaces, feeding those embeddings into the linear regression model, and computing 300-dimensional predictions for the target English word vectors in the English model spaces. To measure task performance in the "from English" direction, we convert the ground-truth non-English target words into vec-tors in the corresponding non-English embedding model space, compute the cosine similarities between each ground-truth vector and its corresponding predicted vector, and record the mean of those cosine similarities as a measure of how close we are to the ground truth on average. We run 30 trials per (language, model) pair and record the mean of the mean cosine similarities.
In the "to English" direction, we similarly convert the ground-truth English target words into vectors and compute the mean cosine similarity over the prediction-ground-truth pairs. We calculate the Spearman correlations of each of the 9 modularity scores with the 30-trial means of mean cosine similarities in both directions. Once again, we calculate the correlations both across all models and within each individual model, yielding 72 total correlation values.
The optimal set of hyperparameters for the merged correlation in the "to English" direction is also Level 3, k = 2, giving a moderate ρ = 0.5 overall, a weak ρ ft = 0.29, a moderate ρ m = 0.56, and a very strong ρ s = 0.82. The corresponding unsupervised cluster correlations for k = 2 are ρ = 0.35, ρ ft = 0.04, ρ m = 0.27, and ρ s = 0.65.
Notably, in both the word similarity and BLI tasks, ρ s is significantly stronger than ρ ft and ρ m . This may be due to the fact that compared to sources such as Wikipedia and Common Crawl, the subtitles used as training data for subs2vec are more representative of how the human brain semantically maps language, as suggested by the model's creators (Paridon and Thompson, 2020). Overall, these downstream task experiments suggest that categorical modularity is a non-trivially significant predictor of performance on both monolingual and cross-lingual NLP tasks (though it is stronger on monolingual tasks) and that it may have potential to be a meta-predictor of how well a particular model matches the information encoded in the human brain.

Use Cases: Comparing and Selecting Models
After having established substantial evidence of the predictive properties of categorical modularity, we present some examples of how the research community can make use of the metric for model evaluation and selection.

Comparing Models within a Language
The best hyperparameters for the tasks described in Section 6 are Level 3 with k = 2 along with Level 2 with k = 2. Across the 29 languages at the latter, FastText has the highest modularity 9 times (Arabic, Catalan, Estonian, Finnish, Greek, Macedonian, Polish, Turkish, Ukrainian), while MUSE has the highest modularity 3 times (Hungarian, Russian, Spanish), and subs2vec has the highest modularity 17 times. For Level 3 with k = 2, FastText has the highest modularity 13 times, while MUSE has the highest modularity 2 times (Russian, Vietnamese), and subs2vec has the highest modularity 15 times. Though individual choices should be made with each language, this suggests that subs2vec may be a strong choice for monolingual tasks overall.

Comparing Languages within a Model
We also present some evidence that categorical modularity predicts bilingual lexicon induction performance moderately well, and predictive properties are especially strong within subs2vec. For the optimal set of hyperparameters found in that task within subs2vec (Level 2, k = 2, ρ s = 0.77 from English and ρ s = 0.81 to English), the languages with the highest modularities in subs2vec are Dutch

Categorical Modularity as a Potential Meta-Predictor
We find evidence that categorical modularity reveals some information about how well models map to the human brain, as suggested by subs2vec's significantly stronger correlations. This is particularly true with regression tasks. Given a new or existing embedding model, calculating its categorical modularities and assessing their correlations with regression tasks such as word similarity may reveal if the model space is representative of how linguistic information is encoded in the brain.

Discussion and Future Work
Categorical modularity shows promise as an intrinsic word embedding evaluation metric based on our preliminary experiments. We can envision extending this work in several directions. For one, we can calculate single-category modularities (denoted by Q c as defined in Equation 5) and test which individual categories contain the most predictive properties. Our limited experiments in this direction with the movie sentiment analysis task suggest that concrete and non-living categories have better predictive capabilities than abstract and living ones: for the sentiment analysis task, the 5 most strongly correlated categories are Nonverbal Sounds, Artifacts, Concrete Objects, Vehicles, and Manufactured Foods, while the 5 least correlated categories are Abstract Properties, Abstract Constructs, Miscellaneous Actions, Humans, and Abstract Actions. We may also extend our work to more models and languages to see if the predictive properties truly hold across all languages and models. Additionally, as more multilingual research and data becomes available in this space, we may probe different sets of semantic categories, further downstream tasks (particularly multi-class classification, monolingual text generation, and sentence-level bilingual lexicon induction), and further variations of models used in downstream tasks (e.g., deeper neural networks instead of vanilla SVMs and linear regressions). We can also envision improvements upon the categorical modularity metric itself, perhaps by way of a lower-resource metric or a metric that works well on contextualized word embeddings for which the word-vector mappings may have more complex geometries. Our code and data, which are available to the public, 10 can also enable researchers and practitioners to replicate our results and experiment with different models, words, languages, and categories.

Conclusion
In this paper, we introduce categorical modularity, a novel low-resource metric that may serve as a tool to evaluate word embeddings intrinsically. We present evidence that categorical modularity has strong non-trivial predictive properties with respect to overall monolingual task performance, moderate predictive properties with respect to cross-lingual task performance, and potential meta-predictive properties of model space similarity to cognitive encodings of language.

Impact Statement Ethical Concerns
All of the data used in this paper is either our own or from publicly released and licensed sources. Our data is mainly aimed towards researchers and developers who wish to assess the qualities of word embedding models and gain some intuition for embedding model selection for downstream tasks. In particular, our conclusions would be suited for researchers working among the 29 functioning languages given in the MUSE library, which are heavily skewed towards Indo-European languages. Though we do not directly introduce novel NLP applications, we provide resources that may be useful in selecting technologies to deploy and informing the development of word embedding systems.
Categorical modularity is intended to be an informational tool that sheds light on semantic representation of natural language information in computational word embeddings, and there are many aspects of its capabilities that can be improved upon, extended, or further explored. We would also like to emphasize that we have only tested our metric on three specific downstream tasks with basic downstream models, and these may not be representative of all NLP tasks in general. Categorical modularity also has not yet been shown to reveal information on representational harms inherent in word embedding spaces, so evidence of good downstream task performance should not be misconstrued as indicative of strong and beneficial performance across all NLP domains.

Environmental Impact
We acknowledge the pressing threat of climate change and therefore record some statistics on the computational costs of our experiments. All of our experiments are run with a 13-inch 2019 MacBook Pro with a 1.7 GHz Quad-Core Intel Core i7 processor running Python 3.8.3 in Terminal Version 2.11 on MacOS Big Sur Version 11.1. For the English language, generating FastText embeddings for our 500 core words took 20.31 seconds, generating the 500 × 500 k-NNM took 1 hour and 25.72 seconds, generating MUSE embeddings for the 500 words took 23.13 seconds, and generating the 500 × 500 k-NNM took 7 minutes and 39.32 seconds. For the downstream task of movie review sentiment analysis, it took 42.03 seconds to generate FastText sentence embeddings for 10,000 English reviews and 6 minutes and 38.32 seconds to generate these embeddings with MUSE. It took 0.35 seconds per review to translate from English to Spanish using the Google Translate API, and it took 2.6 seconds to run 30 trials of the sentiment analysis task for English FastText using scikit-learn's LinearSVC. For the task of word similarity calculation, English FastText embeddings and 3-dimensional input data took 21.25 seconds to generate for 500 word pairs, English MUSE-based embedding data took 33.47 seconds to generate, and the word similarity task using scikit-learn's LinearRegression took 0.09 seconds on the generated English FastTextbased inputs. For bilingual lexicon induction, Fast-Text English-Spanish embedding data took 55.87 seconds to generate, MUSE English-Spanish embedding data took 27 minutes and 29.56 seconds to generate, and the BLI task took a combined 4.35 seconds for both directions of English-Spanish using FastText. All other tasks took less than one second per language/model pair.

A Appendix: Full Results for Correlations of Categorical Modularities with Downstream Tasks
This section contains full tables of correlations of general categorical modularities and unsupervised cluster modularities with downstream tasks. As above, ρ is overall correlation, ρ f t is the correlation within FastText, ρ m is the correlation within MUSE, and ρ s is the correlation within subs2vec. On notation: in the "Model" columns, a, b represents the hyperparameters of Level a Binder categories with k = b neighbors, while "C, a" represents "control" unsupervised clusters with the hyperparameter of k = a neighbors.

B Sizes of Embedding Models
For contextual reference, we summarize the sizes of each of the embedding models used in this paper.