Crosslingual Transfer Learning for Low-Resource Languages Based on Multilingual Colexification Graphs

In comparative linguistics, colexification refers to the phenomenon of a lexical form conveying two or more distinct meanings. Existing work on colexification patterns relies on annotated word lists, limiting scalability and usefulness in NLP. In contrast, we identify colexification patterns of more than 2,000 concepts across 1,335 languages directly from an unannotated parallel corpus. We then propose simple and effective methods to build multilingual graphs from the colexification patterns: ColexNet and ColexNet+ . ColexNet’s nodes are concepts and its edges are colexifications. In ColexNet+, concept nodes are additionally linked through intermediate nodes, each representing an ngram in one of 1,334 languages. We use ColexNet+ to train −−−−−−−→ ColexNet+, high-quality multilingual embeddings that are well-suited for transfer learning. In our experiments, we first show that ColexNet achieves high recall on CLICS, a dataset of crosslingual colexifications. We then evaluate −−−−−−−→ ColexNet+ on roundtrip translation, sentence retrieval and sentence classification and show that our embeddings surpass several transfer learning baselines. This demonstrates the benefits of using colexification as a source of information in multilingual NLP.


Introduction
Multilingual representations are beneficial in natural language processing (NLP) due to their ability to transfer knowledge across languages (Artetxe and Schwenk, 2019;Conneau et al., 2020;Fan et al., 2021).Typically, such representations are learned through pre-training Large Language Models (LLMs) (Brown et al., 2020;Chowdhery et al., 2022;Touvron et al., 2023) or multilingual word embeddings (Ammar et al., 2016;Lample et al., 2018;Dufter et al., 2018).However, LLMs require enormous amounts of data to train, limiting their use mostly to high-resource and medium-resource languages (Zhou et al., 2023).Alternatively, multilingual word embeddings are widely used in NLP because of their simplicity and good performance (Ammar et al., 2016;Lample et al., 2018;Jawanpuria et al., 2019).However, most existing multilingual embeddings are learned through word-context information, without leveraging global cooccurrence information in individual languages or across languages, which can help distinguish distinct meanings conveyed by a lexical form.Therefore, we see a pressing need in NLP for massively multilingual word embeddings that span a large number of languages (1,335 in our case) and that specifically account for global occurrence and are a good basis for crosslingual transfer learning.
Colexification has gained increasing attention in comparative linguistics and crosslingual NLP.According to François (2008), a language colexifies two distinct meanings if it expresses them with the same lexical form.Different languages have different colexification patterns.For example, while English has separate words for <hand> and <arm>1 , Russian 'рукa' colexifies these two concepts.Most prior work explores colexification (Floyd et al., 2021;Brochhagen and Boleda, 2022;List, 2023) using manually curated crosslingual datasets that consist of multilingual word lists such as CLICS (List, 2018;List et al., 2018;Rzymski et al., 2020).However, relying on these datasets has several limitations: extension to more languages and more concepts can be challenging; these datasets contain lists of lemmata and (in a corpus-based approach for low-resource languages without morphological resources) cannot easily be used for the processing of occurrences in context.
To overcome these limitations and boost crosslingual transfer learning especially for low-resource languages, we use the Parallel Bible Corpus (PBC) (Mayer and Cysouw, 2014), which has verse-level aligned translations of the Bible in 1,335 languages, to identify colexification patterns (a verse in PBC roughly corresponds to a sentence).With the identified patterns between a wide range of concepts, we propose novel algorithms that efficiently build large-scale multilingual graphs.To the best of our knowledge, this is the first work that constructs graphs of colexification and trains multilingual representations for crosslingual transfer learning directly from a parallel corpus on a large scale.We show that the graphs capture the links between concepts across languages and that the derived multilingual representations considerably improve crosslingual transfer on downstream tasks.Previous work on building monolingual graphs (Jauhar et al., 2015;Ustalov et al., 2017) or multilingual graphs (Harvill et al., 2022;Jafarinejad, 2023;Chen et al., 2023) is different as it (1) does not consider words in context and only uses lemmata, (2) is based on external sense inventories such as WordNet (Miller, 1995) and BabelNet (Navigli and Ponzetto, 2012;Navigli et al., 2021), which are not available for many low-resource languages, and (3) does not investigate the crosslingual transferability of the multilingual representations on NLP downstream tasks such as sentence retrieval or classification in a crosslingual scenario.
The contributions of this work are as follows: (i) We present ColexNet, a graph of concepts based on colexification patterns that are directly extracted from a parallel corpus.(ii) By extending ColexNet, we further present ColexNet+, a large-scale multi-lingual graph that additionally contains ngrams in 1,334 languages that instantiate those patterns.(iii) We contribute to crosslingual transfer learning, by using ColexNet+ to generate multilingual embeddings: − −−−−−− → ColexNet+.We show that − −−−−−− → ColexNet+ outperforms several baselines on roundtrip translation, verse retrieval, and classification.(iv) We evaluate ColexNet on CLICS and show that we identify a large portion of the ground-truth colexifications.(v) Going beyond many works on crosslingual transfer that focus on transfer from English, we systematically investigate the effect of the source language on successful transfer with − −−−−−− → ColexNet+: we use 1,245 languages as sources and experiment on 1,245 × 1,245 transfer directions.(vi) We make our code, graphs, and embeddings publicly available.2

Related Work
There are many ways to learn multilingual word embeddings.One common way is to first learn monolingual embeddings on each language separately through, e.g., Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), or fast-Text (Bojanowski et al., 2017), and then map them into the same space (Artetxe et al., 2017;Lample et al., 2018;Artetxe et al., 2018).Another group of methods uses parallel corpora to directly learn bilingual embeddings (Hermann and Blunsom, 2014;Chandar et al., 2014;Levy et al., 2017).Our work is related to that of Dufter et al. (2018), which also learns embeddings on the PBC, but does not take advantage of colexification, i.e., the explicit modeling of relations between colexified concepts/ngrams.We use S-ID (Levy et al., 2017) and embeddings from Dufter et al. (2018) as baselines.
One of the best-known and widely used multilingual resources is BabelNet (Navigli and Ponzetto, 2012;Navigli et al., 2021).BabelNet has been used for learning or enhancing embeddings (Iacobacci et al., 2015;Camacho-Collados and Pilehvar, 2018;Conia and Navigli, 2020;Levine et al., 2020;Harvill et al., 2022;Chen et al., 2023) for lexical-level tasks such as semantic word similarity and word sense disambiguation (Speer and Lowry-Duda, 2017;Conia and Navigli, 2020;Procopio et al., 2021;Navigli et al., 2022).Our focus is on the coverage of many more languages (i.e., larger scale in terms of languages) for crosslingual transfer learning.While hand-curated lexica often have better quality than automatically learned resources, they are not available for most of our languages.Ultimately, the two approaches should be combined.
Colexification was introduced by Haspelmath (2003) in the context of grammatical semantics.François (2008) then used colexification as the foundation for studying semantic change crosslinguistically.CLICS (List, 2018;List et al., 2018;Rzymski et al., 2020) is a crosslingual database that facilitates research on colexification.Languages can differ in their colexification patterns, which are influenced by many factors such as human cognition, language family, and geographic area (Jackson et al., 2019;Xu et al., 2020;Segerer and Vanhove, 2022).An empirical study by Bao et al. (2021) indicates that no pair of concepts is colexified in every language.On the other hand, a recent investigation on conceptualization from the PBC shows some concepts are more likely to be involved in colexification than others (Liu et al., 2023).Such universal colexification patterns across languages reflect crosslinguistic similarities (Youn et al., 2016;Georgakopoulos et al., 2022).Therefore, by integrating colexification patterns of as many languages as possible, we can generate multilingual representations that are suitable for massively crosslingual transfer learning.

Data
We use 1,335 Bible translations from the PBC corpus (Mayer and Cysouw, 2014).Each translation is from a different language (identified by its ISO 639-3 code).Prior work (Asgari and Schütze, 2017;Dufter et al., 2018;Weissweiler et al., 2022) has used subsets of the corpus.In contrast, we follow Conceptualizer (Liu et al., 2023) and use all parallel verses between English and other languages.This gives us better coverage of concepts and the contexts in which they occur.

Colexification pattern identification
Concept Pool. Conceptualizer (Liu et al., 2023) uses a small manually selected group of focal concepts, i.e., concepts of interest (83 in total) and constructs a set of strings to represent each concept.For example, it uses {$belly$, $bellies$} to represent the focal concept <belly>, where $ is the word boundary.Manually defining the sets is not feasible when a large number of concepts are to be explored.Thus, in this work, we lemmatize the English corpus and regard each lemma as a concept.The set of all lemmata forms the concept pool F . . Conceptualizer (Liu et al., 2023) creates a bipartite directed alignment graph between source language concepts and target language strings.It consists of a forward pass (FP) and a backward pass (BP).This kind of two-step workflow is also used in extracting semantic relations (Dyvik, 2004) and paraphrases (Bannard and Callison-Burch, 2005) from bilingual parallel corpora.A key difference compared with this prior work is that Conceptualizer works on the ngram level instead of the token level; this facilitates the extraction of any associations hidden inside words.In Conceptualizer, FP first searches for target language ngrams highly associated with a given focal concept; BP then searches for English ngrams highly correlated with the target ngrams identified in FP.The association is measured using χ 2 score.The process can detect if the conceptualization of the focal concept diverges in any language.For example, starting from concept <hand>, FP finds the Russian ngram 'рук', and BP then finds two English ngrams 'hand' and 'arm'.This indicates that the conceptualization of these concepts diverges in English and Russian.The divergence of conceptualization in the lexical forms indicates a difference in their colexification patterns: Russian colexifies the concepts <hand> and <arm> (in the word 'рук') while English does not.

Conceptualizer
Forward Pass.Let f be a focal concept in F and V f the set of verses in which f occurs.FP identifies a set of ngrams T in target-language l where each ngram can refer to concept f , i.e., T = FP(f, l).We exhaustively search all ngrams t within all tokens3 in the parallel corpus in target language l for high correlation with V f .This procedure is similar to Östling and Kurfalı (2023)'s subword-level alignment, but we align concepts in English and subwords in other target languages.E.g., we start from <hand> and find that the Russian ngram 'рук' has the highest correlation with V <hand> , which indicates 'рук' can refer to <hand>.Like Conceptualizer, we use χ 2 as a measure of correlation and iterate FP until the cumulative coverage T = t of focal concept f exceeds a threshold α = 0.9, but for a maximum of M = 3 iterations.See §A for a discussion of these hyperparameters.
Backward Pass.BP is essentially the same as FP, but the search direction is reversed.Let V T be the set of verses in which at least one ngram in T (identified in FP for target language l and concept f ) from target language l occurs.We exhaustively search all concepts c from the concept pool F for high correlations with V T .Let C = BP(T, l) be the final set of identified concepts.If |C| = 1 and c ∈ C ∧ c = f , this indicates the ngrams can only refer to the concept f according to the bilingual context.Alternatively, if |C| > 1, this indicates language l colexifies concepts in C through ngrams T .For example, by performing BP on ngram 'рук', we get <hand> and <arm> as the result, which indicates Russian colexifies <hand> and <arm>.Notably, since we consider ngrams instead of tokens on the target language side, this allows us to also identify partial colexification patterns in BP, i.e., patterns that do not involve an entire word, but rather part of it.We show such examples in §B.

ColexNet
We run FP and BP for all 1,806 focal concepts in the English concept pool F that have frequency between 5 and 2000 and for every language l in our set of 1,334 target languages L (excluding English).This allows us to uncover the colexification patterns in 1,334 languages.We formalize the relations of the colexification patterns as an undirected graph, where each node is a concept represented by an English lemma and each edge indicates that at least one language colexifies the two connected concepts.Formally, let G(F, E, w c , w n ) be a weighted undirected graph on vertices F , i.e., the concept pool, where E is a set of undirected edges; w c is an edge weighting (counting) function: F ×F → Z + , which returns, for a pair of concepts, the number of languages colexifying them; w n is an edge record function, which returns all ngrams that colexify a given pair of concepts.We show the graph construction in Algorithm 1.
In this study, we use a threshold λ to control the confidence of the colexification edges: we remove an edge e if w c (e) < λ.The intuition is that: if two concepts f 1 and f 2 are colexified in many languages, we can be more certain that the edge between f 1 and f 2 is correctly identified.Looking at it the other way around, if two concepts are only colexified in a few languages, this might be a wrongly identified pattern because of verse-level misalignment, free translation, or other errors in the data.See Table 4 for the influence of different λ on graph statistics.In addition, we remove zerodegree nodes to filter out isolated concepts.

ColexNet+
ColexNet only contains concepts that are expressed in English lemmata and cannot be directly used to learn multilingual representations for the target languages.Therefore, we propose ColexNet+, a large multilingual graph expanded from ColexNet by including target language ngrams that instantiate the colexification patterns identified in ColexNet.Specifically, we replace each edge (f 1 , f 2 ) with a set of new pairs of edges: (1) find the set of ngrams w n ((f 1 , f 2 )) that colexify concepts f 1 and f 2 (in any language) and (2) for each ngram v in the set, insert new edges (f 1 , v) and (v, f 2 ).To define a clean bipartite structure, we do not include the original (f 1 , f 2 ); this guarantees that only conceptngram edges and no concept-concept edges occur.In addition, any two related concepts (i.e., there is an edge connecting the two concepts in ColexNet) are always implicitly connected through ngram nodes in ColexNet+ that associate the two con-cepts.Figure 1 shows a subnetwork of ColexNet+ consisting of a few concepts and ngrams in different languages that colexify them.The graph construction is shown in Algorithm 1.
As ColexNet+ is expanded from ColexNet, this allows us to only include pairs of edges expanded from reliable edges (w c (e) ≥ λ) in ColexNet.The number of nodes and edges included in ColexNet+ is thus influenced by λ.The higher λ, the fewer nodes and edges will be in ColexNet+.§A presents statistics and performance for different values of λ.

Multilingual Embedding learning
To capture the semantic relations among the nodes and the structure of ColexNet+, we use Node2Vec (Grover and Leskovec, 2016) to generate node representations.Let v be the node that a random walk currently resides in, t the node that the walk has traversed in the last step, and x the node that the walk will visit in the next step.Node2Vec calculates the unnormalized transition probability from v to x as π vx = α pq (t, x) • w((v, x)) for sampling the next node x in the graph, where w((v, x)) is the weight of the undirected edge (v, x), and d tx is the shortest path distance between t and x.The transition probability determines if either a new node or an already-visited node (regardless of a concept or ngram node) will be sampled.In ColexNet+, d tx ̸ = 1 for any nodes t and x, because a concept (resp.ngram) node will not connect with other concept (resp.ngram) nodes.We set return parameter p = .5 and in-out parameter q = 2 in the hope of encoding more "local" information, as this setting approximates breadth-first sampling according to Grover and Leskovec (2016).
Below, we show that the multilingual representations trained this way have some desirable properties, e.g., representations of ngrams from different languages that refer to the same concept can be highly cosine-similar, which is important for zeroshot crosslingual transfer learning.

Experiments
To evaluate our proposed methods, we conduct the following experiments: ( 1

Baselines
To evaluate the effectiveness of − −−−−−− → ColexNet+, our multilingual embeddings, we consider several previously proposed strong multilingual embeddings as baselines for downstream tasks.The dimension of all embeddings (ours and the baselines) is set to 200 for a fair comparison.In addition, we consider three non-embedding baselines: bag-ofwords (BOW), XLM-R (Conneau et al., 2020) and Glot500-m (ImaniGooghari et al., 2023).The first is a random baseline and is expected to perform the worst because a BOW model is only trained on the English corpus, which does not directly transfer to other languages.The latter two are strong multilingual pretrained models.XLM-R is pretrained on 100 languages while Glot500-m is a continuedpretrained version of XLM-R on the Glot500-c corpus (ImaniGooghari et al., 2023) that includes more than 500 languages.We choose the base version of these multilingual pretrained models.We introduce the embedding baselines below.
S-ID embedding.Levy et al. (2017) show that S-ID embeddings, which leverage the sentence ID feature, are effective in learning good multilingual embeddings from parallel corpora.We use pairs of a verse ID identifier and a token in this verse as input to Word2Vec (Mikolov et al., 2013) to train S-ID embeddings.For example" the pairs (01049027, Wolf) and (01049027, 狼) will be presented in the data because '狼' (resp.'Wolf') occurs in Chinese (resp.German) in verse number 01049027.This is a strong baseline because the verse number is an abstract representation of the context.Therefore it encourages words occurring across languages in the same verse to have similar embeddings.
CLIQUE & N (t) embedding.CLIQUE embeddings (Dufter et al., 2018) are learned on cliques extracted from PBC.Each clique is a set of tokens from different languages that refer to the same concept.The embeddings are then learned from tokenclique pairs.Additionally, to take the connections between concepts into account, Dufter et al. (2018) consider the neighbors (tokens that are connected with the current node in the dictionary graph) of each token and train embeddings on those pairs of neighbors, which we refer to as N (t) embedding.
Eflomal-aligned embedding.We construct an alignment graph of words by using Eflomal (Östling and Tiedemann, 2016) and learn embeddings on the graph as another strong baseline.Specifically, we align the English Bible with Bibles in all other target languages.We define the edge set of the graph as the set of all edges that connect an English word with its aligned target language word (if there are at least two such alignments).Finally, we use Node2Vec (same hyperparameters as for ColexNet+) to learn multilingual embeddings.

Colexification identification
We first evaluate how well ColexNet performs at identifying colexification patterns.We use CLICS (List, 2018;List et al., 2018;Rzymski et al., 2020), a database of colexifications, as the gold standard.Each node in CLICS is a concept expressed in English.In ColexNet, we use English lemmata as expressions of concepts whereas CLICS also includes short noun phrases.We only consider the common concepts, i.e., concepts that are expressed as English words and occur in both CLICS and ColexNet.For each start concept s in the common concepts P , let T (s) be the neighbors in CLICS, i.e., a set of concepts that have a colexification relation with s and C(s) be the neighbors in ColexNet.Then we compute the recall for s as |T (s) ∩ C(s)|/|T (s)|.
To have a global view of the performance, we report the micro average recall (MicroRec.):If the constraint λ is stricter, fewer concepts and fewer edges (both colexification edges contained and not contained in CLICS) will be included in ColexNet.Thus, we observe a consistent drop in both micro and macro recall.On the other hand, we observe a decrease in #aw_colex when we increase λ, as CLICS edges are less likely to be removed than edges missing from CLICS: many languages can share the same colexification patterns for some concepts whereas edges not in CLICS will not be shared across many languages.This can be verified by the steepness of the decrease in #aw_colex.From λ = 1 to 5, around 500 edges not in CLICS are removed for each concept.When λ > 5, the speed decreases, suggesting the identified colexification edges are more reliable.In summary, high recall indicates that we successfully identify many ground-truth colexifications directly from PBC.It is important to note that CLICS' coverage is far from complete for low-resource languages: for many of them, fewer than 100 concepts are included in CLICS.Therefore, #aw_colex gives some indication of performance or discrepancy between CLICS and ColexNet, but many of the edges not in CLICS are actually correct.On the other hand, ColexNet is not immune to semantic errors (Peirsman and Padó, 2008), such as antonyms, hypernyms, or hyponyms, due to co-occurrence or free translation.See §B for a detailed analysis of the identified colexifications.

Roundtrip translation
We additionally use roundtrip translation (Dufter et al., 2018) to assess the quality of multilingual representations.Let [l 0 , l 1 , l 2 , ..., l R ] be a sequence of languages where l 0 = l R is the source language and l i ̸ = l 0 ∀1 ≤ i ≤ R − 1 different intermediate languages.Roundtrip translation starts with a word w 0 in l 0 and iteratively finds the word w r in language l r (1 ≤ r ≤ R) that is closest to word w r−1 in language l r−1 in the embedding space.If w 0 = w R , this indicates that the R − 1 "intermediate" words have representations similar to w 0 and represent the meaning of w 0 .We compute the percentage of roundtrips for w 0 that are successful, i.e., w 0 = w R (top-1 accuracy).In addition, we also report top-5 and top-10 accuracies (i.e., w 0 is in the k (k = 5 or k = 10) nearest neighbors).We set R = 4, l 0 = English and take 1,654 English words that occur in all embedding spaces as the starting point w 0 .For each trial, we randomly select three intermediate languages and then compute results for each of the 1,654 w 0 .We run this experiment ten times and report averages.We ensure that the intermediate languages are different in each trial.

Verse retrieval
Similarly to Glot500-m, we use 500 Englishaligned verses from PBC for verse retrieval.1,250 languages are used (we remove 85 languages that cover fewer than 400 out of the 500 verses).We represent each verse as the average of the embeddings of its units.Given a verse v e in English, we find the most cosine-similar verses v l in target language l.We then compute top-1, top-5 and top-10 accuracy for the returned ranking (i.e., the correct verse is in the top-1, top-5, top-10 nearest neighbors) and average first over verses and then languages.

Verse classification
We evaluate our multilingual embeddings on Taxi1500 (Ma et al., 2023).It provides 860/106/111 verses for train/valid/test sets in more than 1,500 languages.Each verse is annotated with one of six classes: 'recommendation', 'faith', 'description', 'sin', 'grace', and 'violence'.We use a subset of 1,245 languages, those covered by both Taxi1500 and − −−−−−− → ColexNet+.We perform zero-shot transfer by training a logistic classifier on English train and evaluating on the test set of the other 1,244 languages.Similar to verse retrieval, we represent a verse as the average of its embeddings.We report macro F 1 , first averaged over verses (per language) and then averaged over languages.
− −−−−−− → ColexNet+ shows a large improvement over the baselines, especially for roundtrip translation and verse retrieval.The bad performance of BOW is expected, as previously mentioned, because the English vocabulary does not necessarily transfer to other languages.− −−−−−− → ColexNet+'s improvement over S-ID is probably due to the fact that there is only verse ID information provided to serve as verse-level context information in S-ID.Token-level alignment information, however, is not available to S-ID.In other words, using abstract context identifiers alone cannot provide enough information to learn good multilingual embeddings for crosslingual transfer.When comparing − −−−−−− → ColexNet+ with XLM-R and Glot500-m, we see a clear improvement in either verse retrieval or verse classification.The major reason is that both XLM-R and Glot500-m are not trained on all languages that are supported by − −−−−−− → ColexNet+.Due to a lack of data in some low-resource languages, it is difficult to train a good language model in those languages.In contrast, − −−−−−− → ColexNet+ demonstrates the possibility of multilingual embeddings: with a small multilingual corpus where we can extract colexifications, it is already enough to support large-scale zero-shot transfer for the low-resource languages by training embeddings.CLIQUE, N (t), and Eflomal-aligned achieve similar performance on roundtrip translation (top-1).However, when k becomes larger (k = 5 or 10), we see that CLIQUE performs better than N (t) and Eflomal-aligned.This is not surprising, since CLIQUE specifically creates cliques of tokens that are translations of each other in different languages.Therefore the representations of translations should be similar.Eflomal-aligned also achieves good performance on roundtrip translation when k is large (k = 5 or 10) and very close performance to − −−−−−− → ColexNet+ in verse retrieval / classification.There are a few possible explanations.First, the word alignments are noisy in Eflomal-aligned because it operates on the token level and any information hidden inside each token (i.e., ngrams inside each token) cannot be extracted and utilized (see the discussion also in Liu et al. ( 2023)).Therefore, by increasing k in roundtrip translation, the influence of such alignment noise is offset, resulting in better results.Second, as we use the average of embeddings of tokens in a verse as the verse representation in verse retrieval / classification; this can mitigate the impact of unimportant tokens.
For verse classification, we find that different embeddings achieve similar performance except for S-ID.On the one hand, this phenomenon indicates that S-ID, though it learns from abstract context information, cannot align words from different languages that refer to the same concepts well, thus preventing transfer from English to low-resource languages.On the other hand, it might indicate that classification is a less difficult task: it does not require the model to have equally good alignment for all concepts as the model can achieve good results just by aligning important concepts.Nevertheless, − −−−−−− → ColexNet+ still achieves better results than other baselines, suggesting it has better zero-shot transferability.See §E for complete results.

Analyses on ColexNet
Basic statistics.We find that ColexNet has one very large connected component along with a few small connected components.See a visualization of the largest community in ColexNet (λ = 50) in Figure 4 in the Appendix.Therefore, in the largest community, there is always a path in the colexification graph between two concepts even if they are less related.Figure 2 shows degree/ betweenness centrality and degree distribution of ColexNet.From the figure, we can infer that the connectivity can be attributed to (1) a small group of concepts that are involved in many colexification patterns and (2) a small group of edges serving as "bridges" to connect concepts that are rarely colexified in some languages.Therefore, ColexNet, a graph built by the identified colexification patterns across many languages, approximately forms a small-world or scale-free (Barabási and Bonabeau, 2003) network.See §A.2 for graph-related statistics of ColexNet under different λ.
Communities.We use the Louvain algorithm (Blondel et al., 2008) to find communities in ColexNet.We identify 288 communities.Each community forms a cluster of semantically related concepts.Figure 3 gives the example of community#29: it contains several concepts related to <wind>, <storm> and <wave>.We see that <wind> is often colexified with <blow> (wind blows), with <wave> (waves are caused by wind) and with <vio-lent> (winds can be fierce).At the center of a community, we often find a densely connected clique, indicating their connections are strong in many languages.Some concepts, located at the fringe of the community and connected with one of the denselyconnected concepts in the center, are less related to the semantic field of the community and serve as "bridges" to connect with other communities.See §C for further details of the identified communities.

Transfer learning beyond English
NLP research in general and even typological studies are frequently conducted from an English- centric perspective.To reduce this bias and further verify our multilingual embeddings' transfer capability, we additionally use all available languages (1,245 languages) as the query / train languages for retrieval and classification tasks.To this end, we conduct large-scale experiments that contain 1,245 × 1,245 transfer directions.The setup is the same as in §4, where each language takes the role of English as the query / train language.We again represent each verse as the average of the embeddings of its units.For each language, we calculate the average top-k (k = 1, 5, or 10) accuracy for verse retrieval and macro F 1 for verse classification over all other languages except the language itself.
In Table 3, we list the transfer performance of three major languages that are typologically different from English: Arabic (arb), Russian (rus), and Chinese (zho); and three languages that achieve the worst overall performance: Apinayé (apn), Mündü (muh), and Salt-Yui (sll).See §F for complete results for all languages.For high-resource languages, the performance is close to that achieved for English (see Table 2), indicating that the ngrams are well-aligned and − −−−−−− → ColexNet+ has good transfer ability.Chinese performs better than Arabic and Russian.The possible reasons are as follows: (1) Both Arabic and Russian are morphologically rich whereas Chinese is not.Morphological variation makes finding aligned ngrams in the forward pass harder, with a negative impact on performance; (2) To prevent bad tokenization for Chinese, we allow all ngrams (unlimited-length combination of continuous characters) in a verse to be candidates in the forward pass.This setting gives ngrams more freedom and thus better results are expected.For the three low-resource languages, we find that they diverge morphologically and typologically Table 3: Verse retrieval/classification for three highresource languages, the three worst performing languages, and average over all languages (avg.).We also report BOW results for verse classification (in parentheses), which serves as the random baseline.In contrast to the good performance for Arabic (arb), Russian (rus) and Chinese (zho), Apinayé (apn), Mündü (muh) and Salt-Yui (sll) each pose specific difficulties for inducing reliable colexification patterns.
from most high-resource languages.Apinayé and Mündü seem to frequently use several consecutive whitespace-tokenized syllables to express a single concept, which makes finding the correct alignments much harder.Salt-Yui, on the other hand, seems to be highly ambiguous because the writing does not reflect its contrastive tones (Irwin, 1974).We hypothesize such ambiguity can negatively influence performance.See §F for an analysis of the factors that can influence transfer performance.

Conclusion
In this work, we present the multilingual graphs ColexNet and ColexNet+, based on colexifications extracted from a highly parallel corpus.Comparing with CLICS, we show that we identify many gold-standard patterns in ColexNet.In addition, we analyze the structure of ColexNet and show it nearly forms a scale-free graph, with many communities of semantically related concepts.ColexNet+ outperforms several approaches, including multilingual embeddings and pretrained models, on three downstream tasks.This indicates that embeddings learned from colexification graphs improve crosslingual transfer, especially for low-resource languages for which it is often infeasible to pretrain good models.Finally, our embeddings exhibit robust transfer performance across many different source languages.

Limitations
Theoretically, one could identify, explore, and analyze colexification patterns from any parallel corpora and construct graphs of colexifications using the methods proposed in this paper.We use the PBC, a genre-specific parallel corpus in this work, which can limit some of the concepts to religions.Nevertheless, the goal of this work is to explore colexification patterns in as many languages as possible, including a lot of low-resource languages, without relying on any external resources.Therefore, the PBC corpus is a good fit for us.
We conduct extensive experiments to verify the crosslingual transfer capability of the multilingual embeddings learned on ColexNet+.However, some experiments are in-domain (the evaluation tasks are still related to the Bible), e.g., verse retrieval and verse classification.The major reason is that we want to test the embedding's performance on all our supported languages.Unfortunately, as far as we know, evaluation datasets that cover such a wide range of languages, including low-resource languages, are missing in the community.Some datasets, for example, Tatoeba4 , support hundreds of languages but contain many concepts, e.g., pizza, that do not occur in the Bible.Therefore, we do not evaluate our embeddings on those datasets.cross-lingual mappings of word embeddings.In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 789-798, Melbourne, Australia.Association for Computational Linguistics.

A Choice of hyperparameters and discussion
A.1 Forward/backward pass Two hyperparameters in the forward pass and backward pass when searching for the colexification patterns can influence the results, i.e., (1) the maximum number of iterations M for a given concept in each language and (2) the threshold α for the minimum cumulative coverage of the set of identified ngrams.We set M = 3 and α = .9as default values for all involved computations.We are different with Conceptualizer (Liu et al., 2023) in the setting of M .Conceptualizer sets M to 5 whereas we set it to 3. The major reasons are as follows.We are searching for colexification patterns with high accuracy.This requires us to identify the target-language ngrams that instantiate the colexifications with high certainty.Based on empirical explorations, we find that when M is large (e.g., > 3), we can include less accurate or even unrelated ngrams (because those ngrams are rare and occur in the same verse where the concept occurs, which is also discussed by Liu et al. ( 2023)).By setting M = 3 in the forward pass, we will be more confident that the identified target-language ngrams are highly correlated with the concept and this setting achieves the best performance for a few examples in our manual inspection.As for the minimum cumulative coverage threshold α, we directly follow the setting in Conceptualizer, i.e., 0.9, to ensure that the forward pass and backward pass find enough ngrams/concepts while guaranteeing the quality of the associations.

A.2 ColexNet/ColexNet+ construction
In the construction of ColexNet and ColexNet+, we have an important hyperparameter λ: the minimum number of languages for a colexification edge to be included.As shown in Table 4, different λ can influence the number of nodes and edges in ColexNet as well as the number of connected components.It is clear that both #edges and degree decrease dramatically from λ = 1 to 5, which might indicate: (1) increasing λ decreases the number of incorrectly identified colexification patterns (e.g., due to verse-level misalignment); (2) some colexification patterns might be specific to very few languages.Because of many plausible incorrect edges between concepts, when λ = 1, 5 or 10, ColexNet forms a large connected graph.When λ is larger (e.g., 50 or 100), the graph is no longer connected because many less reliable edges are removed from it.
The influence of λ also apply to ColexNet+, since edges being removed in ColexNet also impact ColexNet+: pairs of edges that are expanded from removed edges from ColexNet are then not included in ColexNet+.We show the number of nodes and edges as well as the average degree in ColexNet+ under different λ in Table 5.The changes in degree with the increase of λ are not as prominent as in ColexNet (shown in Table 4).This is mainly because the number of nodes in ColexNet+ is far more than that in ColexNet.Most of the nodes are only associated with around 3 other nodes in ColexNet+, which indicates that many ngrams from target languages colexify about three concepts, because most of the nodes in ColexNet+ belongs to the ngram nodes.Each concept, however, can be frequently associated with more than 3 concepts in ColexNet, as we noticed the average degree of ColexNet (λ =50) is around 5.
The number of nodes and edges also influences the random walks which we used for sampling, thus influencing the quality of multilingual embeddings trained on ColexNet+ using Node2Vec (Grover and Leskovec, 2016).Therefore, we conduct experiments using embeddings trained on ColexNet+ under different λ.Same as §4, we conduct experiments on roundtrip translation, verse retrieval, and verse classification tasks.For roundtrip translation, we again set l 0 = English and use 2,221 words that occur in all embeddings as the start points.For verse retrieval (resp.classification), we also use English as the query (resp.train) language, and report top-k accuracy (resp.macro F 1 score), averaged over all languages.Results are shown in Table 6.
We see there are different trends between the changes in λ and changes in performance for different tasks: (1) the performance of roundtrip translation is positively correlated with λ and the best result is achieved when λ =100; (2) the performance of verse retrieval is also positively correlated with λ ; (3) the performance of verse classification is generally negatively correlated with λ where the best result is achieved when λ =1.Those trends can be explained as follows.Roundtrip translation and verse retrieval, compared with verse classification, require better alignment quality among concepts and ngrams.When λ is small, some incorrect edges are included in the graph.These edges induce noises for sampling, therefore slightly noisy embeddings are obtained, negatively influencing the performance.As for verse classification, the results suggest that when we have fewer out-of-vocabulary ngrams in the embeddings (higher λ induces fewer ngrams in ColexNet+), slightly better performance is achieved.Moreover, λ seems to have a more obvious impact on roundtrip translation and verse retrieval than on verse classification.In summary, the results verify our choice of λ =50, a relatively large number, in the main content of this paper, as it offers very competitive results compared to other choices while not losing many interesting patterns.

B Investigation of identified colexifications
In §4.2, we show that we identify many groundtruth colexification patterns compared with CLICS.However, there are quite a few colexification patterns that are not present in CLICS.Therefore, we conduct a qualitative investigation on those colexification patterns.We classify each pattern (an colexification edge in ColexNet between two concepts) as one of the following categories: (1) full colexification (2) partial colexification and (3) incorrect colexification.
Full colexification.Full colexification indicates that a word in a language directly colexifies two concepts.We list 4 examples of colexifications not included in CLICS but verified by us.An obvious example is that <ground> and <land> are colexified in many languages, e.g., through 土地 in Japanese (jpn) and 大地 in Chinese (zho).<early> and <tomorrow> are frequently colexified in Turkic languages, e.g., • Southern Altai (alt): $эртен  Some concepts may be expressed using multiple lexemes, forming a compound, and a part of the compound may also occur in the expression of a different concept.Note that, in some languages (such as German), a compound is written together without any space in between, whereas in some other languages (such as English), there is a space between each part.In either case, these are considered as compounds, since all the separate elements together constitute one concept.This cannot be confused with co-occurrence, where the two concepts themselves co-occur.For example, in Tatar (tat), two colors: <purple> and <scar-let> are partially colexified through $кызыл$, because <purple> is куе кызыл (literally 'thick red'), which contains a part кызыл meaning 'red, scarlet'.Such partial relation also frequently exists in numbers.For example, empat belas (resp.十 四), which means 14, and empat puluh (resp.四十), which means 40, are partially colexified in Indonesian (ind) (resp.Chinese (zho)), as empat (resp.四) means 4. Some languages, e.g., Chinese and German, construct compounds without inserting blanks between each lexeme, so we also observe many partial colexifications in Chinese and Germanic languages, e.g., : • Chinese (cmn): In summary, many identified colexification patterns in ColexNet belong to this category, which is the reason why we found many patterns that do not exist in CLICS, since CLICS only includes full colexification patterns.
Incorrect colexification.As an automatic statistical method, the results are not immune to errors.Typically, we find the incorrectly identified colexifications are mainly due to two reasons: (1) co-occurrence and (2) free translation.We list some incorrectly identified colexifications in Table 7. Co-occurrence denotes that two particular concepts tend to co-occur very often so that the algorithm wrongly establishes connections between the concept.For example, we found <four> and <twenty> are associated in Catalan (cat) because the ngrams $quatre$, $vint which refer to the two concepts respectively co-occur very frequently in PBC.Similarly, <left> and <right> for Nogai (nog), and <want> and <know> for Low German (nds) also belong to this type of error.Free translation means that the translation is not done word by word so that the corresponding word for a specific concept does not occur in the same sentence.
In this case, the algorithm has no chance of finding the corresponding ngram, which ideally would align with the intended concept.Free translation is very common in the Bible because of its religious textbook nature.For example, in Catalan (cat), the English verse #40012048 starts with "But to the man who told him" but the Catalan translation starts with "Però ell va contestar al qui deia això", which means "But he answered the one who said this", where the concept <man> does not occur in Catalan and the concept <answer> does not occur in English verse.Similarly, Chinese word 十万 means one hundred thousand, i.e., 100,000 (with 十 being 10 and 万 being 10,000).As the formation of the number expression in Chinese is different from its English counterpart, the algorithm wrongly associates <hundred> and <thousand>.

C Communities of ColexNet
There are 288 communities in total detected in ColexNet (λ = 50) by Louvain community detection algorithm (Blondel et al., 2008).Two important hyperparameters, i.e., resolution and random seed are set to 0.1 and 114514 respectively.As mentioned in §5, each community is a cluster of concepts that are semantically related to each other.We create a demonstration website to show the subnetworks of each concept and the community figures. 5For illustration purposes, we randomly select 15 communities that have more than 10 nodes for illustration in this paper.See Visualizations of those communities in Figure 5.

D Influence of language families & areas
We create subnetworks specific to each language family and to each area.We consider six language families that have more than 50 languages in PBC: Austronesian (aust), Atlantic-Congo (atla), Indo-European (indo), Nuclear Trans New Guinea (nucl), Otomanguean (otom) and Sino-Tibetan (sino).We consider five areas: South America (SA), North America (NA), Eurasia, Africa and Papunesia.We only keep the edges in ColexNet that occur in each family (resp.area) for the subnetwork of each language family (resp.area).To quantify agreement of community structure, we use adjusted rand index (ARI) (Hubert and Arabie, 1985;Steinley, 2004), similar to (Jackson et al., 2019).We also compute ARI between ColexNet and each subnetwork.Figures 6 and 7 show pairwise ARI for language families and areas.It is clear that any language family subnetwork cannot represent the global colexification patterns encoded in ColexNet, since no family's ARI with ColexNet is high.In addition, no two language families have a similar community structure according to ARI: for the pair with the highest ARI, atla-aust, ARI = 0.5.In comparison, area-specific subnetworks generally have larger pairwise ARIs.The two areas Africa and Papunesia have a very high ARI of 0.76 and also high ARIs with ColexNet (0.78 and 0.80).This can be explained by ( 1) there are many languages in those two areas so there are more possible colexifications included in the subnetworks and (2) the diversity (in terms of colexification) of languages spoken in these two areas is high.In summary, relatively low ARIs between families and areas also suggest many colexification patterns are only specific to a small group of languages (either in a specific language family or in an area).

E English-centric transfer learning
We have shown the English-centric transfer performance of verse retrieval and verse classification averaged over languages in Table 2.We believe it is also important to have a fine-grained view of the results for individual languages, to better understand the crosslingual transfer capability of − −−−−−− → ColexNet+.Therefore, we show the transfer performance (sentence retrieval and sentence classification) of each individual language clustered by its corresponding language family in Figure 8. Globally, we see that results not only vary across language families but also vary within each language family.English We find in Figure 8 (top) that, though a top-10 accuracy higher than 0.5 is achieved for all languages, the average retrieval accuracy in the Indo-European language family is slightly better than in other families which have many languages (e.g., Sino-Tibetan or Otomanguean language family).We speculate this is probably because other Indo-European languages can learn more accurate alignments as our source language is English which also belongs to the same family.Better alignments influence the quality of embeddings in that language, therefore having an impact on the transfer performance.
The trend in classification as shown in Figure 8 (bottom), however, is slightly different: average F 1 remains stable at around 0.5 for almost all language families, with less variance in each language family.This is evidence for our conjecture that classification is a less difficult task: apparently, good performance can be obtained if only words referring to the most important concepts that are highly associated with specific classes are aligned well.
In summary, good performance indicates that − −−−−−− → ColexNet+ assigns similar representations to ngrams that refer to the same concept, thus improving crosslingual transfer.
Figure 5: Visualizations of 15 randomly selected communities that have more than 10 nodes from 288 communities detected in ColexNet.Each community forms a cluster of concepts that are semantically related to each other.E.g., community #60 is related to the concept <hunger>; and community #73 is related to the concept <money>.Each subfigure contains pairwise ARIs between one area (indicated by the color: Africa, Eurasia, Papunesia, NA, SA, base) and all other areas (indicated on the edges).The ARIs are computed by averaging the results of 50 runs using the Louvain algorithm with different random states.Pairs of the same area, e.g., Africa-Africa, are not shown because the ARI will always be 1 in such cases.base is the graph including all edges, i.e., ColexNet.Note that the scale is adjusted for each area individually.

F Beyond English-centric transfer
We show the complete transfer performance by using any language as train/query language (1,245 languages in total, as we filtered some languages which have a very small size of train or test set).
The results are shown in Table 10,11,12,13,14.We hypothesize that the quality of identified colexification can influence the transfer learning performance.Some languages, because their morphology, typology, or conceptualization are very different from other languages, might pose difficulties in finding reliable colexification patterns, thus being detrimental to crosslingual transfer.To this end, we compute the average colexification patterns per ngram (avg_colex) for each language.That is, for language l, we compute the average number of neighbors of an ngram in ColexNet+.The neighbors of an ngram node are concept nodes, which indicates the concepts that this ngram can refer to.The higher the avg_colex is for a language, the more polysemous or ambiguous the ngrams tend to be.Of course, the extracted colexification patterns are not always correct due to verse-level misalignment, free translation, or some languagespecific properties like morphology.Therefore the metric avg_colex can, to some degree, indicate the level of difficulty to find correct alignments.
We list the number of target-language ngrams in ColexNet+ (#ngrams) as well as avg_colex for the languages we show in §5.2: Arabic (arb), Russian (rus), Chinese (zho), Apinayé (apn), Mündü (muh), Salt-Yui (sll) as well as the average over all languages in Table 8.Three high-resource languages, which are typologically and morphologically different from each other, show similar trends in their statistics: more ngrams are included in ColexNet+ while avg_colex is less than the average.This might indicate that the languages are less ambiguous and the colexifications extracted are mostly reliable, which explains good crosslingual performance when they are used as the train/query languages.On the contrary, the three worstperforming languages have exactly the inverse trend, which indicates it is harder to identify reliable colexifications, thus the performance is bad when they are served as the source languages.
To further test our hypothesis, we compute the Pearson correlation between the performance (classification F 1 score and retrieval accuracy) and avg_colex.The results are shown in Number of target-language ngrams in ColexNet+ (#ngrams) and the average number of colexified concepts per ngram (avg_colex) for Arabic (arb), Russian (rus), Chinese (zho), Apinayé (apn), Mündü (muh), Salt-Yui (sll) as well as the average over all languages.We see that the lower three worst performing languages have fewer #ngrams but larger avg_colex than the average statistics over all languages.c r1 r5 r10 #ngrams 0.20 0.28 0.25 0.24 avg_colex -0.18 -0.25 -0.21 -0.19 Table 9: Pearson correlations between #ngrams/ avg_colex and the transfer performance (c: classification F 1 score, r1: retrieval top-1 accuracy, r5: retrieval top-5 accuracy, r10: retrieval top-10 accuracy).All values are statistically significant under p = 0.01.related with the performance while avg_colex is negatively correlated.However, it is important to note that the correlation is not high: there are quite a few languages that have small #ngrams but large avg_colex perform quite well when they are used as the source languages for large-scale transfer.For example, Bislama (bis), whose #ngrams is only 1,202 but avg_colex is 4.81, achieves good performance: 0.41, 0.46, 0.66, 0.73 for classification, retrieval top-1, top-5, and top-10 respectively.We speculate this is because Bislama is highly influenced by English (Tryon, 1987), therefore the patterns extracted are reliable since the concepts are represented in English lemmata.We leave the further exploration of finding reliable colexifications from a parallel corpus for future research.
To sum up, the quality of the colexification patterns extracted for a language is closely related to the transfer performance when it is served as the train/query language.Due to various languagespecific properties, the model can have difficulties in inducing reliable colexification patterns.

Figure 3 :
Figure 3: Community #29.Line thickness indicates the number of languages that instantiate a colexification.

Figure 4 :
Figure 4: Visualization of the largest community which contains 2,581 nodes out of 2,591 nodes in ColexNet.Each node is a concept and each edge indicates that two concepts are colexified in at least 50 languages.

Figure 6 :
Figure6: Pairwise ARIs between language family-specific subnetworks.Each subfigure contains pairwise ARIs between one family (indicated by the color: atla, aust, indo, nucl, otom, sino, base) and all other families (indicated on the edges).The ARIs are computed by averaging the results of 50 runs using the Louvain algorithm with different random states.Pairs of the same family, e.g., indo-indo, are not shown because the ARI will always be 1 in such cases.base is the graph including all edges, i.e., ColexNet.Note that the scale is adjusted for each family individually.

Figure 7 :
Figure7: Pairwise ARIs between area-specific subnetworks.Each subfigure contains pairwise ARIs between one area (indicated by the color: Africa, Eurasia, Papunesia, NA, SA, base) and all other areas (indicated on the edges).The ARIs are computed by averaging the results of 50 runs using the Louvain algorithm with different random states.Pairs of the same area, e.g., Africa-Africa, are not shown because the ARI will always be 1 in such cases.base is the graph including all edges, i.e., ColexNet.Note that the scale is adjusted for each area individually.

F1Figure 8 :
Figure 8: Top-10 accuracy of verse retrieval (top) and F 1 for verse classification (bottom) by language family for − −−−−−− → ColexNet+.Each small dot represents a language, each large dot . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543, Doha, Qatar.Association for Computational Linguistics.

Table 4 :
Basic statistics of ColexNet under different thresholds λ.We report the number of nodes (#nodes), the number of edges (#edges), the average degree per node (degree), and the number of connected components (#components).

Table 5 :
Basic statistics of ColexNet+ under different thresholds λ.We report the number of nodes (#nodes), the number of edges (#edges), and the average degree per node (degree).
acter: 愛), as the character means both <love> and <wish>.Lastly, in Western Frisian (fry), <dragon> and <snake> are colexified through the word $draek$, for which we manually verify in PBC.It is worth noting that there is another word slang which denotes <snake> in Western Frisian.

Table 7 :
Examples of incorrectly identified colexifications in ColexNet.

Table 9 .
It is evident that #ngrams is weakly positively cor-

Table 10 :
Transfer performance using other languages as the train/query language (Part I).

Table 11 :
Transfer performance using other languages as the train/query language (Part II).

Table 12 :
Transfer performance using other languages as the train/query language (Part III).

Table 13 :
Transfer performance using other languages as the train/query language (Part IV).