Colexifications for Bootstrapping Cross-lingual Datasets: The Case of Phonology, Concreteness, and Affectiveness

Colexification refers to the linguistic phenomenon where a single lexical form is used to convey multiple meanings. By studying cross-lingual colexifications, researchers have gained valuable insights into fields such as psycholinguistics and cognitive sciences (Jack- son et al., 2019; Xu et al., 2020; Karjus et al., 2021; Schapper and Koptjevskaja-Tamm, 2022; FranÃ§ois, 2022). While several multilingual colexification datasets exist, there is untapped potential in using this information to bootstrap datasets across such semantic features. In this paper, we aim to demonstrate how colexifications can be leveraged to create such cross-lingual datasets. We showcase curation procedures which result in a dataset covering 142 languages across 21 language families across the world. The dataset includes ratings of concreteness and affectiveness, mapped with phonemes and phonological features. We further analyze the dataset along different dimensions to demonstrate potential of the proposed procedures in facilitating further interdisciplinary research in psychology, cognitive science, and multilingual natural language processing (NLP). Based on initial investigations, we observe that i) colexifications that are closer in concreteness/affectiveness are more likely to colexify ; ii) certain initial/last phonemes are significantly correlated with concreteness/affectiveness intra language families, such as /k/ as the initial phoneme in both Turkic and Tai-Kadai correlated with concreteness, and /p/ in Dravidian and Sino-Tibetan correlated with Valence; iii) the type-to-token ratio (TTR) of phonemes are positively correlated with concreteness across several language families, while the length of phoneme segments are negatively correlated with concreteness; iv) certain phonological features are negatively correlated with concreteness across languages. The dataset is made public online for further research.

Colexification refers to the linguistic phenomenon where a single lexical form is used to convey multiple meanings.By studying cross-lingual colexifications, researchers have gained valuable insights into fields such as psycholinguistics and cognitive sciences (Jackson et al., 2019;Xu et al., 2020;Karjus et al., 2021;Schapper and Koptjevskaja-Tamm, 2022;François, 2022).While several multilingual colexification datasets exist, there is untapped potential in using this information to bootstrap datasets across such semantic features.In this paper, we aim to demonstrate how colexifications can be leveraged to create such crosslingual datasets.We showcase curation procedures which result in a dataset covering 142 languages across 21 language families across the world.The dataset includes ratings of concreteness and affectiveness, mapped with phonemes and phonological features.We further analyze the dataset along different dimensions to demonstrate potential of the proposed procedures in facilitating further interdisciplinary research in psychology, cognitive science, and multilingual natural language processing (NLP).Based on initial investigations, we observe that i) colexifications that are closer in concreteness/affectiveness are more likely to colexify; ii) certain initial/last phonemes are significantly correlated with concreteness/affectiveness intra language families, such as /k/ as the initial phoneme in both Turkic and Tai-Kadai correlated with concreteness, and /p/ in Dravidian and Sino-Tibetan correlated with Valence; iii) the type-to-token ratio (TTR) of phonemes are positively correlated with concreteness across several language families, while the length of phoneme segments are negatively correlated with concreteness; iv) certain phonological features are negatively correlated with concreteness across languages.The dataset is made public online for further research 1 .

Introduction
Semantic typology studies cross-lingual semantic categorization (Evans et al., 2010).Within this area, the term "colexification" was first introduced and used by François (2008) and Haspelmath (2003) to create semantic maps.The study of colexifications focuses on cross-lingual colexification patterns, where the same lexical form is used in distinct languages to express multiple concepts.For instance, mapu in Mapudungun and apakee in Ignaciano both express the concepts EARTH and WORLD (Rzymski et al., 2020).Colexifications have been found to be pervasive across languages and cultures.The investigation of colexifications have led to interesting findings across different fields, such as linguistic typology (Schapper and Koptjevskaja-Tamm, 2022), psycholinguistics (Jackson et al., 2019), cognitive science (Gibson et al., 2019), but remain relatively unexplored in NLP (Harvill et al., 2022;Chen et al., 2023).
In recent years, with the increasing popularity of automatic methods and big data in linguistics, datasets such as Concepticon (List et al., 2022) and BabelNet (Navigli and Ponzetto, 2012) have been developed, affording large-scale cross-lingual semantic comparisons.The Database of Crosslingual Colexifications (CLICS 3 ) (Rzymski et al., 2020) was created based on the Concepticon con-cepts, including 4,228 colexification patterns across 3,156 languages, to facilitate research in colexifications.Studies have also been shown to curate large-scale colexification networks from BabelNet, consisting of over 6 million synsets across 520 languages (Harvill et al., 2022;Chen et al., 2023).
The study of phonemes and phonological features have furthermore been essential to, e.g., address the problems of non-arbitrariness in languages and investigating universals of spoken languages (de Varda and Strapparava, 2022).Studies such as Gast and Koptjevskaja-Tamm (2022) demonstrate the genealogical stability (persistence) and susceptibility to change (diffusibility) via studying the patterns the phonemes/phonological forms and the colexifications across European languages.However, this study is limited to a small range of languages, and the investigated concepts are also restricted to 100-item Swadesh list (Swadesh, 1950).With the proposed procedures, a wider range of concepts and the phonological forms across language families are curated.
In this paper, we create a synset graph based on multilingual WordNet (Miller, 1995) data from Ba-belNet 5.0.We then develop a cross-lingual dataset that includes ratings of concreteness and affectiveness, as this approach yields more comprehensive data than using CLICS 3 .In addition, we meticulously select and organize phonemes and phonological features for the lexicons that represent the concepts.Our methodology for data creation is not limited to the constructed dataset, as it has potential for broader applications.We showcase the versatility of our approach through analysis across various dimensions, and make our dataset freely available.
Jackson et al. ( 2019) conducted a study on cross-lingual colexifications related to emotions and found that different languages associate emotional concepts differently.For example, Persian speakers associate GRIEF closely with REGRET, while Dargwa speakers associate it with ANXIETY.The variations in cultural background and universal structure in emotion semantics provide interesting insights into the field of NLP.Bao et al. (2021) analyzed colexifications from various sources, including BabelNet, Open Multilingual WordNet, and CLICS 3 , and demonstrated that there is no universal colexification pattern.
In the field of NLP, Harvill et al. (2022) constructed a synset graph from BabelNet to boost performance on lexical semantic similarity task.More recently, Chen et al. (2023) use colexifications to construct language embeddings and further model language similarities.Our goal is to utilize colexifications to construct cross-lingual datasets, including diverse ratings and phonological forms and features, to support further research, particularly in low-resource languages where norms and ratings are notably scarce.
Norms and Ratings A large number of words in high-resource languages have been assigned norms and ratings by researchers in psychology (Brysbaert et al., 2014;Warriner et al., 2013).Norms and ratings of words are essential components in psychology, linguistics, and recently being widely used in NLP.Norms refer to the typical frequency and context in which words are used in a particular language, while ratings represent subjective judgements of individuals on various dimensions such as concreteness, valence, arousal, and imageability.These norms and ratings can improve the performance on downstream tasks, such as sentiment analysis, emotion recognition, word sense disambiguation, and affective computing (Kwong, 2008;Tjuka et al., 2022;Strapparava and Mihalcea, 2007;Mohammad and Turney, 2010).
The study of concreteness and abstractness of concepts is interdisciplinary and spans across various fields, including linguistics, psychology, psycholinguistics, and neurophysiology (Solovyev, 2021).Concrete concepts are those that can be perceived by the senses, such as CAT and MOUN-TAIN, while abstract concepts, like RELATIONSHIP and UNDERSTANDING, cannot be perceived by the senses.Brysbaert et al. (2014) conducted a study on concreteness ratings for 37,058 English words and 2,896 two-word expressions, involving over 4,000 participants, which has provided insights across various linguistic disciplines.The concreteness ratings are based on a scale of 1 (abstract) to 5 (concrete).These ratings have been used in conjunction with various tasks such as classification of metaphoricity (Haagsma and Bjerva, 2016) and animacy (Bjerva, 2014), as well as cultural studies (Berger and Packard, 2022).
Apart from concreteness, affective ratings are also essential for interdisciplinary research in psychology, linguistics and NLP.The affective norms for English words (ANEW) dataset, providing ratings of valence, arousal and dominance for English words, has been widely used in both psychology and NLP research (Bradley and Lang, 1999).Subsequently, the affective norms for French Words (FAN) and the affective norms for German words (ANGST) datasets, proving similar affective ratings for French and German words, respectively, have also been developed (Monnier and Syssau, 2014;Schmidtke et al., 2014).The Spanish version of ANEW is developed by Redondo et al. (2007).Extending the English ANEW, Warriner et al. (2013) covers nearly 14,000 English lemmas, providing ratings for valence (the pleasantness of a stimulus), arousal (the intensity of emotion provoked by a stimulus), and dominance (the degree of control exerted by a stimulus).For creating our dataset, we use the ratings from Warriner et al. (2013), see details in Section 3.
The data for linguistic norms and ratings is usually collected only for one language.For lowresource languages, such data is obviously lacking.Using our procedures, the norms and ratings can be bootstrapped for low-resource languages by sharing cross-lingual concepts through colexifications.
Phonemes and Phonological Features While direct phonetic comparison across languages is difficult, a common practice in comparing phonological characteristics across languages is to combine similar sounds into one multilingual phone set (Salesky et al., 2020).While more advanced methods for phonological typology do exist, e.g.Cotterell andEisner (2017, 2018), a basic approach to phonology is found via the International Phonetic Alphabet (IPA), which classifies sounds based on general phonological properties.In this vein, WikiPron is created to serve as an open-source tool for mining phonemic pronunciation data from Wikitionary and still under continuous maintenance (Lee et al., 2020).To this date, it contains more than 1,8 million word/pronunciations across 543 languages. 2he pronunciations are given in IPA, and segmented in a way that IPA diacritics can be properly recognized (Lee et al., 2020).
Demonstrating that phonological features outperform character-based models, PanPhon is created and used for various NER-related tasks (Mortensen et al., 2016).To date, PanPhon is a database relating over 5,000 IPA segments to 24 subsegmental articulatory features. 3It has been used for various purposes, such as cross-modal and cross-lingual study of iconicity in languages (Zhu et al., 2021), and cross-linguistic phonosemantic correspondence using a deep-learning framework (de Varda and Strapparava, 2021).
In this paper, we build upon this work by diving into the relationship between phonological features, and the concreteness and affectiveness of sense lemmas across a wide set of languages.The paper is inspired by findings such that the sounds of words can influence their meaning and emotional impact.For example, words with round vowel sounds are often associated with positive emotions, while harsher, more angular sounds can convey negative emotions ( Ćwiek et al., 2022).This study aims to initiate the study on the intricate interplay between sound and affective/abstract meanings.

Dataset Curation
A colexification pattern refers to a case where two concepts are colexified, such as DAD-POPE shown in Figure 1.Specifically, a colexification is an instance of a colexification pattern, such as far in Danish, as shown in Table 1.
In order to leverage colexifications to create a cross-lingual dataset incorporating norms and ratings in psychology and other fields, we propose the following procedures for data curation and creation, as illustrated in Fig. 2. Building the Synset/Concept Graph In Word-Net, a sense is a discrete representation of one aspect of the meaning of a word.For example, the lemma bank can either mean the sense FINANCIAL INSTITUTION or the sense SLOPING MOUND.The set of near-synonyms for a sense is called a synset, which is a primitive in WordNet (Jurafsky and Martin, 2023).Synsets are groups of words sharing the same concept.In order to construct of colexification networks, i) the WordDNet synsets are extracted from BabelNet; ii) for each synset, all the included word senses with their lemmas in the regarding language are elicitated; iii) finally, the sets of synsets sharing the same lemmas are extracted to represent a sysnet graph, with nodes being the synsets and the edges being the lemmas and their languages.The construction of a synset graph from BabelNet is first formalized in (Harvill et al., 2022), and adapted by (Chen et al., 2023) incorporating information of the languages and lemmas, see the Algorithm 1.
We adopt the algorithm presented in Chen et al. (2023) to construct a large-scale synset graph from WordNet synsets for our study.The difference in Chen et al. (2023) and Harvill et al. (2022) lies in the addition of G s at line 3 and line 9, as shown in Algorithm 1. G s affords the construction of colexification patterns and modeling language relations.
Algorithm 1 Construction of Colexification Graph: Given a set of languages L and corresponding vocabularies V, create graph edges between all colexified synset pairs (nodes), consisting of the set of tuples of lemmas and their language.for l ∈ L do 5: for {s 1 , s 2 } ∈ Sx return G s 20: end function A WordNet synset comprises a sense word, a Part-of-speech (POS) tag, and a sense number, e.g., dad#n#1.The sense numbers indicate the prevalence of the use of senses, with the most frequently used sense labeled 1.The frequency of use is determined by how often a sense is tagged in semantic concordance texts. 4Our assumption is that the mean score of lexicon ratings, annotated by multiple humans across domains and languages, represents the ratings for the most prevalent sense.However, when it comes to cross-lingual synset-toconcept mapping, there may be variations in the sense annotations between languages.Suppose that in French the main sense KNOT is knot#n#4, which refers to a unit of speed, while in English, the annotation for KNOT likely refers to an actual knot that you tie, which is the 1st sense for the synset.As a result, we cannot expect the same ratings of concreteness or affectiveness for these two different senses.Therefore, to map synsets to concepts, we always select the initial sense of the synsets..
Once filtered by the 1st sense of the synsets, as illustrated in Table 1, we derive concepts by extracting the sense word from each synset.The resulting concept graph comprises nodes representing the 1st senses of synsets and edges indicating the corresponding languages and sense lemmas.
Phonemes Extraction To facilitate analysis of phonetic characteristics cross-lingually in the context of colexifications and against ratings of concreteness and affectiveness, we extract phonemes from WikiPron, which to this date includes 1,882,240 pairs in 543 languages. 5To map the pronunciations to our data, we mapped their word/language code pairs to the pairs of sense lemma/language code extracted from Ba-belNet.As a result, there are 139,698 sense lemma/ phonemes pairs across 142 languages, presented as in Table 1.In our dataset, the median size of the phonemes per language is 32.
Phonological Features Extraction Phonological features have been proposed as the foundation of spoken language universals.Despite variations in phones across languages, the set of phonological features remains constant.Phones can be constructed from a set of phonological features.In our study, we extract phonemes for sense lemmas and then further extract phonological (articulatory) features based on the subsegments using PanPhon.PanPhon generates 24 phonological features for each segment, such as syllabic, sonorant, consonantal, continuant, delayed release, lateral, nasal, strident, voice, spread glottis, constricted glottis, anterior, coronal, distributed, labial, high (vowel/consonant, not tone), low (vowel/consonant, not tone), back, round, elaric airstream mechanism (click), tense, long, hitone, hireg6 .Each feature is assigned a value of '1', '-1', or '0', where '1' indicates a positive value of the feature, '-1' indicates a negative value of the feature, and '0' indicates that the feature is absent for that sound.For instance, a vowel cannot possess consonant features, so it is marked as '0'.We use PanPhon to convert each phone into a vector with length 24 in our dataset.Incorporating Norms and Ratings Having built the concept graph from the synset graph by selecting the 1st senses of the synsets across languages, we map the concepts from databases containing norms and ratings to the concept graph.As shown in Table 1, the concept 1 DAD is mapped from concreteness/affectiveness rating lists to the synset 1 dad#n#1, while the concept 2 POPE is mapped to the synset 2 pope#n#1 by intersecting the datasets by the sense words.When each concept in the colexification pair has a rating, the distance of the concreteness/affectiveness can be calculated by computing the absolute distance of the two.When concept 1 has a (mean) concreteness of conc 1 and concept 2 has a (mean) concreteness of conc 2 , then the Conc.Dist is calculated as |conc 1 − conc 2 |.Similar procedures are used for computing distance of valence (V.Dist), arousal (A.Dist) and dominance (D.Dist).
To conduct analysis of the correlations between phonemes/phonological features against the concreteness/affectiveness, the ratings for each phonemes are calculated as the average of the ratings of the included concepts, grouped by the phonemes and its language, respectively.
Undergoing these procedures, we create a dataset in 142 languages across 21 language families, including ratings in concreteness/affectivness, and phonemes for lemmas.The overall statistics of the data is shown in Table 2.The map for the data color coded by language families is presented in Fig. 3.As shown, the data is highly skewed towards Indo-European languages, and the data is quite scarce in Americas.Previous studies show that abstract concepts are often understood by reference to more concrete concepts (Lakoff and Johnson, 2008), and words that first arise with concrete meanings often later gain an abstract one (Xu et al., 2017).Xu et al. (2020) leans on these findings to show that concepts more dissimilar in concreteness and affective valence are more likely to colexify.To test this, we calculate the correlation coefficients7 between the number of colexifications and concreteness/affectiveness distances of the colexified concepts across languages.However, the results show the exact contrary to the previous theories and findings.As shown in Table 3, there is a statistically significant and relatively strong negative correlation between colexifications and the distance of concreteness, valence, arousal and dominance.This verifies that it is more likely for a pair of concepts to colexify when they are closer in concreteness and affectiveness.Our results about affectivness in colexifications is also corroborated by Di Natale et al. (2021).
Since both distances of conreteness and affectiveness are correlated with colexifications, it is intuitive to assume they might be correlated to each other.To test this, we calcuate the correlation coefficients between each dimension of concreteness and affectiveness.As shown in Fig. 4, the distances of valence and dominance are correlated with each other stronger than other pairs.And, concreteness distance is not significantly correlated with any dimension of affectiveness.

Phonemes vs. Concreteness/Affectiveness
Previous studies suggest that characteristics of the initial and the last phoneme have the most significant impact on the phonetic characteristics of the whole phone set (Pimentel et al., 2020).To test whether there are universals between the initial/last phoneme and the concreteness/affectiveness, we calculate the correlations between them per language family.
Since the whole results are too large to present, we report here only the results where the correlations are statistically significant, and the absolute value of which are bigger than 0.1.To prevent data from incorrectly appearing to be statistically significant, we correct the p-value with Bonferroni correction by dividing it with the number of the languages within the language family that is tested on.Only the results, that are statistically significant at 95% after applying Bonferroni correction, are reported.
We can observe that, as in Table 4, by correlating against the concreteness distance, the p as the initial phoneme and the last í is significantly and stronger correlated within Dravidian languages, and a in Artificial languages as the first phoneme, compared to others.While across language families, k is correlated with concreteness.
Similarly, we test the correlations against the affectivness distance.Only the results with valence is reported, since the correlations of the phonemes against other affective ratings are not significant.As shown in Table 6, p as initials present correlations with affectiveness cross language families, i.e., Sino-Tibetan and Dravidian.
To represent the complexity of phonemes intra language families, we calculate the TTR as the ratio of unique phonemes and the length of all the phonemes for each lemma.Furthermore, the correlation between the TTR and the concreteness/arousal is computed, as shown in Table 4.And also the length of the phoneme segments are calculated for similar correlation test.Across all 8 language families, the segment length is statistically negatively correlated with the concreteness, but positively correlated with arousal.While, the correlations between TTR and the concreteness shows that the more concrete concept, the more diverse (complex) the phonemes are.

Phonological Features vs. Concreteness/Affectiveness
To test whether phonological features of the phonemes correlate with concreteness or affectiveness, for each phoneme/lemma pair, the phonological feature vectors are calculated and the values are aggregated by frequency of the present features.As indicated in Table 5, in the reported data, all the phonological features are negatively correlated with the concreteness.While the correlation coefficients in general are quite small, this hints at the possible existence of effects of these phonological features on concreteness.For instance, the coronal obstruent (cor) feature in all four language families is highly negatively correlated with concreteness, indicating that there is a general preference for such

Conclusion and Future Work
In this study, we proposed a set of procedures to leverage colexifications to bootstrap cross-lingual datasets, incorporating human ratings of concreteness and affective meanings.The created dataset presents data in 142 languages across 21 language families and 5 language macro areas.However, the procedures can be applied beyond the datasets used in this paper.
Inspired by previous works, we test the correlations between i) the distance of concreteness/affectiveness and the number of colexifications; ii) the phonemes and concreteness/ affectiveness; and iii) the phonological features and the ratings.It is shown that i) colexifications closer in concreteness/effectiveness are more likely to colexify; ii) certian initial/last phonemes do present statistically significant correlations with the ratings across languages; and iii) there is a positive correlation between the phoneme diversity and concreteness; finally iv) certain phonological features are negatively correlated with the ratings.While it is difficult to draw any meaningful conclusions from this finding without a prior hypothesis, we hope that future work can use this dataset to make well-founded findings on the interactions between phonology, concreteness, and affectiveness.
We have showcased the soundness and validity of our approach to curate data from different domains and create a cross-lingual dataset mapping the information.The initial analyses and findings could inspire further applications in NLP and also other fields, such as psychology and psycholingusitics, which we will explore extensively for future work.
Nevertheless, the analyses conducted in this study are confined to individual correlation tests, which are inadequate for reaching definitive conlusions.For future work, we will employ multivariate modeling techniques utilizing affective/concrete ratings and the phonetic features to delve deeper into understanding the connections between human conceptualization and sounds across diverse languages and cultures.

Limitations
A limitation of this study is the fact that the concreteness ratings of Brysbaert et al. (2014) are curated solely from self-identified U.S. residents.And the affectiveness ratings of Warriner et al. (2013) are solely curated in English.As such, there is a risk of an anglocentric bias in the created dataset.Nonetheless, the goal of this study is to explore the potential of leveraging colexifications to bootstrap cross-lingual datasets in as many languages as possible, including a lot of low-resource languages.

Figure 1 :
Figure 1: Colexification subgraph for DAD.The weight of the edges are proportional to the frequency of the colexification pattern in the dataset.

Figure 2 :
Figure 2: The Workflow of the Procedures for Creating the cross-lingual Dataset using Colexifications.

Figure 3 :
Figure 3: The map of language families of our data.The size of the points are proportional to the number of concepts in each language.Colors represent language families.

Figure 4 :
Figure 4: Correlation between Affectiveness-and Concreteness-Distances between the Colexified Concepts.The size of the squares represent correlation coefficients.

Table 1 :
An example of the dataset.{CONC,V,D,A}.Dist represent the distance of the concreteness, valence, dominance and arousal of the pair of concepts for each lexicon.The value is unknown(-) if either of the concepts does not have a rating.

Table 2 :
Statistics of the Dataset.

Table 3 :
Correlation between #Colexifications and the Concreteness/Affectivness Distances between the Colexified Concepts, p-values are in the brackets.The sign * indicates the statistical significance of the correlation at 95% (p < 0.0001).

Table 4 :
Correlation between the Initial/Last Phoneme and the Concreteness of Sense Lemma across Languages per Language Family.All the presented coefficients (in the brackets) are statistically significant and at least bigger than 0.1 or smaller than -0.1, corrected with Bonferroni correction (p < 0.05/#Lang.*).

Table 6 :
Correlation between the Initial/Last Phoneme and the Valence of Sense Lemma across Languages per Language Family.All the presented coefficients (in the brackets) are statistically significant and at least bigger than 0.1 or smaller than -0.1, corrected with Bonferroni correction (p < 0.05/#Lang.).

Table 7 :
Correlation between TTR (Type-to-Token Ratio)/ Segment Length and the Concreteness of Sense Lemma per Language Family.All the presented coefficients (in the brackets) are statistically significant and at least bigger than 0.1 or smaller than -0.1, corrected with Bonferroni correction (p < 0.05/#Lang.).