Cross-Cultural Similarity Features for Cross-Lingual Transfer Learning of Pragmatically Motivated Tasks

Much work in cross-lingual transfer learning explored how to select better transfer languages for multilingual tasks, primarily focusing on typological and genealogical similarities between languages. We hypothesize that these measures of linguistic proximity are not enough when working with pragmatically-motivated tasks, such as sentiment analysis. As an alternative, we introduce three linguistic features that capture cross-cultural similarities that manifest in linguistic patterns and quantify distinct aspects of language pragmatics: language context-level, figurative language, and the lexification of emotion concepts. Our analyses show that the proposed pragmatic features do capture cross-cultural similarities and align well with existing work in sociolinguistics and linguistic anthropology. We further corroborate the effectiveness of pragmatically-driven transfer in the downstream task of choosing transfer languages for cross-lingual sentiment analysis.


Introduction
Hofstede et al. (2005) defined culture as the collective mind which "distinguishes the members of one group of people from another." Cultural idiosyncrasies affect and shape people's beliefs and behaviors. Linguists have particularly focused on the relationship between culture and language, revealing in qualitative case studies how cultural differences are manifested as linguistic variations (Siegel, 1977).
Quantifying cross-cultural similarities from linguistic patterns has largely been unexplored in NLP, with the exception of studies that focused on cross-cultural differences in word usage (Garimella et al., 2016;Lin et al., 2018). In this work, we aim to quantify cross-cultural similarity, focusing *The first three authors contributed equally. on semantic and pragmatic differences across languages. 1 We devise a new distance measure between languages based on linguistic proxies of culture. We hypothesize that it can be used to select transfer languages and improve cross-lingual transfer learning, specifically in pragmaticallymotivated tasks such as sentiment analysis, since expressions of subtle sentiment or emotion-such as subjective well-being , anger (Oster, 2019), or irony (Karoui et al., 2017)-have been shown to vary significantly by culture.
We focus on three distinct aspects in the intersection of language and culture, and propose features to operationalize them. First, every language and culture rely on different levels of context in communication. Western European languages are generally considered low-context languages, whereas Korean and Japanese are considered high-context languages (Hall, 1989). Second, similar cultures construct and construe figurative language similarly (Casas and Campoy, 1995;Vulanović, 2014). Finally, emotion semantics is similar between languages that are culturally-related (Jackson et al., 2019). For example, in Persian, 'grief' and 'regret' are expressed with the same word whereas 'grief' is co-lexified with 'anxiety' in Dargwa. Therefore, Persian speakers may perceive 'grief' as more similar to 'regret,' while Dargwa speakers may associate the concept with 'anxiety. ' We validate the proposed features qualitatively, and also quantitatively by an extrinsic evaluation method. We first analyze each linguistic feature to confirm that they capture the intended cultural patterns. We find that the results corroborate the existing work in sociolinguistics and linguistic anthropology. Next, as a practical application of our features, we use them to rank transfer languages for cross-lingual transfer learning. Lin et al. (2019) have shown that selecting the right set of transfer languages with syntactic and semantic languagelevel features can significantly boost the performance of cross-lingual models. We incorporate our features into Lin et al. (2019)'s ranking model to evaluate the new cultural features' utility in selecting better transfer languages. Experimental results show that incorporating the features improves the performance for cross-lingual sentiment analysis, but not for dependency parsing. These results support our hypothesis that cultural features are more helpful when the cross-lingual task is driven by pragmatic knowledge. 2

Pragmatically-motivated Features
We propose three language-level features that quantify the cultural similarities across languages.
Language Context-level Ratio A language's context-level reflects the extent to which the language leaves the identity of entities and predicates to context. For example, an English sentence Did you eat lunch? explicitly indicates the pronoun you, whereas the equivalent Korean sentence ᄌ ᅥ ᆷᄉ ᅵ ᆷ ᄆ ᅥ ᆨᄋ ᅥ ᆻᄂ ᅵ? (= Did eat lunch?) omits the pronoun. Context-level is considered one of the distinctive attributes of a language's pragmatics in linguistics and communication studies, and if two languages have similar levels of context, their speakers are more likely to be from similar cultures (Nada et al., 2001).
The language context-level ratio (LCR) feature approximates this linguistic quality. We compute the pronoun-and verb-token ratio, ptr(l k ) and vtr(l k ) for each language l k , using part-of-speech tagging results. We first run language-specific POStaggers over a large mono-lingual corpus for each language. Next, we compute ptr as the ratio of count of pronouns in the corpus to the count of all tokens. vtr is obtained likewise with verb tokens. Low ptr, vtr values may indicate that a language leaves the identity of entities and predicates, respectively, to context. We then compare these values between the target language l tg and transfer language l tf , which leads to the following definition of LCR: LCR-pron(l tf , l tg ) = ptr(l tg )/ptr(l tf ) LCR-verb(l tf , l tg ) = vtr(l tg )/vtr(l tf ) Literal Translation Quality Similar cultures tend to share similar figurative expressions, including idiomatic multiword expressions (MWEs) and metaphors (Kövecses, 2003(Kövecses, , 2010. For example, like father like son in English can be translated word-by-word into a similar idiom tel père tel fils in French. However, in Japanese, a similar idiom 蛙の子は蛙 (Kaeru no ko wa kaeru) "A frog's child is a frog." cannot be literally translated.
Literal translation quality (LTQ) feature quantifies how well a given language pair's MWEs are preserved in literal (word-by-word) translation, using a bilingual dictionary. A well-curated list of MWEs is not available for the majority of languages. We thus follow an automatic extraction approach of MWEs (Tsvetkov and Wintner, 2010). First, a variant of pointwise mutual information, PMI 3 (Daille, 1994) is used to extract noisy lists of top-scoring n-grams from two large monolingual corpora from different domains, and intersecting the lists filters out domain-specific n-grams and retains the language-specific top-k MWEs. Then, a bilingual dictionary between l tf and l tg and a parallel corpus between the pair are used. 3 For each n-gram in l tg 's MWEs, we search for its literal translations extracted using the dictionary in parallel sentences containing the n-gram. For any word in the n-gram, if there is a translation in the parallel sentence, we consider this as hit, otherwise as miss. And we calculate hit ratio as hit (hit+miss) for each n-gram found in the parallel corpus. Finally, we average the hit ratios of all n-grams and z-normalize over the transfer languages to obtain LTQ(l tf , l tg ).
Emotion Semantics Distance Emotion semantic distance (ESD) measures how similarly emotions are lexicalized across languages. This is inspired by Jackson et al. (2019) who used colexification patterns (i.e., when different concepts are expressed using the same lexical item) to capture the semantic similarity of languages. However, colexification patterns require human annotation, and existing annotations may not be comprehensive. We extend Jackson et al. (2019)'s method by using cross-lingual word embeddings.
We define ESD as the average distance of emotion word vectors in transfer and target languages, after aligning word embeddings into the same space. More specifically, we use 24 emotion concepts defined in Jackson et al. (2019) and use bilingual dictionaries to expand each concept into every other language (e.g., love and proud to Liebe and stolz in German). We then remove the emotion word pairs from the bilingual dictionaries, and use the remaining pairs to align word embeddings of source into the space of target languages. We hypothesize that if words correspond to the same emotion concept in different languages (e.g., proud and stolz) have similar meaning, they should be aligned to the same point despite the lack of supervision. However, because each language possesses different emotion semantics, emotions are scattered into different positions. We thus define ESD as the average cosine distance between languages: where E is the set of emotion concepts and v tf,e is the aligned emotion word vector of language l tf .

Feature Analysis
In this section, we evaluate the proposed pragmatically-motivated features intrinsically. Throughout the analyses, we use 16 languages listed in Figure 4 which are later used for extrinsic evaluation ( §5).

Implementation Details
We used multilingual word tokenizers from NLTK and RDR POS Tagger (Nguyen et al., 2014) for most of the languages except for Arabic, Chinese, Japanese, and Korean, where we used PyArabic, Jieba, Kytea, and Mecab, respectively. For monolingual corpora, we used the news-crawl 1M corpora from Leipzig (Goldhahn et al., 2012) for both LCR and LTQ. We used bilingual dictionaries from Choe et al. (2020) and TED talks corpora (Qi et al., 2018) for both parallel corpora and an additional monolingual corpus for LTQ. We focused on bigrams and trigrams and set k, the number of extracted MWEs, to 500. We followed  to generate the supervised cross-lingual word embeddings for ESD.

LCR and Language Context-level
ptr approximates how often discourse entities are indexed with pronouns rather than left conjecturable from context. Similarly, vtr estimates the rate at which predicates appear explicitly as verbs. In order to examine to which extent these features reflect context-levels, we plot languages on a twodimensional plane where the x-axis indicates ptr and the y-axis indicates vtr in Figure 1.
The plot reveals a clear pattern of context-levels in different languages. Low-context languages such as German and English (Hall, 1989) possess the largest values of ptr. On the other extreme are located Korean and Japanese with low ptr, which are representative of high-context languages. One thing to notice is the isolated location of Turkish with a high vtr. This is morphosyntactically plausible as a lot of information is expressed by the affixation to verbs in Turkish.

LTQ and MWEs
LTQ uses n-grams with high PMI scores as proxies for figurative language MWE (PMI MWEs). We evaluate the quality of selected MWEs and the resulting LTQ by comparing with human-curated list of figurative language MWE (gold MWEs) that are available in some languages. We collected gold MWEs in multiple languages from Wiktionary 4 . We discarded languages with less than 2,000 phrases on the list, resulting in four languages (English, French, German, Spanish) for (a) Network based on Emotion Semantics Distance.
(b) Network based on syntactic distance. analysis. First, we check how many PMI MWEs are actually in the gold MWEs. Out of the top-500 PMI bigrams and trigrams, 19.0% of bigrams and 3.8% of trigrams are included in the gold MWE list (averaged over four languages). For example, the trigrams in the PMI MWEs, keep an eye and take into account, are considered to be in the gold MWEs as keep an eye peeled and take into account are in the list. The seemingly low percentages are reasonable, regarding that the PMI scores are designed to extract collocations patterns rather than figurative languages themselves.
Secondly, to validate using PMI MWEs as proxies, we compare the LTQ of PMI MWEs with the LTQ using gold MWEs. Specifically, we obtained the LTQ scores of each language pair with target languages limited to the four European languages mentioned above. Then for each target language, we measured Pearson correlation coefficient between the two LTQ scores based on the two MWE lists. The average coefficient was 0.92, which indicates a strong correlation between the two resulting LTQ scores, and thus justifies using PMI MWEs for all other languages.

ESD and Cultural Grouping
We investigate what is carried by ESD by visualizing and looking at the nearest neighbors of emotion vectors. 5 Jackson et al. (2019) used word colexification patterns to reveal that the same emotion concepts cluster with different emotions according to the language family they belong to. For instance, in Tai-Kadai languages, hope appears in the same cluster as want and pity, while hope associates with good and love in the Nakh-Daghestanian language family. Our results derived from ESD do not rely on colexification patterns, but also support this finding. The nearest neighbors of the Chinese word for hope was want and pity, while they were found as love and joy for hope in Arabic.
In Figure 2, we compare ESD to the syntactic distance between languages by constructing two networks of languages based on each feature. Figure 2a uses ESD as reference while Figure 2b uses the syntactic distance from the URIEL database (Littell et al., 2017). Each node represents a language, color-coded by its cultural area. For each language, we sort the other languages according to the distance value. When a language is in the list of top-k closest languages, we draw an edge between the two. We set k = 2.
We see that languages in the same cultural areas tend to form more cohesive clusters in Figure 2a compared to Figure 2b. The portion of edges within the cultural areas is 76% for ESD while it is 59% for syntactic distance. These results indicate that ESD effectively extracts linguistic information that aligns well with the commonly shared perception of cultural areas.

Correlation with Geographical Distance
Regarding the language clusters in Figure 2a, some may suspect that geographic distance can substitute the pragmatically-inspired features. For Chinese, Korean and Japanese are the closest languages by ESD, which can also be explained by their geographical proximity. Do our features add additional pragmatic information, or can they simply be replaced by geographical distance?
To verify this speculation, we evaluate Pearson's correlation coefficient of each pragmatic feature value with geographical distance from URIEL. The feature with the strongest correlation was ESD (r=0.4). The least correlated was LCR-verb (r=0.03), followed by LCR-pron (r=0.17) and LTQ (r=−0.31) 6 . The results suggest that the pragmatic features contain extra information that cannot be subsumed by geographic distance.

Extrinsic Evaluation: Ranking Transfer Languages
To demonstrate the utility of our features, we apply them to a transfer language ranking task for cross-lingual transfer learning. We first present the overall task setting, including the datasets and models used for the two cross-lingual tasks. Next, we describe the transfer language ranking model and its evaluation metrics.

Task Setting
We define our task as the language ranking problem: given the target language l tg , we want to rank a set of n candidate transfer languages tf } by their usefulness when transferred to l tg , which we refer to as transferability (illustrated in Figure 3). The effectiveness of cross-lingual transfer is often measured by evaluating the joint training or zero-shot transfer performance (Wu and Dredze, 2019;Schuster et al., 2019). In this work, we quantify the effectiveness as the zero-shot transfer performance, following Lin et al. (2019). Our goal is to train a model that ranks available transfer languages in L tf by their transferability for a target language l tg .
To train the ranking model, we first need to find the ground-truth transferability rankings, which operate as the model's training data. We evaluate the zero-shot performance z tf,tg by training a taskspecific cross-lingual model solely with transfer language l tf and testing on l tg . After evaluating z tf,tg for each candidate transfer language in L tf , we obtain the optimal ranking of languages r tg by sorting languages according to the measured z tf,tg . Note that r tg also depends on downstream task.
Next, we train the language ranking model. The ranking model predicts the transfer ranking of candidate languages. Each source, target pair (l tf , l tg ) is represented as a vector of language features f tf,tg , which may include phonological similarity, typological similarity, word-overlap to name a 6 When two languages are more similar, LTQ is higher whereas geographic distance is smaller. Figure 3: Illustration of transfer language ranking problem when the target language is French (fr) and there are three available transfer languages: Arabic (ar), Russian (ru), and Chinese (zh). The output rankingr fr is compared to the ground truth ranking r fr which is determined by the zero-shot performance z of cross-lingual models.  few. The ranking model takes f tf,tg of every l tf as input, and predicts the transferability ranking r tg . Using r tg from the previous step as training data, the model learns to find optimal transfer languages based on f tf,tg . The trained model can either be used to select the optimal set of transfer languages, or to decide which language to additionally annotate during the data creation process.

Task & Dataset
We apply the proposed features to train a ranking model for two distinctive tasks: multilingual sentiment analysis (SA) and multilingual dependency parsing (DEP). The tasks are chosen based on our hypothesis that high-order information such as pragmatics would assist sentiment analysis while it may be less significant for dependency parsing, where lower-order information such as syntax is relatively stressed.
SA As there is no single sentiment analysis dataset covering a wide variety of languages, we collected various review datasets from different sources. 7 All samples are labeled as either positive or negative. In case of datasets rated with a five-point Likert scale, we mapped 1-2 to negative and 4-5 to positive. We settled on a dataset consist of 16 languages categorized into five distinct cultural groups: West Europe, East Europe, East Asia, South Asia, and Middle East (Figure 4).
DEP To compare the effectiveness of the proposed features on syntax-focused tasks, we chose datasets of the same set of 16 languages from Universal Dependencies v2.2 (Nivre et al., 2018).

Task-Specific Cross-Lingual Models
SA Multilingual BERT (mBERT) (Devlin et al., 2019), a multilingual extension of BERT pretrained with 104 different languages, has shown strong results in various text classification tasks in crosslingual settings (Sun et al., 2019;. We use mBERT to conduct zeroshot cross-lingual transfer and to extract optimal transfer language rankings: fine-tune mBERT on transfer language data and test it on target language data. The performance is measured by the macro F1 score on the test set. DEP We adopt the setting from Ahmad et al. (2018) to perform cross-lingual zero-shot transfer. We train deep biaffine attentional graph-based models (Dozat and Manning, 2016) which achieved state-of-the-art performance in dependency parsing for many languages. The performance is evaluated using labeled attachment scores (LAS).

Ranking Model & Evaluation
Ranking Model For the language ranking model, we employ gradient boosted decision trees, Light-GBM , which is one of the stateof-the-art models for ranking tasks. 8 Ranking Evaluation Metric We evaluate the ranking models' performance with two standard metrics for ranking tasks: Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain at position p (NDCG@p) (Järvelin and Kekäläinen, 2002). While MAP assumes a binary concept of relevance, NDCG is a more fine-grained measure that reflects the ranking positions. The relevant languages for computing MAP are defined as the top-k languages in terms of zero-shot performance in the downstream task. In our experiments, we set k to 3 for MAP. Similarly, we use NDCG@3.
We train and evaluate the model using leave-oneout cross-validation: where one language is set aside as the test language while other languages are used to train the ranking model. Among the training languages, each language is posited in turn as the target language while others are the transfer languages.

Baselines
LANGRANK LANGRANK (Lin et al., 2019) uses 13 features to train the ranking model: The dataset size in transfer language (tf size), target language (tg size), and the ratio between the two (ratio size); Type-token-ratio (ttr) which measures lexical diversity and word overlap for lexical similarity between a pair of languages; various distances between a language pair from the URIEL database (geographic geo, genetic gen, inventory inv, syntactic syn, phonological phon and featural feat).
MTVEC Malaviya et al. (2017) proposed to learn a language representation while training a neural machine translation (NMT) system in a simliar fashion to Johnson et al. (2017). During training, a language token is prepended to the source sentence and the learned token's embedding becomes the language vector. Bjerva et al. (2019) has shown that such language representations contain various types of linguistic information ranging from word order to typological information. We used the one released by Malaviya et al. (2017) which has the dimension of 512.

Individual Feature Contribution
We first look into whether the proposed features are helpful in ranking transfer languages for sentiment analysis and dependency parsing (Table 1). We add all three features (PRAG) to the two baseline features (LANGRANK, MTVEC) and compare the performance in the two tasks. Results show that our features improve both baselines in SA, implying that the pragmatic information captured by our features is helpful for discerning the subtle differences in sentiment among languages.
In the case of DEP, including our features brings inconsistent results to performance. The features help the performance of MTVEC while they deteriorate the performance of LANGRANK. Although some performance increase was observed when applied to MTVEC, the performance of MTVEC in DEP remains extremely poor. These conflicting trends suggest that pragmatic information is not crucial to less pragmatically-driven tasks, represented as dependency parsing in our case. The low performance of MTVEC in DEP is noticeable as MTVEC is generally believed to contain a significant amount of syntactic information, with much higher dimensionality than LANGRANK. It also suggests the limitation of using distributional representations as language features; their lack of interpretability makes it difficult to control the kinds of information used in a model.
We additionally conduct ablation studies by removing each feature from the +PRAG models to examine each feature's contribution. The SA results show that LCR and LTQ significantly contribute to overall improvements achieved by adding our features, while ESD turns out to be less helpful. Sometimes, removing ESD resulted in a better performance. In contrast, the results of DEP show that ESD consistently made a significant contribution, and LCR and LTQ were not useful. The results imply that the emotion semantics information of languages is surprisingly not useful in sentiment analysis, but more so in dependency parsing.

Group-wise Contribution
The previous experiment suggests that the same pragmatic information can be helpful to different extents depending on the downstream task. We further investigate to what extent each kind of information is useful to each task by conducting group-wise comparisons. To this end, we group the features into five categories: Pretrain-specific, Data-specific, Typology, Geography, Orthography, and Pragmatic. Pretrain-specific features cover factors that may be related to the performance of pretrained language models used in our task-specific cross-lingual models. Specifically, we used the size of the Wikipedia training corpus of each language used in training mBERT. 9 Note that we do not measure this feature group's performance on DEP as no pretrained language model was used in DEP. Dataspecific features include tf size, tg size, and ratio size. Typological features include geo, syn, feat, phon, and inv distances. Geography includes geo distance in isolation. Orthographic feature is the word overlap between languages. Finally, the Pragmatic group consists of ttr and the three proposed features, LCR, LTQ, and ESD. ttr is included in Pragmatic as Richards (1987) have suggested that it encodes a significant amount of cultural information. Table 2 reports the performance of ranking models trained with the respective feature category. Interestingly, the two tasks showed significantly different results; the Pragmatic group showed the best performance in SA while the Typology group outperformed all other groups in DEP. This again confirms that the features indicating cross-lingual transferability differ depending on the target task. Although the Pretrain-specific features were more predictive than the Geography and Orthography features it was not as helpful as the Pragmatic features.

Controlling for Dataset Size
The performance of cross-lingual transfer depends not only on the cultural similarity between transfer and target languages but also on other factors, including dataset size and label distributions. Although our model already accounts for the dataset size to some extent by including tf size as input, we conduct a more rigorous experiment to better understand the importance of cultural similarity in language selection. Specifically, we control the data size by down-sampling all SA data to match both the size and label distribution of the second smallest Turkish dataset. 10 We then trained two ranking models equipped with different sets of features: LANGRANK and LANGRANK+PRAG.
In terms of languages, we focus on a setting where Turkish is the target and Arabic, Japanese and Korean are the transfer languages. This is a particularly interesting set of languages because the source languages are similar/dissimilar to Turkish in different aspects; Korean and Japanese are typologically similar to Turkish, yet in cultural terms, Arabic is more similar to Turkish.
In this controlled setting, the ground-truth ranking reveals that the optimal transfer language among the three is Arabic, followed by Korean and Japanese. It indicates the important role of cultural resemblance in sentiment analysis which encapsulates the rich historical relationship shared between Arabic-and Turkish-speaking communities. LAN-GRANK+PRAG chose Arabic as the best transfer language, suggesting that the imposed cultural similarity information from the features helped the ranking model learn the cultural tie between the two languages. On the other hand, LANGRANK ranked Japanese the highest over Arabic, possibly because the provided features mainly focus on typological similarity over cultural similarity.

Related Work
Quantifying Cross-cultural Similarity A few recent work in psycholinguistics and NLP have aimed to measure cultural differences, mainly from word-level semantics. Lin et al. (2018) suggested a cross-lingual word alignment method that preserves the cultural, social context of words. They derive cross-cultural similarity from the embeddings of a bilingual lexicon in the shared representation space. Thompson et al. (2018) computed sim-ilarity by comparing the nearest neighborhood of words in different languages, showing that words in some domains (e.g., time, quantity) exhibit higher cross-lingual alignment than other domains (e.g., politics, food, emotions). Jackson et al. (2019) represented each language as a network of emotion concepts derived from their colexification patterns and measured the similarity between networks.
Auxiliary Language Selection in Cross-lingual tasks There has been active work on leveraging multiple languages to improve cross-lingual systems (Neubig and Hu, 2018;Ammar et al., 2016). Adapting auxiliary language datasets to the target language task can be practiced through either language-selection or data-selection. Previous work on language-selection mostly relied on leveraging syntactic or semantic resemblance between languages (e.g. ngram overlap) to choose the best transfer languages (Zoph et al., 2016;Wang and Neubig, 2019). Our approach extends this line of work by leveraging cross-cultural pragmatics, an aspect that has been unexplored by prior work.

Future Directions
Typology of Cross-cultural Pragmatics The features proposed here provide three dimensions in a provisional quantitative cross-linguistic typology of pragmatics in language. Having been validated, both intrinsically and extrinsically, they can be used in studies as a stand-in for cross-cultural similarity. They also open a new avenue of research, raising questions about what other quantitative features of language are correlates of cultural and pragmatic difference.
Model Probing Fine-tuning pretrained models to downstream tasks has become the de facto standard in various NLP tasks, and their success has promoted the development of their multilingual extensions (Devlin et al., 2019;Lample and Conneau, 2019). While the performance gains from these models are undeniable, their learning dynamics remain obscure. This issue has prompted various probing methods designed to test what kind of linguistic information the models retain, including syntactic and semantic knowledge Ravishankar et al., 2019;Tenney et al., 2019). Similarly, our features can be employed as a touchstone to evaluate a model's knowledge in cross-cultural pragmatics. Investigating how different pretraining tasks affect the learning of pragmatic knowledge will also be an interesting direction of research.

Conclusion
In this work, we propose three pragmaticallyinspired features that capture cross-cultural similarities that arise as linguistic patterns: language context-level ratio, literal translation quality, and emotion semantic distance. Through feature analyses, we examine whether our features can operate as valid proxies of cross-cultural similarity. From a practical standpoint, the experimental results show that our features can help select the best transfer language for cross-lingual transfer in pragmaticallydriven tasks, such as sentiment analysis.