To token or not to token: A Comparative Study of Text Representations for Cross-Lingual Transfer

Choosing an appropriate tokenization scheme is often a bottleneck in low-resource cross-lingual transfer. To understand the downstream implications of text representation choices, we perform a comparative analysis on language models having diverse text representation modalities including 2 segmentation-based models (\texttt{BERT}, \texttt{mBERT}), 1 image-based model (\texttt{PIXEL}), and 1 character-level model (\texttt{CANINE}). First, we propose a scoring Language Quotient (LQ) metric capable of providing a weighted representation of both zero-shot and few-shot evaluation combined. Utilizing this metric, we perform experiments comprising 19 source languages and 133 target languages on three tasks (POS tagging, Dependency parsing, and NER). Our analysis reveals that image-based models excel in cross-lingual transfer when languages are closely related and share visually similar scripts. However, for tasks biased toward word meaning (POS, NER), segmentation-based models prove to be superior. Furthermore, in dependency parsing tasks where word relationships play a crucial role, models with their character-level focus, outperform others. Finally, we propose a recommendation scheme based on our findings to guide model selection according to task and language requirements.


Introduction
The performance of multilingual language models varies substantially across languages, with low-resource languages demonstrating particularly sub-optimal results compared to their highresource counterparts.This disparity poses a global challenge for deploying effective NLP applications, given the diverse linguistic landscape worldwide (Blasi et al., 2022).
To address this challenge, cross-lingual transfer has emerged as a promising solution.By leveraging knowledge from high-resource languages, crosslingual transfer aims to enhance the performance of low-resource ones.However, the effectiveness of cross-lingual knowledge transfer is not uniformly observed across all language pairs.It is influenced by various factors, including language style, structure, origin, dataset quality (Yu et al., 2022;Kreutzer et al., 2022), and the specific relationship between the source and target languages (Ahmad et al., 2019;He et al., 2019).On top of that, the selection of an appropriate language model becomes crucial to achieve successful cross-lingual knowledge transfer.While most state-of-the-art models rely on tokenization (Schuster and Nakajima, 2012;Gage, 1994), yielding high scores for various linguistic downstream tasks, their performance in terms of cross-lingual transfer has room for further investigation.Considering that word formation can significantly vary across different languages, differences in tokenization techniques can hinder the transfer of linguistic capabilities between languages (Hofmann et al., 2022).Hence, the exploration of tokenization-free models is also imperative.
This study thoroughly investigates the role and effectiveness of both tokenization-based (Devlin et al., 2019a) and tokenization-free models (Rust et al., 2022) in cross-lingual knowledge transfer.Our selection of models encompasses BERT and mBERT (Devlin et al., 2019a), which uses traditional subword-based segmentation.In addition, we delve into tokenization-free models such as CANINE (Clark et al., 2022) and PIXEL (Rust et al., 2022).CANINE leverages character-level information to accommodate the diverse word formations and structures found in different languages.On the other hand, PIXEL represents texts using visual elements, introducing new possibilities for scriptbased transfer in visually similar languages.
In this study, we perform standard syntactic task evaluation in both zero-shot and few-shot manner arXiv:2310.08078v1[cs.CL] 12 Oct 2023 to evaluate the cross-lingual transfer capabilities of these models.While accuracy, F1 score, Labeled Attachment Score (LAS), etc. are all effective evaluation indicators of the goodness of a model, they are not particularly representative of how much a model has learned in a short span of training.We utilize these common metrics over zero-shot and few-shot steps and propose the Learning Quotient (LQ) metric, a novel scoring metric that depends on the relation between the zero-shot and few-shot scores.The metric evaluates the linguistic characteristics of the languages with the model's performance on the tasks.This metric enables a comprehensive evaluation of cross-lingual transfer capabilities, offering valuable insights into the strengths and weaknesses of the models.Our findings suggest contrastive downstream performance that relates to the model architecture.Furthermore, we present a decision tree framework, based on this extensive analysis providing practical guidance for selecting appropriate models based on specific task requirements and language relationships.This framework serves as a tool for researchers and practitioners seeking to harness the potential of NLP applications across diverse languages.

Methodology
Problem formulation In this work, we use pretrained language models and fine-tune them on source languages followed by few-shot training on the target languages.Consider the sets of target T = {t 1 , t 2 , . . ., t m } and source languages S = {s 1 , s 2 , . . ., s n }.We assume source languages s ∈ S have adequate resources for effective language model training.Conversely, target languages t ∈ T are low-resource languages with limited data.For any language pair (s, t), we aim to quantify how efficiently a language model can learn the target language t using knowledge transferred from the source language s.Given the scarcity of data for t, our focus lies on the model's performance in the early stages of fine-tuning it, denoted by the evaluation score E.
Let (M ) ∞ s represents a language model M fully finetuned on the language s and (M ) c t represents the model finetuned up to c steps.We investigate how fast can a model learn the language t in the early steps if it was previously finetuned on s.Essentially, we measure the performance of the model ((M ) ∞ s ) c t where c is a small positive integer.It's important, however, to acknowledge that the effi-ciency of this method can be influenced by factors such as the similarities between the source and target languages, as well as the quality and quantity of data available for both.
Our methodology can be broadly divided into two steps: Fine-tuning on Sources Following the pretrained model selection, each system is fine-tuned using the selected source languages.This finetuning stage allows each system to adjust and optimize its parameters based on specific requirements.Once fine-tuned, the systems are prepared for the evaluation phase in a cross-lingual transfer scenario.

Evaluation and Scoring
The last step involves evaluating each system's performance on target language tasks after undergoing a certain amount of fine-tuning.Two scores are measured at this point: zero-shot and few-shot scores.To measure the final score, we calculate the LQ-score ( §2).This score allows us to determine the speed and efficiency at which each system learns a new language based on the knowledge transferred from the source language.
Learning Quotient(LQ) metric Let us denote E (tc) s as the score achieved by the model (M )s ∞ on the language t after c steps of training on t.For different tasks, E can be different.We use accuracy for POS tagging and NER, and Labeled Attachment Score (LAS) for dependency parsing.E (t 0 ) s stands for the zero-shot score of the model on t.Using the same logic, is the average zero-shot score across all source languages, denoted as Z A .Now, let's introduce our proposed scoring metric, applicable for any pair of languages t ∈ T and s ∈ S: LQ(t, s) is comprised of two primary terms, along with a normalization factor.The first term measures the performance of the model after fewshot training on language t, relative to the average zero-shot scores for that target language.The second term simply sums the zero-shot and the fewshot scores.To normalize the metric value, we employ the average zero-shot score, Z A .A minute value ϵ is added to the denominator to avoid division by zero cases.The LQ score provides positive reinforcement for both zero-shot and few-shot scores.Any fewshot score that falls below the zero-shot average incurs a substantial penalty.This metric proves effective in quantifying the pace at which a model adapts to a new language. 23 Experimentation Task Selection We perform the evaluation on three downstream tasks that heavily depend on fundamental linguistic capabilities and syntactic structure: Dependency Parsing, Part-of-Speech (POS) tagging and Named Entity Recognition (NER).These tasks can work as indicators of a model's understanding of language dynamics and its ability to comprehend and interpret linguistic information (Chen and Manning, 2014;Manning, 2011;Lample et al., 2016) Language and Dataset Selection For the execution of POS tagging and Dependency Parsing, we utilized the Universal Dependencies (UD) Dataset 2 The proof can be found in Appendix A.2 (Nivre et al., 2017(Nivre et al., , 2020)).To maintain focus and ensure a meaningful study, we selected 9 languages (as listed in Figure 3(a)) as our source languages and 123 languages as our target languages for the experiments3 .All the models were comprehensively fine-tuned on the selected source languages, thereby establishing a baseline for performance comparison4 .For NER, we utilized the MashakhaNER dataset (Adelani et al., 2021) and all its associated languages as sources and targets (as described in Figure 3(b)).MasakhaNER mainly focuses on a few African languages.These languages are quite low-resource.Hence, these were perfect for this research.
Model Selection To ensure a fair comparison, we use BERT, mBERT, CANINE, and PIXEL as our choice of pre-trained models.BERT and mBERT use subword segmentation whereas CANINE is a characterbased model.Unlike these, PIXEL represents text using visual elements rather than traditional tokens.We selected BERT, as it is the most well-established tokenization-based model that aligns with PIXEL's pre-training dataset.On the other hand, characterlevel models provide another perspective for understanding and processing languages, capturing the distinct attributes of word formations.CANINE, with its pre-training on 104 languages, emerged as a strong candidate.As a counterpart, we chose mBERT, which shares a similar scope of pre-training languages.
Experimental Setup Our experiments involved two major training phases followed by a result extraction step.In the first training phase, each language model was fully fine-tuned on each of the source languages for each task.The experimental setup maintained a high computational standard to ensure robust training and evaluation.All experiments were conducted on a remote server equipped with an A100 GPU.The analysis was conducted over 4 (models) x 9 (source languages) x 123 (target languages) data points for Dependency Parsing and POS tagging.For NER, the analysis was conducted over all 4 (models) x 12 (source languages) x 12 (target languages) data points.We used 10 fine-tuning steps (for §1, set c = 10) for the target languages for all tasks.
For reproducing the results, the language models can be fully fine-tuned on the source languages (our finetuned versions can be used directly from Hug-gingFace) to get the zero-shot results.These models can then be finetuned on the target languages for 10 steps to get the few-shot score.

Results and Discussion
First, we break down the results by several key variables including the visual similarity of languages, their lexical correspondence, and the type of language task.Then, we discuss the performance of these models in light of these variables, revealing patterns regarding model characteristics.

Visual similarity is all you need
Case1 (English → European) Both of PIXEL and BERT are pre-trained in English.Therefore, for a fair comparison with other models, we perform a comparison where English is the only source language.For evaluation, we consider various European languages, taking into account both lexical similarity and the LQ score on the POS tagging task.Figure 4 represent the LQ scores of PIXEL and CANINE when English is used as the source language and various other languages as the targets.Here, in Figure 4(a) we observe the proficiency of PIXEL in handling tasks between languages sharing a similar script.For example, English shares similar degrees of lexical similarity with French (0.27) and Russian (0.24) ( §A.5 and §A.6).However, when considering the LQ scores, French significantly outperforms Russian for PIXEL.Moreover, despite Spanish and Portuguese exhibiting low lexical similarity coefficients with English, they both have achieved high LQ scores.A key factor contributing to these scores is the usage of the Latin script.French, Spanish, and Portuguese, which have all garnered high scores, also use the Latin script.Russian employs a different (Cyrillic) script, which likely explains its relatively lower score.Finnish, despite its use of the Latin script, belongs to a different language family compared to English, which may account for the less impressive performances.Moreover, when the script is non-Latin as presented in Figure 4  Case2 (Hindi → Urdu | Marathi) Despite the high mutual intelligibility and substantial grammatical and linguistic similarities between Hindi and Urdu, as acknowledged in the literature (Bhatt, 2005), the LQ score on the POS tagging task attained by PIXEL for this language pairing is not as high as one would anticipate (ranked 94th).The relatively low performance can be attributed to their disparate scripts, underscoring the importance of visual similarity when using image-based language models such as PIXEL.However, for the other three models, with Hindi as the source, Urdu ranked in the top 3 target languages.Table 1 represents this phenomenon.
On the flip side, Hindi and Marathi are not mutually intelligible.But both of these languages use the Devanagari script.Sorting the LQ scores for Hindi as the source language, Marathi comes out as one of the top-performing target languages (4th).
Case3 (Arabic → X) In the case of Arabic as the source language, PIXEL received the highest scores for Persian (ranked 2nd) and Urdu (ranked 3rd) as  respective source languages.Persian and Urdu are both Indo-European languages and are not at all lexically similar to Arabic.However, these are both written using Arabic script.On the contrary, like Arabic, Maltese is an Afro-Asiatic language with Semitic origin.But PIXEL performed extremely poorly in the case of Maltese (ranked 81st).This, we suspect, is due to the use of Latin script in Maltese, which further emphasizes the effect of visual similarity for PIXEL.
In the case of mBERT and CANINE, these patterns of favoring similar-looking scripts were absent.Rather, we saw an average score for the languages irrespective of the script.
Case4 (African → African) We've compared all four models using 10 African languages from the MasakhaNER dataset for the Named Entity Recognition (NER) task.Aside from Amharic, which uses the Ge'ez script, all other languages use the Latin script.Figure 5 shows the average LQ score obtained by PIXEL and CANINE models for each lan-guage as sources.The Table shows Amharic as an unfit choice for the source language when the target languages are in Latin script.Comparing PIXEL and CANINE, we notice CANINE outperforms PIXEL.Since PIXEL was only pre-trained on English, it is comparatively difficult for PIXEL to perform well on African languages.Conversely, CANINE was pre-trained on Yoruba (an African language) which has strong linguistic similarities with other African languages.
Observation Clearly, the above findings highlight the positive correlation between the performance of PIXEL, an image-based language model, and the visual similarity between languages.It is logical to expect that visually similar language would demonstrate better performance in crosslingual transfer when utilizing PIXEL.The findings in the CANINE and mBERT comparison further reinforce the notion that language models that do not rely on visual representations do not exhibit a strong correlation between their scores and the visual similarity of the source and target languages.

Task Specific Performance
POS tagging In general, mBERT learns quickly compared to other models.This can be attributed to several reasons.First of all, mBERT operates on token-level representations and manifests heavy reliance on word-level semantics.So it is easier to associate the word or subword tokens with their respective POS tags, compared to character-level models like CANINE.Moreover, mBERT's predefined vocabulary, which includes commonly used subwords can potentially expedite the learning process as the model can leverage semantic associations between these known tokens and their POS tags.On the contrary, character-level models have larger input sequence lengths and may require more examples to adequately learn the pattern in data which can lead to slower learning as compared to the tokenization-based models.
In addition, mBERT is trained on multilingual data.So it is more efficient than BERT at transferring knowledge from a high-resource language to a lowresource language, enhancing its few-shot learning capabilities for POS tagging tasks across different languages.
Dependency Parsing Interestingly, CANINE performs better than mBERT or BERT.This may be partly attributed to the nature of the task.Parsing is centered more on understanding the syntactic relationships between words in a sentence rather than on the meanings of individual words.As CANINE works on character level, it is more equipped to capture finer-grained patterns in these relationships, outperforming mBERT, exactly because the necessary information is marked with affixal morphemes in many languages.Moreover, CANINE operates without a predefined vocabulary, and its language independence might be advantageous when parsing sentences in a low-resource language or multilingual context.As a result, it can transfer knowledge across languages more fluidly.On top of that, the occurrence of out-of-vocabulary words or rare words can impact the parsing accuracy.As a character-level model, CANINE is better equipped in handling out-of-vocabulary words, which might be the reason for its improved performance in parsing in few-shot scenarios.Table 3: Few-shot accuracy for POS tagging task with Coptic as the source language highlighting the performance of BERT (monolingually pre-trained) over mBERT and CANINE.Coptic is the only source language (in our analysis) that is not part of the pre-training languages of mBERT and CANINE and the only language where BERT significantly outperforms mBERT and CANINE Named Entity Recognition NER, like POS tagging, leans heavily on understanding the meanings of individual words in order to accurately identify and classify named entities.This semantic nature of the task presents an advantage for segmentation-based models such as mBERT over character-level models like CANINE.Despite the multilingual strength of CANINE, its focus on character-level patterns may not sufficiently capture the semantic nuances needed for effective NER.Conversely, mBERT, with its token-based approach, can better handle the word meanings central to NER tasks.Therefore, in our analysis, mBERT demonstrates slightly superior performance in NER compared to CANINE.This suggests that while character-level models may excel in tasks centered on syntactic relationships, segmentationbased models may still hold the edge in tasks with a strong semantic dependency.

Unseen Languages
BERT performs better than mBERT and CANINE on some languages that these multilingual models were not pre-trained on.For example, consider the case study of Coptic.In comparison to CANINE and mBERT, BERT has better scores for POS tagging when Coptic is used as the source language (Table 3).Multilingual models like CANINE and mBERT underperform in this case.Among all the source languages used in our analysis, Coptic is the only source that is not part of the pre-training languages of CANINE and mBERT.It is also the only language where BERT has consistently outperformed the multi-lingually pre-trained models.This inability to effectively adapt to a new unseen language could be attributed to the influence of the scripts of those languages.In these cases, transliterating the target to a high-resource language has been shown to improve performance on downstream tasks (Muller et al., 2021).

Model Recommendation Tree
Based on our findings, we propose a model selection pathway predicated on three primary considerations: resource availability for the target language, the presence of a visually similar high-resource language, and the task's semantic dependency.
High Resource Languages In the context of high-resource languages, we recommend employing the most advanced models.Our research indicates that both character-based models like CANINE and tokenization-based models like mBERT ex-hibit superior performances in this setting.Generally, multilingual pre-training grants these models a notable edge over their monolingually trained counterparts, making them well-suited for tasks involving high-resource languages and ensuring efficient performance.
Visual Similarity In cases where the target language is resource-poor but visually resembles a high-resource language, our suggestion is to undertake a cross-lingual transfer from the high-resource language using a tokenization-free model like the PIXEL.PIXEL is explicitly designed to discern and capitalize on visual correspondences between languages, which makes it an optimal choice in instances where such resemblances can be exploited.
Semantic Dependency If a high-resource language somewhat closely related to the target language has been used in pre-training a multilingual model, the choice between different models should be guided by the task's semantic content requirements.If the task depends heavily on semantic understanding, models like mBERT or similar tokenization-based models are advisable.These models excel in scenarios where deep semantic comprehension is key.Conversely, if the task doesn't require a strong understanding of semantics, character-based models like CANINE may be a more efficient choice.These models typically perform well in scenarios where semantic dependence is lower.
Special Cases For scenarios that do not fall within the purview of the above-mentioned conditions, a multitude of factors come into play.For instance, when the source language was not part of the pre-training set for the multilingual model, we suggest transliterating the target language to a high-resource language.Transliterating those languages substantially enhances the performance of these multilingual models on downstream tasks.

Related Work
Cross-lingual transfer Cross-lingual transfer has emerged as a valuable approach to enhance model performance in low-resource languages without requiring extensive amounts of target language data (Conneau et al., 2020).XLM-R, proposed by Conneau et al., demonstrates the effectiveness of pre-training on a large-scale masked language model trained on 100 languages from CommonCrawl data.It outperforms multilingual BERT (mBERT) on various cross-lingual benchmarks.Similarly, Devlin et al. and Xue et al. propose finetuning approaches for existing pre-trained language models (PLMs).Recently, another approach by Lee et al. employs adapters for crosslingual transfer in low-resource languages.Fusing Multiple Adapters for Cross-Lingual Transfer (FAD-X) utilizes language adapters and task adapters to address the imbalance in lower-resource languages.MAD-X (Pfeiffer et al., 2020) is another adapter-based method that employs language, task, and invertible adapters.Moreover, this similar setting coupled with language phylogeny information proved to be useful for low-resource cross-lingual transfer (Faisal and Anastasopoulos, 2022).
Tokenization-free models Tokenization-based models such as BERT (Devlin et al., 2019b), RoBERTa (Liu et al., 2019), GPT-3 (Brown et al., 2020), ALBERT (Lan et al., 2020), T5 (Raffel et al., 2020) and ELECTRA (Clark et al., 2020b) are leading the field when it comes to performance across a broad range of natural language processing tasks.However, tokenization-based models like BERT demonstrate poor performance in unexplored domains (Boukkouri et al., 2020) and lack resilience to noisy data such as typos and missed clicks (Sun et al., 2020).
Studies have shown that models using visual text representations are more robust (Salesky et al., 2021).PIXEL (Rust et al., 2022) proposes the use of visual embeddings for language modeling, eliminating the need for a fixed vocabulary.Research suggests that models utilizing visual text representations exhibit greater resilience to noisy texts and enable rapid adaptation to new languages while maintaining performance.
CANINE (Clark et al., 2022), a character-based model, provides an alternative approach that eliminates the reliance on predefined vocabularies.CANINE surpasses vanilla BERT on the TyDiQA benchmark (Clark et al., 2020a) by downsampling input sequences to achieve similar speeds.
ByT5 (Xue et al., 2021a) introduces a modified version of the standard transformer that processes byte sequences, addressing the limitations of a finite vocabulary.Similarly, CHARFORMER (Tay et al., 2021) proposes a gradient-based sub-word tokenization method that operates directly on a byte level.It performs on par with tokenizer-based approaches and outperforms most byte-level methods.
Language Similarity Metrics Several researchers have proposed different methodologies to quantify similarity among languages.For instance, (Petroni and Serva, 2010) introduced a measure of lexical distance, which quantifies the difference between languages based on their vocabulary.On the other hand, (Chiswick and Miller, 2005) suggests a metric of linguistic distance that represents how challenging it is for English speakers to learn other languages.However, this method relies on English speakers' learning difficulty, making it language-biased and not generalizable for speakers of other languages.
A different approach is presented by Ciobanu and Dinu, who propose an automated method for identifying pairs of cognates (words with a common etymology) across languages.But this cognate identification method requires a known list of cognates, limiting its usefulness for less-studied languages, and it may overlook non-lexical aspects of language similarity.
Another common tool is the Automated Similarity Judgment Program (Automated Similarity Judgment Program, 2023) which uses a comprehensive database of vocabulary to analyze linguistic relationships but has been criticized for its simplified standard orthography and its reliance on a limited vocabulary list.

Conclusion
This study provides pivotal insights into the practical application of tokenization-based as well as tokenization-free models in cross-lingual transfer tasks, accentuating the importance of context and task-based model selection.However, there's an abundance of uncharted territory awaiting exploration.The gaps in our understanding of tokenization-free models such as PIXEL and CANINE present a significant opportunity for further research.These models, though promising, are still in their early stages of development.This paves the way for studies aiming to enhance their performance, potentially through the integration of advanced learning algorithms or novel feature extraction techniques.
Additionally, investigating the role of tokenization in handling different language families could provide profound insights.For instance, how do these models perform with agglutinative languages like Turkish or Finnish, or with logographic languages like Chinese?Exploring such linguistic diversity could further clarify the strengths and weaknesses of different model types.An iterative inclusion of extinct or less commonly spoken languages is also essential at this point.
In summary, this study marks a significant step in understanding the capabilities and limitations of different models in cross-lingual transfer tasks.It opens several doors for future research, promising an exciting trajectory for the evolution of language modeling and translation tasks.The journey ahead, albeit challenging, presents a wealth of opportunities for innovation and discovery.

Limitations
This research, while extensive, presents certain limitations.Our study focuses primarily on syntactic tasks, leaving semantic tasks unexplored.While our work delves into the performance of specific models like BERT, mBERT, PIXEL, and CANINE, other models, especially emerging ones like decoder-based language models, remain unexamined in this context.The research also predominantly concerns low-resource languages, potentially limiting the applicability of our findings to high-resource contexts.Moreover, the consideration of different language families, such as agglutinative or logographic languages, is lacking in this analysis.Looking ahead, we plan to address these limitations by incorporating a broader range of language tasks, investigating a wider array of language models, and expanding our research to include high-resource languages and different language families.This will allow us to present a more holistic understanding of cross-lingual transfer in future studies.

A Appendix
A.1 Frequently Asked Questions 1. Q: What did the authors mean by 'few-shot' and 'zero-shot'?A: The term 'few-shot' is quite loosely used in this paper.Each model is at first fully trained on a source language and then evaluated on some target language.In the evaluation phase, the model is either (i) directly evaluated on the target language (termed as zero-shot), or (ii) fine-tuned for a few steps on the target language (termed as few-shot).
. We can rewrite the LQ score as: We assume that a score would effectively measure the cross-lingual transfer capabilities if it gets positively rewarded for a higher score after a few shots of training in comparison to other language pairs and in comparison to the state before few-shot training.That means the growth of F from Z 0 and the difference of F with Z A should play a high impact on the score.
Simplifying the right-hand-side of Eqn 1, we get, In equation 5, the term (F + Z 0 ) /Z A will be greater than 1 when either F is very large or Z 0 is significantly larger than Z A .That means a strong positive score can be obtained when the few-shot score is very high or the leap from zero-shot to few-shot is high.The remaining term F 1 + Z 0 F ensures the stability of the score.So, if a model learns quickly and gains good accuracy/las in the early steps of training, the LQ score will give out a strong score.If a model achieves a good score in zero-shot learning, it also receives a good LQ score.

Limitations of LQ Score
The score utilizes a normalizing term that averages the zero-shot scores across all source languages.So, for any pair of languages, x and k, the LQ score will not always be the same.It will vastly depend on the list of source languages used in the experimentation.So, the numeric value of the LQ score does not have a direct meaning.However, for a given source, the relation between the target languages is indicative of how compatible the source and target are.On the flip side, for a target language, the relation between the source languages is also meaningful.Table 4 provides a comprehensive analysis of the PIXEL model's performance in terms of accuracy in the POS-tagging task, evaluated in both zeroshot and few-shot scenarios.Here, the set of source languages also serves as the target languages, creating a self-referential evaluation method.This unique approach further allows for a deeper understanding of the model's strengths and weaknesses when dealing with identical sources and target languages.

A.5 List of target languages
Tables 5, 6, and 7 give an elaborate list of languages and their scripts along with their respective families.The languages are spread across multiple scripts and multiple families.

A.6 Lexical Similarity
Lexical similarity is the percentage obtained by comparing standardized wordlists from two linguistic varieties and tallying words similar in form and meaning (Ethnologue, 2023).It ranges from 0 to 100, representing the vocabulary overlap between two languages.Values over 85% often suggest the speech variant may be a dialect of the compared language.The proportion of lexical similarity between two kinds of language is calculated by comparing standardized lists of words and tallying the forms that demonstrate similarity in both structure and meaning.

Figure 1 :
Figure 1: Distribution of the languages according to their sub-families.The majority of these are of Indo-European origin.The languages belong to 28 subfamilies spanning 13 different families Figure 3: Geographic distribution of source languages (with script and family) used in the analysis across tasks.
(b), CANINE has an edge over PIXEL.The lexical similarities between different European languages are outlined in

Figure 4 :
Figure 4: LQ score obtained by PIXEL and CANINE on Latin and non-Latin scripts on POS tagging.PIXEL outperforms CANINE on the POS tagging task when both source and target use the same script (on the left portion of the graph).Conversely, PIXEL does not outperform CANINE when the scripts are dissimilar (on the right portion of the graph)

Figure 5 :
Figure 5: Average LQ scores with each language as sources for NER task (for PIXEL and CANINE) shows Amharic (only non-Latin script) pairs significantly worse with other languages that use Latin script

Table 1 :
Table 8 in the appendix.
Comparison between different language models on Hindi as the source and Urdu and Marathi as target shows CANINE and mBERT massively favor linguistically similar languages.PIXEL favors visual similarity

Table 2 :
LQ score and rank of PIXEL with Arabic as the source language shows PIXEL receives a high score when scripts are visually similar rather than when languages are only linguistically similar.