A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained Models

Recent works on tokenizer-free multilingual pretrained models show promising results in improving cross-lingual transfer and reducing engineering overhead compared to subword-based alternatives.However, previous work mainly focuses on reporting accuracy on a limited set of tasks and data settings, placing less emphasis on other important factors when tuning and deploying the models in practice, such as memory usage, inference speed, and finetuning data efficiency. We attempt to fill this gap by performing a comprehensive empirical comparison of multilingual tokenizer-free and subword-based models considering the various dimensions. Surprisingly, we find that subword-based models might still be the most practical choice in many settings, achieving better performance for lower inference latency and memory usage. Based on these results, we encourage future work in tokenizer-free methods to consider these factors when designing and evaluating new models.


Introduction
Several recent results (Clark et al., 2022;Xue et al., 2022) have excited the research community with the possibility of "tokenizer-free" models, character-level and byte-level models, as an alternative to more traditional subword-based models. Tokenizer-free models are especially appealing to practitioners as they can eschew the two-step processing pipeline of subword segmentation and reduce the corresponding difficulties in cross-lingual transfer (Hu et al., 2020;Maronikolakis et al., 2021;Rust et al., 2021;Wang et al., 2021) or domain adaptation (Sato et al., 2020;Liu et al., 2021) due to inconsistent subword units.
However, upon several attempts to apply tokenizer-free methods, our analysis reveals several practical difficulties in applying these methods. This paper is a chronicle of some of the concerns we uncovered; we highlight some challenges with applying these models and propose best practices for future results reporting in this area.
Specifically, we perform experiments finetuning pretrained multilingual models, evaluating them with respect to finetuning data efficiency, inference time, and memory consumption. Based on these multiple dimensions, we come to the somewhat surprising conclusion that subword-based models, in particular mBERT (Devlin et al., 2019), might still be the most practical choice in most settings, as they perform best while maintaining a relatively low inference cost.

Tokenizer-free Multilingual Models
While multilingual pretrained models (Devlin et al., 2019;Lample and Conneau, 2019;Liu et al., 2020;Xue et al., 2021) have led to impressive performance improvements for low-resource languages through cross-lingual transfer, the standard word representation method in these models relies on subword segmentation (Sennrich et al., 2016;Kudo, 2018). In multilingual settings, subword tokenization can be sub-optimal as supporting hundreds of languages with various scripts and vocabularies causes segmentation mismatch between languages and over-segmentation in the lower-resourced languages (Wang et al., 2020).
To alleviate this problem, recent works propose removing the subword segmentation step by using characters or bytes as lexical units (Clark et al., 2022;Xue et al., 2022). In particular, these "tokenizer-free" methods have been applied to both encoder-only and encoder-decoder models. Tab. 1 presents an overview of the different tokenizerfree multilingual models with comparable subword models. Next, we briefly describe the two tokenizer-free models we consider in this work.  To keep the parameter count fixed between mT5 and ByT5, ByT5 allocates the parameters saved from the embedding layer to additional encoder layers. Although adding more depth to the encoder is a reasonable design choice, our results in § 4 show that ByT5 suffers from a much higher inference cost due to the deeper encoder, especially when input/output sequence lengths are longer.

Experimental settings
We conduct a multi-dimensional evaluation focusing on two aspects: finetuning data efficiency ( § 4.1) and inference cost ( § 4.2) to provide a better understanding of the practical applicability of tokenizer-free models. We finetune and evaluate 2 https://www.tensorflow.org/datasets/catalog/ c4#c4multilingual two subword-based models (mBERT, mT5) and two tokenizer-free models (CANINE, ByT5), as mBERT-CANINE and mT5-ByT5 are directly comparable counterparts in terms of their pretraining corpus as shown in Tab. 1. For the T5 models, we consider only the small models of both mT5 and ByT5 as the focus of our work is in the practical implication of using multilingual pretrained models at relatively resource-constrained settings.
Specifically, we finetune the models on three multilingual natural language understanding tasks adopted from the XTREME benchmark (Hu et al., 2020). The three tasks we choose cover various input, output formats -sequence-level classification (XNLI), token-level classification (NER), and extractive question answering (TyDi QA-GoldP).

Tasks
XNLI The Cross-lingual Natural Language Inference (Conneau et al., 2018) is a sequence classification task in which the model predicts whether the hypothesis sentence is an entailment, contradiction, or neutral given the premise sentence. The task is provided in 15 languages.
NER Named Entity Recognition (NER) is a structured prediction task, where the model predicts a tag (location, person, organization) in IOB2 format for each token in the input sentence. We use the WikiAnn dataset (Pan et al., 2017) and select 20 out of 282 languages for multilingual training based on linguistic diversity and the language availability in the other two tasks we consider.
TyDi QA-GoldP The Typologically Diverse Question Answering (Clark et al., 2020) dataset is an extractive QA benchmark in 11 languages. While the original dataset includes two "primary" tasks (SelectP, MinSpan), the secondary GoldP task is the most widely adopted as it is compatible with other SQuAD-style QA tasks (Rajpurkar et al., 2016;Artetxe et al., 2020). For this reason, we mainly compare models on TyDi QA-GoldP

Details of Hardware and Measurements
We use a single Tesla V100 (32GB) GPU for all experiments regarding inference cost measurements. To obtain the peak GPU memory and inference latency, we randomly select 100 samples from the English test set for each task and measure the average cost of predicting one example at a time.

Finetuning data efficiency
Most work presenting multilingual pretrained models evaluates downstream task performance under multilingual finetuning or zero-shot scenarios. In practice, however, downstream task datasets are often available in the language of interest. Thus, in addition to multilingual training, we compare models tuned on different data sizes within a single language to evaluate their finetuning data efficiency.
Specifically, we finetune the four pretrained models with varying numbers of task examples -10 2 , 10 3 , 10 4 (when available), all target language samples (Single), and multilingual training (Multi) to incorporate situations where the task dataset is available in multiple languages. We experiment with four downstream task languages -English, Arabic, Russian, and Swahili -chosen based on both linguistic diversity and various pretraining resource conditions. 3 While the controlled experiments are done on a subset of languages, we report the task performance in all languages for zero-shot evaluation, single language training, and multilin-3 The pretraining corpus sizes are noted in § B.4 (Tab. 8).
gual training in § B.3 for comprehensiveness. 4 In Fig. 1, we report the models' task performance averaged over languages under different finetuning settings. Notably, we find that mBERT achieves the highest score for most settings. The only exception is on XNLI Single and Multi, where ByT5 slightly outperforms mBERT. As the dataset size decreases, it becomes more evident that mBERT is the most sample efficient, especially in the most data-scarce scenarios where only 100 finetuning examples are available. The fact that mBERT outperforms mT5 and ByT5 on smaller datasets is quite surprising, as one might expect T5 models to generalize better in low-resource settings given their much larger pretraining corpus.
Interestingly, we find that CANINE performs poorly compared to mBERT in all three tasks, and the performance gap increases as fewer finetuning data are available. To explain this phenomenon, we hypothesize that character-level models have the additional burden of learning to compose characters into semantically meaningful units and thus require more data to learn task-specific higher-level semantics. These results align with the NER results on the CoNLL and MasakhaNER dataset in Clark et al. (2022), where mBERT outperformed CA-NINE in all languages except Amharic, a language not covered by mBERT's vocabulary.
However, mBERT's stronger performance in TyDi QA-GoldP was unexpected as CANINE performed better at the TyDi QA primary tasks in Clark et al. (2022). Through replication experiments to reconcile the contradictory findings, we found that mBERT outperforms CANINE also in the primary tasks when finetuned for more epochs with our codebase, suggesting that the previous mBERT baseline was potentially undertrained. 5 For mT5 and ByT5, we find that the two models perform comparably in smaller datasets, while on larger sets, ByT5 consistently outperforms mT5 on all tasks. We note that the mT5-Small model could have been penalized in terms of capacity as 85% of the parameters are allocated to embeddings as shown in Tab. 1, leaving only 44M parameters for the non-vocabulary layers. This is even less than that of mBERT (86M), and drastically smaller compared to ByT5-Small, which assigns 298.5M parameters to the non-vocabulary layers. Also, given that the tasks concerned are not generationheavy, the extra depth on the encoder (12 for ByT5 vs. 8 for mT5) might have favored ByT5 over mT5.

Inference cost
Another key concern in utilizing pretrained models for downstream applications is the inference cost, such as memory consumption and latency. In Fig. 2, we plot each model's inference latency and peak memory consumption, color-coding their task performance to provide a comprehensive view of the trade-offs of deploying each model in practice.
In general, the encoder-only models, mBERT and CANINE, require much less memory and inference latency than mT5 and ByT5. Considering performance alongside inference cost, we find that mBERT is still the most practical choice among the four models, achieving the best performance while maintaining a relatively low inference cost.
While producing longer sequences than mBERT, CANINE does not necessarily incur higher memory or latency costs, as it has fewer parameters than mBERT. This helps CANINE, especially in sentence-level tasks (XNLI, NER) where inputs are 5 We include the finetuning code in our released codebase. relatively shorter. However, for tasks with much longer inputs (TyDi QA), the computational overhead from the sequence length dominates the parameter reduction, leading to higher memory usage and slower inference for CANINE.
For mT5 and ByT5, inference costs vary according to the task's input and output length. For tasks with shorter inputs and outputs like XNLI, ByT5 yields better performance than mT5 while retaining similar costs. However, for token-level prediction tasks like NER, ByT5 needs to generate tags autoregressively at the byte level, which drastically slows down the inference time. However, the additional cost is negligible in terms of memory consumption as the inputs are still relatively short. For TyDi QA, we observe an opposite pattern. As the input is a long passage, the extended input sequence significantly increases the memory consumption of ByT5, requiring more effort in tuning the batch size to fit into the GPU memory.

Related work
Large-scale NLP models have achieved remarkable performance in various natural language tasks, with the recent ChatGPT demonstrating near humanlevel language understanding capabilities. While achieving impressive results in standard benchmark settings, the applicability of these models have remained limited mainly due to practical considerations including their high energy consumption and environmental impact (Strubell et al., 2019). Both the NLP and computer vision communities have proposed evaluating models based on practical metrics, such as training/inference efficiency (Canziani et al., 2016;Dehghani et al., 2021;Zhou et al., 2021), energy usage (Henderson et al., 2020), robustness (Ribeiro et al., 2020;Kiela et al., 2021;Koh et al., 2021), and expected performance (Dodge et al., 2019). Similarly, a recent study by Liang et al. (2022) suggests a comprehensive evaluation suite for generative NLP models, including measures of robustness, fairness, and efficiency. Our multi-dimensional evaluation is an attempt to expand these evaluation protocols to multilingual settings and examine the trade-offs of various tokenization schemes.

Conclusion
In this paper, we present a multi-dimensional evaluation of tokenizer-free multilingual models focusing on their efficiency against finetuning dataset size and inference cost. Based on our experiments, we find that mBERT might still be the most cost-effective choice for many tasks, and show that the efficiency trade-offs of model design choices (tokenization, decoder availability) depend heavily on the task's length statistics. Despite our findings, tokenizer-free models still have a significant advantage in reducing engineering effort and potentially increasing robustness to noisy data. We believe more work should be done in developing efficient tokenizer-free models, and encourage the community to consider these criteria of practical applicability when developing and evaluating tokenizer-free pretrained models.

Limitations
This paper mainly covers three NLP tasks, focusing on smaller-sized multilingual pretrained models. In future work, it would be interesting to run the multidimensional evaluation we suggest on a broader set of tasks and models. Although our results show that subword models are a more practical choice in some tasks, we note that other tasks or datasets may exist where tokenizer-free methods achieve better relative performance. For instance, tokenizer-free models have been reported to excel in word-level tasks, and noisy environments (Xue et al., 2022), and the conclusions we reached may be different in such settings. Moreover, we did not explore more complicated generation tasks like translation or summarization, where the difficulty in decoding and longer decode horizons could paint a different picture in a multi-dimensional evaluation.

Ethics Statement
We hope our results encourage the community to consider the practical concerns of running large lan-guage models (LLMs) and designing tokenizer-free pretrained models. As the state-of-the-art LLMs are becoming more computationally extensive, it has become increasingly difficult for researchers and practitioners with less resources to utilize these models for downstream applications. We hope our multi-dimensional analysis can help researchers and practitioners with less computational resources decide which model to use in practice.

A Tasks
For all tasks and models, we refer to the original papers' codebase for hyperparameters.

678
XNLI For encoder-only models, the first token ([CLS]) is used to map the sentence representation to the label distribution. For encoder-decoder models, we generate the index of the label (e.g., '0') directly.
NER For encoder-decoder models, we follow the input-output format (e.g., input: