Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing

We introduce Trankit, a light-weight Transformer-based Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 pretrained pipelines for 56 languages. Built on a state-of-the-art pretrained language model, Trankit significantly outperforms prior multilingual NLP pipelines over sentence segmentation, part-of-speech tagging, morphological feature tagging, and dependency parsing while maintaining competitive performance for tokenization, multi-word token expansion, and lemmatization over 90 Universal Dependencies treebanks. Despite the use of a large pretrained transformer, our toolkit is still efficient in memory usage and speed. This is achieved by our novel plug-and-play mechanism with Adapters where a multilingual pretrained transformer is shared across pipelines for different languages. Our toolkit along with pretrained models and code are publicly available at: https://github.com/nlp-uoregon/trankit. A demo website for our toolkit is also available at: http://nlp.uoregon.edu/trankit. Finally, we create a demo video for Trankit at: https://youtu.be/q0KGP3zGjGc.


Introduction
Many efforts have been devoted to developing multilingual NLP systems to overcome language barriers (Aharoni et al., 2019;Liu et al., 2019a;Taghizadeh and Faili, 2020;Zhu, 2020;Kanayama and Iwamoto, 2020;Nguyen and Nguyen, 2021). A large portion of existing multilingual systems has focused on downstream NLP tasks that critically depend on upstream linguistic features, ranging from basic information such as token and sentence boundaries for raw text to more sophisticated structures such as part-of-speech tags, morphological features, and dependency trees of sentences (called fundamental NLP tasks). As such, building effective multilingual systems/pipelines for fundamental upstream NLP tasks to produce such information has the potentials to transform multilingual downstream systems.
There have been several NLP toolkits that concerns multilingualism for fundamental NLP tasks, featuring spaCy 1 , UDify (Kondratyuk and Straka, 2019), Flair (Akbik et al., 2019), CoreNLP (Manning et al., 2014), UDPipe (Straka, 2018), and Stanza (Qi et al., 2020). However, these toolkits have their own limitations. spaCy is designed to focus on speed, thus it needs to sacrifice the performance. UDify and Flair cannot process raw text as they depend on external tokenizers. CoreNLP supports raw text, but it does not offer state-ofthe-art performance. UDPipe and Stanza are the recent toolkits that leverage word embeddings, i.e., word2vec (Mikolov et al., 2013) and fastText (Bojanowski et al., 2017), to deliver current state-ofthe-art performance for many languages. However, Stanza and UDPipe's pipelines for different languages are trained separately and do not share any component, especially the embedding layers that account for most of the model size. This makes their memory usage grow aggressively as pipelines for more languages are simultaneously needed and loaded into the memory (e.g., for language learning apps). Most importantly, none of such toolkits have explored contextualized embeddings from pretrained transformer-based language models that have the potentials to significantly improve the performance of the NLP tasks, as demonstrated in many prior works (Devlin et al., 2019;Liu et al., 2019b;Conneau et al., 2020).
In this paper, we introduce Trankit, a multilingual Transformer-based NLP Toolkit that over-  et al., 2019). By utilizing the state-of-the-art multilingual pretrained transformer XLM-Roberta (Conneau et al., 2020), Trankit advances state-of-theart performance for sentence segmentation, partof-speech (POS) tagging, morphological feature tagging, and dependency parsing while achieving competitive or better performance for tokenization, multi-word token expansion, and lemmatization over the 90 treebanks. It also obtains competitive or better performance for named entity recognition (NER) on 11 public datasets. Unlike previous work, our token and sentence splitter is wordpiece-based instead of characterbased to better exploit contextual information, which are beneficial in many languages. Considering the following sentence: "John Donovan from Argghhh! has put out a excellent slide show on what was actually found and fought for in Fallujah." As such, Trankit correctly recognizes this as a single sentence while character-based sentence splitters of Stanza and UDPipe are easily fooled by the exclamation mark "!", treating it as two separate sentences. To our knowledge, this is the first work to successfully build a wordpiece-based token and sentence splitter that works well for 56 languages. Figure 1 presents the overall architecture of Trankit pipeline that features three novel transformer-based components for: (i) the joint token and sentence splitter, (ii) the joint model for POS tagging, morphological tagging, dependency parsing, and (iii) the named entity recognizer. One potential concern for our use of a large pretrained transformer model (i.e., XML-Roberta) in Trankit involves GPU memory where different transformer-based components in the pipeline for one or multiple languages must be simultaneously loaded into the memory to serve multilingual tasks. This could extensively consume the memory if different versions of the large pre-trained transformer (finetuned for each component) are employed in the pipeline. As such, we introduce a novel plugand-play mechanism with Adapters to address this memory issue. Adapters are small networks injected inside all layers of the pretrained transformer model that have shown their effectiveness as a lightweight alternative for the traditional finetuning of pretrained transformers (Houlsby et al., 2019;Peters et al., 2019;Pfeiffer et al., 2020a,b). In Trankit, a set of adapters (for transfomer layers) and task-specific weights (for final predictions) are created for each transformer-based component for each language while only one single large multilingual pretrained transformer is shared across components and languages. Adapters allow us to learn language-specific features for tasks. During training, the shared pretrained transformer is fixed while only the adapters and task-specific weights are updated. At inference time, depending on the language of the input text and the current active component, the corresponding trained adapter and task-specific weights are activated and plugged into the pipeline to process the input. This mechanism not only solves the memory problem but also substantially reduces the training time.

Related Work
There have been works using pre-trained transformers to build models for character-based word segmentation for Chinese (Yang, 2019;Tian et al., 2020;Che et al., 2020); POS tagging for Dutch, English, Chinese, and Vietnamese (de Vries et al., 2019;Tenney et al., 2019;Tian et al., 2020;Che et al., 2020;Nguyen and Nguyen, 2020); morphological feature tagging for Estonian and Persian (Kittask et al., 2020;Mohseni and Tebbifakhr, 2019); and dependency parsing for English and Chinese (Tenney et al., 2019;Che et al., 2020). However, all of these works are only developed for some specific language, thus potentially unable to support and scale to the multilingual setting.
Some works have designed multilingual transformer-based systems via multilingual training on the combined data of different languages (Tsai et al., 2019;Kondratyuk and Straka, 2019;Ustün et al., 2020). However, multilingual training is suboptimal (see Section 5). Also, these systems still rely on external resources to perform tokenization and sentence segmentation, thus unable to consume raw text. To our knowedge, this is the first work to successfully build a multilingual transformer-based NLP toolkit where different transformer-based models for many languages can be simultaneously loaded into GPU memory and process raw text inputs of different languages.

Design and Architecture
Adapters. Adapters play a critical role in making Trankit memory-and time-efficient for training and inference. Figure 2 shows the architecture and the location of an adapter inside a layer of transformer. We use the adapter architecture proposed by (Pfeiffer et al., 2020a,b), which consists of two projection layers Up and Down (feed-forward networks), and a residual connection. ci = AddNorm(ri), hi = Up(ReLU(Down(ci))) + ri (1) where r i is the input vector from the transformer layer for the adapter and h i is the output vector for the transformer layer i. During training, all the weights of the pretrained transformer (i.e., gray boxes) are fixed and only the adapter weights of two projection layers and the task-specific weights outside the transformer (for final predictions) are updated. As demonstrated in Figure 1, Trankit involves six components described as follows.
Multilingual Encoder with Adapters. This is our core component that is shared across different transformer-based components for different languages of the system. Given an input raw text s, we first split it into substrings by spaces. Afterward, Sentence Piece, a multilingual subword tokenizer (Kudo and Richardson, 2018;Kudo, 2018), is used to further split each substring into wordpieces. By concatenating wordpiece sequences for substrings, we obtain an overall sequence of wordpieces w = [w 1 , w 2 , . . . , w K ] for s. In the next step, w is fed into the pretrained transformer, which is already integrated with adapters, to obtain the wordpiece representations: x l,m 1:K = Transformer(w1:K ; θ l,m AD ) Here, θ l,m AD represents the adapter weights for language l and component m of the system. As such, we have specific adapters in all transformer layers for each component m and language l. Note that if K is larger than the maximum input length of the pretrained transformer (i.e., 512), we further divide w into consecutive chunks; each has the length less than or equal to the maximum length. The pretrained transformer is then applied over each chunk to obtain a representation vector for each wordpiece in w. Finally, x l,m 1:K will be sent to component m to perform the corresponding task.
Joint Token and Sentence Splitter. Given the wordpiece representations x l,m 1:K for this component, each vector x l,m i for w i ∈ w will be consumed by a feed-forward network with softmax in the end to predict if w i is the end of a single-word token, the end of a multi-word token, or the end of a sentence. The predictions for all wordpieces in w will then be aggregated to determine token, multi-word token, and sentence boundaries for s.

Multi-word Token
Expander. This component is responsible for expanding each detected multi-word token (MWT) into multiple syntactic words 2 . We follow Stanza to deploy a character-based seq2seq model for this component. This decision is made based on our observation that the task is done best at character level, and the character-based model (with character embeddings) is very small.

Joint Model for POS Tagging, Morphological
Tagging and Dependency Parsing. In Trankit, given the detected sentences and tokens/words, we use a single model to jointly perform POS tagging, morphological feature tagging and dependency parsing at sentence level. Joint modeling mitigates error propagation, saves the memory, and speedups the system. In particular, given a sentence, the representation for each word is computed as the average of its wordpieces' transformer-based representations in x l,m 1:K . Let t 1:N = [t 1 , t 2 , . . . , t N ] be the representations of the words in the sentence. We compute the following vectors using feed-forward networks FFN * : Vectors for the words in r upos 1:N , r xpos 1:N , r uf eats 1:N are then passed to a softmax layer to make predictions for UPOS, XPOS, and UFeats tags for each word. For dependency parsing, we use the classification token <s> to represent the root node, and apply Deep Biaffine Attention (Dozat and Manning, 2017) and the Chu-Liu/Edmonds algorithm (Chu, 1965;Edmonds, 1967) to assign a syntactic head and the associated dependency relation to each word in the sentence.
Lemmatizer. This component receives sentences and their predicted UPOS tags to produce the canonical form for each word. We also employ a character-based seq2seq model for this component as in Stanza. Named Entity Recognizer. Given a sentence, the named entity recognizer determines spans of entity names by assigning a BIOES tag to each token in the sentence. We deploy a standard sequence labeling architecture using transformer-based representations for tokens, involving a feed-forward network followed by a Conditional Random Field.

Trankit Installation.
Trankit is written in Python and available on PyPI: https://pypi. org/project/trankit/. Users can install our toolkit via pip using: Figure 3 shows how to initialize a pretrained pipeline for English; it is instructed to run on GPU and store downloaded pretrained models to the specified cache directory. Trankit will not download pretrained models if they already exist.

Initialize a Pipeline. Lines 1-4 in
Multilingual Usage. Figure 3 shows how to initialize a multilingual pipeline and process inputs of different languages in Trankit:  Basic Functions. Trankit can process inputs which are untokenized (raw) or pretokenized strings, at both sentence and document levels. Figure 4 illustrates a simple code to perform all the supported tasks for an input text. We organize Trankit's outputs into hierarchical native Python dictionaries, which can be easily inspected by users. Figure 5 demonstrates the outputs of the command line 6 in Figure 4.
Training your own Pipelines. Trankit also provides a trainable pipeline for 100 languages via the class TPipeline. This ability is inherited from  the XLM-Roberta encoder which is pretrained on those languages. Figure 6 illustrates how to train a token and sentence splitter with TPipeline.

Datasets & Hyper-parameters
To achieve a fair comparison, we follow Stanza (Qi et al., 2020) to train and evaluate all the models on the same canonical data splits of 90 Universal Dependencies treebanks v2.5 (UD2.5) 3 (Zeman et al., 2019), and 11 public NER datasets provided in the following corpora: AQMAR (Mohit et al., 2012), CoNLL02 (Tjong Kim Sang, 2002), CoNLL03  We skip 10 treebanks whose languages are not supported by XLM-Roberta.  (Nothman et al., 2012). Hyper-parameters for all models and datasets are selected based on the development data in this work.   Figure 6: Training a token and sentence splitter using the CONLL-U formatted data (Nivre et al., 2020).

Universal Dependencies performance
of the CoNLL 2018 Shared Task 4 . On five illustrated languages, Trankit achieves competitive performance on tokenization, MWT expansion, and lemmatization. Importantly, Trankit outperforms other toolkits over all remaining tasks (e.g., POS and morphological tagging) in which the improvement boost is substantial and significant for sentence segmentation and dependency parsing. For example, English enjoys a 7.22% improvement for sentence segmentation, a 3.92% and 4.37% improvement for UAS and LAS in dependency parsing. For Arabic, Trankit has a remarkable improvement of 16.16% for sentence segmentation while Chinese observes 12.31% and 12.72% improvement of UAS and LAS for dependency parsing. Over all 90 treebanks, Trankit outperforms the previous state-of-the-art framework Stanza in most of the tasks, particularly for sentence segmentation (+3.24%), POS tagging (+1.44% for UPOS and +1.55% for XPOS), morphological tagging (+1.46%), and dependency parsing (+4.0% for UAS and +5.01% for LAS) while maintaining the competitive performance on tokenization, multiword expansion, and lemmatization. Table 3 compares Trankit with Stanza (v1.1.1), Flair (v0.7), and spaCy (v2.3) on the test sets of 11 considered NER datasets. Following Stanza, we report the performance for other toolkits with their pretrained models on the canonical data splits if they are available. Otherwise, their best configurations are used to train the models on the same data splits (inherited from Stanza). Also, for the Dutch datasets, we retrain the models in Flair as those models (for Dutch) have been updated in version v0.7. As can be seen, Trankit obtains competitive or better performance for most of the languages, clearly demonstrating the benefit of using the pretrained transformer for multilingual NER.      Stanza for several languages in Table 5. As can be seen, besides the multilingual transformer, model packages in Trankit only take dozens of megabytes while Stanza consumes hundreds of megabytes for each package. This leads to the Stanza's usage of much more memory when the pipelines for these languages are loaded at the same time. In fact, Trankit only takes 4.9GB to load all the 90 pretrained pipelines for the 56 supported languages.

Ablation Study
This section compares Trankit with two other possible strategies to build a multilingual system for fundamental NLP tasks. In the first strategy (called "Multilingual"), we train a single pipeline where all the components in the pipeline are trained with the combined training data of all the languages. The second strategy (called "No-adapters") involves eliminating adapters from XLM-Roberta in Trankit. As such, in "No-adapters", pipelines are still trained separately for each language; the pretrained transformer is fixed; and only task-specific weights (for predictions) in components are updated during training. For evaluation, we select 9 treebanks for 3 different groups, i.e., high-resource, medium-resource, and low-resource, depending on the sizes of the treebanks. In particular, the high-resource group includes Czech, Russian, and Arabic; the mediumresource group includes French, English, and Chinese; and the low-resource group involves Belaru-sian, Telugu, and Lithuanian. Table 2 compares the average performance of Trankit, "Multilingual", and "No-adapters". As can be seen, "Multilingual" and "No-adapters" are significantly worse than the proposed adapter-based Trankit. We attribute this to the fact that multilingual training might suffer from unbalanced sizes of treebanks, causing highresource languages to dominate others and impairing the overall performance. For "No-adapters", fixing pretrained transformer might significantly limit the models' capacity for multiple tasks and languages.

Conclusion and Future Work
We introduce Trankit, a transformer-based multilingual toolkit that significantly improves the performance for fundamental NLP tasks, including sentence segmentation, part-of-speech, morphological tagging, and dependency parsing over 90 Universal Dependencies v2.5 treebanks of 56 different languages. Our toolkit is fast on GPUs and efficient in memory use, making it usable for general users. In the future, we plan to improve our toolkit by investigating different pretrained transformers such as mBERT and XLM-Roberta large . We also plan to provide Named Entity Recognizers for more languages and add modules to perform more NLP tasks.