YATO: Yet Another deep learning based Text analysis Open toolkit

We introduce YATO, an open-source, easy-to-use toolkit for text analysis with deep learning. Different from existing heavily engineered toolkits and platforms, YATO is lightweight and user-friendly for researchers from cross-disciplinary areas. Designed in a hierarchical structure, YATO supports free combinations of three types of widely used features including 1) traditional neural networks (CNN, RNN, etc.); 2) pre-trained language models (BERT, RoBERTa, ELECTRA, etc.); and 3) user-customized neural features via a simple configurable file. Benefiting from the advantages of flexibility and ease of use, YATO can facilitate fast reproduction and refinement of state-of-the-art NLP models, and promote the cross-disciplinary applications of NLP techniques. The code, examples, and documentation are publicly available at https://github.com/jiesutd/YATO. A demo video is also available at https://www.youtube.com/playlist?list=PLJ0mhzMcRuDUlTkzBfAftOqiJRxYTTjXH.


Introduction
Large language models (LLMs) such as GPT-3 (Brown et al., 2020), ChatGPT (OpenAI, 2022), and LLaMA (Touvron et al., 2023a,b) have gained significant progress in natural language processing (NLP), showing strong abilities to understand text and competitive performance across various NLP tasks.However, these models are either closesource or difficult to fine-tune due to the high computational costs, which makes them inconvenient for academic research or practical implementation.
Alternatively, traditional neural models, such as recurrent neural networks (RNN, Hochreiter and Schmidhuber, 1997), convolutional neural networks (CNN, LeCun et al., 1989), and pre-trained language models (PLMs, Devlin et al., 2019;Liu et al., 2019;Clark et al., 2020) have been widely studied and utilized for text understanding.These models benefit from large-scale training data and can be quickly fine-tuned toward specific usages.Recent works also show they can offer useful guidance to LLMs (Xu et al., 2023).Therefore, small open-source deep learning models are important in current NLP systems, especially in computation and data resource-limited scenarios.
However, due to the complexity of the deep learning model architecture, it is challenging to implement methods or reproduce results from the literature.The different implementations of these models can lead to unfair comparisons or misleading results.Most existing frameworks were designed for professional developers, which brings additional obstacles for less experienced users, especially for researchers with less or no artificial intelligence (AI) background (Zacharias et al., 2018;Zhang et al., 2020;Johnson et al., 2021).In addition, these frameworks seldom support user-defined features required for various domain applications (e.g., in medical named entity recognition, customized lexicons can be supplemented as external features, such that additional labels are tagged as features when a word occurs in the lexicon).For non-expert, cross-domain users, customizing models via source code with additional features is complex.To promote interdisciplinary applications of cutting-edge NLP techniques, it is necessary to build a flexible, user-friendly, and effective text representation framework that supports a wide range of deep learning architectures and customized domain features.
There exist several text analysis toolkits in the NLP community.CoreNLP (Manning et al., 2014) and spaCy (Honnibal and Montani, 2017) offer pipelines for many traditional NLP tasks, while the performance is sometimes under-optimal due to the use of less powerful models.AllenNLP (Gardner et al., 2017) and flairNLP (Akbik et al., 2019)  user-defined features.FairSeq (Ott et al., 2019) is designed for sequence-to-sequence tasks like machine translation and document summarization.
Transformers (Wolf et al., 2020) offers implementation for various tasks by using state-of-the-art models across different modalities, while it is heavily engineered.PaddleNLP (Contributors, 2021) and EasyNLP (Wang et al., 2022a) are specifically designed for industrial application and commercial usage, which are not lightweight for research purposes.The above toolkits are mostly developed for professional AI researchers or engineers, where heavy coding effort is necessary during model development and deployment.The learning curve is steep to fully leverage these toolkits for crossdisciplinary researchers (e.g., medical, financial) who need to build models with lightweight code.This paper presents a toolkit, YATO (Yet Another deep learning based Text analysis Open toolkit), for researchers looking for a convenient way of building state-of-the-art models for two most popular types of NLP tasks: sequence labeling (e.g., Part-of-Speech tagging, named entity recognition) and sequence classification (e.g., sentiment analysis, document classification).YATO is built on NCRF++ (Yang and Zhang, 2018), a popular neural sequence labeling toolkit with over 250+ citations from research papers, 1,900+ stars and 120+ merged pull requests on GitHub as of Oct. 2023.NCRF++ has been utilized in many crossdisciplinary research projects, including medical (Yang et al., 2020) and finance (Wan et al., 2021).YATO retains its strengths, integrates advanced pretrained language models, and adds capabilities for sequence classification and data visualization.• Lightweight.YATO focuses on two fundamental while popular NLP tasks: sequence labeling and sequence classification, covering many downstream applications such as information extraction, sentiment analysis, text classification, etc. Different from the heavily engineered libraries, YATO is concise and lightweight with less library dependence.It can be fast developed and deployed in various environments, making it a user-friendly toolkit for less experienced users.

Highlights of YATO
• Flexible.Most of the existing libraries do not support the combination of various neural features.By using YATO, users can customize their models through free combinations of various neural models, including traditional neural networks (CNN, RNN) and state-of-the-art PLMs, as well as handcrafted features for domain adaptation.YATO also supports various inference layers, including attention pooling, softmax, conditional random field (CRF), and nbest decoding.
• Configurable.To minimize the effort of coding, all the model developments on YATO can be easily conducted by editing the configuration file.YATO will load the configuration file and construct the deep learning models following the configurations.
• Easy to Use.YATO is built based on PyTorch1 and has been released on PyPI2 , the installation can be done through pip install ylab-yato.For non-AI users, editing a configuration file to build deep learning models is simple and intuitive.For AI users, YATO provides various modularized functions for professional development.
• High Performance.In extensive experiments on sequence labeling and classification tasks, YATO proves that it can achieve state-of-the-art performance on most tasks and datasets.YATO offers flexibility in terms of hardware resources, supporting both GPU and CPU for training and inference processes.It provides the ability to specify the desired device configuration, facilitating efficient utilization of multiple GPUs on a single server.
• Visualization.YATO offers the interface for visualizing text attention, which can help users further interpret and analyze the results.
3 Architecture of YATO YATO is designed hierarchically to support flexible combinations of character-level features, wordlevel features, and pre-trained language models3 , as well as handcrafted features.As illustrated in Figure 1, YATO supports four patterns to represent text as embeddings and with flexible choices on adding handcrafted features and inference layers.

Text Representations
Pure Pre-trained Language Model.YATO enables the initialization of parameters with pretrained language models such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and ELECTRA (Clark et al., 2020), and fine-tunes them on training data.Leveraging the rich knowledge inside the PLM, they have demonstrated strong performance on downstream tasks.To better leverage the models with domain-specific knowledge, YATO also supports pre-trained models designed for specific tasks, such as SciBERT (Beltagy et al., 2019), BioBERT (Lee et al., 2020) and others.Hierarchical Pre-trained Language Model.The hierarchical pre-trained language model in YATO differs from the conventional notion of hierarchy, which typically describes relationships between word, sentence, and document structures.Instead, it signifies the ordinal relation between the traditional neural network and the pre-trained language model.Specifically, YATO supports using both word sequence features and the pre-trained language model representations in a hierarchical way, where the word and character features can be explicitly encoded in advance and used as the input for the pre-trained language model.Traditional Neural Network (TNN) & Pretrained Language Model.In contrast to the hierarchical combination, we can use the word sequence features directly before the final prediction layer, combined with the representation brought from the PLMs.Such a feature-based approach is also used in ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019), which shows close performance while does not require fine-tuning the pre-trained models.
Pure Traditional Neural Network.Besides using the transformer-based pre-trained models, we also support traditional neural models such as RNN, CNN, and BiLSTM.Compared with Transformer, these models usually have fewer parameters and are also shown effective for sequence modeling (Ma and Hovy, 2016;Lample et al., 2016;Yang et al., 2018), especially when the training data is limited.

Handcrafted Features and Inference
Handcrafted Features .YATO provides feature embedding modules to encode any handcrafted fea-   (Yang et al., 2017).

Datasets and Main Results
To evaluate our framework, we evaluated 8 datasets that cover sequence labeling and classification tasks in both English and Chinese, including named entity recognition (NER) on CoNLL2003 (Tjong Kim Sang and De Meulder, 2003), OntoNotes (Hovy et al., 2006) and MSRA (Levow, 2006); CCG supertagging on CCG-Bank (Hockenmaier and Steedman, 2007); sentiment analysis on SST2, SST5 (Socher et al., 2013), and ChnSentiCorp (Tan and Zhang, 2008).Table 3 and Table 4 demonstrate that YATO can reproduce both classical and state-of-the-art deep learning models on most sequence labeling and classification tasks.For some results such as BERT on CoNLL, the originally reported 92.4 F1 score by Devlin et al. (2019) may not be achieved with current libraries, as discussed in previous literature (Stanislawek et al., 2019;Gui et al., 2020).Overall, YATO achieves the best performance on MSRA, OntoNotes 4.0, CCG supertagging, and ChnSentiCorp.The compatibility and reproducibility across different models and tasks demonstrate that YATO can serve as a platform for reproducing and comparing different methods from classical neural models to state-of-the-art PLMs.

Comparison of Different Patterns
Table 5 shows the performance of four different model patterns on both sequence labeling and classification tasks (one dataset for each task).The combination of Hierarchical PLM and TNN⊕PLM (patterns 2 and 3) outperforms pure models (patterns 1 and 4) on SST5.However, pure PLM achieves the best performance on the CoNLL 2003 NER dataset.These results demonstrate that complex models are not always better than simple models, and a flexible framework is necessary for providing various model candidates.

Results by Using Handcraft Features
To demonstrate the effectiveness of encoding handcraft features in domain application, Table 6 shows  Results show that handcraft features can improve the model performance in the medical domain.

Comparison with Transformers
The aforementioned results show that we can achieve the reported values across various tasks by using YATO.We further use tasks from GLUE benchmark (Wang et al., 2018) and compare with the results by using Huggingface Transform-   ers (Wolf et al., 2020), which is one of the most popular libraries.Table 7 shows the results by using BERT-base-uncased model, the values of Huggingface Transformers are sourced from the corresponding github page 5 .YATO achieves comparable and overall better performance than that of Huggingface Transformers by using default settings.

Visualization of Attention Map
Beyond performance, YATO provides a visualization tool for taking the list of words and the corresponding weights as input to generate Latex code for visualizing the attention-based result.Figure 2 provides visualization examples of attention on sentiment prediction tasks.Words or characters with sentiment polarities can be automatically extracted and highlighted using our YATO module.As shown in this table, words that have a high impact on the sentiment are highlighted.This visualization module can improve the interpretability of deep learning models in our toolkit.

Efficiency Analysis
YATO is implemented using a fully batch computing approach, making it quite efficient in both model training and decoding.With the help of GPU and large batches, models built on YATO can be decoded efficiently.Figure 3 shows the

Conclusion
YATO is an open-source toolkit for text analysis that supports various combinations of state-ofthe-art deep learning models and user-customed features, with high flexibility and minimum coding effort.YATO is maintained by core developers from YLab (https://ylab.top/).It aims to help AI researchers build state-of-the-art NLP models and assist non-AI researchers in conducting cross-disciplinary research with advanced NLP techniques.Given the success of its predecessor, NCRF++, we believe that YATO will greatly promote the applications of NLP in various crossdisciplinary fields and reduce disparities of AI application in these areas.In the future, we plan to integrate advanced LLMs and customize modules that support modeling time series, multimodal features, and specific features for various domains.

Limitations
Our proposed text analysis toolkit mainly focuses on discriminative style tasks, where most of them are treated as token-level or sentence-level classification tasks.Recent studies show that the generative style language models such as GPT (Radford et al., 2018), BART (Lewis et al., 2019), and T5 (Raffel et al., 2020) can also show promising zero-shot and few-shot results by adding userdefined prompts or instructions as external inputs, we leave this as our future work.

Figure 2 :
Figure 2: Visualization of attention weights.Different degrees of background color reflect the distributions of attention in words or characters.

5Figure 3 :
Figure 3: Decoding speed for the four patterns on different batch sizes.Tested on NVIDIA RTX 2080Ti GPU.

Table 1 :
Comparison between existing popular text analysis libraries and our proposed YATO.

Table 2 :
A sample of a configuration file.
Model Decoding.Similar to model training, simple file configuration can be used to enable YATO.Besides the greedy decoding, YATO also supports nbest decoding, i.e., which decodes label sequences with the top n probabilities by using the Viterbi decoding in neural CRF layers.The nbest results can serve as important resources for further optimizations, e.g., reranking

Table 5 :
Performances of different training patterns.

Table 7 :
Results of fine-tuning BERT-base-uncased model on GLUE benchmark.