TextBox: A Unified, Modularized, and Extensible Framework for Text Generation

We release an open library, called TextBox, which provides a unified, modularized, and extensible text generation framework. TextBox aims to support a broad set of text generation tasks and models. In TextBox, we implements several text generation models on benchmark datasets, covering the categories of VAE, GAN, pre-trained language models, etc. Meanwhile, our library maintains sufficient modularity and extensibility by properly decomposing the model architecture, inference, learning process into highly reusable modules, which allows easily incorporating new models into our framework. It is specially suitable for researchers and practitioners to efficiently reproduce baseline models and develop new models. TextBox is implemented based on PyTorch, and released under Apache License 2.0 at https://github.com/RUCAIBox/TextBox.


Introduction
Text generation, which has emerged as an important branch of natural language processing (NLP), is often formally referred as natural language generation (NLG). It aims to produce plausible and understandable text in human language from input data (e.g., a sequence, keywords) or machine representation. In the field of text generation, because of incredible performance of deep learning models, many classic tasks have achieved rapid progress, such as machine translation (Vaswani et al., 2017a), dialogue systems (Li et al., 2016), text summarization (See et al., 2017), text paraphrasing (Madnani and Dorr, 2010), and more.
To facilitate the building of text generation models, a few remarkable open-source libraries have been developed (Britz et al., 2017;Klein et al., † Equal contribution. * Corresponding author. 2017b; Miller et al., 2017b;Zhu et al., 2018;Hu et al., 2019). These frameworks are mainly designed for some or a small number of specific tasks, particularly machine translation and dialogue systems. They usually focus on a special kind of techniques for text generation such as generative adversarial networks (GAN), or have limitations in covering comprehensive commonly used baseline implementations. Even for an experienced researcher, it is difficult to implement all compared baselines under a unified framework. Therefore, it is highly desirable to re-consider the implementation of text generation algorithms in a unified and modularized way, especially with deep learning. In order to alleviate the above issues, we initiate a project to provide a unified framework for text generation algorithms. We implement an open source text generation library, called TextBox, aiming to enhance the reproducibility of existing models, standardize the implementation and evaluation protocol of text generation algorithms, and ease the development process of new algorithms. Our work is also useful to support several realworld applications in the field of text generation. We have extensively surveyed related text generation libraries and broadly fused their merits into TextBox. The key features and capabilities of our library are summarized in the following three aspects: • Unified and modularized framework. TextBox is built upon PyTorch (Paszke et al., 2019), which is one of the most popular deep learning framework (especially in the research community). Moreover, it is designed to be highly modularized, by decoupling text generation models into a set of highly reusable modules, including data modules, model modules, evaluation modules, and many common components and functionalities. In our library, it is convenient to compare different text generation algorithms with built-in evaluation protocols via simple yet flexible configuration, or develop new text generation models at a highly conceptual level by plugging in or swapping out modules.
• Comprehensive models, benchmark datasets and standardized evaluations. TextBox contains a wide range of text generation models, covering the categories of variational auto-encoder (VAE), generative adversarial networks (GAN), recurrent neural network (RNN), Transformer based models, and pre-trained language models (PLM). We provide flexible supporting mechanisms via the configuration file or command line to run, compare and test these traditional and state-of-the-art algorithms. Based on these models, we implement two major text generation tasks, including unconditional text generation tasks and conditional text generation tasks (e.g., text summarization and machine translation). To construct a reusable benchmark, we incorporate many commonly used datasets with regards to different text generation tasks for evaluation. Our library supports a series of widely adopted evaluation protocols for testing and comparing text generation algorithms, such as perplexity, negative log-likelihood, BLEU, and ROUGE.
• Extensible and flexible framework. TextBox provides convenient interfaces of various common functions or modules in text generation models, e.g., RNN-based encoder-decoder, Transformerbased encoder-decoder and pre-trained language models. Within our library, users are convenient to choose different API interfaces for building and evaluating their own models. Besides, the interfaces of our library are fully compatible with the PyTorch interface which allows seamless integration of user-customized modules, and enables users to integrate external components as needed.
2 Architecture and Design Figure 1 presents the illustration of the main functionalities and modules in our library TextBox. The configuration module at the bottom helps users set up the experimental environment (e.g., hyperparameters and running details). Built upon the configuration module, the data, model, and evaluation modules form the core elements of our library. In the following, we describe the detailed structure of these three modules.

Data Module
A major design principle of our library is to support different text generation tasks. For this purpose, data module is the fundamental part to provide various data structures and functions adapting to different tasks.
For extensibility and reusability, our data module designs a unified data flow feeding input text into the models. The data flow can be described as: input text → Dataset → DataLoader → models. The class Dataset involves two special data structures, i.e., single sequence and paired sequence, which are oriented to unconditional and conditional text generation tasks, respectively, shown in Table 1. The single sequence structure requires users to preprocess input text into one sequence per line in input files, while the paired sequence structure requires users to separate the source text and target text into two files with one sequence per line in each file. These two data structures are utilized according to the provided hyperparameter, task_type, such as unconditional, translation and summarization. The implementation of Dataset contains many common data preprocessing functionalities, such as converting text into lowercase and word tokenization using NLTK 1 . And the class Dataloader is based on the above two data structures, which is responsible for organizing the data stream. In order to compare different generation models in our framework, we have collected some commonly used benchmarks for text generation tasks, which makes it quite convenient for users to start with our library. The statistics of these benchmark datasets for different tasks in our library are presented in Table 1.

Model module
To support a variety of models, we set up the model module by decoupling the algorithm implementation from other components and extracting a set of frequently used modules, e.g., encoder, decoder. TextBox allows flexible combinations among these modules following the required interface to connect with input and evaluation modules. Based on this abstract design, it is convenient to switch between different text generation tasks, and change from one modeling paradigm to another by simply plugging in or swapping out modules.
In addition to modularized design, our library also includes a large number of text generation baseline models for reproducibility. At the first released version, we have implemented several baseline models within four categories of text generation models, namely VAE-based, GAN-based, RNN-based, and Transformer-based models, corresponding to different generation tasks. For exam-ple, GAN-based models consist of generator and discriminator, and VAE-based models contain encoder and decoder. We summarize all the implemented models in Table 2. For all the implemented models, we test their performance on corresponding benchmarks, and invite a code reviewer to examine the correctness of the implementation. Overall, the extensible and comprehensive model modules can be beneficial for fast exploration of new algorithms for a specific task, and quick comparison between different models.
In specific, for each model, we utilize two interface functions, i.e., calculate_loss and generate, for training and testing, respectively. These functions are general to various text generation algorithms, so that we can implement various algorithms in a highly unified way. Such a design also enables quick development of new models.
In order to improve the quality of generation results, we also implement a series of generation strategies when generating text, such as greedy search, top-k search and beam search. Users are allowed to switch between different generation strategies leading to better performance through setting a hyper-parameter, i.e., decoding_strategy. Besides, we add the functions of model saving and loading to store and reuse the learned models, respectively. In the training process, one can print and monitor the change of the loss value and apply training tricks such as warm-up and early-stopping. These tiny tricks largely improve the usage experiences with our library.

Evaluation Module
The function of evaluation module is to implement commonly used evaluation protocols for text generation. It is important that different models should be compared under the unified evaluate protocols, which is useful to standardize the evaluation of text generation.  Our library supports both logit-based and wordbased evaluation metrics. The logit-based metrics (for unconditional text generation task) include negative log-likelihood 2 (NLL) and perplexity 3 (PPL), measuring how well the probability distribution or a probability model predicts a sample compared with the ground-truth. The word-based metrics (for both unconditional and conditional text generation tasks) include the most widely used generation metrics, such as BLEU (Papineni et al., 2002), and ROUGE (Lin, 2004), measuring the ratios of the overlapping n-grams between the generated and real samples. Besides, to evaluate the diversity of generated samples, we also take into account the Self-BLEU (Zhu et al., 2018) metric in text generation. In summary, users can choose different evaluation protocols towards a specific generation task by setting the hyper-parameter, i.e., metrics.
In practice, as the model may generate many text pieces, evaluation efficiency is an important concern. Hence, we integrate efficient computing package, fastBLEU (Alihosseini et al., 2019), to compute evaluation scores. Compared with other package, fastBLEU adopts the multi-threaded C++ implementation.

System Usage
In this section, we show a detailed guideline to use our system library. Users can run the existing models or add their own models as needed.

Running existing models
To run an existing model within TextBox, users need to specify the dataset, model, and task by setting hyper-parameters, e.g., dataset, model and task. And then experiments can be run with a simple command-line interface: TextBox mainly provides two kinds of YAML configuration files, i.e., dataset configuration and model configuration, which allow running many experiments without modifying source code. It also supports users to modify the YAML configuration file and include it in the command line, which is useful for some specifically defined parameters. The above case shows an example that runs GPT-2 (Radford et al., 2019) model on COCO dataset (Lin et al., 2015) for the unconditional text generation task. TextBox is designed to be run on different hardware devices. By default, CUDA devices will be used if users set the hyper-parameter use_gpu as True, or otherwise CPU will be used. Users can determine the ID of used CUDA devices by setting hyper-parameter gpu_id.
Based on the configuration, we provide the auxiliary function to split the dataset into train, validation and test sets according to the provided hyperparameter split_ratio, or load the pre-splitted dataset. Moreover, TextBox also allows users to load and re-train the saved model for speeding up reproduction, rather than training from scratch. Figure 2 presents a general usage flow when running a model in our library. The running procedure relies on some experimental configuration, obtained from the files, command line or parameter dictionaries. The dataset and model are prepared and initialized according to the configured settings, and the execution module is responsible for training and evaluating models.

Implementing a New Model
With the unified Data and Evaluation modules, one needs to implement a specific Model class and three mandatory functions as follows: • __init__() function. In this function, the user performs parameters initialization, global variable definition and so on. It is worth noting that, the imported new model should be a sub-class of the abstract model class defined in our library. One can reuse the modules (e.g., Transformer) and layers (e.g., Highway net) already existing in our library for convenience. A configuration file is preferable to conduct further flexible adjustment.
• calculate_loss() function. This function calculates the training loss to be optimized and validation loss to avoid overfitting. Based on the returned training loss, our library will automatically involve different optimization methods to learn the model parameters according to pre-defined configuration.
• generate() function. This function is employed to generate output text based on input text or free text. Our library also provides several generation strategies, such as beam search and top-k search, for users to improve generation results.
In order to implement user-customized modules, one can reuse functions and classes inherent from our basic modules, or override original functions and add new functions.

Performance Evaluation
In order to evaluate TextBox, we implement various text generation models, and compare their performance on unconditional and conditional text generation tasks.

Unconditional Text Generation
Following previous work, we adopt COCO (Lin et al., 2015) and EMNLP2017 WMT News (Chatterjee et al., 2017) datasets for comparing the performance of five traditional and state-of-the-art models, i.e., LSTM-VAE, SeqGAN, RankGAN, MaliGAN, and GPT-2 models, in the unconditional text generation tasks.
In our experiments, we follow the parameter configurations described in their original papers. Note that the BLEU-n and Self-BLEU-n metrics in our library employ the one-hot weights (e.g., (0, 0, 0, 1) for BLEU-4) instead of average weights, since we consider that one-hot weights can reflect the overlapping n-grams more realistically. Besides, we adopt NLL loss to measure how well the generation probability distribution is, which is computed as follows: where T denotes the length of sequence, P real and G denote the distribution of real data and the text generation model, respectively.
These results are shown in Table 3. In our experiments, these models implemented in our library have the comparable performance compared with the results reported in the original papers. Moreover, the pretrained language model, i.e., GPT-2, achieves consistent and remarkable performance in both COCO and EMNLP datasets. These results are compatible with our expectations.

Conditional Text Generation
In this section, we show the performance on the test data for conditional text generation task. Due to space limits, we only select the typical conditional text generation task, i.e., machine translation, to compare the attention-based RNN model and Transformer model using three generation strategies, i.e., top-k, greedy, and beam search. The greedy strategy considers the most probable token at each generation step, the top-k search strategy means sorting by probability and zero-ing out the probabilities for anything below the k-th token, and beam search (Vijayakumar et al., 2018) strategy selects the top scoring B candidates from the set of all possible one token extensions of its beams, where B is the beam size.
In our experiments, we adopt the benchmark machine translation dataset, IWSLT2014 Germanto-English (Cettolo et al., 2014). RNN model is based on GRU variant with two layers, and Transformer model adopts the base configuration, i.e., six encoder and decoder layers. The beam size for generation is set to 5. Besides BLEU-n, we also provide the average BLEU metric with average weights (1/4, 1/4, 1/4, 1/4) to compare these Datasets Models NLL BLEU-2 BLEU-3 BLEU-4 Self-BLEU-2 Self-BLEU-3 Self-BLEU-4   two sequence-to-sequence models. These results are presented in Table 4. As we can see from Table 4, Transformer model outperforms the RNN model by a clear margin. Additionally, the adopted beam search generation strategy brings much improvement into the both text generation models compared with the simple greedy and top-k strategies.
The results of all implemented models in other datasets can be acquired from our GitHub page 4 .

Related Work
Text generation has received much attention from the research community. Several toolkits have been released focusing on one or a few specific tasks or techniques. For example, Tensor2Tensor (Vaswani et al., 2018), MarianNMT (Junczys-Dowmunt et al., 2018) and OpenNMT (Klein et al., 2017a) are designed for machine translation task, while ParlAI (Miller et al., 2017a) and Plato (Papangelis et al., 2020) specialized for dialog research in this field. There are two text generation libraries closely 4 https://github.com/RUCAIBox/TextBox related to our library, including Texygen (Zhu et al., 2018) and Texar (Hu et al., 2019) focusing on GAN technique and high modularization, respectively.
Compared with them, TextBox covers more text generation tasks and models, which is useful for reproducibility. Besides, we also implemente standardized evaluations to commpare different models. Also, our library provides various common modules for convenience. It has a proper focus on text generation field, and provide a comprehensive set of modules and functionalities.

Conclusion
This paper presented a unified, modularized, and extensible text generation library, called TextBox. So far, we have implemented 15 text generation models, including VAE-based, GAN-based, RNNbased Transformer-based models and pretrained language models, and 6 benchmark datasets for unconditional and conditional text generation tasks. Moreover, Our library is modularized to easily plug in or swap out components, and extensible to support seamless incorporation of other external modules. In the future, we would further add more models and datasets and consider more functions for covering more text generation tasks.