ARTIST: A Transformer-based Chinese Text-to-Image Synthesizer Digesting Linguistic and World Knowledge

,


Introduction
Text-to-Image Synthesis (TIS) is a popular multimodality task that aims to convert natural language texts into realistic images (Frolov et al., 2021).For accurate TIS, various methods have been proposed based on Generative Adversarial Networks (GANs) (Xu et al., 2018;Zhang et al., 2021a).
Despite the remarkable progress, we suggest that the mainstream transformer-based TIS models may have a few drawbacks.i) Most TIS models have billion-scale parameters, making it challenging to fine-tune and deploy them in resourceconstrained environments (such as Ramesh et al. (2021); Ding et al. (2021)).This highly limits the applications of these models in real-world applications.ii) The encoder-decoder architecture in the transformer (Vaswani et al., 2017) does not explicitly model the semantics of key elements appearing in texts (i.e., entities or objects), and hence may lack the background knowledge for realistic image generation.iii) Most existing works are benchmarked with English datasets, e.g., MS-COCO (Lin et al., 2014).There is a lack of language-specific benchmarks for other languages (Chinese in our work), together with high-performing TIS models that are suitable for efficient domain-specific fine-tuning and online deployment.In addition, the multi-granularity of word segmentation of the Chinese language (Lai et al., 2021) makes the underlying transformer difficult to understand the true meanings of the input texts, which also causes ambiguity for knowledge injection to the models.
We present ARTIST, A tRansformer-based Chinese Text-to-Image SynThesizer with rich linguistic and world knowledge digested.It aims to gen- erate high-quality images with a moderate parameter size.To enhance the understanding abilities of ARTIST, we model the multi-granularity of input texts by linguistics knowledge and attentively inject entity embeddings into the encoder of the transformer model, which is derived from the massive relational facts in knowledge bases.The resulting images are further generated by VQGAN based on "pseudo images tokens" produced by the transformer decoder (Esser et al., 2021).
For evaluation, we establish a large-scale Chinese TIS benchmark over multiple public multimodal datasets and re-produce the results of several popular transformer-based TIS models.The experimental results show that our ARTIST model outperforms previous approaches.In summary, we make the following major contributions in this work: • We formally propose the ARTIST framework for knowledge-enhanced Chinese TIS.
• In ARTIST, the rich linguistic and relational knowledge facts are injected into the model for better Chinese language understanding.
• We establish a large-scale Chinese TIS benchmark with the re-production results of stateof-the-art transformer-based TIS models.The experimental results also show that ARTIST outperforms previous approaches.
2 The ARTIST Model

Overview
Figure 1 shows the overview of our ARTIST framework.In the word lattice fusion layer, as these exists multi-granularity of the Chinese language, thus, for an input text, we obtain all possible word segmentation results and generate the word lattice of the corresponding text.The pre-trained entity embeddings are learned from a large-scale knowledge graph and are selectively injected into entity representations by our designed Entity Representation Interaction Module (ERIM).With the fused knowledge, the transformer model auto-regressively generates "pseudo image tokens", where the codebook is obtained from a VQGAN model.Finally, the images are decoded using the same VQGAN model.

Word Lattice Generation
Previous works for TIS treat tokens in input texts equally, while we suggest that entities described in texts are the critical guide to generating the images that are strongly related to the specific objects.Hence, it is vital to identify the entities and integrate token embeddings with the pre-trained entity embeddings during transformer training.For Chinese, different word segmentations have a great impact on meanings of sentences, which leads to error propagation and language ambiguity.2Following Li et al. (2020), we obtain word lattices of the input sentences, containing all possible word segmentations and entities of the texts.

Entity Representation Interaction
The lattice structure represents all possible entities in the sentence, yet too much knowledge injection may lead to sentence meaning confusion (also N } ∈ R N ×d as the token embeddings of layer k where h (k) i denotes the i-th token embedding, N is the the sequence length, and d is the dimension of hidden representation.Let M be the collection of all possible entities of a given sentence appearing in the lattice.We further denote e m as the pre-trained entity embedding of the m-th entity in M , and e i,m as the entity embedding to be injected into the i-th token based on the knowledge of e m .Clearly if the i-th token overlaps with the m-th entity, we have e i,m = e m and e i,m = 0 otherwise.In our work, we obtain the pre-trained entity embeddings by TransE (Bordes et al., 2013) from a large-scale Chinese knowledge graph CN-DBPedia that contains more than 9 million entities and 67 million triples of relationships (Xu et al., 2017). 3The mutual knowledge injection process is then computed as follows: where w i,m is the weight of the m-th entity embedding for h (k) i , and h (k) i represents the knowledgeenhanced hidden token embedding, after the knowledge of multiple entity embeddings is selectively injected.The entire sequence embeddings are further denoted as h (k) .We build up the transformer layers by: (3) 3 To trade-off the performance and efficiency, we use the TransE model to learn entity representations.
where AT T N and LN denote attention and layer norm, with W 1 , W 2 to be learnable parameters.

Realistic Image Generation
For image generation, we employ the above autoregressive transformer that allows ARTIST to generate a sequence of "pseudo image tokens" based on the knowledge-enhanced text embeddings.Specifically, the "pseudo image tokens" are the codebook indices encoded by the pre-trained VQGAN model (Esser et al., 2021), denoted as where G is the sequence length of image tokens.Given the text tokens w and the image tokens v, we model p(v) as: The loss function of the ARTIST model is: where Θ is the collection of model parameters.Finally, the images are decoded from "pseudo image tokens" to image pixels.The parameters of VQ-GAN are fixed during model training.
3 Benchmark and Experimental Results

Benchmark
To our knowledge, there are no Chinese TIS benchmarks publicly available for us.Thus, we seek to construct a benchmark for the research community.The evaluation datasets include COCO-CN (Li et al., 2019), MUGE4 , Flickr8k-CN (Li et al., 2016) and Flickr30k-CN (Lan et al., 2017), containing a large number of high-quality Chinese text-image pairs.The detailed statistics of the data splits can be found in the appendix (Table 4).Following previous works on TIS, we employ Fréchet Inception Distance (FID) and Inception Score (IS) as metrics (Zhu et al., 2019) lower FID indicate that the qualities of generated images are better.
For baselines, CogView (Ding et al., 2021) 5 releases a Chinese model checkpoint that supports zero-shot learning.We also compare ARTIST against other methods with public codes and model checkpoints, including DALL-E (Ramesh et al., 2021) 6 and OFA (Wang et al., 2022b) 7 .As the models of DALL-E and OFA are for English only, while the English translations of captions from COCO-CN, Flickr-8k and Flickr-30k are already available, those of MUGE are translated into English through a commercial translation service.There exist some other recent works; however, their codes and checkpoints are not available at the time of writing.

Model Configurations of ARTIST
We have pre-trained and released two versions of ARTIST (base and large), with 202M and 433M parameters, respectively, with details further shown in the appendix (Table 5).Both models are pretrained over a subset of the Wukong corpus (Gu et al., 2022), which contains 100M Chinese pretraining text-image pairs collected from the Web.After that, the models are fine-tuned over the four datasets.During training, we fix the batch size and the learning rate to be 16 and 4.5e − 6, respectively.The sequence length is 288, 32 for text and 256 for image.The vocabulary size is 37,512, containing 21,128 text tokens and 16,384 image tokens.Other hyper-parameters are tuned on development sets.To save computational resources, we also pre-train a Chinese CLIP model (Radford et al., 2021) over the same Wukong corpus to rank the 10 generated images in order to select the best one.We implement ARTIST in PyTorch and conduct experiments on a server with 8 Tesla V100 GPUs (32GB).

Overall Performance
Table 1 summarizes the TIS results on all benchmark datasets.As seen, ARTIST shows superiority over both zero-shot generation and fine-tuned methods.Overall, the qualities of images generated by zero-shot learning are not as good as finetuned models, regardless of the model scale.Compared with DALL-E and OFA's fine-tuned models, ARTIST achieves new state-of-the-arts over FID on all four datasets at comparable model sizes and has competitive results for IS on datasets other than Flickr30k-CN.In summary, the qualities of images generated by ARTIST are of great advantage by effective knowledge injection.

Detailed Analysis
Knowledge Ablation.We further conduct an ablation study to verify the impact of knowledge injection.In Table 2, the ablation of either all knowledge or word lattice leads to a substantial drop in performance.Injecting knowledge based on word lattice and ERIM has a more significant improvement.In average, directly injecting entity embeddings reduces FID by 1.1 (from 50.36 to 49.26) and increases IS by 0.1 (from 11.57 to 11.67) compared to no knowledge injection, while injecting knowledge based on our approach reduces FID by 4.58 (from 50.36 to 45.78) and improves IS by 1.43 (from 11.57 to 13.00).Learning with Larger Models.We also increase the model size of ARITIST to 433M, and compare it against OFA-large (470M).The ARTIST-large model further improves the performance compared to the base model.As shown in Table 3, in average, ARTIST-large reduces 6.65 in FID and improves 1.05 in IS compared to OFA-large.CLIP.We generate multiple images for each query text and select the best one by the pre-trained CLIP model.As shown in Figure 2, the number of generated images has an impact on the performance.We recommend generating 10 images with ARTIST to balance performance and efficiency.In the literature, DALL-E generates 512 images for each caption and selects the best one, and CogView generates 60.Our work is much more efficient with better performance.

Conclusion and Future Work
In this paper, we present the ARTIST framework for knowledge-enhanced Chinese TIS.The rich linguistic and relational knowledge facts are injected into the model for better Chinese language understanding.For evaluation, we establish a largescale Chinese TIS benchmark and show that the proposed ARTIST models outperform previous approaches.We will release our models and benchmark to the public and extend our work to other languages in the future.

Limitations
Our work focuses on transformer-based TIS models for the Chinese language, where the rich relational knowledge facts and the linguistic characteristics are fused into the models for better performance.
It is natural to extend our work to other languages (such as English) by considering the linguistic characteristics of these languages as well, which will be addressed in the future work.

Ethical Considerations
Our contribution in this work is fully methodological, namely a new framework to train Chinese TIS models with rich knowledge injected.Hence, there are no direct negative social impacts of this contribution.However, as transformer-based models may have some negative impacts, such as the generation of toxic contents by machines, the produced TIS models produced by our algorithms would unavoidably suffer from these issues, which can have the possibilities of generating inappropriate images.We suggest that users should carefully deal with the potential risks by filtering out these images when the TIS models are deployed online.

A Dataset Statistics
The statistics of the four benchmark datasets are summarized in Table 4

B Configurations of ARTIST Models
The detailed configurations of the ARTIST-base and ARTIST-large models are summarized in Table 5.

C The Performance of VQGAN
During the training process of ARTIST, we fix the parameters of the VQGAN model.Hence, the qualities of the generated images are constrained by the VQGAN model, which can be viewed as the upper bound.We compare the results of VQGAN reconstruction and ARTIST-base in Table 6.As seen, our proposed ARTIST model is closer to the VQGAN reconstruction results than other baselines in most cases, which shows the superiority of our method.

D Case Studies
Figure 3 and Figure 4 show some qualitative results of ARTIST and other open-sourced models in e-commerce and natural scenes, respectively.In general, our approach generates images with more vivid details in most cases.

Figure 1 :
Figure 1: The overall framework of our ARTIST framework.

Figure 2 :
Figure 2: The performance trend of ARTIST when the number of generated images varies.

Table 1 :
The overall experimental results of baseline methods and ARTIST over four benchmark datasets.The best results are printed in bold, with the second best underlined.

Table 2 :
Results of knowledge ablation."w/o.all knowledge" means no knowledge is injected, and "w/o.word lattice" means directly injecting entity embeddings at the corresponding locations without word lattice and ERIM.
. A higher IS and a

Table 3 :
The performance comparison of larger models (ARTIST-large and OFA-large) over two datasets. .

Table 6 :
The reconstruction results of VQGAN model, together with the performance of ARTIST-base.