A Multi-Modal Knowledge Graph for Classical Chinese Poetry

,


Introduction
Classical Chinese poetry plays an important role in cultural transmission in ancient China, and helps the cultural understanding between the east and the west.Over thousands of years, the language of classical Chinese poetry is greatly different from the modern language, such as the semantic meaning of words and the organization of sentences.For example, the word "wulai" (无赖) in modern Chinese only means rascally, but in ancient Chinese, it also means cute, e.g."I like my cute son most.He is lying in the grass beside the stream, peeling lotus pods" (最喜小儿无赖，溪头卧剥莲蓬) , which increases the difficulty to understand the language in classical Chinese poetry.To bridge the culture gap, we propose to display the classical Chinese poetry in an intuitively visual way, which makes it easier to enjoy the beauty of classical Chinese poetry for all the people.
Existing studies in classical Chinese poetry area (Liu et al., 2018;Zhipeng et al., 2019;Wu et al., 2021;Wei et al., 2020;Wang et al., 2021) mainly focus on the generation and analyzing of poetry.Different from them, we make a preliminary attempt to translate the classical Chinese poetry into a multi-modal way to help overcome the cultural barriers among different languages, and make it enjoyable for all the people.To bridge the semantic gap between two modalities (Liu et al., 2019a;Wang et al., 2019), we construct a multimodal knowledge graph for classical Chinese poetry (PKG), in which a wealth of information, such as the allusions and visual information of the words in poetry are incorporated.Specifically, PKG consists of text nodes and image nodes.Each text node is linked to the most similar images by computing the similarity of their representations, which are initialized by pre-training model Bert (Devlin et al., 2019) and BriVL (Huo et al., 2021) respectively, and then fine-tuned in our proposed multi-modal pre-training language model PKG-Bert.
With the incorporated image information in PKG, PKG-Bert injects the visual knowledge into the semantic learning of poetry words, making it possible to bridge the semantic gap between different modalities.The effectiveness of our method is verified on the poetry-image retrieval task.We invite five experts in classical Chinese poetry area to screen out the most matching poetry (or image) corresponding to each image (or poetry) query, and then construct a evaluation dataset for classical Chinese poetry-image retrieval task.The contributions of our work are as follows: • We make a preliminary attempt to construct a • We contribute a multi-modal knowledge graph PKG in classical Chinese area and a labeled evaluation dataset for poetry-image retrieval task, which will promote the researches on ancient Chinese language understanding.

Related Work
In the field of classical Chinese poetry, there are some works on the knowledge graph of poetry (Liu et al., 2020b;Wei et al., 2020;Hong et al., 2020).Liu et al. (Liu et al., 2019b) studies the new word detection in ancient Chinese corpus, and use a semi-automated method to construct a classical Chinese poetry knowledge graph (CCP-KG) (Liu et al., 2020b).They use BERT and GAT to encode the nodes in the text and knowledge graph respectively to classify the sentiment and theme of classical Chinese poetry.KnowPoetry (Hong et al., 2020) semi-automatically constructs a knowledge graph of Tang poetry and poet, and uses inference rules from experts to help the further study of Tang poetry e.g., mining social relationship between poets.
2 The retrieval system is available at http://facdbe.com:8080/There are also some works on the multi-modal knowledge graph in mordern language (Liu et al., 2019a;Wang et al., 2019).MMKG (Liu et al., 2019a) takes three existing knowledge graphs as the blueprint of the textual part, and uses the search engines to find images highly related with each text node.Richpedia (Wang et al., 2019) obtains texts and images entities from the Wikipedia and search engines, and uses predefined rules to extract relations between text and image entities.However, due to the lack of data, existing studies are hard to apply in the field of classical Chinese poetry.For example, it is difficult for the mainstream search engines to recognize ancient Chinese words, and there is no such large-scale knowledge base like Wikipedia in the field of classical Chinese poetry.

Methods
The overall architecture of the multi-modal knowledge graph based pre-training model PKG-Bert is shown in Fig. 1.

The Construction of PKG
A large-scale multi-modal knowledge graph of classical Chinese poetry, termed as PKG, is constructed by the following steps: (1) Poetry data collection.We crawl the classical Chinese poetry dataset from Gushiwen3 and Soyun4 , which contain the content and appreciation of classical Chinese poetry.We collect 905,675 poetry with a range of about three thousand years, that is from the pre-Qin (1000 BC: Before the Common Era) to the Qing dynasty (1800 AD: Anno Domini).
(2) Text nodes connection.We keep the words appearing more than 5 times in poetry as the candidate words (145,800 words), then crawl their definitions from HanDian5 and annotations of sememes from HowNet (Dong and Dong, 2003;Qi et al., 2019).Sememes are the smallest semantic units in linguistics and well connected in HowNet.Based on the connections among sememes, we further add edges between the candidate word and its sememe.As for the word without sememe in its annotation, we use a Sememe Correspondence Pooling (SCorP) model (Du et al., 2020) to predict its sememe, and then build the connections between the word and sememe with positive probability.
(3) Image nodes connection.To incorporate the visual information, we add image nodes to PKG.We feed each text node (or its definition in Han-Dian) into the API6 of a multi-modal pre-training model BriVL (Huo et al., 2021), which returns the top 10 similar images of the input text.Then we make connections between the text node and the highly related image nodes.
Totally, the multi-modal knowledge graph PKG contains 111,514 text nodes (including the words in poetry and sememes in HowNet), 96,049 image nodes, and 1,369,632 edges.

PKG-Bert
To learn the representation of poetry, we propose a multi-modal pre-training language model for classical Chinese poetry, termed as PKG-Bert.The architecture of PKG-Bert is shown in Fig. 1, which is divided into three parts: input layer, encode layer and target layer.
(1) Input layer.Input layer consists of three types of information: token, image entity, and position.Tokens include the words in poetry and linked text entities in PKG.As shown in Fig. 1, given a poetry T " tt 1 , t 2 , ..., t n u, where t i represents a word in the poetry, we first insert special tokens, i.e., [CLS] and [SEP], to tag the start and end of the poetry.Take the word t 1 for example, we then insert its neighbors (i.e., n 1 and n 2 ) in PKG behind it.The embedding of each token is defined as v token P R H , where H is the hidden dimension.Image entity refers to the image node in PKG that linked to the tokens.Given a token, we encode its linked image entities by BriVL (Huo et al., 2021), and then use average pooling and lin-ear transformation operations to obtain its visual representation v image P R H . Position records the order of words in a poetry.Note that the inserted text entities in PKG and the next word in poetry are encoded with the same position.For example, n 1 , n 2 and t 2 are encoded with the same position embedding v pos P R H . Finally, the input representation of a token is defined as: v Input " v token `vimage `vpos . (1) (2) Encode layer.The encode layer uses Mask-Transformer blocks (Liu et al., 2020a) to extract features from the input.Given the input representation v Input , the out representation of the ith transformer block v i (v 0 " v Input ) is defined as: where Q, K, V are the query, key and value vectors of attention mechanism, W q , W k , W v are trainable parameters, d k is the dimension of Q and K, and VM is the parameter in self-attention block to prevent the inserted entities from disturbing the inference of the words.
(3) Target layer.We use Masked Language Model (MLM) (Devlin et al., 2019) to train the language model.The words are randomly masked with a probability of 15% in three mask methods, i.e., replaced by [MASK], replaced by random word, and unchanged with the probabilities of 80%, 10% and 10% respectively.
To overcome the semantic gap between textual and visual modalities, we minimize the space distance between poetry representations learned in PKG-Bert and BriVL (Huo et al., 2021) to keep the poetry and its visual information in the same learning space.It is optimized by: Loss space " 1 ´cos pV PKG-Bert , V BriVL q , (3) where V PKG-Bert and V BriVL are the vectors of a poetry in PKG-Bert and BriVL respectively.
The final optimization function is defined as: Loss " Loss MLM `Loss space . (4)

Experiments
We make a preliminary attempt to design a poetryimage retrieval task, which could be well used to evaluate the effectiveness of PKG-Bert.

Experimental Settings
We collect the classical Chinese poetry from Gushiwen and Souyun and images from Unplash 7 which is a free community of photographers and provides a large number of high-definition images.The statistics of the dataset is shown in Table 2.
To evaluate the model performance, we randomly select 50 images (or poetry) as the queries, choose the top 10 poetry (or image) candidates (selected form 30,000 poetry and 200,000 images) from each model, and then take them together to construct the evaluation set.We invite five experts in classical Chinese area to label the similarity scores of the candidates for each query, which is mainly decided by the number of related objects in the candidates.We take the average score from five experts as the final score of a candidate.

Baseline Methods and Metrics
We compare PKG-Bert with recent multi-modal pre-training models, including: • BriVL (Huo et al., 2021): the largest Chinese multi-modal pre-training model, and pretrained with 30 million text-image pairs.
• ViLT (Kim et al., 2021): a multi-modal pretraining model with a rigorous inter-modal interaction scheme, and requires strong semantic correlation between the input text-image pairs (9 million pairs).
For fair comparison, we translate classical Chinese poetry into English before feeding into CLIP and ViLT.All the models above are evaluated on two tasks: poetry-to-image and image-to-poetry retrieval.We adopt the widely used metrics: HR@K (Hit Ratio), NDCG (Normalized Discounted Cumulative Gain) (Järvelin and Kekäläinen, 2002) and MRR (Mean Reciprocal Rank) (Voorhees et al., 1999) to evaluate the model performance.The details of implementation are introduced in appendix A.1.

Main Results
The performance of different models are shown in Table 1, we can see that: (1) PKG-Bert achieves the best performance in two tasks.Although without the pre-training process of text-image pairs, PKG-Bert still performs the best, indicating the usefulness of incorporating the multi-modal knowledge graph PKG in poetryimage retrieval task.
(2) PKG-Bert performs better than its variants, i.e., -Visual and -PKG, which learns without the visual information in PKG and without the entire PKG respectively.This indicates that both the visual information and the incorporated extra knowledge in KG help to overcome the semantic gap between two modalities.
(3) PKG-Bert obtains a greater improvement on poetry-to-image task (with an average improve- ment of 57%) than image-to-poetry task (12%).This may lie in that the understanding of query is more important in retrieval tasks.In the poetryto-image task, the semantic of the poetry can be much more well learned in PKG-Bert compared with other multi-modal language models.
It is difficult to retrieve the related classical Chinese poetry with "the starry sky (星空)" objects, since that most of the multi-modal pre-training language models are based on modern language, while the words describing "the starry sky " in classical Chinese poetry (e.g., "天汉", "星斗") rarely appear in modern Chinese, which increases the difficulty to retrieve the related classical Chinese poetry.In PKG-Bert, the words in classical Chinese poetry are linked to the sememes, which are associated with the objects in modern language, helping bridges the gap between modern and ancient Chinese language.Besides, the linked visual information of poetry words also helps the retrieval task.More details are shown in appendix A.2.

Conclusions
We construct a multi-modal knowledge graph for classical Chinese poetry (PKG), and propose a multi-modal pre-training language model PKG-Bert.It incorporates the visual information into the semantic learning of classical Chinese poetry, which helps to promote the researches in classical Chinese language understanding.The poetryimage retrieval application also helps to cross the cultural barriers in different countries, making it enjoyable for all the people.

Limitations
This paper uses multi-modal information to enhance the pre-training language model in classical Chinese poetry area, which could be extended to the general Chinese poetry in the future.In addition, the objects in a poetry may not exactly match the images in candidate image set.Thus, collecting a larger scale image set will further improve the model performance.

A Appendix
A.1 Implementation Details PKG-Bert is initialized by Bert (Devlin et al., 2019), and a grid search is applied to find the optimal settings.We use the Adam (Kingma and Ba, 2015) as the optimizer with a weight decay rate of 0.01 and the learning rate is 2e-5 (selected among 2e-6, 2e-5, and 2e-4).The number of transformer blocks in PKG-Bert is 12, the number of heads in transformer is 12, the dimension of the vectors in transformer (d k ) is 64, the activation function is GELU (Hendrycks and Gimpel, 2017), the hidden dimension of embeddings (H) is 768, and the batch size is set to 16.The number of total parameters is 122M.Our model is implemented by Pytorch 1.6.0 with NVIDIA GeForce RTX 2080, and is trained for 84 epochs in which the scales of train and validation sets are 200,000 and 10,000 respectively, and the inference time is 21 milliseconds.

Figure 1 :
Figure 1: The overview architecture of PKG-Bert.t 1 , t 2 , t 3 are the words in the input poetry, n 1 , n 2 are the text neighbors of t 1 in PKG and triangles represent the image entities.

Figure 2 :
Figure 2: An example of image-to-poetry.Underlined words are highly related objects in the candidate poetry.

Figure 3 :
Figure 3: The retrieved poetry for the query of a photo with the starry sky.Underlined words are related objects in candidate poetry.

Figure 4 :
Figure 4: The retrieved poetry for the query of a photo with an island covered with trees.Underlined words are related objects in candidate poetry.

Figure 5 :
Figure 5: The retrieved images for the query of a poetry of a lake in fog.Underlined words are related objects in candidate images.

Figure 6 :
Figure 6: The retrieved images in query of a poetry of the moon.Underlined words are related objects in candidate images.

Table 1 :
The performance of different methods.

Table 2 :
The statistic of dataset.