Chinese WPLC: A Chinese Dataset for Evaluating Pretrained Language Models on Word Prediction Given Long-Range Context

This paper presents a Chinese dataset for evaluating pretrained language models on Word Prediction given Long-term Context (Chinese WPLC). We propose both automatic and manual selection strategies tailored to Chinese to guarantee that target words in passages collected from over 69K novels can only be predicted with long-term context beyond the scope of sentences containing the target words. Dataset analysis reveals that the types of target words range from common nouns to Chinese 4-character idioms. We also observe that linguistic relations between target words and long-range context exhibit diversity, including lexical match, synonym, summary and reasoning. Experiment results show that the Chinese pretrained language model PanGu-\alpha is 45 points behind human in terms of top-1 word prediction accuracy, indicating that Chinese WPLC is a challenging dataset. The dataset is publicly available at https://git.openi.org.cn/PCL-Platform.Intelligence/Chinese_WPLC.


Introduction
Predicting a target word from previous context, especially long-range context, is a long-standing challenging problem in natural language processing. A variety of large-scale datasets such as CNN/Daily Mail (Hermann et al., 2015), Who-did-What (Onishi et al., 2016) and CMRC-2017(Cui et al., 2018 have been developed to examine the capability of machines in word prediction. However, the majority of such datasets have not undergone a thorough manual testing whether a target word can only be predicted from long-range dependencies except for LAMBADA (Paperno et al., 2016). This dataset provides a benchmark testbed where a target word can be easily predicted with long-range context but cannot with only context words in the sentence where the target word is located.
Partially inspired by LAMBADA, we create Chinese WPLC, a dataset for evaluating powerful pretrained language models on word prediction with long-range context. The passages used in our dataset are carefully extracted from over 69K Chinese novels following a procedure mixed with automatic and manual selection. Significant differences from LAMBDA lie not only in language (English vs. Chinese), but also in the following two aspects: • LAMBADA filters out relatively easy passages with weak language models, e.g., RNN, 4-gram and feed-forward neural language models, which makes it an outdated dataset for current state-of-the-art pretrained language models as target words in many left passages may be easily predicted by large-scale pretrained models. Additionally, the original raw data used by LAMBADA may potentially appear in the training set of current pretrained models (Brown et al., 2020). To tackle the aforementioned problems, we use two typical large-scale pretrained models to filter out passages: NEZHA (a masked language model) and NEZHA-Gen (a casual language model) (Wei et al., 2019).
• In order to take language features and difficulty level into account, we use new strategies and methods in passage collection, language model filtering and crowdsourced passage selection, which are different from LAMBADA.
We carry out an in-depth analysis on the built dataset, finding that the relations between target words and previous context ranges from lexical match, synonym, summary to commonsense reasoning. We conduct experiments on the built dataset to evaluate a range of state-of-the-art Chinese pretrained models, including the Chinese pretrained model PanGu-α with up to 200 billion parameters (Zeng et al., 2021), which achieves a top-1 accuracy of 12.1%, 45.2 points behind human  2 Related Work CNN/Daily Mail (Hermann et al., 2015) uses an automatic method to create a large amount of instances of replacing entities with placeholders in news. Children's Book Test (CBT) (Felix et al., 2016) removes four types of words that are expected to be predicted by evaluated models and provides candidate choices for models. LAMBADA (Paperno et al., 2016) masks the last word in a target sentence and evaluates the ability of models in predicting the masked target words with broader context beyond target sentences in novels. Winograd Schema Challenge (WSC) (Levesque et al., 2012) and WinoGrande (Sakaguchi et al., 2020) defines a word selection task that focuses on solving commonsense problems in the form of coreference resolution. Details on the differences of Chinese WPLC from previous related datasets are shown in Table 1. In Chinese, People Daily (PD) & Children's Fairy Tale (CFT) (Cui et al., 2016) corpus is the first cloze-style reading comprehension dataset in chinese. ChID (Zheng et al., 2019) offers an interesting task where words to be predicted are all idioms. CLUEWSC2020 (Xu et al., 2020), a Chinese version of WSC dataset, aims to test the ability of coreference resolution via word prediction. Significantly different from such Chinese datasets, our dataset is specifically developed for evaluating word prediction from long-range context.

Passage Collection
To diversify topics and domains, we collect raw data for the Chinese WPLC from 69,067 crawled novels with different topics (more details are shown in Table 2). The half of the crawled novels are used for training while the other half is used for  extracting passages to build the development and test set. We automatically extract passages from raw data according to the following three rules: • As raw Chinese texts are not word-segmented, we use three different state-of-the-art Chinese word segmenters, PKUSEG (Luo et al., 2019), Jieba 1 and THULAC (Sun et al., 2016) to segment extracted passages. Only passages where the last word to be predicted can be consistently identified by the three segmenters are kept.
• If the last word is a stop word, the penultimate word will be considered as the target word as stop words are usually easily to be predicted. If the penultimate word is a stop word too, such passages will be discarded.
• We set the maximum length of a target word to 4, making the most difficult part of the task be to predict a Chinese idiom (four characters).
• The maximum length of passages is limited to 400 characters as long passages make word prediction more difficult even for humans.

Passage Filtering
Similar to LAMBADA (Paperno et al., 2016), we also use language models to filter out passages where the target words (the last words) can be easily predicted by language models. But significantly different from LAMBADA, we use more powerful pretrained language models, instead of conventional or neural language models trained on relatively small data, to make our dataset challenging for state-of-the-art pretrained models. We finetune NEZHA and NEZHA-Gen (Wei et al., 2019) on the training data which contain 8.7 billion words from 34,534 novels. We use two strategies to filter passages: (1) predicting the target word given a full passage (context + the target sentence that contains the target word) and (2) predicting the target word only given the target sentence. Such strategies are not only more rigorous than that used in LAMBADA but also consistent with the succeeding crowdsourcing step. Different combinations of the two pretrained models and strategies are used to filter passages.
In LAMBADA, a passage will be filtered out if the probability of the target word is greater than a preset threshold. Predefining an appropriate threshold is rather difficult, heavily depend on human experience. Thus, we use a different filtering method: any passages where the target word appears in the list of top-5 words predicted by either of the aforementioned two filtering strategies are discarded.
In addition to this, another difference is that we compute the ratio of the target word probabilities estimated given the full and target sentence by NEZHA-Gen as follows: where P (w|c, s \w ) is the probability of the target word w given the long-range context c plus the target sentence s excluding the target word w while P (w|s \w ) is the probability of predicting w only given s \w . Higher ratios indicate that the target word can be more confidently predicted given the long-range context than the short-term context in the target sentence. Preference is given to passages with a ratio greater than the base e.

Crowdsourced Passage Selection
We hire over 100 crowdsourced workers to manually select passages from the left passages after the automatic passage collection and filtering procedure. For crowdsourced manual passage selection, we take 3 steps, similar to LAMBADA, where in the first two steps crowdsourced workers are asked   to guess the missing target word given the entire passage excluding the target word.
In the third step, three different crowdsourced workers are asked to guess at most 3 target words per worker given the short-term context in the target sentence. If none of the manually predicted words are the target word, the passage is added to Chinese WPLC.
Particularly, in each step, workers are provided with the length of the target word to ease the guessing difficulty.
At last, we collect 9,301 passages, among which 4,827 passages from 17,266 novels are used as the development set while the remaining 4,474 passages from 17,267 novels are used as the test set. Table 3 provides the detailed statistics of the development and test set with respect to the target word length. Figure 1 shows the distribution of the types of target words in Chinese WPLC. The majority of target words are common nouns (60.5%), followed by verbs (19.9%). Different from LAMBDA, Chinese WPLC contain 3.4% Chinese idioms (See the third example in Appendix Table 6). Chinese idioms increase the difficulty of word prediction for machine although they are widely used in human-written Chinese texts.

Linguistic Relations between Target
Words and Long-Range Context between target words and long-term context in passages. We sample 100 examples from the development set and find four linguistic relations: lexical match, synonym, summary, reasoning as shown in Appendix Table 6. Lexical match, indicating that the target word has also occurred in context, accounts for 64%. However, lexical match does not mean that the target word can be easily predicted as further statistics in Table 4 disclose that the distance between the target word and its first/last apperance in context is very long, ranging from over 70 to 80 tokens. Synonym, suggesting that a word or phrase with similar meaning to the target word occurrs in context, accounts for 15%. A more difficult phenomenon is to summarize the given passage to predict the target word, which accounts for 8% of the sampled data. The left samples need to conduct reasoning over context while the target word has not been explicitly mentioned in context at all.

Experiments
We carried out experiments with a range of stateof-the-art pretrained language models on Chinese WPLC. As BERT-large and the last layer of RoBERTa-large are currently not available for Chinese, results of these two models are not provided. Top-1 and Top-3 accuracy are reported.

Baseline Models
In addition to BERT (Devlin et al., 2019), we also evaluated the following pre-trained language models on the dataset.
• ALBERT: ALBERT  is a lite BERT with fewer parameters but more powerful performance.

Experimental Setup
All baselines were tested using their default hyper-parameters, including BERT 2 , ALBERT 3 , RoBERTa 2 , MacBERT 4 , CPM 5 and PanGu-α 6 . For causal language models, beam-search was used to generate top-3 words and the number of generation steps was the length of the target word. For masked language models, we downloaded a whole word mask version and selected top-3 words in the masked positions as predicted target words.

Human Evaluation
In order to assess human performance on Chinese WPLC, we hired another 4 crowdsourced workers to perform word guessing on 1000 samples randomly chosen from the development and test set (500 each). Each worker is asked to guess 3 words and the first word is considered as the most probable word guessed by worker. Table 5 presents the results of the models on the development and the test data. Note that the scores of NEZHA and NEZHA-Gen are 0 since they are used to filter passages in Section 3.2. Pretrained Models vs. Human: All state-ofthe-art pretrained models perform much worse than human on this task. PanGu-α achieves a top-1 accuracy of 12.1%, the highest prediction accuracy among all pretrained models, which, however, is  45.2 points behind human performance (57.3%). We find that knowledge distillation helps in CPMlarge achieve a gain of 0.4 to 1.4 percentage points.

Results
Masked Language Models (MLMs) vs. Casual Language Models (CLMs): MLMs (BERTlike) are slightly better than CLMs (next token prediction) in Table 5. The reasons may be two-fold. First, since MLMs are bidirectional, they can use extra information after target words, such as stop words and punctuations, to predict target words. Second, we used stronger NEZHA-Gen to filter out passages in dataset creation, which may make the remaining passages difficult for other CLMs.

Analysis on PanGu-α and Human Prediction
We analyzed 100 randomly sampled passages from the development set to compare PanGu-α with crowdsourced workers. One difference between human and models on word prediction on Chinese WPLC is that human workers can use the length of a target word as auxiliary information to predict target word while current models cannot use such information. We find that 14% of predicted words by PanGu-α are completely correct and 22% are almost correct (See the first and second example in Appendix Table 7). There are also 11% of examples where target words predicted by PanGu-α are similar to the ground-truth target words (See the third example in Appendix Table 7). We also analyzed 100 sampled passages with correct word predictions by human workers and PanGu-α. We find that 75% of these human predictions are lexical match and 7% are synonym. The type of summary accounts for only 4% of passages while the left 14% are reasoning. For PanGu-α, 71% of predictions are lexical match followed by reasoning which accounts for 23%. There are also 4% of synonym, followed by summary, which accounts for 2%. Lexical match is the easiest type for both human and models. Even the target words of reasoning-type word prediction have not been explicitly mentioned in context at all, we find that both human and models can do better than they do in the other two types (i.e., synonym and summary).

Conclusions
In this paper, we have presented the Chinese WPLC, a Chinese word prediction dataset created from over 69K novels to examine the ability of pretrained language models on long-term context modeling. We employ both automatic and manual selection strategies to keep passages where target words can be only predicted from long-term context beyond target sentences and it is difficult for pretrained language model to predict target words. Experiments with a range of state-of-the-art pretrained language models and in-depth analyse demonstrate that the created dataset is a very challenging testbed even for the very large Chinese pretrained PanGuα, covering a variety of linguistic phenomena (e.g., lexical match, synonym, summary and reasoning).

A Appendix
Relations Example %

Lexical match
Passage: 在一小时的时间里他一直在睡觉。科伦巴的小机场非常潮湿，那儿聚集着一群等候去圣克鲁斯的玻利维亚 人。他们个个带着大包小包的圣诞礼物。他叫的那位出租车司机不懂一句英语，但这没关系。内特指给他看旅游手 册上的"皇宫饭店"几个字，他坐上这辆又旧又脏的出租车离开了<mask><mask>。 He had been sleeping during this hour. The small airport in Corumba was very humid, and there was a crowd of Bolivians waiting to head for Santa Cruz.They all carry bags of Christmas presents . The taxi driver he called didn't understand a word of English, but it didn't matter. Nate pointed out the words "Palace Hotel" in the travel brochure. He got in this old and dirty taxi and left the <mask>. Target word: 机场 / airport 64 Synonym Passage: 守殿的大太监名叫过业大，人称大公公。国藩与大公公打声招呼后，便端坐在养性殿候驾。一坐整整两个 时辰，时至正午，尚不见召，国藩心中犯疑，请大公公打听。一会儿，大公公告诉他："皇上今天不来了，明天在 养心殿<mask><mask>。" The eunuch who guarded the temple was called Guoyeda, commonly known as "the Grand Eunuch". Having greeted the Grand Eunuch, Guofan sat in the Hall of Mental Cultivation waiting for the emperor's coming. Guofan sat for two hours until noon, but still didn't get called. He got bewildered, so asked the Grand Eunuch to inquire about this. After a while, the Grand Eunuch told him: "The emperor is not coming today, but will <mask> you at the Hall of Mental Cultivation tomorrow." Target word: 召见/ summon 15 Summary Passage: 健康的红色会让他们的无限遐想通过努力逐渐转变成为现实，而遗憾的是那些没有自制力的红色却疏于行 动，很多梦想最终堕落为空想。因此，与其说堂吉珂德是西班牙的最后一位骑士，莫如说他是超级富于幻想的红色 代表人物。当然，如果红色不停地空想，再加上夸夸其谈，一不小心，变成"<mask><mask><mask><mask>"。 The healthy red can make their infinite daydream changing gradually to a reality through efforts. However, it is a pity that those red who have no self-control failed to take actions, and many dreams eventually degenerates into fantasies. Thus, Don Quixote is not so much the last knight of Spain as a super fanciful representative of the red. There is no doubt that if the red cannot stop indulging in fantasy, even in some magniloquence, it will turn into <mask><mask><mask><mask>"easily. Target word: 纸上谈兵 / an idea on paper 8 Reasoning Passage: 孟飞酝酿了半天硬是没叫出爸和妈，苏蓝为孟飞解围说："他第一次见你们，一时半会还不习惯。"她妈妈 非常宽容地说："小伙子第一次总是很难说出口的，结了婚就慢慢习惯了。"孟飞一听窃喜，这话表示她妈妈已经默 许了他这位<mask><mask>。 Meng Fei had been brewing for a long time but did not call out father and mother in the end. Su Lan helped him out and said, "It's the first time he has met you, so he doesn't get quite used to it in such a short time." Her mother said very tolerantly: "It has always been hard for a young man to say this for the first time, but you'll get used to it after you got married." Meng Fei was secretly pleased on hearing that, which indicated that her mother had acquiesced in him as a <mask><mask><mask>. Target word: 女婿 / son-in-law 13 Table 6: Linguistic relations between target words and long-term context. Each "<mask>" represents a single Chinese character. The old man is getting older and his body is also not as strong as before. But he still has an unyielding personality. However, I have become more and more aware that he is no longer as enthusiastic about wealth as he used to be in the past. No matter how many enterprises and assets are owned by the Lu's group, it's just a paper of symbols for him. When people are old, what they most look forward to is family reunion! When you have time, you could insinuate Jiahuan that he should come back early. Not only is Lu waiting for him, but also <mask><mask>. PanGu-α: 你老爷子 / your old man Target word / Human: 老爷子 / old man Passage: 有时对方正急需,又不肯对你明言,或故意表示无此急需,你如得知情形,更应尽力帮忙,并且不能有丝毫得意的样子,一面使 他感觉受之有愧,一面又使他有知己之感。寸金之遇,一饭之恩,可以使他终生铭记。日后如有所需,他必奋身图报。即使你无所需, 他一朝否极泰来,也绝不会忘了你这个<mask><mask>！ Sometimes one is in desperate need of you, but would not tell you clearly, or deliberately indicate that there is no urgent need. If you know this situation, you should try your best to help, and cannot show any complacency. On the one hand, it would make him shameful for receiving it and give him the feeling of having a new confidant on the other hand. The encounter of an inch of gold and the grace of a meal can make him remember for life. And if you need help later, he will go out of his way to help you. Even if you don't need it, after a storm comes a calm, he will not forget you who is his <mask>！ PanGu-α: 朋友 / friend Target word / Human: 知己 / confidant Table 7: Examples with predicted target words from PanGu-α and humans.