Coupling Context Modeling with Zero Pronoun Recovering for Document-Level Natural Language Generation

Natural language generation (NLG) tasks on pro-drop languages are known to suffer from zero pronoun (ZP) problems, and the problems remain challenging due to the scarcity of ZP-annotated NLG corpora. In this case, we propose a highly adaptive two-stage approach to couple context modeling with ZP recovering to mitigate the ZP problem in NLG tasks. Notably, we frame the recovery process in a task-supervised fashion where the ZP representation recovering capability is learned during the NLG task learning process, thus our method does not require NLG corpora annotated with ZPs. For system enhancement, we learn an adversarial bot to adjust our model outputs to alleviate the error propagation caused by mis-recovered ZPs. Experiments on three document-level NLG tasks, i.e., machine translation, question answering, and summarization, show that our approach can improve the performance to a great extent, and the improvement on pronoun translation is very impressive.


Introduction
For a long time, natural language generation (NLG) has attracted a lot of attention for its importance in serving human life. In the literature, although various studies have been done to bridge the discrepancy between human and machine, document-level NLG (D-NLG) tasks still suffer from cohesion issues caused by zero pronoun (ZP). As a discourse phenomenon where pronouns can be omitted when they are pragmatically or grammatically inferable from context (Li and Thompson, 1979), zero pronoun appears frequently in pro-drop languages like Chinese, Spanish, etc. Taking the Chinese TED corpus as an example, according to our statistics, in sentences with an average length of 18, each sentence will omit around 0.5 pronouns. Facing this problem, lots of attention has been paid to ZP resolution in the past decade (Zhao and Ng, 2007; * Corresponding author Kong and Zhou, 2010;Yin et al., 2018;Zhang et al., 2019a;Song et al., 2020), and these studies have achieved certain success. Nevertheless, due to the lack of NLG corpora annotated with ZPs, the problem of zero pronoun remains challenging in document-level NLG tasks.
Recently, more and more researchers turn to ZP resolution in NLG tasks (Rao et al., 2015;Wang et al., 2016Wang et al., , 2018aWang et al., ,b, 2019. Among these studies, one line explicitly deals with the ZP problem by recovering dropped pronouns through either annotated corpora (Wang et al., 2016(Wang et al., , 2018a(Wang et al., ,b, 2019Zhang et al., 2019c) or pre-trained ZP resolution systems (Taira et al., 2012;Xiang et al., 2013). Another line indirectly deals with the ZP problem by producing better discourse cohesion through document context modeling Miculicich et al., 2018;Tan et al., 2019;Wong et al., 2020). Although the above studies have made great progresses, ZP resolution in NLG still faces the following possible bottlenecks: (i) ZP-annotated corpora tailored for NLG tasks are scarcity, and existing ZP corpora are limited to certain domains and tasks. (ii) Using pre-trained ZP resolution systems to recover pronoun labels for NLG tasks could lead to the notorious error propagation problem. (iii) Although document context modeling can improve discourse-level cohesion to some extent, the context information is too broad to solve the ZP problem in a targeted manner.
In order to solve the ZP problem in D-NLG tasks while avoiding the above disadvantages, we introduce a highly adaptive two-stage approach that couples context modeling with ZP recovering. Specifically, our approach mainly consists of two phases: First, we pre-train a fault-tolerant ZP position detector for downstream tasks' data corpora to automatically detect ZP positions. Second, we perform document context modeling for both tasksupervised ZP recovering and ZP-focused NLG task learning. Notably, the ZP recovering process and the NLG task learning process depend on and promote with each other harmoniously. On the one hand, instead of recovering specific pronoun labels, we learn ZP representation through the supervision of NLG tasks and thus our method does not require large-scale ZP-annotated data. One the other, the achieved ZP representation is further fused into the previously modeled document context for ZPfocused NLG task learning in turn. In this way, the recovered ZP representation and the supervision of specific tasks are well shared between the two processes for high-quality model integration.
To comprehensively investigate our proposed method, we conduct experiments 1 on three D-NLG tasks: document-level neural machine translation (NMT), question answering (QA), and summarization. Experimental results show that our approach can significantly improve the performance on these tasks due to the effective combination of ZP recovering and document context modeling. Furthermore, we use both APT (Miculicich Werlen and Popescu-Belis, 2017) and CRC (Jwalapuram et al., 2019) to evaluate our model performance on pronoun generation, and the results show that our approach can achieve remarkable performance.

Related Work
As a fundamental research in natural language processing, ZP resolution aims at detecting pronoun chains and resolving missing pronouns to their antecedents. In the literature, previous work mainly resolved ZPs in three steps: zero pronoun detection, anaphoricity determination, and coreference linking. On this basis, varied traditional rule-based or machine-learning methods were used for ZP resolution (Converse, 2006;Zhao and Ng, 2007;Kong and Zhou, 2010). Recently, some neural approaches (Liu et al., 2017;Yin et al., 2018;Zhang et al., 2019a,b;Song et al., 2020) were proposed and have achieved certain success due to their better objective representation and powerful neural architectures.
As a common language phenomenon in prodrop languages, zero pronoun could result in poor discourse-level cohesion and thus seriously impact the performance of document-level NLG. To date, two types of researches have been conducted to alleviate the discourse-level cohesion deficiency: (i) Recovering dropped pronouns in specific NLG corpora for downstream tasks; (ii) Using well-designed context-aware architectures for document-level cohesion modeling. First, some studies directly used manually (Yang et al., 2015;Zhang et al., 2019c; or automatically (Wang et al., 2016(Wang et al., , 2018a annotated ZP corpora for NLG tasks. Nevertheless, the manual annotation is usually time consuming and the automatic annotation is limited to specific tasks like machine translation, it remains challenging when facing new corpus domains or tasks. Moreover, although some two-stage methods were proposed to use pre-trained ZP resolution systems for preprocessing (Taira et al., 2012;Xiang et al., 2013), these methods are known to face notorious error propagation problems. Second, some recent studies explored context-aware architectures for better document cohesion modeling Miculicich et al., 2018;Maruf and Haffari, 2018;Maruf et al., 2019;Tan et al., 2019;Kang et al., 2020). For instance, Tan et al. (2019) proposed a hierarchical model to capture global context, which can significantly improve pronoun translation in document-level NMT. Although the above studies can well capture document-level cohesion to some extent, the context information is too broad to solve ZP problems in a targeted manner.
The difference between our method and previous ones is two-fold: First, our two-stage method is a combination of the above categories which can mitigate both corpora limitation and error propagation issues. Second, compared with previous ZP recovery methods, we focus on enhancing NLG with the recovered ZP representations rather than with specific pronoun labels. Therefore, our approach does not require ZP-annotated NLG corpora.

D-NLG with ZP Recovery
In this section, we introduce the proposed highly adaptive approach which consists of two stages, i.e., detecting ZP positions in the first stage (Section 3.1) and then coupling context modeling with ZP representation recovering in document-level NLG (Sections 3.2 and 3.3).

ZP Position Detection
Due to corpus limitation, previous two-stage methods usually employ pre-trained ZP detectors to automatically recover dropped pronouns for NLG tasks. However, these methods usually suffer from error propagation problems. In addition, referring to the human annotation (Yang et al., 2015), we find that many annotated pronouns are replaceable from the sentence-level perspective, while are irreplaceable from the document-level perspective. With this in mind, we argue that only detecting ZP positions in the first stage and then performing context-aware ZP recovery in the second stage can alleviate the above problems to a certain extent. Notably, considering that ZP position detection is less affected by domain differences, we only need small-scale out-of-domain ZP-annotated data instead of the one that is specific to the target task to achieve it.
In this work, we cast ZP position detection as a sequence labelling task. Our statistics show that in 10% of the cases, the last word of a sentence is a ZP position, which means taking both the left and right sides of each word as candidate ZP positions is necessary. Therefore, different from previous work (Wang et al., 2016), we take both sides of a word into consideration for ZP position detection. Formally, given a sentence with n word units, w 0 , . . . , w n−1 , the components of the detector can be described as following: Encoder. A two-layer bi-directional GRU (Cho et al., 2014) is used to map the word units into a set of hidden states u 0 , . . . , u n−1 .
Decoder. During decoding, we input the previously obtained hidden states into a uni-directional GRU for ZP position prediction. Since both the left and right sides of each word could be ZP positions, we insert placeholders to both sides of each word as candidate ZP positions. Therefore, the decoder input can be formulated as ( , u 0 , . . . , , u n−1 , ) where the sign denotes a randomly initialized vector. It should be noted that although these placeholders share the same learnable vector , the corresponding decoder outputs of these placeholders are context-aware and definitely different. Based on the model structure, we simply build a negative log-likelihood loss term between the decoder outputs (after log-softmax) and the gold standard ZP positions to train our ZP position detector. It is worth mentioning that we aim to build a fault-tolerant ZP position detector, in other words, we permit mis-predicted ZP positions to appear in this stage and this will provide more possibilities and rights for subsequent tasks (Section 3.2) to further determine the value of these positions.

Context Modeling for ZP Recovering
In this work, we hold the view that the recovery of ZPs needs not only sentence-level semantics or intentions but also document-level context. With this in mind, based on the previously detected ZP positions, we explore to leverage document context for ZP representation recovery in this section.

Document-level Context
Modeling. Hierarchical architecture has proven to be effective in document context modeling in many NLG tasks. Recently, Tan et al. (2019) demonstrate that global document context performs better than the partial one in document-level NMT. Inspired by this, we also employ a hierarchical network to model global context for ZP recovery, as shown in Figure 1. Formally, given a document with N sentences, the context modeling process is formulated as: denotes the extracted context information, and ATT self denotes a multi-head self-attention function mentioned in (Vaswani et al., 2017). Following Tan et al. (2019), we implement both the sentence-and document-level encoders with multihead self-attention functions, and the model parameters are shared between the two encoders.
Context-Aware ZP Recovering. Usually, document context consists of not only relationships between sentences, but also dependencies among words. Obviously, each word has its specific surrounding context, even those ZP placeholders, and it is reasonable to utilize the surrounding context of ZP placeholders for ZP representation recovery. To achieve this, inspired by (Tan et al., 2019), we distribute the previously obtained context information to each word unit as: where h i,j is the hidden state of the j-th word in the i-th sentence, H d i denotes the extracted document context, ATT additive is an additive attention function, α i,j and c i,j denote the attention weight and context information assigned to the word. The complete context modeling and distributing process is clearly illustrated in Figure 1. Before introducing the ZP representation recovering process, an important question needs to be clarified: Do downstream tasks require all the ZP positions to be recovered? Taking machine translation for example. Given the sentence "(他们) 都 用 了 fMRI 技术 也 就是 功能性 核磁共振 成像 技术 来 对 大脑 进行 造影 。" with the subject "他们 (They)" omitted, the reference translation is "Both used fmri technology functional magnetic resonance imaging technology to image the brain ." Obviously, the dropped pronoun "他们 (They)" is not explicitly translated, according to the language habits of the target side. This indicates that not all the dropped pronouns need to be recovered, and it requires a good understanding of the context to determine whether pronouns like "他们 (They)" should be recovered or not for better translation. Furthermore, since we perform fault-tolerant ZP position detection in the first stage, it will naturally contain some mis-predicted ZP positions. Taking into account the above-mentioned circumstances, we add a non-ZP mark ε in the label space for our model to determine whether the detected ZP positions should be filled with specific pronoun labels or not according to the document context.
For ZP recovery, given the document context assigned to each ZP position, we build another additive attention function between the context information, c i,j , assigned to each ZP placeholder and the pronoun label vectors 2 P emb , as shown in Figure 1. The attention scores are calculated as: where K = 31 denotes the number of pronoun labels. As stated before, in this work, we propose to borrow the learning objectives from NLG tasks rather than manually annotated ZP corpora to guide the learning of ZP representation recovery. Specifically, for each placeholder, we first select the pronoun label vector with the highest attention score and then multiply the selected vector with its corresponding attention weight ρ ι as the recovered ZP representation, which can be written as: In this way, as the gradient is updated, both the ZP representations and the vector-style pronoun label selection process are learned automatically.

D-NLG with ZP Representation
In this subsection, we aim to integrate the ZP recovery process into specific document-level NLG tasks. The integration process is mainly composed of two phases: First, we replace the original context information distributed to each placeholder, c i,j , with the obtained ZP representation, c i,j , for ZP-focused context modeling; Second, we combine the ZP-focused context with the sentence-level encoder outputs and push the combinations into the decoding phase of each subsequent task for the learning of NLG tasks.
In detail, we apply our method to three recent NLG systems: document-level NMT (Tan et al., 2019), QA and summarization . Among them, Tan et al. (2019) introduced a hierarchical structure to model global context from all sentences of an article and have demonstrated the effectiveness of global context in machine translation.  presented a straightforward yet effective model on summarization and QA based on their proposed MATINF dataset 3 . We incorporate our ZP-focused context information into the three systems as following: • For document-level NMT, we first apply our ZP position detector to NMT corpora for preprocessing. Then, based on the Transformer-based system of Tan et al. (2019), we input the sentences with ZP positions into their encoder for ZP representation recovering and global context refining, and other settings remain the same as theirs.
• For QA and summarization, we also preprocess the MATINF corpus with ZP position detection. Similar to NMT, in QA and summarization, we extract global context for each word unit for both ZP recovering and context refining. Since Xu et al. (2020) did not extract document context in their original system, we simply incorporate the obtained ZP-focused context information into their original word representation through a summation function. And other settings remain the same as .

Model Learning
The overall model learning is composed of two parts: (i) Learning from the NLG tasks' objectives for standard language generation; (ii) Training our language generator adversarially to reduce the errors caused by mis-recovered ZPs. First, we train our model according to the NLG tasks' objectives to warm up the model parameters. That is, we maximize the log-likelihood of language generation in the parallel corpus C as: After that, the adversarial nets participate in the model learning process. Notably, we build two CNN-based feature extractors for the model outputs o and the references r, respectively. The feature extractor is formulated as: where we set the window size by 4 (k = 3) and the stride size by 1, and F relu refers to the Leaky ReLU activation function. In this way, the extracted features for o and r can be written as τ o and τ r , respectively. Our adversarial nets consist of two parts: (i) A generative net G(X, θ g ) that builds the mapping from model outputs to feature space to capture the data distribution p g over the training data X.
(ii) A discriminative net D(τ , θ d ) that outputs a single scalar representing the probability that the extracted feature τ comes from training data X rather than p g . On this basis, we let G and D join the training process to play a two-player minimax game. Given the feature of the generated samples, we train G to maximize the following object: where θ g refers to parameters of our NLG model and the feature extractor for the model outputs. After that, we simultaneously train D to maximize the probability of assigning correct labels to both gold standard and fake generated samples. Formally, we train D to minimize the following object: where θ d refers to the parameters of the feature extractor for the references and a feedforward network-based (In: feature size f , Hidden layer: f /2, Out: 1) scorer with sigmoid function.

Experimentation
In this section, we conduct several experiments on document-level NMT, QA, and summarization to evaluate our proposed approach. Model Settings. For document-level NMT, we apply our proposed approach to the Transformer model implemented by OpenNMT (Klein et al., 2017). For fair comparison, we keep our system settings the same as previous work (Tan et al., 2019), and the detailed model configurations are shown in Appendix. Following previous work, we also use the multi-bleu.perl script to compute caseinsensitive BLEU score for evaluation.

Experimental Settings
For the QA and summarization tasks, we apply our proposed approach to the sequence-to-sequence model of MTF-S2S (the single task version) (Xu et al., 2020) 5 , and keep the system settings consistent with . Concretely, we use the beam search algorithm during decoding, and we set the hidden size of encoders and decoders to 200 and the batch size to 64. We also use Adam as our optimizer with the learning rate set to 0.001. Similarly, we also use ROUGE (Lin and Hovy, 2003) to estimate the quality of the generated texts for performance evaluation.

Results on Document-level NMT
For document-level NMT, we compare our system with two recent context-aware systems Tan et al., 2019). Among them,  propose to model partial document context from previous sentences for better performance. And Tan et al. (2019) put their insight on global context modeling and have demonstrated the usefulness of global context in document-level NMT. Besides, we also present the results of Transformer (Vaswani et al., 2017)   parison with the three baseline systems, our system (line 4) significantly outperforms (Tan et al., 2019) by 3.06 BLEU points and  by 3.60 BLEU points on average. And the superiority is much significant when compared with the Transformer model. This indicates that coupling global context modeling with ZP recovering can significantly boost the document-level translation performance. Moreover, the ablation study (the last two lines) show that the adversarial method we use is definitely useful, although the performance improvement is not so significant.

Results on QA & Summarization Tasks
For QA and summarization, we directly borrow the systems of  as our baselines, and we perform experiments on their single version systems for clarity. Similar to , we also report the results of related systems on the two tasks for reference, and the results of these systems are directly borrowed from .
QA. For question answering, in addition to , we also compare with a retrievalbased baseline by fine-tuning BERT-base (Devlin et al., 2019) for question matching on an external dataset and two character-based generation baselines (Sutskever et al., 2014;Luong et al., 2015).
The overall results are shown in Table 2. From the results, lines 4 and 5 show that our model outperforms the baseline  on all the three indicators. And the last two lines show that the  adversarial learning strategy we use still improves the QA performance to a certain extent. Notably, our resulting method has settled a new state-of-theart performance on all the three indicators when compared with previous studies.
Summarization. For the summarization task, we compare with two extractive methods (Mihalcea and Tarau, 2004;Erkan and Radev, 2004) and six abstractive methods (Sutskever et al., 2014;Luong et al., 2015;Lin et al., 2018;Liu and Lapata, 2019;. And the overall results are presented in Table 3. Firstly, compared with the baseline , our system significantly outperforms theirs by 16.50% on "R-1", 20.46% on "R-2", and 14.14% on "R-L", which proves the great effectiveness of coupling global context modeling with ZP recovering in text summarization. Secondly, compared with all previous studies, our system obtains a superior or competitive performance in most cases except BertAbs (Liu and Lapata, 2019) which employs a well-trained BERT model 6 for context-aware word representation. Similar to NMT and QA, the last two lines further demonstrate the usefulness of the adversarial learning strategy we utilize. On the whole, the overall results above demonstrate that our proposed method is useful. Through this highly adaptive two-stage method, we can effectively alleviate ZP problems in various downstream NLG tasks with only a slight dependency on an independent small-scale corpus annotated with ZP positions. In addition, the results also show that mining the model's potential from a specific perspective (i.e., zero pronoun) is of great significance to document-level NLG. Naturally, we can also  extent the proposed approach to other discourse phenomena (e.g., lexical cohesion, ellipsis, etc.) for discourse-aware language generation, which is worthy of in-depth study.

Analysis and Discussion
In this section, we aim at exploring the potential value of our proposed approach. For clarity, we analyze on document-level NMT for reference.

Contribution on Pronoun Translation
To comprehensively estimate the usefulness of our approach in pronoun translation, we employ two additional methods for evaluation, i.e., Accuracy of Pronoun Translation (APT) (Miculicich Werlen and Popescu-Belis, 2017) and Common Reference Context (CRC) (Jwalapuram et al., 2019). On the one hand, we follow previous work (Tan et al., 2019) to use the APT 7 method to evaluate our pronoun generation performance. In addition, we also report the results of Transformer and Tan et al. (2019) for reference, as shown in Table 4. The results show that our model achieves a remarkable performance which significantly outperforms the Transformer by 32.4% and the context-aware model of Tan et al. (2019) by 30.20%. This strongly suggests the significant effectiveness of our proposed approach in pronoun translation. On the other hand, we further employ the novel CRC 8 method to evaluate the pronoun translation performance of our model, and the results are shown in Table 4. From the results we find that our proposed approach still significantly outperforms the Transformer model by 24.24% and the system of Tan Table 6: Performance of ZP position detection. "*": the re-produced performance on the data we use.
obtained ZP-focused context is expert in pronoun generation in document-level NLG tasks.

Different Strategies of Utilizing Global Document Context
As stated before, although Tan et al. (2019) have demonstrated the usefulness of global context, the context information is too complicated and it is hard to figure out what type of information is at work. To understand the role of context information, we explore the effects of different ways of using global context in document-level NMT. Concretely, we carry out experiments over three system settings where "Ctx→words" means distributing global context to word units (Tan et al., 2019), "Ctx→pro" means leveraging global context only for ZP recovery and then using the recovered ZP representation to replace the sentence-level hidden states of the placeholders, and "Ctx→pro&words" means leveraging global context for ZP recovery and distributing the ZP-focused context to word units. The overall results are shown in Table 5. The first two lines in the table show that our recovered ZP representations (line 2) are effective due to the great capability of our approach in extracting more concise and effective features from the global context from a specific perspective (i.e., zero pronoun). Moreover, the last two lines show that distributing global context to word units can further improve the performance. This indicates that the global context still contains some other effective information worthy of further mining.

Performance of ZP Position Detection
In the literature, Wang et al. (2016) have achieved a certain success in ZP recovery in NMT. As stated before, we aim to improve their ZP position detec-  Table 7: Comparison between example translation results of different methods. Here, "ZP-P" denotes an automatically predicted ZP position placeholder.
tor by considering both the left and right sides of each word as candidate ZP positions. To investigate the effect of our approach, we perform experiments on the cleaned tvsub corpus (Wang et al., 2018a) with the sentences without ZPs filtered out 9 . For performance evaluation, we follow Wang et al. (2016) to utilize the micro-averaged F 1 -score to measure our model performance. The results in Table 6 show that the improved ZP position detector does achieve results better than the method of (Wang et al., 2018a), which suggests the necessity of taking both sides of each word into consideration for ZP position detection. It is worth mentioning that since this work directly harnesses ZP representation in subsequent tasks aiming to alleviate error propagation, it does not depend on ZP-annotated NLG data, therefore, the evaluation on ZP label recovery is infeasible.

Case study
Here, we present a translation example of our NMT system in Table 7 for discussion. In the example, "Source" denotes a source sentence with ZP position detected; "Ref" denotes the reference translation; "Baseline" and "Ours" denote the translation results of the baseline system (Tan et al., 2019) and our approach, respectively. Referring to the "Ref" sentence, although the "Baseline" system can well leverage global context for better BLEU scores, the improvement on pronoun translation is still far from perfect. On the contrary, our system can accurately translate the pronoun "them" and the resulting sentence seems more fluent and more in line with the norms of the target language.

Conclusion
In this paper, we introduced a highly adaptive twostage method to mitigate the cohesion problem posed by the ZP phenomenon in document-level NLG tasks. To tackle both error propagation and corpus limitation issues, we first pre-trained a faulttolerant ZP position detector for automatic ZP position prediction, and then performed document context modeling for both task-supervised ZP recovering and ZP-focused NLG task learning. And we also trained our model in an adversarial fashion to alleviate the language generation confusion caused by mis-recovered ZPs. Experiments on three D-NLG tasks show that our approach can greatly improve the performances, and the performance on pronoun translation is remarkable.