Japanese Zero Anaphora Resolution Can Beneﬁt from Parallel Texts Through Neural Transfer Learning

Parallel texts of Japanese and a non-pro-drop language have the potential of improving the performance of Japanese zero anaphora resolution (ZAR) because pronouns dropped in the former are usually mentioned explicitly in the latter. However, rule-based cross-lingual transfer is hampered by error propagation in an NLP pipeline and the frequent lack of trans-parency in translation correspondences. In this paper, we propose implicit transfer by injecting machine translation (MT) as an intermediate task between pretraining and ZAR. We employ a pretrained BERT model to initialize the encoder part of the encoder-decoder model for MT, and eject the encoder part for ﬁne-tuning on ZAR. The proposed framework empirically demonstrates that ZAR performance can be improved by transfer learning from MT. In addition, we ﬁnd that the incorporation of the masked language model training into MT leads to further gains.


Introduction
Figuring out who did what to whom is an essential part of natural language understanding. This is, however, especially challenging for so-called prodrop languages like Japanese and Chinese because they usually omit pronouns that are inferable from context. The task of identifying the referent of such a dropped element, as illustrated in Figure 1(a), is referred to as zero anaphora resolution (ZAR). Although Japanese ZAR saw a performance boost with the introduction of BERT (Ueda et al., 2020;Konno et al., 2020), there is still a good amount of room for improvement.
A major barrier to improvement is the scarcity of training data. The number of annotated sentences is the order of tens of thousands or less (Kawahara et al., 2002;Hangyo et al., 2012;Iida et al., 2017), and the considerable linguistic expertise required for annotation makes drastic corpus expansion impractical.
Previous attempts to overcome this limitation exploit orders-of-magnitude larger parallel texts of Japanese and English, a non-pro-drop language (Nakaiwa, 1999;Furukawa et al., 2017). The key idea is that Japanese zero pronouns can be recovered from parallel texts because they are usually mentioned explicitly in English, as in Figure 1(b). If translation correspondences are identified and the anaphoric relation in English is identified, then we can identify the antecedent of the omitted argument in Japanese.
Their rule-based transfer from English to Japanese had met with limited success, however. It is prone to error propagation due to its dependence on word alignment, parsing, and English coreference resolution. More importantly, the great linguistic differences between the two language often lead to parallel sentences without transparent syntactic correspondences (Figure 1(c)).
In this paper, we propose neural transfer learning from machine translation (MT). By generating English translations, a neural MT model should be able to implicitly recover omitted Japanese pronouns, thanks to its expressiveness and large training data. We expect the knowledge gained during MT training to be transferred to ZAR. Given that state-of-the-art ZAR models are based on BERT (Ueda et al., 2020;Konno et al., 2020Konno et al., , 2021, it is a natural choice to explore intermediate task transfer learning (Phang et al., 2018;Wang et al., 2019a;Pruksachatkun et al., 2020;Vu et al., 2020): A pretrained BERT model is first trained on MT and the resultant model is then fine-tuned on ZAR. 1 A key challenge to this approach is a mismatch in model architectures. While BERT is an encoder, the dominant paradigm of neural MT is the encoder-decoder. Although both share Transformer (Vaswani et al., 2017) Figure 1: (a) An example of Japanese zero anaphora. The nominative argument of the underlined predicate is omitted. The goal of the task is to detect the omission and to identify its antecedent "son". (b) The corresponding English text. The omitted argument in Japanese is present as a pronoun in English. (c) A Japanese-English pair (Nabeshima and Brooks, 2020, p. 74) whose correspondences are too obscure for rule-based transfer. Because Japanese generally avoids having inanimate agents with animate patients, the English inanimate-subject sentence corresponds to two animate-subject clauses in Japanese, with two exophoric references to the reader (i.e., you).
it is non-trivial to combine the two distinct architectures, with the goal to help the former.
We use a pretrained BERT model to initialize the encoder part of the encoder-decoder model for MT. While this technique was previously used by Imamura and Sumita (2019) and Clinchant et al. (2019), they both aimed at improving MT performance. We show that by ejecting the encoder part for use in fine-tuning (Figure 2), we can achieve performance improvements in ZAR. We also demonstrate further improvements can be brought by incorporating encoder-side masked language model (MLM) training into the intermediate training on MT.
2 Related Work 2.1 Zero Anaphora Resolution (ZAR) ZAR has been extensively studied in major East Asian languages, Chinese and Korean as well as Japanese, which not only omit contextually inferable pronouns but also show no verbal agreement for person, number, or gender (Park et al., 2015;Song et al., 2020;Kim et al., 2021). While supervised learning is the standard approach to ZAR (Iida et al., 2016;Ouchi et al., 2017;Shibata and Kurohashi, 2018), training data are so small that additional resources are clearly needed. Early studies work on case frame construction from a large raw corpus (Sasano et al., 2008;Sasano and Kurohashi, 2011;Yamashiro et al., 2018), pseudo training data generation , and adversarial training (Kurita et al., 2018). These efforts are, however, overshadowed by the surprising effectiveness of BERT's pretraining (Ueda et al., 2020;Konno et al., 2020).
Adopting BERT, recent studies seek gains through multi-task learning (Ueda et al., 2020), data augmentation (Konno et al., 2020), and an intermediate task tailored to ZAR (Konno et al., 2021). The multi-task learning approach of Ueda et al. (2020) covers verbal predicate analysis (which subsumes ZAR), and nominal predicate analysis, coreference resolution, and bridging anaphora resolution. Their method is used as a state-of-the-art baseline in our experiments. Konno et al. (2020) perform data augmentation by simply masking some tokens. They found that performance gains were achieved by selecting target tokens by part of speech. Konno et al. (2021) introduce a more elaborate masking strategy as a ZAR-specific intermediate task They spot multiple occurrences of the same noun phrase, mask one of them, and force the model to identify the pseudo-antecedent.
Our use of parallel texts in ZAR is inspired by Nakaiwa (1999) andFurukawa et al. (2017), who identify a multi-hop link from a Japanese zero pronoun to its Japanese antecedent via English counterparts. Their rule-based methods suffer from accumulated errors and syntactically non-transparent correspondences. In addition, they do not handle inter-sentential anaphora, a non-negligible subtype of anaphora we cover in this paper.
While we exploit MT to improve the performance of ZAR, the exploitation in the reverse direction has been studied. A line of research has been done on Chinese zero pronoun prediction (ZPP) with a primary aim of improving Chinese-English translation (Wang et al., 2016(Wang et al., , 2018(Wang et al., , 2019b. ZPP is different from ZAR in that it does not identify antecedents. This is understandable given that classification of zero pronouns into overt ones suffices for MT. Although Wang et al. (2019b)   open question whether MT helps ZAR as well.

MT as an Intermediate Task
Inspired by the great success of the pretraining/finetuning paradigm on a broad range of tasks (Peters et al., 2018;Devlin et al., 2019), a line of research inserts an intermediate task between pretraining and fine-tuning on a target task (Phang et al., 2018;Wang et al., 2019a;Pruksachatkun et al., 2020). However, Wang et al. (2019a) found that MT used as an intermediate task led to performance degeneration in various target tasks, such as natural language inference and sentiment classification. 2 They argue that the considerable difference between MLM pretraining and MT causes catastrophic forgetting (CF). Pruksachatkun et al. (2020) suggest injecting the MLM objective during intermediate training as a possible way to mitigate CF, which we empirically test in this paper.

Use of BERT in MT
Motivated by BERT's success in a wide range of applications, some studies incorporate BERT into MT models. A straightforward way to do this is to initialize the encoder part of the encoder-decoder with pretrained BERT, but it has had mixed success at best (Clinchant et al., 2019;Zhu et al., 2020). Abandoning this approach,  simply use BERT as a supplier of context-aware embeddings to their own encoder-decoder model. Similarly, Guo et al. (2020) stack adapter layers on top of two frozen BERT models to use them as the encoder and decoder of a non-autoregressive MT 2 We suspect that the poor performance resulted in part from their excessively simple decoder, a single-layer LSTM.

BERT [CLS]
[author] [NA] model. However, these methods cannot be adopted for our purpose because we want BERT itself to learn from MT.
Imamura and Sumita (2019) manage to maintain the straightforward approach by adopting a twostage training procedure: In the first stage, only the decoder is updated with the encoder frozen, while in the second stage, the entire model is updated. Although they offer some insights, it remains unclear how best to exploit BERT when MT is an intermediate task, not the target task.

Proposed Method
We adopt a ZAR model of Ueda et al. (2020), which adds a thin layer on top of BERT during fine-tuning to solve ZAR and related tasks (Section 3.1). Instead of directly moving from MLM pretraining to fine-tuning on ZAR, we inject MT as an intermediate task (Section 3.2). In addition, we introduce the MLM training objective during the intermediate training (Section 3.3).

BERT-based Model for ZAR
ZAR as argument selection As illustrated in Figure 3, the basic idea behind BERT-based ZAR is that given the powerful neural encoder, the joint task of omission detection and antecedent identification can be formalized as argument selection (Shibata and Kurohashi, 2018;Kurita et al., 2018;Ueda et al., 2020). Omission detection concerns whether a given predicate has an argument for a given case (relation). If not, the model must point to the special token [NULL]. Otherwise the model must identify the antecedent of the zero pronoun by pointing either to a token in the given text or to a special token reserved for exophora. Note that by getting the entire document as the input, the model can handle inter-sentential anaphora as well as intra-sentential anaphora. In practice, the input length limitation of BERT forces us to implement a sliding window approach. Also note that in this formulation, ZAR is naturally subsumed into verbal predicate analysis (VPA), which also covers instances where the predicate and the argument have a dependency relation and only the case marker is absent.
Formally, the probability of the token t j being the argument of the predicate t i for case c is: where t i is the context-aware embedding of t i provided by BERT, W c and U c are case-specific weight matrices, and v is a weight vector shared among cases. We output t j with the highest probability.
For each predicate, we repeat this for the nominative (NOM), accusative (ACC), and dative (DAT) cases, and another nominative case for the double nominative construction (NOM2). is also supplied for the reason given in the next paragraph. As is usual for BERT, the special tokens [CLS] and [SEP] are inserted at the beginning and end of the sequence, respectively. If a predicate or argument candidate is split into two or more subwords, the initial subword is used for argument selection.

Input representations
Multi-task learning Following Ueda et al.
(2020), we use a single model to simultaneously perform verbal predicate analysis (VPA), nominal predicate analysis (NPA), bridging anaphora resolution (BAR), and coreference resolution (CR). NPA is a variant of VPA in which verb-like nouns serve as predicates taking arguments. BAR is a special kind of anaphora resolution in which the antecedent fills a semantic gap of the anaphor (e.g., "price" takes something priced as its argument). CR identifies the antecedent and anaphor that refer to the same real-world entity, with the special token [NA] reserved for nouns without coreferent mentions. All of the four tasks can be formalized as argument selection as in Eq. (1). By sharing the BERT encoder, these interrelated tasks have an influence on each other during training. In addition, case-specific weights are shared between VPA and NPA while separate weights are used for BAR and CR. During training, we compute the losses equally for the four tasks.

MT as an Intermediate Task
Our main proposal is to use MT as an intermediate task prior to fine-tuning on ZAR. Following Imamura and Sumita (2019) and Clinchant et al.
(2019), we use a pretrained BERT to initialize the encoder part of the Transformer-based encoderdecoder model while the decoder is randomly initialized. After the intermediate training on MT, we extract the encoder and move on to fine-tuning on ZAR and related tasks ( Figure 2). Specifically, we test the following two procedures for intermediate training: One-stage optimization The entire model is updated throughout the training.
Two-stage optimization In the first stage, the encoder is frozen and only the decoder is updated. In the second stage, the entire model is updated (Imamura and Sumita, 2019).

Incorporating MLM into MT
As discussed in Section 2.2, MT as an intermediate task reportedly harms target-task performance, probably because MT forces the model to forget what it has learned from MLM pretraining (catastrophic forgetting). To overcome this problem, we incorporate the MLM training objective into MT, as suggested by Pruksachatkun et al. (2020). Specifically, we mask some input tokens on the encoder Web News # of sentences 16,038 11,276 # of zeros 30,852 27,062  (Kawahara et al., 2002). Based on their genres, we refer to them as the Web and News, respectively. These corpora have been widely used in previous studies (Shibata and Kurohashi, 2018;Kurita et al., 2018;Ueda et al., 2020). They contained manual annotation for predicate-argument structures (including zero anaphora) as well as word segmentation, part-of-speech tags, dependency relations, and coreference chains. We split the datasets into training, validation, and test sets following the published setting, where the ratio was around 0.75:0.1:0.15. Key statistics are shown in Table 1.
MT We used a Japanese-English parallel corpus of newspaper articles distributed by the Yomiuri Shimbun. 3 It consisted of about 1.3 million sentence pairs 4 with sentence alignment scores. We discarded pairs with scores of 0. Because the task of interest, ZAR, required inter-sentential reasoning, consecutive sentences were concatenated into chunks, with the maximum number of tokens equal to that of ZAR. As a result, we obtained around 373,000, 21,000, and 21,000 chunks for the training, validation, and test data, respectively. Japanese sentences were split into words using the morphological analyzer MeCab with the Juman dictionary (Kudo et al., 2004). 5 Both Japanese and English texts underwent subword tokenization. We used Subword-NMT (Sennrich et al., 2016) for Japanese and SentencePiece (Kudo and Richardson, 2018) for English. We used separate vocabularies for Japanese and English, with the vocabulary sizes of around 32,000 and 16,000, respectively.

Model Settings
BERT We employed a Japanese BERT model with BPE segmentation distributed by NICT. 6 It had the same architecture as Google's BERT-Base (Devlin et al., 2019): 12 layers, 768 hidden units, and 12 attention heads. It was trained on the full text of Japanese Wikipedia for approximately 1 million steps.
MT We used the Transformer encoder-decoder architecture (Vaswani et al., 2017). The encoder was initialized with BERT while the decoder was a randomly initialized six-layer Transformer. The numbers of hidden units and heads were set to be the same as BERT's (i.e., 768 units and 12 attention heads). We adopted Adam (Kingma and Ba, 2017) as the optimizer. We set the total number of epochs to 50. In two-stage optimization, the encoder was frozen during the first 15 epochs, then the entire model was updated for the remaining 35 epochs. We set a mini-batch size to about 500. The details of hyper-parameters are given in Appendix A.
ZAR For a fair comparison with Ueda et al. (2020), we used almost the same configuration as theirs. We dealt with all subtypes of ZAR: intra-sentential anaphora, inter-sentential anaphora, and exophora. For exophora, we targeted [author], [reader], and [unspecified person]. We set the maximum sequence length to 128. 7 All documents from the Web met this limitation. In the News corpus, however, many documents exceeded the sequence length of 128. For such documents, we divided the document into multiple parts such that it had the longest preceding contexts. The evaluation of ZAR was relaxed using a gold coreference chain. The model was trained on the mixture of both corpora and evaluated on each corpus. We used almost the same

Models
Web News  hyper-parameters as Ueda et al. (2020), which are included in Appendix B. We decided to tune the training epochs for MT since we found that it slightly affected ZAR performance. We collected checkpoints at the interval of 5 epochs out of 45 epochs, in addition to the one with the lowest validation loss. They were all trained on ZAR, and we chose the one with the highest score on the validation set. We ran the model with 3 seeds on MT and with 3 seeds on ZAR, which resulted in 9 seed combinations. We report the mean and the standard deviation of the 9 runs. Tables 3 and 4 provide more detailed results. For comparison, we performed additional pretraining with ordinary MLM on the Japanese part of the parallel corpus (denoted as +MLM), because the possibility remained that the model simply took advantage of additional data. The subsequent two blocks compare one-stage (unmarked) optimization with two-stage optimization. MT yielded gains on all settings. The gains were consistent across anaphora categories. Although +MLM somehow beat the baseline, it was outperformed by most models trained on MT, ruling out the possibility that the gains were solely attributed to extra data. We can conclude that Japanese ZAR benefits from parallel texts through neural transfer learning.

Results
Two-stage optimization showed mixed results. It worked for the Web but did not for the News. What is worse, its combination with MLM led to performance degeneration on both datasets.
MLM achieved superior performance as it worked well in all settings. The gains were larger with one-stage optimization than with two-stage optimization (1.4 vs. 0.3 on the Web).

Translation of Zero Pronouns
The experimental results demonstrate that MT helps ZAR, but why does it work? Unfortunately, conventional evaluation metrics for MT (e.g., BLEU) reveal little about the model's ability to handle zero anaphora. To address this problem, Shimazu et al. (2020) and Nagata and Morishita (2020) constructed Japanese-English parallel datasets that were designed to automatically evaluate MT models with regard to the translation of Japanese zero pronouns (ZPT). We used Shimazu et al.'s dataset for its larger data size. 10 To facilitate automatic evaluation of ZPT, this dataset paired a correct English sentence with an incorrect one. All we had to do was to calculate the ratio of instances for which the model assigned higher translation scores to the correct candidates. The only difference between the two sentences involved the translation of a Japanese zero pronoun. To choose the correct one, the MT model must sometimes refer to preceding sentences. As in intermediate training, multiple source sentences were fed to the model to generate multiple target sentences. We prepended as many preceding sentences as possible given the limit of 128 tokens.
In addition, this dataset recorded d, the sentencelevel distance between the zero pronoun in question and its antecedent. The number of instances with d = 0 was 218 while the number of remaining instances was 506. We regarded the former as the instances of intra-sentential anaphora and the latter as the instances of inter-sentential anaphora.
We chose the model with the best performance (i.e., one-stage optimization with MLM). For each checkpoint we collected during intermediate training, we (1) measured the ZPT accuracy and (2) finetuned it to obtain the F1 score for ZAR. As before,
Through the course of intermediate training, we observed almost steady increase in ZPT accuracies and ZAR F1 scores until around the 30th epoch (the four figures in Appendix D). Table 5 shows the strong positive correlations between the two performance measures, especially the very strong correlation for inter-sentential anaphora. These results were in line with our speculation that the performance gains in ZAR stemmed from the model's increased ability to translate zero pronouns.

Why Is MLM so Effective?
The MLM objective during intermediate training on MT is shown to be very effective, but why? Pruksachatkun et al. (2020) conjecture that it would mitigate catastrophic forgetting (CF), but this is not the sole explanation. In fact, Konno et al. (2020) see token masking as a way to augment data.
To dig into this question, we conducted an ablation study by introducing a model with token masking but without the corresponding loss function (denoted as +MT w/ masking). We assume that this model was largely deprived of the power to mitigate CF while token masking still acted as a data augmenter. Table 6 shows the results. Not surprisingly, +MT w/ masking was beaten by +MT w/ MLM with large margins. However, it did outperform +MT, and the gain was particularly large for the Web. The fact that the contribution of the loss function was larger than that of token masking indicates that the improvements were mainly attributed to CF mitigation, but the contribution of token masking alone should not be overlooked.

Case Studies
To gain further insights, we compared ZAR results with English translations automatically generated by the corresponding MT model. Figure 4 gives two examples. It is no great surprise that the translation quality was not satisfactory because we did not fully optimize the model for it.
In the exmple of Figure 4(a), MT seems to have helped ZAR. The omitted nominative argument of "あり" (is) was correctly translated as "the school", and the model successfully identified its antecedent "学校" (school) while the baseline failed. Figure 4(b) illustrates a limitation of the proposed approach. The omitted nominative argument of the predicate, "で" (be), points to "定吉" (Sadakichi, the father of Jutaro). Although the model correctly translated the zero pronoun as "He", it failed in ZAR. This is probably because not only "定吉 (Sadakichi)" but also "龍馬" (Ryoma) and "重太郎" (Jutaro) can be referred to as "He". When disambiguation is not required to generate an overt pronoun, MT is not very helpful.

Note on Other Pretrained Models
Due to space limitation, we have limited our focus to BERT, but for the sake of future practitioners, we would like to briefly note that we extensively tested BART  and its variants before switching to BERT. Unlike BERT, BART is an encoder-decoder model pretrained on a monolingual corpus (original) or a non-parallel multilingual corpus (mBART) . Because MT requires the encoder-decoder architecture, maintaining the model architecture between pretraining and intermediate training looked promising to us.
We specifically tested (1) the officially distributed mBART model, (2) a BART model we pretrained on Japanese Wikipedia, and (3) an mBART model we pretrained on Japanese and English texts. During fine-tuning, we added the ZAR argument selection layer on top of either the encoder or the decoder.
Unfortunately, gains from MT intermediate training were marginal for these models. A more serious problem was that they came close to but rarely outperformed the strong BERT baseline. We gave up identifying the cause of poorer performance because it was extremely hard to apply comparable experimental conditions to large pretrained models.

Conclusion
In this paper, we proposed to exploit parallel texts for Japanese zero anaphora resolution (ZAR) by inserting machine translation (MT) as an intermediate task between masked language model (MLM) pretraining and fine-tuning on ZAR. Although previous studies reported negative results on the use of MT as an intermediate task, we demonstrated that it did work for Japanese ZAR. Our analysis suggests that the intermediate training on MT simultaneously improved the model's ability to translate Japanese zero pronouns and the ZAR performance.
We bridged the gap between BERT-based ZAR and the encoder-decoder architecture for MT by initializing the encoder part of the MT model with a pretrained BERT. Previous studies focusing on MT reported mixed results on this approach, but again, we demonstrated its considerable positive impact on ZAR. We found that incorporating the MLM objective into the intermediate training was particularly effective. Our experimental results were consistent with the speculation that MLM mitigated catastrophic forgetting during intermediate training.
With neural transfer learning, we successfully revived the old idea that Japanese ZAR can benefit from parallel texts (Nakaiwa, 1999). Thanks to the astonishing flexibility of neural networks, we would probably be able to connect ZAR to other tasks through transfer learning. He is left to his son, Shigeta.

OURS
He is a gentle father who watches the growth of Ryoma and Sana .
[NULL]  Figure 4: Two examples of ZAR and MT. Green, blue, and orange dotted lines represent the output of the baseline model, that of ours, and the gold standard, respectively. English sentences are generated by the corresponding MT (encoder-decoder) model. (a) The example in which MT apparently helped ZAR. The nominative zero pronoun of "あり" (is) was correctly translated as "the school". The model also succeeded in identifying its antecedent "学 校" (school). (b) The example in which MT was not helpful. The model successfully translated the nominative zero pronoun of the underlined predicate, "で" (be), as "He". It misidentified its antecedent, however.   Although we followed Ueda et al. (2020) with respect to hyper-parameter settings, there was one exception. Verbal predicate analysis is conventionally divided into three types: overt, covert, and zero. While Ueda et al. (2020) excluded the easiest overt type from training, we targeted all the three types because we found slight performance improvements. The overt type covers situations 11 https://github.com/huggingface/ transformers/blob/v2.10.0/src/ transformers/optimization.py#L47 where the predicate and the argument have a dependency relation and their relation is marked explicitly with a case marker.

C Results on Validation Sets
Tables 9 and 10 show the performance on the validation sets.

D Relationship between Zero Anaphora
Resolution and Zero Pronoun Translatoin Figure 5 shows the relationship between zero anaphora resolution (ZAR) and zero pronoun translation (ZPT) in the course on intermediate training on MT. We observed almost steady increase in ZPT accuracies and ZAR F1 scores until around the 30th epoch.   Zero pronoun test acc.

Methods
(b) Relationship between ZAR and ZPT for intra-sentential on the News test set.  Zero pronoun test acc.
(d) Relationship between ZAR and ZPT for inter-sentential on the News test set. Figure 5: Relationships between ZAR and ZPT.