Pseudo Zero Pronoun Resolution Improves Zero Anaphora Resolution

Masked language models (MLMs) have contributed to drastic performance improvements with regard to zero anaphora resolution (ZAR). To further improve this approach, in this study, we made two proposals. The first is a new pretraining task that trains MLMs on anaphoric relations with explicit supervision, and the second proposal is a new finetuning method that remedies a notorious issue, the pretrain-finetune discrepancy. Our experiments on Japanese ZAR demonstrated that our two proposals boost the state-of-the-art performance, and our detailed analysis provides new insights on the remaining challenges.


Introduction
In pronoun-dropping languages such as Japanese and Chinese, the semantic arguments of predicates can be omitted from sentences. As shown in Figure 1, the semantic subject of the predicate used is omitted and represented by φ, which is called zero pronoun. This pronoun refers to the criminal in the first sentence. This way, the task of recognizing the antecedents of zero pronouns is called zero anaphora resolution (ZAR). This study focuses on Japanese ZAR. ZAR is a challenging task because it requires reasoning with commonsense knowledge about the semantic associations between zero pronouns and the local contexts of their preceding antecedents. As shown in Figure 1, to identify the omitted semantic subject of used, the model should know the semantic relationship between the criminal's weapon and a hammer, namely, a hammer is likely to be used as a weapon for murder and thus was used by the criminal, is crucial. We hereinafter refer to such knowledge as anaphoric relational knowledge.
A conventional approach to acquire anaphoric relational knowledge is to collect predicate-argument *Work done while at Tohoku University.
It seems that ! used a hammer.

criminal-GEN used
The criminal's weapon was found in the victim's room. pairs from large-scale raw corpora and then, use them as features (Sasano et al., 2008;Sasano and Kurohashi, 2011;Yamashiro et al., 2018), or using selectional preference probability (Shibata and Kurohashi, 2018) in machine learning models. A modern approach is to use masked language models (MLMs) (Devlin et al., 2019), which is effective in implicitly capturing anaphoric relational knowledge. In fact, recent studies used pretrained MLMs and achieved drastic performance improvements in the tasks that require anaphoric relational knowledge, including Japanese ZAR (Joshi et al., 2019;Aloraini and Poesio, 2020;Song et al., 2020;Konno et al., 2020).
To get more out of MLMs, in this paper, we propose a new training framework that pretrains and finetunes MLMs specialized for ZAR. The idea is two-fold.
First, we design a new pretraining task that trains MLMs with explicit supervision on anaphoric relations. Many current pretraining tasks adopt a form of the Cloze task, where each MLM is trained by predicting the original token filling the [MASK] token. Although this task provides each MLM with no supervision on anaphoric relations, the MLM implicitly learns about them. In contrast, our new task, called the pseudo zero pronoun resolution (PZERO), provides supervision on anaphoric relations. PZERO assumes that when the same noun phrases (NPs) appear multiple times in a text, they are coreferent. From this assumption, we mask one of such multiple-occurring NPs as a pseudo zero

university-NOM means of transports-ACC surveyed result, most train-ACC teachers-NOM
use.
The university has surveyed teachers' means of transport and found that most teachers use the train.

teachers-GEN
pseudo antecedent pseudo zero pronoun Figure 2: Example of our new pretraining task, PZERO. The second teachers is regarded as a pseudo zero pronoun and is masked, and the first teachers is its pseudo antecedent and should be selected to fill the mask.
pronoun and consider the other NPs as its pseudo antecedents. 1 As shown in the example in Figure 2, the NP, teachers, appears twice. The second is masked as a pseudo zero pronoun, and the first is regarded as its pseudo antecedent. Then, given the masked zero pronoun, an MLM is trained to select its (pseudo) antecedent from the candidate tokens in the context. The explicit supervision on such pseudo anaphoric relations allows MLMs to more effectively learn anaphoric relational knowledge. Second, we address the issue called pretrainfinetune discrepancy (Yang et al., 2019). Generally, some part of an MLM is changed for finetuning on a target task, e.g., discarding the pretrained parameters at the last layer or adding randomly-initialized new parameters. Such changes in the architecture are known to be obstacles to the adaptation of pretrained MLMs to target tasks. To solve this issue, we design a new ZAR model that takes over all the pretrained parameters of an MLM to the ZAR task with minimal modification. This realizes a smoother adaptation of the anaphoric relational knowledge acquired during pretraining to ZAR.
Through experiments on Japanese ZAR, we verify the effectiveness of PZERO and the combination of PZERO and our new ZAR model. Also, our analysis offers insights into the remaining challenges for Japanese ZAR. To sum up, our main contributions are as follows: • We propose a new pretraining task, PZERO, that provides MLMs with explicit supervision on anaphoric relational knowledge; • We design a new ZAR model 2 that makes full use of pretrained MLMs with minimal architectural modifications; • Our empirical results show that both the proposed methods can improve the ZAR performance and achieve state-of-the-art F 1 scores; • Our analysis reveals that the arguments far 1 In addition to antecedents, we deal with postcedents. We use the term "antecedents" to refer both concepts for brevity. 2 Our code is publicly available: https://github. com/Ryuto10/pzero-improves-zar from predicates and the arguments of predicates in the passive voice are hard to predict.

Japanese Zero Anaphora Resolution
Japanese ZAR is often treated as a part of the predicate-argument structure analysis, which is the task of identifying semantic arguments for each predicate in a text. In the NAIST Text Corpus (NTC) (Iida et al., 2017), a standard benchmark dataset that we used in our experiments, each predicate is annotated with the arguments filling either of the three most common argument roles: the nominative (NOM), accusative (ACC), or dative (DAT) roles. If an argument of a predicate is a syntactic dependant of the predicate, we say that the argument is a syntactically dependent argument (DEP) and is relatively easy to identify. If an argument of a predicate is omitted, in contrast, we say that the argument position is filled by zero pronouns. This study is focused on recognizing such zero pronouns and identifying antecedents.
The ZAR task is classified into the following three categories according to the positions of the arguments of a given predicate (i.e., the antecedent of a given zero pronoun): Intra-Sentential (intra): the arguments within the same sentence where the predicate appears. Inter-Sentential (inter): the arguments in the sentences preceding the predicate. Exophoric: the arguments (entities) that exist outside the text. These are categorized into one of three types: author, reader, and general. 3 The identification of inter-sentential and exophoric arguments is an especially difficult task (Shibata and Kurohashi, 2018). For inter-sentential arguments, a model has to search the whole document. For exophoric arguments, a model has to deal with entities outside the document. Because of this difficulty, many previous studies have exclusively focused on the intra-sentential task. In this paper, not only the intra-sentential task, we also treat intersentential and exophoric tasks as the same task formulations, as in previous studies.

Motivation and Task Formulation
The proposed PZERO is a pretraining task for acquiring anaphoric relational knowledge necessary for solving ZAR. PZERO is pseudo since it is assumed that all the NPs with the same surface form have anaphoric relationships. This assumption provides a large-scale dataset from raw corpora. Although the assumption seems to be too strong, an empirical evaluation confirmed that the pretraining task was effective (Section 6). The task is defined as follows: Let X be a given input token sequence X = (x 1 , . . . , x T ) of length T , where one of the tokens is [MASK].
Here, x ∈ R |V| is a one-hot vector and V is a vocabulary set. The task is to select the token(s) corresponding to the original NP of [MASK] from the input tokens. All the NPs with the same surface form as the masked NP are the answers of this task.
The most naive approach for masking NP is replacing all the tokens in the NP with the same number of [MASK] tokens. However, this approach is not appropriate for acquiring anaphoric relational knowledge, as the model can simply use a superficial clue, that is, the number of [MASK] tokens, to predict the original NP. Instead, we replace all the tokens in the NP with a single [MASK] token. Then, we formulate the task objective as predicting the last token in the original NP. This formulation is consistent with that of Japanese ZAR; when the argument consists of multiple tokens, the very last token is annotated as an actual argument.

Preparing Pseudo Data
To create training instances for PZERO, we first extract n consecutive sentences from raw text and split them into a subword sequence. We then insert [SEP] tokens as sentence separators (Devlin et al., 2019). Subsequently, we prune tokens from the beginning of the sequence and then prepend [CLS] at the beginning. As a result, the sequence consists of at most T max subword tokens, which is the maximum input size of our model, as shown in Section 3.3. Then, for each NP in the last sentence, we search for corresponding NPs with the same surface form in this sequence. Upon finding such NPs, we replace the selected NP in the last sentence with a single mask token and collect this sequence as a training instance.

Pretraining Model
Our model for PZERO closely resembles that of the transformer-based MLM (Devlin et al., 2019). Given an input sequence X, each token x t ∈ {0, 1} |V| is mapped to an embedding vector of size D, namely, e t ∈ R D as follows: (1) Here, an embedding vector e token t ∈ R D is obtained by computing e token t = E token x t , where E token ∈ R D×|V| is a token embedding matrix. Similarly, an embedding vector e position t ∈ R D is obtained from the position embedding matrix E position ∈ R D×Tmax and a one-hot vector for position t. T max represents the predefined maximum input length of the model.
Then, the transformer layer encodes the input embeddings e 1 , . . . , e T into the final hidden states H = (h 1 , . . . , h T ). Given each hidden state h t ∈ R D of the t-th token, we calculate the score s t ∈ R, which represents the likelihood that the token is a correct answer, by taking the dot product between the hidden state of the candidate token h t and mask token h mask : where W 1 and W 2 ∈ R D×D are parameter matrices, and b 1 , b 2 ∈ R D are bias terms. We train the model to maximize the score of the correct tokens. Specifically, we minimize the Kullback-Leibler (KL) divergence L = KL(y||softmax(s)), where s = (s 1 , . . . , s T ). y ∈ R T is the probability distribution of the positions of all the correct tokens. In this vector, the values corresponding to the positions of the correct tokens are set as 1/n and otherwise 0, where n is the number of correct tokens.  Figure 3: Input layer of AS and AS-PZERO. Their differences are that (1) a query chunk exists for AS-PZERO, and (2) the position of the target predicate is informed via different embedding types: E predicate and E position .

Argument Selection with Label Probability: AS
The argument selection model, hereinafter AS, is a model inspired by the model of Kurita et al. (2018). From the recent standard practice of the pretrainfinetune paradigm (Devlin et al., 2019), we add a classification layer on top of the pretrained model. The model takes an input sequence X, which is created in a similar manner to that described in Section 3.1. X consists of multiple sentences and is pruned to contain T max tokens at maximum. The target predicate is in the last sentence, and the [CLS] and [SEP] tokens are included. Also, the model takes two natural numbers p start and p end as inputs, where 1 ≤ p start ≤ p end ≤ T . These represent the position of the target predicate.
The model selects a filler token for each argument slot following a label assignment probability over X: argmax t P (t|X, l, p start , p end ), where l ∈ {NOM, ACC, DAT}. We regard [CLS] (i.e., x 1 ) as a dummy token representing the case that the argument filler does not exist in the input sequence. The model selects the dummy token in such cases.
The operation on the input layer of the model is shown on the left-hand side of Figure 3. First, each token x t ∈ {0, 1} |V| in a given input sequence X is mapped to an embedding vector e t ∈ R D using the pretrained embedding matrices E token and E position , and another new embedding matrix E predicate ∈ R D×2 , as follows: where e token t and e position t are the same as in Equation 1. Moreover, e predicate t is an embedding vector computed from E predicate , p start , and p end . This vector represents whether the token in position t is a part of the predicate or not (He et al., 2017).
Second, we apply a pretrained transformer to encode each embedding e t into the final hidden state h t ∈ R D . The probability distribution of assigning the label l over the input tokens o l = (o l,1 , ..., o l,T ) ∈ R T is then obtained by the softmax layer: where w l ∈ R D and b l ∈ R. Finally, from the probability distribution o l , the model selects the token with the maximum probability as the argument of the target predicate. When the model selects the dummy token as an argument, we further classify the argument into the following four categories: z ∈ {author, reader, general, none}. Here, none shows no slot filler for this instance. The other three categories author, reader, and general represent that there is a certain filler entity but do not appear in the context (exophoric). For this purpose, we calculated a probability distribution over the four categories o exo l = (o exo l,author , o exo l,reader , o exo l,general , o exo l,none ) ∈ R 4 by applying a softmax layer to the hidden state of the dummy token h 1 as follows: where w l,z ∈ R D , and b l,z ∈ R. Then the model selects the category with the maximum probability.
In the training step, we assign a gold label to the last token of an argument mention. If there are multiple correct answers in the coreference relations in the context, we assign gold labels to all these mentions. We prepare a probability distribution y ∈ R T of gold labels over the input token in a manner similar to that in Section 3.3. The models are then trained to assign high probabilities to gold arguments.

Argument Selection as Pseudo Zero
Pronoun Resolution: AS-PZERO One potential disadvantage of the AS model is that it may suffer from pretrain-finetune discrepancy.
That is, AS does not use the pretrained parameters, such as W 1 , W 2 , b 1 , and b 2 in Equation 2, but is instead finetuned with randomly-initialized new parameters, such as w l and b l in Equation 4. To make efficient use of the anaphoric relational knowledge acquired during pretraining, we resolve this discrepancy. Inspired by studies addressing such discrepancies (Gururangan et al., 2020;Yang et al., 2019), we propose a novel model for finetuning; argument selection as pseudo zero pronoun resolution (AS-PZERO).
The underlying idea of AS-PZERO is to solve ZAR as PZERO. We use the network structure pretrained on PZERO as it is. Thus, the parameters w l and b l are no longer required. To do this, we modify the input sequence X for ZAR and reformulate the ZAR task as PZERO. Specifically, we prepare a short sentence, called a query chunk, and append it to the end of the input sequence X. The query chunk represents a target predicate-argument slot whose filler is a single [MASK] token, so ZAR can be resolved by selecting the antecedent of the [MASK] token.
Let X denote the modified input of AS-PZERO. The input layer of the model is shown on the righthand side of Figure 3. The query chunk consists of a [MASK] token, a token representing a target argument label (i.e., NOM, ACC, or DAT), and a target predicate. For example, when the number of tokens in the target predicate is represented as T predicate = p end − p start + 1, the length of X is T + 2 + T predicate . The modified input sequence is represented as X = (x 1 , . . . , x T +2+T predicate ). 4 Given a modified input sequence X and the start and end positions of the target predicate p start , p end ∈ N, an input token x t ∈ {0, 1} |V | is mapped to a token embedding e t ∈ R D as follows: where e addposi t is an additional position embedding, which informs the model about the position of the target predicate. This information is intended to be used for distinguishing the target predicate from the multiple predicates appearing with an identical surface form in the input sequence. Specifically, for the target predicate in the query chunk, e addposi t is the same as the position embedding of the target predicate in the original sequence X. Otherwise, e addposi t is zero: where t = t − (T + 3) + p start . For example, as shown in Figure 3, the position embeddings of the target predicate (con and ##firm) are added to those in the query chunk. Thus, we can avoid using the extra embedding matrix E predicate in Equation 3. We encode the embeddings with the transformer layer, and then use Equation 2 for the remaining computation of AS-PZERO to fill out the [MASK] token with the argument of the target predicate. If the score of the dummy token (x 1 ) is highest, the model computes exophoric scores as described in Section 4.1 using Equation 5.

Experimental Settings
PZERO Dataset Japanese Wikipedia corpus (Wikipedia) is the origin of the training data of PZERO. 5 All the NPs in the corpus are PZERO targets. To detect NPs, we parsed Wikipedia using the Japanese dependency parser Cabocha (Kudo and Matsumoto, 2002) and applied a heuristic rule based on part-of-speech tags. We used n = 4 consecutive sentences to develop the input sequence X. From 17.4M sentences in Wikipedia, we obtained 17.3M instances as training data for PZERO. ZAR Dataset For the ZAR task, we used NAIST Text Corpus 1.5 (NTC) (Iida et al., 2010(Iida et al., , 2017, which is a standard benchmark dataset of this task (Ouchi et al., 2017;Matsubayashi and Inui, 2018;Omori and Komachi, 2019;Konno et al., 2020). We used the training, development, and test splits proposed by Taira et al. (2008). The numbers of intra-sentential, inter-sentential, and exophoric for the training/test instances were 18068/6159, 11175/4081, and 13676/3826, respectively. The NTC details are shown in Appendix A. The evaluation script corresponds to that of Matsubayashi and Inui (2018). Model Our implementation is based on the Transformers library (Wolf et al., 2020). We used the pretrained parameters of the bert-base-japanese model as the initial parameters of our pretraining models. We trained our model using an Adam optimizer (Kingma and Ba, 2015) with warm-up steps. As a loss function, we used cross-entropy for the

Results and Analysis
We have two distinct goals in this experiment, that is, to investigate the effectiveness of (1) pretraining on PZERO and (2) finetuning on AS-PZERO. To achieve these goals, we first compare our AS and AS-PZERO models with previous studies to ensure that our models are strong enough in a conventional experimental setting, i.e., the intra-sentential setting (Section 6.1). Then we investigate (1) and (2) based on inter-sentential setting (Section 6.2).

Intra-sentential Experiment
In this setting, the input sequence consists of a single sentence, and only the intra-zero and DEP arguments are targets of the evaluation. As mentioned in Section 2, most of the previous studies on Japanese ZAR use this setting (Matsubayashi and Inui, 2018;Omori and Komachi, 2019;Konno et al., 2020). Thus, we can strictly compare our results with those of other studies in this setting.
We finetuned AS and AS-PZERO from a pretrained MLM. The results in Table 1 show that both the AS and AS-PZERO models already outperformed the previous state-of-the-art models in intra-zero and DEP (Konno et al., 2020) with large margins. This improvement is due to the difference in the use of the pretrained MLM; given a pretrained MLM, we finetuned its entire parameters whereas Konno et al. (2020) used it as input features. Additionally, our pretrained MLM was trained better than theirs.

Inter-sentential Experiment
In this setting, the input sequence consists of multiple sentences: a sentence containing a target predicate and preceding sentences in the document. The intra-sentential, inter-sentential, exophoric, and DEP arguments are the evaluation targets.
We investigate the effectiveness of the proposed PZERO and AS-PZERO. For the experiment, we initialized the parameters of the transformer-based model with the pretrained MLM (pretrain 1) and further pretrained the model on Cloze and PZERO with the same number of updates. This resulted in having two pretrained models (pretrain 2 & 3). Then, we created models of all the possible combinations from {pretrain 1, 2, 3} and {AS, AS-PZERO}, resulting in the six models shown in Table 2.
(I) Do inter-sentential contexts help intrasentential argument identification? We first investigate the impact of inter-sentential context on the performances of intra-zero and DEP by comparing the models (f) and (g) in Table 2 and the models (d) and (e) in Table 1. Here, note that model architectures of (f) and (g) are identical to those of (d) and (e), respectively. In addition, the evaluation instances of the intra-zero and DEP categories are the same for all four models. The differences are that the models (f) and (g) have broader contexts (inter-sentential contexts), i.e., multiple preceding sentences as inputs, and extra training signals from both the inter-zero and exophoric instances. A comparison of these four models shows that (f) and (g) have better performance than (d) and (e) in intra-zero and DEP. This result indicates that inter-sentential contexts are important clues even for identifying intra-sentential argument relations. This result is consistent with those of Guan et al. (2019) and Shibata and Kurohashi (2018), which discussed a method for utilizing inter-sentential contexts as clues for resolving semantic relations in target sentences.
(II) Does pretraining on PZero improve the performance of AS? As shown in Table 2, the comparison between the models pretrained on PZERO (j) and Cloze (h) shows that PZERO outperforms Cloze, especially in inter-zero argument (44.98 → 46.37). As discussed in Section 2, inter-zero is challenging because there are multiple answer candidates across the sentences. The improvement in inter-zero implies that the model effectively learns   anaphoric relational knowledge through the pretraining on PZERO.
(III) Does pretraining on PZero improve the performance of AS-PZero? The performance comparison between the models (j) and (k) demonstrates the effectiveness of the combination of PZERO and AS-PZERO. The model (k) achieved the best result in all categories except for DEP. This indicates that AS-PZERO has successfully addressed the pretrain-finetune discrepancy and that it effecively used the anaphoric relational knowledge learned from PZERO. Table 3 shows the precision and recall of the models (h)-(k) for the intra-zero and inter-zero arguments. The model (k) achieved the best recall performance in both categories and indicated that the proposed PZERO contributes mainly to the improvement in recall.

Analysis
We analyze the source of the improvement in recall as observed in Table 3. Table 4 shows our analysis of the intra/inter-sentential arguments from three aspects I-III and compares the detailed results of the baseline model (h) and our model (k). (I) Number of gold antecedents in input This number determines the difficulty of the anaphora resolution. This is because the saliency of the entity is an important clue, i.e., ZAR is difficult when the argument appears only once in an input. Our model improved the performance of such difficult instances by a large margin (65.87 → 69.12 in intra-zero and 35.96 → 39.57 in inter-zero).
(II) Position of the argument relative to the target predicate The distance between a predicate and its argument determines the difficulty of ZAR. According to (3), (4), and (5), the performance of inter-sentential decreased as the predicate was farther from its argument's last surface-form appearance. Interestingly, the performance of the two models was comparable in (5), which is the case that the arguments are more than two sentences away from the target predicate. This result indicates that the proposed method is not effective for these instances. The error analysis on these instances revealed the fact that even though the argument did not appear explicitly, it was semantically present throughout the context as omitted arguments of the multiple predicates, all pointing to the same entity. This suggests that combining our proposed model with a model that propagates ZAR results through relevant contexts (Shibata and Kurohashi, 2018) can further improve ZAR performance.  semantic subjects and objects appear in the other syntactic positions. Table 4 shows that both models perform worse in (8) and (9) than in (7). Also the case alternation is different for every predicate. Thus, the model had to learn each behavior from training data and raw corpus. However, acquiring such information is not in the scope of PZERO.

Discussion on Pseudo Data Generation
In this study, we generated pseudo data for PZERO by exploiting the strong assumption that all the NPs with the same surface form have anaphoric relationships (Section 3.1). The advantage of our method is its high scalability in data collection; we can obtain a large amount of pseudo instances from raw corpora. Our empirical evaluation showed that our assumption is effective, however more sophisticated methods could be considered. Our future work includes analyzing the noise in pseudo data, i.e., NPs with the same surface but no anaphoric relationships, and its effect on the model performance.

Related Work
Anaphoric Relational Knowledge Our proposed pretraining task for acquiring anaphoric relational knowledge is related to script knowledge acquisition (Chambers and Jurafsky, 2009). Script knowledge models chains of typical events (predicates and their arguments). Between events, some arguments are shared and represented as variables, such as purchase X → acquire X, which can be regarded as a type of anaphoric relational knowl-edge. While script knowledge only deals with shared arguments as anaphoric (coreferring) phenomena, anaphoric relational knowledge is not limited to them. In the sentence of Figure 1, the word criminal is not an argument of the predicate and is ignored in script knowledge, whereas it is within the scope of this work. Thus, it can be said that this work deals with broader anaphoric phenomena.
Zero Anaphora Resolution (ZAR) ZAR has been studied in multiple languages, such as Chinese (Yin et al., 2018), Japanese (Iida et al., 2016), Korean (Han, 2006), Italian (Iida and Poesio, 2011), and Spanish (Palomar et al., 2001). ZAR faces a lack of labeled data, which is a major challenge, and the traditional approach to overcome this is to use large-scale raw corpora. Several studies have employed these corpora as a source of knowledge for ZAR, e.g., case-frame construction (Sasano et al., 2008;Sasano and Kurohashi, 2011;Yamashiro et al., 2018) and selectional preference probability (Shibata et al., 2016). Furthermore, semi-supervised learning approaches, such as pseudo data generation (Liu et al., 2017) and adversarial training (Kurita et al., 2018), have been proposed. However, the use of pretrained MLM has been the most successful approach (Konno et al., 2020), and we sought to improve the pretraining task to better acquire anaphoric relational knowledge.

Pseudo
Zero Pronoun Resolution (PZERO) Several studies have created training instances in a similar way as in PZERO. For example, Liu et al. (2017) casted the ZAR problem as a reading comprehension problem, such that the model chose an appropriate word for the [MASK] from the vocabulary set. The difference is that, unlike their work, we filled the [MASK] by selecting a token from the given sentences. Also, Kocijan et al. (2019) created similar training data for Winograd Schema Challenge (Levesque, 2011). While we considered replacing arbitrary NPs with [MASK], they exclusively replaced the personal name. We expect that our approach is more suitable for ZAR because arguments are not necessarily personal names. Pretrain-finetune Discrepancy Addressing the discrepancy between pretraining and finetuning is one of the successful approaches for improving the use of pretrained MLMs. For example, Gururangan et al. (2020) addressed the discrepancy with respect to the domain of the training dataset. Furthermore, Yang et al. (2019) indicated that [MASK] is used during the pretraining of MLM but never during finetuning. They improved a model architecture to mitigate such discrepancies. Therefore, inspired by these studies, we designed a finetuning model (AS-PZERO) that is suitable for a model pretrained on PZERO and demonstrated its effectiveness. Prompt-based Learning Our use of query chunk in AS-PZERO can be seen as a prompt-based learning approach (Radford et al., 2019;Brown et al., 2020), which has been actively studied (Liu et al., 2021). In a typical prompt-based learning with a pretrained MLM, a model is trained to replace the masked token with a token from a predefined vocabulary (Schick et al., 2020;Schick and Schütze, 2021a,b;Gao et al., 2021). Our model is pretrained on PZERO, which is a task to select a pseudo antecedent from the preceding context. Thus, we designed AS-PZERO as a model to select the argument from the input sentences using a prompt-based approach to avoid the pretrainfinetune discrepancy.

Conclusion
In this study, we proposed a new pretraining task, PZERO, which aims to explicitly teach the model anaphoric relational knowledge necessary for ZAR. We also proposed a ZAR model to remedy the pretrain-finetune discrepancy. Both the proposed methods improved the performance of Japanese ZAR, leading to a new state-of-the-art performance. Our analysis suggests that the hard subcategories of ZAR; distant arguments and passive predicates are still challenging.  We used NAIST Text Corpus 1.5 (NTC) (Iida et al., 2010(Iida et al., , 2017 for ZAR task. Table 5 shows the number of instances in NTC. Table 6 shows a complete list of hyper-parameters used in this study. For both pretraining and finetuning, maximum learning rate and loss function are the target of the hyperparameter search. All the candidates of learning rates and loss functions are presented in Table 7. We used Nvidia Tesla V100 for the entire experiment.

B Hyperparameter Search on Validation Set
Pretraining on Cloze For the hyperparameter search of Cloze task, we adopted the hyperparameters that achieves the lowest perplexity value. We adopted 1.0 × 10 −4 for maximum learning rate. We used the development set that we created from Japanese Wikipedia.
Pretraining on PZERO For the hyperparameter search of PZERO; the parameters were determined by the validation performance on PZERO and ZAR. 6 We eventually employed the parameters with the highest F 1 on inter arguments. Finetuning on ZAR For the hyperparameter search of ZAR; the hyperparameters that achieve the highest overall ZAR F 1 were used. Here, we finetuned pretrained MLM without any further pretraining. Table 8 shows the result of our search process.

C Heuristics for Extracting Noun Phrases from Raw Text
In order to extract noun phrases (NPs) from Japanese Wikipedia, we first parsed the corpus us- 6 The task formulation of PZERO and AS-PZERO are quite similar. Thus we can evaluate the model, which is pretrained on PZERO, directly on ZAR without finetuning.
ing Japanese dependency parser Cabocha (Kudo and Matsumoto, 2002). The parser divides the sentences into a phrase (Japanese "bunsetsu"). Note that each bunsetsu consists of a sequence of words. We then extracted the NPs as follows: 1. Choose a phrase that (1) contains noun(s) and (2) does not contain verb(s). 2. Scan the phrase from the end, and keep eliminating words until a noun appears. 3. Scan the phrase from the beginning, and keep eliminating words until a word other than a symbol appears. 4. The remaining words are regarded as a noun phrase. If the remaining words contain symbols, alphabet, or numbers only, then the words are not discarded. Table 9 shows the statistics of Japanese Wikipedia and the number of PZERO instances generated from this process.

D Performance on Validation Set
We report the performance on development set of NTC for model (d) to model (k), in Table 10 and  Table 11. Here, each model ID follows that of Table 1 and Table 2.

E Number of Parameters of each Model
We report total number of parameters of AS and AS-PZERO in Table 12.