Unsupervised Pronoun Resolution via Masked Noun-Phrase Prediction

In this work, we propose Masked Noun-Phrase Prediction (MNPP), a pre-training strategy to tackle pronoun resolution in a fully unsupervised setting. Firstly, We evaluate our pre-trained model on various pronoun resolution datasets without any finetuning. Our method outperforms all previous unsupervised methods on all datasets by large margins. Secondly, we proceed to a few-shot setting where we finetune our pre-trained model on WinoGrande-S and XS separately. Our method outperforms RoBERTa-large baseline with large margins, meanwhile, achieving a higher AUC score after further finetuning on the remaining three official splits of WinoGrande.


Introduction
Co-reference resolution is an important NLP task that aims to find all expressions that refer to the same entity in a text. The resolution of an ambiguous pronoun, known as pronoun resolution, is a longstanding challenge for the NLU community and an essential step for various high-level NLP tasks such as natural language inference (Bowman et al., 2015;Williams et al., 2018), question answering (Rajpurkar et al., 2016), and relation extraction (Zhang et al., 2017).
The most successful approach to pronoun resolution is first fine-tuning a large pre-trained language model such as BERT (Devlin et al., 2019) or RoBERTa ) on a human-labeled pronoun resolution dataset such as Definite Pronoun Resolution Dataset (DPR) (Rahman and Ng, 2012) or WinoGrande (WG) (Sakaguchi et al., 2020), and then either directly transferring to a smaller dataset such as Winograd Schema Challenge (WSC) (Levesque et al., 2012) or Pronoun Disambiguation Problems (PDP) (Morgenstern * Equal Contribution

WSC Sentences Candidate Choices
The trophy doesn't fit in the suitcase because it is too small. A. the trophy B. the suitcase The trophy doesn't fit in the suitcase because it is too big.
A. the trophy B. the suitcase to resolve the bold pronoun "it" to "the suitcase" in the first sentence and to "the trophy" in the second sentence. et al., 2016) or further finetuning on a downstream dataset such as SuperGLUE-WSC (Wang et al., 2019a). However, all the pipelines above can not avoid the phase of pre-training on a large humanlabeled pronoun resolution dataset. Crowd-sourced "unbiased" labels that do not introduce annotationartifacts (Gururangan et al., 2018) are shown to be costly and challenging to collect, requiring a welldesigned annotation interface and dedicated annotators. To this end, we propose the unsupervised Masked Noun-Phrase Prediction task to pre-train a language model without any pronoun resolution training signal and directly transfer the pre-trained model to downstream datasets such as WSC. 1 Two examples of WSC are listed in Table 1. Our work improves on all previous unsupervised methods by large margins and even outperforms several strong supervised methods on all datasets we study. We then proceed to the few-shot setting where we finetune our best zero-shot model on WinoGrande-S and XS respectively. MNPP gives a large margin of improvements over strong baselines including CSS (Klein and Nabi, 2020), RoBERTalarge (Sakaguchi et al., 2020), and UnifiedQA- BART-large (Khashabi et al., 2020). We further finetune on the remaining three data splits and achieve a higher AUC score on all five splits of WinoGrande over RoBERTa-large baseline.
In summary, our main contributions in this work are threefold.
• First, we propose the MNPP pre-training task and study how different synthetic dataset properties affect zero-shot performances. • Second, we show MNPP outperforms all previous fully unsupervised methods and even several strong supervised baselines on all pronoun resolution datasets we study. • Finally, we show that under few-shot settings, MNPP pre-training gives a significant performance boost on WinoGrande-S and XS and furthermore achieves a higher AUC score over all five splits of WinoGrande.

Related Works
In this work, we mainly compare with unsupervised methods. 2 On WSC, Zhang and Song (2018) propose the first unsupervised model where they modify Skip-Gram (Mikolov et al., 2013) objective to predict semantic dependencies then use this additional information during testing. Wang et al. (2019b)

Masked Noun-Phrase Prediction
We treat MNPP as a binary classification task. Given the sentence: "She put the cup on the chair, but he knocked over the chair, and the cup fell.", the underlined "the chair" will be masked and a pair of replacement phrases for this masked position is given as {"the cup", "the chair"}. One of the candidates is the masked phrase,"the chair", and the other candidate is a different phrase in the sentence, "the cup" extracted from "She put the cup on the chair". The constraint we impose is that both the ground-truth noun-phrase and the alternative candidate need to appear before the masked phrase location, which mimics the pronoun resolution task. We sample sentences following the above constraint to create our synthetic datasets for pre-training. We convert the sentence into the format of {[CLS] first half option second half [SEP]} where first half refers to "She put the cup on the chair but he knocked over " and second half refers to ", and the cup fell.". The option is replaced by candidates, "the cup" or "the chair". We compute P(the chair|sentence, θ) and P(the cup|sentence, θ) and optimize θ, the parameters of the model, using cross-entropy loss. We use the final layer [CLS] vector from transformer-based language models and pass it through a single layer feed-forward network to calculate the logits.

Discussion
The intuition behind MNPP is that given sufficient samples that mimic pronoun resolution task, the model can learn rich knowledge to perform well on human-annotated pronoun resolution datasets. Such idea is also in-line with recent advances in unsupervised QA Banerjee et al., , 2021, where synthetic QA datasets are created from unannotated corpora to perform unsupervised pre-training. Strictly speaking, MNPP is even more unsupervised since our synthetic datasets are not created with true pronoun resolution signals, whereas synthetic QA datasets in works cited above contain true question-answer pairs.
As mentioned in previous Section 2, similar to our work, Kocijan et al. (2019b) studied such pretraining strategy by constructing a synthetic dataset, called MaskedWiki, which is crawled from English Wikipedia. However, our work is significantly different from theirs in the following ways. First, their   pipeline requires further finetuning on another pronoun resolution task before transferring to downstream datasets, whereas our method can be directly evaluated on downstream datasets. Second, the size of MaskedWiki is 2.4 millions, which is 15 times the size of our best performing synthetic dataset. Third, we study how different properties of synthetic datasets affect zero-shot performances. Finally, they use a masked token prediction loss, and we model it as a classification task. Kocijan et al. (2019a) also construct another synthetic dataset called WikiCREM following the same masking principle but with only personal names masked.

Synthetic Dataset
We study three properties of synthetic dataset: source style, size, and difficulty level. The sources we choose include various styles of texts, including CNN stories (See et al., 2017), Wikipedia, and PG-19 language modeling benchmark (Rae et al., 2020). We study 3 groups and a total of 10 different synthetic datasets. The first group contains two synthetic datasets collected from all sources with and without knowledge hunting strategy (Prakash et al., 2019). The second group contains five synthetic datasets collected only from PG-19 but with varying sizes from 10k to 500k. The third group contains three synthetic datasets collected from PG-19 but with easy, medium, and hard samples with the same size of 33k each. 3 Datasets' names are listed in the first column of Table 3 and statistics of the first group are described in Table 2.

Unsupervised Pronoun Resolution
The downstream datasets we test on are the Wino-Grande test set (17k instances), DPR test set (564 instances), KnowRef test set (12k instances), and COPA validation set (101 instances). Although COPA (Wang et al., 2019a) is a cause and effect identification dataset, Sakaguchi et al. (2020) show that directly transferring from a WinoGrandefinetuned RoBERTa-large model to COPA already achieves a good performance, indicating that finetuning on WinoGrande can serve as a resource for common sense knowledge. We also investigate whether learning through MNPP can serve as a resource for common sense. Note that we also provide evaluation on the GAP dataset (Webster et al., 2018) in Table 5 for reference although the authors of GAP explicitly mention in their paper that they urge the community to not treat GAP as a Winograd-style task but a co-reference resolution task without gold mention provided.

Results
We report our experiment results in Table 3 and  Table 4.  Compared with previous methods in Table 4, MNPP outperforms all unsupervised methods on all datasets and is comparable with several strong supervised methods. Current best unsupervised methods on WinoGrande is either random guess or below it, however, MNPP outperforms all of them by a margin of at least 8%. Even compared with a supervised baseline where BERT is first finetuned on DPR, our method outperforms it by 8%. On WSC, MNPP also outperforms all SOTA unsupervised methods by more than 8% and outperforms most supervised methods by at least 4% except RoBERTa-large finetuned on another pronoun resolution dataset. On DPR, our method outperforms the SOTA unsupervised baseline over 3% and also achieves only 1% behind the strong supervised baseline that finetunes BERT on MaskedWiki and DPR sequentially or only on WinoGrande. On KnowRef, MNPP outperforms the only unsuper-  vised baseline by nearly 15% and achieves only 5% behind SOTA supervised model. Finally, on COPA, we show that MNPP gives models better common sense knowledge than finetuning on WinoGrande. Meanwhile, we are not surprised that SOTA supervised methods still outperform unsupervised methods, including ours, considering the supervision itself and huge models with billions of parameters such as T5-11B.

Few-Shot Pronoun Resolution
We further proceed to the few-shot setting on WinoGrande-S and XS. We take the top three performance zero-shot models on WinoGrande development set and finetune them on WinoGrande-XS (160 instances) and S (640 instances) separately. After few-shot evaluation, we also finetune on the remaining three data splits, which are WinoGrande-M, L, and XL. Best performances on all 5 data splits are reported in Fig. 1 and AUC scores are reported in thrid column of WinoGrande section in Table 4. Figure 1, MNPP outperforms CCS, UnifiedQA-BART-large, and RoBERTa-large on WinoGrande-S and XS with a large margin, and more importantly, achieves a higher AUC score as indicated in Table 4. It is clear that MNPP pretraining gives the model crucial additional information in the few-shot setting where only minimal data is available. We also notice that in the AUC column of Table 3, there is a negative correlation between zero-shot performance and AUC score, which means higher zero-shot performance does

Conclusion
In this work, we propose MNPP pre-training to tackle unsupervised pronoun resolution and study how different properties of the synthetic pretraining dataset impact zero-shot performance on downstream datasets. Without finetuning on any pronoun resolution signal, MNPP outperforms all previous fully unsupervised methods on all tasks we study and even several strong supervised baselines. In the few-shot case where we finetune the zero-shot transfer model on WinoGrande-S and XS respectively, our model outperforms baselines by large margins, and further achieves a higher AUC score.
This work shows the effectiveness of unsupervised task definitions on text-based pronounresolution and common sense reasoning tasks. It would be interesting to design such tasks for multimodal common sense reasoning (Zellers et al., 2019;Fang et al., 2020

B Synthetic Datasets Construction
For the first synthetic dataset in the first group, we choose 5000 stories in CNN stories, a small portion of Gutenberg books, and the whole training set of QUOREF (Dasigi et al., 2019), which is a reading comprehension dataset that requires resolving co-reference among entities crawled from Wikipedia, and these sources result in the size of 160k. The second synthetic dataset in the first group comprises the same sources as above plus extra knowledge crawled by Google query using the knowledge hunting strategy introduced in Prakash et al. (2019). Following their strategy, we scrap 6531 and 69462 knowledge sentences for WSC and WinoGrande respectively. We relax the filtering process to allow longer sentences than those in the first synthetic dataset and lead to 380k samples in total. We then fix the text style and study the influence of data size on pre-training. We use 2000 books from PG-19 as the source and create five synthetic datasets with size of 500k, 300k, 100k, 50k, and 10k as the second group. We further study how difficulty levels of samples affect the downstream zero-shot performance. We select 100k samples from the PG-19 books described above and evenly split them into three synthetic datasets with low, medium, and high similarity scores between candidate choices as the third group. As a result, we create 3 groups of synthetic datasets with ten synthetic datasets in total. We used spaCy 4 to pre-process raw text, including removing blank spaces, special characters, sentences that are too short or too long, and extracting noun-phrases.

C Zero-shot Experiment Details
Recent study (Khot et al., 2020) has shown that finetuning a RACE-finetuned (Lai et al., 2017) RoBERTa model as a start point is much more stable than directly finetuning a RoBERTa model from scratch, we follow the same strategy to start finetuning a RACE-finetuned RoBERTa-large model on all synthetic datasets. We use Hugging Face Transformers 5 as our codebase. We set Adam optimizer with an initial learning rate of 1e − 5 and epsilon of 1e − 8, and without weight decaying for all settings. For a synthetic dataset whose size is larger or equal to 100k, we choose the batch size of 32 and train for 20 epochs, otherwise, we choose the batch size of 16 and train for 50 epochs. We checkpoint every X steps, with X in [50,500].

D Few-shot Experiment Details
We set Adam optimizer with an initial learning rate of 1e − 5 and epsilon of 1e − 8, without weight decaying, and batch size between 16 and 32 for all