Progressive Adversarial Learning for Bootstrapping: A Case Study on Entity Set Expansion

Bootstrapping has become the mainstream method for entity set expansion. Conventional bootstrapping methods mostly define the expansion boundary using seed-based distance metrics, which heavily depend on the quality of selected seeds and are hard to be adjusted due to the extremely sparse supervision. In this paper, we propose BootstrapGAN, a new learning method for bootstrapping which jointly models the bootstrapping process and the boundary learning process in a GAN framework. Specifically, the expansion boundaries of different bootstrapping iterations are learned via different discriminator networks; the bootstrapping network is the generator to generate new positive entities, and the discriminator networks identify the expansion boundaries by trying to distinguish the generated entities from known positive entities. By iteratively performing the above adversarial learning, the generator and the discriminators can reinforce each other and be progressively refined along the whole bootstrapping process. Experiments show that BootstrapGAN achieves the new state-of-the-art entity set expansion performance.


Introduction
Bootstrapping is a fundamental technique for entity set expansion (ESE). It starts from a few seed entities (e.g., {London, Beijing, Paris}) and iteratively extracts new entities in the target category (e.g., {Berlin, Moscow, Tokyo}) to expand the entity set, where new entities are often evaluated by their context similarities to seeds (e.g., sharing the same context pattern-"* is an important city") (Riloff and Jones, 1999;Gupta and Manning, 2014;Yan et al., 2020a). During the above process, it is core to decide whether the new entities belong to the target category (within the expansion boundary) or not (outside the expansion boundary) (Shi et al., 2014;Gupta and Manning, 2014). However, it is challenging to determine the expansion boundaries during the whole bootstrapping process, since only several seeds are used as the supervision at the beginning. Firstly, it is obviously not enough to define a good boundary using only several positive entities. For example, as shown in Figure 1, when only using several positive entities to learn distance-based boundaries, the boundaries are usually far from optimum, which in turn influences the quality of following bootstrapping iterations. Therefore, it is critical to enhancing the boundary learning with more supervision signals or prior knowledge (Thelen and Riloff, 2002;Curran et al., 2007). Secondly, bootstrapping is a dynamic process containing multiple iterations. Therefore, the boundary needs to be synchronously adjusted with the bootstrapping model, i.e., a good boundary should precisely restrict the current bootstrapping model from expanding negative entities.
Currently, most bootstrapping methods define expansion boundary using seed-based distance metrics, i.e., determining whether an entity should be expanded by comparing it with seeds. For instance, Riloff and Jones (1999); Gupta and Manning (2014); Batista et al. (2015) define the bound-ary using pattern matching statistics or distributional similarities. Unfortunately, these heuristic metrics heavily depend on the selected seeds, making the boundary biased and unreliable (Curran et al., 2007;McIntosh and Curran, 2009). Although some studies extend them with extra constraints (Carlson et al., 2010) or manual participants (Berger et al., 2018), the requirement of expert knowledge makes them ad-hoc and inflexible. Some studies try to learn the distance metrics (Zupon et al., 2019;Yan et al., 2020a), but they still suffer from weak supervision. Furthermore, because the bootstrapping model and the boundary are mostly learned separately, it is hard for these methods to synchronously adjust the boundary when the bootstrapping model updates.
To address the boundary learning problem, we propose a new learning method for bootstrapping-BootstrapGAN, which defines expansion boundaries via learnable discriminator networks, and jointly models the bootstrapping process and the boundary learning process in the generative adversarial networks (GANs) framework (Goodfellow et al., 2014): (1) Instead of using unified seed-based distance metrics, we define the expansion boundaries of different bootstrapping iterations using different learnable discriminator networks, where each of them directly determines whether an entity belongs to the same category of seeds at each iteration. By defining boundaries using discriminator networks, our method is flexible to use different classifiers and learnable using different algorithms.
(2) At each bootstrapping iteration, by modeling the bootstrapping network as the generator and adversarially learning it with a discriminator network, our method can effectively resolve the sparse supervision problem for boundary learning. Specifically, at each bootstrapping iteration, the generator is trained to select the most confusing entities; the discriminator learns to determine the selected entities as negative instances, and previously expanded entities and seeds as positive instances. In this way, the generator and the discriminator can reinforce each other: the generator can enhance supervision signals for discriminator learning by selecting latent noisy entities, and the discriminator can influence the generator to select more indistinguishable entities. When reaching the generator-discriminator equilibrium, the discriminator finally learns a good expansion boundary that accurately identifies new entities, and the bootstrapping network can expand new positive entities within the boundaries.
(3) By iteratively performing the above adversarial learning process, the bootstrapping network and the expansion boundaries are progressively refined along bootstrapping iterations. Specifically, we use a discriminator sequence containing multiple discriminators to progressively learn expansion boundaries for different bootstrapping iterations. And the bootstrapping network is also refined and restricted along the whole bootstrapping process by the current discriminator and previously learned discriminators.
We conduct experiments over two datasets, and our BootstrapGAN achieves the new state-of-theart performance for entity set expansion.

Progressive Adversarial Learning for Bootstrapping
In this section, we introduce our boundary learning method for bootstrapping models-BootstrapGAN (see Figure 2), which contains a generator-the bootstrapping network that performs the bootstrapping process, and a set of discriminators that determine the expansion boundaries for different bootstrapping iterations. The bootstrapping network and the discriminator networks are progressively and adversarially trained during the bootstrapping process.

Generator: Bootstrapping Network
The generator is the bootstrapping model, which iteratively selects new entities to expand seed sets. We adopt the recently proposed end-to-end bootstrapping network-BootstrapNet (Yan et al., 2020a) as the generator, which follows the encoderdecoder architecture: Encoder The encoder is a multi-layer graph neural network (GNN) that encodes the context features around entities/patterns into their embeddings. And the encoder takes an entity-pattern bipartite graph as input to efficiently capture global evidence (i.e., the direct and multi-hop co-occurrences between entities and patterns). The bipartite graph is constructed from original datasets: entities and patterns are graph nodes; an entity and a pattern are linked if they co-occur.
Based on the above bipartite graph, each GNN layer aggregates information from node neighbors as follows:  where v l i is node i's embedding after layer l, N (i) are i's neighbors, W l is the parameter matrix, a l i,j is the attention-based weight, f is a linear sum function, and σ is the non-linear activation function.
Decoder After encoding entities and patterns, the GRU-based decoder sequentially generates new entities as the expansions, where each GRU step refers to one bootstrapping iteration. Specifically, the hidden state of the decoder represents the semantics of the target category. At each GRU step, the last expanded entities are used as the inputs to update the hidden state, which models the process that newly expanded entities are added to the current set, and therefore the set semantics should be updated (The first step inputs are seeds); then, the generating probabilities of a new entity are calculated as follows 1 : where h k is the hidden state at k-th GRU step, v j is entity j's embedding outputted by the encoder, j is a candidate entity, M is the parameter matrix. And top-N new entities are expanded at each step.

Discriminator: Expansion Boundary
Given positive entities (i.e., seeds and expanded entities), the discriminator defines the expansion boundary of each bootstrapping iteration by identifying whether a new entity is positive (i.e., be-longing to the same category as positive entities) or negative (otherwise). Instead of using seed-based distance metrics (Riloff and Jones, 1999;Gupta and Manning, 2014), we take different categories of seeds into consideration, and design the discriminators to directly predict which category a new entity belongs to. The motivation comes from two aspects: (1) By enforcing the discriminator directly discriminating whether a new entity is positive to any category of seeds, the discriminator can essentially possess the category boundary and is flexible to leverage more supervision signals except for seeds; (2) According to the mutual exclusive assumption (Curran et al., 2007) (i.e., most entities usually belong to only one category), it is better to leverage different categories of seeds to alleviate noises and simultaneously learn their expansion boundaries.
Specifically, we set our discriminator a multiclass classifier, which contains a GNN followed by an MLP layer: The GNN module takes the entitypattern bipartite graph as input, and encodes context features into entity embeddings as Eq. 1; The MLP layer followed by a softmax function outputs the entity's category probabilities, where each category refers to one kind of seed set. And a new entity is only regarded as positive to the category with the highest probability. Besides, we set the GNN module as 1-layer to avoid model overfitting.

Progressive Adversarial Learning
To learn the above generator and discriminator, we design the following progressive adversarial learn-ing process: Before bootstrapping, we pre-train the generator for better convergence (Pre-training); At each bootstrapping iteration, the discriminator is used to learn the expansion boundaries of this iteration, and is adversarially trained with the generator to reinforce each other (Local adversarial learning). Along the whole bootstrapping process, we progressively refine the generator with multiple discriminators by iteratively performing the above local adversarial learning (Global progressive refining).

Pre-training
Many previous studies have suggested that pretraining is important for learning convergence in GANs (Li and Ye, 2018;Qin et al., 2018). This paper pre-trains the generator (i.e., the bootstrapping network), and uses the following two kinds of pretraining algorithms: (1) The multi-view learning algorithm (Yan et al., 2020a), where the generator is co-trained with an auxiliary network.
(2) Self-supervised and supervised pre-training using external resources (Yan et al., 2020b). Note that, since the external resources are not always accessible, we use the first algorithm as our default setting and set the second one as an alternative.

Local Adversarial Learning
At each bootstrapping iteration, the discriminator and the generator are learned using the following adversarial goals: the generator tries to generate new positive entities; the discriminator should distinguish new entities from current positive entities.
However, it is difficult to adopt standard GAN settings for our method: (1) The discriminator is a multi-class classifier rather than a binary classifier.
(2) The generator outputs discrete entities rather than continuous values. To address the above issues, we use a Shannon entropy-based objective that is consistent with the discriminator, and the policy gradient algorithm to optimize the generator.
Shannon entropy-based learning objective To make our GAN settings consistent with the multiclass discriminator, we modify the adversarial goals inspired by Springenberg (2016): The generator tries to generate new entities that are certainly predicted as the same category as known positive entities by the discriminator; The discriminator tries to be not fooled by certainly assigning categories to the known positive entities and keeping uncertain about the class assignment for newly generated entities.
Based on the new goals, we design a Shannon entropy-based learning objective, where the category assignment uncertainty is represented by the Shannon entropy. Formally, at bootstrapping iteration k, we use the following adversarial objective to learn the generator G and the discriminator D: where c is a target category, S c is the corresponding seed set, G c <k is the set of expanded entities before iteration k, entities in S c ∪ G c <k are regarded as positive entities, G c k is the set of newly expanded entities at step k, H(p D (c|e)) is the discriminator prediction entropy for e, CE(·) is the cross-entropy term to assign right classes for positive entities, and λ is a hyper-parameter (this paper sets λ = 1). The first two terms of Eq. 3 aim to maximize the class assignment probabilities (i.e., minimizing the uncertainty) of positive entities, and the third term aims to maximize the entropies (i.e., maximizing the uncertainty) of newly generated entities.
And we sample the same size of newly generated entities as the positive entities to balance the above adversarial training process (We still select top-N entities for inference as Section 2.1).
Policy gradient learning for generator To optimize the generator that outputs discrete entities, we adopt the policy gradient algorithm. Specifically, we first rewrite the objective of the generator (the third term in Eq. 3) as maximizing the following function (denoted as L G ): where p G c k (e) is the expansion probability for entity e at step k, and e is a sampled discrete entity. We adopt the REINFORCE algorithm (Williams, 1992) to directly calculate L G 's gradient ∇ θ L G as: where p D (c|e) is the probability of e belonging to category c returned by the discriminator, R(e) is the indistinguishability-base reward for generator learning 2 , b is the baseline value (This paper sets b = 1 |C| , |C| is the category number).

Global Progressive Refining
The local adversarial learning optimizes the generator and the discriminator at each bootstrapping iteration. This section describes how to refine them along the whole bootstrapping process-we call it global progressive refining. One naive refining method is to iteratively perform the above local adversarial learning using one generator and one discriminator. However, this setting is not suitable for the dynamic bootstrapping process. Firstly, since the positive entities are iteratively expanded, the expansion boundaries at sibling iterations should also be slightly different. Therefore, it is necessary to use different discriminators for different iterations. Secondly, for the endto-end bootstrapping network (Yan et al., 2020a), restricting the outputs of the current iteration will influence the outputs of previous iterations, but the naive refining method cannot continuously restrict the expansions of previous iterations to already learned boundaries.
Therefore, we propose a global progressive refining mechanism using a discriminator sequence containing multiple discriminators rather than one discriminator. Specifically: (1). For each bootstrapping iteration, we use a unique discriminator to learn its expansion boundaries. That means for a total of K bootstrapping iterations, the discriminator sequence contains K different discriminators.
(2). At the k-th iteration, discriminator D k is initialized by learned discriminator D k−1 ; then D k and the generator G are trained using the local adversarial learning until coverage; finally, D k can accurately define the expansion boundaries of iteration k and keeps fixed in the following iterations. Through the above process, we can progressively refine the expansion boundaries by iteratively fitting new discriminators from previously learned boundaries to new ones.
(3). At the k-th iteration, to restrict the generator's previous expansion to the learned boundaries (possessed by {D 1 , D 2 , ..., D k−1 }), we also use the learned discriminator D i (i ≤ k) to assign prediction probabilities as rewards for expanded entities at iteration i. Finally, we replace the generator's gradient calculated by Eq. 5 as: where D k is the discriminator to be learned at iteration k, and {D 1 , ..., D k−1 } are already learned discriminators.  For each category, 10 entities are used as the seeds, and all n-grams (n ≤ 4) around candidate entities are defined as the context patterns.
For BootstrapGAN, we report the results of its two versions: BootstrapGAN, which uses the multiview learning algorithm for pre-training; Bootstrap-GAN(ext), which uses external datasets for pretraining like Yan et al. (2020b).
Evaluation Metrics Following Zupon et al. (2019), this paper uses the precision-throughput curves to compare all methods. For further precise evaluation, we also report the precision@K values (P@K, i.e., the precision at expansion step K).   And we run our method for 10 repetitive training pieces and report the mean values of P@K as well as the standard deviations.
Implementation We implement the Bootstrap-GAN using the PyTorch (Paszke et al., 2019) with the PyTorch Geometric extension (Fey and Lenssen, 2019), and run it on a single Nvidia TiTan RTX GPU. And we use Adam (Kingma and Ba, 2015) and Rmsprop (Tieleman and Hinton, 2012) to respectively optimize the generator and the discriminators. Main hyperparameters are shown in Table 1. Our code is released at https://www.github.com/ lingyongyan/BootstrapGAN.

Overall Evaluation Results
The precision-throughput curves of all methods are shown in Figure 3, and P@K values are also shown in Table 2. We can observe that: (1) Adversarial learning can effectively learn good expansion boundaries for bootstrapping models. Comparing to all baselines without external resource pre-training (i.e., Gupta, LTB, Emboot, and BootstrapNet), BootstrapGAN achieves significant improvements (All p-values of t-test evaluation are less than 0.01), and the precisionthroughput curves of BootstrapGAN are the most smooth ones. That means more correct entities and less noisy entities are expanded at each iteration. It verifies that the learned expansion boundaries of BootstrapGAN contain fewer noisy entities than other methods, and therefore are the better boundaries. Besides, comparing to the baseline model using external resources for pre-training (i.e., GBN), the external resource pre-trained version-BootstrapGAN(ext) also outperforms it.
(2) Progressive adversarial learning is complementary with self-supervised and supervised pre-training, and combining them can achieve the new state-of-the-art performance. Comparing to the original BootstrapGAN, Bootstrap-GAN(ext), which combines self-supervised and supervised pre-training, achieves further improvements: On CoNLL, the P@10 and P@20 values achieve 1.6% and 5.4% improvements; On OntoNotes, the P@10 and P@20 values achieve 3.2% and 1.8% improvements.
(3) The end-to-end bootstrapping paradigm outperforms other bootstrapping methods. Comparing to other methods, the end-to-end learning methods (i.e., BootstrapNet, GBN and  BootstrapGAN, BootstrapGAN(ext)) can achieve obviously higher performance. And comparing to the BootstrapNet/GBN, Bootstrap-GAN/BootstrapGAN(ext) can further achieve noticeable improvements, especially on the more complex dataset-OntoNotes.

Detail Analysis
Effect of pre-training strategies. To analyze the effects of pre-training, we compare the performance of BootstrapGAN using different pretraining settings (see Table 2): BootstrapGAN, and BootstrapGAN without pre-training (-pretrain). And we can see that: pre-training is an effective way to improve bootstrapping performance in some tasks. Without the pre-training, the Boot-strapGAN's performance on OntoNotes substantially drops-all mean P@K values decrease at least 4.9%. This may be because complex datasets (e.g., the OntoNotes) usually contain massive amounts of entities, and the search space of the bootstrapping network is extremely large, which makes it hard to converge to the optimum without appropriate pre-training.
Effect of global progressive refining. To analyze the effects of global progressive refining, we conduct the comparison experiments with different refining mechanisms (see Table 3): original settings using global progressive refining (BootstrapGAN); performing local adversarial learning without refining, i.e., only seeds are taken as positive entities, all expanded entities from different iterations are taken as negative ones in Eq. 3(refining); performing refining using the naive refining mechanism rather than our global progressive refining (-g-refining). From Table 3, we can see that: (1) Refining is useful when performing adversarial learning for bootstrapping. Without the refining mechanism, the BootstrapGAN performance sharply drops on both datasets (All P@K values decrease by at least 5.6%).
(2) Our global progressive refining mechanism is very suitable for BootstrapGAN learning. By replacing the global progressive refining with the naive mechanism, we can see that most Bootstrap-GAN performance results decrease, especially on P@5 and P@10. This verifies our observation that the expansion of previous iterations can be influenced when adversarially learning for later iterations. And our global progressive refining can well alleviate the influence, and therefore a better refining mechanism.
Stability of adversarial learning. To analyze the stability of our adversarial learning method, we report the P@K values of BootstrapGAN at different iterations (see Figure 4). We can see that: (1) Our adversarial learning method can coverage quickly. At around the 10th bootstrapping iteration, the performance of BootstrapGAN reaches a reasonable level. (2) Our adversarial learning method is stable. On both datasets, most P@K values steadily increase with more training iterations, and the standard deviations of most P@K values progressively decrease. Those can verify the stability of our learning algorithm (Although some P@K values decrease a little from iteration 10 to iteration 20, we still consider our algorithm stable since the differences are slight enough to be omitted).

Examples for learned expansion boundaries.
To intuitively show the quality of learned expansion boundaries by BootstrapGAN, we show a typical case of different expanded entities for GPE (geopolitical entities) on the OntoNotes using Bootstrap-Net and BootstrapGAN (see Table 4) 3 . And we can see that BootstrapGAN can expand more correct entities, and most of them are tightly related to the GPE semantics; while the expansion boundaries of BootstrapNet contain many noisy entities at the very beginning and tend to introduce more noises at later iterations. This further verifies the importance of expansion boundary learning and BootstrapGAN's effectiveness.

Related Work
Bootstrapping Bootstrapping is a widely used technique for information extraction (Riloff, 1996;Ravichandran and Hovy, 2002;Yoshida et al., 2010;Angeli et al., 2015;Saha et al., 2017), and also benefits many other NLP tasks, like question answering (Ravichandran and Hovy, 2002), named entity translation (Lee and Hwang, 2013), knowledge base population (Angeli et al., 2015), etc. To address the expansion boundary problem, most early methods (Riloff, 1996;Riloff and Jones, 1999) heuristically decide boundaries using patternmatching statistics, but often result in a rapid quality degrading, which is known as the semantic drifting (Curran et al., 2007). To reduce semantic drifting, some studies leverage external resources or constraints, e.g., mutual exclusive constraints (Yangarber et al., 2002;Thelen and Riloff, 2002;Curran et al., 2007;Carlson et al., 2010), lexical and statistical features (Gupta and Manning, 2014), lookahead feedbacks (Yan et al., 2019), manually defined patterns (Zhang et al., 2020). However, those heuristic constraints are usually not flexible due to their requirement for expert efforts. In contrast, recent studies focus on learning the distance metrics to determine boundaries using weak supervision Berger et al., 2018;Zupon et al., 2019;Yan et al., 2020a). For example, Yan et al. (2020a) propose an end-toend bootstrapping network learned by multi-view learning, and extend it by self-supervised and supervised pre-training (Yan et al., 2020b). However, these methods usually learn a loose boundary using sparse supervision. Furthermore, these methods' boundary learning process and model learning process are usually separately performed and therefore fail to be adjusted synchronously. Adversarial Learning in NLP Adversarial learning (Goodfellow et al., 2014) is widely applied in NLP. For example, in sequential generation tasks, GAN is mainly used to alleviate the problem of lacking explicitly defined criteria (Yu et al., 2017;Lin et al., 2017;. GAN has also been used in weakly supervised information extraction to identify informative instances and filter out noises (Qin et al., 2018;Wang et al., 2019), which inspires our method.

Conclusion
Due to very sparse supervision and the dynamic nature, one fundamental challenge of bootstrapping is how to learn precise expansion boundaries. In this paper, we propose an effective learning method for bootstrapping-BootstrapGAN, which defines expansion boundaries via learnable discriminator networks and jointly models the bootstrapping process and the boundary learning process in the GANs framework. Experimental results show that, by adversarially learning and progressively refining the bootstrapping network and the discriminator networks, our method achieves the new state-of-the-art performance. In the future, we plan to leverage extra knowledge (e.g., knowledge graph) to improve bootstrapping learning.