RetroGAN: A Cyclic Post-Specialization System for Improving Out-of-Knowledge and Rare Word Representations

Retrofitting is a technique used to move word vectors closer together or further apart in their space to reflect their relationships in a Knowledge Base (KB). However, retrofitting only works on concepts that are present in that KB. RetroGAN uses a pair of Generative Adversarial Networks (GANs) to learn a one-to-one mapping between concepts and their retrofitted counterparts. It applies that mapping (post-specializes) to handle concepts that do not appear in the original KB in a manner similar to how some natural language systems handle out-of-vocabulary entries. We test our system on three word-similarity benchmarks and a downstream sentence simplification task and achieve the state of the art (CARD-660). Altogether, our results demonstrate our system's effectiveness for out-of-knowledge and rare word generalization.


Introduction
Retrofitting word embeddings with a KB (Faruqui et al., 2015;Speer and Chin, 2016;Mrkšić et al., 2017) means taking a vector space of word embeddings and finding a mapping that moves some of these word vectors closer together and others further apart, such that these vectors' new positions in the vector space are in better agreement with the relationships between the same words (a.k.a., concepts) in a KB (Speer and Chin, 2016;Mrkšić et al., 2017). However, the retrofitting process can only work on concepts that are actually present in the KB (a.k.a., constraints), which means that retrofitting can get us improved performance in semantic tasks only on the overlapping vocabulary between the KB and the word embeddings. * Work done while at the MIT Media Lab Post-specialization Kamath et al., 2019) is a solution to this problem; it is a series of techniques that try to (1) learn the mapping that retrofitting establishes and (2) generalize the mapping to the rest of the embedding vocabulary.
We develop and present a post-specialization system called RetroGAN that builds upon the approach presented as AuxGAN (Ponti et al., 2018) by extending it to have a pair of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). A regular GAN minimizes the loss when learning the function for post-specialization. Our pair works in a cyclic manner to minimize the losses of both the post-specialization and the inverse to ensure that there is a one-to-one mapping between the two domains. This constrains the outputs for unseen data in both domains and leads to achieving higher performance for unseen concepts.

Related Work
Within the field of retrofitting, work has been done in exploring the various ways of infusing constraints or KBs into word embeddings. The original work by (Faruqui et al., 2015) only used synonymy relationships but not antonymy relationships, which meant that word embeddings with similar (synonymous) semantics in the KB would be pulled together, but word embeddings with dissimilar (antonymous) semantics would not be separated. The Attract-Repel work by (Mrkšić et al., 2017) addressed this shortcoming by incorporating antonymy relationships in a retrofitting procedure: synonymous embeddings are attracted to each other, while antonymous embeddings are repelled against each other. This line of work was continued with the work done by Lexical Entailment Attract-Repel   Table 1: Results of the 10 most similar embeddings for "dog" and "doggo" for FastText embeddings. The distributional neighbors are the closest embeddings in the original distributional space and the retrofitted neighbors are the closest in the RetroGAN post-specialized space. We can see that "doggo" was near slangs such as "bae" and "furbabies", but after post-specialization, it gets closer to words that we regard as semantically similar to "dog." The one-to-one mapping that RetroGAN provides is key to being able to incorporate useful semantic information into rare-words possibly like "doggo".
which looks to add the asymmetric lexical entailment relationship to Attract-Repel. 1 Building on these works, a series of techniques called post-specialization were developed. These techniques consist on utilizing neural models to learn retrofitting mappings such as (Glavaš and (Ponti et al., 2018;Kamath et al., 2019) which use a Deep Feed-forward Neural Network and a Generative Adversarial Network (GAN) respectively. Post-specialization permits, provided a static word embedding, to generate its retrofitted counterpart on the fly with a trained system. A concrete example is in table 1.
As it stands, attention has been shifted to using contextual embeddings such as those produced from BERT (Devlin et al., 2019) on downstream tasks. Only recently have there been efforts in incorporating external, KB assertions into pre-trained transformer-based systems (e.g., KnowBERT (Peters et al., 2019), Align-mask-select (Ye et al., 2019), and LIBERT (Lauscher et al., 2019)). LIB-ERT bridges contextual and retrofitted embeddings by leveraging the knowledge in retrofitted embeddings to find lexical tuples that are fed into BERT to focus on their lexical information.
GANs have been utilized extensively in the image domain to create lifelike images. CycleGAN (Zhu et al., 2017) and other cyclic systems (Kim et al., 2017) have been utilized to perform style transfer (i.e. apply certain distinctive characteristics from one image domain into another). Cycle-GAN serves to learn a, possibly unpaired, one-toone mapping from one domain to another. To effectively utilize paired data, the work by (Tripathy et al., 2018) modifies the CycleGAN architecture to include a conditional cyclic loss in which new discriminators are conditioned to determine if a generated sample is real or not based on a given,

RetroGAN
RetroGAN is a system that builds on (Ponti et al., 2018) by utilizing a CycleGAN-like architecture (i.e., we use a pair of GANs cyclicly but our layers are different from the original CycleGAN). We chose the CycleGAN-like architecture because, in our domain, the cycle-consistency constraints can enforce a one-to-one mapping from original embeddings to retrofitted embeddings. This mapping guarantees that unseen concepts will have their own, unique retrofitted counterparts. We use RetroGAN to learn the mapping of Attract-Repel (Mrkšić et al., 2017) retrofitting (with the synonymy and antonymy constraints from the Attract-Repel paper (Mrkšić et al., 2017)) on a subset of static word embeddings (i.e.., FastText (Bojanowski et al., 2017), and Numberbatch (Speer et al., 2017)), and perform post-specialization on the entire set.

Model & Architecture
RetroGAN consists of two GANs that interplay to balance a combination of losses to transform a particular word embedding x i ∈ X from its original domain X to its counterpart y i ∈ Y in the retrofitted domain Y , and vice-versa. In both GANs that we employ, the generator consists of an input layer followed by 2 hidden dense layers with 2048 neurons and each followed by a dropout layer (with a percentage of 0.2 for the dropouts), and a final linear output layer with the same dimensionality as the input. The output of this layer, for the trained G : X → Y produces the post-specialized embeddings (i.e. a batch of 32 FastText embeddings produces 32 post-specialized embeddings). The hidden layers employ the ReLU (Nair and Hinton, 2010) activation function. Our discriminators have a similar structure (an input layer, 2 hidden layers with dropout but a percentage of 0.3), however, the second hidden layer is followed by a batch normalization layer and the output is a single neuron with a sigmoid activation. The reason for the batch normalization layer was to stabilize the training. We also utilized a third and fourth conditional discriminator following (Tripathy et al., 2018), to leverage the cyclic architecture on paired data.
A novelty in RetroGAN is the combination of cyclic and non-cyclic optimization objectives: the regular adversarial loss for both GANs (L GAN ); the cyclic loss for both generators (L CY C ); the identity loss for both generators (L ID ); the max margin loss similar to (Weston et al., 2011;Ponti et al., 2018) for both the generators and additionally for the cycle of generators (L M M ); and the conditional cycle consistency loss (L cCY C ) introduced in (Tripathy et al., 2018). The combined objective has the following form: where G : X → Y is the generator that maps the source domain X of plain word embeddings to the target domain Y of retrofitted word embeddings; F : Y → X is the generator that does the opposite; D X and D Y are the discriminators for the corresponding domains; and D cX , D cY are our cycle conditional discriminators. For brevity, we only go into details on L M M and L cCY C . The other losses are the standard ones found in their respective works: L GAN is the adversarial loss from (Goodfellow et al., 2014). L CY C is the cycle consistency loss from (Zhu et al., 2017) with a scaling factor of λ (which we set to 1); and L ID is the identity loss from (Zhu et al., 2017), which we scale with γ (which we set to 0.01). L ID serves as a check of whether the embedding is already in the correct domain. L M M is the max margin loss with random confounders as used by (Ponti et al., 2018), and as a novel aspect, we add a cyclic margin loss: (2) Equation 2, intuitively, tries to make generated embeddings similar to their gold-standard and different from confounders. RetroGAN further enforces this constraint across the cycle. Lastly, we have L cCY C which is the conditional cycle loss (Tripathy et al., 2018) 2 , which we scale with ς (set to 1):

Experimental Setup
To train our system we utilize the ADAM (Kingma and Ba, 2015) optimizer with a learning rate of 5e-5 for the generators and 1e-4 for the non-conditional discriminators. We do not train the discriminators used in the regular GAN loss, and instead train the ones in the conditional cycle consistency loss. We also note that we did not perform explicit fine tuning of the scaling parameters, but we will do so in future work through a grid search. We train for 312,500 mini-batches which is the equivalent to the AuxGAN training, using a batch size of 32.
In our tests we use the English Common Crawl FastText with sub-word information (FT-CC) and Numberbatch 19.08 (NB) to see how performance would be affected by using embeddings that were already retrofitted with a large KB. We ran the Attract-Repel (Mrkšić et al., 2017) 3 procedure on all these embeddings then proceeded to perform our post-specialization tests on learning the mapping from FT-CC to the resulting retrofitted embeddings.
We ran the word similarity benchmarks: Sim-Lex (SL) (Hill et al., 2015) SimVerb (SV) (Gerz et al., 2016), and the Cambridge Rare Word (C660) dataset (Pilehvar et al., 2018). We utilize the Disjoint (evaluating words which were not seen in the constraints) and Full (evaluating words which were seen in constraints) settings from (Ponti et al., 2018) for SL and SV, and evaluate C660 on the Full setting to test performance on rare words. The maximum values for the similarity benchmarks while training are listed in table 2.
We trained the publicly available AuxGAN model on 10 epochs of 1M iterations (which in AuxGAN is a single embedding pair rather than a batch of pairs) with both plain stochastic gradient descent (SGD) and ADAM (learning rate of 0.1) and selected the best performing one (ADAM) to 2 In future work we will additionally incorporate the paired conditional adversarial loss. 3 We use the default settings found in https://github.com/nmrksic/attract-repel

Results & Discussion
RetroGAN outperforms AuxGAN in the majority of similarity benchmarks. We note that RetroGAN sets the state of the art on the rare-words benchmark (C660)(previously, to the best of our knowledge, it was 0.543 and 0.55 in (Yang et al., 2019;Fukuda, 2020)). In the similarity results for Full , we note the same observations that were noted in AuxGAN: there are some inconsistent gains and losses, which may be due to the combination of loss functions which may make the systems imprecise; although they spread the knowledge throughout the embeddings, they lose some precision when compared with the original retrofitted embeddings. The results for the lexical simplification (Light-LS) can be seen in table 4 where RetroGAN dominates. We wanted to compare the out-of-knowledge (OOK) performance more in depth and to do this, we joined the words in SimLex (SL) and SimVerb (SV) and selected increasingly larger amounts of them ({5,10,25,50,75,100}%). We then selected the constraints that included these words, trained RetroGAN and AuxGAN with these constraints, and evaluated performance on SL, SV, and C660. Part of this can be seen in table 3. We see that RetroGAN's performance increases every time that new constraints are added, whereas AuxGAN's performance begins to peak after 25% of the constraints which may indicate more efficient knowledge distribution thanks to the cyclic system. Later on, the performance of RetroGAN kept increasing, but was less than the base retrofitted embeddings, possibly because of the lack of precision from the combination of losses. Lastly, we performed a small ablation study (Appendix A) on RetroGAN's losses. We note that the max-margin loss from (Ponti et al., 2018) is necessary for high performance in all the tests. We also notice that the cyclic (cyclic max-margin and cycle conditional discriminator) losses are essential for improved performance on the OOK and rare-word similarity benchmarks. We also see that the removal of the cyclic max-margin loss speeds up early learning and its addition stabilizes later learning respectively which may indicate a need to balance this. Future work will explore how to balance this losses, but it may be possible to put a scheduler to enable the loss after a peak. More details on the ablation study can be found in Appendix A.

Conclusion
This work presents an improvement on postspecialization work through the use of a CycleGAN-like system called RetroGAN. We show that RetroGAN gives improved performance in both the Full (words which were seen in knowledge/constraints) and the Disjoint (words which were not seen in the constraints) evaluation settings for three benchmarks. It additionally has better performance on a downstream lexical simplification task, further confirming its improved generalization ability. We conclude that RetroGAN is an improved system for post-specializing embeddings for rare and OOK concepts. 4

Acknowledgements
This work was made possible thanks to the Media Lab Consortium funding.

A Ablation Tests
We performed a small ablation study to examine how the multiple losses in RetroGAN affect performance. A one-by-one removal of these can be seen in figures 2(a),2(b),2(c),2(j),2(k),2(l). A toggle of each of the losses can be seen in: 2(a),2(b),2(c),2(j),2(k),2(l). The difference between the toggle and one-by-one removal is that in the toggle, we simple turn off the specified loss and leave the others untouched, whereas in the oneby-one removal we turn off one-by-one the losses, in this way we can see the individual effects, and the group effects. We evaluated the FT-CC and the Attract-Repel retrofitted FT-CC in the same scenarios as the evaluations before (Disjoint and Full). We note that the Disjoint setting for Card includes some of the words in the constraints. The max margin loss utilized by (Ponti et al., 2018) (one way maxmargin loss) is essential for high performance on the datasets. Without this loss, in all of the figures, we see that the scores in all our tests fall by at least by 0.1. This is seen in both the toggle and the one-by-one case. We can also see that the Cyclic version of this loss (cycle maxmargin loss) slows down learning initially, but stabilizes it in later iterations. We can see that by removing it we get higher performance in earlier iterations but the performance decays as more iterations are given. This may be because it tries to enforce that the semantic components of the embeddings be similar after going through the cycle, but it may be a hard objective to achieve. This loss is especially useful for the rare-word Card-660 evaluation. By looking at the toggle ablation test for this loss, we can see that it indeed can lead to better earlier performance, however it decays with time.
The identity loss (id loss) helps to stabilize the training in later iterations. Removal of this loss significantly affects the disjoint settings, and the reason for it may be that it gives some indication of the important semantic components of the vectors that are being post-specialized. This in the disjoint setting leads to significant performance reductions on the later iterations. Interestingly enough, by toggling off only this loss, we can see that it leads to better performance, which means that with all the other losses, it may contain redundant information that may hinder performance, however if the model relies on the loss without other losses, its information is useful.
The Cycle Conditional discriminator loss (cycle discriminator loss) also contributes to the stability and generalization of the later learning. Removing this loss does not improve early learning, save on the Card-660 dataset, and in most of the other tests, there is not a large noticeable difference. However, on the disjoint setting we do see that it performance decays in later iterations. We suspect the conditioning helps slightly in the stabilization, and generalization of the system, but its effect is not too much.
The Cycle Loss (cycle mae loss), also stabilizes and helps in the generalization of our system. We can see in the disjoint settings in particular, that its removal hinders the model in later iterations. We suspect that since the consistency is not being enforced, the model does not learn effectively to preserve important, possibly non-semantic, parts from the distributional and the retrofitted domain.
As a practical recommendation, we suggest removing the cyclic max-margin loss either completely (pausing the training early at it's peak around 50k-100k iterations), or toggling it after this initial training to get the speedup and the generation. Another practical recommendation may be to disable the identity loss all by itself. The other losses can be maintained as they are described in this work.

B Out-of-knowledge Scalability Tests
In table 5 we test the performance of post specialization as more constraints are added into the retrofitting process. We note that AuxGAN's performance saturates after 50% whereas Retro-GAN keeps learning, albeit less accurately than the retrofitting system. These tests were run for 100k batches on RetroGAN and for 10M iterations (312500 RetroGAN batches) on AuxGAN.
C Additional embedding pre-processing Input and output vectors are divided by the Euclidean (2) norm. This helps slightly in the performance of the semantic comparison benchmarks. No other pre-processing is done on the vectors.

D Architecture Details
In figure 1, we can see the architecture that Retro-GAN uses. On a training step, the losses are calculated as follows. For the cyclic losses, the system samples embeddings from the distributional embeddings and their retrofitted counterparts and these samples are passed to the generators (1, 5 in the figure). Then, the generators' output is passed to the counterpart generator (Distributional Generator passes to Retrofitted Generator and vice versa, seen as 3 in the figure). The output of this is then used to calculate the max margin loss, and passed on to the subsequent discriminator to calculate the cycle discriminator loss (2, 4 in the figure). In addition to this, after going through the cycle of generators (1 or 5, 3 in the figure) we train the conditional discriminators by conditioning on real inputs from the retrofitted or distributional embeddings, or by conditioning on fake inputs (6,7 in the figure).
The amount of parameters in each model and the layers can be found in table 6.

E Parameter Tuning
We performed a parameter tuning using the Ray tuning library, to try and generate a configuration that would be optimal for RetroGAN. We utilized the ASHA Scheduler (Li et al., 2020) along with the following search space configuration: