Counter-Contrastive Learning for Language GANs

Generative Adversarial Networks (GANs) have achieved great success in image synthe-sis, but have proven to be difﬁcult to generate natural language. Challenges arise from the uninformative learning signals passed from the discriminator. In other words, the poor learning signals limit the learning capacity for gen-erating languages with rich structures and se-mantics. In this paper, we propose to adopt the counter-contrastive learning (CCL) method to support the generator’s training in language GANs. In contrast to standard GANs that adopt a simple binary classiﬁer to discriminate whether a sample is real or fake, we employ a counter-contrastive learning signal that ad-vances the training of language synthesizers by (1) pulling the language representations of generated and real samples together and (2) pushing apart representations of real samples to compete with the discriminator and thus prevent the discriminator from being overtrained. We evaluate our method on both synthetic and real benchmarks and yield competitive performance compared to previous GANs for adversarial sequence generation.


Introduction
Unsupervised text generation has achieved great success in plenty of applications, from dialogue generation to machine translation (Wu et al., 2016;. Common approaches to language models are maximizing the log-likelihood of tokens of discrete sequences given historical observations. Nevertheless, language models trained with maximum likelihood estimation (MLE) can result in exposure bias issues (Bengio et al., 2015), a distributional shift between input sequences during training and inference stages.
Generative Adversarial Networks (GANs) hold the promise of training language models, as an alternative method to MLE. GANs learn to sample * Corresponding author. during training so as to avoid the exposure bias issue, whose aim is to train a language generator to fool the discriminator that distinguishes the fake data out of real samples.
Previous innovations adopt various approaches to enhance the learning signals for generators, such as leaking information from the discriminator to the generator (Guo et al., 2017), directly matching the fake data distribution to that of real data Chen et al., 2018), learning to rank samples out of a collection of curated samples (Lin et al., 2017;Zhou et al., 2020), leveraging more powerful generator architectures to learning representations (Nie et al., 2019), etc. However, the problem of language GANs' training is far from being fully solved. Inspired by the recent success in contrastive learning approaches (Chen et al., 2020) in learning effective representations, we propose a countercontrastive learning objective to aid the adversarial learning of sequence generators in language GANs. Conventional contrastive learning methods aim at pulling positive samples together and pushing away positive samples from negative ones. In contrast, we propose counter-contrastive learning (CCL) method that (1) pulls the generated samples and real samples together (to generate real-looking data) and (2) pushes away the real samples (to hinder the training of the discriminator). Empirical results on both synthetic and real datasets demonstrate competitive results compared with previous language GANs and prove the effectiveness of our method.

Language GANs
Language GANs have attracted extensive interest due to their ability to mitigate the exposure bias issue. The objective of language GANs is to train a language generator G(z; θ (G) ) that can output real-looking text samples that resemble those in the training data p data (x). From the game theory metaphor, language GANs consist of a generator and a discriminator playing a two-player minimax game. The generator network decodes the randomly initialized starting token z into the language sequence G(z; θ (G) ), where the training signal is provided by the discriminator network D(x; φ (D) ) that is trained to distinguish between the samples drawn from the real data distribution p data and those produced by the generator. In this paper, we adopt the relativistic discriminator loss (Jolicoeur-Martineau, 2018) as the training objective: where σ(·) is the sigmoid function.
There have been a large variety of language GANs that resorted to reinforcement learning (RL) heuristics with Monte Carlo search to gather the update rewards from the discriminator. The instability of RL training can further plague the reward sparsity problem. Existing work (Kusner and Hernández-Lobato, 2016;Nie et al., 2019) demonstrated that Gumbel-Softmax relaxation (Maddison et al., 2014) is effective in language GANs, thus we use the Gumbel-Softmax reparameterization instead of policy gradient in our experiments.

Contrastive Learning
Contrastive learning aims at learning informative representations by pulling together positive neighbors and pushing way non-neighbors (Hadsell et al., 2006). Assuming a set of paired examples D = are positive pairs. Let the h i and h + i denote the representations of x i and x + i , the contrastive learning training objective is: where τ is the temperature scalar, and sim(·) is the cosine similarity operator.

Counter-Contrastive Learning
In language GANs, the discrimination classifier is prone to be overtrained, while the generator faces great challenges to obtain sufficient information for the update. To mitigate this issue, we propose a counter-contrastive learning (CCL) objective that not only renders comparative learning signals between real and fake samples but prevents the classifier from being trained too quickly. It is crucial to construct positive and negative samples in our method. As for positive ones, we construct positive pairs by applying disparate dropout masks to get positive representations for input real texts sampled from p data . Specifically, for the same real sentence, we get positive pair representations after feed them into the discriminator twice with two different random dropout where m is the dropout mask and f is the encoder of input sentences. In our experiments, we take the hidden representation of the last-but-one feed-forward layer in the discriminator as the representation h i for each input sentence x i .
With different dropout masks, we get the representations of positive pairs (h i , h + i ). For negative samples, we randomly select fake sentences generated by the generator network and feed them into the discriminator to get fake sample representations. Therefore, we choose one from positive representations and the other from the negative to construct negative pairs (h i , h − i ). Given the mini-batch of size N , we formulate the counter-contrastive learning objectives as: where τ is the constant temperature. Intuitively, this CCL objective aims to (1) force the fake representations to approach real data (the numerator), and (2) prevent the discriminator from learning effective representations of positive pairs by pushing away semantically close pairs (the right term in the denominator).
In contrast to contrastive learning that pulling together the positive neighbors, our CCL objective aims to draw together the fake and real samples (to let the generator imitate the real sentences) and push away the real samples (to fool and hinder the discriminator training, thereby preventing it from fast convergence).

Training Language GANs
When training the language GANs, we keep the training objective as Eq. (1) unchanged and update the generator with Eq. (3) after the generator's conventional update.
Algorithm 1 illustrates the overall training process of the proposed framework. The discriminator and the generator could reach the Nash Equilibrium when the generator could fool the discriminator into accepting its output as being true. Since the discriminator network is easy to be overtrained, we do not pretrain it but only pretrain the generator using MLE for few epochs.
Algorithm 1 Adversarial Training of CCL.
1: Require: generator G θ ; discriminator D φ ; samples of real data S; generator training step g; discriminator training step k; the generator pretraining epochs l. 2: Pretrain G θ using MLE on S for l epochs 3: repeat • Synthetic data, which is generated by an oracle single-layer LSTM as in . We use a randomly initialized single-layer LSTM as the oracle, and generate 10,000 discrete sequences of length 20 and 40 respectively as either training or test set.
• Real data. We use MS COCO Image Captions (Chen et al., 2015)   Evaluation Metrics For synthetic data, we use NLL oracle and NLL gen to evaluate the quality and diversity respectively. Given the real data distribution p data and fake data distribution p θ , NLL oracle measures the negative log-likelihood (NLL) of generated samples y 1···T under the oracle distribution p data whilst NLL gen calculates the NLL of real samples r 1···T under the generated data distribution p θ .
For real data, it is infeasible to get an oracle to compute the NLL oracle . We instead apply the BLEU scores (Papineni et al., 2002) to evaluate sample quality, wherein the test data serve as the reference. Besides, NLL gen is adopted to evaluate the diversity of generated samples.
Model Architecture For the generator network, we apply the Relational Memory Core (Santoro et al., 2018), where the memory size is 256, the memory slot number is 1, the attention head number is 2. The input embedding dimension is set to 32. For the discriminator network, we use the multi-channel convolutional networks using filters with various window sizes to extract distinct ngram features, followed by a max-over-time pooling operation. The input embedding dimension for the discriminator is set to 64. The filter sizes are {2, 3, 4, 5} with the number of 300 channels for each. A max-over-time pooling and a fully connected layer is applied followed by the convolution layer.
Optimization We apply Adam optimizer with β 1 = 0.9 and β 2 = 0.999. For the initial learning rate, we set to 1e-2 and 1e-4 for pretraining and adversarial training respectively for the generator, and to 1e-4 for the discriminator during adversarial training. All trainable parameters whose L2 norm values of gradients exceed 5 are truncated.
Training Settings The following hyperparameters are finetuned: batch size of {32, 64, 128}, the CCL temperature τ ∈ {0.2, 0.5, 1}. The training step for the generator and discriminator is set to g = 1 and d = 5, respectively. We pretrain the generator for 150 epochs before the adversarial training. The optimal batch size is set to 128 for both synthetic and real datasets. All experiments are conducted on Nvidia Titan RTX GPU with 5 different random seeds.  For synthetic data, we evaluate the generated sequence w.r.t. both quality and diversity. We use the oracle LSTM to evaluate the negative loglikelihood of our generated samples (denoted as NLL oracle ) to measure the quality, and the negative log-likelihood of the synthetic dataset (denoted as NLL gen ) measured by the generator during training. We also report the best NLL oracle +NLL gen to evaluate the trade-off between quality and diversity. It is observed that our model outperforms baseline models in terms of quality (measured by NLL oracle ) and quality-diversity trade-off (measured by NLL oracle +NLL gen ), and achieves or matches the competitive results of baselines w.r.t. the diversity (indicated by NLL gen ).   (indicated by BLEU scores) while maintaining the diversity (indicated by NLL gen ). Table 4 shows the same trend on EMNLP2017 WMT News dataset.

Analysis
Ablation Test To further verify the benefits of our method, we conduct an ablation test by removing the CCL update on MS COCO image captions. It can be seen from  Comparison between Generated Samples For fair comparison, we select the generated sentences that contain the word "cat" from samples produced by models with and without the CCL method (see Table 6). It is observed that GANs with CCL tend to produce sentences with better diversity. For example, with the structure "a cat is sitting on top of a car", models w/ CCL can enrich it with different modifier words. However, after removing CCL, the model can duplicate words such as "sitting" regardless of its repetitive usage. Moreover, as shown in the last row of

Related Work
A variety of language GANs integrated the RL paradigm into GANs. SeqGAN  firstly takes the text generation as a Markov decision-making process and trains the language generator with the policy gradient algorithm. RankGAN (Lin et al., 2017) and SAL (Zhou et al., 2020) enrich the restrictive signals by ranking constructed pairs. LeakGAN (Guo et al., 2017) leaks the hidden states of the generator to promote the generator training. Another line of previous work either approximates the categorical sampling or optimizes on continuous representations, such as Gumbel-Softmax GAN (Kusner and Hernández-Lobato, 2016), TextGAN , FMGAN (Chen et al., 2018) and RelGAN (Nie et al., 2019).
Our work aims to integrate the prevalent contrastive learning approach in supporting the generator training, which lies in the line of methods using comparative signals or ranking classifiers, such as RankGAN and SAL. From the perspective of feature matching, the counter-contrastive learning objective can be considered as a contrastive signal to draw together the fake and real sample representations.

Conclusion
In this paper, we introduce a counter-contrastive learning objective to advance the training of language GANs. It pulls the representation of generated and real samples together to promote the generator training, and pushes apart real sample pairs to depress the discriminator training as a competitor. Our future work will include extending the counter-contrastive learning method to other text generation tasks such as machine translation and dialogue generation.