MTAdam: Automatic Balancing of Multiple Training Loss Terms

When training neural models, it is common to combine multiple loss terms. The balancing of these terms requires considerable human effort and is computationally demanding. Moreover, the optimal trade-off between the loss terms can change as training progresses, e.g., for adversarial terms. In this work, we generalize the Adam optimization algorithm to handle multiple loss terms. The guiding principle is that for every layer, the gradient magnitude of the terms should be balanced. To this end, the Multi-Term Adam (MTAdam) computes the derivative of each loss term separately, infers the first and second moments per parameter and loss term, and calculates a first moment for the magnitude per layer of the gradients arising from each loss. This magnitude is used to continuously balance the gradients across all layers, in a manner that both varies from one layer to the next and dynamically changes over time. Our results show that training with the new method leads to fast recovery from suboptimal initial loss weighting and to training outcomes that match or improve conventional training with the prescribed hyperparameters of each method.


Introduction
In both supervised and unsupervised learning, adding loss terms often leads to improved performance. However, as more loss terms are added, the space of possible balancing weights increases exponentially, and more resources need to be allocated to identify good configurations that would justify the added terms. Another common challenge is that at different stages of the training process, the optimal balance may change (Mescheder et al., 2017). It is, therefore, necessary to have the balancing terms update dynamically during training, which further increases the hyperparameter space.
In this work, we introduce Multi-Term Adam (MTAdam), an optimization algorithm for multi-term loss functions. MTAdam extends Adam (Kingma and Ba, 2014) and allows an effective training of an unweighted multi-term loss objective. Thus, MTAdam can streamline the computationally demanding task of hyperparameter search, which is required for effectively weighting multi-term loss objectives.
At every training iteration, a dynamic weight is assigned to each of the loss terms, based on the magnitude of the gradient that each term entails. The weights are assigned in a way that balances between these gradients and equates their magnitude.
This, however, would be an ineffective balancing method, without two crucial components: (i) the balancing needs to occur independently for each layer of the neural network and separately over the losses, since the relative contributions of the losses vary, depending on the layer, and (ii) the update step needs to take into account the maximal variance among all losses, to support sufficient explorations in places of the parameter space, in which one of the losses becomes more sensitive.
The main focus of our experiments is in the domains of natural language understanding and conditional adversarial image generation, in which the leading methods often utilize multiple loss terms. Our results show that MTAdam is able to recover from suboptimal starting points, in which the weight parameters are set inappropriately, while Adam and other baseline methods cannot.

Related Work
SGD with momentum (Nesterov) (Rumelhart et al., 1986) is an optimization algorithm that extends SGD, suggesting to update the network's parameters by a moving average of the gradients, rather than the gradients at each step. Root Mean Square Propagation (RMSProp) (Tieleman and Hinton, 2012) extends SGD, dividing the learning rate during the backward step, by the moving average of the second moment of the gradients of each parameter.
Adam (Kingma and Ba, 2014) combines both principles and employs the first and second moments of the gradients of each learned parameter and applies them during the backward step. Adam has become a dominant optimizer, that is applied across many applications, and, in particular, it is the de-facto standard in the field of adversarial training. It is known for improving convergence to work well with the default values of its own hyperparameters.
Multiple methods have been suggested for selecting hyperparameters (Srinivas et al., 2009;Wang et al., 2016;Bergstra et al., 2011;Feurer et al., 2014). Hyperband (Li et al., 2017) performs hyperparameter search as an infinite-armed bandit problem utilizing a predefined amount of resources, while searching for the best configuration of hyperparameters that maximizes the given success criterion. It can provide an order-of-magnitude speedup compared to Bayesian optimization (Bergstra et al., 2011;Snoek et al., 2012;Hutter et al., 2011).
In our experiments, we employ multiple methods from natural language understanding and the field of image generation. The methods vary in the type of supervision employed or the task being solved: (i) BERT (Devlin et al., 2019) is a pretrained transformer-based (Vaswani et al., 2017) language model, that applies fine-tuning on various language understanding tasks using supervision. (ii) RecoBERT (Malkiel et al., 2020a) is a text-similarity model that fine-tunes BERT on a given catalog, by jointly optimizing a standard masked language modeling, along with a unique title-description model. (iii) pix2pix (Isola et al., 2017) generates an image in domain B based on an input image in domain A, after observing matching pairs during training. (iv) CycleGAN (Zhu et al., 2017) performs the same task, while training in an unsupervised manner on unmatched images from the two domains. (v) SRGAN (Ledig et al., 2017a) generates high-resolution images from lowresolution ones and is trained in a supervised way.

Motivation: Multi-loss Dynamics
To examine the challenges that arise when not scaling the losses correctly, we consider a simple regression problem that involves the L1 or the L2 loss. Specifically, we fit two coefficients to learn the polynomial f (x) = ax + bx 2 . To this end, we randomly sample batches of x ∈ R from a normal distribution, and train a network to minimize the loss, i.e, the network performs a dot product between (x, x 2 ) and the vector of parameters W ∈ R 2 , and the objective compares between 2 . We track and analyze the loss and its gradients on the two parameters. We run three Adam experiments with the values a = 1.52, b = −0.29 (picked at random, results are very similar for other values). Each time we train the model with a different objective: L1, L2, and L1 where the gradients are scaled with the magnitudes of the L2 gradients. The latter utilizes the L1 loss and scales the gradient magnitudes of the coefficients to match the magnitudes created by the L2 loss, i.e. we apply ∂W 2 ) 2 during the backward propagation step. As shown in Fig. 1 (log scale), the L2 loss converges to a lower loss than L1, since the latter oscillates around 10 −2 . The underlying reason is that the gradients' magnitude of the two coefficients w.r.t. the L2 loss constantly decreases during training, while the gradients' magnitude w.r.t. the L1 remains unchanged (see appendix for a plot). Let In contrast, the L2 gradients decrease with model improvements, since ∂ ∂t t 2 = 2t. Fig. 1 also shows that when scaled by the L2 gradient magnitude, the L1 training converges to the optimal solution. We, therefore, observe that the gradient magnitude of a loss that converges well, can scale the gradients of a loss which converges to a larger error. This observation implies that instead of finding a training schedule for a new loss, we can rely on an existing loss for scaling.
We further explore this in a scenario with multiple loss terms. It is often the case that new loss (a) (b) Figure 2: We train a superresolution image mapping technique using two out of three loss terms: MSE, GAN, and perceptual loss. (a) the GAN loss is well-balanced and we obtain effective training when combining it with the MSE loss (blue curve). However, the weight of the perceptual loss was naively set to 1, which is two orders of magnitude larger than the optimal value for combining it with the MSE loss. This leads to poor SSIM (Wang et al., 2004) scores (black curve). However, when scaling the perceptual loss gradients by the magnitude of the gradients of the GAN loss, the perceptual loss supports the MSE loss (red curve). (b) same, where the roles of the perceptual loss and the GAN loss are reversed. Similar FID (Heusel et al., 2017) graphs can be found in the appendix.
terms are added during training. However, one would like to avoid tuning each one separately. As a computationally efficient sample, we consider supervised image superresolution using two out of the three loss terms of Ledig et al. (2017b): MSE, a GAN which verifies that the result is in the target domain, and a perceptual loss. We consider the case in which either the GAN loss or the perceptual loss are weighted equally, in comparison to the MSE (a weight that is larger by two orders of magnitude than the optimal) and observe, in Fig. 2 that this leads to suboptimal results. However, using the gradient magnitude of a balanced term fixes this and allows the method to train well. Thus, this motivating experiment demonstrates that we can balance a loss term by using the gradient magnitude of another term. Experiments with the three terms together are shown in Sec. 5.

Method
The Adam algorithm optimized one stochastic objective function f t (θ) over the set of parameters θ, where t is an index of the current mini-batch of samples. In contrast, MTAdam optimizes a set of such terms f 1 t (θ), . . . , f I t (θ). While Adam's task is to minimize the expected value E t [f t (θ)] w.r.t. the parameters θ, MTAdam minimizes a weighted average of the I terms. The weights of these mixtures are all positive, but otherwise unknown. The guiding principle for the determination of the weights at each iteration t is that the moving average of the magnitude of the gradient of each term is equal across terms. This magnitude is evaluated and bal-anced at every layer of the network.
In Adam, two moments are continuously updated, using a moving average scheme: m t is the first moment of the gradient ∇ θ f t and v t is the second moment. Both are vectors of the same size of θ. The moving averages are computed using the mixing coefficients β 1 and β 2 for the two moments.
MTAdam records such moments for each term i = 1...I separately. In addition, it uses a mixing coefficient β 3 in order to maintain the moving average of the gradient magnitude per each layer , which is denoted by n i l,t . Adam borrows from the SGD with momentum method (Nesterov) and updates the vector of parameters based on the weighted first moment of the gradient. In MTAdam, the first moment is computed based on a weighted gradient, in which the parameters of each layer for every term i are weighted such that their magnitude is normalized by the factor n i ,t . This way, across all layers, and at every time point, the I terms contribute equally to the gradient step.
The Adam algorithm is depicted in the left side of Alg. 1 and MTAdam on the right. In line 1, MTAdam initializes I pairs of first and second moment vectors. This is similar to Adam, except for initializing a pair of moments for each loss term. In line 2, and different from Adam, MTAdam initializes I first moments for the magnitude of the gradients, per layer. In line 3, both MTAdam and Adam iterate over the stochastic mini-batches, performing T training steps.
In line 4, MTAdam iterates over the loss terms.
Algorithm 1: Adam (left) and Multi-term Adam (right). All operations are element-wise.

return θT
Input: In addition to Adam's parameters: β3: decay rate for the gradient's norm first moment, f 1 t (θ)...f I t (θ): stochastic loss term functions. Output: θt: resulting parameters For each term, MTAdam calculates its gradients over each one of the layers (line 5), analogously to the way Adam computes the (single) gradients vector ∇ θ f t (θ t − 1). In lines 6-8, MTAdam iterates over the layers, updates the moving average of the magnitude for each layer and loss , and normalizes the gradients of the current layer and loss term, by multiplying with . This multiplication normalizes the magnitude of the current gradients of layer and loss term i using the moving average n i . This normalization leads to all gradient magnitudes to be similar to that of the first loss term. This assigns a unique role to the first term, as the primary loss to which all other losses are compared. By linking the magnitude to that of a concrete loss, and not to a static value (e.g., normalizing to have a unit norm), we maintain the relationship between the training progression and the learning rate.
While line 5 has an analog in Adam, in MTAdam, the computation of the gradients of each specific loss term i w.r.t θ is done per layer. For a layer index , the gradient is denoted by g i . We denoted by g i the concatenation of all per-layer gradients. Lines 6-8 do not have an Adam analog.
The normalization iterates over the loss terms, and for each gradient, the first moment of the magnitude of the gradient is updated. The gradient magnitude is then normalized by that of the first loss term.
Then, in lines 9-12, MTAdam updates the first and second moments for each parameter and each loss term and computes their bias correction. This is similar to Adam, except that the moments are calculated separately for each loss term. In lines 13-15, MTAdam iterates over the loss terms and calculates the steps from each term. The steps are summed over θ t−1 , and the result is assigned θ t .
In Adam, the update size is normalized by the second moment. In MTAdam, we divided by the maximal second moment among all loss terms. This division allows MTAdam to make smaller gradient steps, when a lower certainty is introduced by at least one of the loss terms. The motivation for this is that even if one of the losses is in a highsensitivity region, where small updates create rapid changes to this term, then the step, regardless of the term which led to it, should be small. The importance of this maximization is demonstrated in the ablation study in Sec. 5.5.

Memory and Run Time Analysis
Adam utilizes a pair of 1st and 2nd moments for each learned parameter. Given a network with θ learned parameters, it has a memory complexity of O(|θ|). MTAdam utilizes I different pairs of 1st and 2nd moments for each parameter and also I first moments magnitude for each layer. These two extensions bring the memory complexity to time complexity also depends on the number of loss terms O(Iθ). The dependence on the number of layers L in Alg. 1 can be absorbed in θ.

Experiments
We compare the results of MTAdam with five baselines: (1-3) Adam, RMSProp, and SGD with momentum, applied with suboptimal weightings. (4) Hyperband (Li et al., 2017) applied to perform a hyperparameter search for the weighting parameters (lambdas). (5) Adam optimizer applied with the prescribed weighting. We note that baseline (4) benefits from running training multiple times and that baseline (5) employs the weights proposed for each method after a development process that is likely to have included a hyper-parameter search, in which multiple runs were evaluated by the developers of each method. For each optimization method, we employ the default parameters in pytorch. For all Adam and MTAdam experiments, β 1 = 0.9 and β 2 = 0.999. For MTAdam β 3 = 0.9.

MNIST classification
In order to turn MNIST into a suboptimal combination of multiple terms experiment, we compute the loss for each of the digits separately, creating ten losses, each weighted by a random weight from a uniform distribution between 1 and 1000. The test set is unweighted, which causes classes associated with lower weights to suffer from underfitting.
The official of the PyTorch MNIST example is used: two convolutional layers, followed by two fully connected layers. The experiment is repeated 100 times, and for the sake of saving computations, hyperband is not tested. The results in Tab. 1, show a clear advantage for MTAdam over the other suboptimal alternatives.

BERT Experiments
To demonstrate the effectiveness of MTAdam to recover from non-optimal weighting when applied to Transformers (Vaswani et al., 2017), we finetune BERT (Devlin et al., 2019) with a sub-optimal combination of terms. BERT is commonly applied with AdamW (Loshchilov and Hutter, 2017), which employs an L2 regularization on the network weights. Thus, the BERT objective can be formulated as a dual-loss objective consisting of a standard classification loss and an L2 regularization: L BERT = λ 1 L bce + λ 2 L reg , where L bce is the binary cross-entropy loss and L reg is the L2 regularization loss. The approximately optimal λ 1 and λ 2 weights proposed by BERT (Devlin et al., 2019) are 1 and 0.01, respectively.
In our experiments, we fine-tune BERT with MTAdam, minimizing the above objective, where the regularization is applied with equal weighting, setting λ 2 = 1. We compare MTAdam with Adam applying the same dual-term objective and with the AdamW. Note that AdamW applies the L2 regularization directly on the model weights, i.e., the L2 gradients are not combined with the first and second moments of the main loss.
Performance is evaluated on the MRPC (Dolan and Brockett, 2005), RTE (Bentivogli et al., 2009), and STS-B (Cer et al., 2017), and reported over five experiments for each variant. The results in Tab. 2, shows that MTAdam is the only method to overcome the equal weights, achieving the results obtained by Adam and AdamW on the prescribed weights. The runtime of MTAdam applied on the BERT architecture is comparable to the Adam and AdamW alternatives, with up to 20% additional compute time (measured on a single v100 GPU).

RecoBERT Experiments
We next apply MTAdam based training to Re-coBERT (Malkiel et al., 2020a,b), which is a recent state-of-the-art text-based item similarity model. In the RecoBERT settings, a catalog of textual items is given, where each item is composed of a title-description pair. Item titles are sentences and descriptions are paragraphs. The Re-coBERT model is a BERT-based network trained with self-supervision to jointly optimize a standard masked language model (MLM) and a unique titledescription model (TDM). During training, titledescription pairs are tokenized and masked (similar to BERT), and then propagated through the model. The optimization goal is to reconstruct the masked words, and predict whether a given pair of title and description corresponds to the same item or not. The RecoBERT objective can be expressed as:  where the L M LM is a standard masked language model loss, λ was set to 1, and L T DM is the TDM loss, which is a standard cross entropy employed over the cosine between the averaged pooled embeddings of the title and description tokens.
In (Malkiel et al., 2020a), the authors introduced a wines similarity task 1 with a test set of items with similarity annotations crafted by a human expert 2 (see Fig.3).
We evaluate RecoBERT with MTAdam on the wines similarity task and compare its performance to the baselines mentioned above. The Hyperband experiments utilize the validation loss to perform a hyperparameter search on the λ and incorporate 40 trials (i.e. 40 training processes, which gives it a great advantage), each trial randomly sampled a different lambda from the logarithmic scale between [10 −4 , 10 4 ]. We report Hyperband performance, by utilizing the trial that is associated with the chosen lambda (λ = 17.4).
As can be seen in Tab  Figure 3: Samples from the wines recommendation task. Each item is associated with a review written by a professional wine reviewer. These two wines were also annotated as similar by a sommelier. MTAdam.

Image synthesis
We compare the performance of MTAdam with other optimizers, evaluated on three methods, pix2pix ( We used the learning rate as found in each method. Performance is evaluated using various metrics: L1, L2, PSNR, NMSE (normalized MSE), FID (Heusel et al., 2017), and SSIM (Wang et al., 2004). In each case, we follow the metrics used in the original work, with the addition of FID.
Pix2pix Experiments The objective function of the pix2pix generator has dual-terms:     Where L GAN is the GAN loss of the generator, L 1 is the pixel loss, and λ 1 and λ 2 are set to 100 and 1, respectively. In our study, we train Pix2pix with equal weights, setting λ 1 to 100 (which implies a 1:1 ratio between the two terms).

L1-A L2-A NMSE-A FID-A L1-B L2-B NMSE-B FID-B horse→zebra
Two datasets are used: matching facade images and their semantic labels (Tyleček and Šára, 2013) and aerial photographs and matching maps (Isola et al., 2017). Performance is reported on a holdout test set of each benchmark. Fig. 4 depicts the test-performance of multiple models per epoch. The experiment's name contains 'equal-weights' for the case of equal loss terms, and 'prescribed' when using the prescribed λ values. As can be seen, MTAdam yields a similar convergence as the Adam-prescribed. Specifically, MTAdam converges substantially better than the Adam-equalweights experiment, leading to improved L1 and FID scores.
In Tab. 4, MTAdam is compared with all baselines, applied for training pix2pix models. The top four models in each section utilize equal weighting, each applied with a different optimizer. The Hyperband experiment utilizes the FID metric to perform a hyperparameter search on λ 2 .
The results of the table clearly show the advantage of MTAdam over all baseline methods. In addition, it also shows a slight improvement in performance in comparison to the usage of Adam on the prescribed weights. Appendix Fig. 8   set. Adam-equal-weights introduces visual artifacts and suffers from mode collapse (it generates the same corrupted patch in the top right corner of many images). MTAdam-equal-weights yields higher-quality images, similar to those of the original Pix2Pix, using the prescribed weights. CycleGAN experiments The CycleGAN objective function is composed of six loss terms: Where the L GAN , L cycle , L idt terms are the GAN loss, cycle GAN loss and identity loss, for each one of the sides (A or B). λ 1 , λ 2 and λ 3 are set to 1, 10 and 0.5, respectively. In our experiments, we employ CycleGAN with suboptimal weighting, by setting λ 2 to 1000, leaving λ 1 and λ 3 unchanged. Appendix Fig. 4 compares the convergence of three CycleGAN models: (1-2) MTAdam and Adam, both with suboptimal weighting, and (3) Adam with the prescribed weights. All models are applied to the horse2zebra dataset (Deng et al., 2009). As can be seen, MTAdam exhibits a competitive convergence to the Adam experiment applied with the prescribed weights, which is much better than the performance of Adam with the suboptimal weights. Tab. 5 presents the performance of MTAdam applied on CycleGAN, compared to all five baselines, and evaluated on two datasets. MTAdam with the suboptimal weighting yields competitive performance to the Adam method, which uses the prescribed hyperparams.
Appendix Fig. 9 exhibits representative images from the CycleGAN models, showing that MTAdam, even when applied to a loss with suboptimal weights, matches or improves the results of Adam on the prescribed weights.
SRGAN The Super Resolution GAN (SR-GAN) (Ledig et al., 2017a) objective function is: (4) for which the total loss is a combination of a GAN loss, perceptual loss and MSE. The λ 1 , λ 2 and λ 3 are set to 1, 0.001 and 0.006. We employ a suboptimal combination between the loss terms, by setting λ 2 = λ 3 = 1.
We evaluate SRGAN trained with MTAdam and equal weights on two test sets, Set14 (Zeyde et al., 2010) and BSD100 (Martin et al., 2001). The results, listed in Tab. 6 demonstrate that MTAdam can effectively recover from suboptimal weights, while the other optimization methods suffer a degradation in performance.

Ablation Study
Tab. 7 presents an ablation study for suboptimal pix2pix on the facade images and for suboptimal CycleGAN on the zebra2horse dataset. The following variants are considered (the descriptions refer to Alg. 1): (i) treating all layers as one layer and eliminating lines 6-8 altogether. (ii) training all layers at once, and performing the normalization in line 8 once for the entire gradient, i.e., still normalizing by the magnitude of the gradient of the first term. (iii) scaling the gradients of each layer l and each term i in line 8 by (n i l,t ) −1 but not by n 1 l,t . (iv) replacing the term max( v 1 t ... v I t ) in line 14 with v i t , in an analogous way to line 14 in Adam. (v) replacing the same term with the mean I −1 v i t . The results, shown in Tab. 7, indicate that it is crucial to employ a per layer analysis, in the way it is done in MTAdam, that normalizing by the magnitude of the gradient of an anchor term is highly beneficial, and that the maximal variance is a better alternative to other scaling terms.

Conclusions
MTAdam is shown to be a widely applicable optimizer, which can dynamically balance multiple loss terms in an effective way. It is a general algorithm, which can find its usage in additional types of tasks that require the optimization of multiple terms, such as domain adaptation and some forms of self-supervised learning. Our code is attached as supplementary. MTAdam is implemented as a generic pytorch optimizer and applying it is almost as simple as applying Adam.

A Motivation
Following Sec. 3 in the main text, Fig.5 presents the gradients' magnitude of the two learned parameters of the polynomial f (x) = ax + bx 2 w.r.t. the loss function. Fig.6 presents the FID scores vs. epoch, for the different SRGAN experiments described in the same section.
B Image synthesis Fig. 7, depicts the test-performance of multiple pix2pix models per epoch. The experiment's name contains 'equal-weights' for the case of equal loss terms, and 'prescribed' when using the prescribed λ values. As can be seen, MTAdam yields a similar convergence as the Adam-prescribed, measured by the L1 scores, and converges substantially better than the Adam-equal-weights (we observe a similar trend in the FID, as presented in the main text). The pix2pix Hyperband experiments incorporate 40 trials (i.e. 40 training processes, which gives it a great advantage), each trial randomly sampled a different lambda from the range [10 −4 , 10 4 ]. We report Hyperband performance, by utilizing the trial that is associated with the chosen lambda. In the aerial experiment, the Hyperband failed to choose a λ 2 value that is close to the original value of 1. In the facade experiment, Hyperband sampled at least one lambda value between 0.5 to 10, yet the retrieved best model utilizes a higher lambda value of 162.89, since this value showed a preferable FID Figure 5: Training dynamics in a two neurons network, learning the coefficients of a polynomial of the form f (x) = ax + bx 2 , minimizing L1, L2 or L1 where the gradients are scaled with the magnitudes of the L2 gradients. The plot presents the gradients' magnitude of the two parameters w.r.t. the loss function (i.e. ( ∂L1 ∂W1 , ∂L1 ∂W2 ) 2 and ( ∂L2 ∂W1 , ∂L2 ∂W2 ) 2 for the L1 and L2 experiments, respectively). score on the validation set. In the maps dataset, a value of 59.61 was selected. Fig. 8 compares the convergence of three Cycle-GAN models: unbalanced MTAdam, unbalanced Adam, and balanced Adam. All models are applied on the horse2zebra dataset (Deng et al., 2009). As can be seen, MTAdam exhibits a competitive convergence to the Adam experiment applied with balanced weighting, which is much better than the performance of Adam on the unbalanced weights. Fig. 9 presents a few representative samples from the CycleGAN(Zhu et al., 2017) experiments, employed to transfer horse images to zebras. As can be seen, the unbalanced Adam training completely fails to generate zebras images and collapses to the identity mapping. This can be attributed to the domination of cyclic loss in the unbalanced settings, which dictates the convergence, leaveing the other loss terms ineffective. On the other hand, the unbalanced MTAdam model was able to successfully generate zebra images, in a quality that is similar or better to the balanced adam model (which uses the prescribed weights). In particular, in the first row, we can see that CycleGAN-Unbalanced-MTAdam was able to outperform the CycleGAN-Balanced-Adam experiment, as the latter fails to generate a zebra image for this particular sample, while MTAdam was able to generate a fairly good quality image. Fig. 10 depicts a few samples from the same models described above, this time employed to generate horse images from zebra images. As can be seen, the unbalanced Adam fails again to generate horse images, collapses to the identity mapping, and this time also introduces visual artifacts in a few images (see the green artifact in the bottom row). In addition, MTAdam yields images of the same quality as the balanced Adam experiment. Fig. 11 presents visual results for the SRGAN (Ledig et al., 2017a) experiments. The images were taken from Set14 (Zeyde et al., 2010). All models were trained to generate high-resolution images from low-resolution images, with a factor of 4x upscaling. As can be seen, the SRGAN model that employs unbalanced weights and Adam optimizer yields images with visual artifacts, and low fidelity. This can be attributed to the domination of the GAN and perceptual loss terms over the pixel-wise term. In contrast, the SRGAN model that employs our MTAdam optimizer with unbalanced weights yields images of the same quality as the SRGAN that utilizes the prescribed weights (Ledig et al., 2017a) and Adam.
(a) (b) Figure 6: We train a superresolution image mapping technique using two out of three loss terms: MSE, GAN, and perceptual loss. (a) the GAN loss is well balanced and we obtain effective training when combining it with the MSE loss. However, the weight of the perceptual loss was naively set to 1, which is about two orders of magnitude larger than the optimal value for combining it with the MSE loss. This leads to poor SSIM scores (shown in the main text) and FID scores (shown in this figure). However, when scaling the perceptual loss gradients by the magnitude of the gradients of the GAN loss, the perceptual loss supports the MSE loss. (b) same, where the roles of the perceptual loss and the GAN loss are reversed.  . CycleGAN-Unbalanced-Adam runs with unbalanced initial weights, which fails to generate zebra images. CycleGAN-Unbalanced-MTAdam (our method) starts with unbalanced weights but produces images of the same quality as CycleGAN-Balanced-Adam. CycleGAN-Unbalanced-Adam runs with unbalanced initial weights, fails to generate horse images, collapses to identity mapping, and introduces visual artifacts. CycleGAN-Unbalanced-MTAdam (our method) starts with unbalanced weights but produces images of the same quality as CycleGAN-Balanced-Adam. . SRGAN-Unbalaced-Adam employs SRGAN training with the unbalanced initial weights, as described in the main text. SRGAN-Balanced-Adam employs SRGAN with the prescribed weights. SRGAN-Unbalanced-MTAdam utilizes our proposed MTAdam optimizer, runs with unbalanced initial weights but produces images of the same quality of SRGAN-Balanced-Adam. Figure 12: Samples from the Facade test set. Adam-prescribed employs pix2pix with the prescribed weights (Isola et al., 2017). Adam-equal-weights runs with equal initial weights, which leads to visual artifacts and low fidelity (see the top right patch). MTAdam-equal-weights (our method) starts with equal weighting but produces images of the same quality as Adam-prescribed.