Keep Learning: Self-supervised Meta-learning for Learning from Inference

A common approach in many machine learning algorithms involves self-supervised learning on large unlabeled data before fine-tuning on downstream tasks to further improve performance. A new approach for language modelling, called dynamic evaluation, further fine-tunes a trained model during inference using trivially-present ground-truth labels, giving a large improvement in performance. However, this approach does not easily extend to classification tasks, where ground-truth labels are absent during inference. We propose to solve this issue by utilizing self-training and back-propagating the loss from the model’s own class-balanced predictions (pseudo-labels), adapting the Reptile algorithm from meta-learning, combined with an inductive bias towards pre-trained weights to improve generalization. Our method improves the performance of standard backbones such as BERT, Electra, and ResNet-50 on a wide variety of tasks, such as question answering on SQuAD and NewsQA, benchmark task SuperGLUE, conversation response selection on Ubuntu Dialog corpus v2.0, as well as image classification on MNIST and ImageNet without any changes to the underlying models. Our proposed method outperforms previous approaches, enables self-supervised fine-tuning during inference of any classifier model to better adapt to target domains, can be easily adapted to any model, and is also effective in online and transfer-learning settings.


Introduction
It is a common consensus that the performance of Machine Learning algorithms improves with increasing data. However, due to the difficulty of obtaining large quantities of labelled data, many models (particularly in Natural Language Processing domain) such as BERT (Devlin et al., 2019), GPT (Radford et al., 2018) and UniLM (Dong et al., 2019) rely on unsupervised pre-training on unlabelled data to learn useful features which are then fine-tuned for other downstream tasks. While this approach leads to large gains in performance, it leads to a mismatch between a network's pretraining and final fine-tuning. Some approaches such as pseudo-labelling (Lee, 2013) have proposed utilizing data-augmentation of unlabelled data with the model's own predictions to better pre-train a model.
While these methods are limited to the training phase, Krause et al. (2018) proposed to continue training a language modeling model (which is the task of predicting the next token in a sequence of tokens) during the evaluation stage, achieving significant improvements as the model learns to better adapt to the inference data, without any modifications to the model architecture or any access to training data. For language modeling, the ground truth labels are the next input token, which are trivially accessible to the model to facilitate this learning. However, this method does not easily generalize to standard classification tasks due to the unavailability of labels during inference. This is the setting which we further explore in this paper, in which we are provided with a classification model already trained on training data, but with no access to the training data, and the aim is to further improve the performance of the model by utilizing self-training on the inference data.
To solve the above problem, we propose a method to train any classifier model during inference, utilizing methods used in domain adaptation, noisy-label learning, and multi-task meta-learning. With ground truth labels being absent, we utilize the model's own predictions as the pseudo-labels for those samples and utilize Class Balanced Self Training (CBST) (Zou et al., 2018) to filter samples based on the model's confidence while retaining class balance. However, naive online learning or re-training on the inference data is not optimal due to the noise in the labels biasing the network, as well as the small size of the inference set. We solve this issue by leveraging the Reptile Meta Learning Algorithm (Nichol et al., 2018) to improve generalization, supplemented with an explicit inductive bias towards the model's pre-trained weights.
Our experimental results and ablation studies show that our method improves the performance of standard backbones such as BERT, Electra (Clark et al., 2020) and ResNet (He et al., 2016) on a wide variety of tasks, such as question answering on SQuAD (Rajpurkar et al., 2018) and NewsQA (Trischler et al., 2017), benchmark task SuperGLUE (Wang et al., 2019), and conversation re-ranking on Ubuntu Dialog corpus v2.0 (Lowe et al., 2017) for NLP, as well as object classification on MNIST (Deng, 2012) and ImageNet (Deng et al., 2009) without any changes to the underlying models, while outperforming previous approaches. Our method can also be utilized for continual selfsupervised fine-tuning of classifiers on target domains, as well as in transfer-learning settings, without any model-level modifications.

Proposed Method
Our proposed technique is the self-supervised training of a classifier model during inference, consisting of three parts -using confident predictions as pseudo-labels, utilizing the Reptile algorithm to improve generalization, and an explicit inductive bias to minimize the effect of noisy labels.

Class Balanced Pseudo-labels
We utilize our classifier's most likely predicted class during inference as hard ground truth labels (pseudo-labels). Hendrycks and Gimpel (2017) show that using a model's own softmaxed probability values, max k {p(y = k|x)}, where k are the classes, x is the input, and y is the predicted class is a reasonable proxy for its expected accuracy. To filter out samples with low maximum probabilities, one can simply threshold the output with some fixed value p t . As proposed by Zou et al. (2018), a separate threshold p t (k max ) for each class, where k max is the class with the maximum predicted probability, works better by reducing skewing in favour of easier classes.
In CBST p t (k) are automatically selected for each k such that a fixed fraction f of examples of each predicted-class are filtered out from the inference set, i.e., These thresholds p t (k) can be kept fixed based on the validation set, or can be a running estimate in an online setting. Unlike the original CBST, we do not further normalize the class probabilities with these thresholds, as that led to a drastic reduction in the accuracy of pseudo-label classification. X t inputs with hard pseudo-labels Y t are used as a training set to further fine-tune the model, using the Reptile Algorithm below. This approach is also unaffected by a lack of model calibration, as long as the model's accuracy on X t is acceptably high.

Reptile Algorithm, but for Single Task
Naively using the confident inferred labels for finetuning the model is not optimal due to small size of the test set compared to the train set as well as label noise, lowering generalization, and reducing the gains that can be achieved using the pseudo-labels. Since aligned gradients between samples improve a model's generalization, as shown in Chatterjee (2020) and Fort et al. (2019), we leverage the Reptile Meta-Learning Algorithm to this end. The meta-gradient for the Reptile algorithm contains as a component the gradient for maximizing the inner product between different mini-batches from the same task, as we prove in Section 3.
Algorithm 1: REPTILE + l 2 sp Input: The Reptile Algorithm is a batched First-Order MAML (FO-MAML) Algorithm, originally in- Figure 1: Overview of Reptile with l 2 sp update for 4 inner steps. tended for multi-task meta-learning. We use this algorithm in a single-task setting, as shown in Algorithm 1. The Reptile algorithm consists of k > 1 inner steps of standard SGD updates with learning rate LR inner . The difference between original network weights θ i,0 and the final network weights θ i,k is used as a meta-gradient for SGD for updating the network parameters with a learning rate LR outer , where i is the outer step. The SGD optimizer can be replaced with any other, such as Adam. The Reptile algorithm for this single task setting is First Order, requiring little extra compute compared to standard optimization, and can be plugged in to any model with ease. Some other multi-task algorithms with Experience-Replay, such as Riemer et al. (2018), may exhibit better learning but are computationally orders of magnitude more expensive and are hence infeasible for large datasets and models.

Explicit Inductive Bias
While all the models we use employ a weight decay towards 0 in their training phase, given the usually smaller size of the inference set, we regularize the model by biasing the network towards its pre-trained weights instead. For this, we use the l 2 sp decay (Li et al., 2018), slowly decaying the model weights between updates towards the initial trained model weights. An example of the update steps involved for k = 4 is shown in Fig.1. We conjecture that this will also make the learning more stable to the noisy pseudo-labels.
Some recent works such as Goldblum et al. (2020) also show that standard l 2 weight decay towards 0 may not be ideal and recommend biasing weights towards some model-dependent non-zero norm value instead. l 2 sp can be seen as a generalization of the same, while simultaneously taking advantage of the pre-training.

Theoretical Analysis
In this section, we provide a theoretical analysis of the meta update of Reptile + l 2 sp. We generalize the Taylor expansion approach for Reptile as used in (Nichol et al., 2018) to accommodate l 2 sp, and show how our approach maximizes the inner product of gradients between different mini-batches.
We consider one set of k inner updates. For i ∈ [0, k], we defineθ i = network weights before i th step, b i = input batch for i th step, L i = loss function corresponding to b i , W = pre-trained network weights for l 2 sp, β = l 2 sp weight decay rate, α = LR inner , (gradient of i th batch) g i = L i (θ 0 ), (gradient at initial point) , (Hessian at initial point) Then, our update rule is - In the following analysis, to keeps the analysis tractable, we assume both α and β are small and comparable, and ignore terms involving O(α 2 ), O(β 2 ) and O(αβ). Using the first order Taylor expansion of g i , we get - The following equations can be proved using simple induction on Eq (1) and (2) - By summing up the displacements from all variable updates, the expectation of the meta-gradient from Reptile + l 2 sp under mini-batch sampling is - When expanding the terms above with Eq (3) and (4) and simplifying, we get - where each c i is a positive constant, dependent on k, α and β. The first term in R.H.S. of Eq (5) is the gradient which takes us to the minimum of the training problem. For the third term, note that - Therefore the third term maximizes the dot product between the gradients of the batches for improved generalization, as in the original Reptile algorithm. For the second and fourth terms, note that (θ 0 − W ) is the direction of the gradient of the l 2 sp, and hence can be interpreted similar to the first and third term, but with training gradients replaced by this l 2 gradient.
Hence, we have shown that the Reptile algorithm maximizing product of gradients for improving generalization holds true in our extension as well.

Corpus
Task |Train| |Dev|  BoolQ  QA  9427  3270  CB  NLI  250  57  COPA  QA  400  100  MultiRC QA  5100  953  ReCoRD QA  101K  10K  RTE  NLI  2500  278  WiC WSD 6000 638  Ubuntu Dialog Corpus v2.0 A large-scale corpus of multi-turn conversations mined from Ubuntu IRC chat logs, and the task is to select the best response given a list of possible distractor responses.
NewsQA A span-style QA dataset, consisting of crowd-sourced questions and answers on CNN news articles, along with unanswerable questions. MNIST An image classification dataset of 28x28 scans of handwritten digits. While the dataset has long been solved, it nevertheless serves as a useful dataset to compare simpler architectures.

Model Params Speed
ImageNet A large-scale dataset for image classification, consisting of 1.2M training samples along with their corresponding class labels.

Models
BERT BERT is a transformer (Vaswani et al., 2017) model, and its derivative models are the backbone of most state-of-the-art models in NLP. We use the official implementation and pre-trained models of BERT-large-cased for SuperGLUE tasks, and BERT-base-uncased for Ubuntu Dialog Corpus, NewsQA, and for our ablation tests on SQuAD.
Electra Electra is a BERT-derived state-of-theart model in many NLP tasks, with a discriminative pre-training task. We use the official pre-trained Electra-large model, and we implement our own classifier for SQuAD v2.0.
ResNet Residual blocks and their variants are the backbone of most image classification models today. We use Tensorflow Model Garden's implementation and pre-trained ResNet-50 for ImageNet.
MLP While models made of only simple Multi Layer Perceptrons have largely fallen out of favour, fully connected layers are often a part of larger architectures. We use an MLP with 2 Layers and 128 Hidden units as the model for MNIST.

Implementation Details
Fine-tuning on inference data is extremely quick as our method is first order, taking less than 15 minutes on a V100 for all datasets except ReCoRD and Ubuntu-Dialog, for which it takes a few hours. We use the Adam optimizer, and we disable our model's l 2 weight decay, if any. Batch-norm variables, if any, are also kept fixed.  For each dataset, we train one model on the training set, followed by five runs on the pseudo-labeled thresholded inference set with varying seeds, and report the mean and standard deviation of the scores. As the test set for SuperGLUE and SQuAD are hidden, we provide results on the development set instead.
All default/official model hyper-parameters were used for each model/dataset, which can be found in their official source codes linked in the supplemental material, except we use 1e −5 as LR for Electra as we observed divergence with standard LR. We linearly decay LR except in the online case, where it is kept fixed. The hyper-parameters for Reptile and l 2 sp are provided in the supplemental material. A reasonable set of hyper-params, that works across a range of datasets and models we tested, is 0.01 for LR outer , 4 for inner step, and 0.1 for l 2 sp, while LR inner depends on the original model's LR. RTE, BoolQ, and WiC filter out f as 70% of data, while all other datasets filter 50%.

Results on SuperGLUE benchmark
As shown in Table 4, our method consistently improves the performance on all the tasks in Super-GLUE, with very little extra compute, with upto 1.8 increase in accuracy. The gains tend to be larger on smaller datasets, but we observe significant improvement even with the largest task ReCoRD, with over 100K examples.

Results in a Transfer Learning Setting
We also evaluate our approach in a transfer-learning setting on NewsQA, using a BERT-base-uncased model, which was pre-trained on SQuAD v2.0, by self-training on NewsQA train set, followed by evaluation on the test set. Our approach is especially effective in this setting, out-performing the original model by 4.57/6.23 F1/EM respectively, as shown in Table 6. This experiment demonstrates that our approach is effective for unsupervised domain adaptation to a target domain even in the absence of source domain data.

Results on Image Classification
To demonstrate that our method also works in non-NLP domains, on ImageNet with ResNet-50, we report an increase in accuracy of 0.16. On MNIST dataset, the improvement in accuracy of our simple MLP model is 0.27.

Comparison with Existing Methods
We compare our method with several existing approaches for Self-Training, Zou  As shown in Table 5 our method greatly outperforms the existing approaches, giving 4 to 5 times the relative improvement compared to other methods, improving performance by 0.68/0.82 F1/EM compared to 0.14/0.15 F1/EM of the best performing existing approach.

Online Variant
Our approach can also be used effectively without any modifications in an online setting, where the model keeps learning continuously as inference data is fed to the model. We use a trained model to make predictions on the input inference data, and at the same time, we use the model's predictions to finetune the model. For this kind of learning, we use a constant learning rate, as the total size of inference data is unavailable. As a baseline, we use BERT + CBST (trained on SQuAD-v2.0 data) with a constant learning rate. BERT + CBST + Reptile + l 2 sp (Online) clearly outperforms BERT + CBST (Online) by 0.38/0.37 F1/EM as shown in Table 7.
We also compare the performance of our method when running in online mode for a long time on NewsQA dataset, as shown in Table 6. The performance improvement is not as large as with decreasing LR, but still results in significant performance improvements of 2.70/1.63 F1/EM, respectively.

Ablation Studies
We conduct extensive ablation studies to test the effectiveness of all parts of our approach. We perform these ablations on SQuAD v2.0 with BERTbase model. 76.20 ± 0.01 73.28 ± 0.01 BERT + CBST + Reptile + l 2 sp (Online) 76.58 ± 0.02 73.65 ± 0.01 Table 7: Ablation Study of our method on SQuAD v2.0 corpus, using the BERT-base-uncased model. The Y axis is F1 score, the X axis is the percentage of data left after thresholding.

Thresholding
In Table 7, we compare using CBST thresholding of model outputs to fine-tune the model vs. using all the data. Using CBST + Reptile + l 2 sp increases scores by 0.35/0.23 F1/EM respectively compared to using all the pseudo-labels with Reptile + l 2 sp. We further study the effect of the thresholding fraction f used to select the subset of confident data. We use the pre-trained Bert-base-uncased model, self-trained on the training set of NewsQA data with pseudo-labels, while varying f , and then evaluate on the dev set. As can be seen in Fig.2, the optimal value for thresholding is around 50%, decreasing slowly as more data (but with less confident labels) is used, and decreasing more sharply as the total filtered data used decreases.

Reptile Algorithm
Compared to using just CBST, using the Reptile Algorithm to finetune results in more performance gains of 0.58/0.37 F1/EM, as we can see in Table 7. This effect persists irrespective of whether l 2 sp or the model's default weight decay towards 0 is used.  Figure 3: Effect of varying total size of Inference data on our method on SQuAD v2.0. The Y axis is F1 score, the X axis is the total amount of Inference data used.
This demonstrates that the increased generalization from Reptile's meta-gradients is indeed effective in increasing model performance and robustness. We also conduct an ablation study on the choice of number of inner steps k on the performance of our model. As shows in Table 8, the number of inner updates does not have a major impact on the results, but we advise it be kept less than or equal to 4 as higher inner steps reduce the number of outer updates (as the total number of epochs is kept constant).

Inductive Bias towards pre-trained weights
We can also see in Table 7 that l 2 sp is indeed effective, and by simply biasing the model towards the pre-trained weights, we can achieve better results. This effect becomes more pronounced when the Reptile algorithm is used, with 0.21/0.26 F1/EM improvement of CBST + Reptile + l 2 sp compared to CBST + Reptile.
We also conduct an ablation study on the choice of this bias, by transfer learning on NewsQA   dataset using our method with a model trained on SQuAD, and measuring the performance on SQuAD thereafter. As shows in Table 9, l 2 sp prevents the model from forgetting its performance on SQuAD. However, higher values prevent it from improving its performance on the original squad by minimizing learning on NewsQA.

Effect of Inference Data Size
In Figure 3, we vary the amount of inference data available for our model to learn from, by training a BERT-base model on varying sizes of pseudolabelled SQuAD v2.0 dev set, while keeping f fixed at 50%. The largest increase occurs early on in the training. However, even on using the full dev set, the performance keeps improving, giving an improvement in F1 of 0.68.

Pseudo-labeling
Lee (2013) proposed a simple and efficient method of semi-supervised learning for deep neural networks, in which the proposed network is trained in a supervised fashion with labeled and unlabeled data simultaneously, using pseudo-labels created by selecting the classes which have the highest predicted probabilities as ground truth labels for unlabeled data. CBST (Zou et al., 2018) used different thresholds for pseudo-labels of different classes. Mutual Mean-teaching (Ge et al., 2020) used a moving average of two separate classifiers to refine pseudo-labels. Zheng and Yang (2020) used KL-divergence between two classifiers as a measure of classifier variance to filter incorrect pseudolabels. Pseudo-labels and similar self-supervised techniques have grown increasingly popular, particularly when used in conjunction with extremely large unlabelled data, and was used by Noisy-Student (Xie et al., 2019) recently to achieve stateof-the-art performance on image classification.

Dynamic Evaluation
Adaptive language modelling has a long history, such as Kuhn (1988), and caching based models have resulted in improved performances over stateof-the-art, such as Merity et al. (2018). Krause et al. (2018) proposed to use dynamic evaluation adapted to recent history via a gradient descent based mechanism. However, their approach is limited to language modelling, where ground-truth labels are trivially available during inference, and does not generalize to standard classification setting. Rahman et al. (2019) also used pseudo-labels during inference to learn, but differently from our paper, they primarily focus on a transductive zeroshot detection, and do not use our proposed metalearning and inductive bias. Kim et al. (2019) also proposed to use pseudo-labels to learn during evaluation, but require changes to the model's training phase. Su et al. (2016) also used pseudo-labels on inference data to improve model performance, but their contributions are primarily focused on adapting Self-Training to unbalanced classes. Dynamic evaluation can be considered a form of Fastweights (Ba et al., 2016), which unlike our approach, requires changes during the training phase.

Generic Methods for Noisy Labels
Loss correction methods such as Patrini et al. (2017) model the noise transition matrix. Other approaches try to directly correct the noisy labels, such as Veit et al. (2017), but require access to a clean set. Others directly modify the loss function to make it more stable to noisy labels, such as Generalized Cross Entropy (Zhang and Sabuncu, 2018). Other approaches, most related to our approach, refine the training strategy, such as Co-teaching (Han et al., 2018) or Mutual Mean-teaching, using two classifiers to select the data for each other.

Unsupervised Domain Adaptation
Unsupervised Domain Adaptation methods often use Adversarial Methods, such as Jiang et al. (2020), to distinguish between source and target domains. Distance based methods, such as Chen et al. (2019), aim to minimize the distribution discrepancy across different domains. Other methods such as Courty et al. (2017) rely on optimal transport between source and target domains. These methods often need access to source domain data, or modify the original model or training procedure.

Meta-Learning for Transfer Learning
Algorithms that rely on Fisher/Hessian matrices have been proposed to improve transfer learning, such as Kirkpatrick et al. (2016). Nichol et al. (2018) proposed using batched FO-MAML during training to learn better weight initialization values. Often these algorithms also use some form of Experience Replay, where saved/cached examples from previous tasks are replayed to prevent the model from forgetting. Riemer et al. (2018) proposed Meta-Experience Replay (MER), exploiting a trade-off between transfer and interference by enforcing gradient alignment across examples.

Conclusion
We propose a method for self-supervised learning for any classifier model during inference using the model's own predictions, adapting Reptile algorithm from meta-learning and an inductive bias for maintaining generalization while improving performance. We demonstrate the effectiveness of our method on a wide range of tasks, including Super-GLUE benchmark, question answering on SQuAD v2.0 and NewsQA, response selection on Ubuntu Dialog Corpus v2.0, and image classification on ImageNet and MNIST. Our approach consistently improves the performance of standard backbones such as BERT, Electra, and ResNet. Our method is effective for improving the performance of neural models without any changes to the underlying models, their training, or access to training data, requires minimum extra compute, and is also effective in online and transfer-learning settings. For each dataset, we train one model on the training set, followed by five runs on the pseudo-labeled thresholded test set with varying seeds, and report the mean and standard deviation of the scores. Smaller datasets in SuperGLUE are known to have significant variation between multiple runs when fine-tuning BERT model, however, most of this variation comes from random initialization of the classification layer. In our experiments, as the model has already been fine-tuned on the train set, the only variation between runs is the order of input data. This results in an extremely small variation in score between different runs, much smaller than the performance gains observed, making the improvements statistically significant.

A.1 Significance tests
We provide below in Table 10 and Table 12 the P-values for one-sample T-test for the Table 6 and  Table 7, with the null hypothesis that the scores of our results have the same mean as the baseline. Our results are significant at 99% confidence in all settings.

B Improved Generalization
The NewsQA results in Table 6 are scores on the test set, while the model was self-trained on the train set. The scores indicate that, the model does not over-fit while self-training as our approach significantly improves the scores on the test set.
Corpus p − value SQuAD v2.0 Electra (F1) 3e-5 Ubuntu Dialog v2.0 5e-5 NewsQA (F1) 1e-9 ImageNet 4e-7 As a further test of improved generalization, we split the squad dev set in two equal halves, performed our self-training on one half, and evaluated on the other half. Scores in Table 11 show, selftraining on one half improved generalization on the other.  For Ubuntu Dialog, we used the same pre-trained models, but we implement our own classifier. For Electra, we used the pre-trained models from https://github.com/google-research/ electra.

Model/Approach
For ResNet-50, we used Tensorflow Model Garden's official implementation as well as pre-trained model on ImageNet at https://github.com/tensorflow/models/ tree/r1.13.0/official/resnet. For MNIST, we implemented our own MLP following https://www.tensorflow.org/datasets/ keras_example. The Reptile+l 2 sp Optimizer is trivial to implement in all of the above models following the pseudo-code from the main paper, by modifying the Optimizer class used for each of the models.

F Hyper-parameters of our approach
The hyper-parameter search bounds were chosen based on heuristic manual estimates, primarily considering the product of the LR inner and LR outer , compared to the model's native LR when the fraction of training steps left equals the ratio of the size of the training set to the size of the filtered inference set. Each set of hyper-parameters was run three times, and the hyper-parameter search was run in a grid. We list the hyper-parameters of our Reptile + l 2 sp approach in COPA Choice of Possible Alternative, a dataset to classify the cause/effect of a given premise from two alternatives, with fully handcrafted data.
ReCoRD Reading Comprehension with Commonsense Reasoning, a QA dataset consisting of articles and Cloze-style questions with a masked entity, scored on predicting the masked entity from the entities in the article, with data from CNN and Daily Mail. Scored with token-level F1 and EM.
RTE Recognizing Textual Entailment, as binary classification of entailment or not entailment, with data from Wikipedia and news.
WiC Word-in-Context, a word sense disambiguation (WSD) dataset, tasked with binary classification of sentence pairs based on the sense of a common polysemous word. Data is from WordNet and Wiktionary.
WSC Winograd Schema Challenge, a coreference resolution task on resolving pronouns to a list of noun phrases. As the models we tested only predicted the majority class, we omit this dataset.
G.2 SQuAD v2.0 The Stanford Question Answering Dataset v2.0 is a popular span-style QA dataset, consisting of passages from Wikipedia, labelled by annotators for questions on the passages and corresponding answer spans, along with unanswerable questions as well. This dataset is evaluated with F1 and EM scores of predicted answer spans.

G.3 Ubuntu Dialog Corpus v2.0
The Ubuntu Dialog Corpus is a large-scale corpus of multi-turn real human conversations mined from Ubuntu IRC chat logs, with only two participants per conversation. Each conversation is annotated with the next utterance (response) following the conversation, and the task is to select the best response given a list of possible distractor responses. The dataset is evaluated with Recall score of picking the correct response out of 10 possible responses, Recall10@1.

G.4 NewsQA
NewsQA is a span-style QA dataset, consisting of crowd-sourced questions on CNN news articles and their corresponding answer spans, along with unanswerable questions. This datasets is evaluated with F1 and EM scores of predicted answer spans.

G.5 MNIST
MNIST is a popular image classification dataset, consisting of normalized and anti-aliased 28x28 scans of handwritten numerical digits. While the dataset has long been solved, it nevertheless serves as a useful dataset to compare simpler architectures.

G.6 ImageNet
ImageNet is a large-scale dataset for image classification, consisting of 1.2M training samples along with their corresponding class labels. It is often the de-facto dataset when comparing Image Classification models.
H Expected validation performance We provide the expected validation performance for all the datasets we ran hyper-parameter searches on, as described in (Dodge et al., 2019).