BERTAC: Enhancing Transformer-based Language Models with Adversarially Pretrained Convolutional Neural Networks

Transformer-based language models (TLMs), such as BERT, ALBERT and GPT-3, have shown strong performance in a wide range of NLP tasks and currently dominate the field of NLP. However, many researchers wonder whether these models can maintain their dominance forever. Of course, we do not have answers now, but, as an attempt to find better neural architectures and training schemes, we pretrain a simple CNN using a GAN-style learning scheme and Wikipedia data, and then integrate it with standard TLMs. We show that on the GLUE tasks, the combination of our pretrained CNN with ALBERT outperforms the original ALBERT and achieves a similar performance to that of SOTA. Furthermore, on open-domain QA (Quasar-T and SearchQA), the combination of the CNN with ALBERT or RoBERTa achieved stronger performance than SOTA and the original TLMs. We hope that this work provides a hint for developing a novel strong network architecture along with its training scheme. Our source code and models are available at https://github.com/nict-wisdom/bertac.


Introduction
Transformer-based language models (TLMs) such as BERT (Devlin et al., 2019), ALBERT (Lan et al., 2020), and GPT-3 (Brown et al., 2020) have shown that large-scale self-supervised pretraining leads to strong performance on various NLP tasks. Many researchers have used TLMs for various downstream tasks, possibly as subcomponents of their methods, and/or they have focused on scaling up TLMs or improving their pretraining schemes. As a result, other architectures like Recurrent Neural Networks (RNN) (Hochreiter and Schmidhuber, 1997;Cho et al., 2014) and Convolutional Neural Networks (CNN) (LeCun et al., 1999) are fading away. In this work, we propose a method for improving TLMs by integrating a simple conventional CNN to them. We pretrained this CNN on Wikipedia using a Generative Adversarial Network (GAN) style training scheme (Goodfellow et al., 2014), and then combined it with TLMs. Oh et al. (2019) similarly used GAN-style training to improve a QA model using a CNN, but their training scheme was applicable only to QAspecific datasets. On the other hand, similarly to TLM, our proposed method for training the CNN is independent of specific tasks. We show that the combination of this CNN with TLMs can achieve higher performance than that of the original TLMs on publicly available datasets for several distinct tasks. We hope that this gives an insight into how to develop novel strong network architectures and training schemes.
We call our combination of a TLM and a CNN BERTAC (BERT-style TLM with an Adversarially pretrained Convolutional neural network). Its architecture is illustrated in Fig. 1. We do not impose any particular restriction on the TLM in BERTAC, so any TLM, ALBERT (Lan et al.,Figure 2: GAN-style pretraining of CNNs. The discriminator D takes either a real representation generated by R or a fake representation generated by F as its input and then it predicts whether the input is a real or fake representation. Suvarnabhumi ::::::: Airporte 1 is Thailand's main international air hub. m1 [EM] is Thailand's main international air hub. e1 Suvarnabhumi Airport Table 1: Example of an entity-masked sentence (m 1 ) and the original sentence (s 1 ) 2020) or RoBERTa  for example, can be used as a subcomponent of BERTAC. We used the CNN to compute representations of a slightly modified version of the input given to a TLM. To integrate these representations with those of the TLM, we stacked on top of the TLM several layers of Transformers for Integrating External Representation (TIERs), which are our modified version of normal transformers (Vaswani et al., 2017). A TIER has the same architecture as that of a normal transformer encoder except for its attention: we replace the transformer's self-attention with an attention based on the representation provided by the CNN. We expect that, by keeping the basic architecture of transformer encoders, the CNN's representations can be integrated more effectively with the TLM's original representations.
We pretrained the CNN using a GAN-style training scheme in order to generate representations of sentences rather freely without the constraint of token embedding prediction in the masked language modeling used for TLMs, as we explain later. For the training, we used masked sentences autogenerated from Wikipedia. As in the masked language modeling, neither human intervention nor downstream task-specific hacking is required. As illustrated in Fig. 2, the GANstyle training requires three networks, namely, a discriminator D and two CNN-based generators R and F . Once the training is done, we use the generator F as CNN in BERTAC. The training data consists of pairs of an entity mention and a sentence in which the entity mention is masked with a special token [EM]. For example, the entitymasked sentence m 1 in Table 1 is obtained by masking the entity mention e 1 , "Suvarnabhumi Airport," in the original text s 1 . The network F generates a vector representation of the masked sentence (m 1 ), while R produces a representation of the masked entity (e 1 ). The discriminator D takes representations generated by either R or F as the input, and it predicts which generator actually gave the representation.
In the original GAN, a generator learns to generate an artificial image from random noise so that the resulting artificial image is indistinguishable from given real images. By analogy, we used an entity-masked sentence as "random noise" and a masked entity as a "real image." In our GAN-style training, we regard the vector representation of a masked entity given by generator R as a real representation of the entity (or the representation of the "real image" in the above analogy). On the other hand, we regard the representation of the masked sentence, generated by F , as a fake representation of the entity (or the representation of the "artificial image" generated from the "random noise" in the above analogy). This representation is deemed fake because the entity is masked in the masked sentence, and F does not know what the entity is exactly. During the training, F should try to deceive the discriminator D by mimicking the real representation and generating a fake representation that is indistinguishable from the real representation of the entity generated by R. On the other hand, R and D, as a team, try to avoid being mimicked by F and also to make the mimic problem harder for F . If everything goes well, once the training is over, F should be able to generate a fake representation of the entity that is similar to its real representation.
An interesting point is that F 's output can be interpreted in two ways: it is a representation of a masked sentence because it is computed from the sentence, and at the same time it is a representation of the masked entity because it is indistinguishable from R's representation of the entity. This duality suggests that F 's output can be seen as a representation of the entire sentence.
We exploit F as a CNN in BERTAC as follows: first, we use F to compute a representation of a masked version of the sentence originally given as input to a TLM. The entity mention to be masked is chosen by simple rules and, if the input consists of multiple sentences, we generate a representation of each (masked) input sentence and concatenate these together into a single one. Then, this representation is integrated to the output of the TLM through multiple TIER layers.
Our GAN-style pretraining is conceptually similar to TLM pretraining with masked language modeling (predicting what a masked word in a sentence should be). However, it was designed to pretrain a model that is able to rather freely generate entity representations without strongly sticking to the prediction of token embeddings. Our hypothesis is that such freely generated representations may be useful for improving the performance of downstream tasks. Moreover, we assumed that using multiple text representations computed from different perspectives (i.e., predicting token embeddings and freely generating entity representations) would help to improve the performance of downstream tasks.
In our experiments, we show that for the GLUE tasks (Wang et al., 2018), BERTAC's average performance on the development set was 0.7% higher than that of ALBERT, which was used as a subcomponent of BERTAC, leading to a performance on the test set comparable to that of SOTA (90.3% vs 90.8% (SOTA)). It also outperformed the SOTA method of open-domain QA (Chen et al., 2017) on Quasar-T (Dhingra et al., 2017) and SearchQA (Dunn et al., 2017) using either ALBERT or RoBERTa. We also compared our method with alternative models using a CNN pretrained in a self-supervised (non GAN-style) manner to directly predict embeddings of the entity mentions. Consequently, we confirmed that our method worked better: only the CNN trained by our GAN-style pretraining gave significant performance improvement over base TLMs.
Note that the computational overhead of BERTAC is reasonably small. It took 20 hours with 16 GPUs to pretrain a single CNN model and 180 hours for the nine models tested with different parameter settings in this work (cf., 480 hours with 96 GPUs for pretraining DeBERTa (He et al., 2021), for example). Moreover, once pretrained, the CNN models can be re-used for various down-stream tasks and combined with various TLMs, including potentially future ones. As for the parameter number, BERTAC had just a 14% increase in parameters when ALBERT-xxlarge was used as its base TLM (268 M parameters for BERTAC vs. 235 M for ALBERT-xxlarge). We confirmed from these results that BERTAC could improve pretrained TLMs with reasonably small computational overhead.

Related Work
Pretraining TLMs with entity information: There have been attempts to explicitly learn entity representation from text corpora using TLMs (He et al., 2020;Peters et al., 2019;Sun et al., 2020;Wang et al., 2020a;Xiong et al., 2020;Zhang et al., 2019). Our proposed method is a complementary alternative to these existing methods in the sense that entity representations are integrated into TLMs via CNNs and not directly produced by the TLMs. Fine-tuning TLMs with external resources or other NNs: Yang et al. (2019a) and  have used knowledge graphs for augmenting TLMs with entity representations during finetuning. Unlike these approaches, BERTAC uses unstructured texts rather than clean structured knowledge, such as knowledge graphs, to adversarially train a CNN. Other previous works have proposed combining CNNs or RNNs with BERT for NLP tasks (Lu et al., 2020;Safaya et al., 2020;Shao et al., 2019;Zhang et al., 2020), but their use of CNNs/RNNs was task-specific, so their models were not directly applicable to other tasks. Adversarial learning for improving TLMs: Oh et al. (2019) proposed a CNN-based answer representation generator for QA that can guess the vector representation of answers from given whytype questions and answer passages. The generator was trained in a GAN-style manner using QA datasets. We took inspiration from their adversarial training scheme to train task-independent representation generators from unsupervised texts (i.e., Wikipedia sentences in which an entity was masked in a cloze-test style).
ELECTRA (Clark et al., 2020) also employed an adversarial technique (not a GAN) to pretrain two TLMs: A generator was trained to perform masked language modeling and a discriminator was trained to distinguish tokens in the training data from tokens replaced by the generator. On downstream tasks, only the discriminator was finetuned. In BERTAC, the GAN-style pretraining was applied only to the CNN, thus reducing the training cost. Furthermore, the CNN can be combined easily with any available TLM, even potentially future ones, without having to re-do the pretraining. In this work, we show that BERTAC outperformed ELECTRA on the GLUE task. Vernikos et al. (2020) proposed a method that used an adversarial objective and an adversarial classifier for regularizing the fine-tuning process of TLMs, inspired by adversarial learning for domain adaptation (Ganin et al., 2016). Our work uses a GAN-style training scheme only for pretraining CNNs, not for fine-tuning TLMs.

Pretraining of CNNs
This section describes the training data and training algorithm for our CNN.

Training data
We pretrained our CNN with an entity-masked version of Wikipedia sentences. WikiExtractor 1 was used to extract, from the English Wikipedia 2 , sentences that have at least one entity mention, i.e., an entity with an internal Wikipedia link. Then we randomly selected one entity mention e i in each sentence and generated an entity-masked sentence m i by replacing the entire selected mention with [EM]. For example, we generated the masked sentence m 1 , "[EM] is Thailand's main international air hub," (in Table 1) by replacing the entity mention e 1 , Suvarnabhumi Airport, in the sentence s 1 , " ::::::::::::: Suvarnabhumi ::::::: Airport is Thailand's main international air hub," with [EM]. We obtained about 43.3 million pairs of an entity mention and a masked sentence ({(e i , m i )}) in this way and used 10% of them (randomly sampled) as the pretraining data for our CNN.

GAN-style pretraining
As illustrated in Fig. 2, the adversarial training is done using three subnetworks: R (realentity-representation generator), F (fake-entityrepresentation generator), and D (discriminator). R and F are CNNs with average pooling and D is a feedforward neural network. Once the training is done, we use the generator F as CNN in BERTAC. In the training, we regard the representation of a masked entity output by generator R as a real representation of the entity that the fakeentity-representation generator F should mimic. F is trained so that, taking an entity-masked sentence as its input, it can generate a representation of the masked entity mention (called a fake representation of the entity in this work) that D cannot distinguish from the real representation. The representation generated by F is fake in the sense that the entity mention is masked in the input sentence and F cannot know what it is exactly.
As mentioned in the Introduction, our GANstyle pretraining was designed to train a model capable of freely generating entity representations. We assumed that using multiple text representations computed from different perspectives (i.e., prediction of token embeddings in TLMs and generation of entity representations in our CNN) would help to improve the performance of downstream tasks.

Input: Training examples {(e,m)}, training epochs t,
mini-batch steps b, mini-batch size n Output: Real representation generator R, fake representation generator F , discriminator D 1 j ← 1 2 Initialize θR, θF , and θD (parameters of R, F , and D) with random weights Update D and R by ascending their stochastic gradient: Update F by descending its stochastic gradient: For each pair of an entity mention (e i ) and an entity-masked sentence (m i ) in the training data, we first generate two matrices of word embeddings e i and m i using word embeddings pretrained on Wikipedia with fastText (Bojanowski et al., 2017). Then, R and F generate, respectively, a real entity representation from e i and a fake entity representation from m i . Finally, they are given to D, which is a feed-forward network that judges whether F or R generated the representations, i.e., whether the representations are real or fake, using sigmoid outputs by the final logistic regression layer.
The pseudo code of the training scheme is given in Algorithm 1. The training proceeds as follows: R and D as a team try to avoid the possibility that D misjudges F's output (i.e., a fake entity representation) as a real entity representation. More precisely, R and D are trained so that D can correctly judge the representation R(e i ) given by generator R as real (i.e., D(R(e i )) = 1) and the representation F (m i ) given by generator F as fake (i.e., D(F (m i )) = 0). Therefore, the training is carried out with the objective of maximizing ) (line 9 in Algorithm 1). This minmax game is iterated for the pre-specified t training epochs.

Pretraining settings
We extracted 43.3 million pairs of an entity mention and a masked sentence from Wikipedia and randomly sampled 10% of them to use as training data (4.33 million pairs, around 700 MB in file size). We used word-embedding vectors in 300 dimensions (for 2.5 million words) pretrained on Wikipedia using fastText (Bojanowski et al., 2017). The embedding vectors were fixed during the training.
We set the training epochs to 200 (t = 200 in Algorithm 1) and did not use any early-stopping technique. We chose t = 200 from the results of our preliminary experiments in which we used 10% of the training data and set training epochs t to either of 100, 200, or 300; the loss robustly converged for t = 200 and t = 300, and thus the earliest point t = 200 was chosen. We used the Rm-sProp optimizer (Tieleman and Hinton, 2012) with a batch size of 4,000 (n = 4, 000 and b = 1, 084 in Algorithm 1) and a learning rate of 2e-4. We trained nine CNN models with all combinations of the filter's window sizes ∈ {"1,2,3", "2,3,4", "1,2,3,4"} and number of filters ∈ {100, 200, 300} for the generators F and R. All of the weights in the CNNs were initialized using He's method (He et al., 2015). We used a logistic regression layer with sigmoid outputs as discriminator D. The training of a single CNN model took around 20 hours using 16 Nvidia V100 GPUs with 32 GB of memory (180 hours in total for the nine models).
We tested all nine CNN models for BERTAC in our GLUE and open-domain QA experiments (Section 5). For each task, the parameters inside the CNNs (as well as the word-embedding vectors) were fixed during the fine-tuning of BERTAC.

BERTAC
As illustrated in Fig. 1, BERTAC (BERT-style TLM with an Adversarially pretrained Convolutional neural network) incorporates the representation provided by the adversarially pretrained CNN to the representation generated by a TLM. For the integration, we use several layers of TIERs (Transformers for Integrating External Representation) stacked on top of the TLM.

CNN in BERTAC
For simplicity, we describe how the CNN is integrated in BERTAC using the task of recognizing textual entailment (RTE) as an example. BERTAC for the RTE task takes two sentences s x and s y as input and predicts whether s x entails s y . First, we explain how the adversarially pretrained CNN (generator F in Section 3.2) generates the representation of the two input sentences. We regard the longest common noun phrase 3 of the two sentences as the entity mention to be masked and create entity-masked sentences m x and m y from s x and s y by masking the noun phrase with [EM] (we use m x = s x and m y = s y if no common noun phrase is found). Then each of the masked sentences m x and m y is given to the CNN. Our expectation here is that the CNN generates similar representations from the masked sentences if they have an entailment relation and that this helps to recognize the entailment relation.
Note that the CNN in BERTAC is connected to several TIER layers and that, as shown in Fig. 1, its input is iteratively updated so that it provides updated representations to the TIER layers. Let m i x ∈ R |mx|×dw and m i y ∈ R |my|×dw be the matrices of word embeddings of m x and m y given to the CNN connected to the i-th TIER layer, where d w is the dimension of a word embedding. We denote the representation generated by the CNN when the matrix of word embeddings m was used as the input by CN N (m). The ith TIER layer is given the concatenation of the two CNN representations of m x and m y , and d e is the dimension of the CNN representation. Note that, for singlesentence tasks, r i = r i x , the CNN representation of m x , is given to the TIER layers.
The initial matrices of word embeddings m 1 x and m 1 y are obtained using the fastText word embeddings (Bojanowski et al., 2017), the same as that used in our adversarial learning. Then, the updated input matrices m i+1 x and m i+1 y for the (i+1)th CNN are obtained from the i-th input matrices m i x and m i y as described below. For the word embedding m i x,j of the j-th word in m x , we compute its bilinear score to r i x (Sutskever et al., 2009): x ∈ R dw×de is a trainable matrix and softmax j (v) denotes the j-th element of the softmaxed vector of v. The bilinear score indicates how much the corresponding token should be highlighted as one associated with the CNN representation r i x during the update process. We expect that this allows the CNN in the next TIER layer to generate further refined representations with the updated embeddings.
We then compute word embeddings m i+1 x in a highway network manner (Srivastava et al., 2015) as follows: , σ is the sigmoid function, ⊙ represents the element-wise product, and W i h , W i t , b i h , and b i t are layer-specific trainable parameters. m i+1 y is also computed from m i y and r i y in the same way. During the fine-tuning of BERTAC for downstream tasks, we fix the parameters of the pretrained CNN but train these parameters for updating CNN's input alongside those of TLMs and TIERs.

Transformers for integrating external representation (TIERs)
As explained in the Introduction, the main difference between a TIER and a normal transformer  In the TIER attention mechanism, the query representation, which is one of the three inputs of the transformer's self-attention, is replaced with the representation given by the CNN. Fig. 3 shows the difference between the TIERs' attention computation and that of normal transformers. Attention in normal transformers is computed in the following way: Q, K, and V are query, key, and value matrices in R l k ×d k , where l k is the length of an input sequence and d k is a dimension of keys. Q, K, and V all come from the same representation of the token sequence provided from the previous transformer layer. The attention should specify how much the corresponding tokens in V should be highlighted, so we designed ours in the same way. In TIERs, we use the following attention. We basically replace the matrix Q with the CNN's representation r ∈ R u×d k while keeping the original K and V, where u is the number of sentences in the input of the model (u ∈ {1, 2} in this paper).
Since r is a matrix with a different size from Q, we needed to adapt the attention computation. We first multiply r to K T , and then its softmaxed results are converted into a l k × d k dimensional matrix using the all-one matrix J u,d k ∈ R u×d k . Let the resulting matrix be A = (softmax( rK T √ d k )) T J u,d k ∈ R l k ×d k . We apply the attention score to V by using the element-wise product between matrices: A ⊙ V.
In addition, the actual CNN's representation r CN N ∈ R u×de given by our CNNs usually have a size that does not match the size requirement for r. Thus, we convert it to r ∈ R u×d k , a d kcolumn matrix as follows: r = r CN N W + b, where W ∈ R de×d k and b are trainable.

Experiments
We tested our model on GLUE and on opendomain QA. In this section, we report the results.

GLUE
GLUE (Wang et al., 2018) is a multi-task benchmark composed of nine tasks including two singlesentence tasks (CoLA and SST-2) and seven two-sentence tasks of similarity/paraphrase tasks (MRPC, QQP, and STS-B) and natural language inference tasks (MNLI, QNLI, RTE, and WNLI). Following the previous work of ALBERT (Lan et al., 2020), we performed single-task fine-tuning for each task under the following settings: singlemodel for the development set and ensemble for test set submissions. As in  and Lan et al. (2020), we report the performance on the development set for each task by averaging over five runs with different random initialization seeds. As in Lan et al. (2020), for test set submissions, we fine-tuned the models for the RTE, STS-B, and MRPC tasks by initializing them with the fine-tuned MNLI single-task model, and we also used task-specific modification for CoLA and WNLI to improve scores (see Appendix A for details). We explored ensemble settings between 6 and 30 models per task for our test set submission.

Fine-tuning details of BERTAC for GLUE
We used ALBERT-xxlarge-v2 (Lan et al., 2020) as the pretrained TLM. As hyperparameters for BERTAC, for each task we tested learning rates ∈ {8e-6, 9e-6, 1e-5, 2e-5, 3e-5}, a linear warmup for the first 6% of steps followed by a linear decay to 0, a maximum sequence length of 128, and all nine CNNs pretrained with different filter settings. We set the batch size to 128 for MNLI and QQP and 16 for the other tasks. Furthermore, we trained our model with the following set of training epochs: {1,2,3,4,5} for MNLI,QQP,and QNLI,{6,7,8,9,10} for CoLA,MRPC,RTE,and {90,95,100,105,110} for WNLI. We set the number of TIER layers to 3 after preliminary experiments. See Table 9 in Ap-pendix B for a summary of the hyperparameters tested in the GLUE experiments.
During the fine-tuning of BERTAC, the parameters inside the CNNs (as well as word embeddings of fastText) were fixed as explained in Section 3.3, while those used to update the input to the CNNs were optimized. For each task, we selected the pretrained CNN (out of nine) and the BERTAC hyperparameters that gave the best performance on the development data. Table 2 shows the results of eight tasks on the GLUE development set: all of them are singlemodel results. Our BERTAC consistently outperformed the previous TLM-based models over seven tasks, except for QQP, and, as a result, showed the best average performance on the development set. Crucially, our model improved the average performance around 0.7% over AL-BERT, the base TLM in our model. This indicates the effectiveness of adversarially trained CNNs and TIERs in BERTAC. The test set results obtained from the GLUE leaderboard are summarized in Table 3. Our model showed comparable performance to SOTA, DeBERTa/TuringNLRv4, and achieved state-of-the-art results on 3 out of 9 task. It also showed better performance than AL-BERT, our base TLM, in most tasks.

Results
To investigate whether our GAN-style pretraining of CNNs contributed to the performance improvement, we also tested the following alternative training schemes for the CNN used in BERTAC.
Self-supervised CNN: We pretrained the CNN to generate representations of a masked sentence in a self-supervised way as follows: For an entity mention e and an entity-masked sentence m in the training data (Section 3.1), the CNN generates a representation r from the masked sentence trying to minimize MSE (mean squared error) between r and the entity mention's representation e (average word embedding of all tokens in e).

Randomly initialized CNN:
We did not pretrained the CNNs, but trained them alongside the TLMs during the fine-tuning of BERTAC (the CNNs were randomly initialized).
We trained both the self-supervised and randomly initialized CNNs using the same hyperparameter settings as GAN-style CNNs (see Section 3.3). We confirm from the results in Table 4 Models   , XLNET (Yang et al., 2019b), ELEC-TRA (Clark et al., 2020), ALBERT (Lan et al., 2020), and DeBERTa (He et al., 2021) were taken from their papers. We omit the results of the WNLI task, since many previous works did not report the dev set results.  that only the proposed method with our GANstyle CNNs showed a higher average score than ALBERT. This suggests the effectiveness of our GAN-style pretraining scheme of CNNs.

Open-domain QA
We also tested BERTAC on open-domain QA (Chen et al., 2017) with the publicly available datasets Quasar-T (Dhingra et al., 2017) and SearchQA (Dunn et al., 2017). We used the pre-processed version 4 of the datasets provided by Lin et al. (2018), which contains passages retrieved for all questions, and followed their data split as described in Table 5.

BERTAC for open-domain QA
We implemented our QA model following the approach of Lin et al. (2018), which combines a passage selector to choose relevant passages from retrieved passages and an answer span selector to identify the answer span in the selected passages. For the given question q and the set of retrieved passages P = {p i }, we computed the probability P r(a|q, P ) of extracting answer span a to question q from P in the following way, and then we extracted the answer spanâ with the highest probability: P r(a|q, P ) = in the top TIER layer is fed into a linear layer with a softmax, which computes the probability that the passage contains a correct answer to the question. Our BERTAC answer span selector identifies answer spans from passages by computing start and end probabilities of each token in passages, where we feed the representation of each token in the top layer of TIERs to two linear layers, each with a softmax for the probabilities (Devlin et al., 2019).

Training details for open-domain QA
We used all nine pretrained CNNs, as in the GLUE experiments. As pretrained TLMs, we used ALBERT-xxlarge-v2 (Lan et al., 2020) and RoBERTa-large . We set the learning rate to 1e-5, the number of epochs to 2, the maximum sequence length to 384, and the number of TIER layers to 3. We used a linear warmup for the first 6% of steps followed by a linear decay to 0 with a batch size of 48 for Quasar-T and 96 for SearchQA. We tested all of the pretrained CNNs and chose for each dataset the one that maximizes EM (the percentage of the predictions matching exactly one of the ground truth an-   Non-TLM-based methods OPENQA (Lin et al., 2018): An RNN-based method that jointly learns passage-selection and answer extraction.
OPENQA+ARG (Oh et al., 2019): An extension of OPENQA that additionally uses an answer representation generator (ARG) trained by adversarial learning. TLM-based methods WKLM (Xiong et al., 2020): This uses a TLM pretrained with a weakly supervised objective for learning Wikipedia entity information. BERT-base was used for the training.
MBERT (Wang et al., 2019): A BERT-based method that extracts answers using globally normalized answer scores across all the passages retrieved by the same question. BERT-large was used for the training. CFORMER (Wang et al., 2020b): It uses a clusteringbased sparse transformer for long-range dependency encoding. The method was trained using RoBERTa-large.

Results
We compared BERTAC with the previous works described in Table 6. Table 7 shows the performance of all of the methods. The subscripts of the TLM-based methods represent the type of pretrained TLM used by each method. All the methods were evaluated using EM and F1 score (average overlap between the prediction and gold answer). BERTAC ALBERT-xxlarge outperformed all of the baselines including the SOTA method (CFORMER) on both EM and F1. BERTAC RoBERTa-large in the same TLM setting as the SOTA method showed a better performance than SOTA except for F1 in Quasar-T. These results suggest that our framework is effective for QA tasks as well.
For ablation studies, we evaluated some variants of BERTAC ALBERT-xxlarge : "w/o CNN and   TIER," which uses ALBERT-xxlarge alone without using our CNN and TIER, "w/o GAN-style CNN," which does not use our CNN pretrained by the GAN-style training scheme but uses selfsupervised CNNs (the same as used in the GLUE experiments, see Table 4), "w/o update," which does not perform layer-wise update of the CNN inputs. The results in Table 8 suggest that all of the following contributed to the performance improvement: the combination of TLMs and GANstyle CNNs, our GAN-style training of CNNs, and the layer-wise update of the CNN inputs.

Conclusion
We proposed BERTAC (BERT-style TLM with an Adversarially pretrained Convolutional neural network), a combination of a TLM and a CNN, where the CNN was pretrained using a novel GAN-style training scheme and masked sentences obtained automatically from Wikipedia. Using this CNN, we improved the performance of standard TLMs. We confirmed that BERTAC could achieve comparable performance with the SOTA and outperformed the base TLM used as a subcomponent of BERTAC in the GLUE task. We also show that BERTAC outperformed the SOTA method of open-domain QA on Quasar-T and SearchQA.

A Task-specific Modification for GLUE Test-set Submission
We applied task-specific modification to WNLI and CoLA in the GLUE tasks to achieve competitive GLUE leaderboard results, i.e., the test set submission results presented in Table 3. For WNLI, we followed Raffel et al. (2020), while, for CoLA, we propose our own modification. Note that we did not apply the tricks in obtaining the results on the development set results shown in Table 2. In the following, we describe the tricks.

A.1 WNLI
WNLI is a coreference resolution task with a twosentence input. The first sentence has an ambiguous pronoun and the second sentence is generated from the first sentence by replacing the pronoun with one of the possible referents (noun phrases) in the first sentence (Wang et al., 2018). In this task, we must predict whether the candidate referent in the second sentence is the correct referent of the pronoun. Since the format of WNLI is known for being difficult to learn by a model, many previous works, including those using AL-BERT, RoBERTa, or T5 Lan et al., 2020;Raffel et al., 2020), converted the data to a simpler format before training their WNLI model for GLUE test-set submission.
Following these approaches, we also converted the data in the same way as Raffel et al. (2020). First, we extract candidate referents for an ambiguous pronoun as follows. Suppose that the following sentence pair of s 1 and s 2 is from the WNLI task's data and has the label correct (meaning that Susan in s 2 is the correct referent of the pronoun she in s 1 ).
s 1 : Jane knocked on Susan's door but she did not get an answer.
s 2 : Susan did not get an answer.
We first find all of the pronouns in the first sentence ("she" in s 1 ). For each pronoun, we find the longest sequence of words that precedes or follows the pronoun in the first sentence and that also appears in the second sentence ("did not get an answer" underlined in s 1 and s 2 ). We then choose the pronoun that precedes or follows the longest matching word sequence and obtain a candidate referent by deleting the matched sequence of words from the second sentence. In the example sentence pair (s 1 , s 2 ), we choose the pronoun she from the first sentence (since there is a single pronoun) and obtain the candidate referent Susan from the second sentence through this process. Finally, we convert the original sentence pair into a pair of a masked sentence and a candidate referent by replacing the pronoun in the first sentence with [MASK] and replacing the second sentence with the extracted referent. The (s 1 , s 2 ) pair is thus changed to the following (s ′ 1 , s ′ 2 ): Since the format of this converted data is similar to that of the training data for the GAN-style training scheme of our CNN, we expect that by using this data conversion, BERTAC can more effectively predict whether the candidate referent for the masked pronoun is correct.

A.2 CoLA
In the CoLA task, we need to predict whether a given sentence is grammatically acceptable. For MNLI QNLI QQP RTE SST-2 MRPC CoLA STS-B WNLI Learning rate {8e-6, 9e-6,1e-5, 2e-5, 3e-5} Batch size 128 16 Training epoch {1,2,3,4,5} {6,7,8,9,10} {90,95,100, 105 ,110}  TIER layer  3  Max sequence length  128  Warmup step linear warmup for the first 6% of steps CNN 9 models pretrained with different filter settings  this task, we conducted a two-step fine-tuning. In the first step, we fine-tuned BERTAC with automatically generated pseudo-training data. This data was prepared as described below, and does not include the original CoLA training data. In the second step, we further fined-tuned the model obtained in the first step using the original CoLA training data. The BERTAC model obtained at this second step was used for the test-set submission.
To automatically generate pseudo-training data, we regarded all of the sentences in the training data of MNLI, QQP, and QNLI as grammatically acceptable and used them as positive examples in the pseudo-training data. After removing duplicate sentences, for each positive example, we generated one negative example by modifying the positive example under the assumption that the modification makes the generated example grammatically unacceptable. As a modification, we randomly applied one of the following three operations: permutation (of four words randomly selected), insertion (of two random words to random positions), and deletion (of two randomly selected words) (Brahma, 2018).
We obtained about 2.14 million examples in this way, half of them positives and the other half negatives. We used all of the training samples automatically generated in this way for the first-step fine-tuning of BERTAC, with a learning rate of 8e-6, a single training epoch, and a batch size of 128, while applying the same settings for the other hyperparameters as those used for the other tasks. The model obtained by the first-step fine-tuning is then used as a starting point for the second-step fine-tuning, using the original CoLA training data this time, of our final model for CoLA.