BinaryBERT: Pushing the Limit of BERT Quantization

The rapid development of large pre-trained language models has greatly increased the demand for model compression techniques, among which quantization is a popular solution. In this paper, we propose BinaryBERT, which pushes BERT quantization to the limit by weight binarization. We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscape. Therefore, we propose ternary weight splitting, which initializes BinaryBERT by equivalently splitting from a half-sized ternary network. The binary model thus inherits the good performance of the ternary one, and can be further enhanced by fine-tuning the new architecture after splitting. Empirical results show that our BinaryBERT has only a slight performance drop compared with the full-precision model while being 24x smaller, achieving the state-of-the-art compression results on the GLUE and SQuAD benchmarks. Code will be released.

(b) MNLI-m. Figure 1: Performance of quantized BERT with varying weight bit-widths and 8-bit activation. We report the mean results with standard deviations from 10 seeds on MRPC and 3 seeds on MNLI-m, respectively. et al., 2019; Fan et al., 2020;Zhang et al., 2020). Among all these model compression approaches, quantization is a popular solution as it does not require designing a smaller model architecture. Instead, it compresses the model by replacing each 32-bit floating-point parameter with a low-bit fixedpoint representation. Existing attempts try to quantize pre-trained models (Zafrir et al., 2019;Fan et al., 2020) to even as low as ternary values (2-bit) with minor performance drop (Zhang et al., 2020). However, none of them achieves the binarization (1-bit). As the limit of quantization, weight binarization could bring at most 32× reduction in model size and replace most floating-point multiplications with additions. Moreover, quantizing activations to 8-bit or 4-bit further replaces the floating-point addition with int8 and int4 addition, decreasing the energy burden and the area usage on chips (Courbariaux et al., 2015).
In this paper, we explore to binarize BERT parameters with quantized activations, pushing BERT quantization to the limit. We find that directly training a binary network is rather challenging. According to Figure 1, there is a sharp performance drop when reducing weight bit-width from 2-bit to 1-bit, compared to other bit configurations. To explore the challenges of binarization, we analyze the loss landscapes of models under different precisions both qualitatively and quantitatively. It is found that while the full-precision and ternary (2bit) models enjoy relatively flat and smooth loss surfaces, the binary model suffers from a rather steep and complex landscape, which poses great challenges to the optimization.
Motivated by the above empirical observations, we propose ternary weight splitting, which takes the ternary model as a proxy to bridge the gap between the binary and full-precision models. Specifically, ternary weight splitting equivalently converts both the quantized and latent full-precision weights in a well-trained ternary model to initialize BinaryBERT. Therefore, BinaryBERT retains the good performance of the ternary model, and can be further refined on the new architecture. While neuron splitting is previously studied (Chen et al., 2016;Wu et al., 2019) for full-precision network, our ternary weight splitting is much more complex due to the additional equivalence requirement of quantized weights. Furthermore, the proposed Bi-naryBERT also supports adaptive splitting. It can adaptively perform splitting on the most important ternary modules while leaving the rest as binary, based on efficiency constraints such as model size or floating-point operations (FLOPs). Therefore, our approach allows flexible sizes of binary models for various edge devices' demands.
Empirical results show that BinaryBERT split from a half-width ternary network is much better than a directly-trained binary model with the original width. On the GLUE and SQuAD benchmarks, our BinaryBERT has only a slight performance drop compared to the full-precision BERT-base model, while being 24× smaller. Moreover, Bi-naryBERT with the proposed importance-based adaptive splitting also outperforms other splitting criteria across a variety of model sizes.

Difficulty in Training Binary BERT
In this section, we show that it is challenging to train a binary BERT with conventional binarization approaches directly. Before diving into details, we first review the necessary backgrounds.
We follow the standard quantization-aware training procedure (Zhou et al., 2016). Specifically, given weight w ∈ R n (a.k.a latent full-precision weights), each forward propagation quantizes it toŵ = Q(w) by some quantization function Q(·), and then computes the loss (ŵ) atŵ. During back propagation, we use ∇ (ŵ) to update latent fullprecision weights w due to the non-differentiability of Q(·), which is known as the straight-through estimator (Courbariaux et al., 2015).
Recent TernaryBERT (Zhang et al., 2020) follows Ternary-Weight-Network (TWN) (Li et al., 2016) to quantize the elements in w to three values {±α, 0}. To avoid confusion, we use superscript t and b for the latent full-precision weights and quantized weights in ternary and binary models, respectively. Specifically, TWN ternarizes each element w t i in the ternary weight w t aŝ where sign(·) is the sign function, ∆ = 0.7 n w t Binarization. Binarization is first proposed in (Courbariaux et al., 2015) and has been extensively studied in the academia (Rastegari et al., 2016;Hubara et al., 2016;Liu et al., 2018). As a representative work, Binary-Weight-Network (BWN) (Hubara et al., 2016) binarizes w b elementwisely with a scaling parameter α as follows: Despite the appealing properties of network binarization, we show that it is non-trivial to obtain a binary BERT with these binarization approaches.

Sharp Performance Drop with Weight Binarization
To study the performance drop of BERT quantization, we train the BERT model with fullprecision, {8,4,3,2,1}-bit weight quantization and 8-bit activations on MRPC and MNLI-m from the GLUE benchmark     For easy comparison, we report the ratio of eigenvalue between the ternary/binary models and the full-precision model. The error bar is estimated of all Transformer layers over different data mini-batches.
From Figure 1, the performance drops mildly from 32-bit to as low as 2-bit, i.e., around 0.6% ↓ on MRPC and 0.2% ↓ on MNLI-m. However, when reducing the bit-width to one, the performance drops sharply, i.e, ∼ 3.8% ↓ and ∼ 0.9% ↓ on the two tasks, respectively. Therefore, weight binarization may severely harm the performance, which may explain why most current approaches stop at 2-bit weight quantization Zadeh and Moshovos, 2020;Zhang et al., 2020). To further push weight quantization to the limit, a first step is to study the potential reasons behind the sharp drop from ternarization to binarization.

Exploring the Quantized Loss Landscape
Visualization. To learn about the challenges behind the binarization, we first visually compare the loss landscapes of full-precision, ternary, and binary BERT models. Following (Nahshan et al., 2019), we extract parameters w x , w y from the value layers 2 of multi-head attention in the first two Transformer layers, and assign the following perturbations on parameters: 2 We also extract parameters from other parts of the Transformer in Appendix C.2, and the observations are similar. where x ∈ {±0.2w x , ±0.4w x , ..., ±1.0w x } are perturbation magnitudes based the absolute mean valuew x of w x , and similar rules hold for y. 1 x and 1 y are vectors with all elements being 1. For each pair of (x, y), we evaluate the corresponding training loss and plot the surface in Figure 2.
As can be seen, the full-precision model (Figure 2(a)) has the lowest overall training loss, and its loss landscape is flat and robust to the perturbation. For the ternary model ( Figure 2(b)), despite the surface tilts up with larger perturbations, it looks locally convex and is thus easy to optimize. This may also explain why the BERT model can be ternarized without severe accuracy drop (Zhang et al., 2020). However, the loss landscape of the binary model (Figure 2(c)) turns out to be both higher and more complex. By stacking the three landscapes together (Figure 2(d)), the loss surface of the binary BERT stands on the top with a clear margin with the other two. The steep curvature of loss surface reflects a higher sensitivity to binarization, which attributes to the training difficulty.
Steepness Measurement. To quantitatively measure the steepness of loss landscape, we start from a local minima w and apply the second order approximation to the curvature. According to the Taylor's expansion, the loss increase induced by quantizing Figure 4: The overall workflow of training BinaryBERT. We first train a half-sized ternary BERT model, and then apply ternary weight splitting operator (Equations (6) and (7)) to obtain the latent full-precision and quantized weights as the initialization of the full-sized BinaryBERT. We then fine-tune BinaryBERT for further refinement.
w can be approximately upper bounded by where = w −ŵ is the quantization noise, and λ max is the largest eigenvalue of the Hessian H at w. Note that the first-order term is skipped due to ∇ (w) = 0. Thus we take λ max as a quantitative measurement for the steepness of the loss surface. Following  we adopt the power method to compute λ max . As it is computationally expensive to estimate H for all w in the network, we consider them separately as follows: (1) the query/key layers (MHA-QK), (2) the value layer (MHA-V), (3) the output projection layer (MHA-O) in the multi-head attention, (4) the intermediate layer (FFN-Mid), and (5) the output layer (FFN-Out) in the feed-forward network. Note that we group key and query layers as they are used together to calculate the attention scores. From Figure 3, the top-1 eigenvalues of the binary model are higher both on expectation and standard deviation compared to the full-precision baseline and the ternary model. For instance, the top-1 eigenvalues of MHA-O in the binary model are ∼ 15× larger than the full-precision counterpart. Therefore, the quantization loss increases of fullprecision and ternary model are tighter bounded than the binary model in Equation (4). The highly complex and irregular landscape by binarization thus poses more challenges to the optimization.

Ternary Weight Splitting
Given the challenging loss landscape of binary BERT, we propose ternary weight splitting (TWS) that exploits the flatness of ternary loss landscape as the optimization proxy of the binary model. As is shown in Figure 4, we first train the half-sized ternary BERT to convergence, and then split both the latent full-precision weight w t and quantized w t to their binary counterparts w via the TWS operator. To inherit the performance of the ternary model after splitting, the TWS operator requires the splitting equivalency (i.e., the same output given the same input): While solution to Equation (5) is not unique, we constrain the latent full-precision weights after where a and b are the variables to solve. By Equations (6) and (7) where we denote I = {i |ŵ t i = 0}, J = {j |ŵ t j = 0 and w t j > 0} and K = {k |ŵ t k = 0 and w t k < 0}. | · | denotes the cardinality of the set. Detailed derivation of Equation (8) is in Appendix A.
Quantization Details. Following (Zhang et al., 2020), for each weight matrix in the Transformer layers, we use layer-wise ternarization (i.e., one scaling parameter for all elements in the weight matrix). For word embedding, we use row-wise ternarization (i.e., one scaling parameter for each row in the embedding). After splitting, each of the two split matrices has its own scaling factor.
Aside from weight binarization, we simultaneously quantize activations before all matrix multiplications, which could accelerate inference on specialized hardwares Zafrir et al., 2019). Following (Zafrir et al., 2019;Zhang et al., 2020), we skip the quantization for all layernormalization (LN) layers, skip connections, and bias as their calculations are negligible compared to matrix multiplication. The last classification layer is also not quantized to avoid a large accuracy drop.
Training with Knowledge Distillation. Knowledge distillation is shown to benefit BERT quantization (Zhang et al., 2020). Following (Jiao et al., 2020;Zhang et al., 2020), we first perform intermediate-layer distillation from the fullprecision teacher network's embedding E, layerwise MHA output M l and FFN output F l to the quantized student counterpartÊ,M l ,F l (l = 1, 2, ...L). We aim to minimize their mean sqau- Thus the objective function is We then conduct prediction-layer distillation by minimizing the soft cross-entropy (SCE) between quantized student logitsŷ and teacher logits y, i.e., pred = SCE(ŷ, y).
Further Fine-tuning. After splitting from the half-sized ternary model, the binary model inherits its performance on a new architecture with full width. However, the original minimum of the ternary model may not hold in this new loss landscape after splitting. Thus we further fine-tune with prediction-layer distillation to look for a better solution. We dub the resulting model as BinaryBERT.

Adaptive Splitting
Our proposed approach also supports adaptive splitting that can flexibly adjust the width of Binary-BERT, based on the parameter sensitivity to binarization and resource constraints of edge devices. Specifically, given the resource constraints C (e.g., model size and computational FLOPs), we first train a mixed-precision model adaptively (with sensitive parts being ternary and the rest being binary), and then split ternary weights into binary ones. Therefore, adaptive splitting finally enjoys consistent arithmetic precision (1-bit) for all weight matrices, which is usually easier to deploy than the mixed-precision counterpart.
Formulation. Intuitively, we assign ternary values to weight matrices that are more sensitive to quantization. The quantization sensitivity of the weight matrix is empirically measured by the performance gain of not quantizing it comparing to the fully-quantized counterpart (Details are in Appendix B.1.). We denote u ∈ R Z + as the sensitivity vector, where Z is the total number of splittable weight matrices in all Transformer layers, the word embedding layer and the pooler layer. The cost vector c ∈ R Z + stores the additional increase of parameter or FLOPs of each ternary weight matrix against a binary choice. The splitting assignment can be represented as a binary vector s ∈ {0, 1} Z , where s z = 1 means to ternarize the z-th weight matrix, and vice versa. The optimal assignment s * can thus be solved from the following combinatorial optimization problem: where C 0 is the baseline efficiency of the half-sized binary network. Dynamic programming can be applied to solve Equation (11) to avoid NP-hardness.

Experiments
In this section, we empirically verify our proposed approach on the GLUE  and SQuAD (Rajpurkar et al., 2016(Rajpurkar et al., , 2018   for CoLA, Spearman correlation for STS-B and accuracy for the rest tasks: RTE, MRPC, SST-2, QQP, MNLI-m (matched) and MNLI-mm (mismatched). For machine reading comprehension on SQuAD, we report the EM (exact match) and F1 score. Aside from the task performance, we also report the model size (MB) and computational FLOPs at inference. For quantized operations, we follow (Zhou et al., 2016;Liu et al., 2018;Li et al., 2020a) to count the bit-wise operations, i.e., the multiplication between an m-bit number and an n-bit number approximately takes mn/64 FLOPs for a CPU with the instruction size of 64 bits.
Implementation. We take DynaBERT (Hou et al., 2020) sub-networks as backbones as they offer both half-sized and full-sized models for easy comparison. We start from training a ternary model of width 0.5× with the two-stage knowledge distillation introduced in Section 3.1. Then we split it into a binary model with width 1.0×, and perform further fine-tuning with prediction-layer distillation. Each training stage takes the same number of training epochs. Following (Jiao et al., 2020;Hou et al., 2020;Zhang et al., 2020), we adopt data augmentation with one training epoch in each stage on all GLUE tasks except for MNLI and QQP. Aside from this default setting, we also remove data augmentation and perform vanilla training with 6 epochs on these tasks. On MNLI and QQP, we train 3 epochs for each stage.
We verify our ternary weight splitting (TWS) against vanilla binary training (BWN), the latter of which doubles training epochs to match the overall training time in TWS for fair comparison. More training details are provided in Appendix B.
Activation Quantization. While BinaryBERT focuses on weight binarization, we also explore activation quantization in our implementation, which is beneficial for reducing the computation burden on specialized hardwares (Hubara et al., 2016;Zhou et al., 2016;Zhang et al., 2020). Aside from 8-bit uniform quantization (Zhang et al., 2020; in past efforts, we further pioneer to study 4-bit activation quantization. We find that uniform quantization can hardly deal with outliers in the activation. Thus we use Learned Step-size Quantization (LSQ) (Esser et al., 2019) to directly learn the quantized values, which empirically achieves better quantization performance.

Results on the GLUE Benchmark
The main results on the development set are shown in Table 1. For results without data augmenta-    tion (row #2-5), our ternary weight splitting method outperforms BWN with a clear margin 3 . For instance, on CoLA, ternary weight splitting achieves 6.7% ↑ and 9.6% ↑ with 8-bit and 4-bit activation quantization, respectively. While data augmentation (row 6-9) mostly improves each entry, our approach still overtakes BWN consistently. Furthermore, 4-bit activation quantization empirically benefits more from ternary weight splitting (row 4-5 and 8-9) compared with 8-bit activations (row 2-3 and 6-7), demonstrating the potential of our approach in extremely low bit quantized models.
In Table 2, we also provide the results on the test set of GLUE benchmark. Similar to the observation in Table 1, our approach achieves consistent improvement on both 8-bit and 4-bit activation quantization compared with BWN.

Results on SQuAD Benchmark
The results on the development set of SQuAD v1.1 and v2.0 are shown in Table 3. Our proposed ternary weight splitting again outperforms BWN w.r.t both EM and F1 scores on both datasets. Similar to previous observations, 4-bit activation enjoys a larger gain in performance from the splitting approach. For instance, our approach improves the EM score of 4-bit activation by 1.8% and 0.6% on SQuAD v1.1 and v2.0, respectively, both of which are higher than those of 8-bit activation.

Adaptive Splitting
The adaptive splitting in Section 3.2 supports the conversion of mixed ternary and binary precisions for more-fine-grained configurations. To verify its advantages, we name our approach as Maximal Gain according to Equation (11), and compare it with two baseline strategies i) Random Gain that randomly selects weight matrices to split; and ii) Minimal Gain that splits the least important modules according to sensitivity. We report the average score over six tasks (QNLI, SST-2, CoLA, STS-B, MRPC and RTE) in Figure 5. The end-points of 9.8MB and 16.5MB are the half-sized and fullsized BinaryBERT, respectively. As can be seen, adaptive splitting generally outperforms the other two baselines under varying model size, indicating the effectiveness of maximizing the gain in adaptive splitting. In Appendix C.4, we provide detailed performance on the six tasks, together with the architecture visualization of adaptive splitting.  Table 4, our proposed BinaryBERT has the smallest model size with the best performance among all quantiza-    tion approaches. Compared with the full-precision model, our BinaryBERT retains competitive performance with a significant reduction of model size and computation. For example, we achieve more than 24× compression ratio compared with BERTbase, with only 0.4% ↓ and 0.0%/0.2% ↓ drop on MNLI-m on SQuAD v1.1, respectively.

Further Improvement after Splitting
We now demonstrate the performance gain by refining the binary model on the new architecture. We evaluate the performance gain after splitting from a half-width ternary model (TWN 0.5× ) to the full-sized model (TWN 1.0× ) on the development set of SQuAD v1.1, MNLI-m, QNLI and MRPC. The results are shown in Table 5. As can be seen, further fine-tuning brings consistent improvement on both 8-bit and 4-bit activation.  Training Curves. Furthermore, we plot the training loss curves of BWN, TWN and our TWS on MRPC with data augmentation in Figures 6(a) and 6(b). Since TWS cannot inherit the previous optimizer due to the architecture change, we reset the optimizer and learning rate scheduler of BWN, TWN and TWS for a fair comparison, despite the slight increase of loss after splitting. We find that our TWS attains much lower training loss than BWN, and also surpasses TWN, verifying the advantages of fine-tuning on the wider architecture.
Optimization Trajectory. We also follow (Li et al., 2018;Hao et al., 2019) to visualize the optimization trajectory after splitting in Figures 6(c) and 6(d). We calculate the first two principal components of parameters in the final BinaryBERT, which are the basis for the 2-D plane. The loss contour is thus obtained by evaluating each grid point in the plane. It is found that the binary models are heading towards the optimal solution for both 8/4-bit activation quantization on the loss contour.

Exploring More Binarization Methods
We now study if there are any improved binarization variants that can directly bring better performance. Aside from BWN, we compare with LAB (Hou et al., 2017) and BiReal (Liu et al., 2018). Meanwhile, we compare with gradual quantization, i.e., BWN training based on a ternary model, denoted as BWN †. Furthermore, we also try the same scaling factor of BWN with TWN to make the precision change smooth, dubbed as BWN ‡. From Table 6, we find that our TWS still outperforms various binarization approaches in most cases, suggesting the superiority of splitting in finding better minima than direct binary training.

Related Work
Network quantization has been a popular topic with vast literature in efficient deep learning. Below we give a brief overview for three research strands: network binarization, mixed-precision quantization and neuron splitting, all of which are related to our proposed approach.

Network Binarization
Network binarization achieves remarkable size reduction and is widely explored in computer vision. Existing binarization approaches can be categorized into quantization error minimization (Rastegari et al., 2016;Hou et al., 2017;Zhang et al., 2018), improving training objectives  and reduction of gradient mismatch (Bai et al., 2018;Liu et al., 2018). Despite the empirical success of these approaches in computer vision, there is little exploration of binarization in natural language processing tasks. Previous works on BERT quantization (Zafrir et al., 2019;Zhang et al., 2020) push down the bit-width to as low as two, but none of them achieves binarization. On the other hand, our work serves as the first attempt to binarize the pre-trained language models.

Mixed-precision Quantization
Given the observation that neural network layers exhibit different sensitivity to quantization (Dong et al., 2019;, mixed-precision quantization re-allocate layer-wise quantization bit-width for higher compression ratio. Inspired by neural architecture search Wang et al., 2020), common approaches of mixedprecision quantization are primarily based on differentiable search (Wu et al., 2018a;Li et al., 2020b), reinforcement learning (Wu et al., 2018b;, or simply loss curvatures (Dong et al., 2019;. While mixedprecision quantized models usually demonstrate better performance than traditional methods under the same compression ratio, they are also harder to deploy (Habi et al., 2020). On the contrary, Binary-BERT with adaptive splitting enjoy both the good performance from the mixed precision of ternary and binary values, and the easy deployment given the consistent arithmetic precision. There are also works on binary neural architecture search (Kim et al., 2020; which have a similar purpose to mixed-precision quantization. Nonetheless, such methods are usually time-consuming to train and are prohibitive for large pre-trained language models.

Neuron Splitting
Neuron splitting is originally proposed to accelerate the network training, by progressively increasing the width of a network (Chen et al., 2016;Wu et al., 2019). The split network equivalently inherits the knowledge from the antecessors and is trained for further improvement. Recently, neuron splitting is also studied in quantization (Zhao et al., 2019;Kim et al., 2019). By splitting neurons with large magnitudes, the full-precision outliers are removed and thus the quantization error can be effectively reduced (Zhao et al., 2019). Kim et al. (2019) apply neuron splitting to decompose ternary activation into two binary activations based on bias shifting of the batch normalization layer. However, such a method cannot be applied in BERT as there is no batch normalization layer. Besides, weight splitting is much more complex due to the equivalence constraint on both the quantized and latent full-precision weights.

Conclusion
In this paper, we propose BinaryBERT, pushing BERT quantization to the limit. As a result of the steep and complex loss landscape, we find directly training a BinaryBERT is hard with a large performance drop. We thus propose a ternary weight splitting that splits a trained ternary BERT to initialize BinaryBERT, followed by fine-tuning for further refinement. Our approach also supports adaptive splitting that can tailor the size of Binary-BERT based on the edge device constraints. Empirical results show that our approach significantly outperforms vanilla binary training, achieving stateof-the-art performance on BERT compression.  (8) In this section, we show the derivations to obtain a and b. Recall the BWN quantizer introduced in Section 2, we havê

A Derivation of Equation
By assuming 0 < a < 1 and b > 0, this can be further simplified to which gives the solution of a as We empirically find the solution satisifies 0 < a < 1.

Thus the solution for
which satisfies b > 0.
B Implementation Details B.1 Detailed Procedure of Adaptive Splitting As mentioned in Section 3.2, the adaptive splitting requires to first estimate the quantization sensitivity vector u. We study the sensitivity in two aspects: the Transformer parts, and the Transformer layers. For Transformer parts, we follow the weight categorization in Section 2.2: MHA-Q/K, MHA-V, MHA-O, FFN-Mid and FFN-Out. For each of them, we compare the performance gap between quantizing and not quantizing that part (e.g., MHA-V), while leavging the rest parts all quantized (e.g., MHA-Q/K, MHA-O, FFN-Mid and FFN-Out). Similarly, for each Transformer layer, we quantize all layers but leave the layer under investigation unquantized, and calculate the performance gain compared with the fully qauntized baseline. The performance gain of both Transformer parts and layers are shown in Figure 7. As can be seen, for Transformer parts, the FFN-Mid and MHA-Q/K rank in the first and second place. In terms of Transformer layers, shallower layers are more sensitive to quantization than the deeper ones. However, the absolute performance gain may not reflect the quantization sensitivity directly, since Transformer parts have different number of parameters. Therefore, we divide the performance gain by the number of parameters in that part or layer to obtain the parameter-wise performance gain. We are thus able to measure the quantization sensitivity of the ith Transformer part in the jth Transformer layer by summing their parameter-wise performance gain together. We also apply the same procedure to word embedding and pooler layer to otain their sensitivity scores.
We are now able to solve Equation (11) by dynamic programming. The combinatorial optimization can be viewed as a knapsack problem, where the constraint C − C 0 is the volume of the knapsack, and the sensitivity scores u are the item values.

B.2 Hyper-parameter Settings
We first perform the two-stage knowledge distillation, i.  the ternary model, and then perform ternary weight splitting followed by fine-tuning (Split Ft.) with only prediction-layer distillation after the splitting. The initial learning rate is set as 5 × 10 −5 for the intermediate-layer distillation, and 2 × 10 −5 for the prediction-layer distillation, both of which linearly decay to 0 at the end of training. We conduct experiments on GLUE tasks both without and with data augmentation (DA) except for MNLI and QQP due to their limited performance gain. The running epochs for MNLI and QQP are set to 3, and 6 for the rest tasks if without DA and 1 otherwise. For the rest hyper-parameters, we follow the default setting in (Devlin et al., 2019). The detailed hyperparameters are summarized in Table 7.

C.1 Performance Drop by Binarization
Here we provide more empirical results on the sharp drop in performance as a result of binarization. We run multi-bit quantization on the BERT model over representative tasks of the GLUE benchmark, and activations are quantized in both 8bit and 4-bit. We run 10 independent experiments for each task except for MNLI with 3 runs. We follow the same procedure in Section 2.1, and the default experimental setup in Appendix B.2 without data augmentation and splitting. The results are shown in Figures 8 and 9 respectively. It can be found that while the performance drops slowly from full-precision to ternarization, there is a consistent sharp drop by binarization in each tasks and on both 8-bit and 4-bit activation quantization. This  is similar to the findings in Figure 1.

C.2 More Visualizations of Loss Landscape
To comprehensively compare the loss curvature among the full-precision, ternary and binary models, we provide more landscape visualizations aside from the value layer in Figure 2. We extract parameters from MHA-K, MHA-O, FFN-Mid and FFN-out in the first two Transformer layers, and the corresponding landscape are shown in Figure 10, Figure 11, Figure 12, Figure 13 respectively. We omit MHA-Q due to page limitation, and also it is symmetric to MHA-K with similar landscape observation. It can be found that binary model have steep and irregular loss landscape in general w.r.t different parameters of the model, and is thus hard to optimize directly.

C.3 Ablation of Knowledge Distillation
While knowledge distillation on BERT has been thoroughly investigated in (Jiao et al., 2020;Hou et al., 2020;Zhang et al., 2020), here we further conduct ablation study of knowledge distillation on the proposed ternary weight splitting. We compare with no distillation ("N/A"), prediction distillation ("Pred") and our default setting ("Int.+Pred"). For "N/A" or "Pred", fine-tuning after splitting follows the same setting to their ternary training. "Int.+Pred" follows our default setting in Table . We do not adopt data-augmentation, and results are shown in Table 10. It can be found that "Int.+Pred." outperforms both "N/A" and "Pred." with a clear margin, which is consistent to the findings in (Zhang et al., 2020) that knowledge distillation helps BERT quantization.

C.4 Detailed Results of Adaptive Splitting
The detailed comparison of our adaptive splitting strategy against the random strategy (Rand.) and minimal gain strategy (Min.) under different model size are shown in Table 8 and Table 9. It can be found that for both 8-bit and 4-bit activation quantization, our strategy that splits the most sensitive modules mostly performs the best on average under various model sizes.

C.5 Architecture Visualization
We further visualize the architectures after adaptive splitting on MRPC in Figure 14. For clear presentation, we merge all splittable parameters in each Transformer layer. As the baseline, 9.8MB refers to no splitting, while 16.5MB refers to splitting all splittable parameters in the model. According to Figure 14, with the increasing model size, shallower layers are more preferred for splitting than deeper layers, which is consistent to the findings in Figure 7.