Combining Deep Generative Models and Multi-lingual Pretraining for Semi-supervised Document Classification

Semi-supervised learning through deep generative models and multi-lingual pretraining techniques have orchestrated tremendous success across different areas of NLP. Nonetheless, their development has happened in isolation, while the combination of both could potentially be effective for tackling task-specific labelled data shortage. To bridge this gap, we combine semi-supervised deep generative models and multi-lingual pretraining to form a pipeline for document classification task. Compared to strong supervised learning baselines, our semi-supervised classification framework is highly competitive and outperforms the state-of-the-art counterparts in low-resource settings across several languages.


Introduction
Multi-lingual pretraining has been shown to effectively use unlabelled data through learning shared representations across languages that can be transferred to downstream tasks (Artetxe and Schwenk, 2019;Devlin et al., 2019;Wu and Dredze, 2019;Conneau and Lample, 2019). Nonetheless, the lack of labelled data still leads to inferior performance of the same model compared to those trained in languages with more labelled data such as English (Zeman et al., 2018;Zhu et al., 2019).
Semi-supervised learning is another appealing paradigm that supplements the labelled data with unlabelled data which is easy to acquire (Blum and Mitchell, 1998;Zhou and Li, 2005;McClosky et al., 2006, inter alia). In particular, deep generative models (DGMs) such as variational autoencoder (VAE; Kingma and Welling (2014)) are capable of capturing complex data distributions at scale with rich latent representations, and they have been used * Work done while at Microsoft Research Cambridge. 1 Code is available at https://github.com/ cambridgeltl/mling_sdgms.
To leverage the benefits of both worlds, we propose a pipeline method by combining semisupervised DGMs (SDGMs) based on M1+M2 model (Kingma et al., 2014) with multi-lingual pretraining. The pretrained model serves as multilingual encoder, and SDGMs can operate on top of it independently of encoding architecture. To highlight such independence, we experiment with two pretraining settings: (1) our LSTM-based crosslingual VAE, and (2) the current stat-of-the-art (SOTA) multi-lingual BERT (Devlin et al., 2019).
Our experiments on document classification in several languages show promising results via the SDGM framework with different encoders, outperforming the SOTA supervised counterparts. We also illustrate that the end-to-end training of M1+M2 that was previously considered too unstable to train (Maaløe et al., 2016) is possible with a reformulation of the objective function.
2 Semi-supervised Learning with DGMs Variational Autoencoder. VAE consists of a stochastic neural encoder q φ (z|x) that maps an input x to a latent representation z, and a neural decoder p θ (x|z) that reconstructs x, jointly trained by maximising the evidence lower bound (ELBO) of the marginal likelihood of the data: where the first term (reconstruction) maximises the expectation of data likelihood under the posterior distribution of z, and the Kullback-Leibler (KL) divergence regulates the distance between the learned posterior and prior of z.

KL
As our SDGM is independent of the encoding architecture, we use different pretrained multi-lingual models to obtain h, µ φ (h), and σ 2 φ (h), described in §3. The second layer computes the posterior distribution of z 2 , conditioned on sampled z 1 from q φ (z 1 |x) and a class variable y.
When we use labelled data, i.e. y is observed, q φ (z 2 |z 1 , y) can be directly obtained. With unlabelled data, we calculate the posterior q φ (z 2 , y|z 1 ) = q φ (y|z 1 )q φ (z 2 |z 1 , y) by inferring y with the classifier q φ (y|z 1 ), and integrate over all possible values of y. Therefore, the ELBO for the labelled data S l = {x, y} is L(x, y): and for the unlabelled data S u = {x} is U(x): where the generation part is p θ (x, y, z 1 , z 2 ) = p(y)p(z 2 )p θ (z 1 |z 2 , y)p θ (x|z 1 ), p(y) is uniform distribution as the prior of y, p(z 2 ) is standard Gaussian distribution as the prior of z 2 , and p θ (x|z 1 ) is the decoder, which can have different architectures depending on the encoder ( §4).
The objective function maximises both the labelled and unlabelled ELBOs while training directly the classifier with the labelled data as well: where J cls (x, y) = E q φ (z 1 |x) [q φ (y|z 1 )], and α is a hyperparameter to tune. Considering the factorisation of the model according to the graphical model, we can rewrite the L(x, y) and U(x) as shown in Table 1(left). The reconstruction term is the expected log likelihood of the input sequence x, same for both ELBOs. The KL term regularises the posterior distributions of z 1 and z 2 according to their priors. Additionally for U(x), as mentioned before, we first infer y and treat it as if it were observed, so we need to compute the expected KL term over q φ (y|z 1 ) regularised by KL(q φ (y|z 1 ) p(y)).
Due to its training difficulty, M1+M2 is trained layer-wise in Kingma et al. (2014), where the first layer is trained according to Eq. 1 and fixed, before the second layer is trained on top. However, in our experiments ( §4.1) we found that M1+M2 is easier to train end-to-end. We attribute this to our mathematical reformulation of the objective functions, giving rise to a more stable optimisation schedule.

LSTM-based Encoder with VAE Pretraining.
Our pretraining is based on the framework of Wei and Deng (2017), in which they pretrain a crosslingual VAE with parallel corpus as input. However, the parallel corpus is expensive to obtain, and only the resulting cross-lingual embeddings rather than the whole encoder could be used due to the parallel input limitation of the model. To address these shortcomings, we propose non-parallel crosslingual VAE (NXVAE), which has the same graphical model as the vanilla VAE. Each language i is associated with its own word embedding matrix, and its input sequence x i is processed via a two layer BiLSTM (Hochreiter and Schmidhuber, 1997) shared across languages. We use the concatenation of the BiLSTM last hidden states as h, and compute q φ (z|x i ) with Eq. 2, so that z becomes the joint cross-lingual semantic space. A language specific bag-of-word decoder (BOW;Miao et al. (2016)) is then used to reconstruct the input sequence. Additionally, we optimise a language discriminator as an adversary (Lample et al., 2018a) to encourage the mixing of different language representations and keep the shared encoder language-agnostic. After pretraining NXVAE, we transfer the whole encoder, including µ φ (h) and σ 2 φ (h), directly into our SDGM framework and treat it as q φ (z 1 |x) component of the model ( §4.1).
Multi-lingual BERT Encoder. To show that our SDGM is effective with other encoding architectures, we use the pretrained multi-lingual BERT (mBERT; Devlin et al. (2019)) 2 as our encoder. Given an input sequence, the pooled [CLS] representation is used as h to compute q φ (z 1 |x) (Eq. 2). Different from NXVAE, we initialise the parameters of µ φ (h) and σ 2 φ (h) randomly.

Experiments
We perform document classification on the class balanced multilingual document classification corpus (MLDoc; Schwenk and Li (2018)). Each document is assigned to one of the four news topic classes: corporate/industrial (C), economics (E), government/social (G), and markets (M  with: tokenization, lowercasing, substituting digits with 0, and removing all punctuations, redundant spaces and empty lines. We randomly sample a small part of parallel sentences to build a development set. For models which do not require parallel input, e.g. NXVAE, we mix the two datasets of a language pair together. To avoid KL-collapse during pretraining, a weight α on the KL term in Eq. 1 is tuned and fixed to 0.1 (Higgins et al., 2017;Alemi et al., 2018). We only run one trial with fixed random seed for both pretraining and document classification. Training details can be found in the Appendix. As our supervised baselines we compare with the following two groups: (I) NXVAE-based supervised models which are pretrained NXVAE encoder with a multi-layer perceptron classifier on top (denoted by NXVAE-z 1 (q φ (y|z 1 )) or NXVAEh (q φ (y|h)) depending on the representation fed into the classifier; or NXVAE-z 1 models initialised with different pretrained embeddings: random initialisation (RAND), mono-lingual fastText (FT; Bojanowski et al. (2017)), unsupervised cross-lingual MUSE (Lample et al., 2018b), pretrained embeddings from Wei and Deng (2017) (PEMB), and our resulting embeddings from pretrained NXVAE (NXEMB). 5 (II) We also pretrain a word-based BERT (BERTW) with parameter size akin to NX-VAE on the same data, and fine-tune it directly. 6 For our semi-supervised experiments, we test   7 We also add a semisupervised self-training method (McClosky et al., 2006) for BERTW to leverage the unlabelled data (BERTW+ST), where we iteratively add predicted unlabelled data when the model achieves a better dev. accuracy until convergence.
Qualitative Results. Table 3 illustrates the quality of the learned alignments in the cross-lingual space of NXVAE for EN-DE word pairs.
Classification Results. Table 4 (EN-DE) shows that within supervised models the NXVAE-z 1 substantially outperforms other supervised baselines with the exception of BERTW. The fact that NXVAE-z 1 is significantly better than NXVAE-h, suggests that pretraining has enabled z 1 to learn more general knowledge transferable to this task. Combining with SDGMs, our best pipeline outperforms all baselines across data sizes and languages, including BERTW+ST with bigger gaps in fewer labelled data scenario. We observe the same trend of performance in both supervised and semisupervised DGM settings on EN-FR and DE-FR. For decoder, BOW outperforms the GRU, a finding in line with the results of Artetxe et al. (2019) which suggests a few keywords seem to suffice for this task. The poor performance of the original M1+M2, implies the domain discrepancy between 7 We also compared this against a more complex Skip Deep Generative Model (Maaløe et al., 2016), but found that end-toend M1+M2 performs better. Details in the Appendix.  pretraining and task data, and highlights the impact of fine-tuning. In addition, our NXEMB, as a byproduct of NXVAE, performs comparably well with MUSE, and better than all other embedding models including its parallel counterpart PEMB.

Multi-lingual BERT Encoder
Experimental Setup. We use the cased mBERT, a 12 layer Transformer (Vaswani et al., 2017) trained on Wikipedia of 104 languages with 100k shared WordPiece vocabulary. The training corpus is larger than Europarl by orders of magnitude, and high-resource languages account for most of the corpus. We use the best SDGM setup (M1+M2+BOW §4.1), on top of mBERT encoder against the mBERT supervised model with a linear layer as classifier (SUP-h) in 5 representative languages (EN, DE, FR, RU, ZH). We report the results over 5 runs due to the training instability of BERT (Dodge et al., 2020;Mosbach et al., 2020).
Classification Results. Figure 1 demonstrates that M1+M2+BOW outperforms the SOTA supervised mBERT (SUP-h) on average across all languages. This corroborates the effectiveness of our SDGM in leveraging unlabelled data within smaller labelled data regime, as well as its independence from encoding architecture. 8 As expected, the gap is generally larger with 8 and 16 labelled data, but reduces as the data size grows to 32. The variance shows similar pattern, but with relatively large values because of the instability of mBERT. Interestingly, the performance difference seems to be more notable in high-resource languages with more pretrained data, whereas in languages with fewer pretrained texts or vocabulary overlaps such as RU and ZH, the two models achieve closer results.

Conclusion
We bridged between multi-lingual pretraining and deep generative models to form a semi-supervised learning framework for document classification. While outperforming SOTA supervised models in several settings, we showed that the benefits of SDGMs are orthogonal to the encoding architecture or pretraining procedure. It opens up a new avenue for SDGMs in low-resource NLP by incorporating unlabelled data potentially from different domains and languages. Our preliminary results in crosslingual zero-shot setting with SDGMs+NXVAE are promising, and we will continue the exploration in this direction as future work.

A Derivations of semi-supervised ELBOs
We derive the full ELBOs of both labelled and unlabelled data for M1+M2 and Auxiliary Skip Deep Generative Model (AUX; Maaløe et al. (2016)). 9 We first use (·) to represent different conditional variables for the two models so that the derivations can be unified, then we will realise it with the model-specific conditions in the end. As written in the paper, the labelled ELBO for both models is: Expanding the ELBO, we will have: After realising (·), we can then obtain the labelled ELBO for M1+M2 and AUX in the original paper: For the unlabelled ELBO, y is unobservable: Similarly, we will get unlabeled ELBO of M1+M2 and AUX: In our experiments, we sample z 1 and z 2 once during inference, so both labeled and unlabeled ELBOs can be approximated by:

B Factorisation of M1+M2 and AUX
The two models have different factorisations, with M1+M2 being written as: and AUX is factorised as follows: where q φ (z 1 |x), q φ (z 2 |·), and p θ (z 1 |z 2 , y) are parameterised as diagonal Gaussians, and other distributions are defined as: where Cat(·) is a multinomial distribution and y is treated as latent variables if it is unobserved in unlabelled case. f (x, ·; θ) serves as the decoder and calculates the likelihood of the input sequence x. For each language pair, the sentences in the same line of both datasets are a pair of parallel sentences. We do the following preprocessing to each dataset: tokenization; lower case; substitute digits with 0; remove all punctuations; remove redundant spaces and empty lines. Then we trim all four datasets into exactly the same sentence size. We randomly split a small part of parallel sentences to build a dev. set, which leads to 189m lines of training set and 13995 lines of dev. set for each language. Then we shuffle each dataset so that each language pair is not parallel anymore (for both train and dev. sets).

C Details on LSTM Encoder with VAE Pretraining
Our goal is to merge the two datasets of each pair and scramble them to form a single dataset. In practice, we keep each dataset separate, and sample a batch randomly from one language alternatively during pretraining, so that the data from both languages are mixed.

C.2 Model and training details
Instead of optimising the standard VAE, we optimise the following objective for NXVAE (Higgins et al., 2017;Alemi et al., 2018): where we manually tune the fixed hyperparameter α on EN-DE data to reach a good balance between the reconstruction and the KL empirically. We select α = 0.1 and apply it for the pretraining of other language pairs as well. The model and training details of XNVAE are shown in Table 5 (left).

C.3 Pretraining other models
For MLDoc supervised document classification, we also pretrain other baseline models to compare with ONLY for EN-DE pair: Cross-lingual VAE with parallel input (PEMB; Wei and Deng (2017)): For the model of Wei and Deng (2017), we run the original code directly on the same EN-DE Europarl data without changing any of the model architecture. Since the model requires parallel input, we take the preprocessed and split EN-DE data. However, we do not shuffle each dataset, but rather feed them as parallel input to the model, so that the model and our corresponding NXVAE use the same amount and content of the data.
Subword-based non-parallel cross-lingual VAE SNXVAE: Instead of having separate vocabulary and decoders for each language, we use a single vocabulary and decoder for SNXVAE. We build the vocabulary with SentencePiece 11 of size 1e4. All other settings are the same as NXVAE. Its model and training details can be found in Table 5 (right).
Word and subword-based BERT model BERTW/BERTSW : For BERTW, we change the vocabulary and model size to be comparable with NXVAE. Note that the vocabulary size of BERTW is the same as the intersected vocabulary size of the two languages in NXVAE. We only use the masked language model objective during pretraining, and discard the objective of next sentence prediction. 12 For BERTSW, we use the same vocabulary as SNXVAE and set the model to similar parameter size as SNXVAE. The model and training details of BERTW and BERTSW are shown in Table 6. D More Results on Document Classification D.1 LSTM Encoder with VAE Pretraining Supervised Learning. Our base model is NXVAE-z 1 , which adds an MLP classifier q φ (y|z 1 ) on top of the encoder with the same architecture of the NXVAE. The similar applies to the subwordbased models SNXVAE-z 1 . NXVAE-h takes the deterministic h as the input to q φ (y|x). All our baseline models with pretrained embeddings use the architecture of NXVAE-z 1 . For fastText (FT), we train the embeddings of both languages with the same data of EN EN-DE and DE EN-DE . For MUSE, we align on the pretrained FT embeddings. For BERTW and BERTSW, we use the library Transformers 13 for classification, and initialise the models with the corresponding pretrained parameters. All model and training details can be found in Table 7. The comparison results of word-based and subword-based models are shown in Table 8.
Semi-supervised learning with SDGMs. The main model (NXVAE) and training details are the same as in supervised learning. Besides M1+M2, we also compare with AUX (Maaløe et al., 2016) with the two decoder types. The training details are shown in Table 9. Regarding the decoding of GRU, all conditional latent variables of p θ (x|·) are fed as extra input at each decoding step (Xu et al., 2017). We tune all semi-supervised models on EN EN-DE with 32 labels in semi-supervised settings, and then apply it to all other languages and data sizes. We tune only one hyperparameter: the scaling factor β in the weight for the classification loss α in the original SDGM paper (Maaløe et al., 2016): where N l and N u are labelled and unlabelled data point numbers. We tune β from {0.1, 0.2, 0.5, 1.0, 2.0, 5.0, 10.0, 20.0}. We pick the β with the best dev. performance for each model, and randomly select one when there is a tie. Then we use such fixed β for all other experiments across different training data sizes and languages.
The results of AUX can be seen in Table 10 along with M1+M2 results from the original paper. The parameter size of each model is shown in Table 11.

D.2 mBERT Encoder
The supervised model (SUP-h) adds a single linear transformation layer on the pooled [CLS] representation of mBERT, and M1+M2+BOW adds the corresponding SDGM framework on the same mBERT output. Like BERT, as mBERT uses a shared WordPiece vocabulary across languages, the parameter size of the same model will be the same for each language. All model and training details along with parameter size can be found in Table  12.
For tuning the hyperparameter of M1+M2+BOW, different from LSTM encoder with VAE pretraining, we set α fixed to α = β. We tune β on EN with 8 labels in semi-supervised settings with 5 trials from {0.1, 0.2, 0.5, 1.0, 2.0, 5.0, 10.0, 20.0, 50.0}, pick the β with the best average dev. performance, and then apply it to all other languages and data sizes. We report the mean and variance over 5 trials, and the full results for both models can be seen in Table 13.

E Conditional document generation
Semi-supervised deep generative models can not only explore the complex data distributions, but are also equipped with the ability to generate documents conditioned on latent codes, which is another advantage over other semi-supervised models. We follow Kingma et al. (2014) by varying latent variable y for generation, and fixing z 2 either sampled from the prior (Table 14) or obtained from the input through the inference model (Table 15), and generate sequence samples from the trained semi-supervised models M1+M2+BOW and M1+M2+GRU. 14 Overall, all models generate words or utterances directly related to the class, with the class labels among top nouns generated by BOW models, and subjects/objects in sentences from GRU are also pertaining to corresponding classes. However, we also observe that the utterances in GRU are not           Table 14: Generated samples from M1+M2+GRU (BOW) for class C (Corporate/Industrial), E (Economics), G (Government/Social), and M (Markets). We randomly sample z 2 from the prior while varying y.
1: Fiat shares lost nearly two percent on Wednesday, slipping below the psychologically important 4,000 lire level in thin trading on a generally easier Milan Bourse, traders said. "The stock has gradually lost ground but without any major sell orders. At the moment there just isn't any interest in Fiat," one trader said. At 1439 GMT, Fiat was quoted 1.99 percent off at 3,980 lire, after touching a day's low of 3,970 lire, in volume of just under four million shares. The all-share Mibtel index posted a 0.47 percent fall. -Milan newsroom +392 66129589 (E) 1: fiat shares lost nearly two percent on UNK slipping below the psychologically important UNK lire level in thin trading on a generally easier milan UNK traders UNK UNK stock has gradually lost ground but without any major sell UNK at the moment there just UNK any interest in UNK one trader UNK at UNK UNK fiat was quoted UNK percent off at UNK UNK after touching a UNK low of UNK UNK in volume of just under four million UNK the UNK UNK index posted a UNK percent UNK UNK milan UNK UNK UNK 2: The top prosecutor of Honduras said on Wednesday that his country is a haven for money laundering. "In Honduras it's easy to launder money, the system allows it," Edmundo Orellana told reporters. "It's permitted because there is no law in Honduras that obligates a Honduran to explain the origin of his wealth." Honduran authorities estimate that $300 million in illegal drug profits is laundered through the country each year. Money laundering is not classified as an offence in Honduras, although legislators have been working on a bill to outlaw it since last year. (G) 2: the top prosecutor of honduras said on wednesday that his country is a haven for money UNK UNK honduras UNK easy to launder UNK the system allows UNK UNK UNK told UNK UNK permitted because there is no law in honduras that UNK a honduran to explain the origin of his UNK honduran authorities estimate that UNK million in illegal drug profits is laundered through the country each UNK money laundering is not classified as an offence in UNK although legislators have been working on a bill to outlaw it since last UNK 1: the bank of the settlement following the following vocational meda of the deal was delay ... and the market ...
2: UNK, market, ticket, bank, traders, anticipation, delay, procedure, trade, prices, immigrants, rate, government, money, meda, escalation, demands, exchange, points, reallocation 2: the bank of the settlement following the following vocational value of the relative gains of ... Table 15: Generated samples from M1+M2+GRU (BOW) by varying class label y. We take z 2 from the input examples shown above. For each example, the first is the original document with the class label in the end, and the second is the real input to the system.