Fast Easy Unsupervised Domain Adaptation with Marginalized Structured Dropout

Unsupervised domain adaptation often relies on transforming the instance representation. However, most such approaches are designed for bag-of-words models, and ignore the structured features present in many problems in NLP. We propose a new technique called marginalized structured dropout , which exploits feature structure to obtain a remarkably simple and efﬁcient feature projection. Applied to the task of ﬁne-grained part-of-speech tagging on a dataset of historical Portuguese, marginalized structured dropout yields state-of-the-art accuracy while increasing speed by more than an order-of-magnitude over previous work.


Introduction
Unsupervised domain adaptation is a fundamental problem for natural language processing, as we hope to apply our systems to datasets unlike those for which we have annotations. This is particularly relevant as labeled datasets become stale in comparison with rapidly evolving social media writing styles (Eisenstein, 2013), and as there is increasing interest in natural language processing for historical texts (Piotrowski, 2012). While a number of different approaches for domain adaptation have been proposed (Pan and Yang, 2010;Søgaard, 2013), they tend to emphasize bag-ofwords features for classification tasks such as sentiment analysis. Consequently, many approaches rely on each instance having a relatively large number of active features, and fail to exploit the structured feature spaces that characterize syntactic tasks such as sequence labeling and parsing (Smith, 2011).
As we will show, substantial efficiency improvements can be obtained by designing domain adaptation methods for learning in structured feature spaces. We build on work from the deep learning community, in which denoising autoencoders are trained to remove synthetic noise from the observed instances (Glorot et al., 2011a). By using the autoencoder to transform the original feature space, one may obtain a representation that is less dependent on any individual feature, and therefore more robust across domains. Chen et al. (2012) showed that such autoencoders can be learned even as the noising process is analytically marginalized; the idea is similar in spirit to feature noising (Wang et al., 2013). While the marginalized denoising autoencoder (mDA) is considerably faster than the original denoising autoencoder, it requires solving a system of equations that can grow very large, as realistic NLP tasks can involve 10 5 or more features.
In this paper we investigate noising functions that are explicitly designed for structured feature spaces, which are common in NLP. For example, in part-of-speech tagging, Toutanova et al. (2003) define several feature "templates": the current word, the previous word, the suffix of the current word, and so on. For each feature template, there are thousands of binary features. To exploit this structure, we propose two alternative noising techniques: (1) feature scrambling, which randomly chooses a feature template and randomly selects an alternative value within the template, and (2) structured dropout, which randomly eliminates all but a single feature template. We show how it is possible to marginalize over both types of noise, and find that the solution for structured dropout is substantially simpler and more efficient than the mDA approach of Chen et al. (2012), which does not consider feature structure.
We apply these ideas to fine-grained part-ofspeech tagging on a dataset of Portuguese texts from the years 1502 to 1836 (Galves and Faria, 2010), training on recent texts and evaluating on older documents. Both structure-aware domain adaptation algorithms perform as well as standard dropout -and better than the wellknown structural correspondence learning (SCL) algorithm (Blitzer et al., 2007) -but structured dropout is more than an order-of-magnitude faster. As a secondary contribution of this paper, we demonstrate the applicability of unsupervised domain adaptation to the syntactic analysis of historical texts.

Model
In this section we first briefly describe the denoising autoencoder (Glorot et al., 2011b), its application to domain adaptation, and the analytic marginalization of noise (Chen et al., 2012). Then we present three versions of marginalized denoising autoencoders (mDA) by incorporating different types of noise, including two new noising processes that are designed for structured features.

Denoising Autoencoders
Assume instances x 1 , . . . , x n , which are drawn from both the source and target domains. We will "corrupt" these instances by adding different types of noise, and denote the corrupted version of x i byx i . Single-layer denoising autoencoders reconstruct the corrupted inputs with a projection matrix W : R d → R d , which is estimated by minimizing the squared reconstruction loss If we write X = [x 1 , . . . , x n ] ∈ R d×n , and we write its corrupted versionX, then the loss in (1) can be written as (2) In this case, we have the well-known closedform solution for this ordinary least square problem: where Q =XX and P = XX . After obtaining the weight matrix W, we can insert nonlinearity into the output of the denoiser, such as tanh(WX). It is also possible to apply stacking, by passing this vector through another autoencoder (Chen et al., 2012). In pilot experiments, this slowed down estimation and had little effect on accuracy, so we did not include it.
High-dimensional setting Structured prediction tasks often have much more features than simple bag-of-words representation, and performance relies on the rare features. In a naive implementation of the denoising approach, both P and Q will be dense matrices with dimensionality d × d, which would be roughly 10 11 elements in our experiments. To solve this problem, Chen et al. (2012) propose to use a set of pivot features, and train the autoencoder to reconstruct the pivots from the full set of features. Specifically, the corrupted input is divided to S subsets We obtain a projection matrix W s for each subset by reconstructing the pivot features from the features in this subset; we can then use the sum of all reconstructions as the new features, tanh( S s=1 W s X s ).

Marginalized Denoising Autoencoders
In the standard denoising autoencoder, we need to generate multiple versions of the corrupted dataX to reduce the variance of the solution (Glorot et al., 2011b). But Chen et al. (2012) show that it is possible to marginalize over the noise, analytically computing expectations of both P and Q, and computing where This is equivalent to corrupting the data m → ∞ times. The computation of these expectations depends on the type of noise.

Noise distributions
Chen et al. (2012) used dropout noise for domain adaptation, which we briefly review. We then describe two novel types of noise that are designed for structured feature spaces, and explain how they can be marginalized to efficiently compute W.
Dropout noise In dropout noise, each feature is set to zero with probability p > 0. If we define the scatter matrix of the uncorrupted input as S = XX , the solutions under dropout noise are and where α and β index two features. The form of these solutions means that computing W requires solving a system of equations equal to the number of features (in the naive implementation), or several smaller systems of equations (in the highdimensional version). Note also that p is a tunable parameter for this type of noise.
Structured dropout noise In many NLP settings, we have several feature templates, such as previous-word, middle-word, next-word, etc, with only one feature per template firing on any token. We can exploit this structure by using an alternative dropout scheme: for each token, choose exactly one feature template to keep, and zero out all other features that consider this token (transition feature templates such as y t , y t−1 are not considered for dropout). Assuming we have K feature templates, this noise leads to very simple solutions for the marginalized matrices E[P] and E[Q], For E[P], we obtain a scaled version of the scatter matrix, because in each instancex, there is exactly a 1/K chance that each individual feature survives dropout. E[Q] is diagonal, because for any off-diagonal entry E[Q] α,β , at least one of α and β will drop out for every instance. We can therefore view the projection matrix W as a rownormalized version of the scatter matrix S. Put another way, the contribution of β to the reconstruction for α is equal to the co-occurence count of α and β, divided by the count of β.
Unlike standard dropout, there are no free hyper-parameters to tune for structured dropout. Since E[Q] is a diagonal matrix, we eliminate the cost of matrix inversion (or of solving a system of linear equations). Moreover, to extend mDA for high dimensional data, we no longer need to divide the corrupted inputx to several subsets. 1 For intuition, consider standard feature dropout with p = K−1 K . This will look very similar to structured dropout: the matrix E[P] is identical, and E[Q] has off-diagonal elements which are scaled by (1 − p) 2 , which goes to zero as K is 1 E[P] is an r by d matrix, where r is the number of pivots. large. However, by including these elements, standard dropout is considerably slower, as we show in our experiments.
Scrambling noise A third alternative is to "scramble" the features by randomly selecting alternative features within each template. For a feature α belonging to a template F , with probability p we will draw a noise feature β also belonging to F , according to some distribution q. In this work, we use an uniform distribution, in which q β = 1 |F | . However, the below solutions will also hold for other scrambling distributions, such as mean-preserving distributions.
Again, it is possible to analytically marginalize over this noise. Recall An off-diagonal entry in the matrixxx which involves features α and β belonging to different templates (F α = F β ) can take four different values (x i,α denotes feature α in x i ): which happens with probability (1 − p) 2 .
• 1 if both features are chosen as noise features, which happens with probability p 2 q α q β .
• x i,α or x i,β if one feature is unchanged and the other one is chosen as the noise feature, which happens with probability p(1 − p)q β or p(1 − p)q α .
The diagonal entries take the first two values above, with probability 1 − p and pq α respectively. Other entries will be all zero (only one feature belonging to the same template will fire in x i ). We can use similar reasoning to compute the expectation of P. With probability (1 − p), the original features are preserved, and we add the outer-product x i x i ; with probability p, we add the outer-product x i q . Therefore E[P] can be computed as the sum of these terms.

Experiments
We compare these methods on historical Portuguese part-of-speech tagging, creating domains over historical epochs.

Experiment setup
Datasets We use the Tycho Brahe corpus to evaluate our methods. The corpus contains a total of 1,480,528 manually tagged words. It uses a set of 383 tags and is composed of various texts from historical Portuguese, from 1502 to 1836. We divide the texts into fifty-year periods to create different domains. Table 1 presents some statistics of the datasets. We hold out 5% of data as development data to tune parameters. The two most recent domains (1800-1849 and 1750-1849) are treated as source domains, and the other domains are target domains. This scenario is motivated by training a tagger on a modern newstext corpus and applying it to historical documents.   (Okazaki, 2007), with SGD optimization. Following the work of Nogueira Dos  on this dataset, we apply the feature set of Ratnaparkhi (1996). There are 16 feature templates and 372, 902 features in total. Following Blitzer et al. (2006), we consider pivot features that appear more than 50 times in all the domains. This leads to a total of 1572 pivot features in our experiments.
Methods We compare mDA with three alternative approaches. We refer to baseline as training a CRF tagger on the source domain and testing on the target domain with only base features. We also include PCA to project the entire dataset onto a low-dimensional sub-space (while still including the original features). Finally, we compare against Structural Correspondence Learning (SCL; Blitzer et al., 2006), another feature learning algorithm. In all cases, we include the entire dataset to compute the feature projections; we also conducted experiments using only the test and training data for feature projections, with very similar results.
Parameters All the hyper-parameters are decided with our development data on the training set. We try different low dimension K from 10 to 2000 for PCA. Following Blitzer (2008) we perform feature centering/normalization, as well as rescaling for SCL. The best parameters for SCL are dimensionality K = 25 and rescale factor α = 5, which are the same as in the original paper. For mDA, the best corruption level is p = 0.9 for dropout noise, and p = 0.1 for scrambling noise. Structured dropout noise has no free hyperparameters. Table 2 presents results for different domain adaptation tasks. We also compute the transfer ratio, which is defined as adaptation accuracy baseline accuracy , shown in Figure 1. The generally positive trend of these graphs indicates that adaptation becomes progressively more important as we select test sets that are more temporally remote from the training data.

Results
In general, mDA outperforms SCL and PCA, the latter of which shows little improvement over the base features. The various noising approaches for mDA give very similar results. However, structured dropout is orders of magnitude faster than the alternatives, as shown in   (Jiang and Zhai, 2007;Daumé III, 2007;Finkel and Manning, 2009). Our work focuses on unsupervised domain adaptation, where no labeled data is available in the target domain. Several representation learning methods have been proposed to solve this problem. In structural correspondence learning (SCL), the induced representation is based on the task of predicting the presence of pivot features. Autoencoders apply a similar idea, but use the denoised instances as the latent representation (Vincent et al., 2008;Glorot et al., 2011b;Chen et al., 2012). Within the context of denoising autoencoders, we have focused   On the specific problem of sequence labeling, Xiao and Guo (2013) proposed a supervised domain adaptation method by using a log-bilinear language adaptation model. Dhillon et al. (2011) presented a spectral method to estimate low dimensional context-specific word representations for sequence labeling. Huang and Yates (2009;2012) used an HMM model to learn latent representations, and then leverage the Posterior Regularization framework to incorporate specific biases. Unlike these methods, our approach uses a standard CRF, but with transformed features.
Historical text Our evaluation concerns syntactic analysis of historical text, which is a topic of increasing interest for NLP (Piotrowski, 2012). Pennacchiotti and Zanzotto (2008) find that part-ofspeech tagging degrades considerably when applied to a corpus of historical Italian. Moon and Baldridge (2007) tackle the challenging problem of tagging Middle English, using techniques for projecting syntactic annotations across languages. Prior work on the Tycho Brahe corpus applied supervised learning to a random split of test and training data (Kepler and Finger, 2006;; they did not consider the domain adaptation problem of training on recent data and testing on older historical text.

Conclusion and Future Work
Denoising autoencoders provide an intuitive solution for domain adaptation: transform the features into a representation that is resistant to the noise that may characterize the domain adaptation process. The original implementation of this idea produced this noise directly (Glorot et al., 2011b); later work showed that dropout noise could be analytically marginalized (Chen et al., 2012). We take another step towards simplicity by showing that structured dropout can make marginalization even easier, obtaining dramatic speedups without sacrificing accuracy.