Bayesian Kernel Methods for Natural Language Processing

Kernel methods are heavily used in Natural Language Processing (NLP). Frequen-tist approaches like Support Vector Machines are the state-of-the-art in many tasks. However, these approaches lack efﬁcient procedures for model selection, which hinders the usage of more advanced kernels. In this work, we propose the use of a Bayesian approach for kernel methods, Gaussian Processes, which allow easy model ﬁtting even for complex kernel combinations. Our goal is to employ this approach to improve results in a number of regression and classiﬁcation tasks in NLP.


Introduction
In the last years, kernel methods have been successfully employed in many Natural Language Processing tasks. These methods allow the building of non-parametric models which make less assumptions about the underlying pattern in the data. Another advantage of kernels is that they can be defined in arbitrary structures like strings or trees, which greatly reduce the need for careful feature engineering in these structures.
The properties cited above make kernel methods ideal for problems where we do not have much prior knowledge about how the data behaves. This is a common setting in NLP, where they have been mostly applied in the form of Support Vector Machines (SVMs). Systems based on SVMs have been the state-of-the-art in classification tasks like Text Categorization (Lodhi et al., 2002), Sentiment Analysis (Johansson and Moschitti, 2013;Pérez-Rosas and Mihalcea, 2013) and Question Classification (Moschitti, 2006;Croce et al., 2011). Recently, they were also employed in regression settings like Machine Translation Quality Estimation (Specia and Farzindar, 2010;Bojar et al., 2013) and structured prediction (Chang et al., 2013).
SVMs are a frequentist method: they aim to find an approximation to the exact latent function that explains the data. This is in contrast to Bayesian settings, which define a prior distribution on this function and perform inference by marginalizing over all its possible values. Although there is some discussion about which approach is better (Murphy, 2012, Sec. 6.6.4), Bayesian methods offer many useful theoretical properties. In fact, they have been used before in NLP, especially in grammar induction (Cohn et al., 2010) and word segmentation (Goldwater et al., 2009). However, only very recently kernel methods have been applied in NLP using the Bayesian approach.
Gaussian Processes (GPs) are the Bayesian counterpart of kernel methods and are widely considered the state-of-the-art for inference on functions (Hensman et al., 2013). They have a number of advantages which are very useful in NLP: • Kernels in general can be combined and parameterized in many ways. This parameterization lead to the problem of model selection, which is difficult in frequentist approaches (mainly based on cross validation). The Bayesian formulation of GPs let them deal with model selection in a much more more efficient and elegant way: by maximizing the likelihood on the training data. This opens the door for the use of heavily parameterized kernel combinations, like multi-task kernels for example.
• Being a probabilistic framework, they are able to naturally encode uncertainty in the predictions, which can be propagated if the task is part of a larger system pipeline.
Besides these properties, GPs have also been applied sucessfully in many Machine Learning tasks. Examples include Robotics (Ko et al., 2007), Bioinformatics Polajnar et al., 2011), Geolocation (Schwaighofer et al., 2004) andComputer Vision (Sinz et al., 2004;. In NLP, GPs have been used only very recently and focused on regression tasks Preotiuc-Pietro and Cohn, 2013). In this work, we propose to combine GPs with recent kernel developments to advance the state-of-the-art in a number of NLP tasks.

Gaussian Processes
In this Section, we follow closely the definition of Rasmussen and Williams (2006). Consider a machine learning setting, where we have a dataset X = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n )} and our goal is to infer the underlying function f (x) that best explains the data. A GP model assumes a prior stochastic process over this function: where µ(x) is the mean function, which is usually the 0 constant, and k(x, x ) is the kernel or covariance function. In this sense, they are analogous to Gaussian distributions, which are also defined in terms of a mean and a variance values, or in the case of multivariate Gaussians, a mean vector and a covariance matrix. In fact, a GP can be interpreted as an infinite-dimensional multivariate Gaussian distribution. The full model uses Bayes' rule to define a posterior over f , combining the GP prior with the data likelihood: where X and y are the training inputs and outputs, respectively. The posterior is then used to predict the label for an unseen input x * by marginalizing over all possible latent functions: (3) where y * is the predicted output. The choice of the likelihood distribution depends if the task is regression, classification or other prediction setting.

GP Regression
In a regression setting, we assume that the output values are equal to noisy latent function evaluations, i.e., y i = f (x i ) + η, where η ∼ N (0, σ 2 n ) is the added white noise. We also usually assume a Gaussian likelihood, because this able us to solve the integral in Equation 3 analytically. Substituting the likelihood and the prior in both Equations 2 and 3 and manipulating the result, we compute the posterior also as a Gaussian distribution: where K is the Gram matrix corresponding to the training inputs and k * = [ x 1 , x * , x 2 , x * , . . . , x n , x * ] is the vector of kernel evaluations between the test input and each training input.

GP Classification
Consider binary classification using −1 and +1 as labels 1 . The model in this case use the actual, noiseless latent function evaluations f and "squash" them through the [−1, +1] interval to obtain the outputs. The posterior over the outputs is then defined as: where σ(f * ) is a squashing function. Two common choices are the logistic function and the probit function. The distribution over the latent values f * is obtained by integrating out the latent function: (6) Because the likelihood is not Gaussian, the resulting posterior integral is not analytically available anymore. The most common solution to this problem is to approximate the posterior p(f |X, y) with a Gaussian q(f |X, y). Two such approximation algorithms are the Laplace approximation (Williams and Barber, 1998) and the Expectation Propagation (Minka, 2001). Another option is to use Markov Chain Monte Carlo sampling methods on the true posterior (Neal, 1998).

Hyperparameter Optimization
The GP prior used in the models described above usually have a number of hyperparameters. The most important ones are the kernel ones but they can also include others like the white noise variance σ 2 n used in regression. A key property of GPs is their ability to easily fit these hyperparameters to the data by maximizing the marginal likelihood: where θ represents the full set of hyperparameters (which was suppressed from all conditionals until now for brevity). Optimization involves deriving the gradients of the marginal log likelihood w.r.t. the hyperparameters and then employ a gradient ascent procedure. Gradients can be found analitically for regression and by approximations for classification, using methods similar to the ones used for prediction.

Sparse Approximations for GPs
SVMs are naturally sparse models which use only a subset of data points to make predictions. This results in important speed-ups which is one of the reasons for their success. On the other hand, canonical GPs are not sparse, making use of all data points. This results in a training complexity of O(n 3 ) (due to the Gram matrix inversion) and O(n) for predictions. Sparse GPs tackle this problem by approximating the Gram matrix using only a subset of m inducing inputs. Without loss of generalization, consider these m inputs as the first ones in the training data and (n − m) the remaining ones. Then we can partition the Gram matrix in the following way (Rasmussen and Williams, 2006, Sec. 8.1): where each block corresponds to a matrix of kernel evaluations between two sets of inputs. For brevity, we will refer K m(n−m) as K mn and its transpose as K nm . The block structure of K forms the base of the so-called Nyström approximation: which result in the following predictive posterior: whereG = σ 2 n K mm + K mn K nm and k m * is the vector of kernel evaluations between test input x * and the m inducing inputs. The resulting complexities for training and prediction are O(m 2 n) and O(m), respectively.
The remaining question is how to choose the inducing inputs. Seeger et al. (2003) use an iterative method that starts with some random data points and adds new ones based on a greedy procedure, in an active learning fashion. Snelson and Ghahramani (2006) use a different approach: it defines a fixed m a priori and use pseudo-inputs which can be optimized as regular hyperparameters. Later, Titsias (2009) also used pseudo-inputs but perform optimization using a variational method instead. Recently, Hensman et al. (2013) modified this method to allow Stochastic Variational Inference (Hoffman et al., 2013), which reduces the training complexity to O(m 3 ).

Kernels
The core of a GP model is the kernel function. A kernel k(x, x ) is a symmetric and positive semidefinite function which returns a similarity score between two inputs in some feature space (Shawe-Taylor and Cristianini, 2004). Probably the most used kernel in general is the Radial Basis Function (RBF) kernel, which is defined over two realvalued vectors. Our focus in this work is on two different types of kernels which can be applied for NLP settings and allow richer parameterizations.

Kernels for Discrete Structures
In NLP, discrete structures like strings or trees are common in training data. To apply a vectorial kernel like the RBF, one can always extract realvalued features from these structures. However, kernels can be defined directly on these structures, potentially reducing the need for feature engineering. The string and tree kernels we define here are based on the theory of Convolution kernels of Haussler (1999), which calculate the similarity between two structures based on the number of substructures they have in common. Other approaches include random walk kernels (Gärtner et al., 2003;Vishwanathan et al., 2010) and Fisher kernels (Jaakkola et al., 2000).

String Kernels
Consider a function φ s (x) that counts the number of times a substring s appears in x. A string kernel is defined as: where w s is a non-negative weight for substring s and Σ * is the set of all possible strings over the symbol alphabet Σ. Usually in NLP, each word is considered a symbol, although some previous work also considered characters as symbols (Lodhi et al., 2002). If we restrict s to be only single words we end up having a bag-of-words (BOW) representation. Allowing longer substrings lead us to the Word Sequence Kernels of Cancedda et al. (2003), which also allow gaps between words.
One extension of these kernels is to allow soft matching between substrings. This is done by defining a similarity matrix S, which encode symbol similarities. This matrix can be defined by external resources, like WordNet, or be inferred from data using Latent Semantic Analysis (Deerwester et al., 1990) for example.

Tree Kernels
Collins and Duffy (2001) first introduced Tree Kernels, which measure the similarity between two trees by counting the number of fragments they share, in a very similar way to string kernels. Consider two trees T 1 and T 2 . We define the set of nodes in these two trees as N 1 and N 2 respectively. Consider also F the full set of possible tree fragments (similar to Σ * in the case of strings). We define I i (n) as an indicator function that returns 1 if fragment f i ∈ F has root n and 0 otherwise. A Tree Kernel can then be defined as: where: Here, 0 < λ < 1 is a decay factor that penalizes contributions from larger fragments cf. smaller ones.
Again, we can put restrictions on the type of tree fragment considered for comparison. Collins and Duffy (2001) defined Subtree kernels, which considered only subtrees as fragments, and Subset Tree Kernels (SSTK), where fragments can have non-terminals as leaves. Later, Moschitti (2006) introduced the Partial Tree Kernels (PTK), by allowing fragments with partial rule expansions.

Multi-task Kernels
Kernels can also be extended to deal with settings where we want to predict a vector of values (Álvarez et al., 2012). These settings are useful in multi-task and domain adaptation problems. Kernels for vector-valued functions are known as coregionalization kernels in the literature. Here we are going to refer them as multi-task kernels.
One of the simplest ways to define a kernel for a multi-task setting is the Intrinsic Coregionalization Model (ICM): where ⊗ denotes the Kronecker product and B is the coregionalization matrix, encoding task covariances. We also denote the resulting kernel function as K(x, x ) to stress out that its result is now a matrix instead of a scalar.  used the ICM to model annotator bias in Quality Estimation datasets. They parameterize B in a number of different ways and get significant improvements over single-task baselines, especially in post-editing time prediction. They also point out that the well known EasyAdapt method (Daumé III, 2007) for domain adaptation can be modeled by the ICM using B = 1+I, i.e., a coregionalization matrix with its diagonal elements equal to 2 and remaining elements equal to 1.
An extension of the ICM is the Linear Model of Coregionalization (LMC), which assume a sum of kernels with different coregionalization matrices: where P is the set of different kernels employed. Alvarez et al. (2012) argue that the LMC is much more flexible than the ICM because the latter assumes that each kernel contributes equally to the task covariances.

Planned Work
Our goal in this proposal is to employ GPs and the kernels introduced in Section 3 to advance the state-of-the-art in regression and classification NLP tasks. It would be unfeasible though, at least for a single thesis, to address all possible tasks so we are going to focus on three of them where kernel methods were already successfully applied.

Quality Estimation
The purpose of Machine Translation Quality Estimation is to provide a quality prediction for new, unseen machine translated texts, without relying on reference translations (Blatz et al., 2004;Bojar et al., 2013). A common use of quality predictions is the decision between post-editing a given machine translated sentence and translating its source from scratch.
GP regression models were recently successfully employed for post-editing time  and HTER 2 prediction (Beck et al., 2013). Both used RBF kernels as the covariance function so a natural extension is to apply the structured kernels of Section 3.1. This was already been done with tree kernels by Hardmeier (2011) in the context of SVMs.
Multi-task kernels can also be applied for Quality Estimation in several ways. The model used by  for modelling annotator bias can be further extended for settings with dozens or even hundreds of annotators. This is a common setting in crowdsourcing platforms like Amazon's Mechanical Turk 3 .
Another plan is to use multi-task kernels to combine different datasets. Quality annotation is usually expensive, requiring post-editing or subjective scoring. Possibilities include combining datasets from different language pairs or different machine translation systems. Available datasets include those used in the WMT12 and WMT13 QE shared tasks (Callison-burch et al., 2012;Bojar et al., 2013) and others (Specia et al., 2009;Specia, 2011;Koponen et al., 2012).

Question Classification
A Question Classifier is a module that aims to restrict the answer hypotheses generated by a Question Answering system by applying a label to the input question (Li and Roth, 2002;Li and Roth, 2005). This task can be seen as an instance of text classification, where the inputs are usually composed of only one sentence.
Much of previous work in Question Classification largely used SVMs combined with structured kernels. Zhang and Lee (2003) compares String Kernels based on BOW and n-gram representations with the Subset Tree Kernel on constituent trees. Moschitti (2006) show improved results by using the Partial Tree Kernel and dependency trees instead of constituency ones. Bloehdorn and Moschitti (2007) combines a SSTK with different soft matching approaches to encode lexical similarity on tree leaves. The same soft matching idea is used by Croce et al. (2011), but applied to PTKs instead and permitting soft matches between any nodes in each tree (which is sensible when using kernels on dependency trees).
Our work proposes to address this task by employing tree kernels and GPs. Unlike Quality Estimation, this is a classification setting and our purpose is to find if this combination can also improve the state-of-the-art for tasks of this kind. We will use the TREC dataset provided by Li and Roth (2002), which assigns 6000 questions with both a coarse and a fine-grained label.

Multi-domain Sentiment Analysis
Sentiment Analysis is defined as "the computational treatment of opinion, sentiment and subjectivity in text" (Pang and Lee, 2008). In this proposal, we focus on the specific task of polarity detection, where the goal is to label a text as having positive or negative sentiment. State-of-the-art methods for this task use SVMs as the learning algorithm and vary between the feature sets used.
Polarity predictions can be heavily biased on the text domain. Consider the example showed by Turney (2002): the word "unpredictable" usually has a positive meaning in a movie review but a negative one when applied to an automotive review (in a phrase like "unpredictable steering", for instance). One of the first methods to tackle this issue is the Structural Correspondence Learning of Blitzer et al. (2007). Their method uses pivot words shared between domains to find correspondencies in words that are not shared.
A previous work that used structured kernels in Sentiment Analysis is the approach of Wu et al. (2009). Their method uses tree kernels on phrase dependency trees and outperforms bag-of-words and word dependency approaches. They also show good results in cross-domain experiments.
We propose to apply GPs with a combination of structured and multi-task kernels for this task. The results showed by Wu et al. (2009) suggest that tree kernels on dependency trees are a good approach but we also plan to employ string kernels on this task. This is because string kernels have demonstrated promising results for text categorization in past work. Also, considering model selection is easily dealt by GPs, we can combine all those kernels in complex and heavily parameterized ways, an unfeasible setting for SVMs. We will use the Multi-Domain Sentiment Dataset (Blitzer et al., 2007), composed of Amazon product reviews in different categories.

Research Directions
In Section 2.3 we saw how the Bayesian formulation of GPs let us do model selection by maximizing the marginal likelihood. In fact, one of our main research directions in this proposal revolves around this crucial point: because we can easily fit hyperparameters to the data we have much more freedom to use richer kernel parameterizations and kernel combinations. Multi-task kernels are one example where we usually have a large number of hyperparameters because we need to fit all the elements of the coregionalization matrix. This number can get even larger if we have a LMC model, with multiple coregionalization matrices. Structured kernels can also be redefined in a richer way: tree kernels between constituency trees could have multiple decay hyperparameters, one for each POS tag. A more extreme example would be to treat all weights in a string kernel as hyperparameters. Thus, we plan to investigate these possibilities in the context of the three tasks detailed before.
As another research direction we also want to address the issue of scalability. Although GPs already showed promising results they can be slow when compared to other well established methods like SVM. Fortunately there has been a lot of advancements in the field of sparse GPs in the last years and we plan to employ them in our work. A key question is how to combine sparse GPs with the structured kernels we presented before. Although it is perfectly possible to select inducing points using greedy methods, it would be much more interesting to use the pseudo-inputs approach. However, it is not clear how to do that in conjunction with non-vectorial inputs, like the ones we plan to use in structured kernels, and this is a key direction that we also plan to investigate.

GP Toolkits
Available toolkits for GP modelling include GPML 4 (Rasmussen and Williams, 2006) and GPstuff 5 (Vanhatalo et al., 2013), which are written in Matlab. Our experiments will mainly use GPy 6 , an open source toolkit written in Python. It implements models for regression and binary classification, including sparse approximations and many vectorial kernels. We plan to contribute to GPy by implementing the structured kernels of Section 3.1, effectively extending it to a GP framework for NLP.

Conclusions and Future Work
In this work we showed a proposal for advancing the state-of-the-art in a number of NLP tasks by combining Gaussian Process with structured and multi-task kernels. Our hypothesis is that highly parameterized kernel combinations allied with the fitting methods provided by GPs will result in better models for these tasks. We also detailed the future plans for experiments, including available datasets and toolkits.
Further research directions that can be explored by this proposal include the use of GPs in different learning settings. Models for ordinal regression  and structured prediction (Altun et al., 2004;Bratières et al., 2013) were already proposed in the GP literature and a natural extension is to apply these models for their corresponding NLP tasks. Another extension is to employ other kinds of kernels. The literature on that subject is quite vast, with many approaches showing promising results.