Generative CCG Parsing with OOV Prediction

This paper presents our system for the CIPS-SIGHAN-2014 bakeoff task of Sim-pliﬁed Chinese Parsing (Task 3). The sys-tem adopts a generative model with OOV prediction model. The former has a PCFG form while the latter uses a three-layer hierarchical Bayesian model. The ﬁnal performance on the test corpus is reported together with the performance of the OOV model.


Introduction
Statistical parsing is the process of discovering the syntactic relations in a sentence, according to the rules of a formal grammar. There exist a body of parsers based on various linguistic formalisms, such as LFG, HPSG, TAG and CCG. (Riezler et al., 2002;Sarkar and Joshi, 2003;Cahill et al., 2004;Miyao and Tsujii, 2005;Clark and Curran, 2007). The parsing techniques also vary from the generative model to the discriminative model. The former uses a joint probability distribution including both the observations and the targets, while the latter only models the conditional probability measure to describe the randomness of the targets based on the observations (Hockenmaier, 2003a(Hockenmaier, , 2003bClark and Curran, 2007).
The out-of-vocabulary (OOV) problem is farfrom solved in statistical parsing, especially in CCG. There are lots of categories such that a computer would be less likely to remember a word. Clark proposed a supertagger to assignment several possible categories to a word which provides highly accurate and efficient results (Clark, 2002).
In this task we propose a three layer hierarchical Bayesian model to predict the OOV, using the POS tag as the hidden layer. Further, we estimate a OOV's category through integrating all possible POS tags, which means that we need to find relations between OOV and POS. To achieve this goal,  we create a mapping between a CCG tree and a TCT tree, which is another kind of syntactic tree according the Tsinghua Chinese Treebank (TCT).
The final report has two parts, one is the evaluation performance based on the test corpus, the other is the performance on OOV prediction.

Our System
Our system combines a generative model for parsing with a OOV prediction model. The former follows heavily from (Hockenmaier, 2003a) with slightly modification, which includes the definition of head nodes, using Dirichlet prior as the smoothing technique. The latter is a three-layer hierarchical Bayesian model: the input and the output layer corresponds to a OOV and its category, respectively, composed with a POS tag as the hidden layer.

Generative Model for Parsing
In this evaluation task, we adopt a generative model as the CCG parsing algorithm. One advantage of the generative model is it needs less human intervention than the discriminative model, which means that, if we have enough data, together with the proper generative model, the algorithm can learn from the data, of the data and for the data with a competitive performance, while the discriminative model needs a lot of manual feature templates, which sounds like cheating since the features are designed by human, rather than the computer itself.
Our generative model bases on (Hockenmaier , 2003a), which defines a generative model over CCG derivation trees. This model acts like a PCFG form, which does not incorporate the notion of combinatory trees. Instead, it is a generative model over sub-trees. By contrast to Hockenmaier, we use a different approach of defining head node, which is a functor categories (categories that accept arguments). Since from a modelling point of view, isolating a head node from a non-head one just make a generative process more hierarchical, there is no statistically significant differences between a head node and a nonhead node. The derivations of a CCG tree can be represented by top-down expansions. As mentioned in (Hockenmaier, 2003a), there are four kinds of leaf nodes in a CCG tree, which corresponds to four kinds of expansion (Table 1). Follow this convention, we have the following generating process: 1. Expansion probabiltiy: Start from a root, choose a type of expansion N by P (exp|C) with exp ∈ {left, right, unary, leaf} and C ∈ C.
2. Lexical probability: If it meets a leaf node, a word w is generated with probability P (w|C, exp = leaf), stop.
3. Head probability: Otherwise, choose a head node with probability P (H|C, exp).

Inference and Learning
The parameter estimation step is similar to a PCFG parser based on the maximum likelihood estimation (MLE), but the estimator may become sparsity due to the huge number of parameters. This may cause the problem of overestimation. To avoid this, we can use a regularization term or a prior as the smoothing technique.
In this task, we prefer a Dirichlet distribution as the prior to other smoothing methods. Since it is easy to implement and forms a conjugate prior to a multinomial distribution. We put a Dirichlet prior Dir(α) on a lexical distribution P (w|C, exp = leaf). In the experiment we set the α = (1, 1, . . . , 1) as a uniform distribution.
The learning or decoding algorithm is the wellknown CKY algorithm. But efficiency is still a problem, since the number of categories is large, for a long sentence more computing steps will be needed to compose two adjacent cells in a chart than other lexicon-based parsers. Fortunately, Clark and Curran proposed an log-likelihood CCG parser which is efficient enough to large-scale NLP tasks (Clark and Curran, 2007).

Estimating the OOV
The supertagger proposed by Clark uses a maximum entropy model to predict a word's categories, based on the idea that given a set of manual features, we need to find a category distribution restricted on the set acts an uniform predictor to unknown words. This maximum entropy principle may not apply to OOV estimating, for the reasons that the OOV is rare, statistically insignificant and unable to catch by a statistical model.
Manual rules can get a more accurate prediction than the statistical model, but these rules are also non-flexible, time-confusing and heavy-lifting. To overcome this problem, we propose a mapping between a CCG tree and a TCT tree with the same terminal nodes.
To make this mapping possible we first need to verify the existence, uniqueness and reversibility of such a mapping. Luckily such a mapping is exist since the CCG tree is generated by a TCT tree. To make it simpler we omit the condition of the uniqueness and reversibility. Now the problem is: Can we find a such a mapping to help us to predict the OOV?
Obviously, the mapping is the relation between the syntactic symbols (POS) and the semantic symbols (category). If we can find the estimator of P (cat|pos) our problem is easily solved by: In the above equations, O stands for the OOV, which is a random variable assigned values from all possible OOV. C indicates the category and S stands for POS tag.
How to create such a mapping matrix? We start from the root node, using a depth-first search algorithm to find the correspondence between nodes in each tree. Notice that the CCG tree is binary, while the TCT tree is not. To find the correct map, we first need to binarize the TCT tree. But the set of all possible binary trees may become huge when there are many children of a node. Fortunately we just need to expand all binary nodes through one direction.
This model acts like the maximum entropy model, since they all use the context features, but the difference is the former focuses on a more restricted conditions based on the tree structure, while the features in the latter is at the sentence level.

Datesets
The data uses in the system composed of two parts, one is for the parser, the other is for the OOV prediction model. The data used by the former comes from the sponsor (CCG bank) with 17558 parsed sentences, 984 categories, while the latter uses data from both the CCG bank and the TCT bank with 9034 sentences. To find the mapping tree with the same leaf nodes, we extract such tree pairs from the two data sets. Finally we get a data set for the OOV prediction model with 5360 tree pairs.

Experimental Results
There are two kinds of metrics to be evaluated, one is the syntactic category evaluation metrics, the other is the parsing tree evaluation metrics. We report both of these metrics, together with the performance of the OOV prediction model. Table 2 and 3 gives the performance of the parser on the test set, based on the syntactic category evaluation metrics and the parsing tree evaluation metrics, respectively.
The notations in Table 3 are explained as follows (Qiang Zhou, 2014): • LDP CE stands for the lexical dependency pairs (LDPs) with complex event relations in the sentence levels.
• LDP CC stands for the LDPs with concept compound relations in the chunk levels.
• LDP PA stands for the LDPs with predicateargument relations in the clause levels, including head-complement and adjunct-head relations.
• LDP MO stands for the LDPs with other non-PA relations in the chunk and clause    Table 4 shows the performance of the OOV estimation model, OOV-POS is the baseline model, which means that a node's category is taken exactly on the corresponding POS tag, +head means such a category is not just on its POS tag, but also with its parent's node's POS tag. +sister has the similar meaning.

Model
Precision

Conclusion
This report has shown a generative CCG parser with a OOV prediction model. One contribution of this report is the development of a Bayesian model to predict the OOV with high accuracy. The techniques we use is easy to extend to a more complicated system.