A Type-Driven Tensor-Based Semantics for CCG

This paper shows how the tensor-based semantic framework of Coecke et al. can be seamlessly integrated with Combinatory Categorial Grammar ( CCG ). The integration follows from the observation that tensors are linear maps, and hence can be manipulated using the combinators of CCG , including type-raising and composition. Given the existence of robust, wide-coverage CCG parsers, this opens up the possibility of a practical, type-driven compositional semantics based on distributional representations.


Intoduction
In this paper we show how tensor-based distributional semantics can be seamlessly integrated with Combinatory Categorial Grammar (CCG, Steedman (2000)), building on the theoretical discussion in Grefenstette (2013). Tensor-based distributional semantics represents the meanings of words with particular syntactic types as tensors whose semantic type matches that of the syntactic type (Coecke et al., 2010). For example, the meaning of a transitive verb with syntactic type (S \NP )/NP is a 3rd-order tensor from the tensor product space N ⊗ S ⊗ N . The seamless integration with CCG arises from the (somewhat trivial) observation that tensors are linear maps -a particular kind of function -and hence can be manipulated using CCG's combinatory rules.
Tensor-based semantics arises from the desire to enhance distributional semantics with some compositional structure, in order to make distributional semantics more of a complete semantic theory, and to increase its utility in NLP applications. There are a number of suggestions for how to add compositionality to a distributional semantics (Clarke, 2012;Pulman, 2013;Erk, 2012).
One approach is to assume that the meanings of all words are represented by context vectors, and then combine those vectors using some operation, such as vector addition, element-wise multiplication, or tensor product (Clark and Pulman, 2007;Mitchell and Lapata, 2008). A more sophisticated approach, which is the subject of this paper, is to adapt the compositional process from formal semantics (Dowty et al., 1981) and attempt to build a distributional representation in step with the syntactic derivation (Coecke et al., 2010;. Finally, there is a third approach using neural networks, which perhaps lies in between the two described above (Socher et al., 2010;Socher et al., 2012). Here compositional distributed representations are built using matrices operating on vectors, with all parameters learnt through a supervised learning procedure intended to optimise performance on some NLP task, such as syntactic parsing or sentiment analysis. The approach of Hermann and Blunsom (2013) conditions the vector combination operation on the syntactic type of the combinands, moving it a little closer to the more formal semantics-inspired approaches.
The remainder of the Introduction gives a short summary of distributional semantics. The rest of the paper introduces some mathematical notation from multi-linear algebra, including Einstein notation, and then shows how the combinatory rules of CCG, including type-raising and composition, can be applied directly to tensor-based semantic representations. As well as describing a tensor-based semantics for CCG, a further goal of this paper is to present the compositional framework of Coecke et al. (2010), which is based on category theory, to a computational linguistics audience using only the mathematics of multi-linear algebra.

Distributional Semantics
We assume a basic knowledge of distributional semantics (Grefenstette, 1994;Schütze, 1998). Re-A potentially useful distinction for this paper, and one not commonly made, is between distributional and distributed representations. Distributional representations are inherently contextual, and rely on the frequently quoted dictum from Firth that "you shall know a word from the company it keeps" (Firth, 1957;Pulman, 2013). This leads to the so-called distributional hypothesis that words that occur in similar contexts tend to have similar meanings, and to various proposals for how to implement this hypothesis (Curran, 2004), including alternative definitions of context; alternative weighting schemes which emphasize the importance of some contexts over others; alternative similarity measures; and various dimensionality reduction schemes such as the well-known LSA technique (Landauer and Dumais, 1997). An interesting conceptual question is whether a similar distributional hypothesis can be applied to phrases and larger units: is it the case that sentences, for example, have similar meanings if they occur in similar contexts? Work which does extend the distributional hypothesis to larger units includes Baroni and Zamparelli (2010), Clarke (2012), and .
Distributed representations, on the other hand, can be thought of simply as vectors (or possibly higher-order tensors) of real numbers, where there is no a priori interpretation of the basis vectors. Neural networks can perhaps be categorised in this way, since the resulting vector representations are simply sequences of real numbers resulting from the optimisation of some training criterion on a training set (Collobert and Weston, 2008;Socher et al., 2010). Whether these distributed representations can be given a contextual interpretation depends on how they are trained.
One important point for this paper is that the tensor-based compositional process makes no assumptions about the interpretation of the tensors. Hence in the remainder of the paper we make no reference to how noun vectors or verb tensors, for example, can be acquired (which, for the case of the higher-order tensors, is a wide open research question). However, in order to help the reader who would prefer a more grounded discussion, one possibility is to obtain the noun vectors using standard distributional techniques (Curran, 2004), and learn the higher-order tensors us-ing recent techniques from "recursive" neural networks (Socher et al., 2010). Another possibility is suggested by , extending the learning technique based on linear regression from Baroni and Zamparelli (2010) in which "gold-standard" distributional representations are assumed to be available for some phrases and larger units.

Mathematical Preliminaries
The tensor-based compositional process relies on taking dot (or inner) products between vectors and higher-order tensors. Dot products, and a number of other operations on vectors and tensors, can be conveniently written using Einstein notation (also referred to as the Einstein summation convention). In the rest of the paper we assume that the vector spaces are over the field of real numbers.

Einstein Notation
The squared amplitude of a vector v ∈ R n is given by: Similarly, the dot product of two vectors v, w ∈ R n is given by: Denote the components of an m × n real matrix A by A ij for 1 ≤ i ≤ m and 1 ≤ j ≤ n. Then the matrix-vector product of A and v ∈ R n gives a vector Av ∈ R m with components: We can also multiply an n × m matrix A and an m × o matrix B to produce an n × o matrix AB with components: The previous examples are some of the most common operations in linear algebra, and they all involve sums over repeated indices. They can be simplified by introducing the Einstein summation convention: summation over the relevant range is implied on every component index that occurs twice. Pairs of indices that are summed over are known as contracted, while the remaining indices are known as free. Using this convention, the above operations can be written as: Note how the number of free indices is always conserved between the left-and right-hand sides in these examples. For instance, while the last equation has two indices on the left and four on the right, the two extra indices on the right are contracted. Hence counting the number of free indices can be a quick way of determining what type of object is given by a certain mathematical expression in Einstein notation: no free indices means that an operation yields a scalar number, one free index means a vector, two a matrix, and so on.

Tensors
Linear Functionals Given a finite-dimensional vector space R n over R, a linear functional is a linear map a : R n → R.
Let a vector v have components v i in a fixed basis. Then the result of applying a linear functional a to v can be written as: The numbers a i are the components of the linear functional, which can also be pictured as a row vector. Since there is a one-to-one correspondence between row and column vectors, the above equation is equivalent to: Using Einstein convention, the equations above can be written as: Thus every finite-dimensional vector is a linear functional, and vice versa. Row and column vectors are examples of first-order tensors.
Definition 1 (First-order tensor). Given a vector space V over the field R, a first-order tensor T can be defined as: • an element of the vector space V , These three definitions are all equivalent. Given a first-order tensor described using one of these definitions, it is trivial to find the two other descriptions.
Matrices An n×m matrix A over R can be represented by a two-dimensional array of real numbers A ij , for 1 ≤ i ≤ n and 1 ≤ j ≤ m.
Via matrix-vector multiplication, the matrix A can be seen as a linear map A : We can also contract a vector with the first index of the matrix, which gives us a map A : R n → R m . This corresponds to the operation resulting in a vector with components We can combine the two operations and see a matrix as a map A : R n × R m → R, defined by: In Einstein notation, this operation can be written as which yields a scalar (constant) value, consistent with the fact that all the indices are contracted. Finally, matrices can also be characterised in terms of Kronecker products. Given two vectors v ∈ R n and w ∈ R m , their Kronecker product It is a general result in linear algebra that any n × m matrix can be written as a finite sum of Kronecker products k x (k) ⊗ y (k) of a set of vectors x (k) and y (k) . Note that the sum over k is written explicitly as it would not be implied by Einstein notation: this is because the index k does not range over vector/matrix/tensor components, but over a set of vectors, and hence that index appears in brackets.
An n × m matrix is an element of the tensor space R n ⊗ R m , and it can also be seen as a linear map A : R n ⊗ R m → R. This is because, given a matrix B with decomposition k x (k) ⊗ y (k) , the matrix A can act as follows: Again, counting the number of free indices in the last line tells us that this operation yields a scalar.
Matrices are examples of second-order tensors.
Definition 2 (Second-order tensor). Given vector spaces V, W over the field R, a second-order tensor T can be defined as: • an element of the vector space V ⊗ W , • a |V | × |W |-dimensional array of numbers T ij , for 1 ≤ i ≤ |V | and 1 ≤ j ≤ |W |, • a (multi-) linear map: Again, these definitions are all equivalent. Most importantly, the four types of maps given in the definition are isomorphic. Therefore specifying one map is enough to specify all the others.
Tensors We can generalise these definitions to the more general concept of tensor.
Definition 3 (Tensor). Given vector spaces V 1 , . . . , V k over the field R, a k th -order tensor T is defined as: • an element of the vector space V 1 ⊗ · · · ⊗ V k , • a multi-linear map T : V 1 × · · · × V k → R.

Tensor-Based CCG Semantics
In this section we show how CCG's syntactic types can be given tensor-based meaning spaces, and how the combinator's employed by CCG to combine syntactic categories carry over to those meaning spaces, maintaining what is often described as CCG's "transparent interface" between syntax and semantics. Here are some example syntactic types, and the corresponding tensor spaces containing the meanings of the words with those types (using the notation syntactic type : semantic type).
We first assume that all atomic types have meanings living in distinct vector spaces: • noun phrases, NP : N • sentences, S : S The recipe for determining the meaning space of a complex syntactic type is to replace each atomic type with its corresponding vector space and the slashes with tensor product operators: Hence the meaning of an intransitive verb, for example, is a matrix in the tensor product space S ⊗ N. The meaning of a transitive verb is a "cuboid", or 3rd-order tensor, in the tensor product space S ⊗ N ⊗ N. In the same way that the syntactic type of an intransitive verb can be thought of as a function -taking an NP and returning an Sthe meaning of an intransitive verb is also a function (linear map) -taking a vector in N and returning a vector in S. Another way to think of this function is that each element of the matrix specifies, for a pair of basis vectors (one from N and one from S), what the result is on the S basis vector given a value on the N basis vector. Now we describe how the combinatory rules carry over to the meaning spaces.

Application
The function application rules of CCG are forward (>) and backward (<) application: In a traditional semantics for CCG, if function application is applied in the syntax, then function application applies also in the semantics (Steedman, 2000). This is also true of the tensor-based semantics. For example, the meaning of a subject NP combines with the meaning of an intransitive verb via matrix multiplication, which is equivalent to applying the linear map corresponding to the matrix to the vector representing the meaning of the NP . Applying (multi-)linear maps in (multi-)linear algebra is equivalent to applying tensor contraction to the combining tensors. Here is the case for an intransitive verb: Pat walks NP S \NP N S ⊗ N Let Pat be assigned a vector P ∈ N and walks be assigned a second-order tensor W ∈ S ⊗ N. Using the backward application combinator corresponds to feeding P , an element of N, into W , seen as a function N → S. In terms of tensor contraction, this is the following operation: Here we use the convention that the indices maintain the same order as the syntactic type. Therefore, in the tensor of an object of type X/Y , the first index corresponds to the type X and the second to the type Y . That is why, when performing the contraction corresponding to Pat walks, P ∈ N is contracted with the second index of W ∈ S ⊗ N, and not the first. 1 The first index of W is then the only free index, telling us that the above operation yields a first-order tensor (vector). Since this index corresponds to S, we know that applying backward application to Pat walks yields a meaning vector in S.
Forward application is performed in the same manner. Consider the following example: with corresponding tensors P ∈ N for Pat, K ∈ S ⊗ N ⊗ N for kisses and Y ∈ N for Sandy. The forward application deriving the type of kisses Sandy corresponds to where Y is contracted with the third index of K because we have maintained the order defined by the type (S \NP )/NP : the third index then corresponds to an argument NP coming from the right.
Counting the number of free indices in the above expression tells us that it yields a secondorder tensor. Looking at the types corresponding to the free indices tells us that this second-order tensor is of type S ⊗ N, which is the semantic type of a verb phrase (or intransitive verb), as we have already seen in the walks example.

Composition
The forward (> B ) and backward (< B ) composition rules are: Composition in the semantics also reduces to a form of tensor contraction. Consider the following example, in which might can combine with kiss using forward composition: with tensors M ∈ S ⊗ N ⊗ S ⊗ N for might and K ∈ S ⊗ N ⊗ N for kiss. Combining the meanings of might and kiss corresponds to the following operation: yielding a tensor in S ⊗ N ⊗ N, which is the correct semantic type for a phrase with syntactic type (S \NP )/NP . Backward composition is performed analogously.

Backward-Crossed Composition
English also requires the use of backward-crossed composition (Steedman, 2000): In tensor terms, this is the same as forward composition; we just need to make sure that the contraction matches up the correct parts of each tensor correctly. Consider the following backwardcrossed composition: Let the two items on the left-hand side be represented by tensors A ∈ S ⊗ N ⊗ N and B ∈ S ⊗ N ⊗ S ⊗ N. Then, combining them with backward-crossed composition in tensor terms is resulting in a tensor in S ⊗ N ⊗ N (corresponding to the indices i, j and m). Note that we have reversed the order of tensors in the contraction to make the matching of the indices more transparent; however, tensor contraction is commutative (since it corresponds to a sum over products) so the order of the tensors does not affect the result.

Type-raising
The forward (> T ) and backward (< T ) typeraising rules are: where T is a variable ranging over categories. Suppose we are given an item of atomic type Y , with corresponding vector A ∈ Y. If we apply forward type-raising to it, we get a new tensor of type A ∈ T ⊗ T ⊗ Y. Now suppose the item of type Y is followed by another item of type X\Y , with tensor B ∈ X ⊗ Y. A phrase consisting of two words with types Y and X\Y can be parsed in two different ways: • Y X\Y ⇒ X, by backward application; • Y X\Y ⇒ T X/(X\Y ) X\Y , by forward type-raising, and X/(X\Y ) X\Y ⇒ X, by forward application.
Both ways of parsing this sentence yield an item of type X, and crucially the meaning of the resulting item should be the same in both cases. 2 This property of type-raising provides an avenue into determining what the tensor representation for the type-raised category should be, since the tensor representations must also be the same: Moreover, this equation must hold for all items, B. As a concrete example, the requirement says that a subject NP combining with a verb phrase S \NP must produce the same meaning for the two alternative derivations, irrespective of the verb phrase. This is equivalent to the requirement that So to arrive at the tensor representation, we simply have to solve the tensor equation above. We start by renaming the dummy index j on the lefthand side: We then insert a Kronecker delta (δ ij = 1 if i = j and 0 otherwise): Since the equation holds for all B, we are left with which gives us a recipe for performing typeraising in a tensor-based model. The recipe is particularly simple and elegant: it corresponds to inserting the vector being type-raised into the 3rdorder tensor at all places where the first two indices are equal (with the rest of the elements in the 3rd-order tensor being zero). For example, to type-raise a subject NP , its meaning vector in N is placed in the 3rd-order tensor S⊗S⊗N at all places where the indices of the two S dimensions are the same. Visually, the 3rd-order tensor corresponding to the meaning of the type-raised category is a cubiod in which the noun vector is repeated a number of times (once for each sentence index), resulting in a series of "steps" progressing diagonally from the bottom of the cuboid to the top (assuming a particular orientation). The discussion so far has been somewhat abstract, so to finish this section we include some more examples with CCG categories, and show that the tensor contraction operation has an intuitive similarity with the "cancellation law" of categorial grammar which applies in the syntax.
First consider the example of a subject NP with meaning A, combining with a verb phrase S \NP with meaning B, resulting in a sentence with meaning C. In the syntax, the two NP s cancel. In the semantics, for each basis of the sentence space S we perform an inner product between two vectors in N: Hence, inner products in the tensor space correspond to cancellation in the syntax.
This correspondence extends to complex arguments, and also to composition. Consider the subject type-raising case, in which a subject NP with meaning A in S ⊗ S ⊗ N combines with a verb phrase S \NP with meaning B, resulting in a sentence with meaning C. Again we perform inner product operations, but this time the inner product is between two matrices: 3 Note that two matrices are "cancelled" for each basis vector of the sentence space (i.e. for each index i in C i ).
As a final example, consider the forward composition from earlier, in which a modal verb with meaning A in S ⊗ N ⊗ S ⊗ N combines with a transitive verb with meaning B in S ⊗ N ⊗ N to give a transitive verb with meaning C in S ⊗ N ⊗ N. Again the cancellation in the syntax corresponds to inner products between matrices, but this time we need an inner product for each combination of 3 indices: To be more precise, the two matrices can be thought of as vectors in the tensor space S ⊗ N and the inner product is between these vectors. Another way to think of this operation is to "linearize" the two matrices into vectors and then perform the inner product on these vectors. For each i, j, k, two matrices -corresponding to the l, m indices above -are "cancelled". This intuitive explanation extends to arguments with any number of slashes. For example, a composition where the cancelling categories are (N /N )/(N /N ) would require inner products between 4th-order tensors in N ⊗ N ⊗ N ⊗ N.

Related Work
The tensor-based semantics presented in this paper is effectively an extension of the Coecke et al. (2010) framework to CCG, re-expressing in Einstein notation the existing categorical CCG extension in Grefenstette (2013), which itself builds on an earlier Lambek Grammar extension to the framework by Coecke et al. (2013).
This work also bears some similarity to the treatment of categorial grammars presented by , which it effectively encompasses by expressing the tensor contractions described by Baroni et al. as Einstein summations. However, this paper also covers CCG-specific operations not discussed by Baroni et al., such as type-raising and composition.
One difference between this paper and the original work by Coecke et al. (2010) is that they use pregroups as the syntactic formalism (Lambek, 2008), a context-free variant of categorial grammar. In pregroups, cancellation in the syntax is always between two atomic categories (or more precisely, between an atomic category and its "adjoint"), whereas in CCG the arguments in complex categories can be complex categories themselves. To what extent this difference is significant remains to be seen. For example, one area where this may have an impact is when non-linearities are added after contractions. Since the CCG contractions with complex arguments happen "in one go", whereas the corresponding pregroup cancellation in the semantics would be a series of contractions, many more non-linearities would be added in the pregroup case. Krishnamurthy and Mitchell (2013) is based on a similar insight to this paper -that CCG provides combinators which can manipulate functions operating over vectors. Krishnamurthy and Mitchell consider the function application case, whereas we have shown how the type-raising and composition operators apply naturally in this setting also.

Conclusion
This paper provides a theoretical framework for the development of a compositional distributional semantics for CCG. Given the existence of robust, wide-coverage CCG parsers (Clark and Curran, 2007;Hockenmaier and Steedman, 2002), together with various techniques for learning the tensors, the opportunity exists for a practical implementation. However, there are significant engineering difficulties which need to be overcome.
Consider adapting the neural-network learning techniques of Socher et al. (2012) to this problem. 4 In terms of the number of tensors, the lexicon would need to contain a tensor for every wordcategory pair; this is at least an order of magnitude more tensors then the number of matrices learnt in existing work (Socher et al., 2012;Hermann and Blunsom, 2013). Furthermore, the order of the tensors is now higher. Syntactic categories such as ((N /N )/(N /N ))/((N /N )/(N /N )) are not uncommon in the wide-coverage grammar of Hockenmaier and Steedman (2007), which in this case would require an 8th-order tensor. This combination of many word-category pairs and higher-order tensors results in a huge number of parameters.
As a solution to this problem, we are investigating ways to reduce the number of parameters, for example using tensor decomposition techniques (Kolda and Bader, 2009). It may also be possible to reduce the size of some of the complex categories in the grammar. Many challenges remain before a type-driven compositional distributional semantics can be realised, similar to the work of Bos for the model-theoretic case (Bos et al., 2004;Bos, 2005), but in this paper we have set out the theoretical framework for such an implementation.
Finally, we repeat a comment made earlier that the compositional framework makes no assumptions about the underlying vector spaces, or how they are to be interpreted. On the one hand, this flexibility is welcome, since it means the framework can encompass many techniques for building word vectors (and tensors). On the other hand, it means that a description of the framework is necessarily abstract, and it leaves open the question