Graph-based Syntactic Word Embeddings

We propose a simple and efficient framework to learn syntactic embeddings based on information derived from constituency parse trees. Using biased random walk methods, our embeddings not only encode syntactic information about words, but they also capture contextual information. We also propose a method to train the embeddings on multiple constituency parse trees to ensure the encoding of global syntactic representation. Quantitative evaluation of the embeddings show a competitive performance on POS tagging task when compared to other types of embeddings, and qualitative evaluation reveals interesting facts about the syntactic typology learned by these embeddings.


Introduction
Distributional similarity methods have been the standard learning representation in NLP. Word representations methods such as Word2vec, GloVe, and FastText [1,2,3] aim to create vector representation to words from other words or characters that mutually appear in the same context. The underlying premise is that "a word can be defined by its company" [4]. For example, in the sentences, "I eat an apple every day" and "I eat an orange every day", the words 'orange' and 'apple' are similar as they share similar contexts.
Recent approaches have proposed a syntax-based extension to distributional word embeddings to include functional similarity in the word vectors by leveraging the power of dependency parsing [5] [6]. Syntactic word embeddings have been shown to be advantageous in specific NLP tasks such as question type classification [7], semantic role labeling [8], part-of-speech tagging [6], biomedical event trigger identification [9], and predicting brain activation patterns [10]. One limitation of these methods is that they do not encode the hierarchical syntactic structure of which a word is a part due to its reliance on nonconstituency parsing such as dependency parsing. While the latter analyzes the grammatical structure of a sentence by establishing a directed binary head-dependent relation among its words, constituency parsing analyzes the syntactic structure of a sentence according to a phrase structure grammar.
Syntactic hierarchy has advantages in tasks such as grammar checking, question answering, and information extraction [11]. It has also been encoded in neural models such as Recursive Neural Tensor Network and has proved it can predict the compositional semantic effects of sentiment in language [12]. Moreover, it can uniquely disambiguate the functional role of some words and therefore the overall semantic meaning. Figure 1 shows the modal verb should in the following sentences: (1) Let me know should you have any question. and (2) I should study harder for the next exam. Even though the word should is a modal verb (MD) in both sentences, it exhibits two different grammatical functions: conditionality and necessity respectively. Similarly, the word is in (3) The king is at home. and (4) Is the king at home? has a similar semantic meaning in both sentences, yet it exhibits two different syntactic roles (statement-forming and question-forming). Traditional word embeddings methods give a contextual, semantic representation to words like is and should, but they make no distinction of their grammatical function due to the absence of information on syntactic hierarchy. On the other hand, constituency parse trees provide a syntactic representation that can easily capture such distinction. Figures (a) and (b) show constituency parse tree of sentences (1) and (2) respectively. The difference in the position of the modal verb should in both sentences indicates a difference in the grammatical function, especially if it is compared to words with similar grammatical function in other sentences such as the one in (figure (c)). Comparing figures (a) and (c), we can note that should hold the same sense of conditionality the word if has. To this end, we propose a simple, graph-based framework to build syntactic word embeddings that can be flexibly customized to capture syntactic as well as contextual information by leveraging information derived from either manually or automatically constituency-parsed trees.
While recent transformer-based models such as BERT [13] have proved to be more sophisticated than word embeddings, the latter remains a popular choice due to its simplicity and efficiency. Thus, the contribution of this work is two-fold: (1) bridge the research gap in the literature of word embedding by introducing hierarchical syntactic embeddings based on constituency parsing (2) propose a graphtheoretic training method that cluster words according to their syntactic and constituent role without sacrificing the original context in which a word appears.

Related Work
The NLP literature is rich with studies suggesting an improvement to original word embeddings models by incorporating external semantic resources like lexicons and ontologies [14,15,16,17,18].
However, very few studies were dedicated to syntactic embeddings. One of the earliest methods was dependency-based word embeddings [5], which generalizes the Skip-gram algorithm to include arbitrary word context. Instead of using bag-of-word context, they use context derived automatically from dependency parse trees. Specifically, for a word w with modifiers m 1 , ..., m k and head h, the contexts (m1, lbl 1 ), ..., (m k , lbl k ), (h, lbl 1 h ), where lbl is a type of dependency relation between the head. and the modifier (e.g. nsub, dobj, etc). For example, the context for the word scientist in "Australian scientist discovers star with telescope" is Australian/amod and discovers/nsubj 1 .
Another modification to word2vec model was proposed by [6] to improve the word embeddings to syntax-based tasks by making it sensitive to the positioning of the words, and thereby accounting for its lack of order-dependence. The modification does not involve incorporating external parsing information, but it includes using 2 output predictors for every word in the window context each of which is dedicated to predicting position-specific value. Results on syntax-based tasks such as POS tagging and parsing show an improvement over classic word2vec embeddings.
More recently, a new approach, named SynGCN, for learning dependency-based syntactic embeddings is introduced by [19]. SynGCN builds syntactic word representation by using Graph Convolution Network (GCN). Using GCN allows SynGCN to capture global information from the graph on which it was trained while remaining efficient at training due to parallelization. Experiments show that SynGCN obtains improvement over state-of-the-art approaches when used with methods such as ELMo [20].
Most syntactic word embeddings methods rely on dependency parsing, and to the best of our knowledge that our work is the first utilizing constituency parsing to build syntactic representation.

Method
Our goal is to learn word embeddings that not only capture the sentence-level syntactic hierarchy encoded by the constituency parse tree, but also capture a global (suprasentential) syntactic representation, and because the constituency parse tree only provides sentence-level syntactic representations, we need a method to combine multiple constituency parse trees. We also need a flexible algorithm to learn the embeddings from those combined trees. In this section, we present a method of parse tree combination (namely graph unionization) as well as the Node2vec algorithm.
Graph Unionization Given a training dataset of constituency parse trees, we compose one graph (henceforth supergraph, Figure 2) by unionizing all the sentence trees in the training dataset. Formally, let G(V, E) be a graph in the training corpus, where V represents a lexical or a non-lexical vertex in a constituency parse tree and E is the edge between them, and let H be where is a non-disjoint union operator and n is the number of sentences in the training corpus. The vertices and edges of the supergraph V H and E H are n i=1 V i and n i=1 E i respectively [21,22].
Node2vec For learning syntactic embeddings from the supergraph, we use a variant of skip-gram algorithm, called node2vec algorithm [23]. Node2vec adapts Word2vec algorithm to graphs in which a node is defined by an arbitrary set of other nodes in the same graph sampled using a biased random walk. Using tunable parameters p and q, the biased random walk offers BFS and DFS search behavior in which more diverse neighborhoods are explored, and therefore richer representation may be learned [23].  As shown in Figure 2, nodes tagged with certain labels, such as adjectives (JJ) or nouns (NN), will be linked together in the supergraph while remaining children of a noun phrase (NP). Similarly, sentences with similar grammatical structures such as interrogative sentences or questions (SQ) will be clustered together in the supergraph. It can be noted that the supergraph can cluster words of similar syntactic functions together while simultaneously enforcing/preserving the global syntactic hierarchy of the training corpus. The supergraph with the aid of the biased sampling strategy Node2vec offers the flexibility of learning customizable syntactic representation. A breadth-first search strategy, for instance, would favor the selection of words of similar POS tags and thereby yielding word-class-specific representation while a depth-first strategy would yield more hierarchical or contextual representation.

Data and Experiment
For the purposes of training the syntactic embeddings, we use the Penn Treebank corpus [24], which contains over 43,000 constituency parse trees to sentences collected from the Wall Street Journal (WSJ). Next, we unionize all the parses trees into one supergraph. The supergraph has 51071 vertices and 65895 edges, and it has an average degree of 2.5805 and a density of 5.05 × 10 −5 . We chose to unionize all the trees in the training corpus for simplicity, but we certainly could have grouped the sentences into clusters of thematic or semantic identity prior to applying unionization. After that, we train the embeddings with node2vec algorithm using SGD of 10,000 epochs and a learning rate of 0.025 with a weight decay of 0.005. In terms of node2vec hyperparameters, we chose a random walk of length 200 and a batch size of 100, and the return parameter p and the in-out parameter q are both set to 10 − 6. Lower p values keep the walk close to the starting node, and lower q values encourage the walk to behave in a DFS manner [23]. The training took 51 seconds on 1 Tesla K80 GPU using the Graphvite Python package [25]. We initialize the word vectors randomly for simplicity, but initialization with other types of distributional embeddings such as word2vec or GloVe is possible. We decide to explore the latter in future work.

Results
We conduct two types of evaluations: qualitative and quantitative. In the qualitative evaluation, we examine the extent to which the learned embeddings can encode grammatical information about the words using words analogies and word arithmetics. In addition, we compare its performance against GloVe vectors [2] and SynGCN [19] on one stream task, POS tagging.

Qualitative Evaluation
One common method to evaluate word embeddings is examining word vectors by their top k nearest neighbors in the latent vector space. From table 5.1, we observe that the top 3 neighbors for the words complicate, failed, earthquakes are all of the similar syntactic category: a present verb attracts similar present verbs; a plural noun attracts plural nouns; and so on. Similarly, the adjective responsible and the adverb handsomely maintain a very close distance to words of the same part-of-speech. In contrast, neither GloVe vectors nor SynGCN exhibit similar neighborhood typology. This confirms that our constituency-based embeddings have consistently preserved syntactic information about words.
Another way to evaluate word embeddings is by explaining word analogies by the means of word vector arithmetics [26]. The famous example used in [1] is woman is to queen as man is to king, or (w q + w k ) − w m ≈ w w . When we apply the same method to our constituency-based syntactic vectors, we assert that the vector arithmetic sense strongly matches the syntactic analogies. For example, in table 5.1, if we subtract the sum of word vectors in the prepositional phrase (PP) in an industrial from the PP of any clearly domestic, the top 3 nearest neighbors in our embeddings are all adverbs (ADV) to compensate for the missing adverb in the second PP. We also note the case is not true for the other types of embeddings where the top nearest neighbors are affected by words in the PP. Similarly, applying the same arithmetic operations to the phrases his state-of-the-art plan and her plan would results in adjective vectors, unlike the other embeddings.  [19], GloVe [2], and our embeddings. Words in bold belong to the same POS tag/ grammatical category.

Intrinsic Evaluation
We also test the performance of our constituency-based embeddings on a mainstream task, parts-ofspeech tagging. Our goal is not to achieve state-of-the-art results in POS tagging, but we want to demonstrate the grammatical potential of our embeddings. For this purpose, we treat POS tagging as SVM-F1 Score CRF-F1 Score Glove [2] 0.731 0.894 SynGCN [19] 0.892 0.898 Ours 0.881 0.910 Table 2: Evaluation on POS tagging using SVM and CRF classifiers. Scores represent a mean F1 score of 5-fold cross-validation.
an independent classification task (as opposed to structured prediction one) in which a non-sequential classifier support vector machine (SVM) is used to predict a POS tag for a word acontextually. The use of non-neural, non-sequential classifier ensures that the grammatical generalizability comes strictly from the embeddings and not from the neural network or the context. Nevertheless, we also treat POS tagging as a structured prediction task in which we use a sequential classifier like conditional random field (CRF) for the purposes of comparison. Performance is also reported for two other word embeddings: GloVe and SynGCN under the same settings. We test the performance using the trained vectors on the first 2000 sentences of the Brown corpus [27]. In table 2, we report the mean F1 score of 5-fold cross-validation in which we can observe that our vectors are competitive in performance to SynGCN and far better than GloVe when used with SVM classifier. In addition, our embeddings outperform both of the competing embeddings when used with CRF.
Even though the performance of the constituency-based embeddings slightly lags behind SynGCN in the case of independent classification, the size of the corpus upon which our embeddings were trained (Penn Treebanks 1 million tokens) is much smaller compared to the one upon which SynGCN was trained (Wikipedia 1.1 billion tokens). In addition, the flexibility of learning customizable syntactic word embeddings as well as the training efficiency make constituency-based word embeddings a powerful and promising research direction that can be applied to other graph-based tasks.

Conclusion and Future Work
We presented a simple and efficient framework to learn syntactic embeddings from constituency parse trees using a combination of multiple graph unionization and biased random walk. Our framework can be flexibly customized to learn purely contextual and non-contextual syntactic embeddings, and it can be also used as a post-hoc method for other kinds of (distributional) word embeddings. Thus, for future studies, we would like to investigate training constituency-based vectors on a larger corpus and examine the effect of different initialization on more mainstream tasks such as machine translation and automatic speech recognition.