Introducing Orthogonal Constraint in Structural Probes

With the recent success of pre-trained models in NLP, a significant focus was put on interpreting their representations. One of the most prominent approaches is structural probing (Hewitt and Manning, 2019), where a linear projection of word embeddings is performed in order to approximate the topology of dependency structures. In this work, we introduce a new type of structural probing, where the linear projection is decomposed into 1. iso-morphic space rotation; 2. linear scaling that identifies and scales the most relevant dimensions. In addition to syntactic dependency, we evaluate our method on two novel tasks (lexical hypernymy and position in a sentence). We jointly train the probes for multiple tasks and experimentally show that lexical and syntactic information is separated in the representations. Moreover, the orthogonal constraint makes the Structural Probes less vulnerable to memorization.


Introduction
Latent representations of neural networks encode specific linguistic features. Recently, a lot of focus was devoted to interpret these representations and analyze structures captured by the deep models. One of the most popular analysis methods is probing (Belinkov et al., 2017;Blevins et al., 2018;Linzen et al., 2016;Liu et al., 2019). The pre-trained model's 1 parameters are fixed, and its latent states or outputs are then fed into a simple neural network optimized to solve an auxiliary task, e.g., semantic, syntactic parsing, anaphora resolution, morphosyntactic tagging, etc. The amount of language information stored in the representations can be evaluated by measuring the specific language task's performance. 1 Typically models for language modeling or machine translation are analyzed.  Probing experiments usually involve classification tasks. Lately, Hewitt and Manning (2019) proposed Structural Probes, which use regression as an optimization objective. They train a linear projection layer to approximate: 1. dependency tree distances between words 2 by the Euclidean distance between transformed vectors; 2. the tree depth of a word by the norm of its vector.
In Figure 1, we visualize our Orthogonal Structural Probe. A linear transformation is replaced by an Orthogonal Transformation (rotation of the embedding space), and product-wise multiplication of rotated vectors by a Scaling Vector to get the final projections. Our motivation is to obtain an embedding space that is isomorphic with the original one, and the impact of each dimension can be evaluated by analyzing Scaling Vector's weights. We elaborate on mathematical properties and training details in Section 3.
In addition to dependency trees used by Hewitt and Manning (2019), we introduce new structural tasks related to lexical hypernymy and word's position in the sentence. We also employ a control task, in which we evaluate the memorization of randomly generated trees. Orthogonal Structural Probes let us optimize for multiple objectives jointly by keeping a shared Orthogonal Transformation matrix and changing task-specific Scaling Vectors.
We will answer the following questions: 1. Do our Orthogonal Structural Probes achieve comparable or better performance to the Structural Probes of Hewitt and Manning (2019)? 2. Can we find other phenomena such as lexical hypernymy and a word's absolute position in a sentence using Orthogonal Structural Probe? How vulnerable are the probes to memorizing random data?
3. Is it possible to effectively train Orthogonal Structural Probes jointly for multiple auxiliary objectives, i.e., depth and distance, or multiple types of structures mentioned in the previous question?
4. Can we identify particular dimensions of the embedding space that encode particular linguistic structures? Are there any superfluous dimensions?
5. If yes, what is the relationship between subspaces encoding distinct structures?

Related Work
Basic linguistic features can be easily extracted from the contextual representations (Liu et al., 2019). Probing was intensively used to investigate the representation of morphological information (mainly POS tags) in hidden states of machine translation systems and language models (Belinkov et al., 2017;Peters et al., 2018;Tenney et al., 2019b). Besides the work of Hewitt and Manning (2019), probing for dependency syntax was performed by Tenney et al. (2019a) and Blevins et al. (2018). They utilize a binary classifier to predict dependency edges. In work contemporary to ours, Ravichander et al. (2020) employ a softmax classifier to show that BERT can be successfully probed for hypernymy. There is an ongoing debate on which probe architectures offer a good insight into underlying representations. Zhang and Bowman (2018) showed that a POS tagger on top of a frozen randomly initialized LSTM model achieves unexpectedly high results. In the work of Hewitt and Liang (2019), the multilayer perceptron probes display similar accuracy for predicting POS tags as for randomly assigned tags. These symptoms underscore how crucial it is to carefully consider the probe's architecture to avoid reaching spurious conclusions. It is good practice to monitor additional aspects of the probe beyond performance on a linguistic task, such as selectivity (Hewitt and Liang, 2019), or complexity (Pimentel et al., 2020). The recent state of knowledge is summarized in surveys on probing (Belinkov and Glass, 2019) and interpretation of BERT's representations (Rogers et al., 2020).
Orthogonality has been applied broadly in the field of deep learning, especially to cope with exploding/vanishing gradient problem in recurrent neural networks (Arjovsky et al., 2016;Jing et al., 2017a;Wisdom et al., 2016). In this work, we use regularization to enforce the orthogonality of a dense layer. In literature, such an approach is called "soft constraint" (Bansal et al., 2018;Vorontsov et al., 2017). Alternatively, "hard constraint" assumes parameterization of a network such that the transformation of latent states is orthogonal by definition (Arjovsky et al., 2016;Jing et al., 2017b). There are a few examples of orthogonality applications in NLP: in RNN language model (Dangovski et al., 2019); in Performer (Choromanski et al., 2020), which is a more efficient counterpart of Transformer (Vaswani et al., 2017). Best to our knowledge, we are the first to use orthogonal transformation in probing.

Method
In this section, we first review the structural probing proposed by Hewitt and Manning (2019) and then introduce our Orthogonal Structural Probe.

Structural Probes
In the previous work, a linear transformation is optimized to transform the contextual word representations produced by a pre-trained neural model (e.g. BERT Devlin et al. (2019), ELMo Peters et al. (2018)). The squared L2 norm of the differences between transformed word vectors approximate the tree distance between them: where B is the Linear Transformation matrix and h i , h j are the vector representations of words at positions i and j.
The probe is optimized to approximate the distance between tokens in the dependency tree (d T ) by gradient descent objective: where s is the length of a sentence.
Moreover, the same work introduced depth probes, where vectors were linearly transformed so that the squared L2 length of the mapping approximate the token's depth in a dependency tree: Gradient descent objective is analogical:

Orthogonal Structural Probes
We introduce orthogonality to structural probes. For that purpose, we perform the singular value decomposition of the matrix B where the matrices U and V are orthogonal, and D is diagonal. Notably, when we substitute B with U · D · V T in Eq. (1), the matrix U cancels out. It can be easily shown by rearranging the variables in the equation: 3 We can replace the diagonal matrix D with a vectord and use element-wise product (we will call d the Scaling Vector). Finally, we get the following equation for Orthogonal Distance Probe: 3 A complete derivation can be found in the appendix.
The same reasoning can be applied to Eq. (3) to obtain Orthogonal Depth Probe: We showed that Orthogonal Structural Probe is mathematically equivalent to Standard Structural Probe.

Multitask Training
Orthogonal Structural Probe can be easily adapted to multitask probing for a set of objectives O. We use one shared Orthogonal Transformation and different Scaling Vectors for each task. In one batch, we compute a loss for a specific objective. For each batch (with objective o ∈ O), a forward pass consists of multiplication by a shared orthogonal matrix V T and product-wise multiplication by a designated vectord o . All the batches are shuffled together in a training epoch.

Orthogonality Regularization
We use Double Soft Orthogonality Regularization (DSO) proposed by Bansal et al. (2018) to coerce orthogonality of the matrix V during training: ) || · || F stands for the Frobenius norm of a matrix.

Sparsity Regularization
In further experiments, we investigate the effects of sparsity in Scaling Vector. For that purpose, we compute the L1 norm and add it to the training loss.

Training Objective
Altogether, the loss equation in Orthogonal Distance Probe for objective o ∈ O is the following: And in Orthogonal Depth Probe: The loss is normalized by the number of predictions in a sentence and averaged across a batch.

Experiments
We train probes on top of each of 24 layers of English BERT large cased model (Devlin et al., 2019) implemented by HuggingFace (Wolf et al., 2020). We optimize for the approximation of depth and distance in four types of structures: syntactic dependency, lexical hypernymy, absolute position in a sentence, and randomly generated trees. In the following subsection, we expand upon these structures.

Data and Objectives
In our experiments, we use training, evaluation, and test sentences from Universal Dependencies English Web Treebank (Silveira et al., 2014). Depending on the objective, we reveal only partial relevant annotation from the dataset.

Dependency Syntax
We probe for syntactic structure in Universal Dependencies parse trees . Dependency trees are annotated in English Web Treebank. We focus on distances between words in dependency trees and their depth, i.e., distance from the syntactic root.

Lexical Hypernymy
We introduce probing for lexical information. We optimize probes to approximate the distance between pairs of words in the hypernymy tree and the depth for each word. For that purpose, we use the tree from WordNet (Miller, 1995). We consider lexical distances between pairs of nouns and pairs of verbs in sentences and lexical depth for each noun and verb. We provide gold POS information and look up synset by a lemmatized form of a word to avoid ambiguity.

Position in a Sentence
Probing for the sentence index of a word and positional difference between pairs of words.

Random Structures
We probe for randomly generated trees. When we jointly optimize for depth and distance, we keep the same randomly generated tree. This control task allows us to determine the extent to which our probes memorize the structures and thus over-fit to the training data.

Training
We use batches of size 12 and an initial training rate of 0.02. We use learning rate decay and earlystopping mechanism: if validation loss does not achieve a new minimum after an epoch, the learning rate is divided by 10. After three consecutive learning rate updates not resulting in a new minimum, the training is stopped.
Orthogonality Regularization In our experiments, we took λ O equal to 0.05. 4 The regularization converged early during the gradient optimization. Hence we can assume that matrix V is orthogonal.
Sparsity Regularization By default λ S = 0. Only in the experiments described in Section 5.1, we use sparsity regularization by setting λ S to a positive value (0.005, 0.05, or 0.1) when DSO drops below 1.5 during the training. This mechanism prevents weakening orthogonality constraint in early epochs. Additional details of the training are described in the appendix. The code is available at GitHub: https://github.com/Tom556/ OrthogonalTransformerProbing.

Evaluation
We assess Spearman's rank correlation between gold and predicted values. We report the average correlations for the sentences with lengths from 5 to 50 in the same way as Hewitt and Manning (2019).
Our Orthogonal Structural Probes are trained jointly for multiple objectives (Section 3.3). We evaluate the effect of multitasking testing different configurations: A) separate probing for each objective; B) joint probing for distance and depth in the same structure type; C) joint probing for distance in all structures; D) joint probing for depths in all structures; E) probing for all objectives together. We compare the results with two baselines: I) optimizing only Scaling Vector; II) Structural Probes.

Dimensionality of Scaling Vector
We hypothesize that the orthogonality regularization allows us to find embedding subspace capable of representing a particular linguistic structure. In Section 5.1, we examine the performance of lower-rank projections and ask whether further restrictions of dimensionality affect the results. In Section 5.2 we analyze interactions between subspaces related to a particular objective in a joint probing setting.

Results
We compare Spearman's correlations between predicted values and gold tree depths and distances in Table 1. The correlations obtained from Orthogonal Structural Probes are high for linguistic structures: from 0.803 for lexical distance to 0.882 for lexical depth. Predicted positional depths and distances nearly match gold values. Correlation on training data for random structures is very weak, hinting that the probes do not memorize structures during training but extract them from the model's representations. The correlation for distances is higher than for depth. We hypothesize it is because the probes learn some basic tree properties. 5 The results obtained by Orthogonal Structural Probes are close to those of Structural Probes. For dependency distance, the difference is not statistically significant. Notably, correlations on training set for randomly generated trees decreased. It suggests that Orthogonal Structural Probes are less vulnerable to memorization. In multitask probing,  correlation evenly decreases across all tasks. While selectivity (the difference between average correlation for dependency, lexical, and positional objectives and random objectives) increases from 0.673 to 0.726. Optimizing only a Scaling Vector gives distinctly lower correlations. These results emphasize the necessity of changing the coordinate system to amplify the dimensions encoding linguistic information.
In Fig. 2 (upper), we observe that the performance varies throughout the layers, confirming previous observations by Hewitt and Manning (2019) and Tenney et al. (2019a). The mid-upper layers tend to be more syntactic, and the mid-lower ones are more lexical. Predicting word position is more accurate in the lower layers, dropping significantly toward the last layers. It is due to the fact that in BERT, positional embeddings are added before the first layer. Random structure probes maintain steady results across all the layers.

Dimensionality
We observe that orthogonality constraint is quite effective in restricting the probe's rank. In most of our experiments, the majority of Scaling Vector parameters converged to zero. It allows selecting subspaces encoding particular linguistic features. We want to answer whether such subspace has enough capacity for each probing task. For that purpose, we zero out the dimensions with corresponding Scaling Vector weights closer to zero than = 10 −4 . 6 Their elimination does not affect the results; correlations in Table 2 and Table 1 column A are practically equal. The dimensionality reduction is the strongest for lexical and positional depth probes, where subspaces with the rank of 19 and 20 respectively encode the structures as well as the whole embedding space with 1024 dimensions (Fig. 2, lower). The number of selected dimensions is the highest in probing for random structures. This is because a large capacity is required for memorization.
Another question we pose is whether it would be adequate to shrink the subspace even further. For each objective, we choose and drop a random portion of parameters to examine how it would affect the predictions. We conduct a procedure similar to cross-validation, i.e., we repeatedly drop disjoint and exhaustive sets of dimensions and average results for each set at the end. 7 Table 2 shows that dimension dropping had the largest impact on positional probes: −0.458 for depth; the decrease is low for lexical distance -only −0.083. It suggests that the information necessary for the latter objective is more dispersed than for the former one.

Sparsity Regularization
We use sparsity regularization of Scaling Vector to examine whether dimensionality can be reduced more intelligently. The strength of regularization is regulated by value  of λ s ∈ {0.005, 0.05, 0.1}. We observe that for some objectives (dependency depth, positional depth, and positional distance), the relevant information is captured in a small number of dimensions. Remarkably, only one dimension of embedding space can achieve 0.822 correlation with dependency depths. We conjecture that if it is possible to achieve a high correlation with sparse subspaces, information on the phenomenon is focal in the model (concentrated in few dimensions). For the objectives with focal information, results decrease sharply when random dimensions are dropped because the probability of dropping important coordinates is high. On the other end of the spectrum, we can identify the objective for which information is spread -lexical distance. The dropping of random dimensions only moderately decreases correlation, as there are no especially essential coordinates. Probing with sparsity regularization produces subspaces of relatively large size.
Sparsity regularization also positively affects control objectives, decreasing correlations with distances and depths of randomly generated structures, indicating that regularized probes are less prone to memorization.
Notably, Torroba Hennigen et al. (2020) proposed a method for selecting embeddings' dimensions relevant to particular linguistic phenomena. In our setting, thanks to the Orthogonal Transformation, we are not constrained to analyzing the dimensions of just one coordinate system.

Separation of Information
Another outcome of joint training was the ability to examine relationships between subspaces for each of the objectives. Figure 3 shows histograms of the dimensions selected in lexical and dependency probes. Each bin of the histogram corresponds to 10 coordinates. The height of a bar (in one color) represents how many were selected for a specific task. The dimensions on the x-axis are ordered by the weighted absolute values of Scaling Vectors. 8 We found that in layers 6 and 16 (they achieve the highest correlation in lexical and dependency, respectively), the histograms are disjoint, indicating that the layers' representations of dependency syntax and lexical hypernymy are orthogonal to each other in the embedding space. The orthogonality is less visible in the first layer and disappears almost entirely in the top one. In most layers, depth subspace is included in distance subspace for the same structural type. This behavior was expected as distance probing is more complex and therefore requires more capacity.
In Fig. 4 we present histograms for additional tasks at the model's 16th layer. The positional subspace has a sizable intersection with the syntactic one, yet only a few common dimensions with the lexical subspace. The connection can be attributed to the fact that dependency edges can often be inferred from words' relative positions. Probing for random structures is interlinked with other objectives. The sizes of shared subspaces for each pair can be found in Table 3. Histograms and tables for other sets of tasks are presented in the appendix.

Discussion
The introduction of an orthogonal constraint is a core element of our analysis. The constraint assures that no dimension is enhanced or diminished in the transformation and allows interpreting the magnitude of values in the Scaling Vector as the relevance of each dimension for the objectives.
In an Orthogonal Structural Probe, the sufficient rank of a transformation is learned during the optimization. The rank regularization is a prerequisite to disentangle the information encoded by the probe (Section 5.2). The natural question is whether such analysis can be performed by reducing the rank of Structural Probe with another regularizer and decomposing linear transformation after the optimization. We argue that it is not possible both in joint and separate probing: • In joint probing for multiple tasks: one Scaling Vector is shared for all the tasks. It is not possible to attribute the dimensions to a specific task.
• In separate probing for each task: the decomposition leads to different orthogonal matrices. Hence, the dimensions of distinct Scaling Vectors do not correspond to each other.

Limitations
We focus on syntax annotated in Universal Dependencies and lexical hypernymy encoded in Word-Net. We do not claim that there is no correlation between syntactic and lexical information in BERT, just that the topologies of those two structures are encoded separately. It is entirely possible that we could find dimensions overlap when probing for syntax and lexicon in differently annotated datasets.
Conversely to Structural Probes, our reformulation of the loss (in Eq. (12) and Eq. (11)) is not convex. We thank one of the anonymous ACL reviewers for pointing it out. Nevertheless, we show that despite non-convexity, our Orthogonal Structural Probes achieve similar results to Structural Probes and are more selective.

Conclusions
We have expanded structural probing to new types of auxiliary tasks and introduced a new setting, Orthogonal Structural Probe, in which probes can be optimized jointly. We found out that: 2. In addition to syntactic dependencies Orthogonal Structural Probes can be efficiently trained to approximate dependency and depth in WordNet hypernymy trees and positional order.
3. Orthogonal Structural Probes can be trained jointly for multiple objectives. In most cases, the performance moderately drops, and selectivity increases. The number of parameters decreases in comparison to training many separate probes.
4. Usually, information necessary for each objective is stored in a subspace of relatively low rank (19 -263). We can further reduce dimensionality by applying sparsity regularization. For a few objectives (e.g., positional depth, dependency depth), the information is hugely focal, and the performance can fall markedly when just 25% randomly selected dimensions are dropped.
5. We have found that in most of BERT's layers, the subspace encoding linguistic hypernymy is separated from the subspace encoding dependency syntax and subspace encoding word's position.

Further work
Our method can be adjusted for multitask and multilingual settings. Following the observation that the orthogonal transformation can map distributions of embeddings in typologically close languages (Mikolov et al., 2013;Vulić et al., 2020). We think that joint training for many languages may be possible by keeping the same Scaling Vector and adding a separate Orthogonal Transformation per language, fulfilling the role of orthogonal mappings. Another leg of research would be analyzing probes for other linguistic structures, for instance, derivation trees.

A Technical Details
The Orthogonal Structural Probe is trained to minimize L1 loss between predicted and gold distances and depths. The loss is normalized by the number of predictions in a sentence and averaged across a batch of size 12. Optimization is conducted with Adam (Kingma and Ba, 2014) with initial learning rate 0.02 and meta parameters: β 1 = 0.9, β 2 = 0.999, and = 10 −8 . We use learning rate decay and early-stopping mechanism: if validation loss does not achieve a new minimum after an epoch, learning rate is divided by 10. After three consecutive learning rate updates not resulting in a new minimum, the training is stopped.
To alleviate sharp jumps in training loss that we observed mainly in training of Depth Probes, we clip each gradient's norm at c = 1.5.

A.1 Orthogonal Regularization
In order to coerce orthogonality of matrix V we add DSO to the loss. Bansal et al. (2018) showed that for convolutional neural network applied to image processing, a simpler regularization -SO is more powerful.
In our experiments, DSO led to faster convergence. Fig. 5 shows values of orthogonality penalty during the training. Taking into account the properties of the Frobenius norm, we observe that V matrix is close to orthogonal already after initial epochs. Fig. 6 presents values of sparsity penalty during the training. The regularization is applied only after the orthogonality penalty drops below 1.5.

A.3 Number of Parameters
The number of Orthogonal Structural Probe's parameters is given by equation: where D emb is dimensionality of the embeddings and N obj is a number of jointly probed objectives. Therefore, our biggest probes on top of  BERT Large for all eight objectives have 1024 2 + 1024 · 8 = 1, 056, 768 parameters. It is more than in Structural Probes of Hewitt and Manning (2019). Nevertheless, our probes have less degrees of freedom, because we use Orthogonal Transformation instead of Linear Transformation.

D Application in Dependency Parsing
We have computed the UAS of dependency trees predicted based on dependency probes. We employ the algorithm for extraction of directed dependency trees proposed by Kulmizev et al. (2020). Our innovation to the method is that we optimize distance and depth probes jointly during one optimization. In line with the previous studies, we show that Orthogonal Structural Probes can be employed for parsing. Table 4 presents Unlabeled Attachment Scores achieved by different multi-task configurations. Joint probing for dependency distance and depth allows us to extract a directed dependency tree in just one optimization. Best to our knowledge, it has not been tried before. Analogically to Spearman's correlation, UAS drops when more objectives are used in optimization. However, even joint probing for all eight objectives is capable of producing trees with 75.66% UAS.

E Scaling Vector Properties
In this appendix, we elaborate on the properties of Scaling Vectors parameters in the multi-task probing.

E.1 Parameters Distribution
The distribution of values in Scaling Vector (Fig. 7) shows that the majority of parameters converge to zero. They are within 10 −40 to 10 −30 margin after training. Therefore, the significant dimensions are clearly identifiable.

E.2 Separation of Information (Continued)
On the following pages, we present dimension overlap histograms and tables, as in Section 5.2, for the remaining pairs of objectives.