Documents Representation via Generalized Coupled Tensor Chain with the Rotation Group constraint

Continuous representations of linguistic structures are an important part of modern natu-ral language processing systems. Despite the diversity, most of the existing log-multilinear embedding models are organized under vector operations. However, these operations can not precisely represent the compositionality of nat-ural language due to a lack of order-preserving properties. In this work, we focus on one of the promising alternatives based on the embedding of documents and words in the rotation group through the generalization of the coupled tensor chain decomposition to the exponential family of the probability distributions. In this model, documents and words are represented as matrices, and n-grams representations are combined from word representations by matrix multiplication. The proposed model is optimized via noise-contrastive estimation. We show empirically that capturing word order and higher-order word interactions allows our model to achieve the best results in several document classiﬁcation benchmarks.


Introduction
The current progress in natural language processing systems is largely based on the success of the representation learning of linguistic structures such as word, sentence and document embeddings. The most promising and successful methods are based on learning representations via two types of models: shallow log-multilinear models and deep neural networks. Despite the efficiency and interpretability of log-multilinear models, they can not use higher-order linguistic features like dependency between subsequences of words. To avoid these disadvantages, we usually use nonlinear predictors like recurrent, recursive, convolution neural networks, and more recently Transformers. Nevertheless, these methods can achieve high performance at the cost of loss of some interpretability and the cost of increased computation time.
However, there exist other ways to utilize higherorder interactions and still preserve the efficiency of linear models. In this research, we focus on more data-oriented, i.e., more linguistic grounded, and better interpretable methods which still can achieve high results in the practical tasks. Particularly, we investigate the matrix-space model of language, in which semantic space consists of square matrices of real values. The key idea behind this method goes from realization of the Frege's principle of compositionality through order-preserving property of matrix operations. As shown by Rudolph and Giesbrecht (2010), this type of models can internally combine various properties from statistical and symbolic models of natural language and therefore it is more flexible than vector space models.
In spite of that fact, such models are usually hard to optimize on the real data. To this end, Yessenalina and Cardie (2011) took attention to the needs of nontrivial initialization and proposed to learn the weights by the bag-of-words model. Asaadi and Rudolph (2017) used complex multi-stage initialization based on unigrams and bigrams scoring. Both approaches try to solve the sentiment analysis task. Recently Mai et al. (2019) considered the problem of self-supervised continuous representation of words via matrix-space models. They optimized a modified word2vec objective function (Mikolov et al., 2013) and proposed a novel initialization by adding small isotropic Gaussian noise to the identity matrix.
In this paper, we use a similarity function between matrices similar to Mai et al. (2019), but instead of neural network type of learning, we implement the model as the coupled tensor chain and impose the rotation group constraint. We focus on the document representation problem. Given a document collection, we try to find unsupervised doc-ument and word representation suitable for downstream linear classification tasks. To this end, we represent words, n-grams and documents as matrices and train self-supervised model. Our intuition is based on the fact that modeling interaction between words and documents is insufficient for modeling relations between complex phrases and documents.
Contributions. The main contributions of this work can be summarized as follows: • To the best of our knowledge, this is the first representation learning method based on the Riemannian geometry of matrix groups.
• We show that our approach to model the compositionality and word order allows us to increase the quality of document embedding on downstream tasks. Moreover, it is also more computationally efficient in comparison with neural network models.
• Our model achieves state-of-the-art performance on the task of representation learning for multiclass classification both on short and long document datasets.
Our implementation of the proposed model is available online 1 .

Related work
Euclidean embedding models (Mikolov et al., 2013;Pennington et al., 2014) based on implicit word-context co-occurance matrix factorization (Levy and Goldberg, 2014) are an important framework for current NLP tasks. Proposed models achieve relatively high performance in various NLP tasks like text classification (Kim, 2014), named entity recognition (Lample et al., 2016), machine translation (Cho et al., 2014). Riemannian embedding models have shown promising results by expanding embedding methods beyond Euclidean geometry. There are several models with negative sectional curvature like Poincare (Dhingra et al., 2018;Nickel and Kiela, 2017) and Lorentz models (Nickel and Kiela, 2018). Furthermore, Meng et al. (2019) proposed a text embedding model based on spherical geometry. Tensor decomposition models have been applied to many tasks in the NLP. Particularly, Van de Cruys et al. (2013) proposed the Tucker model for decomposing subject-verb-object co-occurance tensor for computation of compositionality. The most similar to our task is the word embedding problem. In this direction, Sharan and Valiant (2017) explored the Canonical Polyadic Decomposition (CPD) of word triplet tensor. Bailey and Aeron (2017) used symmetric CPD of pointwise mutual information tensor. Frandsen and Ge (2019) extended the RAND-WALK model (Arora et al., 2016) to word triplets. The main drawback of existing approaches is that they can not preserve word order information of long n-grams properly. For example, in the case of the CPD, we need to use separate parameters for each word based on its position in the text. This restriction does not allow us to efficiently use the linguistic meaning of tensor modes. The symmetric CPD completely loses word order information and the Tucker model suffers from exponentially increasing parameter size in the case of long length n-grams. Our approach eliminates these disadvantages.

Problem and model description
In this section, we describe our model for the document representation task. We begin with a short introduction of the multilinear algebra, then present the proposed document modeling framework in the view of the coupled tensor decomposition and provide the detailed description of our model and indicate benefits/drawbacks which are related to rotational group constraints.

Basic multilinear algebra
A tensor is a higher-order generalization of vectors and matrices to multiple indices. The order of a tensor is the number of dimensions, also known as modes or ways. An N -th order tensor is represented as X ∈ R I 1 ×I 2 ×···×I N , and its element is denoted as x i 1 ...i N . We can always represent a tensor X as sum of rank-1 tensors, where each of them is defined as outer product of N -vectors, i.e., a (1) • · · · • a (N ) and a (n) ∈ R In for n = 1, . . . , N . The minimal number of rank-1 tensors in this sum defines tensor rank.
In this research we focus on a particular type of tensor decomposition called tensor chain (Perez-Garcia et al., 2007;Khoromskij, 2011;Zhao et al., 2019). It represents tensor via the following sum of rank-1 tensors where R 1 , . . . , R N are called ranks of the TC model and the element-wise form of the following decomposition is given by and A (n) in ∈ R Rn×R n+1 represents the i n -th frontal slice of the core tensors A (n) ∈ R Rn×In×R n+1 for n = 1, . . . , N and R N +1 = R 1 and a (n) rnr n+1 are tubes of A (n) .

Document modelling setting
In our research, we use the fact that the same text can be represented in different ways via different sets of n-grams with fixed lengths {W n } N n=1 , where W n = W × · · · × W n and W is the word set.
The main hypothesis is that the occurance statistics of the each of these n-grams sets contains some new information about this text, which can not be extracted from any other n-grams set. If we combine information from all of these sets we can achieve better quality for our document embedding model.
Due to the fact that the occurrence of the sequence of words depends on the occurrence of each word from this sequence, it is reasonable to treat the distribution of each fixed-length n-grams set separately. Otherwise, by the reason of dependence between the length of word sequence and their frequency, small length n-grams can downweight the importance of long length n-grams. Thus the effect of higher-order interaction can become low. Also, we notice that consider p(w) as a distribution over unordered sets is a quite restrictive assumption on the structure of the model due to the importance of the order of the words in the language semantics. For all these reasons, we work with each n-gram distribution as with the separate distribution of the single random variable rather than define joint distribution for all n-grams sets and assume that particular distribution for each n-gram set can be constructed through marginalization from this joint distribution.
Following this intuition for each n-gram set, n, we assign appropriate joint distribution, p(w, d), where w ∈ W n and d ∈ D. We represent cooccurrence of each n-gram, w = (w 1 , . . . , w n ), and each document, d ∈ D, as (n + 1)th-order tensor Then we represent probability p(w, d) as the mean of these tensors Note that co-occurrences of n-grams and documents define bipartite graphs between them and X (n) can be interpreted as adjacency tensors of these graphs.

Proposed model
Following compositional matrix-space modelling approach we represent each word, w ∈ W, as a matrix U w ∈ R R×R , a n-gram, w ∈ W n , as U w = n k=1 U w k , and each document, d ∈ D, as a matrix V d ∈ R R×R . To measure dependence between n-gram, w, and document, d, we use the Frobenius inner product, defined as U w , . We assume that embeddings organized accordingly to this operation can be suitable for linear classifiers.
The resulting model is the generalized to exponential family of probability distributions coupled tensor decomposition (Collins et al., 2001;Yilmaz et al., 2011) of the set of tensors {X (n) } N n=1 by the corresponding set of tensor chain models with restricted set of parameters in the following optimization problem Figure 1: Illustration of representation of document collection in a multi-view way as a collection of bipartite graphs. Each of these graphs represents dependence (number of co-occurrence) between word's strings of length n and documents. Adjacency matricesX (n) ∈ R |W| n ×|D| of these graphs can be appropriately tensorized to adjacency tensorsX (n) ∈ R |W|×···×|W|×|D| which can be linked through modified multinomial link function with latent tensors Z (n) ∈ R |W|×···×|W|×|D| . Latent tensors can be decomposed via the Coupled Tensor Chain model. In our model, all core tensors U (V T ) which represent words (documents) are additionally restricted to have the where each Kullback-Leibler divergence can be expressed as Our model represents p(w, d) by using following mean function wd is one of natural parameters, which organized in the latent tensors Z (n) ∈ R |W|×···×|W|×|D| . This latent tensors contain pointwise mutual information between n-grams and documents and we assume that each of this tensor has low tensor chain rank, i.e., z

Intuition from geometric interpretation
If we avoid generative assumptions (Saunshi et al., 2019), our task can be interpreted as maximizing the similarity between document d and n-grams from its document distribution p(w|d) with simultaneous minimization of similarity between this document and n-grams from common n-gram distribution p(w). As shown in previous works (Kumar and Tsvetkov, 2019;Meng et al., 2019) enforcing the spherical geometry constraints is a promising choice for tasks focusing on directional similarity. For doing so it can be reasonable to constrain our model to the orthogonal group. In this case Frobenius inner product became proper similarity measure and the sequential matrix product always preserves fixed norm and group structure (i.e., invertibility of matrix multiplication). Due to the group structure, our model has an interesting property to uniquely determine each word in the n-gram by their left and right aggregated context matrices and general n-gram matrix. However, orthogonal group is a disjoint set of two connected components: set of rotations and set of reflections {A|A T A = AA T = I, det(A) = −1}. We impose constraints on our model parameters enforcing rotation matrices since the product of any number of rotations is always rotation, i.e. rotation set forms a matrix group. While the product of an even number of reflections becomes rotation.

Noise-contrastive estimation
In practice we do not need to construct set of tensors {X (n) } N n=1 explicitly. Instead, since eachX (n) represents a higher-order frequency table, we can optimize the sum of MLE tasks: Usually, we have huge amount of data which make problem of computing partition function for eacĥ x (n) wd intractable for many current computing architectures. We can avoid this problem by using noise-contrastive estimation (Gutmann and Hyvärinen, 2012) for conditional model (Ma and Collins, 2018). Similar to , we construct negative samples from our batch by connecting non-linked n-grams and documents. Finally, for parameter set we formulate optimization problem in the following way: where each risk function is equal to We add concentration parameters κ (n) to our loss function to overcome the problem of fixed scale. This makes our model more flexible to represent sharp distributions. Due to the fact that each n-gram distribution has its own scale, it can be reasonable to have a different κ (n) for different n-gram distributions.

Optimization setup 4.1 N-gram construction
We construct n-grams from text corpora by using sequentially moving of the sliding window of length n (from 1 to N ) inside each document.

Parameters initialization
Initialization from the uniform distribution on the Stiefel manifold (Saxe et al., 2014) is one of promising ways to initialize deep neural network. To initialize parameters only from rotation component of Stiefel manifold we can swap two columns for each parameter matrix if the determinant of the this matrix is -1. However, this initialization can be below optimal way, because these rotation matrices can be far away from each other and due to the non-trivial structure of the loss function on this manifold we can stuck in local minima. To overcome this problem, we can fix particular point on the manifold for all matrices and perform small movement from this point in arbitrary direction. We use following strategy for each parameter We initialize all concentration parameters using following equation κ (n) = u R , where u ∼ U(0.9, 1.1) and n = 1, . . . , N .

Riemannian optimization
We solve our problem on the product manifold of rotation group and nonnegative orthant by simultaneous optimization of model parameters Let M be a real smooth manifold and L : M → R a smooth real-valued function over parameters θ ∈ M. Riemannian gradient descent (Gabay, 1982;Absil et al., 2008) based on two sequential steps. At first we compute Riemannian gradient by orthogonal projection of the Euclidean gradient on the tangent space at the same point on which we compute Euclidean gradient by proj θt : and then we perform movement on the manifold by specific curve, which called retraction R θt : For optimization on rotation group the orthogonal projector of matrix G ∈ R R×R on the tangent space at the point A ∈ SO(R) is given by: For movement on the manifold in this direction we use QR-based retraction: We choose the QR-based retraction because it allows Riemannian Adagrad to achieve the fastest convergence in our experiments in comparison with Cayley retraction (first-order), Polar retraction (second-order), and geodesic (matrix exponential).

Algorithm 1 Optimization algorithm for RDM
Input: Learning rates α and β, number of iterations T , maximum n-gram length N .
end for end for

Computational efficiency
The computational complexity of our model depends on the complexity R 3 of multiplication of matrices of size R × R, and QR decomposition 4 3 R 3 (Layton and Sussman, 2020;Trefethen and III, 1997). Due to the number of elements in these matrices d = R 2 , we can transform the complexity of our model in the dimension of embedding. In this view, the time complexity is O(kd 1.5 ), where k is the size of the context window. As shown in Table (2), our model has computational benefits in comparison with Transformer due to the linear dependence of time complexity on the word sequence length. We note that in comparison with Bi-LSTM models like ELMo, our model has lower complexity on embedding dimension, and can be computed in parallel using the associativity property of matrix multiplication. Although our model has a higher theoretical time complexity than the vector space models, the real gap between them is relatively small at ordinary embedding dimension (∼ 400).

Method
Time  (He et al., 2019). We choose these datasets for our benchmarks because they are significantly different in the document's average length. This implies that statistics of long n-grams differ between these datasets too and in the case of the ArXiv dataset statistics of n-grams are significantly better converged than in the case of 20 newsgroups. This fact allow us to hypothesize that matrix-space models should less overfit on the ArXiv dataset. We fix the document embeddings and optimize multinomial logistic regression with SAGA optimizer and l 2 -norm regularization. Instead of test set we use nested 10-fold cross-validation to estimate statistical significance using Wilcoxon signedrank test (Japkowicz and Shah, 2011;Dror et al., 2018). For each fold we estimate the hyperparameter of l 2 -regularization on 10 point logarithmic grid from 0.01 to 100 by using additional 10-fold cross-validation with macro-averaged F1 score. For text preprocessing, we use CountVectorizer from the Scikit-learn package. Additionally we remove words which occur in the NLTK stopwords list or occur only in 1 document. In the case of ArXiv dataset we use half of this dataset and use only documents in the range from 1000 to 5000 words (smaller documents are removed and bigger documents are reduced to the first 5000 words).

Dataset
#cls |W| |D| #w 20 Newsgroups 20 75752 18846 180 ArXiv 6 251108 16371 3829 Baselines. We compare our model with different vector space models: paragraph vectors (Le and Mikolov, 2014), weighted combinations of the word2vec skipgram vectors (Mikolov et al., 2013) (average, TF-IDF and SIF (Arora et al., 2017)), Doc2vecC (Chen, 2017), sent2vec (Pagliardini et al., 2018) and recently proposed JoSe (Meng et al., 2019). The comparison with the last two of these models seems to be more informative than with others because of some similarity of these models to our model (sent2vec can use n-gram information and JoSe also based on the spherical type of embedding geometry). Due to the large number of possible values of hyperparameters for each model, we used the default values proposed by the authors of these models or proposed in subsequent studies of these models like in the case of paragraph vectors (Lau and Baldwin, 2016). We modify only the min count to 1 and window size to be equal to the number of negative samples for paragraph2vec and word2vec models because it gives better results for these methods in our experiments. We choose n-grams number equal to 1 for sent2vec, because other values doesn't improve results. To preserve the fairness of comparison we use fixed embedding dimension, number of negative samples, and number of epochs for all models including ours. These values try to mimic the usual values of these hyperparameters in practice. We compare our model not only with vector models but also with neural network models. We use 5.5B ELMo (Peters et al., 2018) version which is pre-trained on Wikipedia (1.9B) and all of the monolingual news crawl data from WMT 2008WMT -2012. ELMo embedding dimension is equal to 1024. Also we use 768-dimensional embedding vectors from BERT model "bert-base-uncased". Following Devlin et al. (2019) we take the last layer hidden state corresponding to the [CLS] token as the aggregate document representation. If length of document is bigger than 512 we cut document on 512-length parts and average representation of this parts. Finally, we add Sentence BERT (Reimers and Gurevych, 2019) to the baseline models. This model is fine-tuned on SNLI and MultiNLI datasets for sentence embedding generation. We use 768dimensional embedding vectors from model "bertbase-nli-mean-tokens".
We do not perform fine-tuning of BERT and ELMo models for our datasets, because in our experiments it doesn't give any positive effect on the final performance of these models. However, this is not true for Sentence BERT. Fine-tuning slightly improve performance of this model on the 20 newsgroup (50 epoch with maximum margin triplet loss).
We should notice that this experimental design gives some benefits to neural network models in comparison with log-multilinear models, but it is more consistent with the ordinary practical use case of the Transformers and RNN models. However, the next experiment will show that log-multilinear models can still outperform pre-trained neural networks.
Ablation study. For ablation study, we use different settings of our model. RDM means rotation document model, i.e. our model. RDM-R means our model without rotation group constraints. By (1) we mean model which utilize only unigrams and documents co-occurance information. By (3) and (5) we mean model which utilizes information from (1, 2, 3)-grams and (1, 2, 3, 4, 5)-grams respectively. For RDM, we use 1e-2 and 1e-3 as learning rates of Radagrad and projected Adagrad respectively and we use λ = 15 for 20 newsgroups and change λ to 5 for ArXiv dataset due to smaller number of epoch in this experiment. For RDM-R we use 1e-2 as learning rate for Adagrad.

Experimental results and comparison of performance
Comparison to baselines. As one may observe in the Table (4), our models yield results comparable or outperforming the baseline methods, including the simpler log-multilinear models (e.g. Skipgram) and more complex models featuring nonlinear transformations, such as recurrences (ELMo)  Table 4: Text classification performance on the 20 Newsgroups (short documents) and on the modified ArXiv (long documents) datasets. We fix the number of epochs to 50, the embedding dimension to 400, and the number of negative samples to 15 for all models on the 20 Newsgroups. On the ArXiv dataset, we use the same hyperparameters except for only the number of epochs which is equal to 5. We use macro-average for Precision, Recall, and F1. and transformer blocks (BERT, SentenceBERT). More specifically, on the 20 newsgroups, the RDM (5) model yields the best results significantly 2 outperforming all the listed baseline approaches. It is interesting that contrary to the 20 Newsgroups, all RDM variants with any number of the n-grams sets show strong results and significantly outperform other models on the ArXiv dataset. Weighted combinations of the skipgram vectors and doc2vecC model achieve the closest to our result. This confirms that neural models like ELMo and BERT, are not the best way for all datasets and log-multilinear models can outperform them. We can see that performance of nonlinear models increase if we use a large document dataset and BERT can outperform some of the log-multilinear approaches (PV-DBOW, JoSe and sent2vec), but still, its result is not on the top level.
Comparison between our models. On observing the results we can see that our model increase the performance of classification when adding the n-gram set with a bigger length. This property has both models with rotation group constraints and without such constraints. Despite this fact, as we can see the model without rotation constraints is less robust in respect to noisy statistics of small document dataset and achieve performance less than the PV-DBOW model, while the rotation group model outperforms all other models. However, once we move on to a dataset with a larger document's average length (ArXiv), RDM-R performs better than all other models, including RDM. This confirms our hypothesis that the existence of good statistics of long n-grams has critical value for matrix space models. Due to strong associativity between sets of n-grams, our model needs more parameters to approximate all the co-occurrences. The RDM-R has a higher number of degrees of freedom than the RDM R 2 vs R(R−1)

2
. We think that it allows RDM-R to outperform RDM in this experiment. This intuition can be also confirmed by the fact that RDM (1) and RDM-R (1) have the same performance level. And only if we increase the number of n-grams for the model from 1 to 3, then RDM-R can achieve better performance. However, if we increase the number of n-grams set from 3 to 5 both models stay on the same level of performance. This is the sign that we need to use a bigger embedding dimension if we want to achieve even better results.
In summary, if we have short documents, it's better to use RDM. For long document dataset with restriction on the embedding dimension, we suggest to relax the rotation group constraints. This trick allows RDM to use more degrees of freedom to estimate data precisely.

Conclusion
In this paper, we proposed a novel unsupervised representation learning method based on the generalized tensor chain with rotation group constraints, which can utilize higher-order word interactions and preserve most part of the computational efficiency and interpretability of vector-based models. Our model achieves state-of-the-art results in the document classification benchmarks on the 20 newsgroups and modified ArXiv dataset. A further direction of research could be focused on adding tensor kernel functions to the model to eliminate problems with dependence on the dimension of embedding. It could be interesting to augment this type of model with the different loss functions based not only on n-gram-document interactions but also on word-word interactions from the knowledge graph or document-document interaction from the citation graph of the documents.