Hidden Schema Networks

Large, pretrained language models infer powerful representations that encode rich semantic and syntactic content, albeit implicitly. In this work we introduce a novel neural language model that enforces, via inductive biases, explicit relational structures which allow for compositionality onto the output representations of pretrained language models. Specifically, the model encodes sentences into sequences of symbols (composed representations), which correspond to the nodes visited by biased random walkers on a global latent graph, and infers the posterior distribution of the latter. We first demonstrate that the model is able to uncover ground-truth graphs from artificially generated datasets of random token sequences. Next, we leverage pretrained BERT and GPT-2 language models as encoder and decoder, respectively, to infer networks of symbols (schemata) from natural language datasets. Our experiments show that (i) the inferred symbols can be interpreted as encoding different aspects of language, as e.g. topics or sentiments, and that (ii) GPT-2-like models can effectively be conditioned on symbolic representations. Finally, we explore training autoregressive, random walk “reasoning” models on schema networks inferred from commonsense knowledge databases, and using the sampled paths to enhance the performance of pretrained language models on commonsense If-Then reasoning tasks.


Introduction
Much of the developmental and causal theories of human cognition are predicated on relational structures of knowledge that naturally exhibit compositionality. Semantic content is intrinsically relational, as one is only able to explain a given unit of knowledge -such as a concept, word or perception -insofar as there are other units of knowledge which relate to it (Block, 1986). Thus we can partially construe a concept through its relationships to other concepts (like when we say "a dog is an animal that barks"), just as we can partially construe it through its relationships to our perceptions (when we say "that is a dog", whilst pointing to a dog on the street) or the words we use (when we use the word dog to refer to the concept dog). Likewise, we can partially construe words not only through their relationships to concepts or percepts, but also through their relationships to other words, as words that occur in the same context tend to have similar meanings (Harris, 1954;Firth, 1957). Note that is precisely this contextual semantic content of words what we have explicit access to when processing our raw text datasets. On the other hand, generalization, reasoning and understanding seem to be inevitably tied to the compositional nature of knowledge. Indeed, the ability to compose a set of knowledge units (and their relations) into new, more complex relational units, which can be deployed to understand and reason about unseen data -a feature usually referred to as GPT-2 model of M layers, with a pseudo-self-attention mechanism to attent to the schema ej 1 :j L . Please see the supplementary material for details. The "c" operations labels concatenation. Right: Encoder architecture as BERT model, followed by a single Transformer block. In both center and right figure purple shade blocks represent submodules with pretrained parameters. Pink shade blocks represent submodules with randomly initialized parameters.
combinatorial generalization -is regarded as key to human-level intelligence (Fodor and Pylyshyn, 1988;Fodor and Lepore, 2002;Lake et al., 2017;Battaglia et al., 2018). Relational structures allowing for compositionality thus seem to comprise not a sufficient, but a necessary attribute of any representation scheme that strives for the generalization power of human cognition.
From the computational side, if one is to inform any modern machine learning model with such structural characteristics, one will initially encounter the problem of finding suitable primitives or data structures. In natural language processing (NLP), for example, it has become common place to leverage distributed continuous representations of words (Bengio et al., 2003) for different downstream tasks. Such representations are trained to encode average contextual semantics -precisely the kind of semantic content typical of word co-occurrence relations we mentioned above -into a semantic space, which allows meaning to change continuously within it (Mikolov et al., 2013). Yet, despite earlier attempts (Mitchell and Lapata, 2008), it is unclear whether such representations can be meaningfully composed into representations of, say, unseen sentences and thus mimic the compositional character of natural language. More recently, contextualized continuous word representations inferred by deep learning architectures have shown spectacular results in many NLP tasks (Radford et al., 2018;Devlin et al., 2018;Radford et al., 2019;Brown et al., 2020). Their success stems from those models' ability to infer flexible representations through, inter alia, raw, massive datasets, data-scalable attention mechanisms and minimal inductive biases (Vaswani et al., 2017). These representations are known to not only contain rich contextual word semantics, but also consistently encode sentence-level grammar (Hewitt and Manning, 2019), and they seem to display some compositional properties too (Hupkes et al., 2020). Yet, these representations lack semantic interpretability, which renders them difficult to manipulate.
In this work we develop a generative language model -the Hidden Schema Network model (HSN)that enforces, via inductive biases, a discrete, relational structure for sentence representation which allows for compositionality, while exploiting the well-known advantages of attention models and contextualized, pretrained continuous representations. Specifically, we assume there is a set of global symbols whose relations are encoded into a latent graph. We then use the VAE framework (Kingma and Welling, 2013;Rezende et al., 2014) to encode sentences into sequences of symbols, which correspond to the nodes visited by biased random walkers on the latent graph, and further infer the posterior distribution of the latter. We leverage pretrained BERT (Devlin et al., 2018) and GPT-2 (Radford et al., 2019) models as encoder and decoder, respectively, and train our model on language modelling tasks.
Our main contribution is then an exploration of a novel way to integrate discrete (symbols), relational (graphs) and continuous (neural representations) machine learning components into an end-to-end, differentiable representation learning algorithm for natural language modelling.

Related Work
In cognitive psychology, a schema is (roughly) defined as a large, complex unit of knowledge representing what is typical of a group of instances (Bartlett, 1932;Piaget, 1948;Rumelhart, 2017). Marvin Minsky's frames (Minsky, 1974(Minsky, , 1975 are similar in function to a schema, but perhaps more easily characterized in terms of data structures. We use these terms in a loose fashion, however. Our aim being only to be suggestive of the general problem of knowledge representation (Thagard, 1984). We are in fact concerned with representation schemes for natural language processing. Within the context of linguistics, Jackendoff (1978) argues that there must be a level of representation -the so-called conceptual structures -at which information conveyed by language must be compatible with information coming from sensory systems. Conceptual structures must, he goes on, be able to represent all the conceptual distinctions made by natural language, and provide some degree of compositionality. Earlier computational models implementing (some kind of) conceptual structure rely on either hand-coded (semantic) network representations (Quillan, 1966;Collins and Quillian, 1969;Brachman, 1977) or hand-coded databases (McClelland and Rogers, 2003). Other works focus instead on learning semantic representations directly from text data via topic models (Griffiths et al., 2007b), and even infer latent concept graphs through nonparametric priors (Chambers et al., 2010). In sharp contrast with these works, modern, neural-based language models focus mainly on learning semantic representations of sentences or documents that are continuous, usually leaning on a VAE framework (Bowman et al., 2015). In fact, VAE models have been quite recently deployed to infer sentence-level latent representations from large, pretrained language models (Wang and Wan, 2019;Li et al., 2020). We build on top of these ideas, while trying to connect back with models of conceptual structure, which necessarily involve discrete representations. The latter have also been successfully inferred from neural language models before (Zhao et al., 2018;Kaiser and Bengio, 2018;Kaiser et al., 2018).

Hidden Schema Networks
We address the problem of learning the joint probability distribution over sequences of words, while inferring interpretable representations capturing their semantics. Neural autoregressive language models approximate such distributions with a product over conditional probabilities, such that where x 1:T = (x 1 , x 2 , . . . , x T ) labels the sequence of words in question, and each conditional is given by (the pdf of) a categorical distribution over some vocabulary of size V . The class probabilities of these conditionals are generally computed as π i = softmax(W · h θ (x <i )), with W ∈ R V ×D trainable, and D the output dimension of h θ , a deep neural network model with parameter set θ (Bengio et al., 2003). Models of this form allow for tractable estimation of and sampling from either the joint distribution, or any product of the conditionals in Eq. 1. Indeed, their recent implementation in terms of large-capacity, self-attention architectures such as GPT-2 (Radford et al., 2019) has been shown to generate syntactically correct, diverse and fluent text. Yet, the output representations (h 1 , h 2 , . . . , h T ) of these models lack both semantic and syntactic interpretability, which ultimately renders the generation of text difficult to control 1 . In order to improve on both the interpretability and controllability of neural language models, Bowman et al. (2015) introduced a VAE language model framework, which conditions the joint distribution over word sequences on an additional latent, continuous representation. The Hidden Schema Network model instead infers sentence-level, latent, discrete representations which can, at least in principle, capture the relational and compositional features of semantic content.
Let us assume there is a set E = {e 1 , e 2 , . . . , e K } of K symbols that encode some high-level, abstract semantic content of natural language. Let this set be the set of nodes of a hidden (semantic) graph G, with adjacency matrix A, so that adjacent (connected) symbols are semantically related. These symbols can generically be defined as learnable, dense vectors in R S , for some dimension S. Without loss of generality, however, we opt below for simple indicator ("one-hot") vectors of dimension K instead. We define a schema e j1:j L as a sequence of L K symbols (e j1 , e j2 , . . . , e j L ), where the indices j 1 , . . . , j L label a subset of connected nodes in G. Accordingly, we refer to G as a schema network. The symbols composing the schemata are chosen through a L-step stochastic process conditioned on G. Partially motivated by research on random walks and human memory search (Griffiths et al., 2007a;Abbott et al., 2012), as well as by the simplicity of their inference, we choose to compose the schemata via biased random walk processes on G, and leave exploring different schema processes for future work . Let us now specify the generative model in detail.

Generative Model
We write the joint probability over a sequence x 1:T of T words, together with the hidden graph G, as where z 1:L labels the sequence z 1 , . . . , z L of K-dimensional, one-hot vectors representing the node labels j 1 , . . . , j L visited by a random walker on G, and θ denotes the trainable model parameters.
Note that we introduced the one-hot representation of j i for notational convenience, as shall become evident below 2 . Next, we specify the different components of Eq. 2.
Prior over (global) graph. A prior on the adjacency matrix p(A) allows us to control the topological properties of G. One can choose, for example, random graph models whose degree distribution asymptotically follow a power law (Barabási and Albert, 1999), or unbiased, maximum entropy graph models, with respect to some given constrains (Park and Newman, 2004). For the sake of simplicity we choose a Bernoulli (Erdös-Rényi) random graph model (Solomonoff and Rapoport, 1951;Erdös and Rényi, 1959), for which each link a ij is defined via an independent Bernoulli variable with some fixed, global probability p ∈ [0, 1], so that The probability p will be a hyperparameter of our model.
Prior over random walks. The probability p(z 1:L |A) of a random walk over the nodes of G can generally be written as where p(z 1 ) labels the probability of selecting j 1 as the starting point of the walk, and it is given by (the pdf of) a categorical distribution over the nodes of G, with class probabilities Similarly p(z i |z i−1 , A) labels the conditional probability of jumping from j i−1 to j i , which we define in terms of a K × K transition probability matrix P. Now, to allow for biased random walks, let each node k on G be given a positive weight f k , so that the probability of jumping from j to k is proportional to f k A kj . We then write the transition probability matrix as so that the motion of the random walker is biased according to the node weights f k . These weights should be understood as encoding aspects of the diffusion dynamics that are independent of the topology of the graph (Gómez-Gardenes and Latora, 2008; Lambiotte et al., 2011). Three comments are in order: first, note that one can also train the prior over walks by making the vectors ρ and f learnable. Second, setting the node weights f = I and the class probabilities ρ = 1 K I, with I the K-dimensional vector of ones, yields a uniform random walk over G, i.e. a process in which the walker has equal probability of jumping to any of its neighbors. Third, one can also allow for inhomogeneous random walks in which the probability matrix changes at each step of the random walk. Such processes can be parameterized with a sequence of weights f [1] , f [2] , . . . , f [L−1] .
Decoder and likelihood. Just as in Eq. 1, we define the joint probability over word sequences as a product of conditional probabilities, this time conditioned on the schema e j1:j L too, that is with π i the class probabilities of the ith conditional, W ∈ R V ×D trainable, and h dec θ a deep neural network model. We let h dec θ be a pretrained GPT-2 language model, and modify it to also process the schema e j1:j L , but remark that any other model for sequence processing (as e.g. a recurrent neural net) could be used instead. A bit more in detail, to condition GPT-2 on e j1:j L , without perturbing its optimized weights too much, we use the pseudo-self -attention (PSA) mechanism introduced by Ziegler et al. (2019). In a nutshell, this mechanism augments the key and value matrices of GPT-2 in their first L rows with projections of e j1:j L . Figure 1 shows an illustration of the complete decoder model including the PSA mechanism. Please check the supplement for the explicit equations of the latter.

Inference Model
The generative model we presented above is hierarchical. The random graph is shared across all sentences and thus constitutes a global latent object. The random walks, in contrast, are local random variables. Our task is to infer the schema and graph posterior distributions that best describe the collection of word sequences in our dataset. To do this, we approximate the true posterior distribution of these variables with a variational posterior of the form where φ labels the set of trainable parameters. Let us specify each of its components.
Posterior over (global) graph. We model the posterior over the graph assigning again Bernoulli variables to its links, but we let the probability of observing each link depend on the global symbols with g φ : E × E → R a deep neural network, and p φ (e i , e j ) ∈ [0, 1], for all e i ∈ E, the link probabilities. Our reasoning here is that the network g φ should infer graphs connecting symbols which are semantically related via the encoded sentences.
Posterior over random walks (encoder model). Analog to Eq. 4 we model the posterior probability over random walks on G as where instead of having a single transition probability matrix, we have a sequence of them, thereby allowing the posterior to capture inhomogeneous random walks. Note that we could have also chosen a mean-field decomposition along the steps of the random walk, simply by either ignoring the dependency on the graph, or making the graph fully connected. One can readily show that in this case one recovers the discrete VAE model of Zhao et al. (2018) (see the supplementary material for details). Going back to Eq. 9, we model the probabilities over the starting point of the random walks and the transition matrices as follows where h enc 1 , h enc 2 , . . . , h enc L ∈ R D is the sequence of outputs of a deep neural network model h enc φ (x 1:T ) processing the input sequence of T words. The model h enc φ (x 1:T ) must then map a sequence of T  vectors to a sequence of L vectors. We define h enc φ by a pretrained BERT model (Devlin et al., 2018), followed by a single Transformer block, randomly initialized. The Transformer block processes the T (D-dimensional) outputs from BERT as keys and values, together with a set of L learnable vectors q 1:L as queries. The right hand side of Figure 1 illustrates the complete encoder architecture.

Training Objective
To optimize the parameter sets {θ, φ} of our latent variable model we would, as usual, maximize a variational lower bound on the logarithm of the marginal likelihood p θ (x 1:T ) (Bishop, 2006). It is, however, well known that VAE models tend to encounter problems learning representations encoding information about the data -the so-called posterior collapse problem -especially when dealing with natural language (Bowman et al., 2015). To solve this issue practitioners resort to maximizing the variational lower bound, together with the mutual information between data and representations (Zhao et al., 2018;Fang et al., 2019;Zhao et al., 2019). We follow this same route and show (in the supplementary material) that maximizing the lower bound and the mutual information corresponds to maximizing the objective where KL labels the Kullback-Leibler divergence (Kullback and Leibler, 1951) between prior and posterior distributions, and q * φ (z 1:L |A) is the aggregated posterior distribution over random walks. The latter is defined as E p D (x 1:T ) [q φ (z 1:L |x 1:T , A)] and is in general intractable. In practice, we approximate it with an expression identical to Eq. 9, but with the class probabilities and transition matrices (Eqs. 10 and 11) replaced with their data-averaged counterparts. We refer the reader to the supplementary material for details on this, as well as for the explicit, closed-form expressions of the Kullback-Leibler terms in Eq. 12.

Proof of Concept: Inferring Ground-Truth Random Graphs
Before testing the behaviour of our methodology on natural language data, we evaluate the ability of the model to infer hidden graph structures from sequential data in a controlled experiment. To this end, we define a synthetic language model with an underlying, ground-truth graph G * as follows: Given a graph G * with K nodes, and a vocabulary of random tokens V of size V , we assign one random bag of tokens (i.e. one pdf over V) to each node of the graph. Let the K random bags be the K symbols {e 1 , e 2 , . . . , e K } of the synthetic language model. We then sample N uniform random walks of length L over G * , and sample one random token from each symbol (i.e. from each random bag) along the walks. The result is a set of random token sequences of the same length as that of the random walks. The supplement contains a more detailed description of the generation procedure.
Given this set of random token sequences, the task is to infer the hidden ground-truth graph G * .
Experimental settings. Following the procedure above we generated two datasets from two random graphs with different topologies. One sampled from the Barabási-Albert model (Barabási and Albert, 1999), the other from the Erdös-Rényi model (Erdös and Rényi, 1959). We set both graphs to have K = 100 symbols, and the token sequences to have length L = 10. Each dataset has a total of N = 100000 token sequences. Further details about the random graph model parameters and the dataset statistics can be found in the sup. material. The synthetic datasets are available in the source code.  (100,20) 17.72 20.28 19.18 HSN (100,5) 17.79 20.10 19.05 HSN (50,20) 16.88 19.59 19.01 HSN (50,5) 17.41 20.06 18.95  Figure 2: Empirical degree distributions of inferred graphs from each corpora. Results correspond to HSN with L = 5, K = 50. We also show the distribution for random graphs with p = 0.5. The graphs are sampled 500 times.
A simple proof-of-concept. We consider a problem in which the set of symbols (random bags) E is known, so that the ground-truth graph G * has a fixed labelling. This setting will allow for simple comparison between G * and our inferred graphs. To infer G * we used a simplified version of HSN, namely: We (i) replace BERT in Fig. 1 with a 2-block Transformer encoder (Vaswani et al., 2017); (ii) set the graph model g φ (Eq. 8) to a single-layer, feed forward network; and (iii) note that, since the symbols are known, the likelihood of the model is simply given by L i=1 e ji where, as before, j i denotes the index of the non-zero component of z i . We train this model by maximizing Eq. 12 and refer to the sup. material for details on hyperparameters, training procedure and model sizes.
Results. Table 1 shows our results for our two synthetic datasets. Specifically, we compute the Area Under the Receiver Operating Characteristic Curve (ROC AUC) of our model q φ (A) with respect to G * , and the Frobenious norm between q φ and two graphs: the ground-truth one G * , and a second random graph G rand sampled from the same random graph model as G * . We train ten models in total and display the mean and standard deviation of our results. We also use a different G rand for each calculation run. The first metric shows that q φ correctly predicts the edges of G * , whereas the other two metrics show that G ∼ q φ (A) is closer to G * than to any other random graph sampled from the same distribution. The last two columns in Table 1 show however that q φ (A) tends to generate denser graphs as compared to the target.
Having demonstrated that HSN can indeed infer hidden graph structure from sequential data in a simple setting 3 , we now move to our main problem: language modelling.

Language Modelling and Representation Learning
Natural language modelling deals with the prediction of the next word in a sentence or document, given a sequence of previously observed words. A natural evaluation metric is therefore the perplexity per word of the model, which is defined as the exponential of the data-averaged, negative loglikelihood of the model, divided by the number of words in the sequence. One complication with this is that latent variable models can only approximately estimate the likelihood function. One can readily see, however, that Eq. 12 is also a lower bound on log p θ (Bishop, 2006) and so, we estimate the perplexity of our models with exp(−L/T ).
Datasets and baselines. We consider three widely used public datasets, namely the Penn Treebank (PTB) (Marcus et al., 1993), Yahoo and Yelp (Yang et al., 2017) corpora. For completeness we include statistics of these datasets in the sup. material. We compare HSN against a pretrained GPT-2, fine-tuned both during a single epoch and until its objective function plateaus. We also compare againts two VAE language models: iVAE MI (Fang et al., 2019) and Optimus (Li et al., 2020). The former implements both encoder and decoder as one-layer LSTM (Hochreiter and Schmidhuber, 1997). The latter uses pretrained BERT and GPT-2 as encoder and decoder, respectively.

Sports
Politics & Government Experimental settings. In all experiments we leverage pretrained BERT and GPT-2 models, both with 12 layers, 768 hidden dimensions (D) and 12 attention heads. Note that Optimus shares these settings. We use the public HuggingFace implementation of both these models (Wolf et al., 2020). The graph model is set to a 2-layer feed forward network, each with hidden dimension 512, and we also train an inhomogeneous random walk prior model (Eq. 4) by making ρ and the sequence of weights f [1] , ] trainable. Furthermore, we explore HSNs with K = {50, 100} symbols and hidden random walks of L = {5, 20} steps. Let us label these configurations as HSN(K, L). Additional details on hyperparameters and training procedures can be found in the sup. material.
Language modelling performance. Table 2 shows the perplexity of our model, together with the baselines, evaluated on the test set of the three corpora. HSN achieves a much better performance than all baselines under this metric. Note in particular that HSNs with 50 symbols perform consistently better than their 100-symbol counterparts. As we discuss in the sup. material, 100-symbol HSNs tend to infer networks with many disconnected subgraphs, the largest of which has usually about 50 symbols. It appears then that (about) 50 symbols are enough to encode these corpora. We have additionally trained five 100-symbol HSNs with L = {5, 20} for each dataset, each with a different random initialization of the pink shaded blocks in Fig. 1. We find our mean perplexities to be better than the baselines even within error bars. The reader can find these results in the sup. material, together with the values of the KL divergences for both, the local schemata and the global graph. Taken together, these results show that our schema representations help improve upon the performance of a stand-alone GPT-2. To get a deeper insight into the features of these representations we now explore the structure of the learned global graphs G, as well as the semantic content of the schemata.
Structure of hidden schema networks. We characterize the structure of G in terms of five statistical quantities: (i) the diameter D, which measures the maximum path length over all node pairs in   (3) what are the songs that you listen to over and over, after norris does a wierd story? break up between his wife and lover? (4) does anybody remember a song by jude _UNK or the song that goes...... when you die...
(i watched the movie the squirrel on court).... " i'm going to tell you i will keep your breath " what song has the lyrics " if the whiskey don't kill me i don't know what will "? _UNK i've been a _UNK for seventeen long years and i spent all my money on whiskey ... G; (ii) the average distance l, which instead measures the average shortest path length between all node pairs; (iii) the clustering coefficient C, which represents the probability that two neighbors of a randomly chosen node are themselves neighbors; (iv) the number of connected components CC; and (v) the degree distribution P (k), which represents the probability that a randomly chosen node will have k neighbors. We report our results in Table 3 for HSN(50, 5). Tables containing the results for HSN(50,20) and the 100-symbol HSNs are available in the sup. material. Before continuing, let us mention here once more that the 100-symbol HSNs infer networks with large CC and a single largest connected component each, which have very similar structure to those of HSN(50, 5). Similarly, longer random walks seem to also favor larger CC. Now, the first thing we notice in Table 3 is that the networks from each corpora tend to have smaller average distances l and much larger clustering coefficients C than any random graphs (with p = 0.5) of the same size 4 . Let us remark that the combinations of these two features defines the so-called small-world structure (Watts and Strogatz, 1998). Intuitively, a larger C implies that a random walker starting from a given node k will have a larger number of paths bringing it back to k. Put another way, many of the paths spreading out of any node become redundant and a random walker will require on average more steps to reach any two symbols. In this scenario, random walkers tend to cluster in neighborhoods around their starting point -a property that could help encode different semantic aspects in different regions of G. Another consequence is that one could expect schemata composed of repeated symbols. Figure 2 shows the degree distributions of HSN(50, 5). Here we see another aspect on which the schema networks differ from a purely random graph. In particular, the former are more densely connected than the latter.
Schemata and semantics. To qualitatively grasp the semantic content of the learned schemata we take advantage of the labels available to both Yahoo and Yelp corpora. Figure 3 displays the random walk distributions over the schema networks for four subsets of both Yahoo and Yelp, as inferred with HSN(50, 5). Similar plots for all subsets (labels) of both corpora, extracted with all our HSN configurations can be found in the sup. material. Note how the "hot" symbols per category reside on different regions of the graphs (as suspected already from the large clustering coefficient of G) and yet, the "Science & Math" schemata (both nodes and edges) of Yahoo are closer to the "Education & Reference" schemata than to the "Sport" schemata. Where closer nodes in the figure indicate well-connected nodes in the underlying graph G. A similar picture holds for the four schemata of Yelp. Finally, we have also explored "schema interpolations": given two schemata e j1:j L and e m1:mj , we find the shortest path (of length l) on G connecting the end of e j1:j L with the beginning of e m1:mj . Our interpolation steps are the schemata {e j1+i:j L +i : ∀ 0 ≤ i ≤ l + L}. Table 4 shows one such interpolation between two schemata of Yahoo (see sup. material for many more interpolations from all corpora).

Conclusion
We introduced a novel representation learning algorithm for natural language modelling that infers discrete, relational representations which allow for compositionality. Experiments show our model (i) can infer hidden graph structure from random token sequences, (ii) outperforms state-of-the-art scores on VAE language modelling benchmarks and (iii) learns representations encoding high-level semantics of natural sentences, thereby adding some novel layers of interpretability to large, pretrained language models. Future work involves using our discrete representations for commonsense reasoning and transfer learning problems.

A Pseudo-self attention mechanism revisited
The attention mechanism of the original Transformers (Vaswani et al., 2017) is defined as where Q, K and V ∈ R T ×D are sets of queries, keys and values, respectively, given by a sequence of T , D-dimensional vectors, packed into matrices. In practice, these queries, keys and values are projected many times with different learnable, linear maps. The Attention operation (Eq. 13) is performed on these different projections in parallel, whose outputs are then concatenated and projected once more with a final, linear map. The complete operation is known as Multi-head Attention (Vaswani et al., 2017), and we use this notation in Fig. 1 of the main text. Now, the question is how to condition GPT-2 on the schema e j1:j L . Given a sequence of input representations u 1:T , the self -attention mechanism in GPT-2 is obtained by choosing Q = u 1:T · W Q , K = u 1:T · W K and V = u 1:T · W V , all in R T ×D , with W Q , W K and W V ∈ R D×D pretrained matrices. We leverage a pseudo-self attention (PSA) mechanism (Ziegler et al., 2019) that augments the key and value matrices in their first L rows, with projections of e j1:j L so that where p enc is a positional encoding, just as the one used in the original Transformer implementation (Vaswani et al., 2017). The latter informs GPT-2 about the ordering of the symbols in the schema, as selected by the random walk process. PSA is then simply given by Eq. 13 with the keys and values replaced with the augmented ones,K andṼ. The W e K , W e V here are randomly initialized, learnable parameters mapping the schemata onto the decoder self-attention, D-dimensional space, and we have as many of them as layers in GPT-2. Therefore this mechanism allows GPT-2 to attend to the projected schema at each of its layers, with a minimal addition of untrained parameters (Ziegler et al., 2019).

B Training objective
The Evidence Lower Bound (ELBO) of the Hidden Schema Network model reads where KL[·] denotes the Kullback-Leibler (KL) divergence.
Note that this is not the training objective of the main text. There we maximize the ELBO together with the mutual information between sentences and schemata. We give details about this modified objective in subsection B.3 below. Before getting into that, let us first calculate the explicit expressions for the two divergences above.

B.1 Kullback-leibler between random walks
For notational convenience we will not write the explicit dependence on the graph A in what follows. Using the explicit product form of the probabilities over walks leads to KL[q φ (z 1:T |x (n) 1:T ); p(z 1:  Table 5: Inference on ground-truth random graphs. Here we use the notation HS(p) to denote Hidden Schema Network models with prior graph distributions whose edge probability is set to p.
whereq φ (z i |x (n) 1:T ) is the aggregated probability over all walks until step i. Since the random walks are Markovian,q can be explicitly written aŝ where the (posterior) class probabilities over the walks' starting points ρ, and the transition matrices Q [i] are defined in Eqs. 10 and 11 of the main text. Using the definitions in Eqs. 4 and 9 of the main text, we can write the argument of the expectation value in Eq. 16 above as which means we only need to compute the expectation of the product z k i z j i−1 . This one can easily be shown to be whereρ 1:T ), defined in Eq. 17. Finally, the second KL term in Eq. 16 can be directly evaluated where ρ j (x (n) 1:T , φ) and ρ j are, respectively, the posterior and prior class probabilities for the random walks' starting points.

B.2 Kullback-leibler between random graph models
Since both prior and posterior graph models treat each edge in G as a Bernoulli random variable we can write directly where p φ (e i , e j ) is the posterior link probability, which is conditioned on the symbols connected by the link, and p is the global prior probability over all links, as defined in Eq. 3 of the main text.

B.3 Maximizing mutual information
We would like to maximize the mutual information between the word sequences in our dataset and the schema representations. We have argued that the training objective in the main text already includes such a mutual information term. To see this is indeed the case we need to workout some identities.
Let us, for simplicity of notation, consider two discrete variables z and x, the last of which follows an unknown distribution p D (x). What follow are identities where is the conditional entropy with respect to distribution q (see e.g. page 17 in (Cover and Thomas, 1991)) and is the entropy of distribution q * (z), which we define as the marginal (data-aggregated) distribution Finally, we used the definition of mutual information See e.g. page 20 in (Cover and Thomas, 1991  It follows from Eq. 23 that maximizing the ELBO (Eq. 15), together with the mutual information between word sequences and schemata, simply amounts to replacing the KL between the approximate posterior and prior random walk distributions, with the KL between the aggregated posterior and prior random walk distributions. To wit where we introduced the aggregated posterior over random walks wrt the word sequence In practice we approximate this quantity with where q * φ (z 1 ) is a categorical distribution whose class probabilities ρ * j (φ) are the average of those from our approximate posterior (Eq. 10 in the main text) and the transition probabilities q * φ (z i |z i−1 , A) have transition probability matrices

B.4 Mean-field solution
Instead of modeling the posterior over random walks with Eq. 9 of the main text, we could consider a mean-field decomposition along the time component, by ignoring the dependency on the graph G q φ (z 1:L |x 1: where at each step of the walk we have a step-dependent categorical distribution whose class probabilities live in the K-simplex. We could model the latter via where h enc 1 , . . . , h enc L are the outputs of our encoder neural network model, shown in Figure 1 of the main text.
Replacing the mean-field approximation of 33 into 15 yields

B.5 Fully connected graph
We can replace the adjacency matrix A in the definition of the transition probability matrix of our posterior Q(x 1:T , A, φ), with that of a fully connected graph. The aggregated posterior over all walks up to step i (Eq. 17 above) reduces in this case tô which is equivalent to that of the mean-field approximation of section B.4 withρ

C On synthetic dataset experiments
In this section we give additional details of and results from our proof-of-concept experiments.

C.1 Synthetic Language Model
We generate our synthetic dataset as follows: first, we sample a single, fixed graph G * with K nodes from a predefined random graph model. Second, we define a set of random tokens V, of size V , to be our vocabulary. We create each token as a random 3-tuple from the Latin alphabet, and choose to have at least one order of magnitude more tokens than nodes in G (that is, V K). Third, we assign a random bag of tokens to each node in G * . These random bags can simply be understood as probability distributions over V, and can be represented as V -dimensional vectors whose components live on the simplex. Note in particular that, by construction, tokens can be shared among the different nodes of G * . Finally, let us identify the K random bags with the K symbols {e 1 , e 2 , . . . , e K } of the synthetic language model.
To generate synthetic sentences we sample uniform, L-step random walks on G * , whose transition matrix is given by Eq. 4 in the main text, with f = I. Having obtained a set of random walks on G * , we sample one random token from each of the symbols (i.e. from each random bag) along the walks.

C.2 Experimental settings
Here we give additional details for reproducibility

Datasets
• Following the procedure above we generated two datasets from two random graphs with different topologies. One sampled from the Barabási-Albert model (Barabási and Albert, 1999), the other from the Erdös-Rényi model (Erdös and Rényi, 1959). We generate these graphs using NetworkX, a Python language software package for network structures (Hagberg et al., 2008). Specifically, we generate Barabási-Albert graphs by attaching 3 edges from each new node to old ones, and Erdös-Rényi graphs with an edge probability of 0.5. We set both graphs to have K = 100 symbols.
• We define each random bag of tokens in G * to have two tokens only (each with equal probability).
• We use a vocabulary of 1000 random tokens.
• Once the graph is fixed, we set the token sequence length to L = 10 (L = 11) for the Erdös (Barabási) datasets and generate a total of N = 100000 token sequences from each random graph.
Hidden Schema Network (HSN) settings • We train randomly initialized embeddings of dimension 256, one for each token. We sample these from a normal distribution with zero mean and a standard deviation of 0.01.
• The posterior graph model is defined via a single feed-forward neural network with 256 hidden units.
• The prior graph model has the edge probability p as hyperparameter. We crossvalidate it from the set p = {0.1, 0.2, 0.5, 0.6, 0.8} and found that HSN could fit the Barabási dataset only with small values {0.1, 0.2}. HSN could fit the Erdös dataset with larger values {0.5, 0.6} • The posterior random walk model is defined by replacing BERT with a 2-block Transformer encoder (Vaswani et al., 2017), each with 2 heads, 256 hidden units and dropout probability of 0.2.
• The prior random walk model was set to a uniform random walk.
Training details  Table 5 displays the mean and standard deviation of some additional results on our proof-of-concept experiments. We trained ten models in total.
with considerable irony the case also shows how completely japan has turned the tables on u.s. business (1) in brief the chancellor of the exchequer nigel lawson's decisions were justified by their intended political and financial convenience and credit (2) analysts said they expect the federal authority to be totally revamped giving japanese manufacturers more clear way to measure their exports. (3) but others say inco commission has been inadequate (4) in 1970 banco exterior an agency run by banco exterior <unk> de <unk> <unk> was attempting to reduce liabilities and raise the sale of certain works by the division the amended filings also point out that under a new agreement <unk> has an <unk> obligation to sell farmers to axa upon an acquisition of b.a.t Table 9: Interpolation between two random instances from the PTB dataset We first trained a simple LSTM Network to infer the correct symbol order in each random token sequence. We noticed that a network with 256 hidden units was enough to solve this task perfectly. Indeed, the negative log-likelihood (NLL) of these models corresponds to choosing the 2-token random bag sequence (i.e. the schema) that yields the correct token sequence without errors. The HSN performs equally well on the Barabási dataset, and slightly worst on the Erdös dataset. In fact, we have noticed the Erdös dataset proved to be more challenging to learn with the HSN in all regards. See, for example, the AUC scores or the Frobenious norms of HSN in this dataset, as compared to the Barabási case. We think this might be due to the fact that Barabási graphs have more structure, simply because of their sparsity, which arguably make them easier to infer with our inductive bias.
Note also how increasing the prior edge probability p affects the average number of edges of the inferred graphs.

D On language modelling experiments
In this section we give additional details of and results from our language modelling and representation learning experiments.

D.1 Experimental settings
Here we give additional details for reproducibility Datasets • We consider three widely used public datasets, namely the Penn Treebank (PTB) (Marcus et al., 1993)

HSN settings
• In all experiments we leveraged pretrained BERT and GPT-2 models, both with 12 layers, 768 hidden dimensions (D) and 12 attention heads. We used the public HuggingFace implementation of both these models (Wolf et al., 2020).
• The posterior graph model is set to a 2-layer feed forward network, each with hidden dimension 512. • We crossvalidated the prior edge probability over the set of values p = {0.1, 0.2, 0.5, 0.6} and found p = 0.5 (a maximum entropy prior) to yield the best results. All results we report correspond to this (p = 0.5) case.
• We also train an inhomogeneous random walk prior model by making ρ and the sequence of weights f [1] , f [2] , . . . , f [L−1] trainable. We initialized them by sampling from a normal distribution with zero mean and standard deviation of 0.01.

Training details
• We used a batch size of 32 and train with Adam (Kingma and Ba, 2014), with a learning rate of 0.00001, in all experiments.
• To sample both graph and random walk posterior models with used the Gumbel-Softmax trick (Jang et al., 2016), with a constant temperature of 1.0.
• We used a cyclical schedule to anneal both KL terms in our training objective from zero to one (Fu et al., 2019). When the annealing weight (usually called β in the literature) is finite, we used a KL threshold scheme , with a threshold value of 0.1.
• We trained the models for 100 epochs, although they usually needed about 60 epochs only to converge (in the NLL).
• We applied word dropout to the input of the decoder model with probability 0.3 in the following cases: (i) for all models trained on PTB; (ii) and all models with L = 50 trained on all datasets.
Interpolate: Very bad -bad do not use this company!! they told me within one hour, then i called again they said the driver have 90 mins. 90 minutes later, they said the driver is in traffic and wait for 15 minutes, i checked google map no accident, all green on all freeway...
(1) i ordered for pick up as my daughter hadn't been told that or even ordered online. when i spoke to the young lady, who was _UNK, she carried on a conversation with not a manager. it's bad customer service and i wouldn't even bother with this place... (2) place was clean... when i called to let them know i 'd get something else, the person that answering the phone wouldn't understand me... really? i gave this restaurant a b + for the cleanliness of the food and the friendliness of the staff (3) i had the quesadilla and the carnitas tacos. i felt every bite of these were so rubbery and the potatoes were off. i feel like the service and the quality of food can do much better. (4) somewhat disappointed. i did it once and loved it but today, today's water is bitter and salty... and the mint and cherry blossom _UNK'flavors just taste that way. the food quality doesn't match the place at all. i think it's ok for a pub but this place is supposed to be a nice place for professional lunches. i had the chicken flatbread and the chicken was more like subway chicken! with so many options around that area i won't pick this place for lunch.
Interpolate: Very bad -Very Good skip it... there are much better options out there! the " hot " food was not hot, and the flavor was only mediocre at most.
(1) indifferent to locals. the kids size pizzas were a billion times worse than a pizza hut. the quality of food was just awful. i wouldn't recommend this to a significant other for what it is.
(2) this new mexican spot is ok, bordering on childish. i went with friends and ordered a carne asada burro... it wasn't off the hook ; what made this place great were the chips & salsa sucked. yuck! ... (3) wow. _UNK you give so much frosting!! we were a groupon special for a cupcake for the princess of chocolate, and we were pretty stoked. they were _UNK and creative. they even suggested we try the coconut ... we 'll definitely be back soon. (4) went for the first time during a recent trip to vegas. our server jeff made special recommendations for our friends and i.
it was fantastic most of the food was light and fresh... i would highly recommend this place! i had dinner at republic kitchen tonight for the first time and was very impressed with the service, the decor, the menu, and the food quality... i am going back sunday for their brunch and jazz!

D.2 Additional results
Here we report results complementing the conclusions of the main text.
Language modelling. Table 6 displays our perplexity results on all datasets, just as in the main text. In the last two rows we additionally report the mean and standard deviation we obtained when repeating the experiments with the 100-symbol HSN model five times, with different initializations. The conclusion of the main text, viz. that our results outperform all baselines, remains unaltered, even within error bars. We additionally report in Table 7 the mean values of the KL for these five 100-symbol HSN runs.
Graph statistics. Table 8 reports the statistics of our inferred graphs for all datasets, and all model configurations. We can see that increasing the random walk length from 5 to 20 increases the number of connected components of the graphs. As a consequence, subsets of word sequences are map onto smaller subgraphs, the larger of which is about 50 symbols. One could argue that, since longer random walk lengths imply a larger set of possible schema configurations, the number of symbols required to describe our three corpora can simply decrease. In other words, less symbols are needed by long schemata. Similarly, directly increasing the symbols number leads too to a larger number of connected components. Indeed, even the short schemata in Yelp and Yahoo do not use all available symbols to model the corpora.
Representation learning. We can get a graphical picture of the feature we just discussed above in Figures 4-6 below. Very importantly, we see that the schema distribution is different for each category of each corpora in all model configurations. Tables 9-11 show interpolations of random instances from all datasets. Note how the model successfully interpolate between categories in both Yelp and Yahoo.