Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation

Syntactic structures used to play a vital role in natural language processing (NLP), but since the deep learning revolution, NLP has been gradually dominated by neural models that do not consider syntactic structures in their design. One vastly successful class of neural models is transformers. When used as an encoder, a transformer produces contextual representation of words in the input sentence. In this work, we propose a new model of contextual word representation, not from a neural perspective, but from a purely syntactic and probabilistic perspective. Specifically, we design a conditional random field that models discrete latent representations of all words in a sentence as well as dependency arcs between them; and we use mean field variational inference for approximate inference. Strikingly, we find that the computation graph of our model resembles transformers, with correspondences between dependencies and self-attention and between distributions over latent representations and contextual embeddings of words. Experiments show that our model performs competitively to transformers on small to medium sized datasets. We hope that our work could help bridge the gap between traditional syntactic and probabilistic approaches and cutting-edge neural approaches to NLP, and inspire more linguistically-principled neural approaches in the future.


Introduction
Once upon a time, syntactic structures were deemed essential in natural language processing (NLP). Modeling and inference about syntactic structures was an indispensable component in many NLP systems. That has all changed since the deep learning revolution started a decade ago. Modern NLP predominantly employs various neural models, most of which do not consider syntactic structures in their design.
One type of neural models that are particularly successful is transformers (Vaswani et al., 2017). Given an input text, a transformer produces a vector representation for each word that captures the meaning as well as other properties of the word in its context. Such contextual word representations can then be served into downstream neural networks for solving various NLP tasks. The power of transformers in producing high-quality contextual word representations is further unleashed with large-scale pretraining (Devlin et al., 2019;Liu et al., 2020). Nowadays, a vast majority of NLP models and systems are built on top of contextual word representations produced by some variants of pretrained transformers.
Like most other neural models, transformers were developed based on human insight and trial and error, without explicit design for incorporating syntactic structures. Nevertheless, there is evidence that contextual word representations produced by pretrained transformers encode certain syntactic structures (Hewitt and Manning, 2019;Tenney et al., 2019) and attention heads in pretrained transformers may reflect syntactic dependencies (Clark et al., 2019;Htut et al., 2019;Ravishankar et al., 2021). Because of the heuristic nature of the transformer model design, exactly how transformers acquire such syntactic capability remains unclear.
In this paper, we propose probabilistic transformers, a very different approach to deriving contextual word representations that is based on classic nonneural probabilistic modeling with innate syntactic components. Specifically, we design a conditional random field that models discrete latent representations of all words as well as a syntactic dependency structure of the input sentence, and we define a potential function which evaluates the compatibility of the latent representations of any pair of words connected by a dependency arc. We use mean field variational inference for approximate inference, producing a marginal distribution for each latent word representation, the probability vector of which can then be used as a contextual vector representation of the word.
While we propose our model from a purely syntactic and probabilistic perspective that is unrelated to transformers, we show that there is a striking resemblance between the computation graph of the inference procedure of our model and that of a transformer, with our intermediate distributions over dependency heads corresponding to self-attention scores and our intermediate distributions over latent word representations corresponding to intermediate word embeddings in a transformer. In short, we start with a probabilistic syntactic model but reach the transformer! We empirically compare our model with transformers when trained with either masked language modeling or downstream tasks. Our experimental results show that our model performs competitively to transformers on small to medium sized datasets.
We hope that probabilistic transformers, instead of being a replacement of transformers, could benefit the analysis of the syntactic capability of transformers and at the same time inspire novel extensions of transformers. Furthermore, we hope our work would promote future research of neural models that are linguistically more principled, theoretically more well-founded, and empirically no less powerful than existing models.

Probabilistic Transformers
We will first introduce the basic model, a conditional random field (CRF) as illustrated in Figure 1, then show the inference procedure, and finally introduce some variants to the basic model.

The CRF Model
Given a sentence (a sequence of words), denote n as the sequence length. For the i-th word, we define Z i as a discrete latent label that represents the syntactic (and possibly semantic) property of the word in the sentence (i.e., it is a contextual representation) with a label set of size d. Such a discrete representation deviates from the common practice of representing a word with a continuous vector, but it is sufficient at least for syntactic processing (Kitaev et al., 2022) and it greatly simplifies our probabilistic model. For the i-th word, we also define H i ∈ {1, 2, · · · , n} representing the syntactic dependency head of the word. So the set of variables {H i } n i=1 specifies a dependency structure. We may also allow H i to point to a dummy root node, which will be discussed in Section 2.3.5. We follow the head-selection paradigm of dependency parsing and do not enforce the tree constraint, which again simplifies our model design.
Next, we define two types of potential functions. For the i-th word w i , we define a unary potential function (corresponding to the unary factors in Figure 1) evaluating the compatibility of the word and its label Z i : where S ∈ R |V|×d is a score matrix, |V| is the size of the vocabulary. For simplicity, we do not exploit any morphological or contextual features for computing the scores. For every pair of words w i and w j (i ̸ = j), we define a ternary potential function (corresponding to the ternary factors in Figure 1) over Z i , Z j and H i , which evaluates the compatibility between the labels of the two words if w j is the dependency head of w i : where T ∈ R d×d is a score matrix. Inspired by the multi-head structure in transformers, we allow multiple dependency structures for the same sentence, which may represent different flavors of dependencies. Each dependency structure resides in a different channel with its own dependency head variables and ternary potential functions. For the c-th channel, we denote the set of dependency head variables by {H (c) i } n i=1 and the score matrix of the ternary potential function by T (c) . Let h denote the total number of channels. We may stack all the score matrices T (c) for c = 1, · · · , h to form a score tensor T ∈ R d×d×h . Note that all the channels share the same set of latent label variables {Z i } n i=1 .

Inference
Following Wang and Tu (2020), we use Mean Field Variational Inference (MFVI) to perform approximate inference. Different from the previous work, however, we need to run inference over latent labels in addition to dependency heads.
MFVI iteratively passes messages between random variables and computes an approximate posterior marginal distribution for each random variable (denoted by Q(·)). Let F i denote the message received by variable Z i at time step t from ternary factors. We have where are the approximate marginal distributions at time step t, with Q i . We initialize these distributions by After a fixed number of T > 0 iterations, we obtain the final posterior marginal distribution Q (T ) i (Z i ) for i = 1, · · · , n. Resulted from interactions with all the words of the sentence, the distribution Q (T ) i (Z i ) incorporates information of not only the i-th word, but also its context. Therefore, we can treat the probability vector of this distribution as a contextual vector representation for the i-th word. In practice, we find that using unnormalized scores in log space as contextual word representations produces better results, i.e., we skip exponentiation and normalization when computing Since all the computation during MFVI is fully differentiable, we can regard the corresponding computation graph as a recurrent or graph neural network parameterized with score matrix S and tensor T. We can use the contextual word representations for downstream tasks by connecting the network to any downstream task-specific network, and we can update the model parameters using any task-specific learning objective through gradient descent. This is exactly the same as how transformers are used.

Extensions and Variants
We introduce a few extensions and variants to the basic model that are empirically beneficial. Additional variants are discussed in Appendix B.

Distance
Similar to the case of transformers, our probabilistic model is insensitive to the word order of the input sentence. In order to capture the order information, we apply relative positional encoding to our model by using distance-sensitive ternary potential functions. Specifically, we use different ternary scores for different distances between words denoted by the two Z variables of the potential function. The ternary potential function in Equation 2 becomes: where f is a clip function with threshold γ: Notice that x cannot be zero since the head of a word cannot be itself. We set γ = 3 by default.

Asynchronous Update
During inference of the basic model, we iteratively update all variables in a synchronous manner. This can be problematic. Consider the first iteration. The messages passed to Z variables from H variables do not contain meaningful information because the initial distributions over H are uniform.
Consequently, after one iteration, distributions over all Z variables become almost identical.
To fix this problem, we use the asynchronous update strategy by default in this work. For each iteration, we first update distributions over H variables, and then update distributions over Z variables based on the updated distributions over H variables. Formally, we rewrite Formula 6 as ic (j) and eliminate Formula 8 because distributions over H variables no longer need initialization.

Message Weight
During inference, H variables have much fewer message sources than Z variables. This often pushes H variables towards being uniformly distributed. To balance the magnitude of the messages, we follow the Entropic Frank-Wolfe algorithm (Lê-Huu and Alahari, 2021), a generalization of MFVI, and introduce weight λ Z > 0 and λ H > 0 to Equation 5 and 6: We set λ Z = 1 and λ H = 1 d by default 2 .

Tensor Decomposition
Ternary score T is a tensor of shape d × d × h. Since d is usually set to several hundred, such a tensor leads to a huge number of parameters. To reduce the number of parameters, we apply the Kruskal form (which is closely related to tensor rank decomposition) to build the ternary score from smaller tensors.
where U, V ∈ R d×r and W ∈ R h×r . Since the number of channels h is relatively small, we may also choose only to decompose the first two dimensions.
where U, V ∈ R d×h×r . 2 We choose these weights in a similar way to choosing the scaling factor in scaled dot-product attention of transformers. See more details in Appendix A.5.

Root Node
Dependency parsing assumes a dummy root node, which we may add to the CRF model. The root node is not associated with any word and instead can be seen as representing the entire sentence. Therefore, we assume that it has a different (and possibly larger) label set from words and hence requires a different ternary potential function. Specifically, we define Z ROOT as a discrete latent label of the root node with a label set of size d root . For i ∈ {1, 2, · · · , n}, c ∈ {1, 2, · · · , h}, we add a ternary potential function over Z i , H (c) i and Z ROOT : where T ′ ∈ R d×droot×h is the root score tensor. During inference, we initialize Q (0) (Z ROOT ) with a uniform distribution. After inference, we can regard the posterior marginal distribution of Z ROOT as a sentence representation.

Comparison with Transformers
Although our probabilistic transformers are derived as a probabilistic model of dependency structures over latent word labels, we find that its computational process has lots of similarities to that of transformers. Below, we first re-formulate a probabilistic transformer in a tensor form to facilitate its comparison with a transformer, and then discuss the similarities between the two models at three levels.

Probabilistic Transformers in Tensor Form
Consider a probabilistic transformer using a distance-insensitive ternary potential function without a dummy root node. We tensorize the update formulas in the inference process of probabilistic transformers. Suppose Q (t) z ∈ R n×d is a tensor that represents the posterior distributions of all the Z variables, and Q (t) h,c ∈ R n×n is a tensor that represents the posterior distributions of all the H variables in channel c (with a zero diagonal to rule out self-heading). We can rewrite Equation 3 and 4 as where and σ is the softmax function. We still set λ Z to its default value 1 but regard λ H as a hyperparameter.
With asynchronous update, Equation 18 becomes: We assume that T (c) is symmetric for c = 1, · · · , h. This is the only assumption that we make in this section beyond the original definition from the previous section. Symmetric score matrices indicate that the ternary factors are insensitive to the head-child order, which is related to undirected dependency parsing (Sleator and Temperley, 1993).
h,c is also symmetric based on Formula 15 and 19. Thus, we can simplify Equation 16 to Suppose we decompose the ternary score tensor into two tensors U, V ∈ R d×h×r according to Equation 14, which can be rewritten as: where U (c) , V (c) ∈ R d×r are the c-th channel of tensor U and V respectively. Substitute 21 into 15 and 20, we have We define For time step t − 1, we could rewrite Formula 22 and 23 as Apply Equation 27, 19, 26 to 17, we have where We call the computation of channel c a singlechannel update for channel c. Now we have a tensorized formulation of the computation in probabilistic transformers and we are ready for its comparison with transformers at three different levels.

Single-Channel Update vs. Scaled Dot-Product Attention
Scaled dot-product attention in transformers is formulated as: As we can see, our single-channel update in Equation 29 is almost identical to scaled dot-product attention in transformers. The only difference is that the diagonal of the tensor Q c K T c is zero in our model because the head of a word cannot be itself.

Multi-Channel Update vs. Multi-Head Attention
Multi-head attention in transformers is formulated as: It is equivalent to Our multi-channel 7617 update formula (the second term within the softmax function in Equation 28) is similar to the multi-head attention in transformers, as shown in Figure 2. The main difference is that probabilistic transformers use the same parameters for W K and W V (both are V, shown in green color in Figure 2b) and for W Q and W O (both are U, shown in orange color in Figure 2b). Recall that U and V are obtained from matrix decomposition (Equation 14). Therefore, the correspondence between U, V and W Q , W K , W O , W V in transformers suggests that the latter can also be seen as derived from tensor decomposition. Previous work on transformers has the same findings (Elhage et al., 2021).

Full Model Comparison
Figure 3 compares the full computation graphs of the two models, which have a similar overall structure that repeats a module recurrently until outputting contextual word representations. Within the module, we have also established the correspondence between multi-channel update and multihead attention. On the other hand, there are a few interesting differences. First, our model does not have a feed-forward structure as in a transformer. However, we do propose a variant of our model that contains global variables representing topics (Appendix B.3), which may have similar functionality to the feed-forward structure. Second, our model does not have residual connections or layer norms. Instead, it adds the initial distributions (unary scores) to the updated message at each iteration. This may replace the functionality of residual connections and may even make more sense when the downstream task strongly depends on the original word information.
Third, we have an additional softmax in each iteration. Note that we do softmax before the first iteration (Equation 7) and also at the end of each iteration (Equation 28), but bypass it in the last iteration when producing the output word representations, so our model could be equivalently formulated as doing softmax before each iteration, which we show in Figure 3c. Doing softmax in this way is similar to the layer norm in pre-LN transformers (Xiong et al., 2020) ( Figure 3b). Finally, our model shares parameters in all iterations. This is similar to some variants of transformers that share parameters between layers, such as Universal Transformer (Dehghani et al., 2019) and ALBERT (Lan et al., 2019).
One consequence of these differences is that probabilistic transformers have much fewer parameters than transformers with the same number of layers, heads and embedding dimensions, because of shared parameters between iterations, absence of a feed-forward structure, and tied parameter matrices in multi-channel updates.

Experiments
We empirically compare probabilistic transformers with transformers on three tasks: masked language modeling, sequence labeling, and text classification. For each task, we use two different datasets. We also perform a syntactic test to evaluate the compositional generalization ability of our model.

Tasks and Datasets
Here we briefly introduce our tasks and datasets. A detailed description is presented in Appendix D.
Masked Language Modeling (MLM). We perform MLM tasks on two corpora: the Penn Tree-Bank (PTB) (Marcus et al., 1993) and Brown Laboratory for Linguistic Information Processing (BLLIP) (Charniak et al., 2000). Following Shen et al. (2022), we randomly replace words with a mask token <mask> at a rate of 30%. The performance of MLM is evaluated by measuring perplexity (lower is better) on masked words.
We project the final word representation of each mask token to the vocabulary. For transformers, we tie the projection parameters to the initial word embeddings. We find that this trick improves the performance of transformers.
We directly project the final word representation of each word to the target tag set. For POS tagging, we evaluate the results by the accuracy of wordlevel predictions. For NER, we evaluate the results by measuring the F1 score of named entities.
Text Classification. We use the Stanford Senti-ment Treebank (SST) (Socher et al., 2013) as the dataset. It has two variants: binary classification (SST-2) and fine-grained classification (SST-5). For transformers, we add a <CLS> token at the front of the sentence and then project its representation to the tag set. For our model, we use the variant with a root node introduced in Section 2.3.5 and project the representation of the root node to the tag set.
Syntactic Test. To evaluate the compositional generalization abilities of our model, we perform a syntactic test on the COGS dataset (Kim and Linzen, 2020). We follow the settings in Ontanón et al. (2021), who cast the task as a sequence labeling task.
As in sequence labeling, we project word representations to tag sets. If all words in a sentence are correctly predicted, the sentence prediction will be counted as correct. We evaluate the results by the sentence-level accuracy of the predictions.

Settings
We tune transformers and our model separately for each task except the syntactic test. For the syntactic test, we find that both transformers and our model easily reach 100% accuracy on the validation set. This observation is consistent with Ontanón et al. (2021). Therefore, instead of tuning, we use the best-performed setting of transformers in Ontanón et al. (2021)   parts of transformers based on the correspondence discussed in Section 3. For our model, we integrate all the variants mentioned in Section 2.3 except the root node variant, which we only use for text classification tasks. We tune the tensor decomposition strategy on different tasks. For MLM tasks, we add a small L2 regularization term to the ternary scores in our model, which we experimentally find beneficial. We optimize both models using the Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.999.

Results
We report the average and standard deviation results of 5 random runs in Table 1. It shows that our model has a competitive performance compared with transformers. In most tasks, probabilistic transformers perform competitively to transformers. It is worth noting that in these experiments, probabilistic transformers have much fewer parameters than transformers. For most tasks, the number of parameters of our best model is about one-fifth to one-half of that of the best transformer.
We also conduct case studies of the dependency structures inferred by our model after training on downstream tasks. Similar to the case of selfattentions in transformers, the inferred dependency structures are only partially consistent with human intuition. See Appendix F for details.

Related Work
There have been several studies trying to incorporate syntactic structures to transformers. dependency-constrained self-attention mechanism to induce dependency and constituency structures. Our work deviates from all these previous studies in that we start from scratch with probabilistic modeling of word representations and dependencies, but obtain a model that is strikingly similar to transformers.

Discussion
It is worth noting that in this work, our primary goal is not to propose and promote a new model to compete with transformers. Instead, it is our hope that our work could benefit the analysis and extension of transformers, as well as inspire future research of transformer-style models that are linguistically more principled, theoretically more well-founded, and empirically no less powerful than existing models. In the long run, we aim to bridge the gap between traditional statistical NLP and modern neural NLP, so that valuable ideas, techniques and insights developed over the past three decades in statistical NLP could find their place in modern NLP research and engineering.
The datasets used in our experiments have small to medium sizes (around 10k to 60k training sentences). Our preliminary experiments with MLM on larger data show that our models significantly underperform transformers, which suggests that our model may not be as scalable as transformers. One possible cause is the absence of a feed-forward structure in our model. Recent researches show that the feed-forward layers might serve as an important part of transformers (Dong et al., 2021). Further research is needed to analyze this problem.
Our model can be extended in a few directions. Instead of discrete labels, we may assume Z variables representing discrete vectors or even continuous vectors, which may lead to more complicated inference. We may model dependency labels by pairing every H variable with a dependency label variable. While we focus on contextual word representation (i.e., encoding) in this paper, we may extend our probabilistic model to include a decoder. Considering the similarity between our model and transformers, we speculate that some of these extensions may be used to inspire extensions of transformers as well.

Conclusion
We present probabilistic transformers, a type of syntactic-aware probabilistic models for contextual word representation. A probabilistic transformer acquires discrete latent representations of all words in the input sentence by modeling a syntactic dependency structure of the input sentence. We use MFVI for approximate inference and find a striking resemblance between the computation graph of the inference procedure of our model and that of a transformer. Our experimental results demonstrate that our model performs competitively to transformers on small to medium sized datasets.

Limitations
Though we have found a tight connection between probabilistic transformers and transformers in Section 3, this does not mean that our model can be directly used to interpret or modify transformers. For instance, in Section 3.3, we find that W K and W V in transformers both correspond to U in probabilistic transformers. However, if we tie W K and W V in transformers, then we may observe a performance drop on some downstream tasks.
The performance of probabilistic transformers lags behind transformers on large datasets (>100k), which suggests that our model may not be as scalable as transformers. We have discussed this in Section 6.
The way of positional encoding for probabilistic transformers leads to slower training and inference speed. On masked language modeling tasks, our model is about 3 times slower than transformers with either absolute or relative positional encoding, though it has much fewer parameters than transformers.

A Extended Entropic Frank-Wolfe
In Section 2.3.3, we add message weights to the update function of the posterior marginal distributions. It follows an extension of the Entropic Frank-Wolfe algorithm (Lê-Huu and Alahari, 2021), which is a generalization of MFVI. Below we briefly introduce the algorithm and our extension following most of the notations in their paper.

A.1 Entropic Frank-Wolfe
Suppose we want to minimize a continuous differentiable energy function E(·). Vanilla Frank-Wolfe solves the problem min x∈X E(x) by starting from a feasible x (0) ∈ X at time step 0, and iterating the following steps: where α t ∈ [0, 1] follows some stepsize scheme, X is the value range of x, and here we let x ∈ R n×d be the concatenation of the distributions over the label set of all variables in CRF. Regularized Frank-Wolfe (Lê-Huu and Alahari, 2021) adds a regularization term r(·) to the objective. It solves the new objective E(x) + r(x) by iterating It has been proved that regularized Frank-Wolfe achieves a sublinear rate of convergence O(1/ √ t) for suitable stepsize schemes.
Entropic Frank-Wolfe is a special case of regularized Frank-Wolfe, which sets the regularization term as an entropy function r(x) = −λH(x), where H(x) = − i∈V s∈S x is log x is , S is the label set of the variables, V is the set of indices of the variables. Entropy Frank-Wolfe has a closedform solution for the update process (30) When λ = 1 and α t = 1, ∀t ≥ 0, it is the same as the mean field algorithm.

A.2 Extended Entropic Frank-Wolfe
We extend the Entropic Frank-Wolfe algorithm by using a more general regularization term , where λ i > 0 is the regularization weight of the i-th variable and H(x i ) = − s∈S x is log x is is the entropy of x i over the probability simplex ∆ = x ∈ R d : x ≥ 0, 1 ⊤ x = 1 . It allows us to assign different regularization weights for different variables. We claim that the update function could be written as This extension is still a special case of the regularized Frank-Wolfe algorithm. As a result, it inherits all the convergence properties from the regularized Frank-Wolfe mentioned in the previous section. On the other hand, it is also an extension of MFVI, which allows adding a message weight to each variable during inference.

A.3 A Proof for Extended Entropic Frank-Wolfe
We give a simple proof to the close-form solution of extended Entropic Frank-Wolfe in Equation 31.
Since the optimization could reduce to n independent subproblems over each i ∈ V, We only need to give the closed-form solution to each subproblem: Lemma 1. For a given vector c ∈ R d , λ > 0, the optimal solution z * to Proof. We can rewrite the problem as The Lagrangian of the above problem is given by where µ = (µ 1 , µ 2 , . . . , µ d ) ≥ 0 and ν ∈ R are the Lagrange multipliers.
Since the given problem is convex and there exists z ∈ R d such that 1 ⊤ z = 1 and z > 0, the Slater's constraint qualification holds. Thus, it suffices to solve the following Karush-Kuhn-Tucker (KKT) system to obtain the optimal solution: The first equation implies ∀1 ≤ s ≤ d, z s > 0, and thus in combination with the last, we obtain ∀1 ≤ s ≤ d, µ s = 0. Therefore, the first equation becomes c s + λ log z s + 1 + ν = 0 ∀1 ≤ s ≤ d. Rewrite the equation as Summing up this result for all s, and taking into account the second equation, we have That is, In other words, z = softmax(− 1 λ c).

A.4 Inference in CRF
In this work, we apply the extended Entropic Frank-Wolfe to do inference in the CRF. Let s = (Z 1 , · · · , Z n , H 1 , · · · , H (1) n , H 1 , · · · , H (h) n ) denote an assignment to all the random variables. Our CRF encodes the joint distribution where Z is a normalization factor. The objective is to find an assignment s that maximizes the joint distribution p(s). To express in the form of an energy function, let p(s) = 1 Z exp(−e(s)), we have where 1 H i =j is an indicator function, which is equal to 1 if H i = j and is equal to 0 otherwise. The objective could now be expressed as minimizing the energy function e(s).
In general, the problem of CRF inference is NP-Hard (Shimony, 1994). In MFVI, we solve the continuous relaxation of the CRF problem instead. Let X be the simplex. That is, we allow a marginal distribution for each random variable. As in Section 2.2, let Q i (·) be the approximate marginal distribution over Z i and Q ic (·) be the approximate marginal distribution over H (c) i . The energy function is then Then we have In MFVI, the update for each distribution is the softmax of the derivative (let λ = 1 and α t = 1, ∀t ≥ 0 in Equation 30). That is,

Together with Equation 3 and 4, we have
, which directly leads us to Formula 5 and 6. In the extended Entropic Frank-Wolfe, the update for each distribution is the regularized softmax of the derivative (Equation 31). That is, Then it is equivalent to Formula 11 and 12 with regularization weight λ Z > 0 for Z variables and λ H > 0 for H variables.

A.5 The Choice of Message Weights
In Section 2.3.3, we set λ Z = 1 and λ H = 1 d by default. This choice comes from a theoretical analysis similar to Vaswani et al. (2017), and we empirically find it helpful to improve the performance.
Assume that the ternary scores in T are independent random variables with mean 0 and variance σ 2 . Then from Equation 3, we know that F (t) ic (j) is a weighted sum of these random variables. Suppose the weights are uniformly distributed, then F (t) ic (j) 7625 has mean 0 and variance d 2 (d 2 ) 2 σ 2 = 1 d 2 σ 2 . Since d is usually set to several hundred, this might result in a small variance in the message received by H variables and thus lead to uniformly distributed H variables. To balance this effect, we set λ H = 1 d such that the variance of 1 Here, since n varies in sentences, it is impossible to set a fixed λ Z that always recovers the original variance σ 2 . Compared to F (t) ic (j), the variance of G

B More Extensions and Variants
We have introduced several extensions and variants that are beneficial to the model performance in Section 2.3. There are some other variants that we find do not bring significant improvement empirically, but might also be meaningful and have interesting correspondences to transformers.

B.1 Step Size
In our model, we can retain information between iterations and do partially update with a proper step size. Let be the original posterior marginal distributions of the variables at time step t, which is the same as Formula 5 and 6. We have the posterior distributions with step size where α Z , α H ∈ (0, 1] are the step sizes of each update. When α Z = α H = 1, it is equivalent to the original model. We initialize these distribution by Formula 7 and 8.

B.2 Damping
Similar to step size in Appendix B.1, the damping approach also aims at retaining information between iterations. Instead of partially updating the posterior distribution, the damping approach partially updates the messages.
We define messages in time step t as i . Thus, Formula 5 and 6 can be written as Now, we add damping factors β Z and β H , which restrict the message update between iterations. We change Equation 32 and 33 to We initialize the message by ic (j) = 0 When β Z = β H = 0, there is no damping in the update process and it is equivalent to the original model. When β Z = 0.5 and β H = 0, it is similar to the residual connection in transformers. When β Z = β H = 0.5, it is similar to the residual attention mechanism proposed in RealFormer (He et al., 2021).

B.3 Global Variables
As we mentioned in Section 3.4, probabilistic transformers do not have a feed-forward structure as in transformers. Feed-forward layers, however, constitute two-thirds of a transformer model's parameters. Recent researches show that the feedforward layers might serve as an important part of transformers (Dong et al., 2021;Geva et al., 2021Geva et al., , 2022. Inspired by Sukhbaatar et al. (2019), who combines the feed-forward layer and the self-attention layer into a unified all-attention layer, we design a similar structure based on dependency relations. Intuitively, we could add some global variables that are similar to the latent word representations (Z variables) but these representations are global features that do not change with input sentences. We will introduce 3 different model designs below.

B.3.1 All-dep
Based on the intuition above, we add some global variables to the CRF model. Define F i as the i-th discrete global feature variable with the same label set as Z variables, representing the global features of the corpus. The total number of global feature variables is m. These variables are observed and the distributions on the label set will not change during inference. The head of each word could either be another word or a global feature variable. That is, H (c) i ∈ {1, 2, · · · , n, n + 1, · · · , n + m}. Then, for each word w i and global feature F j in channel c, we define a ternary potential function over Z i , H (c) i and F j , which evaluates the compatibility between the labels of the word and the global feature of the entire corpus. c.
An illustration of the CRF model is shown in Figure 4. We call this setting all-dep since the head of each word could either be another word or a dummy global feature variable. It follows the all-attn setting in Sukhbaatar et al. (2019).
Notice that F j is a variable that does not participate in inference. It could be seen as part of the model. Thus, we could design an equivalent model that does not contain global feature variables but have a binary factor between Z i and H (c) i : where P (F i = g) is the probability that the i-th global variable has label g. It can be proved that the MFVI inference process for the model with global feature variables and the model with binary factors is the same. Move the product inside the exponential term, we have The term inside the exponential is a weighted sum of ternary scores. We may re-formulate this potential function with a simplified term: where B (c) ∈ R m,d is a score matrix for channel c. The weighted sum of ternary scores could be regarded as a neural parameterization of the binary scores B (c) . An illustration of the simplified CRF model is shown in Figure 5. Given the model above, we can now derive the following iterative update equations of posterior distribution: The initialization of the posterior marginal distributions Q is still the variable representing the syntactic dependency head of the i-th word in the c-th channel. Similar to the approaches in the all-dep setting, we define a simplified binary potential function for Z i and G Figure 6 illustrates the CRF model of the dep-split setting.
We could derive the following iterative update equations of posterior distribution: where are the approximate marginal distributions at time step t, with Q i . We initialize these distributions by Formula 7, 8 and

B.3.3 Single-split
Following the single-split setting in Sukhbaatar et al. (2019), we design a CRF model that is similar to the dep-split model but only allows one global head for each word. We also call this setting singlesplit. Denote G i as the global head variable for i-th word with a label set of size m. We define a binary potential for Z i and G i where B ∈ R m×d is a score matrix. Figure 7 illustrates the CRF model of the single-split setting.  We could derive the following iterative update equations of posterior distribution: where where we can regard GFU as an operator that updates the latent word representations from global features. An illustration of the computation process is shown in Figure 8. From Figure 9, we can see that the feed-forward structure in transformers is very similar to the global feature update process in probabilistic transformers with global variables.

C Distance and Relative Positional Encoding (RPE)
In Section 3.2, we find that the single-channel update (Equation 29) in probabilistic transformers is almost identical to scaled dot-product attention in transformers. This observation is based on the hypothesis that probabilistic transformers and transformers are sharing the same positional encoding method. But this is not the case.
In section 2.3.1, we mention that to capture the word order information, we use a clip function to select the ternary potential function based on the distance of two words (Equation 9). This is similar to the relative positional encoding (RPE) in transformers. Shaw et al. (2018) proposes a method to add an additional component to key and value, based on the clipped distance. Specifically, the scaled dot-product attention with RPE could be rewritten as where x i is the input representation of the i-th word, z i is the output representation, α ij = exp e ij k exp e ik . The additional component is a learnable parameter that based on the clipped distance For probabilistic transformers, we directly add the distance information to the ternary potential function. Combining Equation 9 and 29, we could rewrite the single-channel update as where α ij = exp e ij k exp e ik . The weights are based on the clip function f in Equation 10 Notice that this way of positional encoding is quite parameter inefficient. It also makes our training process much slower than that of transformers.

D Details for Tasks and Datasets
In this section, we will introduce our tasks and datasets in detail. A brief introduction is shown in Section 4.1.

D.1 Masked Language Modeling
Masked Language Modeling (MLM) tasks generally evaluate the expressiveness of contextural word representations. We perform MLM tasks on two corpora: the Penn TreeBank (PTB) and Brown Laboratory for Linguistic Information Processing (BLLIP). We randomly replace words with a mask token <mask> at a rate of 30% and the model is required to predict the original word. Following Shen et al. (2022), we never mask <unk> tokens. The performance of MLM is evaluated by measuring perplexity (lower is better) on masked words. (2020) with around 40k sentences and 1M tokens as the train set. The validation set consists of the first section each year and the test set consists of the second section each year. We remove all punctuation, replace numbers with a single character N and use lower-case letters. The vocabulary contains words that appear more than 27 times in the entire BLLIP dataset, with size 30231 including <unk> and <mask>.

D.2 Sequence Labeling
Sequence labeling tasks require models to predict the tag for each word in the sequence. For sequence labeling tasks, we perform part-of-speech (POS) tagging on two datasets: the Penn TreeBank (PTB) and the Universal Dependencies (UD). We also perform named entity recognition (NER) on CoNLL-2003.
PTB. As introduced in Appendix D.1, we also use the PTB dataset for POS tagging but with a different setting. We use the most commons split of this corpus for POS tagging, where sections from 0 to 18 are used as the train set, sections from 19 to 21 are used as the validation set, and sections from 22 to 24 are used as the test set. All words in the train set compose the vocabulary.
UD. UD is a project that develops crosslinguistically consistent treebank annotation for many languages (De Marneffe et al., 2021). We test our model on the language-specific part-of-speech (XPOS) tags of the English EWT dataset with the standard splits. All words in the train set compose the vocabulary.
CoNLL-2003. It is a named entity recognition dataset which is released as part of CoNLL-2003 shared task (Tjong Kim Sang andDe Meulder, 2003). We test our model on the English dataset. All words in the train set compose the vocabulary. We only project the final word representation of each word to the tag set with the BIOES scheme without using a CRF decoder.

D.3 Text Classification
Text Classification tasks need to classify sentences into different classes. We use the Stanford Sentiment Treebank (SST) (Socher et al., 2013) as the dataset. It has two variants: binary classification (SST-2) and fine-grained classification (SST-5). The dataset comes from SentEval (Conneau and Kiela, 2018).
SST-2. SST-2 classifies each movie review into positive or negative classes. It contains 67k sentences in the train set.
SST-5. SST-5 classifies sentences into 5 classes: negative, somewhat negative, neutral, somewhat positive and positive. It contains 8.5k sentences in the train set.
In text classification, all words in the train set compose the vocabulary.

D.4 Syntactic Test
To evaluate the compositional generalization abilities of our model, we perform a syntactic test on the COGS (Kim and Linzen, 2020) dataset. COGS is a semantic parsing dataset that measures the compositional generalization abilities of models. We follow the settings in Ontanón et al. (2021), which turns the task from seq2seq into a sequence tagging task. The model needs to predict 5 tags for each input word: a parent word, the role of the relation between the word and its parent (if applicable), the category, the noun determiner (for nouns) and the verb name (for verbs). With these tags, one can reconstruct the original output deterministically.
For role, category, noun determiner and verb name, we directly project word representations to each tag set. For the parent tag, (Ontanón et al., 2021) propose 3 types of prediction heads: • Absolute uses a direct projection to predict the absolute index of the parent in the input sequence (-1 for no parent).
• Relative uses a direct projection to predict the relative offset of the parent token with respect to the current token, or self for no parent.
• Attention uses the attention weights from a new attention layer with a single head to predict the parent.
We empirically find that relative performs the best in most settings for both transformers and probabilistic transformers. This is not consistent with the observations in Ontanón et al. (2021) who finds that attention outperforms other settings. We still apply the relative setting in our experiments.

E Hyperparameters and Implementation
We report our hyperparameters in Table 2 for probabilistic transformers and Table 3 for transformers. We tune the models for each task except the syntactic test through random search. We run experiments on one NVIDIA GeForce RTX 2080 Ti and all the experiments could finish in one day. Our implementation is based on the flair framework (Akbik et al., 2019).

F Case Studies of Learned Dependency Structures
A probabilistic transformer infers marginal distributions over both Z and H variables, the latter of which can be used to extract a dependency structure. Since our model is trained on downstream tasks such as MLM without access to gold parse trees, it can be seen as performing unsupervised dependency parsing. We visualize the dependency structures learned by a probabilistic transformer by looking at the most probable head of each word in the sentence. Figure 10 illustrates the dependency structures extracted from a probabilistic transformer trained on the PTB dataset under the MLM task. The sentence comes from the test set of the PTB dataset.
We show the head of each word in all the channels. The numbers on the dependency arcs represent probabilities estimated by the model. The model does not contain a root node, so there is at least one circle in the dependency graph.
From the figure, we can see that our model is very confident in its choices of dependency arcs, with all the probabilities close to 1, which indicates strong compatibilities between the latent representations of connected word pairs. The predicted structure somewhat makes sense. For example, it puts 'she said' together. But generally, most of the dependency arcs are not consistent with humandesigned dependency relations. this is not a major crash she said  A2. Did you discuss any potential risks of your work?
We did not find potential risks in this work.
A3. Do the abstract and introduction summarize the paper's main claims?
Section 1 A4. Have you used AI writing assistants when working on this paper?
Left blank.
B Did you use or create scientific artifacts?
Section 4 B1. Did you cite the creators of artifacts you used?
Section 4, Appendix E B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
We use commonly-used benchmarks and the license could be easily found on the Internet.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)? Appendix D B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it? For a fair comparison, we use the datasets as is. We follow the preprocessing steps from previous work.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used? Section 4.3, Appendix E The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.