Recurrent Neural Language Models as Probabilistic Finite-state Automata

Studying language models (LMs) in terms of well-understood formalisms allows us to precisely characterize their abilities and limitations. Previous work has investigated the representational capacity of recurrent neural network (RNN) LMs in terms of their capacity to recognize unweighted formal languages. However, LMs do not describe unweighted formal languages -- rather, they define \emph{probability distributions} over strings. In this work, we study what classes of such probability distributions RNN LMs can represent, which allows us to make more direct statements about their capabilities. We show that simple RNNs are equivalent to a subclass of probabilistic finite-state automata, and can thus model a strict subset of probability distributions expressible by finite-state models. Furthermore, we study the space complexity of representing finite-state LMs with RNNs. We show that, to represent an arbitrary deterministic finite-state LM with $N$ states over an alphabet $\alphabet$, an RNN requires $\Omega\left(N |\Sigma|\right)$ neurons. These results present a first step towards characterizing the classes of distributions RNN LMs can represent and thus help us understand their capabilities and limitations.


Introduction
We start with a few definitions.An alphabet Σ is a finite, non-empty set.A formal language is a subset of Σ's Kleene closure Σ ˚, and a language model (LM) p is a probability distribution over Σ ˚.
LMs have demonstrated utility in a variety of NLP tasks and have recently been proposed as a general model of computation for a wide variety of problems requiring (algorithmic) reasoning (Brown et al., 2020;Chen et al., 2021;Hoffmann et al., 2022;Chowdhery et al., 2022;Wei et al., 2022a,b;Kojima et al., 2023;Kim et al., 2023, inter alia).Our paper asks a simple question: How can we characterize the representational capacity of an LM based on a recurrent neural network (RNN)?In other words: What classes of probability distributions over strings can RNNs represent?

Examples of languages
Figure 1: A graphical summary of the results.This paper establishes the equivalence between the bolded deterministic probabilistic FSAs and Heaviside Elman RNNs, which both define deterministic probabilistic finite-state languages. 1nswering this question is essential whenever we require formal guarantees of the correctness of the outputs generated by an LM.For example, one might ask a language model to solve a mathematical problem based on a textual description (Shridhar et al., 2023) or ask it to find an optimal solution to an everyday optimization problem (Lin et al., 2021, Fig. 1).If such problems fall outside the representational capacity of the LM, we have no grounds to believe that the result provided by the model is correct in the general case.The question also follows a long line of work on the linguistic capabilities of LMs, as LMs must be able to implement mechanisms of recognizing specific syntactic structures to generate grammatical sequences (Linzen et al., 2016;Hewitt and Manning, 2019;Jawahar et al., 2019;Liu et al., 2019;Icard, 2020;Manning et al., 2020;Rogers et al., 2021;Belinkov, 2022, inter alia).
A natural way of quantifying the representational capacity of computational models is with the class of formal languages they can recognize (Deletang et al., 2023).Previous work has connected modern LM architectures such as RNNs (Elman, 1990;Hochreiter and Schmidhuber, 1997;Cho et al., 2014) and transformers (Vaswani et al., 2017) to formal models of computation such as finite-state automata, counter automata, and Turing machines (e.g., McCulloch and Pitts, 1943;Kleene, 1956;Siegelmann and Sontag, 1992;Hao et al., 2018;Korsky and Berwick, 2019;Merrill, 2019;Merrill et al., 2020;Hewitt et al., 2020;Merrill et al., 2022;Merrill and Tsilivis, 2022, inter alia).Through this, diverse formal properties of modern LM architectures have been shown, allowing us to draw conclusions on which phenomena of human language they can model and what types of algorithmic reasoning they can carry out. 2 However, most existing work has focused on the representational capacity of LMs in terms of classical, unweighted, formal languages, which arguably ignores an integral part of an LM: The probabilities assigned to strings.In contrast, in this work, we propose to study of LMs by directly characterizing the class of probability distributions they can represent.
Concretely, we study the relationship between RNN LMs with the Heaviside activation function H pxq def " 1 tx ą 0u and finite-state LMs, the class of probability distributions that can be represented by weighted finite-state automata (WFSAs).Finitestate LMs form one of the simplest classes of probability distributions over strings (Icard, 2020) and include some well-known instances such as n-gram LMs.We first prove the equivalence in the representational capacity of deterministic WFSAs and RNN LMs with the Heaviside activation function, where determinism here refers to the determinism in transitioning between states conditioned on the input symbol.To show the equivalence, we generalize the well-known construction of an RNN encoding an unweighted FSA due to Minsky (1954) to the weighted case, which enables us to talk about string probabilities.We then consider the space complexity of simulating WFSAs using RNNs.Minsky's construction encodes an FSA with N states in space O p|Σ|N q, i.e., with an RNN with O p|Σ|N q hidden units, where Σ is the alphabet over which the WFSA is defined.Indyk (1995) showed that a general unweighted FSA with N states can be 2 See §7 for a thorough discussion of relevant work.
simulated by an RNN with a hidden state of size O `|Σ| ?N ˘.We show that this compression does not generalize to the weighted case: Simulating a weighted FSA with an RNN requires Ω pN q space due to the independence of the individual conditional probability distributions defined by the states of the WFSA.Lastly, we also study the asymptotic space complexity with respect to the size of the alphabet, |Σ|.We again find that it generally scales linearly with |Σ|.However, we also identify classes of WFSAs, including n-gram LMs, where the space complexity scales logarithmically with |Σ|.These results are schematically presented in Fig. 1.

Finite-state Language Models
Most modern LMs define p pyq as a product of conditional probability distributions p: where EOS R Σ is a special end of sequence symbol.The EOS symbols enables us to define the probability of a string purely based on the conditional distributions.Such models are called locally normalized.We denote Σ def " Σ Y tEOSu.Throughout this paper, we will assume p defines a valid probability distribution over Σ ˚, i.e., that p is tight (Du et al., 2023, §4).Definition 2.1.Two LMs p and p 1 over Σ ˚are weakly equivalent if p pyq " p 1 pyq for all y P Σ ˚. 3   Finite-state automata are a tidy and wellunderstood formalism for describing formal languages.
Definition 2.2.A finite-state automaton (FSA) is a 5-tuple pΣ, Q, I, F, δq where Σ is an alphabet, Q a finite set of states, I, F Ď Q the set of initial and final states, and δ Ď Q ˆΣ ˆQ set of transitions.
We assume that states are identified by integers in Z |Q| def " t0, . . ., |Q| ´1u. 4 We also adopt a more suggestive notation for transitions by denoting pq, y, q 1 q P δ as q define Par pq, yq def " !q 1 | Dy P Σ : q 1 y Ý Ñ q P δ ) as the set of y-parents of q and the children of the state q as the set ! q 1 | Dy P Σ : q y Ý Ñ q 1 P δ ) .FSAs are often augmented with weights.Definition 2.3.A real-weighted finite-state automaton (WFSA) A is a 5-tuple pΣ, Q, δ, λ, ρq where Σ is an alphabet, Q a finite set of states, δ Ď Q ˆΣ ˆR ˆQ a finite set of weighted transitions and λ, ρ : Q Ñ R the initial and final weighting functions.
We denote pq, y, w, q 1 q P δ with q y{w Ý Ý Ñ q 1 and define ωpq y{w Ý Ý Ñ q 1 q def " w, where ωpq y{Ý Ý Ñ q 1 q def " 0 if there are no y-transitions from q to q 1 . 5The underlying FSA of a WFSA is the FSA obtained by removing the transition weights and setting I " tq P Q | λ pqq ‰ 0u and F " tq P Q | ρ pqq ‰ 0u.Definition 2.4.An FSA A " pΣ, Q, I, F, δq is deterministic if |I| " 1 and for every pq, yq P Q ˆΣ, there is at most one q 1 P Q such that q y Ý Ñ q 1 P δ.A WFSA is deterministic if its underlying FSA is deterministic.
In contrast to unweighted FSAs, not all nondeterministic WFSAs admit a weakly equivalent deterministic one, i.e., they are non-determinizable.Definition 2.5.A path π is a sequence of consecutive transitions Its length |π| is the number of transition in it and its scan s pπq the concatenation of the symbols on them.We denote with ΠpAq the set of all paths in A and with ΠpA, yq the set of all paths that scan y P Σ ˚.
The weights of the transitions along a path are multiplicatively combined to form the weight of the path.The weights of all the paths scanning the same string are combined additively to form the weights of that string.Definition 2.6.The path weight of π P ΠpAq is The stringsum of y P Σ ˚is A pyq def " ř πPΠpA,yq w pπq.A class of WFSAs important for defining LMs is probabilistic WFSAs.Definition 2.7.A WFSA A " pΣ, Q, δ, λ, ρq is probabilistic (a PFSA) if all transition, initial, and final weights are non-negative, ř qPQ λ pqq " 1, and, for all q P Q, ř q y{w Ý Ý Ñq 1 Pδ w `ρ pqq " 1.
5 Throughout the text, we use ˝as a placeholder a free quantity, in this case, to any weight w P R. In case there are multiple ˝'s in an expression, they are not tied in any way.
b{0.1 a{0.9 b{0.7 ppab n ab m q : 1 ¨0.6 ¨0.1 n ¨0.9 ¨0.7 m ¨0.3 ppbb m q : 1 ¨0.4 ¨0.7 m ¨0.3The initial weights, and, for any q P Q, the weights of its outgoing transitions and its final weight, form a probability distribution.The final weights in a PFSA play an analogous role to the EOS symbol-they represent the probability of ending a path in q: ρ pqq corresponds to the probability of ending a string y, p pEOS | yq, where q is a state arrived at by A after reading y.We will use the acronym DPFSA for the important special case of a deterministic PFSA.Definition 2.8.A language model p is finite-state (an FSLM) if it can be represented by a PFSA, i.e., if there exists a PFSA A such that, for every y P Σ ˚, p pyq " A pyq.
See Fig. 2 for an example of a PFSA defining an FSLM over Σ " ta, bu.Its support consists of the strings ab n ab m and bb m for n, m P N ě0 .
In general, there can be infinitely many PFSAs that express a given FSLM.However, in the deterministic case, there is a single minimal DPFSA.Definition 2.9.Let p be an FSLM.A PFSA A is a minimal DPFSA for p if it defines the same probability distribution as p and there is no weakly equivalent DPFSA with fewer states.
3 Recurrent Neural Language Models RNN LMs are LMs whose conditional distributions are given by a recurrent neural network.Over the course of this paper, we will focus on Elman RNNs (Elman, 1990) as they are the easiest to analyze and special cases of more common networks, e.g., those based on long short-term memory (LSTM; Hochreiter and Schmidhuber, 1997) and gated recurrent units (GRUs;Cho et al., 2014).
ing hidden state recurrence: where h 0 is set to some vector in R D .r : Σ Ñ R R is the symbol representation function and σ is an element-wise nonlinearity.b P R D , U P R DˆD , and V P R DˆR .We refer to the dimensionality of the hidden state, D, as the size of the RNN.
An RNN R can be used to specify an LM by using the hidden states to define the conditional distributions for y t given y ăt .Definition 3.2.Let E P R |Σ|ˆD and let R be an RNN.An RNN LM pR, Eq is an LM whose conditional distributions are defined by projecting Eh t onto the probability simplex ∆ |Σ|´1 using some f : R |Σ| Ñ ∆ |Σ|´1 : ( We term E the output matrix.
The most common choice for f is the softmax defined for x P R D and d P Z D as An important limitation of the softmax is that it results in a distribution with full support for all x P R D .However, one can achieve 0 probabilities by including extended real numbers R def " R Y t´8, 8u: Any element with x d " ´8 will result in softmaxpxq d " 0.
Recently, a number of alternatives to the softmax have been proposed.This paper uses the sparsemax function (Martins and Astudillo, 2016), which can output sparse distributions: Importantly, sparsemaxpxq " x for x P ∆ D´1 .
On determinism.Unlike WFSAs, Elman RNNs (and most other popular RNN architectures, such as the LSTM and GRU) implement inherently deterministic transitions between internal states.As we show shortly, certain types of Elman RNNs are at most as expressive as deterministic WFSAs, meaning that they can not represent non-determinizable WFSAs.
Common choices for the nonlinear function σ in Eq. ( 2) are the sigmoid function σpxq "  and the ReLU σpxq " maxp0, xq.However, the resulting nonlinear interactions of the parameters and the inputs make the analysis of RNN LMs challenging.One fruitful manner to make the analysis tractable is making a simplifying assumption about σ.We focus on a particularly useful simplification, namely the use of the Heaviside activation function. 6efinition 3.3.The Heaviside function is defined as Hpxq def " 1 tx ą 0u.
See Fig. 3 for the graph of the Heaviside function and its continuous approximation, the sigmoid.For cleaner notation, we define the set B def " t0, 1u.Using the Heaviside function, we can define the Heaviside ERNN, the main object of study in the rest of the paper.Definition 3.4.A Heaviside Elman RNN (HRNN) is an ERNN R " pΣ, σ, D, U, V, b, h 0 q where σ " H.

Equivalence of HRNNs and FSLMs
The hidden states of an HRNN live in B D , and can thus take 2 D different values.This invites an interpretation of h as the state of an underlying FSA that transitions between states based on the HRNN recurrence, specifying its local conditional distributions with the output matrix E. Similarly, one can also imagine designing a HRNN that simulates the transitions of a given FSA by appropriately specifying the parameters of the HRNN.We explore this connection formally in this section and present the main technical result of the paper.The central result that characterizes the representational capacity HRNN can be informally summarized by the following theorem.We split this result into the question of (i) how DPFSAs can simulate HRNN LMs and (ii) how HRNN LMs can simulate DPFSAs.
The proof closely follows the intuitive connection between the 2 D possible configurations of the RNN hidden state and the states of the weakly equivalent DPFSA.The outgoing transition weights of a state q are simply the conditional probabilities of the transition symbols conditioned on the RNN hidden state represented by q. 7 This implies that HRNNs are at most as expressive as DPFSAs, and as a consequence, strictly less expressive as non-deterministic PFSAs.We discuss the implications of this in §6.

HRNNs Can Simulate DPFSAs
This section discusses the other direction of Theorem 4.1, showing that a general DPFSA can be simulated by an HRNN LM using a variant of the classic theorem originally due to Minsky (1954).We give the theorem a probabilistic twist, making it relevant to language modeling.Lemma 4.2.Let A " pΣ, Q, δ, λ, ρq be a DPFSA.Then, there exists a weakly equivalent HRNN LM whose RNN is of size |Σ||Q|.
We describe the full construction of an HRNN LM simulating a given DPFSA in the next subsection.The full construction is described to showcase the mechanism with which the HRNN can simulate the transitions of a given FSA and give intuition on why this might, in general, require a large number of parameters in the HRNN.Many principles and constraints of the simulation are also reused in the discussion of the lower bounds on the size of the HRNN required to simulate the DPFSA.

Weighted Minsky's Construction
For a DPFSA A " pΣ, Q, δ, λ, ρq, we construct a HRNN LM pR, Eq with R " pΣ, σ, D, U, V, b, h 0 q defining the same distribution over Σ ˚.The idea is to simulate the transition function δ with the Elman recurrence by appropriately setting U, V, and b.The transition weights defining the stringsums are represented in E.
Let n : Q ˆΣ Ñ Z |Q||Σ| be a bijection, i.e., an ordering of Q ˆΣ, m : Σ Ñ Z |Σ| an ordering of Σ, 7 The full proof is presented in Appendix A. and m : Σ Ñ Z |Σ| a bijection, i.e., an ordering of Σ.We use n, m, and m to define the one-hot encodings ¨ of state-symbol pairs and of the symbols, i.e., we assume that q, y d " 1 td " n pq, yqu and y d " 1 td " m pyqu for q P Q and y P Σ.
HRNN's hidden states.The hidden state h t of R will represent the one-hot encoding of the current state q t of A at time t together with the symbol y t upon reading which A entered q t .Formally, h t " pq t , y t q P B |Q||Σ| . (6) There is a small caveat: How do we set the incoming symbol of A's initial state q ι ?As we show later, the symbol y t in h t " pq t , y t q does not affect the subsequent transitions-it is only needed to determine the target of the current transition.Therefore, we can set h 0 " pq ι , yq for any y P Σ.
Encoding the transition function.The idea of defining U, V, and b is for the Elman recurrence to perform, upon reading y t`1 , element-wise conjunction between the representations of the children of q t and the representation of the states A can transition into after reading in y t`1 from any state.8 The former is encoded in the recurrence matrix U, which has access to the current hidden state encoding q t while the latter is encoded in the input matrix V, which has access to the one-hot representation of y t`1 .Conjoining the entries in those two representations will, due to the determinism of A, result in a single non-zero entry: One representing the state which can be reached from q t (1 st component) using the symbol y t`1 (2 nd component); see Fig. 4.More formally, the recurrence matrix U lives in B |Σ||Q|ˆ|Σ||Q| .Each column U : ,npq,yq represents the children of the state q in the sense that the column contains 1's at the indices corresponding to the state-symbol pairs pq 1 , y 1 q such that A transitions from q to q 1 after reading in the symbol y 1 .That is, for q, q 1 P Q and y, y 1 P Σ, we define Since y is free, each column is repeated |Σ|-times: Once for every y P Σ-this is why, after entering the next state, the symbol used to enter it is not relevant for the determination of the subsequent transitions and, in the case of the initial state, any incoming symbol can be chosen to set h 0 .
a -r e a c h a b le f pEh 1 q " ¨a : w 1 b : w 2 EOS : 0 ‚ Figure 4: A high-level illustration of how the transition function of the FSA is simulated in Minsky's construction on a fragment of an FSA starting at q (encoded in h) and reading the symbol a.The top path disjoins the representations of the children of q, whereas the bottom path disjoins the representations of states reachable by an a-transition.The Heaviside activation conjoins these two representations into h 1 (rightmost fragment).Projecting Eh 1 results in the vector defining the same probability distribution as the outcoming arcs of q (red box).
The input matrix V lives in B |Σ||Q|ˆ|Σ| and encodes the information about which states can be reached by which symbols (from any state).The non-zero entries in the column corresponding to y 1 P Σ correspond to the state-symbol pairs pq 1 , y 1 q such that q 1 is reachable with y 1 from some state: Lastly, we define the bias as b def " ´1 P R |Q||Σ| , which allows the Heaviside function to perform the needed conjunction.The correctness of this process is proved in Appendix A (Lemma A.1).
Encoding the transition probabilities.We now turn to the second part of the construction: Encoding the string acceptance weights given by A into the probability distribution defined by R. We present two ways of doing that: Using the standard softmax formulation, where we make use of the extended real numbers, and with the sparsemax.
The conditional probabilities assigned by R are controlled by the |Σ| ˆ|Q||Σ|-dimensional output matrix E. Since h t is a one-hot encoding of the state-symbol pair q t , y t , the matrix-vector product Eh t simply looks up the values in the n pq t , y t q th column.After being projected to ∆ |Σ|´1 , the entry in the projected vector corresponding to some y t`1 P Σ should match the probability of y t`1 given that A is in the state q t , i.e., the weight on the transition q t y t`1 {Ý ÝÝÝ Ñ ˝if y t`1 P Σ and ρ pq t q if y t`1 " EOS.This is easy to achieve by simply encoding the weights of the outgoing transitions into the n pq t , y t q th column, depending on the projection function used.This is especially simple in the case of the sparsemax formulation.By definition, in a PFSA, the weights of the outgoing transitions and the final weight of a state q t form a probability distribution over Σ for every q t P Q. Projecting those values to the probability simplex, therefore, leaves them intact.We can therefore define (9) Projecting the resulting vector Eh t , therefore, results in a vector whose entries represent the transition probabilities of the symbols in Σ.
In the more standard softmax formulation, we proceed similarly but log the non-zero transition weights.Defining log 0 def " ´8, we set (10) It is easy to see that the entries of the vector softmaxpEh t q form the same probability distribution as the original outgoing transitions out of q.Over the course of an entire input string, these weights are multiplied as the RNN transitions be-tween different hidden states corresponding to the transitions in the original DPFSA A. The proof can be found in Appendix A (Lemma A.2).This establishes the complete equivalence between HRNN LMs and FSLMs. 9 5 Lower Bound on the Space Complexity of Simulating PFSAs with RNNs Lemma 4.2 shows that HRNN LMs are at least as expressive as DPFSAs.More precisely, it shows that any DPFSA A " pΣ, Q, δ, λ, ρq can be simulated by an HRNN LM of size O p|Q||Σ|q.In this section, we address the following question: How large does an HRNN LM have to be such that it can correctly simulate a DPFSA?We study the asymptotic bounds with respect to the size of the set of states, |Q|, as well as the number of symbols, |Σ|.

Asymptotic Bounds in |Q|
Intuitively, the 2 D configurations of a Ddimensional HRNN hidden state could represent 2 D states of a (DP)FSA.One could therefore hope to achieve exponential compression of a DPFSA by representing it as an HRNN LM. 10 Interestingly, this is not possible in general: Extending work by Dewdney (1977), Indyk (1995) shows that there exist unweighted FSAs which require an HRNN of size Ω ´|Σ| a |Q| ¯to be simulated.This bound is tight-any FSA can be simulated by an HRNN of size O ´|Σ| a |Q| ¯.11We now ask whether the same lower bound can also be achieved when simulating DPFSAs.We find that the answer is negative: There exist DPFSAs which require an HRNN LM of size Ω p|Σ||Q|q to faithfully represent their probability distribution.Since the transition function of the underlying FSA can be simulated more efficiently, the bottleneck comes from the requirement of weak equivalence.Indeed, as the proof of the following theorem shows (Theorem 5.1 in Appendix A), the issue intuitively arises in the fact that, unlike in an HRNN LM, the local probability distributions of the different states in a PFSA are completely 9 The full discussion of the result is postponed to §6. 10 Indeed, any DPFSA defined from an RNN as described in the proof of Lemma 4.1 can naturally be exponentially compressed by representing it with an HRNN.
11 For completeness, we present constructions by Dewdney (1977) and Indyk (1995), which represent any unweighted FSA with a HRNN of size O ´|Σ||Q| arbitrary, whereas they are defined by shared parameters (the matrix E) in an HRNN LM.
Theorem 5.1.There exist FSLMs with the minimal DPFSA A " pΣ, Q, δ, λ, ρq for which the size of any weakly equivalent HRNN LM must scale linearly with |Q|.
Note that the linear lower bound holds in the case that the transition matrix (which corresponds to the output matrix E in the RNN LM) of the DPFSA is full-rank.If the transition matrix is low-rank, its possible decomposition into smaller matrices can be carried over to the output matrix of the RNN, reducing the size of the hidden state to the rank of the matrix.Indeed, there exist regular languages motivated by phenomena in human language that can be represented in logarithmic space in the number of the states of their minimal FSA.For example, Hewitt et al. (2020) show that bounded Dyck languages of k parenthesis types and of maximal depth m, which require an FSA with k m states to be recognized, can be represented by HRNN LMs of size m log k, which is an exponential improvement on Indyk's lower bound.

Asymptotic Bounds in |Σ|
Since each of the input symbols can be encoded in log |Σ| bits, one could expect that the linear factor in the size of the alphabet from the constructions above could be reduced to O plog |Σ|q.However, we again find that such reduction is in general not possible-the set of FSAs presented in Appendix B is an example of a family that requires an HRNN whose size scales linearly with |Σ| to be simulated correctly, which implies the following theorem.
Based on the challenges encountered in the example from Appendix B, we devise a simple sufficient condition for a logarithmic compression with respect to |Σ| to be possible: Namely, that for any pair of states q, q 1 P Q, there is at most a single transition leading from q to q 1 .Importantly, this condition is met by classical n-gram LMs and by the languages studied by Hewitt et al. (2020).This intuitive characterization can be formalized by a property we call log |Σ|-separability.
Definition 5.1.An FSA A " pΣ, Q, I, F, δq is log |Σ|-separable if it is deterministic and, for any pair q, q 1 P Q, there is at most one symbol y P Σ such that q y Ý Ñ q 1 P δ.
The conditional of log |Σ|-separability is a relatively restrictive condition.To amend that, we introduce a simple procedure which, at the expense of enlarging the state space by a factor of |Σ|, transforms a general deterministic (unweighted) FSA into a log |Σ|-separable one.Since this procedure does not apply to weighted automata, it is presented in Appendix C.
6 Discussion §4 and §5 provided the technical results behind the relationship between HRNN LMs and DPFSAs.To put those results in context, we now discuss some of their implications.

Equivalence of HRNN LMs and DPFSAs
The equivalence between HRNN LMs and DPFSAs based on Lemmas 4.1 and 4.2 allows us to establish a number of constraints on the probability distributions expressible by HRNN LMs.For example, this result shows that HRNNs are at most as expressive as deterministic PFSAs and, therefore, strictly less expressive than general, non-deterministic, PFSAs due to the well-known result that not all non-deterministic PFSAs have a deterministic equivalent (Mohri, 1997). 12An example of a simple non-determinizable PFSA, i.e., a PFSA whose distribution cannot be expressed by an HRNN LM, is shown in Fig. 5. 13  Moreover, connecting HRNN LMs to DPFSAs allows us to draw on results from (weighted) formal language theory to manipulate and investigate HRNN LMs.For example, we can apply general results on the tightness of language models based on DPFSAs to HRNN LMs (Du et al., 2023). 14 Even if the HRNN LM is not tight a priori, the fact that the normalizing constant can be computed means that it can always be re-normalized to form a 12 General PFSAs are, in turn, equivalent to probabilistic regular grammars and discrete HMMs (Icard, 2020). 13Even if a non-deterministic PFSA can be determinized, the number of states of the determinized machine can be exponential in the size of the non-deterministic one (Buchsbaum et al., 2000).In this sense, non-deterministic PFSAs can be seen as exponentially compressed representations of FSLMs.The compactness of this non-deterministic representation must be "undone" using determinization before it can be encoded by an HRNN. 14Informally, the question of tightness concerns the question of whether the LM forms a valid probability distribution over Σ ˚, which is not necessarily the case for locally normalized LMs such as RNN LMs. Figure 5: A non-determinizable PFSA.It assigns the string ab n c the probability A pab n cq " 0.5 ¨0.9 n ¨0.1 0.5 ¨0.1 n ¨0.9, which can not be expressed as a single term for arbitrary n P N ě0 .
probability distribution over Σ ˚.Furthermore, we can draw on the various results on the minimization of DPFSAs to reduce the size of the HRNN implementing the LM.
While Lemma 4.1 focuses on HRNN LMs and shows that they are finite-state, a similar argument could be made for any RNN whose activation functions map onto a finite set.This is the case with any RNN running on a computer with finite-precision arithmetic-in that sense, all deployed RNN LMs are finite-state, albeit with a very large state space.
In other words, one can view RNNs as very compact representations of large DPFSAs whose transition functions are represented by the RNN's update function.Furthermore, since the topology and the weights of the implicit DPFSA are determined by the RNN's update function, the DPFSA can be learned very flexibly yet efficiently based on the training data.This is enabled by the sharing of parameters across the entire graph of the DPFSA instead of explicitly parametrizing every possible transition in the DPFSA or by hard-coding the allowed transitions as in n-gram LMs.
A note on the use of the Heaviside function.Minsky's construction uses the Heaviside activation function to implement conjunction.Note that, conveniently, we could also use the more popular ReLU function: A closer look at Minsky's construction shows that the only action performed by the Heaviside function is clipping negative values to 0 while non-negative values are left intact. 15Since ReLU behaves the same way on the relevant set of values, it could simply be swapped in for the Heaviside unit.This simply shows that the convenient binary structure of the Heaviside function does not enhance the representational capacity of the model in any way; as one would expect, ReLU-activated Elman RNN LMs are at least as expressive as Heaviside-activated ones.166.2 Space Complexity of Simulating DPFSA with HRNN LMs Theorems 5.1 and 5.2 establish lower bounds on how efficiently HRNN LMs can represent FSLMs, which are, to the best of our knowledge, the first results characterizing such space complexity.They reveal how the flexible local distributions of individual states in a PFSA require a large number of parameters in the simulating RNN to be matched.This implies that the simple Minsky's construction is in fact asymptotically optimal in the case of WFSAs, even though the transition function of the underlying FSA can be simulated more efficiently.Nonetheless, the fact that RNNs can represent some FSLMs compactly is interesting.The languages studied by Hewitt et al. (2020) and Bhattamishra et al. ( 2020) can be very compactly represented by an HRNN LM and have clear linguistic motivations.Investigating whether other linguistically motivated phenomena in human language can be efficiently represented by HRNN LMs is an interesting area of future work, as it would yield insights into not only the full representational capacity of these models but reveal additional inductive biases they use and that can be exploited for more efficient learning and modeling.

Related Work
To the best of our knowledge, the only existing connection between RNNs and a weighted formalism was made by Peng et al. (2018), where the authors connect the recurrences analogous to Eq. ( 2) of different RNN variants to the process of computing the probability of a string under a general PFSA.With this, they are able to show that the hidden states of an RNN can be used to store the distribution over the current possible states and the probability of the read string, which can be used to upper-bound the representational capacity of specific RNN variants.Importantly, the interpretation of the hidden state is different to ours: Rather than tracking the current state of the PFSA, Peng et al. (2018)'s construction stores the distribution over all possible states.While this suggests a way of simulating PFSAs, the translation of the probabilities captured in the hidden state to the probability under an RNN LM is not straightforward.Weiss et al. (2018), Merrill (2019) and Merrill et al. (2020) consider the representational capacity of the so-called saturated RNNs, whose parameters take their limiting values ˘8 to make the updates to the hidden states discrete.In this sense, their formal model is similar to ours.However, rather than considering the probabilistic representational capacity, they consider the flexibility of the update mechanisms of the variants in the sense of their long-term dependencies and the number of values the hidden states can take as a function of the string length.Connecting the assumptions of saturated activations with the results of Peng et al. (2018), they establish a hierarchy of different RNN architectures based on whether their update step is finite-state and whether the hidden state can be used to store arbitrary amounts of information.Analogous to our results, they show that Elman RNNs are finitestate while some other variants such as LSTMs are provably more expressive.
In a different line of work, Weiss et al. ( 2019) study the ability to learn a concise DPFSA from a given RNN LM.This can be seen as a relaxed setting of the proof of Lemma 4.1, where multiple hidden states are merged into a single state of the learned DPFSA to keep the representation compact.The work also discusses the advantages of considering deterministic models due to their interpretability and computational efficiency, motivating the connection between LMs and DPFSAs.
Discussion of some additional (less) related work can be found in Appendix D.

Conclusion
We prove that Heaviside Elman RNNs define the same set of probability distributions over strings as the well-understood class of deterministic probabilistic finite-state automata.To do so, we extend Minsky's classical construction of an HRNN simulating an FSA to the probabilistic case.We show that Minsky's construction is in some sense also optimal: Any HRNN representing the same distribution as some DPFSA over strings from an alphabet Σ will, in general, require hidden states of size at least Ω p|Σ||Q|q, which is the space complexity of Minsky's construction.

Limitations
This paper aims to provide a first step at understanding modern LMs with weighted formal language theory and thus paints an incomplete picture of the entire landscape.While the formalization we choose here has been widely adopted in previous work (Minsky, 1954;Dewdney, 1977;Indyk, 1995), the assumptions about the models we make, e.g., binary activations and the simple recurrent steps, are overly restrictive to represent the models used in practice; see also §6 for a discussion on the applicability to more complex models.It is likely that different formalizations of the RNN LM, e.g., those with asymptotic weights (Weiss et al., 2018;Merrill et al., 2020;Merrill, 2019) would yield different theoretical results.Furthermore, any inclusion of infinite precision would bring RNN LMs much higher up on the Chomsky hierarchy (Siegelmann and Sontag, 1992).Studying more complex RNN models, such as LSTMs, could also yield different results, as LSTMs are known to be in some ways more expressive than simple RNNs (Weiss et al., 2018;Merrill et al., 2020).
Another important aspect of our analysis is the use of explicit constructions to show the representational capacity of various models.While such constructions show theoretical equivalence, it is unlikely that trained RNN LMs would learn the proposed mechanisms in practice, as they tend to rely on dense representations of the context (Devlin et al., 2019).This makes it more difficult to use the results to analyze trained models.Rather, our results aim to provide theoretical upper bounds of what could be learned.
Lastly, we touch upon the applicability of finite-state languages to the analysis of human language.Human language is famously thought to not be finite-state (Chomsky, 1957), and while large portions of it might be modellable by finite-state machines, such formalisms lack the structure and interpretability of some mechanisms higher on the Chomsky hierarchy.For example, the very simple examples of (bounded) nesting expressible with context-free grammars are relatively awkward to express with finite-state formalisms such as finite-state automata-while they are expressible with such formalisms, the implementations lack the conciseness (and thus inductive biases) of the more concise formalisms.On the other hand, some prior work suggests that finding finite-state mechanisms could nonetheless be useful for understanding the inner workings of LMs and human language (Hewitt et al., 2020).

A Proofs
A.1 Performing the Logical AND with an HRNN Minsky's construction requires the RNN to perform the logical AND operation between specific entries of binary vectors x P B D .The following fact shows how this can easily be performed by an HRNN with appropriately set parameters.
Fact A.1.Consider m indices i 1 , . . ., i m P Z D and vectors x, v P B D such that v i " 1 ti P ti 1 , . . ., i m uu, i.e., with entries 1 at indices i 1 , . . ., i m .Then, H `vJ x ´pm ´1q ˘" 1 if and only if x i k " 1 for all k " 1, . . ., m.In other words, As a special case, m " 2 in Fact A.1 corresponds to the AND operation of two elements, which is used in Minsky's construction.There, the vector v corresponds to the weights of a single neuron while ´pm ´1q (´1 for m " 2 corresponds to its bias.
We now present the proofs of the lemmas establishing the equivalence of DPFSAs and HRNN LMs.
Proof.Let R " pΣ, σ, D, U, V, b, h 0 q be a HRNN defining the conditional probabilities p.We construct a deterministic PFSA A " pΣ, Q, δ, λ, ρq defining the same string probabilities.Let s : be a bijection.Now, for every state q def " sphq P Q def " Z 2 D , construct a transition q y{w Ý Ý Ñ q 1 where q 1 " spσ pUh `V y `bqq with the weight w " p py | hq " f pE hq y .We define the initial function as λ psphqq " 1 th " h 0 u and final function ρ with ρ pqq def " p pEOS | spqqq.It is easy to see that A defined this way is deterministic.We now prove that the weights assigned to strings by A and R are the same.Let y P Σ ˚with |y| " T and π " ˆsph 0 q y 1 {w 1 ÝÝÝÑ q 1 , . . ., q T ´1 y T {w T Ý ÝÝÝ Ñ q T ṫhe y-labeled path starting in sph 0 q (such a path exists since we the defined automaton is complete-all possible transitions are defined for all states).
p `yt | s ´1pq t q ˘¨p `EOS | s ´1pq T q "p pyq which is exactly the weight assigned to y by R. Note that all paths not starting in sph 0 q have weight 0 due to the definition of the initial function.■ Lemma A.1.Let A " pΣ, Q, δ, λ, ρq be a deterministic PFSA, y " y 1 . . .y T P Σ ˚, and q t the state arrived at by A upon reading the prefix y ďt .Let R be the HRNN specified by the Minsky construction for A, n the ordering defining the one-hot representations of state-symbol pairs by R, and h t R's hidden state after reading y ďt .Then, it holds that h 0 " pq ι , yq where q ι is the initial state of A and y P Σ and h T " pq T , y T q .
Proof.Define sph " pq, yq q def " q.We can then restate the lemma as sph T q " q T for all y P Σ ˚, |y| " T .Let π be the y-labeled path in A. We prove the lemma by induction on the string length T .
Base case: T " 0. Holds by the construction of h 0 .
Inductive step: T ą 0. Let y P Σ ˚with |y| " T and assume that sph T ´1q " q T ´1.
We prove that the specifications of U, V, and b ensure that sph T q " q T .By definition of the recurrence matrix U (cf. Eq. ( 7)), the vector Uh T ´1 will contain a 1 at the entries n pq 1 , y 1 q for q 1 P Q and y 1 P Σ such that q T y 1 {Ý Ý Ñ q 1 P δ.This can equivalently be written as Uh T ´1 " Ž q T y 1 {Ý Ý Ñq 1 Pδ pq 1 , y 1 q , where the disjunction is applied element-wise.
On the other hand, by definition of the input matrix V (cf.Eq. ( 8)), the vector V y T will contain a 1 at the entries n pq 1 , y T q for q 1 P Q such that ˝yT {Ý ÝÝ Ñ q 1 P δ.This can also be written as V y T " Ž ˝yT {Ý ÝÝ Ñq 1 Pδ pq 1 , y T q .
By Fact A.1, H pUh T ´1 `V y T `bq npq 1 ,y 1 q " H pUh T ´1 `V y T ´1q npq 1 ,y 1 q " 1 holds if and only if pUh T ´1q npq 1 ,y 1 q " 1 and pV y T q npq 1 ,y 1 q " 1.This happens if i.e., if and only if A transitions from q T to q T upon reading y T (it transitions only to q T due to determinism).
Since the string y was arbitrary, this finishes the proof.■ Lemma A.2. Let A " pΣ, Q, δ, λ, ρq be a deterministic PFSA, y " y 1 . . .y T P Σ ˚, and q t the state arrived at by A upon reading the prefix y ďt .Let R be the HRNN specified by the Minsky construction for A, E the output matrix specified by the generalized Minsky construction, n the ordering defining the one-hot representations of state-symbol pairs by R, and h t R's hidden state after reading y ďt .Then, it holds that p pyq " A pyq.
Proof.Let y P Σ ˚, |y| " T and let π be the y-labeled path in A. Again, let p pyq def " ś |y| t"1 p py t | y ăt q.We prove p pyq " ś T t"1 w t by induction on T .
Inductive step: T ą 0. Assume that the p py 1 . . .y T ´1q " ś T ´1 t"1 w t .By Lemma A.1, we know that sph T ´1q " q T and sph T q " q T .By the definition of E for the specific f , it holds that f pEh T ´1q mpyq " ωpsph T ´1q y{w T Ý ÝÝ Ñ sph T qq " w T .This means that p py ďT q " ś T t"1 w t , which is what we wanted to prove.
Clearly, p pyq " p pyq p pEOS | yq.By the definition of E (cf.Eq. ( 9)), pEh T q mpEOSq " ρ psph T qq, meaning that p pyq " p pyq p pEOS | yq " ś T t"1 w t ρ psph T qq " A pyq. Since y P Σ ˚was arbitrary, this finishes the proof.■ A note on strong equivalence.The purpose of Lemma A.2 and ?? was to show the existence of a weakly equivalent (cf.Definition 2.1) HRNN LM given a DPFSA defining a finite-state LM and vice versa.We keep the discussion in the main part of the paper restricted to weak equivalence for brevity.However, note that the proofs of the lemmas in fact establish the existence of a strongly equivalent DPFSA and HRNN LM, respectively.This can easily be seen from the one-to-one correspondence between path scanning a given string in the DPFSA and the sequence of hidden states generating the same string in the HRNN LM.In this sense, the connection between DPFSAs and HRNN LMs is even tighter than just defining the same probability distribution; however, we are mainly interested in the implications of the simpler weak equivalence.
Theorem 5.1.There exist FSLMs with the minimal DPFSA A " pΣ, Q, δ, λ, ρq for which the size of any weakly equivalent HRNN LM must scale linearly with |Q|.Proof.Without loss of generality, we work with R-valued hidden states.Let A be a minimal deterministic PFSA and R " pΣ, σ, D, U, V, b, h 0 q a HRNN with p pyq " A pyq for every y P Σ ˚.Let y ăT P Σ ånd y ďT def " y ăT y for some y P Σ. Define p pyq def " ś |y| t"1 p py t | y ăt q.It is easy to see that p py ăT y T q " p py ăT q p py t | y ăT q.The probabilities in the conditional distribution p p¨| y ăT q are determined by the values in Eh T ´1.By definition of the deterministic PFSA, there are |Q| such conditional distributions.Moreover, these distributions (represented by vectors P ∆ |Σ|´1 ) can generally be linearly independent. 17This means that for any q, the probability distribution of the outgoing transitions can not be expressed as a linear combination of the probability distributions of other states.To express the probability vectors for all states, the columns of the output matrix E, therefore, have to span R |Q| , implying that E must have at least |Q| columns.This means that the total space complexity (and thus the size of the HRNN representing the same distribution as A) is Ω p|Q|q.■

B Lower Space Bounds in |Σ| for Simulating Deterministic PFSAs with HRNNs
In this section, we provide a family of DPFSAs which require a HRNN LM whose size must scale linearly with the size of the alphabet.We also provide a sketch of the proof of why a compression in |Σ| is not possible.Let A N " pΣ N , t0, 1u, t0u, t1u, δ N q be an FSA over the alphabet Σ N " ty 1 , . . ., y N u such that δ N " !0 ) (see Fig. 6).Clearly, to be able to correctly represent all local distributions of the DPFSA, the HRNN LM must contain a representation of each possible state of the DPFSA in a unique hidden state.On the other hand, the only way that the HRNN can take into account the information about the current state q t of the simulated FSA A is through the hidden state h t .The hidden state, in turn, only interacts with the recurrence matrix U, which does not have access to the current input symbol y t`1 .The only interaction between the current state and the input symbol is thus through the addition in Uh t `V y t`1 .This means that, no matter how the information about q t is encoded in h t , in order to be able to take into account all possible transitions stemming in q t (before taking into account y t`1 ), Uh t must activate all possible next states, i.e., all children of q t .On the other hand, since V y t`1 does not have precise information about q t , it must activate all states which can be entered with an y t`1 -transition, just like in Minsky's construction.
In Minsky's construction, the recognition of the correct next state was done by keeping a separate entry (one-dimensional sub-vector) for each possible pair q t`1 , y t`1 .However, when working with compressed representations of states (e.g., in logarithmic space), a single common sub-vector of size ă |Σ| (e.g., log |Σ|) has to be used for all possible symbols y P Σ.Nonetheless, the interaction between Uh t and V y t`1 must then ensure that only the correct state q t`1 is activated.For example, in Minsky's construction, this was done by simply taking the conjunction between the entries corresponding to q, y in Uh t and the entries corresponding to q 1 , y 1 in V y 1 , which were all represented in individual entries of the vectors.On the other hand, in the case of the log encoding, this could intuitively be done by trying to match the log |Σ| ones in the representation pp pyq | 1 ´p pyqq, where p pyq represent the binary encoding of y.If the log |Σ| ones match (which is checked simply as it would result in a large enough sum in the corresponding entry of the matrix-vector product), the correct transition could be chosen (to perform the conjunction from Fact A.1 correctly, the bias would simply be set to log |Σ| ´1).However, an issue arises as soon as multiple dense representations of symbols in V y have to be activated against the same sub-vector in Uh t -the only way this can be achieved is if the sub-vector in Uh t contains the disjunction of the representations of all the symbols which should be activated with it.If this sets too many entries in Uh t to one, this can result in "false positives".This is explained in more detail for the DPFSAs in Fig. 6 next.
Let r n represent any dense encoding of y n in the alphabet of A N (e.g., in the logarithmic case, that would be pp pnq | 1 ´p pnqq).Going from the intuition outlined above, any HRNN simulating A N , the vector Uh 0 must, among other things, contain a sub-vector corresponding to the states 1 and 2. The subvector corresponding to the state 2 must activate (through the interaction in the Heaviside function) against any y n for n " 2, . . ., N in A N .This means it has to match all representations r n for all n " 2, . . ., N .The only way this can be done is if the pattern for recognizing state 2 being entered with any y n for n " 2, . . ., N is of the form r " Ž N n"2 r n .However, for sufficiently large N , r " Ž N n"2 r n will be a vector of all ones-including all entries active in r 1 .This means that any encoding of a symbol will be activated against it-among others, y 1 .Upon reading y 1 in state 1, the network will therefore not be able to deterministically activate only the sub-vector corresponding to the correct state 1.This means that the linear-size encoding of the symbols is, in general, optimal for representing DPFSAs with HRNN LMs.
C Transforming a General Deterministic FSA into a log |Σ|-separable FSA log |Σ|-separability is a relatively restrictive condition.To amend that, we introduce a simple procedure which, at the expense of enlarging the state space by a factor of Σ, transforms a general deterministic FSA into a log |Σ|-separable one.We call this log |Σ|-separation.Intuitively, it augments the state space by introducing a new state pq, yq for every outgoing transition q y Ý Ñ q 1 of every state q P Q, such that pq, yq simulates the only state the original state q would transition to upon reading y.Due to the determinism of the original FSA, this results in a log |Σ|-separable FSA with at most |Q||Σ| states.
While the increase of the state space might seem like a step backward, recall that using Indyk's construction, we can construct an HRNN simulating an FSA whose size scales with the square root of the number of states.And, since the resulting FSA is log |Σ|-separable, we can reduce the space complexity with respect to Σ to log |Σ|.This is summarized in the following theorem, which characterizes how compactly general deterministic FSAs can be encoded by HRNNs.To our knowledge, this is the tightest bound on simulating general unweighted deterministic FSAs with HRNNs.Theorem C.1.Let A " pΣ, Q, I, F, δq be a minimal FSA recognizing the language L.Then, there exists an HRNN R " pΣ, σ, D, U, V, b, h 0 q accepting L with D " O ´log |Σ| a |Σ||Q|

¯.
The full log |Σ|-separation procedure is presented in Algorithm 1.It follows the intuition of creating a separate "target" for each transition q y Ý Ñ q 1 for every state q P Q.To keep the resulting FSA deterministic, a new, artificial, initial state with no incoming transitions is added and is connected with the augmented with the children of the original initial state.
The following simple lemmata show the formal correctness of the procedure and show that it results in a log |Σ|-separable FSA, which we need for compression in the size of the alphabet.
Lemma C.1.For any y P Σ, pq, yq Proof.Ensured by the loop on Line 3. ■ Lemma C.2. log |Σ|-separation results in an equivalent FSA.
Proof.We have to show that, for any y P Σ ˚, y leads to a final state in A if and only if y leads to a final state in A 1 .For the string of length 0, this is clear by Lines 13 and 14.For strings of length ě 1, it follows from Lemma C.1 that y leads to a state q in A if and only if Dy P Σ such that y leads to pq, yq in A 1 .
From Lines 11 and 12, pq, yq P F 1 if and only if q P F , finishing the proof.■ Lemma C.3.log |Σ|-separation results in a log |Σ|-separable FSA.
Proof.Since the state pq 1 , y 1 q is the only state in Q 1 transitioned to from pq, yq after reading y 1 (for any y P Σ), it is easy to see that A 1 is indeed log |Σ|-separable.■ Algorithm 1 1. def SEPARATE(A " pΣ, Q, I, F, δq): Ź Connect the children of the original initial state qι with the new, aritificial, initial state.
for q P Q, y P Σ : 8.
for q add pq, yq Ź Add all state-symbol pairs with a state from the original set of final states to the new set of final states. 11.
if q ι P I : ŹCorner case: If the original initial state qι is an initial state, make the artificial initial state qι 1 final. 14.
return A 1

D Additional Related Work
Our work characterizes the representational capacity of HRNN LMs in terms of DPFSAs.On the other end of representational capacity, Chen et al. ( 2018) consider the connection between Elman RNNs with arbitrary precision (in stark contrast to our model) and Turing machines first established by Siegelmann and Sontag (1992) and outline some implications the relationship has on the representational capacity of RNNs and the solvability of tasks such as finding the most probable string or deciding whether an RNN is tight.These tasks are shown to be undecidable.This is in contrast to the equivalence shown here which, among other things, means that the decidability of the tasks on WFSAs can be carried over to RNN LMs.On a different note, Bhattamishra et al. (2020) and Deletang et al. ( 2023) provide an empirical survey of the unweighted representational capacity of different LM architectures.The former focuses on RNN variants and their ability to recognize context-free languages.The authors find that RNNs indeed struggle to learn the mechanisms required to recognize context-free languages, but find that hierarchical languages of finite depth, such as Dpk, mq, can be learned reliably.This further motivates the connection between RNN LMs and finite-state models, as well as the specific construction by Hewitt et al. (2020).While the results from Deletang et al. ( 2023) can be connected to the theoretical insights provided by existing work, it is also clear that the probabilistic nature, as well as non-architectural aspects of LMs (such as the training regime), make establishing a clear hierarchy of models difficult.18

E Upper Space Bounds for Simulating Deterministic PFSAs with HRNNs
Minsky's construction ( §4.2) describes how to represent a DPFSA A with a HRNN of size linear in the number of A's states.Importantly, the encoding of the FSA transition function (taken from Minsky's original construction) is decoupled from the parameter defining the probability distribution, E. This section describes two asymptotically more space-efficient ways of constructing the component simulating the transition function.They originate in the work by Dewdney (1977), who showed that an unweighted FSA A " pΣ, Q, I, F, δq can be represented by an HRNN of size O ´|Σ||Q| 3 4 ¯.The optimal lower bounds were then more thoroughly studied by Alon et al. (1991).Using the same ideas, but a specific trick to compress the size of the processing layer of the RNN further, Indyk (1995) later reduced this bound to O ´|Σ| a |Q| ¯, which, as discussed in §5, is asymptotically optimal.Naturally, as shown in §5, the space-efficiency gain can not be carried over to the weighted case-that is, the space-efficiency is asymptotically overtaken by the output matrix E. Nevertheless, for a more complete treatment of the subject, we cover the two compressed constructions of the HRNN simulating an unweighted FSA in this section in our notation.Importantly, given a DPFSA, we focus only on the underlying FSA, i.e., the unweighted transition function of the automaton, since by Theorem 5.1, the compression can only be achieved with components representing that part of the automaton.

E.1 Dewdney's Construction
This section describes the construction due to Dewdney (1977) in our notation.Since some of the parts are very similar to the construction due to Indyk (1995), those parts are reused in Appendix E.2 and introduced more generally.
Representing states of the FSA.Let A " pΣ, Q, I, F, δq be a deterministic FSA.Recall that Minsky's construction encodes the A's current state as a one-hot encoding of the state-symbol pair.The construction due to Dewdney (1977), on the other hand, represents the states separately from the symbols.It encodes the states with two-hot representations by using the coefficients of what we call a square-root state representation.This results in representations of states of size O ´a|Q| ¯.The input symbols are incorporated into the hidden state separately. 19efinition E.1.Let A " pΣ, Q, I, F, δq be an FSA and s def " r a |Q|s.We define the square-root state representation of A's states q P Q as20 We denote the inverse of ϕ 2 with ϕ ´1 2 and further define for k P Z s ϕ ´1 2 pk, ¨q def " tq P Q | φ 0 " k where φ " ϕ 2 pqqu (14 and ϕ ´1 2 p¨, kq analogously. Specifically, we will denote ϕ ´1 2 pk, ¨q and ϕ ´1 2 p¨, kq with k in the j th position (with j P Z 2 , 0 for ϕ ´1 2 pk, ¨q and 1 for ϕ ´1 2 p¨, kq) as Φ k,j .
We can think of the function ϕ 2 as representing states of the FSA in a two-dimensional space Z s ˆZs .However, to efficiently simulate A with an HRNN, it is helpful to think of ϕ 2 pqq in two different ways: as a vector v P N ě0 2|Q| , or as a matrix in B |Q|ˆ|Q| in the following sense.
Definition E.2.Given a square-root state representation function ϕ 2 , we define the vector representation of the state q P Q as the vector v pqq P B 2|Q| with where φ " pφ 0 , φ 1 q " ϕ 2 pqq, and all other entries 0. Furthermore, we define the matrix representation of the state q P Q as the matrix B P B |Q|ˆ|Q| with and all other entries 0.
Dewdney's construction also heavily relies on the representations of sets of states.We define those additively.Definition E.3.Let Q Ď Q be a set of states.We define the vector representation of Q as the vector Similarly, we define the matrix representation of Q as the matrix To help understand the above definitions, we give an example of a FSA and the representations of its states.
Example E.1.Consider the FSA in Fig. 7, for which s " r a |Q|s " r ?3s " 2, meaning that resulting in the state-to-vector mapping21 and the state-to-matrix mapping The two components of the vector representations separated by "|" denote the two halves of the representation vectors, corresponding to the two components of ϕ 2 pqq.
High-level idea of Dewdney's construction.Given these definitions, the intuition behind Dewdney's construction of an HRNN simulating an FSA A is the following: Simulating the transition function of an FSA by detecting preceding states.We elaborate on the last point above since it is the central part of the construction. 22The idea of simulating the transition function δ is reduced to detecting whose parent given the current input symbol y t is currently active-naturally, this should be the state active at t `1.Concretely, consider again the FSA A in Fig. 7.The parents of the three states, indexed by the incoming symbols are: for 0 tb : 2u, for 1 ta : 1, b : 0u, and for 2 ta : 1, b : 0u.Suppose that at some time t, A is in state 0 is reading in the symbol b.Then, since the state 0 is the b-parent of the state 2, we know that at time t `1, A will be in state 2. This principle can be applied more generally: To determine the state of an FSA at time t `1, we simply have to somehow detect whose parent is active at time t given the current input symbol at time t.
The crux of Dewdney's construction is then the following: 23 How do we, using only the Elman update rule, determine whose y t -parent is active at time t?This can be done by detecting which parent matrix B pPar pq, y t qq the representation of the current state q t is included in in the sense that if ϕ 2 pq t q " φ, it holds that B pPar pq, y t qq φ 0 φ 1 " 1.To be able to formally talk about the detection of a representation in a set of parents, we define several notions of matrix detection.
Informally, we say that a matrix is easily detectable if the presence of its non-zero elements can be detected using a single neuron in the hidden layer of a HRNN.
Definition E.4.Let B P B DˆD be a binary matrix.We say that B is easily detectable if there exist w P Q 2D and b P Q (neuron coefficients) such that σ pxe ij , wy `bq " where e ij " `ei | e j ˘refers to the 2D-dimensional vector with 1's at positions i and D `j.In words, this means that the neuron defined by w, b fires on the input e ij if and only if B ij " 1.
We define detectable matrices as the matrices which can be detected using a conjunction of two neurons.
Definition E.5.Let B P B DˆD be a binary matrix.We say that B is detectable if there exist w 1 , w 2 P Q 2D and b 1 , b 2 P Q such that σ pxe ij , w 1 y `b1 q " 1 ^σ pxe ij , w 2 y `b2 q " 1 ðñ B ij " 1. (26) Furthermore, we say that a matrix is (easily) permutation-detectable if there exist permutation matrices P and Q such that PBQ is (easily) detectable.
Intuitively, this means that one can effectively replace an easily detectable matrix B with a single neuron: Instead of specifying the matrix explicitly, one can simply detect if an entry B ij of B is 1 by passing e ij through the neuron and seeing if it fires.This reduces the space complexity from D 2 to 2D.Similarly, one can replace a detectable matrix with two neurons.As shown in Fact A.1, the required conjunction of the two resulting neurons can then easily be performed by a third (small) neuron, meaning that a detectable matrix is effectively represented by a two-layer MLP.
An example of easily detectable matrices are the so-called northwestern matrices.
Definition E.6.A matrix B P B DˆD is northwestern if there exists a vector α with |α| " D and Intuitively, northwestern matrices contain all their ones contiguously in their upper left (northwest) corner.An example of a northwestern matrix for α " `2 1 1 ˘is Lemma E.1.Northwestern matrices are easily detectable.
22 Later, we will see that Indyk (1995) uses the exact same idea for simulating δ. 23 Again, the same applies to Indyk (1995).On the other hand, for B ij " 0, we have xe ij , wy " α i `pD ´j `1q ă j `D ´j `1 " D ùñ H pxe ij , wy `bq " H pxe ij , wy ´Dq " 0.

■
A more general useful class of detectable matrices are line matrices (Dewdney, 1977).
Definition E.7.A binary matrix B P B DˆD is a line matrix if any of the following conditions hold: 1.All B's ones lie either in the same row (B is a row matrix) or in the same column (B is a column matrix).
2. B is a transversal, i.e., a matrix in which there is at most one 1 in any column and row.
Lemma E.2.Row and column matrices are easily permutation-detectable.
Proof.Let i, N P Z D and B be a row matrix with B ijn " 1 for n P Z N , i.e., a row matrix with all its ones in the i th row.Define P P B DˆD as P 1i " 1 and 0 elsewhere and Q P B DˆD with Q jnn " 1 and 0 elsewhere.Then, PBQ contains all its 1 in its northwestern corner (contiguously in the first row) and is thus easily detectable.Let w It is easy to see that this "rearranges" the components of the neuron recognizing the northwestern matrix PBQ to make them recognize the original matrix, meaning that the neuron defined by w 1 and b 1 recognizes the line matrix.The proof for a column matrix is analogous.
Proof.The core idea of this proof is that every transversal can be permuted into a diagonal matrix, which can be written as a Hadamard product of a lower-triangular and an upper-triangular matrix.
Let B be a transversal.Pre-multiplying B with its transpose P def " B J results in a diagonal matrix.It is easy to see that PB can be written as a Hadamard product H 1 b H 2 of a lower-triangular matrix H 1 and an upper-triangular matrix H 2 .Both are easily permutation detectable.A conjunction of the neurons detecting H 1 and H 2 (again, performed by another neuron) detects the original matrix B. In the following, we will refer to H 1 and H 2 as the factors of the transversal.■ Crucially, any binary matrix B P B DˆD can be decomposed into a set of line matrices B whose disjunction is B: It is easy to see that B ij " 1 if and only if there exists M P B such that M ij " 1.This means that non-zero entries of any B P B DˆD decomposed into the set of line matrices B can be detected using an MLP in two steps: 1. Detect the non-zero entries of the individual line matrices from the decomposition B (which are, as shown above, detectable).
2. Take a disjunction of the detections of the individual line matrices to result in the activation of the original matrix.
The disjunction can again be performed by applying another 2-layer MLP to the activations of the line matrices.An important consideration in both Dewdney's as well as Indyk's construction later will be how large B has to be.
Using matrix decomposition and detection for simulating the transition function.We now describe how Dewdney's construction uses matrix detection based on the decomposition of matrices into line matrices to simulate an FSA using an HRNN.From a high level, the update steps of the HRNN will, just like in Minsky's construction, simulate the transition function of the simulated FSA.However, in contrast to the Minsky construction, in which each transition step in the FSA was implemented by a single application of the Elman update rule, here, a single transition in the FSA will be implemented using multiple applications of the Elman update rule, the end result of which is the activation of the two-hot representation of the appropriate next state.Nonetheless, there are, abstractly, two sub-steps of the update step, analogous to the Minsky construction (cf.Fig. 4): 1. Detect the activations of all possible next states, considering any possible input symbol (performed the term Uh t in Minsky's construction).
2. Filter the activations of the next states by choosing only the one transitioned into by a y t -transition (performed by conjoining with the term V y t in Minsky's construction).
The novelty of Dewdney's construction comes in the first sub-step: How can the Elman update step be used to activate the two-hot representation of q t 's children?As alluded to, this relies on the pre-computed parent matrices Par pq, yq (cf.Definition E.2).The parent matrices of individual states are compressed (disjoined) into component-activating matrices, the representation matrices of the parents of specific sets of states (cf.Definition E.3), defined through the function ϕ 2 in the following sense.
Definition E.8.A component-activating matrix is the representation matrix B j,y,k def " B pPar pΦ k,j , yqq for some k P Z r and j P Z 2 .
Intuitively, the component-activating matrix B j,y,k is the result of the disjunction of the matrix representations of all y-parents q of all states q 1 whose j th component of the vector ϕ 2 pq 1 q equals k.This results in 2|Σ|s matrices.They can be pre-computed and naturally depend on the transition function δ.The name component-activating matrix is inspired by the fact that each of the matrices "controls" the activation of one of the 2|Σ|s neurons in a specific sub-vector of the HRNN hidden state.That is, each component-activating matrix controls a particular dimension, indexed by the tuple pj, y, kq for j P B, y P Σ, k P Z s , in the data sub-vector of the HRNN hidden state.As we will see shortly, they contain all the information required for simulating A with a HRNN.
To define the transition function of the HRNN simulating A, all 2|Σ|s component-activating matrices are decomposed into permutation-detectable line matrices (cf.Definition E.7) whose activations are combined (disjoined) into the activations of individual component-activating matrices.Analogously to above, we will denote the sets of line matrices decomposing the component-activating matrices as B j,y,k , i.e., B j,y,k " Ž MPB j,y,k M. The dimensions of the hidden state corresponding to the activations of the line matrices before they are combined into the activations of the component-activating matrices form the processing sub-vector of the HRNN hidden state since they are required in the pre-processing steps of the update step to determine the activation of the actual hidden state.This is schematically drawn in Fig. 8a.
For any component-activating matrix B decomposed into the set of line matrices B, we know by Lemmas E.2 and E.3 that all M P B are detectable by a single-layer MLP.By adding an additional layer to the MLP, we can disjoin the detections of M P B into the detection of B. More abstractly, this MLP, therefore, detects the activation of one of the 2|Q|s cells of the data sub-vector of the HRNN hidden state-all of them together then form the two-hot encoding of all possible next states of the FSA (before taking into account the input symbol).Designing 2|Q|s such single-values MLPs, therefore, results in an MLP activating the two-hot representations of all possible next states of the simulated FSA.Conjoining these activations with the input symbol, analogously to how this is done in the Minsky construction, results in the activation of the two-hot representation of only the actual next state of the simulated FSA.This is illustrated in Fig. 8b.
High-level overview of simulating a transition.In summary, after decomposing all the componentactivating matrices into the sets B j,y,k , the detection of all candidate next states (before considering the input symbol) in the update step of HRNN is composed of the following sub-steps.The highlighted orange neuron in the representation of the state from the data sub-vector corresponds to the activation of one of the components of the red states (which have in common that their 0 th component of ϕ2 pqq is the same).The matrix corresponding to the disjunction of the representations of their y-parents (blue states) is decomposed into two line matrices-a transversal and a column matrix.The non-zero elements of the former can be detected by a conjunction of two neurons while the non-zero elements of the latter can be detected directly by a single neuron.Those activations are then disjoined to result in the activation in the orange neuron.The purple neurons in the processing sub-vector are composed of the neurons in the networks implementing the detection of line matrices and their conjunctions and disjunctions (also shown in purple).
(b) A high-level illustration of how the transition function of the FSA is implemented in Dewdney's construction on an example of an FSA fragment, where the simulated automaton is initially in the state q and reads the symbol a, transitioning to q 1 .The components whose changes are relevant at a given step are highlighted.Starting in the state q, which is stored in the data sub-vector v pqq, in the first sub-step, the processing bits of the appropriate line matrices are activated (p 1 ).Next, the activated line matrices are used to activate the representations of all of q's children in the data sub-vector (v `␣q 1 , q 2 (˘) .Lastly, these representations are conjoined with the states reachable by the symbol a, resulting in the representation of the state q in the data sub-vector (v pqq).
This results in the activation of the two-hot representations of all possible next states (i.e., all children of q t ).In the last sub-step of the HRNN update step, these are conjoined with the representation of the current input symbol.This step is very similar to the analogous stage in Minsky's construction, with the difference that here, the non-zero entries of the vector Vh t must cover the two-hot representations of the states with an incoming y t -transition.This conjunction then ensures that among all the children of q t , only the one reached by taking the y t -transition will be encoded in h t`1 .The construction just described can be summarized by the following lemma.Lemma E.4.Let A " pΣ, Q, I, F, δq be a deterministic FSA.Then, Dewdney's construction results in a HRNN correctly simulating A's transition function, i.e, sph t q " q t for all t.
Proof.The proof follows the reasoning on the activation of appropriate matrices according to the transition function of the FSA outlined above.To formally prove the lemma, we would have to follow a similar set of steps to how the correctness of Minsky's construction (Lemma A.1) was proved.We omit this for conciseness. ■ This shows that Dewdeny's construction correctly encodes the FSA in a HRNN.However, its space efficiency remains to be determined.As mentioned above, working with two-hot representations of the states means that the data sub-vector is of size O ´|Σ| a |Q| ¯.However, the construction also requires a number of processing dimensions in the processing sub-vector.To understand the full complexity of the construction, we have to determine the maximal number of processing bits in the HRNN.The first step to the answer is contained in the following lemma, which describes the number of line matrices required to cover an arbitrary binary matrix.It lies in the core of the efficiency of Dewdney's construction.Lemma E.5.Let B P B DˆD with N 2 elements equalling 1.Then, there exists a decomposition B of B into at most 2N line matrices such that Ž MPB M " B. Proof.Based on Dewdney (1977). 24Define the sequence of transversals T 1 , T 2 , . . .where T i is the transversal containing the maximum number of ones in the matrix B i def " B ´Ži´1 j"1 B j .The transversal containing the maximal number of ones can be found using the maximum matching algorithm.Continue this sequence until there are no more ones in B i .The number of ones in the matrices B i , ∥B i ∥ 1 , forms a (weakly) decreasing sequence.
If there are at most 2N transversals in the sequence, the lemma holds.Otherwise, we compare the functions f piq def " ∥T i ∥ 1 and g piq def " 2N ´i.
• If f piq ą g piq for all i " 1, . . ., N , then ř N i"1 f piq " ´i " 2N 2 1 2 N pN `1q ě N 2 .However, the transversals in the decomponsition cannot contain more ones than the original matrix.
• We conclude that for some i ď N , f piq ď g piq.Let i 0 be the first such index in 1, . . ., N and L 1 def " tT 1 , . . ., T k u.Since the maximum number of independent ones (in the sense that at most one appears in a single row/column) in B i 0 ´1 is ∥T i 0 ∥ 1 ď 2N ´i0 (those are chosen by the maximum transversal T i 0 ).By König's theorem (Szárnyas, 2020), there is a set of at most 2N ´i0 column or row matrices L 2 def " tL 1 , . . .L k u with k ď 2N ´i0 which cover B i 0 ´1. 25 Therefore, L def " L 1 Y L 2 constitutes a valid cover of B with ď N `2N ´i0 " O pN q matrices.j such that f pjq is defined.Then, it is easy to see that B ij " 1 ðñ i " f pjq, meaning that by defining the parameters w and b as w f pjq def " r 2 ´I pjq (37) and other elements as 0, we get that B ij " 1 ðñ i " f pjq ðñ w i `wj `b " 0. (40) Compared to earlier, where component-activating matrices were detected by testing an inequality, detecting a non-decreasing matrix requires testing an equality.Since all terms in the equality are integers, testing the equality can be performed with the Heaviside activation function by conjoining two neurons; one testing the inequality w i `wj `b ´1 ă 0 and another one testing the inequality w i `wj `b `1 ą 0. Both can individually be performed by a single neuron and then conjoined by an additional one.■ With this, the high-level idea of Indyk's construction is outlined in Fig. 9.After constructing the component-activating matrices based on ϕ 4 and decomposing them into non-decreasing matrices, the rest of Indyk's construction is very similar to Dewdney's construction, although the full update step of the HRNN requires some additional processing.To test the equality needed to detect non-decreasing matrices in the decomposition, Eq. ( 40), the four-hot representations are first converted into two-hot ones.This can be done by a simple conjunction of the first two and the last two components of the four-hot representation.Then, the activations of the non-decreasing matrices can be computed and disjoined into the representations of the component-activating matrices.These form the 4|Σ|r components of the data sub-vector of the HRNN hidden state.They contain the activations of all possible next states, i.e., the children of the current state of A. These are then conjoined with the representation of the current input symbol in the same way as in Dewdney's construction but adapted to the four-hot representations of the states.The process is thus very similar to the phases of Dewdeney's construction illustrated in Fig. 8b.
Indyk's construction can be summarized by the following lemma.
Lemma E.8.Let A " pΣ, Q, I, F, δq be a deterministic FSA.Then, Indyk's construction results in a HRNN correctly simulating A's transition function, i.e, sph t q " q t for all t.
Proof.Again, the proof follows the reasoning outlined above, and, to formally prove it is correct, we would have to follow a similar set of steps to how the correctness of Minsky's construction (Lemma A.1) was proved.We omit this for conciseness.■ The only remaining thing to show is that Indyk's construction achieves the theoretically optimal lower bound on the size of the HRNN simulating a deterministic FSA.All previous steps of the construction were valid no matter the chosen permutation π.The permutation, however, matters for space efficiency: Intuitively, it determines how efficiently one can decompose the resulting component-activating matrices (which depend on the permutation) into non-decreasing matrices in the sense of how many non-decreasing matrices are required to cover it.Indyk, therefore, proved that there always exists, with non-zero probability, a permutation in which the decomposition across all states is efficient enough to achieve the minimum number of neurons required.This is formalized by the following lemma.
Lemma E.9.Let A " pΣ, Q, I, F, δq be a deterministic FSA.There exists a permutation of Q such that Indyk's construction results in a HRNN of size O ´|Σ| a |Q| ¯.
Proof.The proof can be found in Indyk (1995, Lemma 6).■ This concludes our presentation of Indyk's construction.All results stated in this section can be summarized by the following theorem.Those activations are then disjoined to result in the activation in the orange neuron.The purple neurons in the processing sub-vector are composed of the neurons in the networks implementing the detection of line matrices and their conjunctions and disjunctions (also shown in purple).Note that even if the second matrix were not non-decreasing in itself (i.e., the columns of the two ones would be flipped), one could still transform it into a non-decreasing matrix by permuting the columns and permuting the corresponding neurons.
Theorem E.2.Let A " pΣ, Q, I, F, δq be a deterministic FSA.There exists a HRNN of size O ´|Σ| a |Q| correctly simulating A.
Proof.Again, the proof follows from the fact that Indyk's construction correctly simulates the FSA (cf.Lemma E.8) while the space bound is guaranteed by Lemma E.9. ■

Figure 2 :
Figure 2: A weighted finite-state automaton defining a probability distribution over ta, bu ˚.

Figure 3 :
Figure 3: The sigmoid and the Heaviside functions.

Figure 6 :
Figure 6: The FSA A N .

Figure 7 :
Figure 7: An example of a fragment of an FSA.

"
`α | D . . . 1 ȃnd b " ´D.It is easy to see that for any e ij where B ij " 1, it holds that xe ij , wy " α i `pD ´j `1q ě j `D ´j `1 " D `1 ùñ H pxe ij , wy `bq " H pxe ij , wy ´Dq " 1.
High-level overview of Dewdney's construction.

Figure 9 :b 1 "
Figure9: High-level overview of Indyk's construction.The highlighted orange neuron in the representation of the state from the data sub-vector corresponds to the activation of one of the components of the red states (which have in common that their 0 th component of ϕ 4 pqq is the same).The matrix corresponding to the disjunction of the representations of their y-parents (blue states) is decomposed into two non-decreasing matrices.The nonzero elements of both can be detected by a conjunction of two neurons; here,f 1 " ˆ0 1 2 3 H 0 0 0 ˙and f 2 " ˆ0 1 2 3 H H 1 2 ˙, meaning that w 1 " `3 0 0 0 | 0 1 1 1 ˘, w 2 " `0 3 2 0 | 0 0 1 2 ˘,and b 1 " b 2 " 4. Those activations are then disjoined to result in the activation in the orange neuron.The purple neurons in the processing sub-vector are composed of the neurons in the networks implementing the detection of line matrices and their conjunctions and disjunctions (also shown in purple).Note that even if the second matrix were not non-decreasing in itself (i.e., the columns of the two ones would be flipped), one could still transform it into a non-decreasing matrix by permuting the columns and permuting the corresponding neurons.