On the Representational Capacity of Recurrent Neural Language Models

This work investigates the computational expressivity of language models (LMs) based on recurrent neural networks (RNNs). Siegelmann and Sontag (1992) famously showed that RNNs with rational weights and hidden states and unbounded computation time are Turing complete. However, LMs define weightings over strings in addition to just (unweighted) language membership and the analysis of the computational power of RNN LMs (RLMs) should reflect this. We extend the Turing completeness result to the probabilistic case, showing how a rationally weighted RLM with unbounded computation time can simulate any deterministic probabilistic Turing machine (PTM) with rationally weighted transitions. Since, in practice, RLMs work in real-time, processing a symbol at every time step, we treat the above result as an upper bound on the expressivity of RLMs. We also provide a lower bound by showing that under the restriction to real-time computation, such models can simulate deterministic real-time rational PTMs.


Introduction
A language model (LM) is definitionally a semimeasure1 over strings (Icard, 2020).Recent advances in their capabilities, leading to the widespread adoption of LMs, have sparked interest in their theoretical properties and guarantees.Previous work has characterized modern architectures such as recurrent neural networks (RNNs; Elman, 1990;Hochreiter and Schmidhuber, 1997) in terms of the formal languages they can and cannot recognize (Kleene, 1956;Minsky, 1967;Siegelmann and Sontag, 1992;Merrill et al., 2020, inter alia).However, characterizing LMs as formal languages is, in some sense, a category error because LMs encode semimeasures over strings instead of deciding language membership (Chen et al., 2018).In this work, we thus offer another perspective on understanding RNN LMs (RLMs) by asking: What classes A QPTM is a PTM with multiple rationally weighted transition functions.A 2PDA is a probabilistic two-tape pushdown automaton.A Σ−2PDA is a 2PDA that is deterministic in its output alphabet.An RLM is a simple RNN LM.An εRLM is an RLM augmented with an empty output symbol (ε).The prefix "RD-" denotes deterministic real-time machines.
of semimeasures over strings can RLMs represent, i.e., what is their computational expressivity?
The empirical capabilities of trained language models have spurred a large field of work testing their reasoning and linguistic abilities.However, our theoretical understanding of what these models are inherently capable of is still lacking (Deletang et al., 2023).Connecting an LM architecture to well-understood models of computation can help us determine whether the architecture is able to perform the sequences of computations required to carry out an algorithmic task (Pérez et al., 2019).Furthermore, connecting it to linguistic models can tell us whether the architecture is capable of correctly modeling the linguistic structure of a sentence symbolically (Linzen et al., 2016).Finally, characterizing the types of semimeasures the architecture can represent allows us to make more concrete claims about the abilities and limitations of the architecture itself.
RLMs have set many important milestones in language modeling and still hold the state of the art in some important settings of natural language processing (Qiu et al., 2020;Orvieto et al., 2023).Moreover, despite the recent trend towards the recurrence-free and, thus, parallelizable, transformer-based LMs (Vaswani et al., 2017), elements of recurrence have found their way into recent language models and RNNs themselves have recently even been proposed as alternatives or extensions to some high-performing models (Peng et al., 2023;Orvieto et al., 2023;Zhou et al., 2023).At a high level, RNNs work by maintaining a hidden state encoding the processed string, much like how formal models of computation such as Turing machines process and store information.This sequential nature has motivated the comparison of the computational power of RNNs to that of various formal models of computation, from simple models such as finite-state automata (Kleene, 1956;Merrill et al., 2020) and counter machines, all the way up to Turing machines and related models (Minsky, 1967;Siegelmann and Sontag, 1992;Weiss et al., 2018).
Precisely where RNNs end up on the hierarchy of formal models of computation depends on the specific formalization.In this work, we characterize the computational power of RNNs in their most permissive formalization, i.e., one that allows RNNs to process and produce rational-valued vectors and perform an unbounded number of computational steps per input symbol by allowing them to emit empty tokens, ε, in between words.Siegelmann and Sontag (1992) show that such RNNs can simulate any deterministic Turing machine and are, hence, Turing complete. 2While this sheds light on the processing power of RNNs, their result is not directly applicable to language modeling, as it does not take into account the probability assigned to the strings.By extending Siegelmann and Sontag's (1992) construction to the probabilistic case, we provide first steps towards understanding the expressive power of RLMs with rational arithmetic.We show that RLMs with rational weights and unbounded computation time can compute exactly the same semimeasures over strings as probabilistic Turing machines.
On one hand, rational arithmetic offers a reasonably faithful formalization of real-world models in that computer scientists often analyze numerical algorithms using such an idealization. 3However, on the other hand, the assumption of unbounded computation time does represent a large departure from realistic models.In practice, RLMs perform a constant number of computational steps per symbol, operating in a real-time setting (Weiss et al., 2018).Therefore, we treat the above result as an upper bound on the computational power of recurrent RLMs.As a lower bound, we study a second type of RLMs, restricting the models to operate in real-time, which results in a more fine-grained hierarchy of specific Turing machine-like models equivalent to an RLM.We hence characterize the expressivity of RLMs in terms of classical computational models.
Our work offers a first step towards a comprehensive characterization of the expressivity of RLMs in terms of the classes of probability measures they can represent.In addition to providing insights into the computational capacity of RLMs, the work also follows the recent exploration of the measure-theoretic foundations of LMs (Welleck et al., 2020;Meister et al., 2023;Du et al., 2023), while focusing on a particular architecture.We conclude the paper by posing several open questions on the exact position of RLMs in the hierarchy of relevant computational models.Fig. 1 shows a roadmap of the paper, with the two types of RLMs of interest and their relation to different formal computational models.

Preliminiaries
In this section, we build up the necessary definitions and vocabulary for the rest of the paper.

Recurrent Neural Language Models
A formal language L is a subset of the Kleene closure Σ * of some finite non-empty set of symbols, i.e., an alphabet, Σ.An element of Σ * is called a string, y.Furthermore, ε denotes the empty string.We assume throughout that ε / ∈ Σ and denote (Bauwens, 2013;Icard, 2020).If the semimeasure of all strings sums to one, i.e., y∈Σ * µ (y) = 1, then µ is called a probability measure. 4A language model (LM) p is defined as a semimeasure over Σ * .If p is a probability measure, we call it a tight language model.Most modern LMs are autoregressive, meaning they define p (y) through conditional semimeasures of the next symbol given the string produced so far and the measure of ending the string, i.e., p (y) where EOS denotes the special end-of-string symbol, which specifies that the generation of a string has halted.The inclusion of EOS allows (but does not guarantee) a p defined autoregressively to define a probability measure over Σ * (Du et al., 2023).We will denote Σ def = Σ ∪ {EOS}.We will use the following definition of an RNN.
Definition 2.1.A simple RNN R is an RNN with the following hidden state update rule: where The function f is the saturated sigmoid, defined as: Due to their sequential nature, RNNs have been linked to formal models of computation such as finite-state automata, pushdown automata (PDA), and Turing machines under various formalizations with different implications on computational power (e.g., Siegelmann and Sontag, 1992;Hao et al., 2018;Korsky and Berwick, 2019;Merrill, 2019;Merrill et al., 2020;Hewitt et al., 2020, inter alia).For example, if, instead of using the saturated sigmoid, we assumed that f is a function that maps to a finite set, this would result in RNNs that are at most as expressive as finite-state automata (Minsky, 1967;Svete and Cotterell, 2023).Merrill et al. (2020) study the computational power of saturated RNNs by investigating the effect of asymptotically large weights.Finally, Siegelmann and Sontag (1992) assumes rational-valued arithmetic, which is the convention we follow in this work.
An RNN specifies an LM by defining a conditional probability measure over y t given y <t .Let E ∈ Q |Σ|×D be an output matrix and R an RNN.An RLM is an LM whose conditional probability measures are defined by projecting Eh t to the probability simplex ∆ |Σ|−1 using a projection function When generating from an RLM, we assume the next symbol is sampled according to the probabilities defined by π (Eh t ) and is then passed as the next input symbol back into the RNN until EOS is generated.

Turing Machines
We use a reformulation of the classic definition of a probabilistic Turing machine similar to Weihrauch's (2000) Type-2 Turing machine.5 Definition 2.2.A probabilistic Turing machine (PTM) is a two-tape machine specified by the 6tuple M = (Q, Σ, Γ, δ 1 , δ 2 , q ι , q φ ), where • Q is a finite set of states; • Σ and Γ are the input and tape alphabets, and Γ includes the blank symbol ⊔; • q ι , q φ ∈ Q are initial and final states; are two transition functions, one of which is chosen at random at each computation step.
The Turing machine defined above has two tapes.The first is a working tape on which symbols from the tape alphabet Γ can be read and written.The second is an append-only output tape on which M writes symbols of the output alphabet Σ.In the beginning, both tapes are empty, i.e., the working tape has only blank symbols ⊔, and the output tape has only empty symbols ε.Starting in the initial state q ι , at any time step t, the machine samples one of the two transition functions at random, each with probability 1 2 , and applies it.A given transition can be written as (q, γ) The semantics of such a transition is as follows: When in state q and reading γ on the working tape, go to state q ′ , write γ ′ to the working tape, write y ∈ Σ ε to the output tape, and move the head on the working tape by one symbol along the tape in the direction d, that is, left (L), right (R), or stay in place (N ). 6When y = ε, the machine simply does not write anything on the output tape.The machine halts once it reaches the final state q φ .We call the sequence of symbols y ∈ Σ * on the output tape at that point the output of the machine.
Note that, once a transition function has been chosen, since it is a function, the next transition is uniquely determined by the current state q and the current tape symbol γ under the read-write head.In the following, we call a pair of (q, γ) ∈ Q × Γ a configuration of M.
Remark 2.1.Given a probabilistic Turing machine M as defined above, we can get the probability of M halting and outputting a specific string y by summing the probabilities of all halting paths7 through the machine that result in y being written on the output tape (the probability of each path is 2 −n , where n is the number of computation steps).
Remark 2.1 induces a semimeasure over the possible sequences y ∈ Σ * that a PTM M can output, which we will call P M .That is, P M (y) is the probability that M will halt with y as its output.
Remark 2.2.The notion of halting probability as defined in Remark 2.1 has a counterpart in RLMs, namely, the probability mass placed on all finite strings generated (Icard, 2020).For details, see Appendix A.

Pushdown Automata
We now move to another probabilistic computational model: The two-stack pushdown automaton.
Definition 2.3.A probabilistic two-stack pushdown automaton (2PDA) is a two-stack-machine defined by the tuple P = (Q, Σ, Γ, δ, q ι , q φ ), where • Q is a finite set of states; • Σ and Γ are the input and stack alphabets, and Γ includes the bottom-of-stack symbol ⊥; • q ι , q φ ∈ Q are the initial and final states; To make the connection to Turing machines more straightforward, our definition of a 2PDA assumes that its transitions depend on its current state q and the top stack symbol of only the first of the two stacks. 8We write transitions as q Such a transition denotes that the 2PDA in the state q with the symbol γ on top of the first stack pops γ 1 and γ 2 from the first and second stack, pushes γ 3 and γ 4 onto the stacks and moves to state q ′ .At the same time, the 2PDA consumes or emits (depending on the use-case) a symbol from Σ ε .
We require the rational weighting function δ of a 2PDA to be locally normalized over configurations (q, γ) ∈ Q × Γ, where γ is the symbol currently on the top of the first stack: A 2PDA starts at the initial state q ι with both stacks empty (only containing the symbol ⊥) and then sequentially applies transitions according to their probability given by δ.The automaton halts when reaching the final state, q φ .The sequence of the symbols output by the automaton concatenated in the order of transitions taken constitutes the output string.
A note on variants of computational models.Our definition of a probabilistic Turing machine differs from the traditional definition of mere language acceptors in that they start from a starting state q ι and then iteratively apply probabilistic transitions to generate outputs y ∈ Σ * , where each specific y has a corresponding probability of being produced.This is to simplify the comparison to 2PDA and to be able to interpret them as language models in their own right.
Next, we define what it means for two probabilistic models of computation to be equivalent.Definition 2.4.We say that two probabilistic computational models M 1 and M 2 are weakly equivalent if, for any string y ∈ Σ * , we have P M1 (y) = P M2 (y).If, furthermore, there exists a weight-preserving, yield-preserving9 bijection between halting paths in the two models, they are called strongly equivalent.10

An Upper Bound
In this section, we establish an upper bound on the expressive power of RLMs by extending Siegelmann and Sontag's (1992) result to the probabilistic case of language models.Because we want to upper bound the power of RLMs used in practice, we will start with a more unrealistic recurrent LM which can output empty symbols (ε), which we denote as εRLM.We first introduce a variant of probabilistic Turing machines that can have an arbitrary (finite) number of rationally valued transition functions and show that they are strongly equivalent to 2PDA (Section 3.1).We then review Siegelmann and Sontag's (1992) construction for the unweighted case (Section 3.2).Finally, we extend this construction to the probabilistic case by showing how to simulate a 2PDA with an εRLM.We conclude with the observation that this results in the equivalence of PTMs and εRLMs (Section 3.3).

Rationally Weighted PTMs
This paper considers the expressive power of RNNs with rational weights.To make the connection to PTMs easier, it is helpful to define a more general type of a PTM which, instead of sampling between two equally probable transition functions, can have any number of possible transitions at a given computation step, each of which has a rational probability of being applied.Definition 3.1.A rational-valued probabilistic Turing machine (QPTM) is a PTM whose transition weighting function is of the form: In other words, for any current configuration, it assigns a rational-valued probability in the interval [0, 1] to each available transition.We require that the probabilities are normalized over configurations, that is, for all q ∈ Q, γ ∈ Γ: The original construction by Siegelmann and Sontag (1992) uses unweighted 2PDA which are equivalent to Turing machines (Hopcroft et al., 2001).We now want to show that we can simulate a PTM with an εRLM in the same way, that is via probabilistic 2PDA as defined above.Therefore, we first show that PTMs and probabilistic 2PDA are also equivalent, in the following two propositions.
See Appendix C for the proof.
Proof.The proof that any PTM has a strongly equivalent 2PDA closely follows the proof that any (unweighted) deterministic TM can be simulated by a two-stack PDA (Thm. 8.13;Hopcroft et al., 2001).The idea is to use the two stacks in tandem to simulate the TM's infinite tape.The first stack contains the symbols to the right of the TM's head and the top symbol on the first stack is the tape symbol under the TM's head.The second stack contains the symbols to the left of the head.See Fig. 2 for a visualization.The extension to the probabilistic case, using the introduced definitions of QPTMs and 2PDA, is straightforward; see Appendix D. ■

Simulating Unweighted TMs
Before we introduce the equivalence of the models in the probabilistic case, we review the classical unweighed construction of an RNN simulating a TM first introduced by Siegelmann and Sontag (1992) and simplified by Chung and Siegelmann (2021).Specifically, Siegelmann and Sontag (1992) show that a simple RNN can encode a TM by simulating a deterministic unweighted 2PDA.11This 2PDA takes an input string y and maps it to the output M (y) given by the simulated Turing machine: Given a deterministic unweighted 2PDA, the construction defines an RNN that halts and stores the acceptance of y by the 2PDA in a specific neuron of the RNN, or never halts if M (y) = undef.
The crux of the construction lies in encoding the content of a stack in a neuron.12Importantly, the encoding must be such that (i) the top of the stacks can easily be read and (ii) the encoding of the stack can easily be updated upon popping off or pushing onto the stack.This can, for example, be achieved by mapping a (binary Notice the opposite orientation of the two encodings: The top of the stack in γ is written on the right-hand side while it is the left-most digit in the numerical encoding which enables easy updates to the encoding; with this, popping γ N = 0 can, for example, be performed by computing f (10 , and pushing γ = 1 by computing f 1 10 • η (γ) + 3 10 .14Similarly, the current state of a 2PDA is stored in a set of neurons keeping the one-hot encoding of the state, which is updated by simulating the transition function of the 2PDA.This can be done by intersecting the states reachable from the current configuration of the 2PDA and the states reachable by the currently read symbol, the same way as in the classical Minsky construction of a simple RNN simulating a finite-state automaton (Minsky, 1954).Because of the determinism of the transition function, this results in a single possible next state.The intersection can be implemented using conjunction, which is possible using the saturated sigmoid function.
With this, an RNN simulating a 2PDA can be constructed by keeping a hidden state vector divided into multiple sets of values, three of which will be relevant for our extension: (1) Two stack neurons, each representing a stack; (2) Two readout neurons, each encoding the symbols on top of one of the stacks; (3) |Q| state neurons encoding the current state of the 2PDA.The readout neurons can be computed from the stack encodings η (γ 1 ) and η (γ 2 ) similarly to how the stack encodings are updated.See top of Fig. 3 for an illustration of how these components can be used to determine the quantities relevant to determining the next action of the 2PDA.More details of the construction can be found in Chung and Siegelmann (2021, Thm. 1).Importantly, note that in this case, the RNN (and the 2PDA) can be fully deterministic due to the equivalence of deterministic and non-deterministic unweighted TMs.They also do not have to consider any ε-transitions or steps generating ε's since there is no generation in the sense of Section 2.3.These two aspects will, however, require more attention in the probabilistic case, which we discuss next.

Simulating PTMs
A TM can perform an unbounded number of computational steps per output symbol.To account for this with RLMs in the language modeling setting, we extend their definition to one that allows generating ε's, effectively allowing RNNs to perform computations without affecting the output string (εRLM).Definition 3.2.An RLM with ε-transitions (εRLM) is an RLM that can output ε-symbols.
More precisely, an εRLM defines a symbol representation function r : Σ ε → Q R and the output matrix E ∈ Q |Σε|×D , where D and R are parameters depending on the 2PDA (Chung and Siegelmann, 2021), and Σ ε = Σ ∪ {ε}.The ε-symbols represent empty substrings, so the final output of the εRLM is the output string with ε's removed.Effectively, this gives an εRLM the possibility to perform an arbitrary number of computations per symbol of the string.With this additional gadget, we are able to state our main result establishing a close connection between PTMs and εRLMs.
On determinism.The construction we describe in the following theorem requires that the next transition of the 2PDA is fully specified given the current state (q, γ) and the (sampled) output symbol from Σ ε . 15That is, the non-determinism of the simulated 2PDA is constrained to the sampling step of the RLM, meaning there can only be one possible transition in the 2PDA per output symbol.We call a 2PDA or QPTM that has this property Σ-deterministic.Note that this is still a non-deterministic automaton; see Def. 4.1.
Rough idea.Given a Σ-deterministic 2PDA P, we design an εRLM R that simulates P by executing its transitions and hence defining the same semimeasure over strings.We use the LM controller from Chung and Siegelmann (2021) (with the same definitions of the parameters U, V, and b), as it conveniently models the transitions of P and exposes the parts of its configuration required to define the transition (and with it the string) probabilities.Note the additional ε's do not change the construction.This leaves us with the task of appropriately defining the output matrix E. Exposing the symbols on the top of the stacks, top (γ 1 ) and top (γ 2 ), and the current state q of P in h t (cf.Fig. 3) allow us to easily access the appropriate probabilities encoded in the output matrix E. Note that due to the Σ-determinism of P, a single pair of the stack symbol and the current state (top (γ 1 ) , q) determines the conditional probability measure over the next symbol. 16More precisely, we define E ∈ Q |Σε|×|Γε||Q| which maps the one-hot encoding of the pair (γ, q) to a |Σ ε |-dimensional vector of probabilities over the next symbol. 17To achieve that, simply let E y,(γ,q) correspond to δ q • , where we index the output matrix directly with the elements for cleaner notation. 18Denoting with γ, q the one-hot encoding of the tuple (γ, q), the vectors E γ, q represent semimeasures over Σ ε , and π can be set to the identity function. 19Considering that R directly simulates all possible paths of P, it is easy to see that R generates a string y with a sequence of actions if and only if P generates it as well.Moreover, the encoding of the probabilities in E means that the probabilities of the action sequences are always the same.■ A formal statement of the theorem is given in Appendix E. We provide a proof where we show that the correspondence between the paths produced by Chung and Siegelmann's (2021) construction and the definition of E as described above results in a trivial weight-and yield-preserving mapping between the paths of the 2PDA P and the εRLM R that simulates it.Together, this shows that the two machines are strongly equivalent. 20This construction is implemented in https://github.com/rycolab/rnn-turing-completeness.
Finally, we can show that the expressivity of εRLMs is bounded from above by that of a 2PDA. 16Recall that the target configuration of P depends only on the top symbol of the first stack, γ def = top (γ1). 17One-hot encodings of the state-stack symbol pairs can be obtained by applying the RNN update (sub-)step in which the nonlinearity is used to implement conjunction.This adds another sub-step to the simulation of the full 2PDA update step. 18Again, due to the assumed determinism, the elements with • are irrelevant for the weights. 19The identity function is generally not a projection function onto the probability simplex.However, since its inputs in this case already lie on the probability simplex, its use is possible.More generally, we could use the sparsemax function (Martins and Astudillo, 2016), which acts like the identity function on the probability simplex.Alternatively, we could use the more popular softmax function and set the entries of E to the logarithms of the original probabilities (defining log 0 Proposition 3.2.Every εRLM has a weakly equivalent 2PDA.
Proof.For the proof, see Appendix G. ■

A Lower Bound
While Thm. 3.2 establishes a concrete result on the expressive power of εRLMs, the result follows from somewhat unrealistic assumptions, namely rationally weighted networks and unbounded computation time.We contend the first assumption is a reasonable approximation, since even for small neural networks the number of expressible states can be large; assuming double precision floating point numbers, an RNN can yield as many as 2 64•D different states, where D is the number of neurons. 21However, RLMs used in practice operate in real-time, outputting a symbol at every computation step.To make our analysis closer to this use case, in this section, we develop a lower bound on the expressivity of an RLM under the real-time restriction while still allowing rational arithmetic operations.

Real-time RLMs
Now, we switch back to studying the more common RLM with an RNN controller based on Def.2.1.Firstly, note that the class of RLMs is a subset of the class of εRLMs: Proposition 4.1.For every RLM there exists a strongly equivalent εRLM.
Proof.This result follows trivially from Def. 3.2: An RLM is simply an εRLM that always assigns probability 0 to outputting ε. ■

Real-time Deterministic 2PDA
The lack of ε-transitions requires the properties of the simulated model to change: As in Thm.3.2, the RNN construction requires that there is only one transition for every output symbol and configuration.Previously, this was done by imposing Σ-determinism, where non-determinism over symbols at a given time step can be reintroduced by delaying transitions through the use of additional ε-transitions, which is not possible here.
In fact, the lack of ε's and binarization means the resulting PDA has to be not just real-time, 21 To use the state space with limited precision more effectively, one could add more stack-encoding neurons or choose more efficient encodings of the stack contents.but also deterministic.We define such a 2PDA analogously to the single stack case: 22Definition 4.1.A 2PDA is deterministic if: • For any current state q ∈ Q, current top stack symbol γ ∈ Γ, and a given output symbol y ∈ Σ ε , there is at most one transition with non-zero probability.• If, in a given computation step, the weight of an ε-transition is non-zero, then its weight is 1 and the weight of all other transitions is 0.
Proof.This follows directly from Thm. 3.2 since an RD−2PDA is just a special case of the 2PDA without ε-transitions which is exactly the restriction imposed on the RLM.■

Real-time Deterministic QPTM
As before, we want to connect the RLM with the better-understood PTM.To do so, we introduce a new class of rationally weighted PTMs that are deterministic and operate in real-time.
Definition 4.2.A QPTM is deterministic if, for any configuration q, γ ∈ Q × Γ, and any symbol y ∈ Σ ε , there is at most one transition starting at that configuration and emitting y with non-zero probability.Furthermore, if there is a transition starting in (q, γ) outputting ε with non-zero probability, it must be the only possible transition in that configuration.If there are no ε-transitions with non-zero probability at all, then it is called real-time (RD−QPTM).
Proof.This directly follows from Thm. 3.1 because RD−QPTMs are a special case of general QPTMs.■ The resulting Turing machine that our RLM can simulate is now strictly less expressive than the original PTM.See Appendix F for the proof.Hence, our lower bound is strictly less powerful than the upper bound.

Open Questions
This work establishes upper and lower bounds on the expressive power of RLMs.While this shows how powerful RLMs can be, the bounds do not completely and precisely characterize the models of interest.A natural question for follow-up work is, therefore, the following.
Open Question 5.1.What is the exact computational power of a rationally weighted RLM?
While we do not answer this question definitively, we hope that the steps and framework outlined here help follow-up work to establish more precise descriptions of LMs in general, be it in the form of RNNs or other architectures.Furthermore, in this work, we have introduced novel models of probabilistic computation (RD−2PDA, RD−QPTM) that prove useful for describing RLMs in a formal setting due to the close connection between their dynamics and those of RNNs.We also provide a preliminary analysis of the concrete computational power of the novel models.For example, in Appendix F, we provide an example of a language that can be generated by a QPTM but not by its real-time deterministic counterpart, thereby showing that the former is more powerful than the latter.However, we leave a more precise characterization of their expressive power to future work, specifically.
Open Question 5.2.What is the relationship between deterministic QPTMs and nondeterministic devices lower on the hierarchy of computational models, e.g., non-deterministic probabilistic finite-state automata which cannot be represented by deterministic finite-state automata (Mohri, 1997;Buchsbaum et al., 2000)?
Open Question 5.3.Are the εRLM introduced in Def.3.2 weakly equivalent to non-deterministic QPTMs without the need to introduce two different types of ε symbol to store the direction of the head in the outputs?

Discussion and Conclusion
The widespread deployment of LMs in more and more far-reaching applications motivates a precise theoretical understanding of their abilities and shortcomings.In this paper, we show that tools from formal language theory, namely, probabilistic Turing machines and their extensions, offer a fruitful means of investigating those abilities by allowing us to directly characterize the classes of (probabilistic) languages LMs can represent.
Concretely, we place two different formalizations of RLMs into the framework of probabilistic Turing machines, thus characterizing their computational power.To connect our results with the bigger picture of understanding RLMs, consider again Fig. 1.The upper part of Fig. 1 (left to right) expresses the equivalence of PTMs, their rationally valued equivalent and probabilistic two-stack PDAs.These provide an upper bound for Σ-deterministic 2PDA, which are 2PDA that are deterministic in their output alphabet.These in turn can be simulated by RLMs that can output ε, allowing the model to perform an unbounded number of computations in between outputting output tokens.In Appendix G, we show that any εRLM is weakly equivalent to some 2PDA, meaning the expressivity of εRLM is upper-bounded by that of 2PDA.The lower half of the Fig. 1 shows the results on the more realistic real-time RLMs with rational weight.We show that such models match the expressive power of real-time probabilistic Turing machines with rational weights (lower left) through their correspondence to real-time deterministic probabilistic 2-stack PDAs (lower right).These results provide a set of first insights into the modeling power of modern language models and hopefully provide a starting point for the investigation of other modern architectures, such as transformers (Vaswani et al., 2017).

Limitations
Here, we list several points of our analysis that we consider limiting.Similarly to Siegelmann and Sontag (1992), all our results assume the RLMs to have rationally valued weights and hidden states, which is not the case for RLMs implemented in practice.It remains to be shown if the bounded precision in practical implementations proves to be too restrictive for the LMs to learn to solve algorithmic problems.The upper bound result additionally assumes that computation time is unbounded, which is a departure from how RNNs function in practice.It is not clear how an RNN could be trained in a non-real-time manner, or if that would actually lead to better results on any of the standard NLP tasks.
Importantly, note that the lower bound result is likely not tight, as, we only show the ability of RLMs to simulate a specific computational model (namely, RD−QPTMs).There might be more expressive models that can also be simulated by RLMs.In general, the results presented are theoretical in nature and not necessarily a practically efficient way of simulating Turing machines.Moreover, we do not suggest that trained RNNs in practice actually implement such mechanisms, but only that they are theoretically capable of doing it; The construction thus serves the specific purpose of theoretically simulating PTMs and does not naturally extend to training and inference outside of problems specifically designed for Turing machines.

Ethics Statement
Our work sheds light on the theoretical capabilities of language models.To the best of the authors' knowledge, it does not pose any ethical issues.

A Halting Probability and RLM
As discussed in remark 2.1, the halting probability of a PTM is defined as the sum of the probabilities of all halting paths, i.e., paths that end with q φ .Note that EOS in a RLM and the final state q φ in a PTM are similar constructs, and so we can consider a similar notion for RLM.
We first see how the corresponding notion of halting probability arises in the context of RLM.While it is possible to define a probability measure over Σ * with the autoregressive parameterization as in Eq. ( 1), not all semimeasures defined by Eq. ( 1) are probability measures over Σ * .For example, in the definition of RLM (Section 2.1), if we pathologically choose a projection function π such that it always places zero probability on EOS, we would end up with a semimeasure that places 0 probability on Σ * . 23In fact, Eq. ( 1) defines a probability measure over the set of finite and infinite strings: Σ * ∪ Σ ∞ .Under this formulation, the halting probability of an RLM is the probability mass placed on the set of finite strings, Σ * . 24We recognize that a similar situation exists in the case of a PTM, where the non-halting trajectories are infinite sequences that can be considered as elements of {0, 1} ∞ . 25n trying to measure the probability mass placed on the set Σ * within Σ * ∪Σ ∞ , we first need to define an appropriate probability measure over Σ * ∪ Σ ∞ .However, defining probability measures over uncountable sets such as Σ ∞ or {0, 1} ∞ raises nontrivial difficulties.As a simple illustration, consider an infinite fair coin toss.The sample space for this semimeasure is {H, T} ∞ .Clearly, each single infinite event (a binary string ω) has probability ( 1 2 ) ∞ = 0.However, treating uncountable semimeasures carelessly would result in the following paradox: For reasons like this, a rigorous discussion of PTM (Roy, 2011) or RLM (Du et al., 2023) will involve a modicum of measure theory and typically starts with defining the appropriate σ-algebra.In this work, we find that introducing such technical machinery obscures our purposes and we therefore intentionally omitted them.For a rigorous discussion on the corresponding definition of halting probability in RLMs and more general autoregressive models, see Du et al. (2023).

B Versions of Probabilistic Two-stack Pushdown Automata
In our work, we use an adaptation of the traditional two-stack PDA whose transition function depends only on the top symbol of one of the stacks (and the current state), whereas usually the top symbols of both stacks are taken into account.This setup follows the proof by (Hopcroft et al., 2001) but warrants additional justification when applied to the probabilistic case.
Proposition B.1.A 2PDA whose transition weighting function depends on the top symbol of both stacks can be simulated by a 2PDA whose transition function depends only on the top symbol of the first stack.
Proof. ( =⇒ ) Let P 1 be a 2PDA whose transition function has the form δ : We now show that we can construct a 2PDA P 2 as defined in Def.2.3, such that for any transition in P 1 , P 2 has a finite sequence of transitions resulting in the same state and stack configurations.Let q ′ be such a transition in P 1 , where γ is the top symbol on the first stack and γ ′ is the top symbol on the second stack.We can simulate this transition in P 2 through the following chain of transitions, where we introduce a transition-specific new state q ′′ : ( ⇐= ) The converse direction of proving that any such 2PDA P 1 whose transitions depend on the top symbols of both stacks can simulate a specific 2PDA P 2 whose transitions depend only on the top symbol of the first stack is trivial: For any transition in P 2 , we can just create a transition with the same semantics and weight 1 in P 1 for each second stack top symbol γ ′ .■ C Rationally Weighted Probabilistic Turing Machines Proof.( =⇒ ) The forward direction is trivial: Every PTM is a QPTM because 1 2 is a rational number.( ⇐= ) We start by noting that we can transform any QPTM M into one that has exactly two possible (rational-valued) transitions at any current state and tape symbol.We do this by repeatedly applying the following transformations: For any (q, γ) ∈ Q × Γ that has only one possible transition, its probability is 1, so we can split it into two new identical transitions with probability 1 2 .For (q, γ) ∈ Q × Γ, this allows exactly 2 possible transitions, this is already as required by the PTM (save for the probabilities, which we deal with in the next step).For any (q, γ) ∈ Q × Γ that allow k > 2 possible transitions, we repeatedly apply the following steps: 1. We choose one of the transitions whose probability we denote by p, and leave it as it is; 2. We then create a new ε-transition with d = N to a new state with probability 1−p, leaving γ the same; 3. We then change the remaining k − 1 transitions to start at the new state and tape symbol.
These transformations yield a QPTM with a completely binarized transition function.An example of this is shown in Fig. 4. Now note that any locally normalized pair of transitions with rational weights can be replaced by a sequence of transitions whose probabilities are 1 2 each (Knuth and Yao, 1976;Icard, 2020). 26This, in conjunction with the previous transformation, allows us to convert our QPTM into a PTM without changing the string probability.■ D Proof of Thm.3.1 Theorem 3.1.QPTMs and 2PDA are strongly equivalent.
Proof. ( =⇒ ) We want to show that for every QPTM, we can construct a strongly equivalent 2PDA.We do this in two steps: (1) we construct a candidate 2PDA and (2) we outline a weight-and yield-preserving bijection between accepting paths, whose existence proves strong equivalence.
Construction of P. Let M = (Q, Σ, Γ, δ M , q ι , q φ ) be a QPTM.The constructed 2PDA P = (Q P , Σ, Γ, δ P , q ι , q φ ) will inherit M's alphabets.It will also inherit all of M's states and add a number of additional ones, i.e., Q P = Q ∪ Q ′ .As we showcase later, the subset Q ⊆ Q P will be crucial in the analysis of the relationship between the two models.The states of P will be a superset of those in M. For each transition in M, we define a transition in P as follows, depending on the direction the head moves after writing to the tape: (i) Transitions that leave the head in place, that is, transitions of the form (q, γ) For each such transition in M, we add an equally weighted new transition q y,γ,γ→γ ′ − −−−−− → ε→ε q to P.
(ii) Transitions that move the head to the right, i.e., (q, γ) For each such transition in M, we add an equally weighted new transition q y,γ,γ→ε (iii) Transitions that move the head to the left, that is, transitions of the form τ = (q, γ) For any such transition τ in M, we add an equally weighted transition q q with weight 1 to P. Here, q τ is a new state unique to transition τ , and γ 2 is the symbol at the top of the second stack before the transition.
A weight-and yield-preserving bijection.Given the construction outlined above, we now show the existence of a weight-and yield-preserving bijection between the paths of the two models.While we are only interested in halting (accepting) paths, i.e., paths going from q ι to q φ with a non-zero weight,27 we prove a stronger result that there exists a weight-and yield-preserving bijection between all Q-subpaths, a notion we will define below.
Definition D.1.A Q-subpath in P is a subpath whose last state corresponds to a state that also exists in the original M. Recall that M's states are constructed to be a subset of those in P.
(⇒) We will define a weight-and yield-preserving mapping ψ 1 from the subpaths π M in the original M to the Q-subpaths π P in the P. Fix an arbitrary subpath π M in M. Our proof proceeds by structural induction on the subpath relation: Note that this is a well-founded ordering, making it suitable for induction.
Inductive Hypothesis.For all σ M ≺ π M , the function ψ 1 is weight-and yield-preserving.
Base Case.The function ψ 1 is defined to map the empty subpath in M to the empty subpath in P with weight 1 and yield ε.This preserves the weight and yield.
Inductive Case.Let π M be an arbitrary non-empty subpath in M. As depicted in Fig. 5, let σ M be the (strict) subpath of π M that omits the last transition τ M .Then σ M is mapped as follows: This mapping exists and is weight-and yield-preserving by the inductive hypothesis.Now, we seek to extend the function ψ 1 to π M = σ M • τ M .By our construction of P, each of the transitions in π M is simulated either by a single transition in P (case (i) and case (ii)) or two consecutive transitions in P (case (iii)).In the two-transition case, the sequence of two actions is unique because the second action has a transition-dependent state name, e.g., q τ M .Thus, we can extend ψ 1 in the following manner: Figure 5: Illustration of how ψ 1 maps a path in the original QPTM (top half) to a path in the 2PDA (bottom half).
Note that those states with a prime do not correspond to states in the original QPTM.
depending on whether τ M corresponds to one (Eq.( 13a)) or two transitions (Eq.( 13b)) in P. Note that the extension ψ 1 preserves the weight and yield of π M .This is clear by inspecting the construction as exactly one transition inherits τ M 's weight and yield.In the two-transition case, the second transition is given weight 1 and the yield ε.
(⇐) Next, we define a weight-and yield-preserving mapping from the subpaths π P in the constructed 2PDA to the subpaths π M in the M. Fix an arbitrary Q-subpath π P in P. Our proof proceeds by structural induction on the subpath relation: σ P ≺ π P ⇐⇒ σ P is a strict Q-subpath of π P .Note that this is a well-founded ordering, making it suitable for induction.
Inductive Hypothesis.For all σ P ≺ π P , the function ψ 2 is weight-and yield-preserving.
Base Case.The function ψ 2 is defined to map the empty subpath in P to the empty subpath in M with weight 1 and yield ε.This preserves the weight and yield.
Inductive Case.Let π P be an arbitrary non-empty Q-subpath in P. Consider the figure below: Figure 6: Illustration of how ψ 2 maps a path in the 2PDA (top half) to a path in the original QPTM (bottom half).Note that those states with a prime do not correspond to states in the original QPTM.
By construction, this path can be decomposed into a strict Q-subpath σ P ≺ π P and either one or two additional transitions as defined in (i)-(iii), as shown in Fig. 6.In the single-transition case, post-pended to σ P is either a single transition in M (case (i) and case (ii)).In the two-transition case, post-pended to σ P is a pair of the two consecutive transitions of P that together simulate a single transition in M (case (iii)), shown at the top of Fig. 6.In the first case, ψ 2 maps that action in π P to its (unique) corresponding action in M. In the latter case, note that due to the uniqueness of the added named state q ′ τ for the transition of type (iii) in M, all non-zero-weighted paths containing q ′ τ in P will contain both transitions in P defined in (iii), consecutively.Such pairs of transitions have a corresponding transition in the original QPTM-the transition for which they were added.Thus, we can extend ψ 2 in the following manner: depending on whether τ M corresponds to one (Eq.( 14a)) or two transitions (Eq.( 14b)) in P. Note that the extension ψ 2 preserves the weight and yield.This is clear by inspecting the construction as exactly one transition inherits τ M 's weight and yield.In the two-transition case, the second transition is given weight 1 and the yield ε.
Wrapping up.Thus, we have defined a pair (ψ 1 and ψ 2 ) of weight-preserving, yield-preserving total functions that map arbitrary Q-subpaths28 in M to paths in P and vice versa.It is easy to see that the two maps are inverses of each other; ψ 2 undoes the operations of ψ 1 , and vice versa.This means that ψ 1 is a bijection.Finally, because all halting paths in M are Q-subpaths, we conclude M and P are strongly equivalent.
( ⇐= ) To prove the backward direction, we want to show that any 2PDA P has a strongly equivalent QPTM M. We proceed analogously: Given a 2PDA P, we construct a candidate QPTM M and then sketch a path level weight-and yield-preserving bijection, again in the form of two injective functions which are inverses of one another.This proves strong equivalence.
Construction of M. Let P = (Q, Σ, Γ, δ, q ι , q φ ) be an arbitrary probabilistic 2PDA.Now we define QPTM M = (Q M , Σ, Γ, δ M , q ι , q φ ) to have the same alphabets Σ, Γ as P. Furthermore, we let M have a superset of the states of P, that is, we let Q M = Q ∪ Q ′ , where Q ′ are some additional states.We define the transitions of M by enumerating and distinguishing between all the possible transition types in P: (i) Transitions that do not pop or push any symbols, i.e., q y,γ,ε→ε − −−−− → ε→ε q ′ .For such transitions, we define the equally weighted stay-in-place operation (N ) of the form (q, γ) (ii) Transitions that pop a symbol and push a symbol to the same stack, i.e., (a) q q ′ with γ 2 , γ 4 ̸ = ε.Each transition of type (a) defines a single QPTM transition (q, γ 1 ) y/N −−→ (q ′ , γ 3 ) with the same weight as the original transition.Each transition of type (b) defines two consecutive transitions (q, γ) γ 4 ), where we create a new state q τ unique to the given transition in P. The weight of the first of the two new transitions in P equals the weight of the original transition in M, and the second one has weight 1 (thus being the only possible continuation from state q τ ).
(iii) Transitions that pop from one stack and push to the other stack, i.e., (a) q q ′ with γ 2 , γ 3 ̸ = ε.A transition of type (a) defines a transition in a QPTM that moves the head to the right (weighted equally to the original transition): (q, γ 1 ) y/R −−→ (q ′ , γ 4 ).A transition of type (b) defines a QPTM transition moving to the left followed by a stay-in-place operation that changes the symbol below the head: (q, γ) y/L − − → (q τ , γ), (q τ , γ 2 ) ε/N −−→ (q ′ , γ 3 ).Again, the weight of the first new transitions in P is chosen to be equal to the weight of the original transition in M and the latter has weight 1.
(iv) Transitions that push a symbol without popping one.These can be thought of as insertions that move all the symbols on the right side of the head to the right.Thus, they can be simulated by sequences of actions by the QPTM.For instance, a transition of the form q y,γ,ε→ε q ′ where γ 3 ̸ = ε can be simulated in M as follows: 1.) Change the current symbol under the head (γ) to a marker symbol not in the alphabet, e.g., ↓; 2.) Go to the end of the string and shift all the characters up to the marker one by one to the right, until back at the original position.
Therefore, the sequence of such transitions is added to M for any transition of this form in the 2PDA.We preserve the weight of the original transition by defining the weight of the first transition in step 1.) to be the weight of the original transition in P and the weights of all following transitions in steps 2.) and 3.) to be 1.
(v) Transitions that pop from one or both stacks without pushing equally many symbols.These can be thought of as deletions that remove symbols from the M's tape and move all the symbols on the right of the head to the left.For instance, a transition q y,γ,γ 1 →ε − −−−−− → ε→ε q ′ where γ 1 ̸ = ε can be modeled using the strategy from (iv), where step 2.) changes to shifting all the symbols to the left one by one until reaching the end of the string, then moving back to the marker.We define the sequence of such transitions in M for any transition of this form in the 2PDA.Again, the first transition has the same weight as the original transition in P and all following new transitions have weight 1.
(vi) Transitions that pop symbols from both stacks and push to both stacks, i.e., q where γ 1 , γ 2 , γ 3 , γ 4 ̸ = ε.Such transitions can be regarded as a composition of the two cases of (ii), performing first the simulation from (b) and then from (a).The chaining can be done by adding another intermediate state, q τ ′ .Such transitions in the 2PDA therefore define the sequence of transitions (q, γ) −−→ (q ′ , γ 3 ) in the QPTM.The weights are again chosen such that the weight of the first one corresponds to that of the original transition in P, and all following ones have weight 1. 29A weight-and yield-preserving bijection.With the construction above, we again show that there is a weight-and yield-preserving bijection between halting paths of M and P. The reasoning is analogous to the one in the forward direction.While we are only interested in halting (accepting) paths, i.e., paths going from q ι to q φ with a non-zero weight, we again prove a stronger result that there exists a weightand yield-preserving bijection between all Q-subpaths.Definition D.2.A Q-subpath in M is a subpath whose last state corresponds to a state that also exists in the original QPTM.Note that P's states are constructed to be a subset of those in M.
(⇒) We define a weight-and yield-preserving mapping ψ 3 from the subpaths π P in the original P to the subpaths π M in the constructed M. Fix an arbitrary Q-subpath π P in P. Our proof proceeds by structural induction on the subpath relation: σ P ≺ π P ⇐⇒ σ P is a strict Q-subpath of π P .Note that this is a well-founded ordering, making it suitable for induction.
Inductive Hypothesis.For all σ P ≺ π P , the function ψ 3 is weight-and yield-preserving.
Base Case.The function ψ 3 is defined to map the empty subpath to itself with weight 1 and yield ε.This preserves the weight and yield.
Figure 7: Illustration of how ψ 3 maps a path in the original 2PDA (top half) to a path in the QPTM (bottom half).
Note that those states with a prime do not correspond to states in the original 2PDA.
Inductive Case.Let π P be an arbitrary non-empty subpath in P. As depicted in Fig. 7, let σ P be the (strict) subpath of π P that omits the last transition τ P .Then, σ P is mapped as follows: σ P → ψ 3 (σ P ).This mapping exists and is weight-and yield-preserving by the inductive hypothesis.Now, we seek to extend the function ψ 3 to π P = σ P • τ P .By our construction of M, each of the transitions in π P is simulated by a number of transitions in M. Importantly, the transitions corresponding to any transition τ P in P are unique to the particular τ P due to the P-transition-dependant naming of the added states Q ′ in M. Thus, we can extend ψ 3 in the following manner: where τ M,1 , . . ., τ M,Lτ P are the L τ P transitions in M that the transition τ P corresponds to. 30Note that the extension ψ 3 preserves the weight and yield of π P .This is clear by inspecting the construction as exactly one transition inherits τ M 's weight and yield, while the remaining transitions have weight 1 and the yield ε.
(⇐) Next, we define a weight-and yield-preserving mapping from the subpaths π M in the constructed QPTM to the subpaths π P in the P. Fix an arbitrary Q-subpath π M in M. Our proof proceeds by structural induction on the subpath relation: σ M ≺ π M ⇐⇒ σ M is a strict Q-subpath of π M .Note that this is a well-founded ordering, making it suitable for induction.
Inductive Hypothesis.For all σ M ≺ π M , the function ψ 4 is weight-and yield-preserving.
Base Case.The function ψ 4 is defined to map the empty subpath in M to the empty subpath in P with weight 1 and yield ε.This preserves the weight and yield.
Inductive Case.Let π M be an arbitrary non-empty Q-subpath in M. Consider the figure below: By construction, this path can be decomposed into a strict Q-subpath σ M ≺ π M and either one or two additional transitions as defined in (i)-(vi), as shown in Fig. 8.In the single-transition case, post-pended to σ M is either a single transition in M (type (iia) and type (iiia)).In the multi-transition case, post-pended to σ M is a sequence of consecutive transitions of M that together simulate a single transition in P, e.g. two consecutive new transitions added for a P-transition of type (iib) as shown in the example at the top of Fig. 6.In the first case, ψ 2 maps that action in π M to its (unique) corresponding action in P. In the case of multiple transitions, note that due to the uniqueness of the added named states in M (q ′ τ for the transition of type (iib) in the example), all non-zero-weighted paths containing such transition-specific additional Figure 8: Illustration of how ψ 4 maps a path in the original QPTM (top half) to a path in the 2PDA (bottom half).
Note that those states with a prime do not correspond to states in the original 2PDA.
states in M will contain all the transitions in M that belong its sequence, consecutively.Such sequences of transitions have a corresponding transition in the original P-the transition for which they were added.Thus, we can extend ψ 4 in the following manner: depending on whether τ P corresponds to one (Eq.( 16a)) or multiple transitions (Eq.( 16b)) in M.Here again τ M,1 , . . ., τ M,Lτ P are the L τ P transitions in M the transition τ P corresponds to.Note that the extension ψ 4 preserves the weight and yield.This is clear by inspecting the construction as exactly one transition inherits τ P 's weight and yield.In the multi-transition case, the second transition is given weight 1 and the yield ε.
Wrapping up.Thus, we have defined a pair (ψ 3 and ψ 4 ) of weight-preserving, yield-preserving total functions that map arbitrary Q-subpaths31 in P to paths in M and vice versa.It is easy to see that the two maps are inverses of each other; ψ 3 undoes the operations of ψ 4 , and vice versa.This means that ψ 3 is a bijection.Finally, because all halting paths in P are Q-subpaths, we conclude P and M are strongly equivalent.We therefore conclude that the classes of 2PDA and QPTM are strongly equivalent.■ E Proof of Thm.3.2 The following theorem formalizes the informal claim made by Thm.3.2, which says that every Σ-deterministic probabilistic 2PDAs admits a strongly equivalent εRLMs, establishing a lower bound on the expressivity of εRLMs.Due to the Σ-determinism, the proof is a simple extension of the weighted extension to Minsky's construction recently detailed in Svete and Cotterell (2023)-it uses the correspondence between the paths in the 2PDA and the εRLM produced by Chung and Siegelmann's (2021) construction to define a natural weighting of the paths resulting in a weight-and yield-preserving mapping.
Theorem E.1.Let p be a language model defined by the Σ-deterministic probabilistic 2PDA P. Let R be an RNN over Σ ε as defined by Chung and Siegelmann's (2021) construction that furthermore defines Note that the definition above is more restrictive than that of a general PTM, but a superset of the class of deterministic Turing machines, Ms, which can be thought of as unweighted PTMs with the restriction that both transition functions are identical.It is a well-known result that Ms are computationally equivalent to PTMs, that is, they can recognize the same (unweighted) languages.However, in contrast to Ms, deterministic PTMs as defined above can still express a semimeasure over strings, albeit a trivial one (each string y that can be generated has a probability of 2 −|y| of being generated).
Proposition F.1.A deterministic PTM can only express distributions where each finite string has a binary probability, that is, a probability of the form 2 −n .
Proof.Let y ∈ Σ * be a string.Because PTM is deterministic, there is a unique path in PTM that accepts y.Let n be the length of the accepting path.Then, P M (y) = 2 −n .■ Definition F.2.A PTM is real-time if it has no ε-transitions.
Definition F.3.A dyadic rational is a rational number whose numerator can be any integer and whose denominator is a power of 2. We denote the set of dyadic rationals with Z [2 −n ] .32 Proposition F.2.A real-time PTM can express only dyadic measures, that is, the measure of each string is a dyadic rational.
Proof.Let y ∈ Σ * be a string, and let n = |y|.Because PTM is real-time, every path accepting y has length n.Moreover, since PTM is real-time, it may only have a finite number of accepting paths for any string.If PTM has k accepting paths for the string y, thne P M (y) = k2 −n , which is a dyadic rational.■ A note on the language expressivity of real-time Turing machines.While deterministic and non-deterministic Turing machines can recognize or generate the same languages, it has been shown that all real-time languages are context-sensitive (Burkhard and Varaiya, 1971).In fact, there are real-time definable languages that are context-sensitive and not context-free.On the other hand, there are also context-free languages that are not real-time definable (Rosenberg, 1967).Real-time Turing machines with just one tape have been shown to only recognize regular languages (Tadaki et al., 2010).However, with just one more tape, the computational power increases dramatically (Rabin, 1963), allowing recognition of languages that are non-context-free (Rosenberg, 1967).
We now turn to the investigation of rationally weighted PTMs, i.e., QPTMs.First, recall that an unrestricted QPTM is weakly equivalent to an unrestricted PTM (Prop.3.1).Furthermore, Icard (2020, Thm. 3) showed that PTMs define exactly the enumerable semimeasures (see Appendix G for details).This means that any enumerable real-valued semi-measure over strings can be expressed by a PTM, and, hence, a QPTM.
Proposition F.3.Real-time deterministic QPTMs are strictly less expressive than general QPTMs.
Proof.By Thm.G.1, PTMs, and hence QPTMs, can express real-valued semimeasures over strings.For instance, there exists a QPTM that generates the language Σ * for the one-symbol alphabet Σ = {a}, such that the probability of a string of a certain length is given by a Poisson measure: p(a k ) = Pois(λ, k) = for some λ ∈ R + .Let us choose, e.g., λ = 1.Then the probability of each string in Σ * is an irrational number.However, an RD−QPTM has to output a symbol at each time step with a rational probability, and hence, there exists no RD−QPTM that can express the above language.■ Note that similar arguments can be made to show that both determinism as well as real-time on their own are enough to restrict distributions from such QPTMs to the rationals.

Figure 1 :
Figure 1: Roadmap through the paper showing relations between different models of computation.A PTM is a reformulation of the classic probabilistic Turing machine.A QPTM is a PTM with multiple rationally weighted transition functions.A 2PDA is a probabilistic two-tape pushdown automaton.A Σ−2PDA is a 2PDA that is deterministic in its output alphabet.An RLM is a simple RNN LM.An εRLM is an RLM augmented with an empty output symbol (ε).The prefix "RD-" denotes deterministic real-time machines.

Figure 3 :
Figure3: A schematic illustration of how the model fromChung and Siegelmann (2021) stores the information about the configuration of the 2PDA and how it can be used to access the information needed for defining string probabilities.We denote with • the one-hot encoding function of the input arguments.h d ′ . . .h D refer to the rest of the hidden state not directly relevant for determining the configuration of the 2PDA.

Figure 9 :
Figure9: A schematic illustration of the different types of Turing machines and 2PDA and their corresponding place in a hierarchy of distributions.The curves differentiate different types of distributions, where 2 −n means every string has a binary probability, Z [2 −n ] refers to the dyadic distributions, and Q and R are the rational-valued and real-valued distributions, respectively.The colors of the boxes indicate which restrictions are placed on the automata, from real-time and deterministic, via Σ-deterministic, to unrestricted.Finally, the horizontal line divides formulations of machines that have just two transition functions that are uniformly distributed vs. the case of finitely many rational-values transition functions.Note that the Σ-deterministic automata are placed between the rational-valued and the real-valued distributions.