A Fast Algorithm for Computing Prefix Probabilities

Multiple algorithms are known for efficiently calculating the prefix probability of a string under a probabilistic context-free grammar (PCFG). Good algorithms for the problem have a runtime cubic in the length of the input string. However, some proposed algorithms are suboptimal with respect to the size of the grammar.This paper proposes a new speed-up of Jelinek and Lafferty’s (1991) algorithm, which runs in O(n3|N|3 + |N|4), where n is the input length and |N| is the number of non-terminals in the grammar. In contrast, our speed-up runs in O(n2|N|3 + n3|N|2).


Introduction
Probabilistic context-free grammars (PCFGs) are an important formalism in NLP (Eisenstein, 2019, Chapter 10). One common use of PCFGs is to construct a language model. For instance, PCFGs form the backbone of many neural language models, e.g., recurrent neural network grammars (RNNGs; Dyer et al., 2016;Dyer, 2017;Kim et al., 2019). However, in order to use a PCFG as a language model, one needs to be able to compute prefix probabilities, i.e., the probability that the yield of a derivation starts with the given string. In notation, given a string w = w 1 · · · w N , we seek the probability p(S * ⇒ w · · · ) where S is the distinguished start symbol of the grammar and * ⇒ is the closure over applications of derivation rules of the grammar. 1 Our paper gives a more efficient algorithm for the simultaneous computation of the prefix probabilities of all prefixes of a string w under a PCFG.
The authors are aware of two existing efficient algorithms to compute prefix probabilities under a PCFG. The first is Jelinek and Lafferty's (1991) algorithm which is derived from CKY (Kasami,1 Specifically, α * ⇒ β means that there exists an n ≥ 0 such that α ⇒ · · · ⇒ n times β, where ⇒ marks a derivation step. 1965;Younger, 1967;Cocke and Schwartz, 1970) and, thus, requires the grammar to be in Chomsky normal form (CNF). Jelinek-Lafferty runs in O(N 3 |N | 3 + |N | 4 ) time, where N is the length of the input and N is the number of non-terminals of the grammar, slower than the O(N 3 |N | 3 ) required for parsing with CKY, when the number of non-terminals |N | is taken into account.
The second, due to Stolcke (1995), is derived from Earley parsing (Earley, 1970) and can parse arbitrary PCFGs, 2 with a runtime of O(N 3 |N | 3 ). Many previous authors have improved the runtime of Earley's (Graham et al., 1980;Leermakers et al., 1992;Moore, 2000, inter alia), and Opedal et al. (2023) successfully applied this speed-up to computing prefix probabilities, achieving a runtime of O(N 3 |G|), where |G| is the size of the grammar, that is, the sum of the number of symbols in all production rules.
Our paper provides a more efficient version of Jelinek and Lafferty (1991) for the computation of prefix probabilities under a PCFG in CNF. Specifically, we give an O(N 2 |N | 3 + N 3 |N | 2 ) time algorithm, which is the fastest attested in the literature for dense grammars in CNF, 3 matching the complexity of CKY adapted for dense grammars by Eisner and Blatz (2007). 4 We provide a full derivation and proof of correctness, as well as an open-source implementation on GitHub. We also briefly discuss how our improved algorithm can be extended to work for semiring-weighted CFGs.

Preliminaries
We start by introducing the necessary background on probabilistic context-free grammars.
2 Note that Earley's and, by extension, Stolcke's algorithms also implicitly binarize the grammar during execution by using dotted rules as additional non-terminals. 3 A PCFG in CNF is dense if for every X, Y, Z ∈ N , we have a production rule X → Y Z ∈ R. 4 Note that there exist approximate parsing algorithms with lower complexity bounds (Cohen et al., 2013). Moreover, there are parsing algorithms that asymptotically run in subcubic time in the input length using fast matrix multiplication (Valiant, 1975;Benedí and Sánchez, 2007). However, they are of limited practical use (Lee, 1997). Definition 1. A probabilistic context-free grammar (PCFG) is a five-tuple G = (N , Σ, S, R, p), made up of: • An finite set of non-terminal symbols N ; • An alphabet of terminal symbols Σ; • A distinguished start symbol S ∈ N ; • A finite set of production rules R ⊂ N × (N ∪ Σ) * where each rule is written as X − → α with X ∈ N and α ∈ (N ∪ Σ) * . Here, * denotes the Kleene closure; • A weighting function p : R → [0, 1] assigning each rule r ∈ R a probability such that p is locally normalized, meaning that for all X ∈ N that appear on the left-hand side of a rule, Note that not every locally normalized PCFG constitutes a valid distribution over Σ * . Specifically, some may place probability mass on infinite trees (Chi and Geman, 1998). PCFGs that do constitute a valid distribution over Σ * are referred to as tight. Furthermore, if all non-terminals of the grammar can be reached from the start non-terminal via production rules, we say the PCFG is trim.
Definition 2. A PCFG G = (N , Σ, S, R, p) is in Chomsky normal form (CNF) if each production rule in R is in one of the following forms: where X, Y, Z ∈ N are non-terminals, a ∈ Σ are terminal symbols, and ε is the empty string.
Definition 3. A derivation step α ⇒ β is an application of the binary relation ⇒: (N ∪ Σ) * × (N ∪ Σ) * , which rewrites the left-most non-terminal in α according to a rule in R from the left-hand side of that rule to its right-hand side, resulting in β.
We represent derivations as trees whose structure corresponds to production rules, where any parent node is the non-terminal on the left-hand side of a rule and its children are the symbols from the right-hand side. The leaves of the tree, when read from left to right, form the yield. Such a tree, when rooted S, is called a derivation tree. Otherwise, it is called a derivation subtree. Definition 5. The probability of a derivation tree (or derivation subtree) τ is the product of the probabilities of all its corresponding production rules: Definition 6. We define T X (w i · · · w k ) as the set of all derivation subtrees τ rooted at X with yield w i · · · w k . Definition 7. Given a PCFG G = (N , Σ, S, R, p), a non-terminal X ∈ N , and a string w = w 1 · · · w N ∈ Σ * , the inside probability of X be- That is, the sum of the probability of all derivation trees τ starting at X that have yield w i · · · w k . Definition 8. Given a PCFG G = (N , Σ, S, R, p), a non-terminal X ∈ N , and a string w = w 1 · · · w N ∈ Σ * , we define the prefix probability p π , i.e., the probability of w being a prefix under G, to be: In words, p π is the probability of deriving w with an arbitrary continuation from X, that is, the sum of probabilities of deriving wu from X over all possible suffixes u ∈ Σ * . In the following, we write the prefix probability of deriving prefix w = w i · · · w k from X as p π (i, k | X). Definition 9. Let G be a PCFG in CNF. Then for non-terminals X, Y, Z ∈ N , the left-corner expectations E lc (Y | X) and E lc (Y Z | X) are, respectively, defined as: for k ∈ 1, . . . , N : for ℓ ∈ 2, . . . , N : 10: ▷ i marks the beginning of the span 11: for i ∈ 1, . . . , N − ℓ + 1 : 12: ▷ k marks the end of the span 13: for k ∈ 1, . . . , N : 9: for X ∈ N : ▷ Compute base case 10: for ℓ ∈ 2 . . . N : 12: return p π Figure 1: Pseudocode for the CKY algorithm (left) and Jelinek-Lafferty (right)

Jelinek and Lafferty (1991)
We now give a derivation of the Jelinek-Lafferty algorithm. The first step is to derive an expression for the prefix probability in PCFG terms.
Lemma 1. Given a tight, trim PCFG in CNF and a string w = w 1 · · · w N , the prefix probability of a substring w i · · · w k of w, can be defined recursively as follows: Proof. A proof of Lemma 1 is given in App. A.
The above formulation of the prefix probability is closely related to that of the inside probability from Baker's (1979) inside-outside algorithm, which can be efficiently computed using CKY, see Algorithm 1.
Next, the left-corner expectations E lc as defined by Eq. (8) can be computed efficiently as follows. Let P denote the square matrix of dimension |N |, with rows and columns indexed by the non-terminals N (in some fixed order), where the entry at the i th row and the j th column corresponds to p(X i − → X j •), i.e., the probability of deriving X j on the left corner from X i in one step: We can find the probability of getting to nonterminal X j after k derivation steps starting from X i by multiplying P with itself k times: We can hence get the matrix P * , whose entries correspond to deriving X j from X i after any number of derivation steps, by summing over all the powers of the matrix P : 5 Note that the entry at the i th row and j th column of P * is exactly the left-corner expectation E lc (X j | X i ). Finally, we can compute the leftcorner expectations E lc (Y Z | X) using Eq. (9): Similarly, we can compute the base case of the recursive Eq. (10), namely p π (k, k | X), which is defined as follows.
Definition 10. The prefix probability of the token at position k being derived from X is defined as: We can now combine the quantities derived above to obtain an efficient algorithm for the computation of prefix probabilities p π (i, k | S). For the full algorithm, see Algorithm 2. Proposition 1. The time complexity of the CKY algorithm as presented in Algorithm 1 is O (N 3 |N | 3 ).
Proof. Clearly, the computationally critical part is in lines 9-13, where we iterate over all indices of w for i, j, and k, as well as over the whole set of grammar rules, thus taking O(N 3 |R|). In a PCFG in CNF, with the size of the alphabet taken as constant, the number of rules, |R|, is O(|N | 3 ), making the overall complexity of CKY O(N 3 |N | 3 ). ■ Proposition 2. The total time complexity of Jelinek-Lafferty is O(N 3 |N | 3 + |N | 4 ): Proof. 1. We begin by pre-computing all the inside probabilities β in line 2 of Algorithm 2, which takes O(N 3 |N | 3 ) by Proposition 1.

4.
Computing p π (k, k | X) for all X ∈ N by Eq. (14) in lines 8-10 takes O(N |N | 2 ) as we iterate over all positions k ∈ N and over all Y ∈ N for each X ∈ N .
5. And finally, computing the p π chart in lines 11-14 takes O(N 3 |N | 3 ) since we iterate over all ℓ, i, j ≤ N and X, Y, Z ∈ N .

Our Speed-up
We now turn to our development of a faster dynamic program to compute all prefix probabilities. The speed-up comes from a different way to factorize p π (i, k | X), which allows additional memoization. Starting with the definition of the prefix probability in Eq. (15a), we first expand E lc (Y Z | X) by Eq. (9), as seen in Eq. (15b). Then, we factor out all terms that depend on the left-corner nonterminal Y in Eq. (15c), which we store in a chart γ, see Eq. (15e). We then do the same for all terms depending on X ′ , factoring them out in Eq. (15d) and storing them in another chart δ, see Eq. (15f). Our improved algorithm for computing all prefix probabilities is shown in Algorithm 3.
3. Pre-computing γ and δ in lines 5-9 takes O(N 2 |N | 3 ), as we sum over non-terminals, and both charts each have two dimensions indexing N and two indexing N .
4. The loops computing p π in lines 13-17 take O(N 3 |N | 2 ), as we are now iterating over X, Z ∈ N and ℓ, i, j ≤ N .

Generalization to Semirings
It turns out that Jelinek-Lafferty, and, by extension, our improved algorithm, can be generalized to work for semiring-weighted CFGs, with the same time complexity, under the condition that the weights are locally normalized and the semiring has a welldefined Kleene closure. This follows from the fact that the only operations used by the algorithm are addition and multiplication if we use Lehmann's (1977) algorithm for the computation of left-corner expectations, E lc . The definitions, derivation, and proof of this statement can be found in App. B.

Conclusion
In this paper, we have shown how to efficiently compute prefix probabilities for PCFGs in CNF, adapting Jelinek-Lafferty to use additional memoization, thereby reducing the time complexity from O(N 3 |N | 3 +|N | 4 ) to O(N 2 |N | 3 +N 3 |N | 2 ). We thereby addressed one of the main limitations of the original formulation, of being slow for large grammar sizes.
While we have improved the asymptotic running time of a classic algorithm with regard to grammar size, the time complexity of our algorithm is still cubic in the length of the input. Our result follows the tradition of dynamic programming algorithms that trade time for space by memoizing and reusing pre-computed intermediate results. The usefulness of this trade-off in practice depends on the specifics of the grammar, and while the complexity is strictly better in terms of non-terminals, it will be most noticeable for denser grammars with many nonterminals.

Ethics Statement
We do not foresee any ethical issues arising from this work.
A Proof of Lemma 1 Lemma 1. Given a tight, trim PCFG in CNF and a string w = w 1 · · · w N , the prefix probability of a substring w i · · · w k of w, can be defined recursively as follows: Proof. Given the PCFG is in CNF and k > i, in order to derive the prefix w i · · · w k we must first apply some rule X − → Y Z, where the first part of the substring is then derived from Y and the remainder (and potentially more) from Z: where the last term, p π (i, k | Y), handles the case where the whole prefix is derived from Y alone. This term is clearly recursively defined through Eq. (16) , we can rewrite Eq. (16) as: After repeated substitutions ad infinitum, we get: Note that, in the last step, infinite derivations do not carry any probability mass since we assumed the PCFG to be tight and trim. Hence, the final form of the equation is:

B Extension to semirings
In the following, we give the necessary background on semirings and then show how the algorithms introduced above can be framed in terms of semirings. We start by introducing the necessary definitions and notation. Definition 11. A monoid is a 3-tuple ⟨A, •, 1⟩ where: (i) A is a non-empty set; (ii) • is a binary operation which is associative: (iii) 1 is a left and right identity element: Definition 12. A semiring is a 5-tuple W = ⟨A, ⊕, ⊗, 0, 1⟩, where (i) ⟨A, ⊕, 0⟩ is a commutative monoid over A with identity element 0 under the addition operation ⊕; (ii) ⟨A, ⊗, 1⟩ is a monoid over A with identity element 1 under the multiplication operation ⊗; (iii) Multiplication is distributive over addition, that is, ∀a, b, c ∈ A: (iv) 0 is an annihilator for A, that is, ∀a ∈ A, 0 ⊗ a = a ⊗ 0 = 0.
Definition 13. A semiring W = ⟨A, ⊕, ⊗, 0, 1⟩ is complete if it is possible to extend the addition operator ⊕ to infinite sums, maintaining the properties of associativity, commutativity, and distributivity from the finite case (Rozenberg and Salomaa, 1997, Chapter 9). In this case, we can define the unary operation of the Kleene star denoted by a superscript * as the infinite sum over its operand, that is, ∀a ∈ A: Analogously to Eq. (13), it then follows that: and, similarly: We now discuss how complete semirings can be lifted to matrices. The definitions follow analogously to matrices over the reals. Definition 14. We define semiring matrix addition as follows. Let A and B be d × d matrices whose entries are elements from a complete semiring W = ⟨A, ⊕, ⊗, 0, 1⟩. Then the sum ("+") of A and B is defined as: Definition 15. We define semiring matrix multiplication as follows. Let A and B be d × d matrices whose entries are elements from a complete semiring W = ⟨A, ⊕, ⊗, 0, 1⟩. Then the product ("·") of A and B is defined as: We also define the zero matrix, O, over the complete semiring W = ⟨A, ⊕, ⊗, 0, 1⟩, such that all entries are 0, and the unit matrix I as (I) ij = 1 iff i = j and 0 otherwise for all indices i, j ∈ 0, . . . , d. It is then straightforward to show that matrix addition is associative and commutative while matrix multiplication is associative and distributive over matrix addition. Hence, ⟨W d×d , +, ·, O, I⟩ is a semiring. Furthermore, by the element-wise definition of its addition operation, it is also complete.
We now consider a semiring-weighted CFG G = ⟨N , Σ, S, R, p, W⟩, where N , Σ, S, R are defined as before, except the (locally normalized) weighting function p is now semiring-valued: As before, we define the matrix P as the square matrix of dimension |N | whose rows and columns are indexed by the non-terminals N in some fixed order so that the entry P ij corresponds to p(X i − → X j •) = Y∈N p(X i − → X j Y). We can then calculate the probability of getting X j from X i at the leftmost non-terminal after exactly k derivation steps as (P k ) ij = k i=0 P ij . Note that this holds because the production rule weights are locally normalized, meaning that we only need to consider the left-most rule applications instead of having to explicitly calculate the full treesum. Finally, to get the left-corner expectations, we then need to calculate the Kleene closure over the matrix P , 6 that is, we want to find P * = ∞ k=0 P k . To compute the Kleene closure over the transition matrix we can use an efficient algorithm by Lehmann (1977) which is a generalization of the well-known shortest-path algorithm usually attributed to Floyd (1962) and Warshall (1962), but introduced previously by Roy (1959). The algorithm works under the condition that the Kleene closure of all individual matrix entries from semiring W exists, which is true for our case since we assumed W to be complete. The algorithm is shown in Algorithm 4.