Dynamic Programming in Rank Space: Scaling Structured Inference with Low-Rank HMMs and PCFGs

Hidden Markov Models (HMMs) and Probabilistic Context-Free Grammars (PCFGs) are widely used structured models, both of which can be represented as factor graph grammars (FGGs), a powerful formalism capable of describing a wide range of models. Recent research found it beneficial to use large state spaces for HMMs and PCFGs. However, inference with large state spaces is computationally demanding, especially for PCFGs. To tackle this challenge, we leverage tensor rank decomposition (aka. CPD) to decrease inference computational complexities for a subset of FGGs subsuming HMMs and PCFGs. We apply CPD on the factors of an FGG and then construct a new FGG defined in the rank space. Inference with the new FGG produces the same result but has a lower time complexity when the rank size is smaller than the state size. We conduct experiments on HMM language modeling and unsupervised PCFG parsing, showing better performance than previous work. Our code is publicly available at https://github.com/VPeterV/RankSpace-Models.

i.e., all factor graphs generated by G. an FGG G 164 assigns a score w G (D, ξ) to each D ∈ D(G) along 165 with each ξ ∈ Ξ D . A factor graph D ∈ D(G) 166 assigns a score w D (ξ) to each ξ ∈ Ξ D : , ξ). The inference problem is to compute 170 the sum-product of G: 2 ΞX is defined as the set of assignments to the endpoints of an edge e labeled X, so ΞX = Ω ( 1) × · · · × Ω ( k ) where att(e) = v1 · · · v k , lab V (vi) = i.
of nonterminal/terminal-labeled edges only, and 184 τ R (ξ) is given by: This defines a recursive formula for computing 187 ψ S , i.e., Z G . Next, we will show how Eq. 3-4 188 recover the well-known inside algorithm.

197
where p denotes FGG rule probability p(N 1 → 198 N 2 N 3 ). It is easy to see that ψ X i,k is exactly the  Figure 2: Using CPD to decompose a factor can be seen as adding a new node. where r is the rank size; w q e k ∈ R N k ; ⊗ is outer 220 product; λ q is weight, which can be absorbed into 221 {w q e k } and we omit it throughout the paper. where m ei ∈ R N i is factor-to-node message; represented as W e i . Fig. 2 illustrates this intuition.

235
We refer to R as rank nodes and others as state Eq. 9 can be derived automatically by combining 255 Eq. 7 (or Fig. 3 (a)) and Eq. 5-6. Cohen et al.

256
(2013) note that U T can be extracted to the front T ∈ R m×m 2 , and then let V ∈ R r×m×m , their accelerated inside algorithm 267 has the following recursive form: Consider an B-FGG G shown in Fig. 1(b) and 319 replace the rhs of π 6 with Fig. 3(a), i.e., we use 320 CPD to decompose binary rule probability tensor.

321
Besides U, V, W ∈ R r×m defined in Sec. 3, we 322 define the start rule probability vector as s ∈ R m×1 , 323 and the unary rule probability matrix as E ∈ R o×m

324
where o is the vocabulary size.

346
We can easily derive the inference (inside) algorithm of G by following Eq. 3-4 and Fig. 4(b) 3 .

348
Let α i,j ∈ R r denote the rank-space inside score 349 for span [i, j). When j > i + 2: Fig. 4 and finally, Z G = Lα 0,n . The resulting inference 365 complexity is O(n 3 r + n 2 r 2 ), which is lower than 366 O(n 3 r + n 2 mr) of TD-PCFG when r < m, en-367 abling the use of a large state space for PCFGs in 368 the low-rank setting.

369
The key difference between the rank-space in-

446
We set the rank size to 4096. See Appd. D and E 447 for more details.

448
Main result. Table 2 shows the perplexity on the 449 PTB validation and test sets. As discussed ear-