Higher-order Derivatives of Weighted Finite-state Machines

Weighted finite-state machines are a fundamental building block of NLP systems. They have withstood the test of time—from their early use in noisy channel models in the 1990s up to modern-day neurally parameterized conditional random fields. This work examines the computation of higher-order derivatives with respect to the normalization constant for weighted finite-state machines. We provide a general algorithm for evaluating derivatives of all orders, which has not been previously described in the literature. In the case of second-order derivatives, our scheme runs in the optimal O(Aˆ2 Nˆ4) time where A is the alphabet size and N is the number of states. Our algorithm is significantly faster than prior algorithms. Additionally, our approach leads to a significantly faster algorithm for computing second-order expectations, such as covariance matrices and gradients of first-order expectations.


Introduction
Weighted finite-state machines (WFSMs) have a storied role in NLP. They are a useful formalism for speech recognition (Mohri et al., 2002), machine transliteration (Knight and Graehl, 1998), morphology (Geyken and Hanneforth, 2005;Lindén et al., 2009) and phonology (Cotterell et al., 2015) inter alia. Indeed, WFSMs have been "neuralized" (Rastogi et al., 2016;Hannun et al., 2020;Schwartz et al., 2018) and are still of practical use to the NLP modeler. Moreover, many popular sequence models, e.g., conditional random fields for part-ofspeech tagging (Lafferty et al., 2001), are naturally viewed as special cases of WFSMs. For this reason, we consider the study of algorithms for the WFSMs of interest in se for the NLP community. This paper considers inference algorithms for WSFMs. When WFSMs are acyclic, there exist simple linear-time dynamic programs, e.g., the forward algorithm (Rabiner, 1989), for inference. However, in general, WFSMs may contain cycles and such approaches are not applicable. Our work considers this general case and provides a method for efficient computation of m th -order derivatives over a cyclic WFSM. To the best of our knowledge, no algorithm for higher-order derivatives has been presented in the literature beyond a generalpurpose method from automatic differentiation. In contrast to many presentations of WFSMs (Mohri, 1997), our work provides a purely linear-algebraic take on them. And, indeed, it is this connection that allows us to develop our general algorithm.
We provide a thorough analysis of the soundness, runtime, and space complexity of our algorithm. In the special case of second-order derivatives, our algorithm runs optimally in O(A 2 N 4 ) time and space where A is the size of the alphabet, and N is the number of states. 1 In contrast, the second-order expectation semiring of Li and Eisner (2009) provides an O(A 2 N 7 ) solution and automatic differentiation (Griewank, 1989) yields a slightly faster O(AN 5 +A 2 N 4 ) solution. Additionally, we provide a speed-up for the general family of second-order expectations. Indeed, we believe our algorithm is the fastest known for computing common quantities, e.g., a covariance matrix. 2

Weighted Finite-State Machines
In this section we briefly provide important notation for WFSMs and a classic result that efficiently finds the normalization constant for the probability distribution of a WFSM.
N ×N where N is the number of states, and α, ω ∈ R ≥0 N are column vectors of start and end weights, respectively. We define the matrix W def = a∈A W (a) .
Definition 2. A trajectory τ i is an ordered sequence of transitions from state i to state . Visually, we can represent a trajectory by The weight of a trajectory is We denote the (possibly infinite) set of trajectories from i to by T i and the set of all trajectories by T Consequently, when we say τ i ∈ T , we make i and implicit arguments to which T i we are accessing.
We define the probability of a trajectory τ i ∈ T , Of course, p is only well-defined when 0 < Z < ∞. 4 Intuitively, α W k ω is the total weight of all trajectories of length k. Thus, Z is the total weight of all possible trajectories as it sums over the total weight for each possible trajectory length.
Thus, we can solve the infinite summation that defines W by matrix inversion in O(N 3 ) time. 5 Corollary 1.
By Corollary 1, we can find Z in O(N 3 + AN 2 ). 6 Strings versus Trajectories. Importantly, WF-SMs can be regarded as weighted finite-state acceptors (WFSAs) which accept strings as their input. Each trajectory τ i ∈ T has a yield γ(τ i ) which is the concatenation of the alphabet symbols of the trajectory. The yield of a trajectory ignores any ε symbols, a discussion regarding the semantics of ε is given in Hopcroft et al. (2001). As we focus on distributions over trajectories, we do not need special considerations for ε transitions. We do not consider distributions over yields in this work as such a distribution requires constructing a latent-variable model where σ ∈ A * and γ(τ i ) is the yield of the trajectory. While marginal likelihood can be found efficiently, 7 many quantities, such as the entropy of the distribution over yields, are intractable to compute (Cortes et al., 2008).

Computing the Hessian (and Beyond)
In this section, we explore algorithms for efficiently computing the Hessian matrix ∇ 2 Z. We briefly describe two inefficient algorithms, which are derived by forward-mode and reverse-mode automatic differentiation. Next, we propose an efficient algorithm which is based on a key differential identity.

An O(A 2 N 7 ) Algorithm with Forward-Mode Automatic Differentiation
One proposal for computing the Hessian comes from Li and Eisner (2009) who introduce a method based on semirings for computing a general family of quantities known as second-order expectations (defined formally in §4). When applied to the computation of the Hessian their method reduces precisely to forward-mode automatic differentiation (AD; Griewank and Walther, 2008, Chap 3.1). This approach requires that we "lift" the computation of Z to operate over a richer numeric representation known as dual numbers (Clifford, 1871;Pearlmutter and Siskind, 2007). Unfortunately, the secondorder dual numbers that we require to compute the Hessian introduce an overhead of Another method for materializing the Hessian ∇ 2 Z is through reverse-mode automatic differentiation (AD). Recall that we can compute Z in O(N 3 + AN 2 ), and can consequently find ∇Z in O(N 3 + AN 2 ) using one pass of reversemode AD (Griewank and Walther, 2008, Chapter 3.3). We can repeat differentiation to materialize ∇ 2 Z. Specifically, we run reverse-mode AD once for each element i of ∇Z. Taking the gradient of (∇Z) i gives a row of the Hessian matrix, :) . Since each of these passes takes time O(N 3 +AN 2 ) (i.e., the same as the cost of Z), and ∇Z has size AN 2 , the overall time is O(AN 5 +A 2 N 4 ).

Our Optimal O(A 2 N 4 ) Algorithm
In this section, we will provide an O(A 2 N 4 )-time and space algorithm for computing the Hessian.
Since the Hessian has size O(A 2 N 4 ), no algorithm can run faster than this bound; thus, our algorithm's time and space complexities are optimal. Our algorithm hinges on the following lemma, which shows that the each of partial derivatives of W can be cheaply computed given W .
The second step uses Equation 40 of the Matrix Cookbook (Petersen and Pedersen, 2008).
We now extend Lemma 1 to express higherorder derivatives in terms of W . Note that as in Lemma 1, we will use ij as a shorthand for the partial derivative ∂W ij e j Proof. Application of Theorem 2 for the m=2 case.
Theorem 2 shows that, if we have already computed W , each element of the m th derivative can be found in O(m m!) time: We must sum over O(m!) permutations, where each summand is the product of O(m) items. Importantly, for the Hessian (m = 2), we can find each element in O(1) using Corollary 2. Algorithm D m in Fig. 1 provides pseudocode for materializing the tensor containing the m th derivatives of Z.
Correctness of algorithm D m follows from Theorem 2. The runtime and space bounds follow by needing to compute and store each combination of transitions. Each line of the algorithm is annotated with its running time.
Corollary 3. The Hessian ∇ 2 Z can be materialized in O(A 2 N 4 ) time and O(A 2 N 4 ) space. Note that these bounds are optimal.

2:
Compute the tensor of m th -order derivative of a WFSM; requires

Second-Order Expectations
In this section, we leverage the algorithms of the previous section to efficiently compute a family expectations, known as a second-order expectations. To begin, we define an additively decomposable function r: T → R R as any function expressed as where each r (a) jk is an R-dimensional vector. Since many r of interest are sparse, we analyze our algorithms in terms of R and its maximum density R def = max j a − →k r (a) jk 0 . Previous work has considered expectations of such functions (Eisner, 2001) and the product of two such functions (Li and Eisner, 2009), better known as second-order expectations. Formally, given two additively decomposable functions r: T → R R and t: T → R T , a second-order expectation is Examples of second-order expectations include the Fisher information matrix and the gradients of firstorder expectations (e.g., expected cost, entropy, and the Kullback-Leibler divergence). Our algorithm is based on two fundamental concepts. Firstly, expectations for probability distributions as described in (1), can be decomposed as expectations over transitions (Zmigrod et al., 2021). Secondly, the marginal probabilities of transitions are connected to derivatives of Z. 9 Lemma 2. For m ≥ 1 and m-tuple of transitions We formalize our algorithm as E 2 in Fig. 1. Note that we achieve an additional speed-up by exploiting associativity (see App. A.3).
Theorem 4. Algorithm E 2 computes the secondorder expectation of additively decomposable functions r: T → R R and t: T → R T in: where R= min(N R , R) and T = min(N T , T ). Proof. Correctness of algorithm E 2 is given in App. A.3. The runtime bounds are annotated on each line of the algorithm. We note that each r and t is R and T sparse. O(N 2 ) space is required to store W , O(RT ) is required to store the expectation, and O(N (R + T )) space is required to store the various r and t quantities.
Previous approaches for computing secondorder expectations are significantly slower than E 2 . Specifically, using Li and Eisner (2009)'s secondorder expectation semiring requires augmenting the arc weights to be R × T matrices and so runs in O(N 3 RT +AN 2 RT ). Alternatively, we can use AD, as in §3.2, to materialize the Hessian and compute the pairwise transition marginals. This would result in a total runtime of O(AN 5 +A 2 N 4 R T ).

Conclusion
We have presented efficient methods that exploit properties of the derivative of a matrix inverse to find m-order derivatives for WFSMs. Additionally, we provided an explicit, novel, algorithm for materializing the Hessian in its optimal complexity, O(A 2 N 4 ). We also showed how this could be utilized to efficiently compute second-order expectations of distributions under WFSMs, such as covariance matrices and the gradient of entropy. We hope that our paper encourages future research to use the Hessian and second-order expectations of WFSM systems, which have previously been disadvantaged by inefficient algorithms. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. 2000

A Proofs
A.1 Theorem 2. For m ≥ 1 and m-tuple of transitions τ = i 1 where s = α W , e = W ω and S τ is the multi-set of permutations of τ . Proof. We prove this by induction on m. Base Case: m = 1 and τ = i a − → j : Step: Assume that the expression holds for m. Let τ = i 1 a 1 − → j 1 , . . . , i m am − − → j m and consider the tuple τ , the concatenation of (i a − → j) and τ .
Consider the derivative of each summand with respect to W (a) ij . By the product rule, we have The above expression is equal to inserting i a − → j in every spot of the induction hypothesis's permutation, thereby creating a permutation over τ . Reassembling with the expression for the derivative, Lemma 2. For m ≥ 1 and m-tuple of transitions τ = i 1 Proof. Let T τ be the set of trajectories such that τ i ∈ T τ ⇐⇒ τ ⊆ τ i . Then, We prove the lemma by induction on m. Base Case: Then, m = 1 and τ = i 1 a 1 − → j 1 . We have that Step (a) holds because taking the derivative of Z with respect to W (a 1 ) i 1 j 1 yields the sum of the weights all trajectories which include i 1 i 1 j 1 from the computation of the weight. Then, we can push the outer W

Z
Step (b) pushes 1 Z and n k=2 W (a k ) i k j k as constants into the derivative and step (c) uses our induction hypothesis on τ . Then, step (d) takes 1 Z out of the derivative as we pushed it in as a constant. Finally, step (e) follows by the same reasoning as step (a) in the base case above.

A.3
Theorem 4. Algorithm E 2 computes the second-order expectation of additively decomposable functions r: T → R R and t: T → R T in: where R= min(N R , R) and T = min(N T , T ). Proof. We provide a proof of correctness (the time and space bounds are discussed in the main paper). Zmigrod et al. (2021) show that we can find second-order expectations over by finding the expectations over pairs of transitions. That is, We can use Lemma 2 for the m = 2 case, to find that the expectation is given by The second summand can be rewritten as N i,j,k,l=0 a,b∈A Consider the first summand of the above expression Finally, recomposing all the pieces together,