Neuralizing Regular Expressions for Slot Filling

Neural models and symbolic rules such as regular expressions have their respective merits and weaknesses. In this paper, we study the integration of the two approaches for the slot filling task by converting regular expressions into neural networks. Specifically, we first convert regular expressions into a special form of finite-state transducers, then unfold its approximate inference algorithm as a bidirectional recurrent neural model that performs slot filling via sequence labeling. Experimental results show that our model has superior zero-shot and few-shot performance and stays competitive when there are sufficient training data.


Introduction
Neural approaches almost dominate recent natural language processing (NLP) research. The tremendous success of neural networks benefits from their strong capability of learning from a large amount of annotated data. On the other hand, systems based on symbolic rules, while being no longer the mainstream approach to NLP research, are still very widely used in practice because of their interpretability, trustworthiness and decent zero-shot performance in data-scarce scenarios. Therefore, how to integrate neural and symbolic approaches while retaining their respective advantages is becoming an active research direction.
In this paper, we try to integrate neural networks with regular expressions (RE), one of the most widely used forms of symbolic rules. We focus on the slot filling task (SF), a typical application scenario of REs that aims to identify words in a sentence that carry specific information. For example, given the sentence "show me the flights from New York to Dallas.", SF aims to tag the span "New York" as the content of slot fr.city (the departing city). An RE specifies text patterns of both con- * Corresponding author. tents and contexts and uses capturing groups to mark target contents.
Existing methods that integrate neural networks and symbolic rules either use rule outputs as pseudo-labels to distill knowledge into neural networks (Zhang et al., 2018;Hu et al., 2016), or use rule outputs to influence attention weights or output logits of neural networks (Luo et al., 2018;Wang et al., 2019;Li and Srikumar, 2019). These integration methods can outperform both rules and pure neural networks when there are sufficient training data. However, since they still require training, their performance is well below that of rules in zero-shot and low-resource settings. More recently, Jiang et al. (2020) proposed a novel method that converts REs into neural networks, which can match the performance of the original REs without training and can compete with previous integration methods when trained. Unfortunately, their method is designed for text classification and cannot be easily extended to the SF task because it does not model RE capturing groups and only computes sentence-level scores.
In this work, we propose to neuralize REs with capturing groups for the SF task. Our method is inspired by Jiang et al. (2020) but differs from their work in many aspects. Specifically, we treat SF as a sequence labeling task with the BIO tag scheme and convert REs with capturing groups into finite-state transducers (FST) with restricted output dependencies. We propose an approximate FST inference algorithm and unfold it as a neural model that resembles bidirectional RNN models for sequence labeling. Our model is approximately equivalent to the original REs and can be further improved by training on labeled data. We also propose several techniques to enhance the model without harming its initial approximation to the REs.
We conduct our experiments on three popular SF datasets involving two domains and two lan- guages. Results show that our model has superior performance on zero-shot and low-resource settings compared with all the previous methods and remains competitive on rich-resource settings. We provide our source code as well as a handy toolkit for writing and converting REs in https: //github.com/jeffchy/RE2NN-SEQ.

Regular Expression for Slot Filling
We describe a slot filling system based on wordlevel REs with capturing groups. Consider the following RE that captures the content of a sentence for slot fr.city (the departing city). w * from w * fr.city to w * w is the wildcard symbol for words and it matches any word; symbol * is the Kleene star operation which means the preceding symbol or subexpression can appear for zero or more times. subexpression name is a capturing group that captures the contents matched by the sub-expression and tag them as name. An RE can have multiple capturing groups. To do slot filling, we write REs in which each capturing group is dedicated to a slot indicated by the name, and then we apply the REs one by one to capture the contents in the input sentences. For words captured by multiple groups with different names, we resolve the conflicts based on pre-defined priority. For example, given sentence show me the flights from New York to Dallas, the RE above would identify "New York" for slot fr.city.

Finite-State Transducer
A finite-state transducer (FST) is a finite-state machine with input and output. At each time step, it accepts a word, transits from one state to another, and outputs a label. It is formally defined as a 6-tuple: T = Q, Σ, Γ, Q I , Q F , Ω .
• Q: a finite set of states. |Q| = K.
• Q F : a finite set of final states, a subset of Q.
If we assign a weight to each transition and a weight to each state for being a start state or final state, then we get a weighted finite-state transducer (WFST). We can use a 4th-order tensor T Ω ∈ R V ×L×K×K to represent the transition weights, and two K-dimension vectors µ and ν to represent the weights of start and final states. An FST can be viewed as a WFST with 0/1 weights. Specifically, T Ω [m, n, i, j] = 1 iff. T accepts input token σ m , transits from state q i to state q j , and outputs γ n , where σ m ∈ Σ, γ n ∈ Γ, q i , q j ∈ Q.
We define a path as a sequence of transitions. An FST accepts an input sequence if there exists at least one path that starts from a start state, matches the input sequence at each position, and ends at a final state. We call such a path an accepting path, which defines a mapping from the input sequence to an output sequence. The score of path ω 1 , · · · , ω m with start state q i and final state q j can q0 q1 q2 q3 w /l from/l w /B-fr.city w /I-fr.city to/l w /l Figure 2: An example finite state transducer. q 0 is the only start state and q 3 is the only final state. w is the wildcard for input words and l is the wildcard for output labels. Each arc represents a possible transition and the slash above each arc separates the input (left) and the output (right).
be computed as: We show an example FST in Fig. 2. The input "flights from New York to Dallas" can be accepted by the FST, and the accepting path indicates a state sequence [q 0 , q 0 , q 1 , q 2 , q 2 , q 3 , q 3 ] and output sequence [l , l , B-fr.city, I-fr.city, l , l ].

Method
In this section, we introduce the several steps of neuralizing REs (as shown in Fig. 1) as well as the decoding and training methods. Sakuma et al. (2012) rigorously show that any RE with capturing groups can be converted into an FST. The equivalence of the RE and FST is reflected in:

Converting RE to FST
(1) RE matches a sentence if and only if the corresponding FST has at least one accepting path; (2) the output sequence of the FST matches the content captured by the RE. Here we use the BIO tag scheme in the FST output to specify captured groups. For example, we tag New and York as Bfr.city and I-fr.city to represent the captured group New York as fr.city. For words outside of any capture group (e.g., 'from', 'Dallas' in the example sentence (Sec. 2.1)), however, we depart from the BIO scheme and assign a wildcard label l instead of the outside label 'O' to them, which means the RE is totally unsure about which label the words should be assigned because groups in other REs might capture them. For example, 'Dallas' can be captured by another RE as 'to.city' (arriving city) and hence shall not be assigned label 'O'.
To convert an RE to an FST, we view an FST as a finite-state automaton with vocabulary Σ × Γ, use Thompson's construction (Thompson, 1968) to build the automata from the RE and further minimize the automata using Hopcroft's algorithm (Hopcroft, 1971). In the resulting FST, the input Algorithm 1: Inference in FST 1 Input: x = x1, · · · , xm. T = T Ω , µ, ν 2 Step 1: sum out the dimension w.r.t. label for tensor TΩ to get T Ω ∈ R V ×K×K . Let T Ω [xt] denote the transition matrix of word xt from T Ω 3 Step 2: calculate forward scores. let α0 = µ T . 4 for t ← 1 to m do 5 αt = αt−1 · T Ω [xt]. 6 Step 3: calculate backward scores. let βm = ν T . 7 for t ← m to 1 do Step 4: get label scores ct ∈ R L at each position t 10 (ct) k = (α t−1 )i (TΩ[xt]) kij (βt)j (einsum notation) vocabulary Σ is the RE vocabulary and the output vocabulary Γ contains labels B-X and I-X for each slot X, as well as O and the wildcard label l . As an example, the RE in Sec. 2.1 can be converted into an FST shown in Fig. 2. We provide the complete conversion algorithm and prove its correctness in Appendix A. When there are multiple REs, we take the union of the REs with the 'or' operation (i.e., a|b) to form a big RE and turn it into an FST.

Inference in FST
Given sentence x = x 1 , · · · , x m , FST inference aims to find the output label sequence y = y 1 , · · · , y m . The score of y is defined as the weight sum of all the accepting paths matching input x and output y. As finding the highest scoring output given a sentence is proved to be NP-hard (Casacuberta and De La Higuera, 2000), we instead use an approximate inference that finds the highest scoring output label at each position independently of the labels at all the other positions. This can be done with the classic variable elimination algorithm, which involves a forward process summing out variables to the left and a backward process summing out variables to the right of each position. We show the algorithm of computing label scores simultaneously at all the positions in Algorithm 1.

Independent FST
The 4th-order tensor in the FST leads to the following issues. FST to i-FST We address these concerns by transforming the FST such that each label output is independent of the input and the source state Algorithm 2: Inference in i-FST.
Step 1: sum out the dimension w.r.t. label for O to get vector o ∈ R K . Let • denote element-wise multiplication. 3 Step 2: calculate forward scores, let α0 = µ T .
Step 3: calculate backward scores, let βm = ν T . 7 for t ← m to 1 do Step 4: get label scores ct ∈ R L at each position t. 10 ct = (αt • βt) · O of the transition given the target state. We call it independent FSTs (i-FST). Fig. 3 shows an i-FST converted from the FST shown in Fig. 2. In the example i-FST, the output is B-fr.city if we reach q 2 , but in the original FST, the output could be either B-fr.city or I-fr.city depending on the source state. We show the conversion algorithm in Appendix B. The algorithm runs in polynomial time O(LK 3 ) and adds at most O(LK) new states. As FSTs converted from REs naturally satisfy this independence condition in most cases, the algorithm runs much faster and adds no more than 50% new states in our experiments on three datasets.
3th-order representation Because of the independence in an i-FST, we can use a 3th-order tensor T ∈ R V ×K×K to represent transitions between states given an input, and a matrix to represent the mapping from end states to the output label O ∈ R K×L . We can use T and O to recover the 4th-order tensor of an i-FST: Inference in i-FST We can again apply variable elimination to compute label scores (Algorithm 2). Since the output label now depends only on one state, the time complexity is also reduced. We compare FST and i-FST in Table 1.
We find that i-FST inference resembles running a BiRNN model with a linear output layer for sequence labeling. We recurrently update the forward transition reaching state factor graph Table 1: Comparison of FST and i-FST. K 1 denotes the number of states for i-FST. In practice, K 1 ≤ 1.5K.
In the factor graphs, variable S and E represents the source and target states of a transition, and X, Y represent the input and output.
score α t and backward score β t based on the input x t and the previous scores at each position, so α t and β t resemble the forward and backward hidden states in a BiRNN. At position t, we aggregate α t and β t and map them to the label score vector y t , which also resembles concatenating the bidirectional hidden states together and feeding them into the output layer in a BiRNN.

Parameter Tensor Decomposition: the last step towards FSTRNN
An i-FST is much more compact and faster but it still: (1) has too many parameters (especially T ∈ R V ×K×K ) compared to a traditional BiRNN; and (2) is unable to incorporate external word embeddings. To tackle these problems, we adopt the tensor decomposition-based method proposed by Jiang et al. (2020) and modify the forward and backward score computation accordingly (Steps 3 and 4 in Algorithm 2).

CP Decomposition (CPD)
We apply CPD to decompose the 3th-order tensor T into three matrices: where R is rank, a hyper-parameter. A higher rank usually results in lower decomposition errors. We show the detail of CPD in Appendix C. After tensor decomposition, we use E R , D S , D E instead of T to compute forward and backward scores. Line 5 of Algorithm 2 becomes: where E R ∈ R V ×R can be treated as a special word embedding matrix derived from the original RE, and v t is the embedding of word x t selected from E R . If E R , D S , D E reconstruct T perfectly, then Eqa. 1 is equivalent to line 5 in Algorithm 2. Similarly, the backward score equation in line 8 is modified to: Incorporating External Word Embedding As mentioned above, we treat E R [x t ] as an Rdimension word vector. We may also want to inject additional information of the word by incorporating externally pretrained word embedding.
For static word embeddings such as GloVe, we again adopt the method of Jiang et al. (2020). Let We then interpolate these two R-dimension vectors with hyper-parameter η to get new v t , and use it to calculate the forward and backward scores in Eqa. 1 and 2.
To incorporate contextualized word embeddings such as BERT, we first use two methods to acquire an intermediate static word embedding matrix E w ∈ R V ×D : (1) The Aggregate method proposed by Bommasani et al. (2020)

Enhancements
We enhance the model without harming its approximation to the FST with the following techniques.
Non-linearity We apply the tanh activation function to α t and β t . For example, the third formular of Eqa. 1 becomes: α t = tanh ((g · D T t ) • o T ). Note that y = tanh (x) is close to y = x when x ∈ [−1, 1], so we find that our model still approximates the FST very well. As will be shown later, when the model is trained with labeled data, applying tanh leads to better performance.
Dummy States The number of FST states K is decided by the REs. We may introduce K additional states to improve the capacity of the model. We achieve that by padding the parameter matrices.
The padding numbers are so small that these new states can be seen as being isolated with no transition and hence do not interfere with the FST inference. However, transitions from and to these states will be automatically established after training, which often improves the model performance.

Gated Variants
We follow (Jiang et al., 2020) and add an update gate z t and a reset get r t into the forward and backward score computation. We call it FSTGRU. We close the gates initially by setting a big bias term so that our model still approximates the FST. After training, the gates would be utilized and lead to better model performance. We show details in Appendix D.

Decoding
FST inference produces a label score vector c t at each position t. Here we show how to output labels from the score vectors.
Priority We first feed the label scores at each position into a priority layer to resolve conflicts when different groups capture the same word. The priority relations between slot labels are specified by human experts and can be represented as a set of logic rules. We turn the rules into soft logic computation that can be implemented with an MLP which takes c t as input and outputs an updated vector c t ∈ R L . We show details in Appendix E.

FSTRNN-Softmax
Before decoding from c t , we introduce a fixed threshold τ ∈ (0, 1) (set to 0.1 by default) to handle the wildcard label l . Intuitively, l represents unsureness, so we only choose it when the scores of all the other labels are below τ . In that case, we deem the word is not captured by any groups and hence output label O. In all the other cases, we output the highest-scoring label. We show the decoding step in Algorithm 3.
FSTRNN-CRF Linear chain CRF is widely used in sequence labeling decoding. It can be easily integrated into our model. We first handle l similarly using threshold τ . We then regard label scores as CRF emission scores and use the Viterbi algorithm to produce the output sequence. Finally, we again change l to O. We initialize the CRF transition probabilities p(y t |y t−1 ) to 1 L so that initially we obtain the same output sequence as FSTRNN-Softmax. We show the decoding step for FSTRNN-CRF in Algorithm 4.

Training Using Label Data
To approximate REs, we initialize the model parameters: E R , D S , D E , O, α 0 , β m from the corresponding i-FST. But unlike REs, our model can be trained using labeled data to further improve its performance. For FSTRNN-Softmax, we optimize the cross-entropy loss at each position. For FSTRNN-CRF, we optimize the classic CRF loss. We optimize the loss functions using Adam (Kingma and Ba, 2014). As l is a dummy label that does not appear in the training data, its score will be automatically reduced over training, which implies we learn to remove unsureness of the original RE.

Experimental Settings
Datasets and REs We perform our experiments on three SF datasets involving 2 languages and 2 domains: ATIS (Hemphill et al., 1990), ATIS-ZH (Mansour and Haider, 2021), and SNIPS (Coucke et al., 2018). ATIS contains English queries for airline service and ATIS-ZH is the Chinese version of ATIS. SNIPS is a collection of English queries to voice assistants. It has a large vocabulary and is more complex than ATIS. Both ATIS and SNIPS are widely adopted for evaluating SF models. For each dataset, we ask an RE expert to write RE rules. We show the statistics of the datasets and REs in Table 2. We leave more details of RE writing in Appendix F. for experiments with static word embeddings, and use BERT-base-uncased 1 + (Softmax/CRF) (Chen et al., 2019a) as base models for experiments with contextualized word embeddings. When using BERT, we represent each word using the last BERT hidden state of the first sub-token and feed these hidden states into a linear layer. These baseline methods are widely used for SF. We also compare previous methods of enhancing these base models with REs. Luo et al. (2018) use REs as additional input features (+i), use REs to adjust output logits (+o) or apply both of them (+io) 2 . We also compare two knowledge-distillation-based algorithms that treat RE results as a teacher to help the learning of the base model. One is the classic knowledge distillation method proposed by Hinton et al. (2015)  Training Settings When using static word embeddings, we use 100d GloVe (Pennington et al., 2014) embedding for ATIS and SNIPS and use 300d FastText (Bojanowski et al., 2017) embedding for ATIS-ZH; we also fix word embeddings E w , E R during training. When using contextural word embeddings, we finetune BERT for both our methods and the baselines. Our methods have comparable numbers of trainable parameters to those of the baselines. We show the parameter numbers and hyper-parameter tuning of our methods and baselines in Appendix G and H. We compare our methods and the baselines on zero-shot, 10-shot, 50-shot, 10%, and 100% of the training data. We report averaged results from four runs with different random seeds for data sampling and parameter initialization.

Evaluation
We report the micro F1 scores of our methods and baselines on various training settings. We leave the BiLSTM results in Appendix J because BiGRU is slightly better. We do the same for +io (worse than +i).

With Static Word Embeddings
Zero-shot We show the zero-shot performance for models with non-contextual word embeddings in Table 3   comparable or even better results in comparison with the original REs without any training and perform much better than all the other baselines. Because +kd, +pr are knowledge-distillation-based methods, they are the same as +none without any training. The base models and their +i, +kd, +pr enhancements perform at the random guessing level, while +o is better but still significantly inferior to the original REs. FSTRNN/FSTGRU with Softmax/CRF perform almost the same because of our initialization strategy described in Sec. 3.5 and Sec. 3.6. sometimes outperform the original REs because of randomness and external word embedding incorporation when η < 1. We give an analysis of the impact of η in Appendix I.
Few-shot Can our model maintain the lead over the baselines and be trained to outperform the original REs on low-resource settings? The 10-shot and 50-shot results in Table 5 give a YES answer. On the few-shot settings, our methods outperform all baselines (including REs). This is because, on one hand, our models have a better starting point derived from the REs in comparison with the neural baselines; and on the other hand, our models can learn to improve using labeled data in comparison with the RE baseline.  are effective in low-resource settings and can improve the base models with RE information. CRF and gates do not help our methods and baselines because of the lack of training data.
Rich-resource We show the results with 10% and 100% of the training data in Table 5. With static word embeddings, our methods still perform the best with 10% of the training data on ATIS and ATIS-ZH and remain competitive with 100% of the training data on all three datasets. The baselines perform reasonably well with sufficient training samples. However, there does not exist a single baseline that consistently performs the best.
Our methods underperform the baselines on SNIPS 10%, which might be caused by the complexity of the SNIPS dataset and the quality of the RE rules (which have 85 precision compared to 90+ for ATIS and ATIS-ZH). We also observe that CRF and gates significantly improve the performance of our methods and baseline without BERT.

With BERT
We show the zero-shot, low-resource, and richresource performance of BERT-enhanced methods on the SNIPS dataset in Table 4. We choose between the two initialization strategies of E w , Aggregate and Random, based on the development set performance.

Zero-shot
The experimental results show that our methods can still approximate the REs well, reaching 52.25 F1 (+0.27 compared to original REs) without any training data.
Low-resource In the low-resource setting (10 and 50 training samples), our methods have minor or even no improvement over the zero-shot setting. On the other hand, the BERT-enhanced baselines perform much better compared to the non-BERT baselines, but they are still far behind our methods.

Rich-resource
The experimental results on 10% and 100% of training data show that both our methods and the baselines have a large performance gain compared with the non-BERT setting (around +8% and +4%). Our methods are still competitive with the baselines, especially when using 100% of training data, which shows the ability of our methods to utilize pretrained contextual word embeddings such as BERT.
In addition to the SNIPS dataset, we also test our methods with BERT on the ATIS dataset. While our method again beats the baselines on the zeroshot and 10-shot settings and is competitive on the 100% setting, it falls behind the baselines on the 50shot and 10% settings, especially when using CRF. We speculate that the ATIS REs may not be very compatible with BERT-enhanced models and hence our model may require more training data to move away from the RE-based initialization. We leave more detailed analyses for future investigation.

Analysis
Ablation Study We conduct an ablation study on the 100% training samples on the three datasets in Table 6. The ablation results show that our various enhancements indeed improve the model performance, especially the tanh nonlinearity and external word embeddings. The randomly initialized FSTGRU+CRF performs surprisingly well, which indicates that our model has a strong capacity to learn from data and does not rely much on rules with enough training data.
Utilizing Unlabeled Data We assume we do not have unlabeled data in previous experiments. However, with clean unlabeled data, we can use REs to produce pseudo-labels and use them to train baselines. We compare the results of training Bi-GRU+(none/i/o)+CRF using different amounts of unlabeled data and FSTRNN without any training on SNIPS (Fig 4). Results show that we need   (2020) who neuralize REs for text classification. Apart from the difference in the output forms, our method differs from theirs mainly in that: (1) we target the slot filling task and hence neuralize REs with capturing groups; (2) we convert REs into an FST that produces output sequences, while they convert REs into a finite state automaton (FSA) that decides acceptance of strings; (3) the tensor parameter of an FST has higher order than an FSA, which leads to a more complicated procedure to reduce the computational complexity; (4) our decoding algorithm has to invoke both forward and backward processes, while theirs requires only a forward process.

Conclusion and Future Work
In this work, we neuralize a symbolic RE system for slot filling into a trainable neural network model. The model approximates REs well initially and can be trained to improve itself with labeled data. Experiments in various settings show the advantages of our method. To the best of our knowledge, we are the first to neuralize REs into neural networks for the slot filling task.
For future work, we want to explore the possibility of converting an FSTRNN back into REs for better interpretability, and investigate more on utilizing BERT.
We believe our methodology can be extended to other tasks such as NER and QA. We also hope our methods can provide insights into theoretical analysis on the relations between neural models and regular languages.

Acknowledgement
This work was supported by the National Natural Science Foundation of China (61976139). Theorem A.4. Any NFA can be turned into a deterministic finite automaton (DFA).

References
Step 1: Given a RE with k capturing groups with group names l 1 , · · · , l k , we first inject the empty string to split the sub-expressions that with and without capturing groups. For example, the RE: w * from w * fr.city to w * becomes w * from w * fr.city to w * .
Step 2: We ignore the capturing groups and perform the Thompson's construction on the whole RE and get the -NFA (Theorem A.2). However, during the construction, we mark the states and edges of each sub-NFA corresponds to each capturing group and denote them as S 1 , · · · , S k , and E 1 , · · · , E k , where E k = {(s i , s j )|s i , s j ∈ S k }.
We also use S 0 and E 0 to denote the states and edges outside the capturing groups. As we separate sub-expressions with , due to the concatenation rule of Thompson's construction, we have Step 3: Assign l as the output of edges in E 0 whose input is not .
Step 4: Make sure for each sub-NFA, the start state has no outgoing -edges and incoming edges. Then assign BIO-tags to the new sub-NFA. We show the algorithm details in Algorithm 5.
Algorithm 5: Step 4 1 input: sub-NFAs, S1, E1, · · · , S k , E k , 2 for i ← 1 to k do 3 Perform the -elimination to the i-th sub-NFA to get the new states and edges S i , E i .

4
Find the start state of the new sub-NFA. s0 ∈ S i .

5
Create a new state s . Upon getting the updated sub-NFA, assign B-l k to the outgoing edges from the start state and I-l k to all other edges. 16 end Step 5: We do the -elimination again to eliminate the s in E 0 and the s we add to separate the capturing groups at the beginning. Then we convert it into a DFA (Theorom A.3) and minimize it using the Hopcroft algorithm (Hopcroft, 1971).
As most steps are either trivial or based on a provided theorem, we only prove the correctness of the following lemma.
Lemma A.5. Given an NFA, Algorithm 5 produces an equivalent NFA whose start state has no outgoing -edges and incoming edges.
Proof. Given an NFA, the algorithm first performs -elimination (line 4) to the NFA, by Theorem A.4, the new NFA has no -edges and are equivalent to the original one. Later steps (lines 5 to 14) will not produce edges. Therefore the start state has no outgoing -edges.
Then we prove that the algorithm from line 4 to line 13 produces an equivalent NFA whose start state has no incoming edges. We denote the original NFA A and the converted NFA A , and L(A) denotes the language set of A. We prove that A is equivalent to A by showing L(A) ⊂ L(A ) and L(A ) ⊂ L(A). (1) L(A) ⊂ L(A ). The paths that via the original incoming edges of the start state will go through the new state because we change the destinations of these incoming edges from the start state to the new state (lines 8 and 9). They will reach the final state because we copy all outgoing edges of the start state to the new state. (line 12). For the paths that do not go by these incoming edges, they will not go through the new state and reach the final states.
(2) L(A ) ⊂ L(A). Similarly, for the paths go through the new states will go through the incoming edges of the start state in the original NFA, and as the outgoing edges of the start state and the new state are the same, it must reach the final states in the A. For the paths that do not go through the new state, we can remove the new state and corresponding edges. It is the same as the A removing incoming edges of the start state. Proved.

B FST to s-FST
We present the conversion algorithm below (Algorithm 6).

C CP decomposition (CPD)
CPD is also known as tensor rank decomposition. For a Nth-order tensor T ∈ R d 1 ×d 2 ×···×d N can be approximated as sum of R rank-1 tensors.
⊗ denotes the outer product, denotes the Khatri-Rao product andT (1) denotes the mode-1 unfolding of tensorT . If the rank R is large enough (e.g. larger than the tensor rank of T ), the decomposition can be exact.
We also apply the tricks mentioned in Jiang et al. (2020)'s Appendix B to speed up the decomposition and normalize the decomposed matrices.

D Gated Variants
We add an update gate z t , and a reset gate r t into the forward and backward score computation of FSTRNN and get FSTGRU Take the forward score as an example. We compute the z t and r t using v t and α t−1 : where σ is the sigmoid activation, and W z , W r , U z , U r , b z , b r are trainable parameters of gates. We apply these gates to the forward score computation and Eqa 1 becomes: We initialize W z , W r , U z , U r randomly using Xavier normal (Glorot and Bengio, 2010), and bias terms b z , b r as a big integer (e.g. 10). Therefore, we close the gates at the beginning to make our model still approximate the FST, and after training, the performance can be improved because of the gating mechanism.

E Priority
We show how we resolve priority problems here with an example. For the sentence I love action movie, the following RE can capture movie and action movie at the same time.
w (movie|action movie) movie w So the word action can be tagged as B-movie or O, the word movie can be tagged as B-movie or I-movie. Assume that we prefer longer captured text, we can define the following priority: B-movie > O, I-movie > O, I-movie > B-movie, here the > operator means that the label on the left has higher priority compared to label on the right and these priority can be written into logic. For example, the priority B-movie > O can be expressed as: MATCH(O) ∧ ¬MATCH(B-movie) ⇒ LABEL(O) . We can encode this logic into our model using soft logic. Let L a , L a be proposition symbols whose soft truth score are a, b, the soft version of ¬L a is 1 − a, and the soft version of L a ∧ L b is max(0, a + b − 1). So the soft version of is max(0, b − a), which can be easily implemented using a L × L matrix and a ReLU nonlinear. The priority layer is optional; as in our experiments, conflicts seldom occur.

F Writing REs
We try to mimic the RE annotating procedure in real applications. In industry, REs for slot filling applications are written by RE experts with domain knowledge (or with the help of a domain expert).
In our experiments, we ask an RE expert to write RE rules for these datasets. As the expert does not have domain knowledge and is not familiar with the data, we follow the method of (Luo et al., 2018) to 'teach' the expert domain knowledge. More specifically, we sample 40-shot of training data and ask the expert to write REs to capture them. It takes the expert around 6 hours to write rules for ATIS, 8 hours for SNIPS. And 4 hours for ATIS-ZH. Experts usually have domain knowledge in reality. Hence the writing process can be further accelerated, and less or even no examples are required. We also show examples of Written REs for each dataset in Table 7.

G Number of trainable parameters
We set the word embedding dim D = 100, the rank R = 150, the number of FST states K = 150, the number of slot labels L = 50, and the number of hidden states H = 100 to calculate the number of trainable parameters of our method and baselines. We show the results in Table 8.

H Hyper-parameter tuning
We tune the hyper-parameter of our methods and baselines on the development set using grid search. We report the grids in the following Table 9.

I Analysis on η
We show how the η influences the performance of our model on SNIPS in zero-shot and rich resource settings (Fig 5). The best η on zero-shot settings is 0.9 instead of 1, which means integrating some word information can improve the rule without any training. After training with 100% of training data, a small η (e.g. 0.1) performs best because word embedding can help model learning with sufficient data.

J Full results
We show the full results with BiLSTM and +io baselines, and standard deviations of three datasets in Table 10, Table 11 and Table 12.