Self-Training using Rules of Grammar for Few-Shot NLU

We tackle the problem of self-training networks for NLU in low-resource environment— few labeled data and lots of unlabeled data. The effectiveness of self-training is a result of increasing the amount of training data while training. Yet it becomes less effective in low-resource settings due to unreliable labels pre-dicted by the teacher model on unlabeled data. Rules of grammar, which describe the gram-matical structure of data, have been used in NLU for better explainability. We propose to use rules of grammar in self-training as a more reliable pseudo-labeling mechanism, especially when there are few labeled data. We design an effective algorithm that constructs and expands rules of grammar without human involvement. Then we integrate the con-structed rules as a pseudo-labeling mechanism into self-training. There are two possible scenarios regarding data distribution: it is unknown or known in prior to training. We empir-ically demonstrate that our approach substan-tially outperforms the state-of-the-art methods in three benchmark datasets for both scenarios.


Introduction
Deep learning for natural language understanding (NLU) achieves satisfactory performance in various tasks such as intent detection (ID) or slot filling (SF) (Liu and Lane, 2016;Goo et al., 2018;E et al., 2019;Wang et al., 2020a). Given a sentence S = w 1 w 2 · · · w n , ID is a task to find an intent I of S among several possible intents, and SF is to identify keywords in S and tag a correct slot sequence s 1 s 2 · · · s n of S in BIO format.
Recent studies have focused on the data scarcity problems for ID and SF tasks. In low-resource settings with few labeled data, the model performance often becomes poor if it is not properly trained * The first two authors contributed equally to this work.
to cope with the problem. Wang et al. (2020b) suggest several approaches for few-shot learning such as data augmentation and self-training (ST). Data augmentation has been popular in image processing and recently is being used in NLP applications. However, it is still an open issue if data augmentation is suitable for NLU since the quality of augmented data might not be reliable even if it improves the overall performance (Hedderich et al., 2020;Cengiz and Yuret, 2020). On the other hand, ST (Ratsaby and Venkatesh, 1995;Ye et al., 2020) shows promising performance for low-resource settings (Cho et al., 2019). Nevertheless, classic ST is often unreliable since the pseudo-labeling mechanism heavily depends on the teacher model of its own. This motivates researchers to add more reliable mechanisms to the teacher model and to make better ST models (He et al., 2020;Paul et al., 2019).
Rules of grammar are one of the oldest techniques to represent knowledge in NLP (Rizos et al., 2019;Jiang et al., 2020), and recently have become popular again for few-shot learning due to the reliability and explainability of rules (Luo et al., 2018;Abro et al., 2020). On the other hand, naive rule extraction is easy to overfit on the source corpus and is impossible to recognize data outside of the corpus. Generalizing the rules can solve this issue. We propose to use rules of grammar to enhance the teacher model in ST. Because we have formal grammars as rules, we can expand rules by grouping similar fragments and represent data more precisely with an easy grammar modification. We investigate the usefulness of rules of grammar from a quantitative perspective and study how good these rules for ID and SF, especially in low-resource settings. It is well-known that reflecting a real-world data distribution among different classes makes better models when such information is available in prior training Labeled data (L) Label U with M ID/SF DNN module M Figure 1: An overview of our self-training procedure using rules of grammar for accurate ID and SF when there are few labeled data. The rules of grammar are expanded by edit-distance, n-gram, and slot-centric methods.
the models (Yang et al., 2020). Thus, we consider two scenarios: when the distribution is unknown, or known in prior.
In summary, our research questions are RQ1: Are rules of grammar useful in ST for few-shot NLU? RQ2: Which rule expansion gives the best performance? RQ3: Does it make a difference to know the data distribution in prior for rules of grammar in ST?

Methodology
Grammar extraction: We substitute the slots in the initial corpus into a slot variable and construct initial rules of grammar. We denote a rule with annotated intent and slot information. For example, the following is a rule for an intent ground_transport.
It parses a sentence transportation on Monday, and identifies its intent as ground_transport.
Grammar expansion: The initial rules can parse only the sentences with the same structure and different keywords for slots. For example, a rule can parse a sentence he leaves but not they leave. This problem requires more general rules for parsing diverse sentences. Grammar expansion enables such diversity by relaxing substructures of rules. In particular, we use three grammar expansion methods-edit-distance, n-gram, slot-centricand group similar substructures.
Edit-distance expansion: The edit-distance expansion groups the rules according to the structural similarity. The similarity is the edit-distance normalized by the maximum length of rules and a given threshold. Each group of rules with editdistances lower than the threshold produces a single rule by merging every rule contained in the group. When computing the distance, we prevent edits involving a slot since slots are incompatible with words or different slots.
n-gram expansion: The n-gram expansion aims to maintain the semantics of rules and enables substitution between similar word sequences. The expansion first extracts every 4-word-gram of each rule and count its occurrences. These word-grams make groups based on semantic similarity of them and form a kind of slot represented by a variable. The same procedure repeats for 3-and 2-wordgram.
Slot-centric expansion: The slot-centric expansion focuses on slots. It aims to extract every wordonly sequences in all rules with common slots. The expansion groups the rules with common slots and replaces the remaining word sequences by a symbol.
Combining expansions: A combination of grammar expansions often gives rise to better performance than the individuals. For example, a mix of edit-distance and slot-centric methods expands the expressive power both in syntactic and semantic perspectives. See Figure 2 for an example.
Self-training: ST mechanism uses a teacher model trained from a given labeled data and pseudo-labels additional data for successive training. Therefore the performance of the teacher model is crucial in ST. To make a better teacher model, we use rules of grammar, instead of a training model alone, as a pseudo-labeling mechanism R abbr → Describe fare code $fare_basis_code R abbr → What's fare code $fare_basis_code R abbr → What does code $fare_basis_code mean R abbr → What does code $fare_basis_code stand for R flight → What does flight code $airline_code mean R airline → What does the airline code $airline_code stand for ⇓ (slot-centric expansion)

Experiments and Analysis
We run experiments on two scenarios: 1) when the data distribution among classes is unknown and 2) when it is known in prior. We first evaluate how well rules of grammar perform in low-resource settings for RQ1. We then verify the effectiveness of various grammar expansions for RQ2, and examine how the data distribution affects rules of grammar for RQ3.  Dataset: We use three benchmark datasets: ATIS (Hemphill et al., 1990), Snips (Coucke et al., 2018), and Facebook (Schuster et al., 2019). Each sentence in the datasets has an intent and a sequence of slot labels. We split a dataset into train, validation, and evaluation datasets with the ratio of 64%, 16%, and 20%, respectively. For the first scenario when the data distribution is unknown among classes, we build an initial corpus of 10, 20 and 30 sentences for each intent from both the train and the validation sets at random. For the second scenario when the distribution is known in prior, we take 1%, 5% and 10% of our the train and the validation datasets according to the data distribution from the original corpus. We then erase labels in the remaining train dataset, which becomes the unlabeled dataset for ST.

Experiments
Baseline models: In our empirical tests, we have observed that recent text classification models for few-shot settings such as UST (Mukherjee and Awadallah, 2020) or Delta-training (Jo and Cinarel, 2019) show competitive performance for ID but extremely poor performance for SF due to the nature of the SF task. Thus, we compare our model with the following two baselines, which are designed for the ID/SF tasks explicitly.
1. E et al. (2019) propose a state-of-the-art model (SF-ID network) for ID and SF tasks. The model consists of two subnets (ID subnet and SF subnet) connected bi-directionally. The baseline model B ST is the SF-ID network with the ST mechanism in few-shot settings.
2. Luo et al. (2018) 1 suggest another ruleassisted state-of-the-art model (B RE ) to solve ID and SF tasks in few-shot settings. B RE uses a bidirectional LSTM model and several components injecting external knowledge via regular expressions. We compare B RE to show the effectiveness of our rule construction methods.
Experiment Setting: We incorporate our rules of grammar into B ST and evaluate the effect of our method in few-shot settings. We follow the same parameters for B ST as in E et al. (2019). The initial threshold Θ for differentiating the correct labels is 0.8 for pseudo-labeling and the value is adjusted along the procedure. We also apply the same metrics as in E et al. (2019): accuracy (= #(correct intents) / #(sentences)) for ID, slot-unit F 1 score for SF and sentence accuracy (SA = #(correct intents & slots) / #(sentences)) for the overall accuracy of model predictions.

Results and Analysis
We run experiments for five times and compute the average in each test. For the overall comparison, we report the average performance of different n-shots and k%-samplings with respect to three benchmark datasets all together for ID and SF, respectively.All our methods achieve their best performance within 10 iterations, showing high convergence rate.  Figure 4: Average k%-sampling performance for the known data distribution. Red indicates the best score RQ1: The experimental results in Figure 3 and 4 answer RQ1; using rules of grammar (G) is viable for ST on few labeled data. In all cases of ID, SF and SA, we observe a large improvement from the baselines, proving that G is an effective method. Namely, our rule-based ST method without any expansions (G) outperforms B ST and B RE . It is also noticeable that the experimental results for both scenarios are very close to the performance of SOTA performance on ID and SF tasks.
One concern is whether G still works when treating unseen data. As G highly depends on the initial corpus, it is important to analyze how G performs with the data that is not in the initial corpus. We observe that, on average, about 60% of test dataset contains at least one slot value that is not in the initial train or the validation datasets. G predicts these unseen slots with 57% accuracy on average. This implies that G is inadequate for data outside the initial corpus.
RQ2: While G is more effective than the baselines, the resulting grammar is a straight conversion to CFGs and cannot cope with data with slight variations from the current rules. We resolve this problem by expanding grammars according to the data similarity. We test various grammar expansions: edit-distance with 0.3 threshold (E), n-gram (N), slot-centric (S), and their combinationsAmong various combinations, SE shows the best performance. Since S uses slots which are keywords in sentences, S is appropriate for extracting semantic similarity. With rules grouped by their semantics, E then use rules' structure information for better expansion. This is why SE shows a substantial improvement.
Within unseen data, SE achieves 75% accuracy on average which is 18%p higher than the accuracy of G. Our expansions aim to overcome the weakness of G which heavily depends on the initial corpus. The result shows that the expansion, SE, develops the rules better than the naive extraction, G.
RQ3: In practice, the distribution among classes may not be available in prior. Thus, for the fewshot problem, it is more realistic to have a uniformsampled data for each class-n-shot test. The experimental result in Figure 3 shows that our method (SE) is effective in n-shot test, and the performance gain is larger when n is smaller.
For the other case when the data distribution is known in prior, one can sample labeled data according to the ratio. Figure 4 shows that our method still outperforms, but unlike Figure 3, the improvement of SE seems relatively small. This is because of the total amount of the labeled data. For instance, the 5% case has more data than the 30-shot case. In such cases, our grammar extraction G already works well and the extra gain from SE is marginal when there are a reasonable amount of small data. On the other hand, for the extreme few-shot settings (e.g., 10, 20-shots or 1%), SE shows an impressive performance among all possible combinations.

Conclusions
Recently, rules of grammar have become popular again in deep learning due to their reliability and explainability. We have adapted rules of grammar in ST for NLU in few-shot settings. A major problem when using rules of grammar is the fact that it might not be flexible to cope with exceptions or variant data from the current rules. We have resolved this problem by expanding rules based on the data/grammar similarity. This gives rise to more precise pseudo-labels in ST and better performance. We have demonstrated that the combination of slot-centric and edit-distance shows the best performance. For both scenarios when knowing data distribution in prior or not, our algorithm achieves a substantial improvement. Remark that rules of grammar do not depend on specific neural network models, and, therefore, learning with rules of grammar is a complementary solution for the few-shot problem in NLU.