Knowledge-Aware Meta-learning for Low-Resource Text Classification

Meta-learning has achieved great success in leveraging the historical learned knowledge to facilitate the learning process of the new task. However, merely learning the knowledge from the historical tasks, adopted by current meta-learning algorithms, may not generalize well to testing tasks when they are not well-supported by training tasks. This paper studies a low-resource text classification problem and bridges the gap between meta-training and meta-testing tasks by leveraging the external knowledge bases. Specifically, we propose KGML to introduce additional representation for each sentence learned from the extracted sentence-specific knowledge graph. The extensive experiments on three datasets demonstrate the effectiveness of KGML under both supervised adaptation and unsupervised adaptation settings.

Meta-learning has been shown to dominate selfsupervised pretraining techniques such as masked language modeling (Devlin et al., 2018) when the training tasks are representative enough of the tasks encountered at test time (Bansal et al., 2019(Bansal et al., , 2020. However, in practice, it requires access to a very large number of training tasks (Al-Shedivat et al., 2021) and, especially in the natural language domain, mitigating discrepancy between training and test tasks becomes non-trivial due to new concepts or entities that can be present at test time only.
In this paper, we propose to leverage external knowledge bases (KBs) in order to bridge the gap between the training and test tasks and enable more efficient meta-learning for low-resource text classification. Our key idea is based on computing additional representations for each sentence by constructing and embedding sentence-specific knowledge graphs (KGs) of entities extracted from a knowledge base shared across all tasks (e.g., Fig. 1). These representations are computed using a graph neural network (GNN) which is meta-trained endto-end jointly with the text classification model. Our approach is compatible with both supervised and unsupervised adaptation of predictive models.
Related work. In modern meta-learning, there are two broad categories of methods: (i) gradientbased (Finn et al., 2017;Nichol and Schulman, 2018;Li et al., 2017;Zintgraf et al., 2019;Lee and Choi, 2018;Yao et al., 2019Yao et al., , 2020 and (ii) metric-based (Vinyals et al., 2016;Snell et al., 2017;Yoon et al., 2019;. The first category of methods represents the "metaknowledge" (i.e., a transferable knowledge shared across all tasks) in the form of an initialization of the base predictive model. Methods in the second category represent meta-knowledge in the form of a shared embedding function that allows to construct accurate non-parametric predictors for each task from just a few examples. Both classes of methods have been applied to NLP tasks (e.g., Han et al., 2018;Bansal et al., 2019;Gao et al., 2019), however, methods that can systematically leverage external knowledge sources typically available in many practical settings are only starting to emerge and focusing on limited applicable scopes (e.g., (Qu et al., 2020;Seo et al., 2020)).

Contributions.
1. We investigate a new meta-learning setting where few-shot tasks are complemented with access to a shared knowledge base (KB).
2. We develop a new method (KGML) that can leverage an external KB and bridge the gap between the training and test tasks.
3. Our empirical study on three text classification datasets (Amazon Reviews, Huffpost, Twitter) demonstrates the effectiveness of our approach.

Preliminaries
We consider the standard meta-learning setting, where given a set of training tasks T 1 , . . . , T n , we would like to learn a good parameter initialization θ for a predictive model f θ such that it can be quickly adapted to new tasks given only a limited amount of data (i.e., few-shot regime). Each task T i has a support set of labeled or unlabeled sen- In our text classification setup, we assume that parameters θ are split into two subsets: (1) BERT (Devlin et al., 2018) parameters θ B shared across tasks and (2) task-specific parameters θ c that are adapted for each task. Below, we discuss two adaptation strategies: supervised and unsupervised.

Supervised adaptation
Under supervised adaptation scenario, we incorporate knowledge with both gradient-based metalearning and metric-based meta-learning, which are detailed as: Gradient-based meta-learning. Following Finn et al. (2017), the task-specific parameters θ c i for each task T i can be adapted by finetuning them on the support set: where L is the cross-entropy loss. Then, using the query set D q i , we can evaluate the post-finetuning model and optimize the model initialization as follows: At evaluation time, the initialization parameters θ are adapted to test tasks T t by finetuning on the corresponding support sets D s t . Metric-based Meta-learning. Following (Snell et al., 2017) Prototypical Network (ProtoNet), the task-specific parameter θ c i is formulated as a lazy classifier, which is built upon the prototypes Here, D s i,k represents the subset of support sentences belonging to class k. Then, for each sentence in the query set, the probability of assigning it to class k is calculated as: (2) where d is defined as a distance measure. During the meta-training phase, ProtoNet learns a wellgeneralized embedding function θ B . Then, the meta-learned θ B is applied to the meta-testing task, where each query sentence is assigned to the nearest class with the highest probability (i.e., y q t,j = arg max r p(y q t,j = r|x q t,j )).

Unsupervised adaptation
When labeled supports sets D s i are not available, we follow  and use ARM-CML. For each task T i , we use the shared BERT encoder to compute a representation of each query sentence x q i,j , which returns an embedding vector, denoted f θ B (x q i,j ). Then, we compute the overall representation of the task by averaging these embedding vectors, . This task representation is then used as an additional input to the sentence classifier, which is trained end-to-end. The meta-training process can be formally defined as: Note that to enable unsupervised adaptation, ARM-CML learns to compute accurate task embeddings c i from unlabeled data instead of using finetuning.

Approach
In this section, we present the proposed KGML framework (Fig. 2), which allows us to enhance supervised and unsupevised adaptation methods described in the previous section with external knowledge extracted from a shared KB and. In the following subsections, we elaborate the key components of KGML: (1) extraction and representation of sentence-specific knowledge graphs (KGs) and (2) knowledge fusion.

KG Extraction and Representation
For each sentence x i,j , we propose to extract a KG, The nodes N i,j of the graph correspond to entities in the corresponding sentence x i,j and the edges E i,j correspond to relations between these entities. The relations between the entities are extracted from the KB shared across all tasks. Notice that some entities are not directly related to each other in the KB. To enhance the density of graphs, we further "densify" the extracted KG with additional edges by constructing a k-nearest neighbor graph (k-NNG) based on the node embeddings. More details of KG construction algorithm are provided in Appendix A.
To compute representations of the sentencespecific KGs, we use graph neural networks (GNN) (Kipf and Welling, 2016; Zonghan Wu, 2019). In particular, we use GraphSAGE (Hamilton et al., 2017) as the forward propagation algorithm, which is formulated as follows: where W k (∀k ∈ {1, ..., K}) are the weight matrices of the GNN, N (v) represents neighborhood set of node v and h k u denotes the node representation in the k-th convolutional layer (h 0 v as the input feature). σ and AGG k are functions of non-linearity and aggregator, respectively.
After passing each graph G i,j into the graph neural network, we aggregate all node representations {h K v | v ∈ N i,j } and output the graph embedding g i,j as the holistic representation of the knowledge graph.

Knowledge Fusion
To bridge the distribution gap between metatraining and meta-testing stages, we integrate the information extracted from knowledge graph into the meta-learning framework. Assume the sentence representation is f θ B (x i,j ). For each sentence, we are motivated to design another aggregator AGG kf to aggregate the information captured from the representation of sentence f θ B (x i,j ) and its corresponding knowledge graph representation g i,j .
Specifically, the aggregator is formulated as: There are various selections of aggregators (e.g., fully connected layers, recurrent neural network), and we will detail the selection of aggregators in the Appendix D. Then, we replace the sentence representation f θ B (x i,j ) byf θ B (x i,j ) in the metalearning framework. We denote all parameters related to knowledge graph extraction and knowledge fusion as φ. Notice that φ are globally shared across all task in MAML since we are suppose to connect the knowledge among them. In Alg. 1 and Alg. 3 (Appendix B), we show the meta-learning procedure of the proposed model under the settings of supervised and unsupervised adaption, respectively.

Experiments
In this section, we show the effectiveness of our proposed KGML on three datasets and conduct related analytic study.

Dataset Description
Under the supervised adaptation, we leverage two text classification datasets. The first one is Amazon Review (Ni et al., 2019), aiming to classify the category of each review. The second one is a headline category classification dataset -Huffpost (Misra, 2018), aiming to classify the headlines of News. We apply the traditional N-way K-shot few-shot learning setting (Finn et al., 2017) on these datasets (N=5 in both Huffpost and Amazon Review). As for the unsupervised adaptation, similar to the settings in , we use a federated sentiment classification dataset -Twitter (Caldas et al., 2018), to evaluate the performance of KGML. Each tasks in Twitter represents the sentences of one user. Detailed data descriptions are shown in Appendix C.
Under the unsupervised adaptation scenario, KGML is compared with the following four baselines: empirical risk minimization (ERM), upweighting (UW), domain adversarial neural network (DANN) (Ganin and Lempitsky, 2015), and adaptive risk minimization (ARM) . Here, we report the performance with full users and 60% users for meta-training.
On both scenarios, accuracy is used as the evaluation metric and all baselines use ALBERT (Lan et al., 2019) as encoder. WordNet (Miller, 1995) is used as the knowledge graph. All other hyperparameters are reported in Appendix D.

Overall Performance
The overall performance of all baselines and KGML are reported in Table 1. The results indicate that KGML achieves the best performance in all scenarios by using knowledge bases to bridge the gap between the meta-training and meta-testing tasks. Additionally, under the supervised adaptation scenario, the improvements of Amazon Review are larger than that in Huffpost under the 1shot setting, indicating that the former has a larger gap between meta-training and meta-testing tasks. One potential reason is that the number of entities of Amazon review is more than Huffpost headlines, resulting in more comprehensive knowledge graphs. Another interesting finding is that ARM hurts the performance under the unsupervised adaptation. However, with the help of the knowledge graph, KGML achieves the best performance, corroborating its effectiveness in learning more transferable representations and further enabling efficient unsupervised adaptation.

Ablation Study
We conduct ablation studies to investigate the contribution of each component in KGML. Two ablation models are proposed: I. replacing the aggregator AGG kf with a simple feature concatenator; II. removing extra edges in KG, which are introduced by k-nearest neighbor graph. The performance of each ablation model and the KGML of Amazon and Huffpost are reported in Table 2. We observe that (1) KGML outperforms model I, demonstrating the effectiveness of the designed aggregator; (2) Comparing between KGML with model II, the results show that KNN boosts performance. One potential reason is that KNN densifies the whole --network according to the entities' semantic embeddings learned from the original WordNet, which explicitly enriches the semantic information of the neighbor set of each entity. It further benefits the representation learning process and improves the performance.

Robustness Analysis
In this subsection, we analyze the robustness of KGML under different settings. Specifically, under supervised adaptation, we change the number of shots in Huffpost. Under unsupervised adaptation, we reduce the number of training users in Twitter. The performance are illustrated in Figure 3a and Figure 3b, respectively (see the comparison between Huffpost-ProtoNet and ProtoNet in Appendix E). From these figures, we observe that KGML consistently improves the performance in all settings, verifying its effectiveness to improve the generalization ability.

Discussion of Computational Complexity
We further conduct the analysis of computational complexity and reported the meta-training time per task in Table 3, where the results of supervised adaptation are performed under the setting of Huffpost 5-shot. Though KGML increases the metatraining time to some extent, the who training pro- cess can be finished within 1-2 hours. Thus, the additional computational cost seems to be a reasonable trade-off for accuracy.

Conclusion
In this paper, we investigated the problem of metalearning on low-resource text classification, and propose a new method KGML. Specifically, by learning the representation from extracted sentencespecific knowledge graphs, KGML bridges the gap between meta-training and meta-testing tasks, which further improves the generalization ability of meta-learning. KGML is compatible with supervised and unsupervised adaptation and the empirical experiments on three datasets demonstrate its effectiveness over state-of-the-art methods.

A Detailed Descriptions of Sentence-specific KG Construction
To construct a holistic knowledge graph with all the entities and relations, we first use the existing knowledge graph (i.e., WordNet (Miller, 1995)) as the sparse knowledge base G base . Some entities may be the nodes with few interactions or even isolated. Thus, to connect all entities, the node embeddings in the knowledge base are then used to construct a K-NN graph G knn , which is further combined with the base knowledge graph, rendering the dense knowledge graph G = G knn ∪ G base . For each sentence x i,j , we use its entities to query the knowledge graph G, which returns the entity embeddings N i,j and a adjacency matrix A i,j . Each element in A i,j represents the shortest distance of the corresponding entities. Inspired by Occam's Razor criterion, we compute the Minimum Spanning Tree (MST) (Wikipedia, 2021) w.r.t all the target entities (other entities and relations in the chosen path are included) as the concise and informative graphical representation of the sentence. In Alg. 2, we illustrate the whole process of the knowledge graph.

B Pseudocodes of KGML
In this section, we add the pseudocode for unsupervised adaptation in Alg. 3.

Algorithm 2 Knowledge Graph Extraction
Require: Dense knowledge graph G 1: for each sentence x i,j do 2: Use the entities x i,j to query G and obtain entity embeddings N i,j and adjacency matrix A i,j 3: Apply MST algorithm on A i,j , which returns T, the minimum spanning tree w.r.t the entities in x i,j .

4:
Construct the knowledge graph G i,j by including the selected nodes and edges on the path of T, i.e., G i,j = {(r, s) | ∃(u, v) ∈ T, (r, s) ∈ ShortestPath(u, v)}. 5: end for

C Data Statistics
For supervised adaptation, we use Amazon Review and Huffpost to evaluate the performance. Amazon Review contains 28 classes, and the number of classes for meta-training, meta-validation, and meta-testing are 15, 5, 8, respectively. The Huffpost dataset includes 41 classes in total, and we use 25, 6, 10 classes for meta-training, meta-validation, and meta-testing, respectively. In terms of the unlabeled adaptation, the number of Twitter users for meta-training, meta-validation, and meta-testing are 741, 92, 94, respectively.

D Hyperparameter Settings
For all the supervised adaptation and the unlabeled adaption experiments, we use ALBERT (Lan et al., 2019) as the sentence encoders and Graph-SAGE (Hamilton et al., 2017) as the graph encoders. All hyperparameters are selected via the performance on the validation set.

D.1 Supervised Adaptation
The GNN used contains two layers, where the number of neurons is 64 and 16, respectively. We adopt two fully connected layers with ReLU as activation layer for the adaptation layers, where the number of neurons is 64 for each layer. The aggregator AGG kf is designed as the one fully connected layer. We set the inner-loop learning rate α and outer-loop learning rate β as 0.01 and 2e-5, respectively. The number of steps in the inner loop is set as 5. We use Adam (Kingma and Ba, 2014) for outer loop optimization. The maximum number of epochs for huffpost and Amazon Review is 10,000 and 4,000, respectively.

Algorithm 3 KGML for Unsupervised Adaptation
Require: Task distribution p(T ); Stepsize β; Knowledge Base 1: Randomly initialize parameter θ 0 , φ 2: while not converge do Learn the sentence embeddings f θ B (x q i,j ) 7: Extract the knowledge graph G q i,j 8: For each graph, using GNN to learn the graph embedding g q i,j via Eqn. (3) 9: Fuse the sentence and graph embeddings and obtain the final embedding For the sentence encoder, the number of output dimensions is set as 240. The GNN is composed of two convolution layers, where each layer contains 64 neurons, and AGG k is designed as a mean pool operation. We use one fully connected layer for the final aggregation AGG kf . In the training phase, the learning rate is set β as 1e-4, and we use Adam (Kingma and Ba, 2014) optimizer with weight decay 1e-5. The contextual support size and meta batch size are 50 and 2, respectively.

E Additional Results of Robustness Analysis
In Figure 4, we show the comparison between Huffpost-ProtoNet and ProtoNet w.r.t. the number of shots. The results further demonstrate the effectiveness of KGML.