Meta Learning and Its Applications to Natural Language Processing

Deep learning based natural language processing (NLP) has become the mainstream of research in recent years and significantly outperforms conventional methods. However, deep learning models are notorious for being data and computation hungry. These downsides limit the application of such models from deployment to different domains, languages, countries, or styles, since collecting in-genre data and model training from scratch are costly. The long-tail nature of human language makes challenges even more significant. Meta-learning, or ‘Learning to Learn’, aims to learn better learning algorithms, including better parameter initialization, optimization strategy, network architecture, distance metrics, and beyond. Meta-learning has been shown to allow faster fine-tuning, converge to better performance, and achieve amazing results for few-shot learning in many applications. Meta-learning is one of the most important new techniques in machine learning in recent years. There is a related tutorial in ICML 2019 and a related course at Stanford, but most of the example applications given in these materials are about image processing. It is believed that meta-learning has great potential to be applied in NLP, and some works have been proposed with notable achievements in several relevant problems, e.g., relation extraction, machine translation, and dialogue generation and state tracking. However, it does not catch the same level of attention as in the image processing community. In the tutorial, we will first introduce Meta-learning approaches and the theory behind them, and then review the works of applying this technology to NLP problems. This tutorial intends to facilitate researchers in the NLP community to understand this new technology better and promote more research studies using this new technology.


Brief Description
Deep learning based natural language processing (NLP) has become the mainstream of research in recent years and significantly outperforms conventional methods. However, deep learning models are notorious for being data and computation hungry. These downsides limit such models' application from deployment to different domains, languages, countries, or styles, since collecting in-genre data and model training from scratch are costly. The long-tail nature of human language makes challenges even more significant.
Meta-learning, or 'Learning to Learn', aims to learn better learning algorithms, including better parameter initialization, optimization strategy, network architecture, distance metrics, and beyond. Meta-learning has been shown to allow faster finetuning, converge to better performance, and achieve outstanding results for few-shot learning in many applications. Meta-learning is one of the most important new techniques in machine learning in recent years. There is a related tutorial in ICML 2019 1 and a related course at Stanford 2 , but most of the example applications given in these materials are about image processing. It is believed that metalearning has excellent potential to be applied in NLP, and some works have been proposed with notable achievements in several relevant problems, e.g., relation extraction, machine translation, and dialogue generation and state tracking. However, it does not catch the same level of attention as in the image processing community.
In the tutorial, we will first introduce Metalearning approaches and the theory behind them, and then review the works of applying this technology to NLP problems.

Tutorial Structure and Content
A typical machine learning algorithm, e.g., deep learning, can be considered as a sophisticated function. The function takes training data as input and a trained model as output. Today the learning algorithms are mostly human-designed. These algorithms have already achieved significant progress towards artificial intelligence, but still far from optimal. Usually, these algorithms are designed for one specific task and need a lot of labeled training data. One possible method that could overcome these challenges is meta-learning, also known as 'Learning to Learn', which aims to learn the learning algorithm. In the image processing research community, meta-learning has shown to be successful, especially few-shot learning. It has recently also been widely adopted to a wide range of NLP applications, which usually suffer from data scarcity. This tutorial has two parts. In part I, we will introduce several meta-learning approaches (estimated 1.5 hours). In part II, we will highlight the applications of the meta-learning methods to NLP (estimated 1.5 hours).

Part I -Introduction of Meta Learning
We will start with the problem definition of metalearning, and then introduce the most well-known meta-learning approaches below.

Learning to Initialize
Gradient descent is the core learning algorithm for deep learning. Most of the components in gradient descent are handcrafted. First, we have to determine how to initialize network parameters. Then the gradient is computed to update the parameters, and the learning rates are determined heuristically. Determining these components usually need experience, intuition, and trial and error. With meta-learning, those hyperparameters can be learned from data automatically. Among these series of approaches, learning a set of parameters to initialize gradient descent, or learning to initialize, is already widely studied. Column (A) of Table 1 lists the NLP papers using learning to initialize. Learning to initialize is the most widely applied meta-learning approach in NLP today. The idea of learning to initialize spreads quickly in NLP probably because the idea of looking for better initialization is already widespread before the development of meta-learning. The researchers of NLP have applied lots of different transfer learning techniques to find a set of good initialization parameters for a specific task from its related tasks. Here we will not only introduce learning to initialize but also compare its difference with typical transfer learning.

Learning to Compare
Besides the gradient descent-based learning algorithm, the testing examples' labels are determined by their similarity to the training examples in some learning algorithms. In this category, methods to compute the distance between two data points are crucial. Therefore, a series of approaches have been proposed to learn the distance measures for the learning algorithms. This category of approaches is also known as metric-based approaches.
Column (B) of Table 1 lists the NLP papers using learning to compare. Natural language is intrinsically represented as sophisticated sequences. Comparing the similarity of two sequences is not trivial, and widely used handcrafted measures, such as, Euclidean distance, cannot be directly applied, which motivates the research of learning to compare in NLP.

Other Methods
Although the above two methods dominate the NLP field at the moment, other meta-learning approaches have also shown their potential. For example, besides parameter initialization, other gradient descent components such as learning rates and network structures can also be learned. In addition to learning the components in the existing learning algorithm, some attempts even make the machine invent an entirely new learning algorithm beyond gradient descent. There is already some effort towards learning a function that directly takes training data as input and outputs network parameters for the target task. Column (C) of Table 1 lists these methods.

Part II -Applications to NLP tasks
There is a growing number of studies applying metalearning techniques to NLP applications and achieving excellent results. In the second part of the tutorial, we will review these studies. Here we summarize these studies by categorizing their applications. Please refer to Table 1 for a detailed list of studies we plan to cover in the tutorial.

Text Classification
Text classification has a vast spectrum of applications, such as sentiment classification and intent classification. The meta-learning algorithms developed for image classification can be applied to text classification with slight modification to incorporate domain knowledge in each application Tan et al., 2019;Geng et al., 2019;Dou et al., 2019;Bansal et al., 2019).

Sequence Labeling
Using a meta-learning algorithm to make the model fast adapt to new languages or domains is also useful for sequence labeling like name-entity recognition (NER) (Wu et al., 2020) and slot tagging (Hou et al., 2020). However, the typical meta-learning methods developed on image classification may not be optimal for sequence labeling because sequence labeling benefits from modeling the dependencies between labels, which is not leveraged in typical meta-learning methods. Techniques, such as the collapsing labeling mechanism, are proposed to optimize meta-learning for sequence labeling problem (Hou et al., 2020).

Automatic Speech Recognition and Neural Machine Translation
Automatic speech recognition (ASR), Neural machine translation (NMT), and speech translation require a large amount of labeled training data. Collecting such data is cost-prohibitive. To facilitate the expansion of such systems to new use cases, metalearning is applied in these systems for the fast adaptation to new languages in NMT (Gu et al., 2018) and ASR (Hsu et al., 2020;Chen et al., 2020b), and fast adaptation to new accents (Winata et al., 2020b), new speakers (Klejch et al., 2019(Klejch et al., , 2018, code-switching (Winata et al., 2020a) in ASR.

Relation Classification and Knowledge Graph Completion
The typical supervised learning approaches for relation classification and link prediction for knowledge graph completion require a large number of training instances for each relation. However, only about 10% of relations in Wikidata have no more than ten triples (Vrandei and Krtzsch, 2014), so many long-tail relations suffer from data sparsity. Therefore, meta-learning has been applied to the relation classification and knowledge graph completion to improve the performance of the relations with lim-

Task-oriented Dialogue and Chatbot
Domain adaptation is an essential task in dialog system building because modern personal assistants, such as Alexa and Siri, are composed of thousands of single-domain task-oriented dialog systems. However, training a learnable model for a task requires a large amount of labeled in-domain data, and collecting and annotating training data for the tasks is costly since it involves real user interactions. Therefore, researchers apply meta-learning to learn from multiple rich-resource tasks and adapt the meta-learned models to new domains with minimal training samples for dialog response generation (Qian and Yu, 2019) and dialogue state tracking (DST) (Huang et al., 2020). Also, training personalized chatbot that can mimic speakers with different personas is useful but challenging. Collecting many dialogs involving a specific persona is expensive, while it is challenging to capture a persona using only a few conversations. Thus, meta-learning comes into play for learning persona with few-shot example conversations (Madotto et al., 2019). tures, or initializations such that the meta-trained model can generalize well in new tasks with limited data, the approach is often used at efficient knowledge transferring between domains and languages, and has seen many promising results. Meta-learning has the potential to democratize the progress of machine learning and NLP for different domains, languages, and countries in a scalable way.

Prerequisites for the attendees
The attendees have to understand derivatives as found in introductory Calculus and understand basic machine learning concepts such as classification, model optimization, and gradient descent.

Reading list
We encourage the audience to read the papers of some well-known meta-learning techqnieus before the tutorial, which are listed below.
• Learning to Initialize (Finn et al., 2017) • Learning to Compare (Snell et al., 2017;Vinyals et al., 2016) • Other Methods (Ravi and Larochelle, 2017;Andrychowicz et al., 2016) 7 Biographies of Presenters  (2014) degrees in computer science from Karlsruhe Institute of Technology, Germany. From 2014 to 2015, he worked at Nuance Communications as a senior research scientist and at Ludwig-Maximilian University Munich as an acting professor in computational linguistics. In 2015, he was appointed assistant professor at University of Stuttgart, Germany. Since 2018, he has been a full professor at the Institute for Natural Language Processing in Stuttgart. His main research interests are natural language processing (esp. speech recognition and dialog systems) and machine learning (esp. deep learning) for low-resource settings.
Shang-Wen Li 5 is a senior Applied Scientist at Amazon AI. His research focuses on spoken language understanding, dialog management, and natural language generation. His recent interest is transfer learning for low-resourced conversational bots. He earned his Ph.D. from MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) in 2016. He received M.S. and B.S. from National Taiwan University. Before joining Amazon, he also worked at Apple Siri researching conversational AI.

Open access
We will allow the publication of our slides and video recording of the tutorial in the ACL Anthology.