Multi-Domain Multilingual Question Answering

Question answering (QA) is one of the most challenging and impactful tasks in natural language processing. Most research in QA, however, has focused on the open-domain or monolingual setting while most real-world applications deal with specific domains or languages. In this tutorial, we attempt to bridge this gap. Firstly, we introduce standard benchmarks in multi-domain and multilingual QA. In both scenarios, we discuss state-of-the-art approaches that achieve impressive performance, ranging from zero-shot transfer learning to out-of-the-box training with open-domain QA systems. Finally, we will present open research problems that this new research agenda poses such as multi-task learning, cross-lingual transfer learning, domain adaptation and training large scale pre-trained multilingual language models.


Overall
Question answering (QA) has emerged as one of the most popular areas in natural language processing (NLP). Established benchmarks such as the Stanford Question Answering Dataset (SQuAD; Rajpurkar et al., 2016) are used as a standard testing ground for new models while opendomain QA benchmarks such as Natural Questions (Kwiatkowski et al., 2019) represent the frontier of what is possible with current NLP technology (Zaheer et al., 2020).
In this tutorial, we will review recent advances in open-domain QA but focus on an area that has received less attention both in research and in past tutorials-multi-domain and multilingual QA. Open-domain QA is of interest for building general-purpose assistants that can answer questions about any topic (Adiwardana et al., 2019). 1 The tutorial materials are available at https://github.com/sebastianruder/ emnlp2021-multiqa-tutorial.
Most real-world applications of QA, however, deal with the needs of specific domains. Multi-domain QA is particularly promising as it allows us to adapt models to new domains that are of practical importance, such as answering questions about COVID-19 (Tang et al., 2020).
At the same time, over the course of the last year we have seen the emergence of the first benchmarks for multilingual QA (Lewis et al., 2020;Artetxe et al., 2020;Clark et al., 2020). These benchmarks are a step towards enabling access to technology beyond English and building question answering systems that serve all of the world's approximately 6,900 languages. In addition to introducing standard datasets for multilingual QA, we will discuss advances in cross-lingual learning that made such benchmarks viable for the first time.
We generally aim to highlight methods and techniques that can be applied to adapt to many domains and languages in order to be helpful to the majority of the audience. While multi-domain and multilingual data differ in many ways both can be formulated as transfer learning problems and approached using a similar set of fundamental tools and principles, which we aim to convey to our audience.
As one example of such a tool, we will cover training procedures for large pre-trained language models (LMs). For multi-domain QA, we will discuss adaptation of LMs e.g. BERT  or RoBERTa (Liu et al., 2019). For multilingual QA, we will teach the methods for training LMs from large multilingual supervised and unsupervised data e.g. XLM-RoBERTa  and M4 (Arivazhagan et al., 2019). Notably, our tutorial will highlight the challenges of applying such methods to specific domains and languages. Overall, we will aim to provide a set of best practices that will enable researchers and practitioners to train methods for their domain and language of interest, from the nature of the training data, to model architectures and hyper-parameter settings. Type of the tutorial: Cutting-edge. Prior QA tutorials at ACL: The broader area of question answering has been a staple of tutorials at NLP conferences e.g. ACL 2018, ACL 2020. In general, we will demonstrate that techniques from open-domain QA cannot be directly applied to perform QA on unseen new domains (Tang et al., 2020;Castelli et al., 2020) and emphasize the importance of domain-specific training is necessary. This is the first tutorial to focus specifically on multi-domain and multilingual question answering, which has not been taught anywhere before.
Breadth: The tutorial will cover 90% of work from the QA, machine reading comprehension, domain adaptation and multilingual literature and 10% of the presenters work. Diversity: The tutorial will cover multilingual work including discussions of large multilingual pre-trained language models and QA examples in different languages. We will also discuss how methods scale to different languages and domains, including how much training data is necessary to achieve a certain performance. Prerequisites: Familiarity with Transformer models and pre-trained language models such as BERT.

Brief Tutorial Outline
This is a 3 hour tutorial: hence, we will divide our time between the following novel topics: 2.1 First half: Multi-Domain QA 1. Open-Domain monolingual QA and its limitations [20 mins]: We will begin our tutorial by introducing our audience to the existing work on open-domain QA (also known as reading comprehension) and its recent progress on benchmark tasks such as SQuAD (Rajpurkar et al., 2016(Rajpurkar et al., , 2018 and Natural Questions (Kwiatkowski et al., 2019). We will then survey work on monolingual QA: giving a brief historical background, discussing the basic setup and core technical challenges of the research problem, and then describe modern datasets with the common evaluation metrics and benchmarks. Finally, we will discuss their limitations when applied to unseen closed domains e.g. movies, information technology (IT) or biomedical questions and motivate the next section.

Introduce Multi-domain QA [20 mins]:
We will focus on several recent benchmark datasets e.g. TechQA (Castelli et al., 2020) and DoQA (Campos et al., 2020), which introduce more realistic QA scenarios. The former introduces a dataset and a leaderboard for IT that comes with only a limited amount of training data. The latter requires strong domain adaptation as QA systems are trained on the "cooking" domain and tested by answering questions about movies and travel. DoQA is rather challenging as QA systems need to take narrative context into consideration, which most reading comprehension systems do not. We will furthermore discuss recent datasets such as CovidQA (Tang et al., 2020), which focus on emerging domains that are of practical importance.

Modeling and Evaluation [30 mins]:
Finally, we will focus on various initial baselines which can be adopted to achieve impressive results via transfer learning on top of large pre-trained language models such as BERT . We will also discuss the evaluation methodology including the various metrics that measure document retrieval and QA performance. Finally, we give an overview of many practical ways to adapt to another domain such as via in-domain pretraining and task-adaptive pretraining, which improves performance by adapting to a task's unlabeled data (Gururangan et al., 2020).  (He et al., 2018) and DRCD (Shao et al., 2018) in Chinese, ARCD (Mozannar et al., 2019) in Arabic, multi-domain QA (Gupta et al., 2018) in Hindi-English, and visual QA (Gao et al., 2016) in Chinese-English. We distinguish between datasets that have been created by obtaining naturally occurring data in a language or via translations from SQuAD into Korean (Lee et al., 2018;Li et al., 2018), French and Japanese (Asai et al., 2018) and Italian (Croce et al., 2019). Recent datasets such as XQuAD (Artetxe et al., 2020) and MLQA (Lewis et al., 2020) cover more languages while the recently introduced TyDiQA (Clark et al., 2020) and MKQA (Longpre et al., 2020) can be seen as multilingual counterparts to Natural Questions. Three of these datasets are part of XTREME (Hu et al., 2020), a massively multilingual benchmark for testing the cross-lingual generalization ability of state-ofthe-art methods. While state-of-the-art models have matched or surpassed human performance in general-purpose monolingual benchmarks such as GLUE (Wang et al., 2019), current methods still fall short of human performance on multilingual benchmarks, despite recent gains (Chi et al., 2020). Multilingual question answering consequently is at the frontier of such cross-lingual generalization. We will generally aim to highlight the settings where current methods fail, showing validation examples in different languages, and highlight best practices of how methods can be adapted to better deal with them.

Open research problems [25 mins]:
Finally, we will discuss challenges and promising research directions for multi-domain and multilingual question answering.
3 Goals 3.1 What are the objectives of the tutorial?
Firstly, to familiarize the audience with the task of monolingual question answering and latest benchmarks on open-domain QA. We furthermore aim to raise awareness of the challenges of QA across multiple domains and languages, to demonstrate the usefulness of adapting models to such settings, and to teach best practices for different adaptation scenarios.

Why is this tutorial important to include at ACL?
Multi-domain and multilingual question answering is a key technology to deal with emerging topics and challenges around the world such as COVID-19 (Tang et al., 2020). We expect that being familiar and having access to the toolkit of multi-domain multilingual QA will both enable researchers to make progress on fundamental challenges and allow practitioners to leverage research advances in real-world applications. In addition, highlighting challenges and introducing the audience to techniques for adapting QA models to other languages may contribute to a broader, less English-centric research landscape.

Presenters
• Name: Sebastian Ruder Affiliation: DeepMind Email: sebastian@ruder.io Website: http://ruder.io Sebastian is a research scientist at DeepMind where he works on transfer learning and multilingual natural language processing. He has been area chair in machine learning and multilinguality for major NLP conferences including ACL and EMNLP and has published papers on multilingual question answering (Artetxe et al., 2020;Hu et al., 2020). He was the Co-Program Chair for EurNLP 2019 and has co-organized the 4th Workshop on Representation Learning for NLP at ACL 2019. He has taught tutorials on "Transfer learning in natural language processing" and "Unsupervised Cross-lingual Representation Learning" at NAACL 2019 and ACL 2019 respectively. He has also co-organized and taught at the NLP Session at the Deep Learning Indaba 2018 and 2019.
Section: Sebastian will teach Multilingual QA during this tutorial (Second 1 1/2 hrs). His team's system called 'GAAMA' has obtained the top scores in public benchmark datasets (Kwiatkowski et al., 2019) and has published several papers on question answering (Chakravarti et al., 2019;Castelli et al., 2020;Glass et al., 2020). He is also the Chair of the NLP professional community of IBM. Avi is a Senior Program Committe Member and the Area Chair in Question Answering for major NLP conferences e.g. ACL, EMNLP, NAACL and has published several papers on Question Answering. He has taught a tutorial at ACL 2018 on "Entity Discovery and Linking". He has also organized the workshop on the "Relevance of Linguistic Structure in Neural NLP" at ACL 2018. He is also the track coordinator for the Entity Discovery and Linking track at the Text Analysis Conference.