Technical Question Answering across Tasks and Domains

Building automatic technical support system is an important yet challenge task. Conceptually, to answer a user question on a technical forum, a human expert has to first retrieve relevant documents, and then read them carefully to identify the answer snippet. Despite huge success the researchers have achieved in coping with general domain question answering (QA), much less attentions have been paid for investigating technical QA. Specifically, existing methods suffer from several unique challenges (i) the question and answer rarely overlaps substantially and (ii) very limited data size. In this paper, we propose a novel framework of deep transfer learning to effectively address technical QA across tasks and domains. To this end, we present an adjustable joint learning approach for document retrieval and reading comprehension tasks. Our experiments on the TechQA demonstrates superior performance compared with state-of-the-art methods.


Introduction
Recent years have seen a surge of interests in building automatic technical support system, partially due to high cost of training and maintaining human experts and significant difficulty in providing timely responses during the peak season.Huge successes have been achieved in coping with opendomain QA tasks (Chen and Yih, 2020), especially with advancement of large pre-training language models (Devlin et al., 2019).Among them, twostage retrieve-then-read framework is the mainstream way to solve open-domain QA tasks, pioneered by (Chen et al., 2017): a retriever component finding a document that might contain an answer from a large collection of documents, followed by a reader component finding the answer snippet in a given paragraph or a document.Recently, various pre-training language models (e.g., BERT) have dominated the encoder design for solving different open-domain QA tasks (Karpukhin  The question and answer can hardly overlap substantially, because the answer typically fills in missing information and actionable solutions to the question such as steps for installing a software package and configuring an application.Different from factoid questions that are typically aligned with a span of text in document (Rajpurkar et al., 2016(Rajpurkar et al., , 2018)), semantic similarities between such non-factoid QA pairs could have a large gap as shown in Fig. 1.Therefore, the retrieval module in retrieve-thenread framework might find documents that do not contain correct answers due to the semantic gap in non-factoid QAs (Karpukhin et al., 2020;Lee et al., 2019;Yu et al., 2020b).Second, compared to SQuAD (with more than 100,000 QA pairs), technical domain datasets typically have a much smaller number of labelled QA pairs (e.g., about 1,400 in TechQA), partially due to the prohibitive cost of creating labelled data.In addition, there are limited real user questions and technical support documents, especially for some new tech products and communities.Since the pre-trained language models are mainly trained on general domain corpora, directly fine-tuning pre-trained language models may lead to unsatisfying performance due to the large discrepancy between source tasks (general domains) and target tasks (technical domains) (Chang et al., 2020;Gururangan et al., 2020).
To address the aforementioned challenges, we propose a novel deep transfer learning framework that explores knowledge transfer across tasks and domains (TransTD).TransTD consists of two components: TransT (knowledge transfer across tasks) and TransD (knowledge transfer across domains).TransTD jointly learns snippet prediction (reading comprehension) task and matching prediction (document retrieval) task simultaneously, applying it on both general domain QA and target domain QA.
To address the first challenge of non-factoid QAs, TransT leverages a joint learning model that directly ranks all predicted snippets by reading each pair of query and candidate document.It optimizes matching prediction and snippet prediction in parallel.Compared to two-stage retrieve-then-read methods that only read most semantically related documents, TransT considers potential snippets in every candidate document.When jointly training these two tasks, snippet prediction pays attention to local correspondence and matching prediction helps understand the semantic relationship from a global perspective, allowing the multi-head attentions in BERT-based encoders to jointly attend to information from different representation subspaces at different positions.Besides, the weights of two training objectives can be dynamically learned to pay more attention on the more difficult task when training different data samples.
To address the second challenge of learning with limited data, TransD leverages a deep transfer learning model to transfer knowledge from general domain QAs to technical domain QAs.General domain QA dataset like SQuAD has a much larger data size and a similar task setting (i.e., snippet prediction).Though knowledge is different between two domains, by learning the ability to answer questions in general domains, the model can quickly adapt and learn efficiently when changing into a new domain, reflected in faster convergence and better performance.Transfer learning helps avoid overfitting on technical QAs with limited size of data.Specifically, our model first applies the multi-task joint learning in general domain QAs (SQuAD), then transfers model parameters to initialize the training in the target domain QAs (TechQA), making knowledge transfer across domains to address data limitation.
We conducted extensive experiments on the TechQA dataset and utilized BERT as basic models.Experiments show that TransTD can provide superior performance than models with no knowledge transfer and other state-of-the-art methods.

Related Work
Open-Domain QA Open-domain textual question answering is a task that requires a system to answer factoid questions using a large collection of documents as the information source, without the need of pre-specifying topics or domains (Chen and Yih, 2020).Two-stage retriever-reader framework is the mainstream way to solve open-domain QA, pioneered by (Chen et al., 2017).Recent work has improved this two-stage open-domain QA from different perspectives such as novel pre-training methods (Lee et al., 2019;Guu et al., 2020), semantic alignment between question and passage (Lee et al., 2019;Karpukhin et al., 2020;Wu et al., 2018), cross-attention based BERT retriever (Yang et al., 2019;Gardner et al., 2019), global normalization between multiple passages (Wang et al., 2019).
Transfer Learning Transfer learning studies how to transfer knowledge from auxiliary domains to a target domain (Pan and Yang, 2009;Jiang et al., 2015;Yao et al., 2019).Recent advances of deep learning technologies with transfer learning has achieved great success in a variety of NLP tasks (Ruder et al., 2019).Several research work in this domain greatly enrich the application and technology of transfer learning on question answering from different perspectives (Min et al., 2017;Deng et al., 2018;Castelli et al., 2020;Yu et al., 2020a).Although transfer learning has been successfully applied to various QA applications, its applicability to technical QA has yet to be investigated.In this work, we focus on leveraging transfer learning to enhance QA in tech domain.

Research Problem
In the technical support domain, suppose we have a set of questions Q and a large collection of documents D. For each question Q ∈ Q, we aim at finding a relevant document D ∈ D and extracting the snippet answer S = (D start , D end ) in the document D. Note that the answer may not exist, and so, the relevant document may not exist, either.All predicted snippets are ranked by a specific span score calculation method, and (usually) the top-11 answer span is chosen to answer the given question.

Proposed Framework
In this section, we present our proposed framework for technical QA.Given a query, we first obtain 50 Technotes by issuing the query to the search engine Elasticsearch2 .Instead of using a document retriever based on semantic similarity between the query and each document, our proposed TransTD jointly optimizes snippet prediction and matching prediction in a parallel style.Figure 2 illustrates the design of the framework.It has a multi-task learning method to transfer knowledge across the snippet prediction (reading comprehension) and matching prediction (document retrieval) tasks.This method is further applied to pre-train the model on auxiliary domain QAs3 .Furthermore, the weights of two training objectives are dynamically adjusted by calculating the difference between real answer snippet and predicted snippet.So, the model can focus on optimizing the more difficult task when training different data samples.Lastly, Our model has a novel snippet ranking function that uses snippet prediction to obtain an alignment score and linearly combines it with the matching prediction score.

Knowledge Transfer across Tasks
We build our model upon BERT (Devlin et al., 2019) to jointly optimize on the RC and DR tasks.Suppose Θ has the BERT encoder parameters.When we apply domain knowledge transfer, which will be introduced in the following section, we initialize it with the parameters Θ (aux) trained on the auxiliary domain; when we do not apply the transfer, we initialize it with the original pre-trained BERT parameters.We have two multi-layer perceptron (MLP) classifiers for the two tasks, whose parameters are denoted by θ RC and θ DR , respectively.Both classifiers are randomly initialized.More specifically, the RC classifier is to predict answer snippets, and the DR classifier is to predict document matching.The joint loss is as follows: where λ is a hyper-parameter for the weight of the DR task over RC task.
Calculate adjustment factor As shown in Eq.( 1), the weights between two training objectives are only adjusted by a pre-determined hyperparameter λ.However, for different samples in the dataset, the difficulty of learning snippet prediction and matching prediction is different.The weight of two training objectives should be dynamically adjusted so that the model can focus on optimizing the more difficult task when training different data samples.Since non-factoid questions are openended questions that often require complex answers that are mostly sentence-level texts, positional relationships between start token and end token in answer snippets have more fluctuations than factoid answers.Therefore, we take the difference between real answer snippet and predicted snippet to measure the difficulty of snippet prediction.Intuitively, when the predicted answer snippet is significantly different from the actual answer snippet (much larger or much smaller), it indicates snippet prediction is difficult for the current data sample.So, the model should focus on optimizing the reading comprehension part.On the contrary, the model should focus on optimizing the document retrieval part.Formally, the weight-adjustable joint learning loss function is defined as:

Knowledge Transfer across Domains
Besides transferring across tasks, in our framework, we employ knowledge transfer across domains.We identify a dataset from an auxiliary domain (not a technical support domain) for technical question answering like SQuAD.We apply the multi-task learning to the auxiliary domain.The goal is to learn BERT encoder parameters Θ (aux) and two MLP classifiers θ  4) Here the encoder is initialized by the original pretrained BERT parameters.We will initialize the BERT encoder in the target domain Θ with Θ (aux) (used in TransTD-Mean and TransTD-CLS).When λ (aux) = 0, we apply the single RC task on the auxiliary domain (used in TransTD-single).

Framework Components
Question and Document Encoder Given a pair of question Q and document D, we first build a concatenation by where [CLS] stands for a classification token and [SEP] separates components in the sequence.The BERT Θ encoder generates contextualized representations of every token X in the input sequence q, which is denoted by BERT Θ (q)[X] ∈ R d , where d = 1024.So we have a matrix of token representations H ∈ R m×d , where H(k) = BERT Θ (q)[q[k]] (k is the index of the token).
Reader MLP This classifier reads the representation matrix H and computes the score of each token being the start token in the answer snippet p start ∈ R m and the score of each token being the end token p end ∈ R m .Matching MLP Suppose we have the representation of the sequence q.It can be denoted by h ∈ R d .The classifier is to predict whether the question Q and document D are aligned, which is a binary variable projected from h: where σ is the sigmoid function and w DR ∈ R d are trainable parameters.We have two options to produce h from the input sequence q.The first option is to apply mean pooling to the representations of all tokens (used in TransTD-Mean): The second option is to use the classification token [CLS] (used in TransTD-CLS): Joint Inference The reading MLP takes question and document pairs and predicts a reading score, where p (•) [0] denotes the probability of taking first token of the sequence as the start position or end position of the snippet.The joint ranking score of a (Q, D) pair is a linear combination of reading score and matching score, It should be noted that different from previous work that only leverages the first term in reading score, i.e.,  et al., 2020;Qu et al., 2020), our added second term improved inference performance.This is because during the training time, the span label of a document that does not contain an answer is set to (0, 0), and such negative documents are the majority.Therefore, (p start [0] + p end [0]) reflects the probability that Q and D is not aligned.See Table 4 for experimental comparisons.

TechQA Dataset
The TechQA dataset (Castelli et al., 2020) contains actual questions posed by users on the IBM DeveloperWorks forums.TechQA is designed for ) reflects the degree of misalignment between Q and D.

Snippet ranking method
Ma-F1 HA-F1@1 R@1 MP-BERT (  Figure 4: The more layers being fine-tuned in the target domain, the better performance we can have.However, it shows the pattern but not always true in the middle of the range.

Evaluation methods
The accuracy of the extracted snippets is evaluated by Ma-F1 5 and HA_F1@K.Ma-F1 is the macro average of the F1 scores computed on the first of the K answers provided by the system for each given question: where F1@K computes F1 scores for top-K answer snippets, selects the maximum F1 score, and computes the macro F1 score average over all questions.HA_F1@K calculates macro F1 score average over all answerable questions.Besides, models are evaluated on retrieving and ranking document by mean reciprocal rank (MRR) and recall at K (R@K).R@K is the percentage of correct answers in top K out of all the relevant answers.MRR represents the average of the reciprocal ranks of results for a set of queries.

Ablation Study
TranT transfers knowledge across tasks on the target domain, with multi-tasks of RC and DR.
TranD transfers knowledge from source domain RC to target domain RC w/o multi-task learning.
TransTD transfers knowledge across both tasks and domains.TransTD + is further improved by the adjustable weight. 5To avoid confusion between F1 (used on the TechQA leaderboard) and F1@K, we use Ma-F1 instead of F1.

Knowledge transfer across domains
In Table 2, the model first fine tuning on the source domain QA (SQuAD) then further fine tuning on the target domain QA (TechQA) makes superior performance than only fine tuning on the target domain QA.This indicates knowledge transfer from general domain QA is crucial for technical QA.

Knowledge transfer across tasks
In Table 2, transferring knowledge across tasks better capture local correspondence and global semantic relationship between the question and document.Compared with BERT RC , TransT improves Ma-F1 by +0.94% and HA_F1@1 by +1.91%.

Across both tasks and domains
In Table 2, transferring knowledge across both tasks and domains further improve model performance.TransTD fine tunes on SQuAD, then further fine tunes on the TechQA with both RC and AR tasks.It performs better than TransD and TransT.TransTD + makes adjustable joint learning, which further brings +1.7% and +2.32% improvements on Ma-F1 and HA_F1@1 compared to TransTD.

Comparison with retrieve-then-read (two-stage) methods
Using semantic similarity to predict alignment between query and document in open-domain QA is an efficient and accurate method.It can be statistical-based (e.g., BM25) (Yang et al., 2019) or neural-based that can be jointly optimized with snippet prediction (Karpukhin et al., 2020;Lee et al., 2019).However, as shown in Table 3, in the case of the same encoder (i.e., BERT), our proposed TransTD with novel snippet ranking function can identify answers more accurately than above methods.This means that our method is more effective in the context of non-factoid QAs whose semantics of query and document are not aligned.

Parameter Analysis
Loss ratio In Figure 3, we compare performance with loss ratio between the RC and DR tasks, λ in Eq.( 1).We observe that when λ = 4.0, TransTD achieves the best performance for both RC and DR tasks.If the loss ratio becomes more than 4.0, the performance decreases significantly.This is because RC helps DR more than DR helps RC, which is consistent with results in Table 2.

Number of fine tuning layers As shown in Fig-
ure 4, we compare performance on different numbers of fine tuning layers.Fine tuning all layers (24 layers) makes the best performance.However, the model performance and the number of fine tuning layers are not an absolute linear relationship.For example, only fine tuning 12 to 14 layers achieves better performance than having 16 or 18 layers, making a good reference for training with limited GPU memories.

Error Analysis
As shown in Figure 5, we manually categorize the predictive results of 160 answerable question instances in the development set.First of all, there are 107 (64.4%) questions that can be correctly matched with corresponding documents through the joint inference by Eq.( 12), however, 53 (35.6%) questions are mismatched with the documents that do not contain desirable answers.Additionally, among 107 correct predictions, only 39 (36.4%) of them are given with the correct answer snippet in the best matching document.Among 68 wrong predictions, 32 (47.1%) of them are mismatched with the answer span.Besides, 16 (23.5%) of them are provided with a smaller span of answer snippet than the actual span, in which the average length of answer snippet is 44 words.On the contrary, 20 (29.4%) of them are provided with a larger span of answer snippet than the actual span, in which their average length is 16 words.We observe that the TechQA dataset offers a challenging yet interesting problem, where the answers have a wide range of the number of words.Some long answers are across multiple sentences.

Conclusion
In this paper, we studied QA in the technical domain, which was not well investigated.Technical QA faces two unique challenges: (i) the question and answer rarely overlaps substantially (onfactoid questions) and (ii) very limited data size.To address the challenges, we propose a novel framework of deep transfer learning to effectively address TechQA across tasks and domains.To this end, we present an adjustable joint learning approach for document retrieval and reading comprehension tasks.Our experiments on the TechQA dataset demonstrates superior performance compared with non-transfer learning state-of-the-art methods.
How many provinces did the Ottoman empire contain in 17th century?... … At the beginning of the 17th century the Ottoman empire contained 32 provinces.Some of these were later absorbed into the Ottoman Empire, while others … … How can uninstall Data Studio 3.1.1where Control Panel uninstall process gets an error?Question We are able to install Data Studio (DS) 4.1.2successfully but unable to uninstall the existing Data Studio 3.1.1.When uninstall Data Studio 3.1.1from Control Panel, it raises an error message pop-up window and can not uninstall it.Here is the message: |Java Virtual Machine Launcher| X Could not find the main class: com.zerog.lax.LAX.Program will exit.How can uninstall Data Studio 3.1.1where Control Panel process gets an error?Cause It is an known behavior/limitation.Answer It may be happened where two versions Data Studio 3.1.1and 4.1.2installed machine.Here is an workaround.Please try to uninstall all products including Install Manager (IM) then reinstall IM and Data Studio 4.1.2.Below are detailed steps: 1. Use IM to uninstall as many packages as possible.2. Identify the packages that are still installed, and manually clean them up.Example on Windows: -C:\Program Files\IBM\{IBMIMShared | SDPShared} 3. Delete IBM Installation Manager.Example on Windows: -Delete the IM install directory: C:\Program Files\IBM\Installation Manager\ -Delete the AppData directory (IM Agent Data): Windows 7: C:\ProgramData\IBM\Installation Manager -Delete the Windows registry (regedit) entry : HKEY_LOCAL_MACHINE\SOFTWARE\IBM\Installation Manager -re-install IM 4. Reinstall DS 4.1.2and other products.[UserQuestion] We use Data Studio 3.1.1.0with DB2 WSE V9.7 FP11 on Windows 2008.While trying to new version of Data Studio 4.1.2,we are able to install it successfully.But unable to remove the existing 3.1.1.0,getting the JVM error "Could not find the main class".Is it a bug or something?How we can delete it?[Answer] Please try to uninstall all products including Install Manager (IM) then reinstall IM and Data Studio 4.1.2.[TechNote] Please try to uninstall all products including Install Manager (IM) then reinstall IM and Data Studio 4.1.2.[A Wiki Article] [A Factoid Question] (a) A factoid QA example in the SQuAD dataset.(b) A non-factoid QA example in the TechQA dataset.

Figure 1 :
Figure 1: Factoid QA is semantic aligned but nonfactoid QA has few overlapping words.Semantic similarities between such non-factoid QA is not indicative.

Figure 2 :
Figure 2: Our framework performs knowledge transfer across tasks and domains.It explores the mutual enhancement between the snippet prediction (reading comprehension) and matching prediction (document retrieval), applying multi-task learning to the BERT models on both auxiliary domain (SQuAD) and target domain (TechQA).

Figure 3 :
Figure 3: λ is the weight of the DR task loss over the RC task loss.When λ = 4.0, TransTD achieves the best performance for both RC (left two) and DR (right two) tasks.

Figure 5 :
Figure 5: Error analysis.The left figure represents the proportions between correct and wrong prediction on DR.The right figure represents the proportion of RC results when the retrieval phase already predicts the correct document.(Here, "too small" means that if the prediction is S RC = (D (pred) start , D (pred) end ) and the truth is S = (D start , D end ), we have D (pred) start > D start and D (pred) end < D end ; on the contrary, "too large" means we have D (pred) start < D start and D (pred) end > D end .)

Table 1 :
Statistics of TechQA.The test set is not publicly available, only allowing people to submit models for evaluation.The length of TechNotes is much bigger than that of question and answer texts.

Table 2 :
Ablation study on knowledge transfer across tasks and across domains on TechQA.TransTD transfers knowledge across both tasks and domains, and TransTD + is further improved by the adjustable weight.

Table 3 :
TransTD outperforms two-stage retrieve-thenread methods that retrieve document based on semantic alignment.k is the number of retrieved documents.

Table 4 :
Our proposed snippet ranking function can bring additional improvements.Using (p s MP-BERT = p DR • p s • p e ) BERT = α • p DR + p s + p e ) = p s + p e − p s [0] − p e [0] ) with = α • p DR + S w/o )machine reading comprehension tasks, Each question is associated with a candidate list of 50 Technotes obtained by issuing a query on the search engine Elasticsearch 4 .A question is answerable if an answer snippet exists in the 50 Technotes, or is unanswerable otherwise.Data statistics are given in Table1.In TechQA, the training set has 600 questions in which 450 questions are answerable; the validation set has 310 questions in which 160 questions are answerable; the test set has 490 questions.The Technotes are usually of greater length than question and answer texts.