Improving Complex Knowledge Base Question Answering via Question-to-Action and Question-to-Question Alignment

Complex knowledge base question answering can be achieved by converting questions into sequences of predefined actions. However, there is a significant semantic and structural gap between natural language and action sequences, which makes this conversion difficult. In this paper, we introduce an alignment-enhanced complex question answering framework, called ALCQA, which mitigates this gap through question-to-action alignment and question-to-question alignment. We train a question rewriting model to align the question and each action, and utilize a pretrained language model to implicitly align the question and KG artifacts. Moreover, considering that similar questions correspond to similar action sequences, we retrieve top-k similar question-answer pairs at the inference stage through question-to-question alignment and propose a novel reward-guided action sequence selection strategy to select from candidate action sequences. We conduct experiments on CQA and WQSP datasets, and the results show that our approach outperforms state-of-the-art methods and obtains a 9.88% improvements in the F1 metric on CQA dataset. Our source code is available at https://github.com/TTTTTTTTy/ALCQA.


Introduction
Complex knowledge base question answering (CQA) aims to answer various natural language questions with a large-scale knowledge graph.Compared to simple questions with single or multihop of relations, complex questions have more kinds of answer types such as numeric or boolean types and require more kinds of aggregation operations like min/max or intersection/union to yield answers. Semantic parsing approaches typically map questions to intermediate logical forms such as query graphs (Yih et al., 2015;Bao et al., 2016;Bhutani et al., 2019;Maheshwari et al., 2019;Lan and Jiang, 2020;Qin et al., 2021), and further transform them into queries like SPARQL query language.Recently, many works (Liang et al., 2017;Saha et al., 2019;Ansari et al., 2019;Hua et al., 2020a,b,c) predefine a collection of functions with constrained argument types and represent the intermediate logical form as a sequence of actions that can be generated using a seq2seq model.Sequencebased methods are natural to accomplish more complex operations by simply expanding the function set, thus making some logically complex questions answerable while they're difficult to answer using query graphs.
The seq2seq model has been widely used and achieved good results on many text generation tasks, such as machine translation, text summarization and style transfer.In these tasks, the source and the target sequence are both natural language texts and thus share some low-level features.However, semantic parsing aims to transform unstructured texts into structured logical forms, which requires a difficult alignment between them.This problem becomes more serious when the complexity of the question rises.Some works propose to solve this problem by modelling the hierarchical structure of logical forms.Dong and Lapata (2016) introduces a sequence-to-tree model with an attention mechanism.Dong and Lapata (2018) proposes to decode a sketch of the logical forms which contain a set of functions at first and then decode low-level details like arguments.Guo et al. (2021) iteratively segments a span from the question by a segmentation model and parses it using a base parser until the whole query is parsed.Li et al. (2021) uses a shift-reduce algorithm to obtain token sequences instead of predicting the start and end positions of the span.However, most of these works require intermediate logical forms or sub-questions to train models, which are usually difficult to obtain.Guo et al. (2021) and Li et al. (2021) propose to pretrain a base parser firstly, and then search good segments that predicted sub logical forms are part of or can be composed into the golden meaning representation.They don't necessarily require training pairs but have the limitation that decomposed utterances are continuous segments of the original question.
In this paper, we propose a novel framework to boost the alignment between unstructured text and structured logical forms.We decompose the semantic parsing task into three stages: question rewriting, candidate action sequences generation and action sequence selection.In the question rewriting stage, we utilize a question rewriting model to explicitly transform a query into a set of utterances, each corresponding to a single action, thus reducing the complexity of the question.We propose a two-phase training method to train the rewriting model on the lack of training pairs.In the candidates generation stage, we build a seq2seq model to generate logical forms with beam search algorithm and consider KG artifacts like entities as candidate vocabularies in the decoding stage.To further align the question and action sequence, we concatenate a question and a KG artifact as input and encode it using a pretrained language model (PLM) like BERT (Devlin et al., 2018).The cross attention mechanism of PLM can effectively align between the question and KG artifacts implicitly, which makes decoding easier.Moreover, we innovatively propose to improve complex knowledge base question answering via question-to-question alignment.Motivated by the phenomenon that the more similar two questions are, the more similar their corresponding action sequences will be, we build a memory consisting of question-answer pairs and retrieve a set of question-answer pairs as the support set based on the similarity with the current question during action sequence selection phase.We then propose a reward-guided selection strategy that scores each candidate action sequence according to the support set.
Our main contributions are as follow: • We propose a novel framework that mitigates the gap between natural language questions and structural logical forms through questionto-action alignment and question-to-question alignment.
• We propose a novel question rewriting mechanism that rewrites a question into a more structured form without requiring a dataset or adding any constraints, and employ a rewardguided action sequence selection strategy that utilizes similar question-answer pairs to score candidate action sequences.
• We conduct experiments on several datasets, and experimental results show that our approach is comparable to the state-of-the-art on WQSP dataset and obtains a 9.88% improvements in the F1 metric on CQA dataset.

Overview
In this task, with training set T = {(q 1 , a 1 ), ..., (q s , a s )}, where (q i , a i ) is a question-answer pair, the objective is to transform complex questions into logical forms, which can be further derived into KG queries to find answers.We define the logical form as a sequence of actions involving a function and multiple arguments.Following NS-CQA (Hua et al., 2020c), we design 16 functions with arguments comprised of numerical values and KG artifacts including entities, relations, and entity types.We recognize these arguments in the preprocessing step.Denote the input question as q, the set of predefined functions as function set F, question related numerical values and KG artifacts as argument set G, parameters of model as θ, our goal can be normalized as maximizing the probability P (L | q ; θ), where L is the action sequence that produces correct answers and each word in L belongs to F or G.
As shown in Figure 1, our framework consists of three stages: question rewriting, candidate action sequences generation, and action sequence selection.In the first stage, we rewrite a complex query into a more structured form by a seq2seq model, the details of training the model will be described in 2.2.The rewritten query then can be combined with the original question as input, and a newly seq2seq model is used to generate multiple candidate action sequences sequentially.And finally, We retrieve k question-answer pairs that are most similar to the current question from a pre-constructed memory.The candidates are then modified according to the KG artifacts in these k questions and scored based on the comparison results between the execution results and respective answers, separately.

Question Rewriting
An action sequence consists of multiple consecutive actions, and it is difficult for the seq2seq model to decide which part of the question to focus on when generating each action.We train a question rewriting model that transform a query into a set of utterances which are concatenated by the symbol "#" and each utterance corresponds to a single action.With the rewritten question, the model can focus on a certain part of the question when generating action in the sequence, thus reducing the difficulty of decoding.
To train the rewriting model, we require an adequate training corpus which is difficult to obtain.On the lack of golden datasets, we propose a twophase approach to convert queries into rewritten questions and use them for training of the rewriting

Module 1: Question Rewriting Training
Input: T = {(q 1 , a 1 ), ..., (q n , a n )} Output: M r , which is the trained model for rewriting questions 1 Search pseudo action sequences and obtain T = {(q 1 , a 1 , L 1 ), ..., (q n , a n , L n )}, where L i = {f 1 ; f 2 ; ...f k } is the pseudo action sequence of (q i , a i ); 2 Train M q which transforms action sequences into questions using T ; 9 q ij ← Compare(q ori , q del ); 10 q ori ← q del ; 11 end 12 Q ← Q ∪ {q i , {q i1 ; q i2 ; ...; q ik }} ; 13 end 14 Train M r using Q; model as shown in Module 1.In the first phase (line 1-2), we employ a breadth-first search algorithm to find pseudo action sequences for some questions, and then train a seq2seq model that translates an action sequence into a query.In the second phase (line 3-13), we construct a training corpus for question rewriting based on searched question-logical form pairs and the model trained in the previous stage.Specifically, given an action sequence L = {f 1 ; f 2 ; ...f k }, we delete the last action f k , back-translate the shorter action sequence into a new query, and compare it with the original question.We can determine that the tokens which appear in the original question but not in the current generated question are the ones we should most focus on when generating the deleted action.For example, the left part of Figure 1 illustrates the process of decomposing the question "how many musical instruments can lesser number of people perform with than glockenspiel".We firstly delete the last action "Count()" and then the seq2seq model translates the newly formed sequence "SelectAll(...)LessThan(...)" into query "which musical instruments can lesser number of people perform with than glockenspiel".The words "how many" should be paid more attention because they do not appear in the generated question.We iteratively perform delete, back-translate and compare operations until the action sequence is empty and concatenate the compare results of each step using symbol "#".
Thus, we can construct the question rewriting dataset Q and train a question rewriting model M r .
To make the rewriting model learn to output KG artifacts in the rewritten query, we concatenate the original question and KG artifacts as input, and wrap KG artifacts with symbols like entity and /entity .We initialize models in both phases using BART (Lewis et al., 2020), an outstanding pretrained seq2seq model that demonstrates high performance on a wide range of generation tasks, and finetune them by constructed datasets.

Encoder-decoder Architecture
We use BERT and BiLSTM (Hochreiter and Schmidhuber, 1997) to construct the encoder.Given a question q with n tokens and the argument set G = {g 1 , ..., g m }, where m is the size of argument set with respect to q and g i = {g i1 , ..., g il } is a KG artifact or numerical value with l tokens, we concatenate the question and each argument separately using [SEP] as the delimiter to construct BERT input sequences.In this case, we obtain question embedding E q ∈ R n×de , and argument embedding g i ∈ R de by mean pooling over E g i .We then stack embeddings of arguments to construct a matrix E G ∈ R m×de and feed E q into a BiLSTM encoder to obtain the final question representation Decoding is implemented using LSTM, and at each time step, the current hidden state s t ∈ R ds is updated based on the hidden state and output of the previous time step as follows: where [;] denotes vector concatenation.o t−1 is the embedding of output in the last step which obtains from learnable embedding matrix W func if the output is a function or from E G if the output is an argument.τ t−1 is a vector that obtains from learnable embedding matrix W type according to the type of last output.c t is the context vector resulting from the weighted summation of h i , the i-th row of the question embedding H, based on the attention mechanism.W a ∈ R ds×d h is a projection matrix.
We then calculate the vocabulary distribution based on hidden state s t .Our vocabulary consists of two parts, a fixed vocabulary containing a collection of predefined functions and a dynamic vocabulary consisting of arguments, i.e., numerical values and KG artifacts related to the question.We feed s t through one linear layer W o and a softmax function to compute the probability of each word in the fixed vocabulary.To obtain the probabilities of the words in the dynamic vocabulary, we project the hidden vectors s t to the same dimension through the projection matrix W p ∈ R ds×de and then compute the similarities with each word by taking the dot product.
Next, we calculate the probability P t that generate from the fixed vocabulary at the current time step through a linear layer followed by the activation function, and combine the two vocabulary distributions based on P t .Note that if w is a word in fixed vocabulary, then P dyn (w) is zero; similarly P fix (w) is zero when w is in dynamic vocabulary.P (w) = P t P fix (w) + (1 − P t )P dyn (w)

Reward-guided Action Sequence Selection Strategy
To improve accuracy, we generate multiple candidate action sequences with beam search algorithm and design a reward-guided action sequence selection strategy.In general, the more similar the structure and semantics of the two questions are, the more similar their corresponding action sequences will be.Therefore, we propose that similar questions can be used to help the selection of correct action sequence.Specifically, we build a memory consisting of question-answer pairs in the training set.Note that we don't require golden logical forms of these questions.
To retrieve similar questions with answers from memory, we use edit distance to calculate the simi-larity between two questions.To improve the generalization of the questions, we replace the entity mentions, type mentions and numerical values in the questions with the symbol [ENTITY], [TYPE] and [CONSTANT], respectively.We don't mask relations because it is always hard to recognize relation mentions.In addition, the presence of some antonyms including atmost and atleast, less and greater, can lead to the exact opposite semantics of questions with similar contexts.Therefore, we construct a set of antonym pairs and set the similarity to 0 when there is an antonym pair in the two questions.We retrieve k question-answer pairs with the highest similarity to form the support set We then propose a reward-guided action sequence selection strategy that scores each candidate action sequence according to its fitness to the retrieved support set.Specifically, given a candidate A i and an item {q j , a j , d j } in the support set, we adjust the arguments in A i to arguments of q j according to their positions in the text as Figure 2, and then score it by compute F1 scores between a j and execution results of modified sequences on the lack of golden action sequences.Due to the positions of the relations being unknown, we obtain all possible orders of relations and generate multiple modified action sequences.We then take the highest F1 as the score of {q j , a j , d j } to A i and denote it as r j i .The overall score of A i then can be calculated as follows: where k j=1 d j is a normalized term.We take the candidate action sequence with the highest score as the output sequence in the inference stage.

Training
We use REINFORCE (Williams, 1992) algorithm to train our model.We view F1 scores of the answers generated by predicted action sequence with respect to ground-truth answers as original rewards.To improve the stability of training, we use the adaptive reward function (Hua et al., 2020c) to adjust rewards.Moreover, we use a breadthfirst search algorithm on a subset of data to obtain pseudo-action sequences and pretrain the model to prevent the cold start problem.

Experimental Setup
Our method aims to solve various complex questions, and we mainly evaluate it on ComplexQues-tionAnswering (CQA) (Saha et al., 2018) dataset which is a large-scale KBQA dataset containing seven types of complex questions, as shown in Table 1.We show the details and some examples of this dataset in Appendix A. We also conduct experiments on WebQuestionsSP (WQSP) (Yih et al., 2015) which contains 4737 simple questions.The results show that our method also works well on simple datasets.
We employ standard F1-measure between predicted entity set and ground truth answers as evaluation metrics.For some categories whose answers are boolean values or numbers on CQA dataset, we view answers as single-value sets and compute the corresponding F1 scores.The training details and model parameters can be found in Appendix B

Baselines
We compare our framework with seq2seq based methods.KVmem (Saha et al., 2018) presents a model consisting of a hierarchical encoder and a key value memory network.CIPITR (Saha et al., 2019) (Das et al., 2021) generates complex logical forms conditioned on similar retrieved questions and their logical forms to generalize to unseen relations.We also compare our method with graph-based methods on WQSP dataset.STAGG (Yih et al., 2015) proposes a staged query graph generation framework and leverages the knowledge base in an early stage to prune the search space.TEX-TRAY (Bhutani et al., 2019) answers complex questions using a novel decompose-execute-join approach.QGG (Lan and Jiang, 2020) modifies STAGG with more flexible ways to handle constraints and multi-hop relations.OQGG (Qin et al., 2021) starts with the entire knowledge base and gradually shrinks it to the desired query graph.

Overall Performances
The overall performances of our proposed framework against KBQA baselines are shown in Table 1 and 2. Our framework significantly outperforms the state-of-the-art model on CQA dataset while staying competitive on WQSP dataset.On CQA dataset, our method achieves the best overall performance of 80.89% and 85.31% in macro and micro F1 with 9.88% and 4.51% improvement, respectively.Moreover, it can be observed that our method achieves the best result on six of seven question categories.On Logical Reasoning and Verification (Boolean), which are relatively simpler, our model obtain a 3.40% and 2.35% improvement in macro F1, respectively.On Quantitative Reasoning, Comparative Reasoning, Quantitative (Count) and Comparative (Count), whose questions are complex and hard to parse, out model obtain a considerable improvement.To be specific, the macro F1 scores increase by 20.02%, 17.22%, 3.45% and 17.55%, respectively.Our proposed method doesn't outperform CIP-Sep on Simple Question which trains a separate model on this category but still achieves a comparable result with the second-best baseline.On WQSP dataset, our method outperforms all the sequence-based methods and stay competitive with the graph-based method which having the best results.Our method doesn't gain a lot because most questions in this dataset are one hop and simple enough while our frameword aims to deal with various question categories.We don't compare with graph-based methods on CQA dataset because they always start from a topic entity and interact with KG to add relations into query graphs step by step, which can not solve most question types like Quantitative Reasoning and Comparative Reasoning in this dataset.
The experimental results demonstrate the ability of our method to parse complex questions and generate correct action sequences.The main improvement of the proposed method comes from two as-  pects.On the one hand, we employ a rewrite model to decompose a complex question into several utterances, allowing the decoder to focus on a shorter part when decoding each action.On the other hand, we make full use of existing question-answer pairs and determine the structure of action sequences indirectly through the alignment between questionquestion pairs.

Ablation Studies
We conduct a series of ablation studies on CQA dataset to demonstrate the effectiveness of the main modules in our framework.To explore the impact of question rewriting module, we remove it and only use the original question as input of the seq2seq model.The performance drops by 1.79% in macro F1 as shown in Table 3.To prove the effectiveness of the action sequence selection module, we generate candidate action sequences using beam search mechanism and directly use the action sequence with the highest probability as the output instead of selecting by action sequence selection module.The macro F1 drops by 2.49% after removing this module.To verify that the cross-attention mechanism in BERT can lead to alignment between question and KG artifacts and further improve the generation result, we encode question and KG artifacts separately and find the performance drops by 0.97%.Experimental results show that every main module in our framework has an important role in performance improvement.To explore the impact of employing different underlying embeddings, we conduct experiments on two settings, initializing an embedding matrix randomly and encoding with BERT.We finetune the embedding matrix during the training stage in the first setting while freezing the parameters of the BERT model.As shown in Table 4, BERT embedding achieves the best result and improves by 4.40% compared to random embedding.It is reasonable because BERT is pretrained with a large corpus to represent rich semantics and uses a crossattention mechanism to align the question and KG artifacts better.Note that our proposed method still outperforms state-of-the-art methods without using BERT.To investigate the effect of the number of candidate action sequences and the size of the support set on the selection of action sequences, we conduct experiments and plot the results in Figure 3.It can be observed that the macro F1 score increases with the size of the support set at the beginning, whatever the number of candidates.This trend slows down gradually and the macro F1 score peaks when the size is about 6.Then, as the size of the support set continues to increase, the macro F1 score decreases slightly.It's mainly caused by the simple and rough method we use to calculate the question similarity, which leads to the assumption that similar questions have similar action sequence structures not always hold.In contrast, a certain number of similar questions can alleviate this problem and improve performance.However, when the number reaches a certain level, the newly added questions become less similar to the original questions and introduce noise instead.In addition, the increase in the number of generated candidates also improves performance.If the number is too high, this boost becomes less apparent or even negative because of the lower quality of the newly added candidates.

Case study
We show some examples to illustrate the ability of our modules.Table 5 shows a complex question of category Quantitative (Count).We can observe that   the model wrongly predicts the third action in the absence of rewriting module but makes a correct generation with the help of rewritten utterances.It's reasonable because the seq2seq model learns to focus on "approximately 5 people" when predicting the third action.Table 6 shows a query of category Verification (Boolean).It's confusing for the model to decide which entity to output, and the correct action sequence is given a lower probability.However, it's much easier to choose through action sequence selection module.The wrong logical form produces an incorrect result in the majority of cases and thus receives a lower selection score, as shown.

Related Work
Semantic parsing is the task of translating natural language utterances into executable meaning representations.Recent semantic parsing based KBQA methods can be categorized as graph-based (Yih et al., 2015;Bao et al., 2016;Bhutani et al., 2019;Lan and Jiang, 2020;Qin et al., 2021) and sequence-based (Liang et al., 2017;Saha et al., 2019;Ansari et al., 2019;Hua et al., 2020a,b,c;Das et al., 2021).Graph-based methods build a query  Compared to graph-based methods, sequencebased methods can generate logical forms directly using the seq2seq model, which is easier to implement and can handle more question categories by simply expanding the set of action functions.However, the semantic and structural gap between natural language utterances and action sequences leads to poor performance on translation.

Conclusion
In this paper, we propose an alignment-enhanced complex question answering framework, which reduces the semantic and structural gap between question and action sequence by question-to-action and question-to-question alignment.We train a question rewriting model to align question and subaction sequence in the absence of training data and employ a pretrained language model to align the question and action arguments implicitly.Moreover, we utilize similar questions to help select the correct action sequence from multiple candidates.Experiments show that our framework achieves state-of-the-art on the CQA dataset and performs well on various complex question categories.In the future, how to better align questions with logical forms will be considered.

Limitations
In our method, we view KG artifacts as tokens and generate logical forms using a seq2seq model, which can handle more types of complex questions, i.e., superlative quesions without topic entities.However, for single and multi-hop questions, graph-based methods may gain better performance.The reason is that they start from a topic entity and interact with KG to add relations into query graphs step by step, which can prune the search space more effectively.Moreover, we control the vocabulary size through entity and relation recognition, which makes the preprocessing step more complex.

A CQA Dataset
Complex Question Answering(CQA) dataset contains the subset of the QA pairs from the Complex Sequential Question Answering(CSQA) dataset, where the questions are answerable without needing the previous dialog context.There are 944K, 100K and 156K question-answer pairs in the training, validating and test set, respectively.This dataset has seven types of complex questions, making it difficult for the model to answer correctly.We show some examples of each question category in Table 7.For simple questions, the corresponding action sequence contains only one action, and for some complex questions, the length of action sequence may up to 4.

B Training Details
To compare with previous works and reduce training time, we also randomly select two small subsets(about 1% each) from the training set to train models.We use BFS algorithm to search pseudo action sequences for the first subset to train the question rewriting model as introduced in 2.2 and pretrain the action sequence generation model.We use the second one for subsequent reinforcement learning of the action sequence generation model.
We evaluate our trained model on the whole test set.
We initialize two models in the question rewriting stage with the base version of BART and finetune them using Adam Optimizer with a learning rate of 1e-5.For the action sequence generation model, we adopt the uncased base version of BERT for underlying embeddings and freeze the parameters to improve training stability.We set the dimension of type embedding to 100, the hidden sizes of one-layer BiLSTM Encoder and LSTM Decoder to 300.We train the model for 100 epochs and 50 epochs using Adam with learning rates of 1e-4 and 1e-5 in the pretraining and reinforcement learning stages, respectively, and finally choose the checkpoint with the highest reward in the development set.We generate 5 candidate action sequences with a beam size of 10, and retrieve 3 questions with a similarity greater than threshold 0.6 as the support set.If no similar question meets the condition, we

Figure 1 :
Figure 1: An overview of the proposed approach.The question is first converted into a more structured form, then multiple candidate action sequences are generated by the seq2seq model, and finally the candidate action sequences are scored based on similar question-answer pairs.

Figure 2 :
Figure 2: An example of adjusting candidate action sequences.The upper and lower parts of (a) are the original question and a question in the support set, respectively.We first obtain a relation-masked action sequence (the second line of (b)) based on the alignment results of entities and types between two questions as shown in (a), and then output multiple action sequences according to all possible combinations of relations.

Figure 3 :
Figure 3: Trends of macro F1 when the size of support set increases.
Ronald J Williams.1992.Simple statistical gradientfollowing algorithms for connectionist reinforcement learning.Machine learning, 8(3):229-256.Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao.2015.Semantic parsing via staged query graph generation: Question answering with knowledge base.In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1321-1331.
proposes to mitigate reward sparsity with auxiliary rewards and restricts the program space to semantically correct programs.CIPITR proposes two training ways, one training a single model for

Table 2 :
The overall performances on WQSP dataset.† denotes supervised training.

Table 3 :
Ablation studies on main components.

Table 4 :
Ablation studies for different underlying embeddings.

Table 5 :
Test case on quesiton rewriting module

Table 6 :
Test case on action sequence selection module