Recursive Tree-Structured Self-Attention for Answer Sentence Selection

Syntactic structure is an important component of natural language text. Recent top-performing models in Answer Sentence Selection (AS2) use self-attention and transfer learning, but not syntactic structure. Tree structures have shown strong performance in tasks with sentence pair input like semantic relatedness. We investigate whether tree structures can boost performance in AS2. We introduce the Tree Aggregation Transformer: a novel recursive, tree-structured self-attention model for AS2. The recursive nature of our model is able to represent all levels of syntactic parse trees with only one additional self-attention layer. Without transfer learning, we establish a new state of the art on the popular TrecQA and WikiQA benchmark datasets. Additionally, we evaluate our method on four Community Question Answering datasets, and find that tree-structured representations have limitations with noisy user-generated text. We conduct probing experiments to evaluate how our models leverage tree structures across datasets. Our findings show that the ability of tree-structured models to successfully absorb syntactic information is strongly correlated with a higher performance in AS2.


Introduction
Motivation. Natural language text is characterized by structure. For instance, syntactic parse trees decompose a sentence into syntactic groups, which in turn are decomposed recursively until we get to single-word spans. Therefore, syntactic parse trees have a varying number of levels that can be accurately represented by recursive model architectures.
Tree-structured LSTM networks (Tai et al., 2015) are the recursive extension of LSTM networks (Hochreiter and Schmidhuber, 1997), and allow for syntactic trees to be represented hierarchically. Tree-LSTMs and bidirectional Tree-LSTMs <s> Marrakesh </s> <s> in Marrakesh </s> <s> died in Marrakesh </s> <s> Averroes </s> <s> Averroes died in Marrakesh. </s> Self-Attention Self-Attention Self-Attention Self-Attention Self-Attention Sentence Embedding Figure 1: Embedding a sentence with our proposed recursive tree-structured self-attention using the corresponding constituency parse tree. There is only one set of parameters for the recursive self-attention. (Teng and Zhang, 2017) do not represent sequence position information, whereas the hybrid neural inference networks (Chen et al., 2017a) represent sequence position information separately from treestructured hierarchical information.
Tree-structured models have been applied to the tasks of natural language inference (Chen et al., 2017a), sentence pair similarity (Tai et al., 2015), dependency parsing (Kiperwasser and Goldberg, 2016), and text embeddings (Mrini et al., 2019). In this paper, we consider the problem of Answer Sentence Selection (AS2), where the goal is to predict for a question-sentence pair whether the sentence contains an answer to the question. Given that treestructured models have performed strongly on a task that takes a sentence pair as input -sentence pair similarity, we hypothesize that tree structures can help in AS2, another sentence pair task.
The most recent top-performing model architectures for Answer Sentence Selection have been based on the self-attention transformer architecture (Vaswani et al., 2017). Three of them (Lai et al., 2019;Garg et al., 2019; use transfer learning on large AS2 datasets; another one (Laskar et al., 2020) uses direct fine-tuning on pre-trained transformer-based language encoders, whereas all three use pre-trained BERT  and/or RoBERTa embeddings (Liu et al., 2019).
Contribution. We investigate whether tree structures are useful for AS2. We introduce the Tree Aggregation Transformer: a novel recursive and treestructured self-attention model for Answer Sentence Selection. We use the syntactic parse trees of questions and candidate answer sentences to model them in a tree-structured way. We then form representations for questions and candidate answers using one additional self-attention layer in a recursive, bottom-up fashion, as shown in Figure 1. We learn syntactic embeddings to represent hierarchical order and phrase-level syntactic information. We find in an ablation study that our learned syntactic embeddings improve performance.
Without using AS2 datasets for transfer learning, our model establishes a new state of the art for the clean versions of TrecQA and WikiQA, two widely used benchmark datasets in question answering and AS2. Our tree-structured self-attention matches or exceeds the state of the art -which is fine-tuning on RoBERTa -on 2 out of 4 Community Question Answering (CQA) datasets. We conduct experiments for 3 probing tasks to establish what information our models leverage to increase performance, and likewise what they fail to leverage when they do not exceed baselines. We find that tree-structured representations that successfully absorb the provided syntactic information consistently perform better than baselines. Our probing task results suggest that there is more work to be done for tree structures to adapt to noisy user-generated text.

Related Work
Tree-structured Transformers. To the best of our knowledge, our method is the first to introduce tree self-attention to Answer Sentence Selection. There is a growing body of work incorporating tree structures in self-attention for a range of other NLP tasks. Nguyen et al. (2019) introduce a transformerbased encoder-decoder that incorporates treestructured attention. The tree-structured attention is accumulated hierarchically. A token in the tree has as many representations as overall children, therefore it is first accumulated in a bottom-up fashion (vertically), and then horizontally to compute a token's representation. Their model is not recursive and uses different parameters for each level. The authors evaluate their model in machine translation and text classification. Sun et al. (2020) develop a tree-structured transformer encoder-decoder architecture for code generation. Here, the tree structure is based on the code syntax. The model uses character-level embeddings as input. Harer et al. (2019) introduce Tree-Transformer: a model with a tree convolution block for correction of code and grammar. Wang et al. (2019) propose a model of the same name, where the model learns syntactic parse trees in an unsupervised manner. The model uses up to 12 layers of non-recursive self-attention on top of a pre-trained BERT. Ahmed et al. (2019) introduce Constituency and Dependency Tree Transformer models, largely inspired by the Constituency and Dependency Tree-LSTM models (Tai et al., 2015) and RvNN models (Socher et al., 2011(Socher et al., , 2012(Socher et al., , 2013. On 4 datasets of semantic relatedness, natural language inference and paraphrase identification, their transformer models achieve performance on par with Tree-LSTM models, and do not set a new state of the art. The authors use two convolution layers to form a parent representation from the corresponding children. Their model does not learn an explicit syntactic representation, and the authors do not analyze the fluctuating results. Answer Sentence Selection (AS2). The recent state-of-the-art models in the AS2 task all use transfer learning from large-scale datasets, and do not incorporate syntactic information. All of them use a standard linear (or sequential) input format, where the first input sentence is the question and the second is the candidate answer. Lai et al. (2019) introduce the Gated Self-Attention Memory Network (GSAMN). It combines gated attention (Dhingra et al., 2017;Tran et al., 2017), memory networks (Sukhbaatar et al., 2015) and self-attention (Vaswani et al., 2017) in one model. The authors use transfer learning with their Stack Exchange QA dataset. Garg et al. (2019) propose the TandA method: Transfer and Adapt. The method is simply finetuning directly on a pre-trained BERT or RoBERTa model. The transfer step is transfer learning: finetuning a large pre-trained BERT or RoBERTa on the ASNQ dataset: a large-scale answer sentence selection dataset extracted from Google's Natural Questions (Kwiatkowski et al., 2019). The second step is to adapt the language model fine-tuned for answer sentence selection to the smaller, target benchmarks TrecQA and WikiQA.  build upon the work of Lai et al. (2019). They propose to use a neural Turing machine (Graves et al., 2014) as a controller for the memory network, instead of the gated attention that Lai et al. (2019) use. Like Garg et al. (2019), they use the ASNQ dataset for transfer learning.
Laskar et al. (2020) achieve state-of-the-art results on a wide range of QA and CQA datasets by directly fine-tuning on the target datasets, without transfer learning from an external large-scale dataset. They show results for two methods: the first trains a self-attention layer while freezing pretrained language model layers, and the second directly fine-tunes on the language model.

Tree Aggregation Transformer for Answer Sentence Selection
In the AS2 task, the input is a pair of sentences, where the first one is the question and the second is a candidate answer. This is a binary classification problem on whether or not the candidate answer sentence contains an answer to the question. We therefore design our model to form a representation of the question and a representation of the candidate answer, in a bottom-up tree aggregation fashion. Semantic and Syntactic Representation. We define a token embedding in our input representation as the concatenation of a semantic embedding and a syntactic embedding. The semantic embedding is a projection of the token embedding from a given pre-trained language model, whereas the syntactic embedding contains information from partof-speech tags, syntactic categories, and the level within the syntactic parse tree.
The syntactic embedding is the sum of three learned embeddings. The first embedding represents the token's tag -a part-of-speech tag if the token is a word, or a syntactic category if the token is a classification or separator token. The second embedding represents the token's level within the tree, inherited from the head of the token's constituent span. Our recursive model allows to represent sentences with as many tree levels as the corresponding syntax tree has. The third embedding represents the position of a token within the constituent span, as seen in the example in Figure 2. This position embedding puts the token within its span context, whereas the position embedding of the semantic (language model) embedding puts the token within the context of the question-sentence pair.
More formally, given a token t, its language model embedding x t , its position index p t , its partof-speech tag or syntactic category s t , and its tree level l t , the token's semantic embedding e t and syntactic embedding n t are as follows: where W 1 , W 2 , b 1 , b 2 are learned, and E s , E p and E l are learned embedding layers, respectively for the part-of-speech tag or syntactic category, the position index, and the tree level.
Recursive Self-Attention. We add 1 layer of recursive self-attention layer on top of the language model layers. The recursive self-attention layer has separate attention distributions a e t and a n t for the semantic embedding e t and syntactic embedding n t : where d n and d e are the dimensions of the query and key vectors for the semantic and syntactic embeddings respectively, and K e and K n are the learned matrices of key vectors of input tokens. q e t and q n t are the query vectors for the token t, such that: (NP Averroes) where V e and V n are the value vectors for the input tokens, and W O,e , W O,n , b O,e , b O,n are learned. Finally, we apply separate position-wise feed-forward layers to these output vectors.
Usually, self-attention includes residual dropout over the attention-weighted value vectors. We found in preliminary experiments that the performance on the dev set improved when we omitted dropout regularization. We omit dropout in both self-attention and position-wise feed-forward layer.
The recursiveness of the self-attention allows the model to re-use the same sets of parameters across each tree level, instead of training new ones as in previous work (Nguyen et al., 2019;Wang et al., 2019).
Constituent Span Embedding. Each input sentence is represented in a tree-structured fashion using its constituency parse tree. We use a pre-trained parser, whose parameters are fixed, to produce the trees before training time.
The constituent span is fed to the recursive selfattention as a matrix of token vectors. This matrix includes the embeddings of the words of the constituent span, preceded by a first, start-of-sentence embedding, and followed by an end-of-sentence embedding. The start-of-sentence token is the classification token if the span is part of the question, or a separator token if the span is part of the candidate sentence. Figure 3 shows how we compose a constituent span embedding for RoBERTa models.
The constituent span embedding is the output embedding of the first token. The first token embedding obtains through the recursive self-attention an attention-weighted sum of all of the span's token embeddings. This creates a span-specific embedding, conscious of the entire question-sentence pair input as a result of the language model layers, but focused on the tokens of a span as a result of the recursive self-attention.
In using only one layer of recursive selfattention, the first token embedding gets an attention-weighted sum of value vectors that contains token embeddings that did not go through a layer of self-attention, and syntactic embeddings that came directly out of the embedding layers.
Efficient Tree Aggregation. To obtain an aggregate sentence embedding, we proceed by embedding from the deepest level of the tree (the leaves) to the root, as shown in Figure 3. The computations are done on the same two sets of self-attention parameters.
To reduce training time, we compute the constituent span embeddings one level at a time. For instance, in Figure 2, we compute the NP, VP and PP groups at once when computing the span embeddings at tree level 2.
We efficiently compute all span embeddings only once, and keep all computed span embeddings, as they will be used in the next level.
The sentence embedding is obtained from the first token output of the computation at the root of the tree, as shown in Figure 1.
Prediction. Finally, we concatenate the aggregate embeddings for the question-sentence input pair. Given the question's aggregate semantic embedding w e q and aggregate syntactic embedding w n q , and the sentence's aggregate semantic embedding w e s and aggregate syntactic embedding w n s , we obtain the prediction values as follows: p(s|q) = softmax W * tanh w e q ; w n q ; w e s ; w n s + b where W and b are learned. We use binary crossentropy as our loss function. Our model can optionally include a residual connection, by adding the classification token embedding output of the language model to the beginning of the question-sentence pair vector. This residual connection does not contain syntactic information, and the classification token embedding is not projected in this case.

Datasets
We evaluate our proposed Tree Aggregation Transformer on six English-language benchmark datasets for answer sentence selection. The first two -TrecQA and WikiQA -are widely used benchmarks in Question Answering (QA). The other four -YahooCQA and SemEval 2015, 2016 and 2017 -are all from the Community Question Answering (CQA) domain. We show the statistics of these six datasets in Table 1.
TrecQA (Wang et al., 2007) is collected from labeled sentences of the QA track of the Text REtrieval Conference (TREC). Over time, the dataset has evolved into two versions: the raw version includes all question-sentence pairs, whereas the clean version excludes questions with only nonrelevant or only relevant candidate answers.
WikiQA (Yang et al., 2015) contains questions originally sampled from Bing query logs, and matched with candidate answer sentences from the first paragraph of relevant Wikipedia articles. Likewise, it also has a raw and a clean version. YahooCQA (Tay et al., 2017) is a filtered and pre-processed subset of the large-scale Yahoo! Answers Manner Questions dataset (Surdeanu et al., 2008). The latter is based on the Yahoo! Answers online forum.
SemEval 2015 CQA (Nakov et al., 2015) is the challenge dataset of Subtask A of Task 3 of Se-mEval 2015. It is based on the Qatar Living online forum, and the goal is to predict the relevance   (Nakov et al., 2016) corresponds as well to Subtask A of Task 3 of SemEval 2016, about question-comment similarity. It is a new dataset also based on the Qatar Living online forum. The training set includes the training, development and testing sets of the SemEval 2015 CQA, and two new training sets. The authors of the dataset have described the first one as highly reliable, and the second one as noisier.
SemEval 2017 CQA (Nakov et al., 2017) is the latest version of the community question answering task. The training and development sets are the same as the 2016 version, but the testing set is different.
In Figure 2, we show an example of questionsentence pairs for a QA dataset and a CQA dataset. The aim is to illustrate the difference in style and length between formal (QA) and informal (CQA) text.

Setup
The standard evaluation metrics in answer sentence selection are Mean Average Precision (MAP) and Mean Reciprocical Rank (MRR). Both metrics are widely used in Information Retrieval (IR) and are averaged per query -in this case per question. Our model produces relevance scores going from 0 (irrelevant) to 1 (relevant) for each candidate answer, and therefore produces a list of candidate answers that can be ranked by relevance. Whereas MRR scores how early a first relevant answer appears in that candidate list, MAP scores the order in which all candidate answers are listed for each question.
To produce parse trees, we use the NLTK partof-speech tagger (Loper and Bird, 2002) trained on the part-of-speech tagset of the English Penn Tree-    (Marcus et al., 1994), and the Englishlanguage parser of Mrini et al. (2020), which is the state of the art on the parse trees of the PTB.

Training Parameters
We use 1 layer of recursive self-attention for all datasets. We use the residual connection described in §3 for TrecQA only. For all our models, we use either BERT large or RoBERTa large, so as to match our baselines. Our recursive self-attention layers have: 16 attention heads, a feed-forward dimension of 4096, and a hidden dimension of 2048. We use half of the dimensions to encode semantic information, and the rest to encode syntactic information.

Ablation Study on Syntactic Embeddings
We perform an ablation study by removing the syntactic embedding part of the input representation. In this experiment, we are quantifying the added value of the learned syntactic embeddings for span position, part-of-speech tags and syntactic categories, and tree levels.
Our results on the dev sets are in Table 3. Se-mEval 2016 and 2017 results are the same since both have the same dev set. Across all AS2 datasets, we notice that there is an advantage to learning syntactic embeddings, as the sum of MRR and MAP scores are higher for the variant that includes learned syntactic embeddings. The advantage is clearer for QA datasets, suggesting that formal language tends to benefit more from learned syntactic information. We use syntactic embeddings in our next experiments.

Baselines
We conside five strong baselines, described in §2:   Baselines 1, 2, and 5 are available only on TrecQA and/or WikiQA, whereas baselines 3 and 4 use the exact same datasets as we do.

Results and Discussion
The results of our experiments with the QA datasets are in Table 4, and the results of our experiments with CQA datasets are in Table 5.

State of the Art in QA datasets
Our results in Table 4 establish a new state of the art in TrecQA and WikiQA, two widely used benchmark datasets in answer sentence selection.
In TrecQA, our average of MAP and MRR scores matches the one for TandA (Garg et al., 2019) in BERT, without any transfer learning on a large dataset. This shows that our model is able to leverage the tree structure to increase performance on relatively small datasets.
For the RoBERTa results in WikiQA, the added value between the direct fine-tuning and our recursive self-attention confirms that our model is beneficial to formally written text, such as the one found in Wikipedia.
The increase in performance compared to the Evidence Memory models (Tran et al., 2020) when we add our tree representation shows that our tree aggregation method brings about a consistent and robust added value for the QA datasets.

Limitations in CQA datasets
As shown in Table 5, our Tree Aggregation Transformer is able to establish a new state of the art in SemEval 2015, and our BERT-based version exceeds other BERT-based baselines. However, our method scores below the state of the art in Ya-hooCQA and SemEval 2016, and only manages to match the MRR -but not the MAP -of the state of the art in SemEval 2017.
Therefore, there is a contrast in the performance of our recursive tree-structured self-attention between the QA and the CQA datasets. The difference lies in the style of the datasets, as questions and sentences can be much longer in QA datasets than in CQA datasets. On average, a training set pair in QA has 32 words for WikiQA, and 39 words in TrecQA, whereas a training set pair in CQA has 78 words for SemEval 2015, 85 words for Se-mEval 2016-2017, and 40 words for YahooCQA. As shown in the example, CQA pairs may also have spelling mistakes or lack coherent structure. Thus, the informal writing style and larger text length of CQA datasets may be decreasing the ability of our model to leverage tree structures. Accordingly, we see that our model achieves very competitive scores for YahooCQA, and that it has a text length that is very close to the QA datasets. The SemEval 2015 exception could be explained by the fact that the 2015 training dataset is less noisy than the 2016-2017 training dataset, as pointed out by the authors of the SemEval CQA datasets.

Do Tree Structures Improve
Performance?
We investigate how tree structures are leveraged in the Answer Sentence Selection task across the different datasets. We evaluate our tree-structured representations and compare them with the corresponding sequential representations, using three probing tasks from Conneau et al. (2018).

Probing Tasks
The three probing tasks are as follows: (1) Top Constituent Prediction. This task looks to predict the top constituent sequence of the question-sentence pair: the sequence of syntactic categories immediately below the S (Sentence) syntactic category. Following Conneau et al. (2018), we define this task as a 20-way classification problem, where the first 19 classes are the 19 most popular top constituent sequences, and the last category is for all the remaining top constituent sequences.
(2) Tree Depth Prediction. The tree depth is the number of hops from the root node of the syntactic tree to the lowest-level leaf nodes.
(3) Input Length Regression. This tasks investigates whether the embedding is aware of how many words it contains. The length of the questionsentence pair input is defined as the number of its tokens -full words and punctuation symbols. The first two tasks are syntactic, and investigate whether our tree-structured representations absorbed the syntactic category information that we fed it -respectively syntactic categories and tree levels -and whether that information was already present in the sequential representations.

Probing Experiment Setup
In our probing experiments, we consider all six datasets used both in our work and in Laskar et al. (2020). We consider the sequential representation of a question-answer pair to be the classification token embedding used for prediction in the RoBERTa-based models of Laskar et al. (2020). We take our own RoBERTa-based tree-structured models (without evidence memory), where we consider the tree-structured representation to be the classification token embedding fed to the prediction layer. The tree-structured and sequential representations have the same number of dimensions.   Table 6: Results for three probing tasks comparing sequential (Laskar et al., 2020) and tree-structured (ours) representations. In the last two columns, we show the Spearman correlation of the probing task and the AS2 performance differences between the tree-structured and sequential representations.
The probing model architecture is a simple MLP with a layer of the same size as the input embeddings, a ReLU activation, and a prediction layer. We train 36 probing models for each of the 36 combinations of a probing task, a dataset and a representation type. The input embeddings are frozen, so that the training does not change the weights of the pre-trained AS2 models. All experiments are trained for the same number of epochs, and use the same train/dev/splits as AS2 experiments.

Probing Results and Discussion
Our probing experiment results are shown in Table 6. We compute the Spearman correlations of the added values of the tree-structured representations compared to the sequential representations in each probing task with the same added value in the AS2 task. We compute the added value of the tree representation in a given task by subtracting the performance of the sequential representations (Laskar et al., 2020) from the performance of the tree-structured representations (ours).
For the syntactic probing tasks (the first two), the tree-structured representation gets an F1 score about 3 to 4 times higher than the one obtained by the sequential representation in 4 datasets: TrecQA, WikiQA, and SemEval 2015 and 2017. These 4 datasets correspond to the ones in which our tree-structured AS2 models set a new state of the art or matched the performance of the fine-tuning baseline of Laskar et al. (2020). In the other datasets, the tree-structured representation's F1 score is just slightly higher than the sequential representation's F1 score, if not about the same. This shows that when the tree-structured representations successfully absorb the syntactic information we fed it, there is a consistent increase in performance in the answer sentence selection task. The high correlation values for both MAP and MRR confirm that successfully absorbing syntactic information is associated with higher performance in AS2. The weakness of tree-structured representations in certain datasets may be due to the lack of generalization of syntactic parsers trained on the Penn Treebank.
In the input length probing experiment, we observe that the mean-squared error (MSE) of the tree-structured representations is consistently and significantly lower than the one of the sequential representations, except for YahooCQA. This shows that the recursion of our tree-structured AS2 model makes representations aware of the length of their question-sentence pair, but the correlation values show that this information does not necessarily help in the AS2 task.

Conclusions
We introduce the Tree Aggregation Transformer: a novel, recursive and tree-structured self-attention model for AS2. Our method embeds sentences by aggregating word representations following the corresponding parse tree. We show that our model leverages tree structure and, through an ablation study, that its learned syntactic embeddings increase performance. Our method establishes a new state of the art in the TrecQA and WikiQA benchmark datasets with only one additional selfattention layer. Our tree-structured self-attention exceeds or matches the state of the art in 2 out of 4 CQA datasets, where text is informal and longer. To investigate this mixed performance, we devise 3 probing tasks to examine what our tree-structured representations learn compared to their sequential counterparts. We find that there is a strong correlation between a tree-structured model's ability to absorb syntactic information and its ability to increase performance in the AS2 task compared to baselines. Our findings suggest that there is more work to be done for tree-structured representations to adapt to noisy user-generated text.