Red Dragon AI at TextGraphs 2020 Shared Task: LIT : LSTM-Interleaved Transformer for Multi-Hop Explanation Ranking

Explainable question answering for science questions is a challenging task that requires multi-hop inference over a large set of fact sentences. To counter the limitations of methods that view each query-document pair in isolation, we propose the LSTM-Interleaved Transformer which incorporates cross-document interactions for improved multi-hop ranking. The LIT architecture can leverage prior ranking positions in the re-ranking setting. Our model is competitive on the current leaderboard for the TextGraphs 2020 shared task, achieving a test-set MAP of 0.5607, and would have gained third place had we submitted before the competition deadline. Our code implementation is made available at https://github.com/mdda/worldtree_corpus/tree/textgraphs_2020


Introduction
Complex question answering often requires reasoning over many evidence documents, which is known as multi-hop inference. Existing datasets such as Wikihop (Welbl et al., 2018), OpenBookQA (Mihaylov et al., 2018), QASC (Khot et al., 2020), are limited due to artificial questions and short aggregation, requiring less than 3 facts. In comparison, TextGraphs (Jansen and Ustalov, 2020) uses WorldTree V2 (Xie et al., 2020) which is the largest available dataset that requires combining an average of 6 and up to 16 facts in order to generate an explanation for complex science questions. The dataset contains 5k questions that require knowledge in core science as well as common sense. Figure 1 shows an example question from the WorldTree dataset. The evaluation for this dataset is framed as a ranking objective over a large set of 9k science facts, and models are scored based on the MAP metric over the predicted rank ordering. Multi-hop inference encounters significant noise or "distraction" documents in the process and this challenge is known as semantic drift (Fried et al., 2015). Compared to WorldTree V1 (Jansen et al., 2018), WorldTree V2 has more examples but is more challenging as the larger pool of science facts presents a greater risk of semantic drift.
Neural information retrieval models such as DPR (Karpukhin et al., 2020), RAG (Lewis et al., 2020), and ColBERT (Khattab and Zaharia, 2020) that assume query-document independence use a language model to generate sentence representations for the query and document separately. The advantage of this late-interaction approach is efficient inference as the sentence representations can be computed beforehand and optimized lookup methods such as FAISS (Johnson et al., 2017) exist for this purpose. However, the late-interaction compromises on deeper semantic understanding possible with language models. Early-interaction approaches such as TFR-BERT (Han et al., 2020) instead concatenate the query and document before generating a unified sentence representation. This approach is more computationally expensive but is attractive for re-ranking over a limited number of documents. However, the previous approaches consider each query-document pair in isolation. This forgoes any cross-document interaction which can leverage additional knowledge sources or benefit the ranking objective. Other work (Pasumarthi et al., 2019;Pobrotyn et al., 2020;Sun and Duh, 2020) facilitate cross-document interactions through self-attention mechanisms. However, the cross-document interaction is only applied after the feature extraction step and cannot leverage the language understanding potential in earlier language model layers.
The most straightforward loss for the document ranking objective is Binary Crossentropy where each document is ranked according to the binary classification probability of being within the gold explanation set. However, there have been recent progress in differentiable losses to optimize directly for the ranking objective (Wang et al., 2018;Revaud et al., 2019;Engilberge et al., 2019). In this work, we also compare the benefits of each loss for multi-hop ranking.
The main contributions of this work are: 1. We show that conventional information retrieval-based methods are still a strong baseline and propose I-BM25, an iterative retrieval method that improves inference speed and recall by emulating multi-hop retrieval.
2. We propose a hierarchical LSTM-interleaved transformer (LIT) architecture that maximizes early cross-document interactions for improved multi-hop re-ranking.

We provide empirical comparisons of training with different loss functions and show that Binary
Crossentropy loss is simple yet may outperform differentiable ranking losses.

Models
Three different system architectures are described here, with overall schemes illustrated in Figure 3 for comparison.

Iterative BM25 Retrieval
Chia et al (2019) showed that conventional information retrieval methods can be a strong baseline when modified to suit the multi-hop inference objective. However, this method is limited due to computationally expensive inference and sensitivity to noise and semantic drift. We propose an iterative retrieval method 'I-BM25' that performs inference in a fraction of the time and reduces semantic drift, resulting in a even stronger baseline retrieval method. For preprocessing, we use spaCy (Honnibal and Montani, 2017)   2. For each question, the closest n explanation candidates by cosine proximity are selected and their vectors are aggregated by a max operation. The aggregated vector is down-scaled and used to update the query vector through a max operation.
3. The previous step is repeated for increasing values of n until there are no candidate explanations remaining.

LSTM-After Transformer for Re-Ranking
BERT is a pre-trained language model that is widely adapted and fine-tuned for many downstream NLP tasks. Due to computational constraints, we use DistilBERT (Sanh et al., 2020) which has 40% fewer parameters and comparable performance. In sequence-level tasks such as text classification, a [CLS] token is a special token inserted at the front of the sequence. The latent representation of the token is passed to a feed-forward network for prediction. We append an LSTM (Hochreiter and Schmidhuber, 1997) (2018)). This hierarchical structure allows the transformer to perform crossdocument reasoning and knowledge reference. The LSTM layers enable the model to be rank-aware when used in the re-ranking setting. For re-ranking, the top 128 predictions from I-BM25 are passed to the LSTM-After Transformer which performs binary classification for each document.

LSTM-Interleaved Transformer for Re-Ranking
TextGraphs is a challenging task which requires complex multi-hop reasoning, but information retrieval methods are surprisingly strong baselines. To enhance cross-document interaction and leverage language representations in earlier transformer layers, we interleave adapters (Houlsby et al., 2019) into the architecture which are recurrent instead of merely feed-forward. The LSTM-adapter modules in Figure 2 operate on the latent representation at the [CLS] position of each document at each layer of the transformer. After each transformer layer, the [CLS] latent representations for each input document are first down-projected, passed to the LSTM layers and finally up-projected and fed into the next transformer layer. Compared to (Houlsby et al., 2019), the LIT architecture is fully trainable and makes the transformer architecture more expressive by enabling cross-document reasoning which was previously not possible. Apart from LSTM, we also tested GCN (Kipf and Welling, 2017) and Self-Attention (Parikh et al., 2016) layers but had limited success in achieving competitive performance from them.  Table 1 shows that I-BM25 is a strong information retrieval method that can be a drop-in replacement for previous information retrieval methods. The results also show the advantage of the LIT architecture in interleaving LSTM layers between transformer layers, rather than after the last transformer layer.

Loss Function
Dev MAP LambdaLoss 0.4970 APLoss 0.5187 Binary Crossentropy 0.5680 The results of optimization using 3 different loss objectives are shown in Table 2. Surprisingly, the direct ranking-loss oriented objectives were less effective in reducing the final evaluation MAP score, which is potentially due to the bucketisation approximation used in the APLoss calculations not being appropriately pre-scaled in our experiments. In this case, the training may require different hyperparameters to converge optimally. Another potential explanation is that these ranking losses may be sub-optimal (when used as a training objective) when many documents have very similar underlying scores which is the case here.

Notes
Further to our experience last year, we included preprocessing steps to isolate the branching 'combo' statements (which essentially contain OR clauses between different noun phrases, for instance). This step remains in our codebase, but we did not exploit it fully, since a full treatment would require the isolation of which 'combo branch' is taken by each gold statement in the training set.

Discussion
Other architectures that we explored included Graph neural network (GNN) methods, however we had insufficient time to tune these for the multi-hop explanation task herein. Surprisingly, our simple LSTM methods (which can be viewed as a linear graph that performs message-passing along the list of results ordered by the I-BM25 method) already provided a competitive method. We estimate that next year's competition will require the use of graph-based methods, due to their greater expressive power.

Conclusion
The LIT architecture is a simple yet powerful adaptation of the Transformer architecture to learn better cross-document interactions for multi-hop ranking. The structure can be easily integrated with any transformer language model to enable cross-referencing of knowledge statements and improved ranking performance. For example, LIT can be a drop-in encoder for other multi-hop question answering datasets such as HotPotQA (Yang et al., 2018). When applied to the challenging WorldTree V2 dataset, LIT achieves competitive performance with current state-of-the-art models despite a smaller footprint. We envision that this architecture can be beneficial to many NLP tasks which require multi-hop reasoning over documents.