ChiSquareX at TextGraphs 2020 Shared Task: Leveraging Pretrained Language Models for Explanation Regeneration

In this work, we describe the system developed by a group of undergraduates from the Indian Institutes of Technology for the Shared Task at TextGraphs-14 on Multi-Hop Inference Explanation Regeneration (Jansen and Ustalov, 2020). The shared task required participants to develop methods to reconstruct gold explanations for elementary science questions from the WorldTreeCorpus (Xie et al., 2020). Although our research was not funded by any organization and all the models were trained on freely available tools like Google Colab, which restricted our computational capabilities, we have managed to achieve noteworthy results, placing ourselves in 4th place with a MAPscore of 0.49021in the evaluation leaderboard and 0.5062 MAPscore on the post-evaluation-phase leaderboard using RoBERTa. We incorporated some of the methods proposed in the previous edition of Textgraphs-13 (Chia et al., 2019), which proved to be very effective, improved upon them, and built a model on top of it using powerful state-of-the-art pre-trained language models like RoBERTa (Liu et al., 2019), BART (Lewis et al., 2020), SciB-ERT (Beltagy et al., 2019) among others. Further optimization of our work can be done with the availability of better computational resources.


Introduction
The Shared Task is aimed at Multi-hop Inference for Explanation Regeneration. Participants are required to develop new and improve existing methods to reconstruct gold explanations for the WorldTree Corpus (Xie et al., 2020) of elementary science questions, their answers, and explanations.
Question: Which of the following is an example of an organism taking in nutrients? (A) a dog burying a bone (B) a girl eating an apple (C) an insect crawling on a leaf (D) a boy planting tomatoes Answer: (B) a girl eating an apple Gold Explanation Facts: 1) A girl means a human girl: Grounding 2) Humans are living organisms: Grounding 3) Eating is when an organism takes in nutrients in the form of food: Central 4) Fruits are kinds of foods: Grounding 5) An apple is a kind of fruit: Grounding Irrelevant Explanation Facts: 1) Some flowers become fruits. 2) Fruit contains seeds. 3) living things live in their habitat. 4) Consumers eat other organisms The example highlights an instance for this task, where systems need to perform multi-hop inference to combine diverse information and identify relevant explanation sentences required to answer the specific question. The task provides a new and more challenging corpus of 9029 explanations and a set of gold explanations for each question and correct answer pair.  TG 2019 TG 2020  Questions  1680  4367  Explanations  4950  9029  Tables  62  81   Table 3: Dataset Comparison

Dataset
The dataset is the WorldTree Corpus V2.1 (Xie et al., 2020) of Explanation Graphs and Inference Patterns supporting Multi-hop Inference (Februrary 2020 snapshot). It is a newer version of the dataset used in the TextGraphs-2019 (Jansen and Ustalov, 2019). The comparison between the two datasets is shown in Table 3.

Problem Review
The problem statement requires participants to build a system that, given a question and its answer choices, can identify the sentences that explain the answer given the question. This is a challenging task due to the presence of other irrelevant sentences in the corpora for the given question, which have equally significant lexical and semantic overlap as the correct ones (Fried et al., 2015). When a more classical graph theory approach using the semantic overlap of explanations and questions is tried, it leads to the problem of semantic drift (Jansen, 2018). More classic graph methods were attempted in (Kwon et al., 2018), where the challenge of semantic drift in multi-hop inference was analyzed, and the effectiveness of information extraction methods was demonstrated. Also, approaching the question as a language generation task is not effective and the current state-of-the-art models (Dušek et al., 2020) are not capable of generating the exact explanations as required by this task. So this task can easily be transformed into a sentence ranking problem in which we need to rank the relevant facts over all other given facts present in the corpus. The evaluation metric used for the task is the widely used and robust mean average precision (MAP) metric. We have explained a few initial experiments that were undertaken in Section 4.1, followed by the pre-processing methods we incorporated in Section 4.2. We have then discussed our models in Sections 4.3 through 4.6. We have finally shown all our results and discussions in Section 5 followed by the conclusion and acknowledgments.

Initial Experiments
We used the pure textual form of each explanation, problem and correct answer, rather than using a semi-structured form given in the column-oriented files provided in the dataset. Initially, we just reduced the original text of the questions that included all the answer choices. This was done by removing the incorrect answers, which thereby resulted in an improvement in performance. This is similar to what was seen in the previous edition of the task. Taking the TFIDF baseline with the basic pre-processing we got a MAP score of 0.3065 on the hidden test set. Taking this as the starting point, we built a SentenceBERT Model in which we converted all questions and explanations into contextual word embedding vectors and ranked the explanations in descending order of cosine similarity of the embedded vectors. We observed a drop in the model's performance with the MAP score of 0.2427 on the test dataset, which is worse than the simple TFIDF ranker. We realized that it was the semantic overlap between the question and the irrelevant explanations that caused such an unexpected performance drop on further inspection. So we noted that we should not use contextual word embeddings, but instead, we must improve the simple but effective information retrieval technique of TFIDF for the ranker. We then used the Sublinear TFIDF 2 and Binary TFIDF. The optimized Sublinear TFIDF vectorizer gave a boost in the score: 0.3254 MAP.

Preprocessing
It was seen that the TFIDF algorithm was very sensitive to keywords, so we applied the pre-processing and optimization techniques mentioned in (Chia et al., 2019). For each of these, we performed Penn-Treebank tokenization, followed by lemmatization using the lemmatization files provided with the dataset. 3 We used NLTK for tokenization to reduce the vocabulary size needed by combining the different forms of the same keyword. We also removed stopwords, which thereby removed noise in the texts. A simple TFIDF based ranker along with the above pre-processing returned a MAP score of 0.3850. Substituting Sublinear TFIDF, we noticed that the score increased to 0.4080 MAP. With some experimentation, we were able to further improve the MAP score to 0.426 using Binary TFIDF. Finally, we applied Recursive TFIDF as proposed in this paper (Chia et al., 2019), in which the authors treated the TFIDF vector as a representation of the current chain of reasoning, each successive iteration built on the representation to accumulate a sequence of explanations. We optimized all the other variables like normalization, maxlen, hops, scale. We found the MAP to be completely independent of the normalization used. For maxlen = {128, 125, 144}, we found maxlen = 128 to be most efficient. For number of hops = {1,2,3}, we found 1 to be best. This may be because semantic drift creeps in as we explore the nodes further away from the current node. A scaling factor was used in each successive explanation as it is added to the TFIDF vector. For the downscaling factors= {1.25, 1.3, 1.35}, we found 1.25 was optimum. We got a slight improvement in the score 0.4430 MAP when used along with Binary TFIDF. All these steps were done as a part of the pre-processing step.

Pure Language Model approach
After doing all the pre-processing steps, we tried to apply a simple pure language model based approach that has shown good performance in Text Classification tasks. We took each processed question and concatenated each of the 9029 explanations to it one by one. Then for each of these question + explanation pairs, we used a simple language model based BERT classifier (BERT F orSequenceClassif ication) to predict whether the explanation was one of the gold explanations for that question. The result for this was 0.4116 MAP, which was lesser than we expected. We deduced that there are two major problems with this simplistic approach.
• Class imbalance: Most of the question-explanations would be labeled 0 (False), since out of the 9029 total explanations only a few would actually be gold explanations for the question. This causes the classifier to output the 0 label almost all the time and prevents it from learning the true relations between the gold explanations and the question. It is possible that the class imbalance could be mitigated by searching for better hyperparameter values; however, we weren't able to do that with the available resources, so we applied a different technique to address this. • Non-scalability: This approach would require inferences equal to the number of explanations in the corpus for every question. While it's possible to do this for this relatively small corpus of 9029 explanations, as the number of explanations becomes larger, this approach would no longer be feasible; requiring too much time for training and, more importantly, for inference.

Using TFIDF to retrieve relevant explanations
To address the above problems, we applied the optimal TFIDF vectorizer (TFIDF binary + recursive) obtained in the pre-processing step to first obtain the most relevant explanations for a given question based on the lexical overlap between the question and the explanations. The number of explanations retrieved by this initial ranker (top k) was a tuned parameter. This technique was very effective at retrieving the gold explanations for a question. We have shown the fraction of gold explanations retrieved when we  Table 5. We can see that almost 87% of the gold explanations are retrieved when we that top 100 explanations from TFIDF, and almost 99% of gold explanations are retrieved when we took the top 500 explanations. This saves our computation as we now need to only train the model for at max 500 explanations per question instead of 9029 explanations. Now top k retrieved explanations are concatenated to the questions as in the previous approach, and the classifier model is trained to classify whether a given explanation among the top k explanations is the right explanation for the question or not. The MAP score using the BERT F orSequenceClassif ication model was 0.4365 MAP. We inferred that the score was low because the model was predicting the 0 label for almost all inputs since there was still a significant imbalance in the training dataset (though significantly less than before).

Addressing class imbalance
To address the class imbalance in question explanation pairs, we applied a simple approach of oversampling of the minority class (Positive or '1' label). We simply repeated the gold explanations during training such that for each question, the number of positive and negative labeled explanations would be equal (equal to top k/2). Hence the explanations for a given question were the top k/2 negatively labeled explanations plus the positively labeled explanations retrieved by TFIDF repeated top k/2 times. This was only applied while training, not during inference in the validation and test datasets. Using this simple technique, we were able to get a significant boost in the performance for the baseline of BERT with 0.4506 MAP score.

Pre-trained Language Models
We tried out all pre-trained language models available for sequence classification. We optimized the following hyperparameters: top k, num train epochs, batch size, learning rate, epsilon, gradient accumulation steps, max grad norm, weight decay. The batch size is dependent on the GPU RAM available. The parameters top k and num train epochs are a function of training time.
Since we needed to optimize the training time, we first trained all models with top k as 100 with 3 epochs to get a preliminary model performance. Then we took the best models and trained it for a higher top k (500 or 300 whichever was feasible) to get a boost in score. Our best performing model took close to 8 hours to complete the training. Further details of the models are given in the supplementary.

Results and discussion
We present the scores in the table given below. We got our best performance from RoBERTa. When we observe the results, we see that there is only a slight variation in the final scores of most pre-trained language models. We observe that the models overfit the given data. We could not perform a grid search to optimize all parameters due to computational constraints and had to manually search for the best hyperparameters due to which the performance of any given model may not be optimal. Further, we have trained only RoBERTa and BART for top k 500 explanations and not other models because they had a long training time or a higher RAM requirement.

Conclusion
We have given a system description of our team ChiSquareX which stood 4th place in the evaluation phase leaderboard with a MAP score of 0.4902. We have presented a system with optimized preprocessing of the dataset followed by an optimized TFIDF information retrieval scheme to obtain initial ranks, and then further pre-trained language model based re-ranker to rank the final explanations. Despite the computational constraints, just by leveraging Google Colab and other open-source tools, we have managed to fine-tune state-of-the-art pre-trained language models like RoBERTa, BART and ELECTRA on the (Xie et al., 2020) dataset and achieve a reasonable MAP score.