Context-Aware Transformer Pre-Training for Answer Sentence Selection

Answer Sentence Selection (AS2) is a core component for building an accurate Question Answering pipeline. AS2 models rank a set of candidate sentences based on how likely they answer a given question. The state of the art in AS2 exploits pre-trained transformers by transferring them on large annotated datasets, while using local contextual information around the candidate sentence. In this paper, we propose three pre-training objectives designed to mimic the downstream fine-tuning task of contextual AS2. This allows for specializing LMs when fine-tuning for contextual AS2. Our experiments on three public and two large-scale industrial datasets show that our pre-training approaches (applied to RoBERTa and ELECTRA) can improve baseline contextual AS2 accuracy by up to 8% on some datasets.


Introduction
Answer Sentence Selection (AS2) is a fundamental task in QA, which consists of re-ranking a set of answer sentence candidates according to how correctly they answer a given question.From a practical standpoint, AS2-based QA systems can operate under much lower latency constraints than corresponding Machine Reading (MR) based QA systems.Nowadays, latency is of particular importance because sources of information such as Knowledge Bases or Web Indexes may contain million or billion of documents.In AS2, latency can be minimized because systems process several sentences/documents in parallel, while MR systems parse the entire document/passage in a sliding window fashion before finding the answer span (Garg and Moschitti, 2021;Gabburo et al., 2022).
Modern AS2 systems (Garg et al., 2020;Laskar et al., 2020) use transformers to cross-encode question and answer candidates together.Recently, Lauriola and Moschitti (2021) proved that performing answer ranking using only the candidate sentence * Work done as an intern at Amazon Alexa AI is sub-optimal, for e.g., the answer sentence may contain unresolved coreference with entities, or the sentence may lack specific context for answering the question.Several works (Ghosh et al., 2016;Tan et al., 2018;Han et al., 2021) have explored performing AS2 using context around answer candidates (for example, adjacent sentences) towards improving performance.Local contextual information, i.e., the previous and next sentences of the answer candidates, can help coreference disambiguation, and provide additional knowledge to the model.This helps to rank the best answer at the top, with minimal increase in compute requirements and latency.
Previous research works (Lauriola and Moschitti, 2021;Han et al., 2021) have directly used existing pre-trained transformer encoders for contextual AS2, by fine-tuning them on an input comprising of multiple sentences with different roles, i.e., the question, answer candidate, and context (previous and following sentences around the candidate).This structured input creates practical challenges during fine-tuning, as standard pre-training approaches do not align well with the downstream contextual AS2 task, e.g., the language model does not know the role of each of these multiple sentences in the input.In other words, the extended sentence-level embeddings have to be learnt directly during fine-tuning, causing underperformance empirically.This effect is amplified when the downstream data for fine-tuning is small, indicating models struggling to exploit the context.
In this paper, we tackle the aforementioned issues by designing three pre-training objectives that structurally align with the final contextual AS2 task, and can help improve the performance of language models when fine-tuned for AS2.Our pre-training objectives exploit information in the structure of paragraphs and documents to pre-train the context slots in the transformer text input.We evaluate our strategies on two popular pre-trained transform-ers over five datasets.The results show that our approaches using structural pre-training can effectively adapt transformers to process contextualized input, improving accuracy by up to 8% when compared to the baselines on some datasets.
Answer Sentence Selection TANDA (Garg et al., 2020) established the SOTA for AS2 using a large dataset (ASNQ) for transfer learning.Other approaches for AS2 include: separate encoders for question and answers (Bonadiman and Moschitti, 2020), and compare-aggregate and clustering to improve answer relevance ranking (Yoon et al., 2019).(Devlin et al., 2019) and SOP (Lan et al., 2020) have been widely explored for transformers to improve accuracy for downstream classification tasks.However, the majority of these objectives are agnostic of the final tasks.End task-aware pre-training has been studied for summarization (Rothe et al., 2021), dialogue (Li et al., 2020), passage retrieval (Gao and Callan, 2021), MR (Ram et al., 2021) and multi-task learning (Dery et al., 2021).Lee et al. (2019) didates. Di Liello et al. (2022b) propose a sentencelevel pre-training paradigm for AS2 by exploiting document and paragraph structure.However, these works do not consider the structure of the downstream task (specifically contextual AS2).To the best of our knowledge, ours is the first work to study transformer pre-training strategies for AS2 augmented with context using cross-encoders.

Contextual AS2
AS2 Given a question q and a set of answer candidates S = {s 1 , . . ., s n }, the goal is to find the best s k that answers q.This is typically done by learning a binary classifier C of answer correctness by independently feeding the pairs (q, s i ), i ∈ {1, . .., n} as input to C, and making C predict whether s i correctly answers q or not.At inference time, we find the best answer for q by selecting the answer candidate s k which scores the highest probability of correctness k = arg max i C(q, s i ).
Contextual AS2 Contextual models for AS2 exploit additional context to improve the final accuracy.This has been shown to be effective (Lauriola and Moschitti, 2021) in terms of overcoming coreference disambiguation and lack of enough information to rank the best answer at the top.Different from the above case, contextual AS2 models receive as input a tuple (q, s i , c i ) where c i is the additional context.c i is usually the sentences immediately before and after the answer candidate.

Context-aware Pre-training Objectives
We design a transformer pre-training task that aligns well with fine-tuning contextual AS2 models, both structurally and semantically.We exploit the division of large corpora in documents and the subdivision of documents in paragraphs as a source of supervision.We provide triplets of text spans (a, b, c) as model inputs when pre-training, which emulates the structure of (q, s i , c i ) for contextual AS2 models, where a, b and c play the analogous role of the question, the candidate sentence (that needs to be classified), and the context (which helps in predicting (a, b) correctness), respectively.Formally, given a document D from the pre-training corpus, the task is to infer if a and b are two sentences extracted from the same paragraph P ∈ D. Following Di Liello et al. (2022b), we term this task: "Sentences in Same Paragraph (SSP)".Intuition for SSP Consider an example of a Wikipedia paragraph composed of three sentences: s 1 : Lovato was brought up in Dallas, Texas; she began playing the piano at age seven and guitar at ten, when she began dancing and acting classes.Given a question of the type "What are the acting roles of X", a standard LM can easily reason to select answers of the type "X acted/played in Y", by matching the subject argument of the question with the object argument of the answer, for the same predicate acting/playing.However, the same LM would have a harder time selecting answers of the type "X appeared in Y " because this requires learning the relation between the entire predicate argument structure of acting vs. the one of appearing.A LM pre-trained using the SSP task can learn these implications, as it reasons about concepts from s 3 , e.g., "appearing in Prison Break and Just Jordan" (which are TV series), being related to concepts from s 2 , e.g., "having an acting career" as the sentences belong to the same paragraph.
The semantics learned by connecting sentences in the same paragraph transfer well downstream, as the model can re-use previously learned relations between entities and concepts, and apply them between question and answer candidates.Relations in one sentence may be used to formulate questions that can be answered in the other sentence, which is most likely to happen for sentences in the same paragraph since every paragraph describes the same general topic from a different perspective.
We design three ways of choosing the appropriate contextual information c for SSP.We present details on how we sample spans a, b and c from the pre-training documents below.
Static Document-level Context (SDC) Here, we choose the context c to be the first paragraph P 0 of D = {P 0 , .., P n } from which b is extracted.This is based on the intuition that the first paragraph acts as a summary of a document's content (Chang et al., 2020): this strong context can help the model at identifying if b is extracted from the same paragraph as a.We call this static documentlevel context since the contextual information c is constant for any b extracted from the same document D. Specifically, the positive examples are created by sampling a and b from a single random paragraph P i ∈ D, i > 0. For the previously chosen a, we create hard negatives by randomly sampling a sentence b from different paragraphs P j ∈ D, j ̸ = i ∧ j > 0. We set c = P 0 for this negative example as well since b still belongs to D. We create easy negatives for a chosen a by sampling b from a random paragraph P ′ i in another document D ′ ̸ = D.In this case, c is chosen as the first paragraph P ′ 0 of D ′ since the context in the downstream AS2 task is associated with the answer candidate, and not with the question.

Dynamic Paragraph-level Context (DPC)
We dynamically select the context c to be the paragraph from which the sentence b is extracted.We create positive examples by sampling a and b from a single random paragraph P i ∈ D, and we set the context as the remaining sentences in P i , i.e., c = P i \ {a, b}.Note that leaving a and b in P i would make the task trivial.For the previously chosen a, we create hard negatives by sampling b from another random paragraph P j ∈ D, j ̸ = i, and setting c = P j \ {b}.We create easy negatives for a chosen a by sampling b from a random P ′ i in another document D ′ ̸ = D, and setting c = P ′ i \ {b}.

Dynamic Sentence-level Local Context (DSLC)
We choose c to be the local context around the sentence b, i.e, the concatenation of the previous and next sentence around b in P ∈ D. To deal with corner cases, we require at least one of the previous or next sentences of b to exist (e.g., the next sentence may not exist if b is the last sentence of the paragraph P ).We term this DSLC as the contextual information c is specified at sentence-level and changes correspondingly to every sentence b extracted from D. We create positive pairs similar to SDC and DPC by sampling a and b from the same paragraph P i ∈ D, with c being the local context around b in P i (and a / ∈ c).We automatically discard paragraphs that are not long enough to ensure the creation of a positive example.We generate hard negatives by sampling b from another P j ∈ D, j ̸ = i, while for easy negatives, we sample b from a P ′ i ∈ D ′ , D ′ ̸ = D (in both cases c is set as the local context around b).

Datasets
Pre-Training To perform a fair comparison and avoid any improvement stemming from additional pre-training data, we use the same corpora as RoBERTa (Liu et al., 2019).This includes the English Wikipedia, the BookCorpus (Zhu et al., 2015), OpenWebText (Gokaslan and Cohen, 2019) and CC-News1 .We pre-process each dataset by filtering away: (i) sentences shorter than 20 characters, (ii) paragraphs shorter than 60 characters and (iii) documents shorter than 200 characters.We split paragraphs into sequences of sentences using the NLTK tokenizer (Loper and Bird, 2002)  Contextual AS2 We evaluate our pre-trained models on three public and two industrial datasets for contextual AS2.For all datasets, we use the standard "clean" setting, by removing questions in the dev.and test sets which have only positive or only negative answer candidates, following standard practice in AS2 (Garg et al., 2020).We measure performance using Precision-at-1 (P@1) and Mean Average Precision (MAP) metrics.
• ASNQ is a large scale AS2 dataset (Garg et al., 2020) derived from NQ (Kwiatkowski et al., 2019).The questions are user queries from Google search, and answers are extracted from Wikipedia.
• WikiQA is a small dataset (Yang et al., 2015) for AS2 with questions extracted from Bing search engine and answer candidates retrieved from the first paragraph of Wikipedia articles.
• IQAD is a large scale industrial dataset containing de-identified questions asked by users to Alexa virtual assistant.IQAD contains ∼220k questions where answers are retrieved from a large web index (∼1B web pages) using Elasticsearch.We use two different evaluation benchmarks for IQAD: (i) Fine-Tuning We fine-tune each continuously pretrained model on all the AS2 datasets.As baselines, we consider (i) standard pairwise-finetuned AS2 models, using only the question and the answer candidate, and (ii) contextual fine-tuned AS2 models from (Lauriola and Moschitti, 2021), which use the question, answer candidate and local context.

Results
Table 1 summarizes the results of our experiments averaged across 5 runs to show also standard deviation and statistically significant improvements over baselines.
Public datasets On ASNQ, our pre-trained models get 3.8 -5.5% improvement in P@1 over the baseline using only the question and answer.Our models also outperform the stronger contextual AS2 baselines (1.6% with RoBERTa and 2.4% with ELECTRA), indicating that our task-aware pre-training can help improve the downstream finetuning performance.On NewsAS2, we observe a similar trend, where all our models (except one) outperform both the standard and contextual baselines.On WikiQA, a smaller dataset, the contextual baselines under-performs the non-contextual baselines, highlighting that with few samples the model struggles to adapt and reason over three text spans.For this reason, our pre-training approaches provide the maximum accuracy improvement on WikiQA (up to 8 -9.1% over the non-contextual and contextual baselines).
Industrial datasets On IQAD, we observe that the contextual baseline performs on par or lower than the non-contextual baseline, indicating that off-the-shelf transformers cannot effectively exploit the context available for this dataset.The answer candidates and context for IQAD are extracted from millions of web documents.Thus, learning from the context in IQAD is a harder task than learning from it on ASNQ, where the context belongs to a single Wikipedia document.Our pre-trained models help to process the diverse and possibly noisy context of IQAD, and produce a significant improvement in P@1 over the contextual baseline.
Combining the 3 SSP objectives We observe that combining all the objectives together does not always outperform the individual objectives, which is probably due to the misalignment between the different approaches for sampling context in our pre-training strategies.Notice that we used a single classification head for all the three tasks, indirectly asking the model also to recognize the task to be solved among SDC, DPC or DSLC.Experiments with separate classification heads (one for each task) led to worse results in early experiments.
Choosing the optimal SSP objective Our finetuning datasets have significantly different structures: ASNQ, NewsAS2 and WikiQA have answer candidates sourced from a single document (Wikipedia for ASNQ and WikiQA and CNN Daily Mail articles for NewsQA), while IQAD has answer candidates extracted from multiple documents.This also results in the context for the former being more homogeneous (context for all candidates for a question is extracted from the same document), while for the latter the context is more heterogeneous (extracted from multiple documents for different answer candidates).
Our DPC and DSLC pre-training approaches are well aligned in terms of the context that is used to help the SSP predictions.The former uses the remainder of the paragraph P as context (after removing a and b), while the latter uses the sentence previous and next to b in P .We observe empirically that the contexts for DPC and DSLC often overlap partially, and are sometimes even identical (considering average length of paragraphs in the pre-training corpora is 4 sentences).This explains why models pre-trained using both these approaches perform comparably in Table 1 (with only a very small gap in P@1 performance).
On IQAD, we observe that the SDC approach of providing context for SSP outperforms DPC and DSLC.In SDC, the context c can potentially be very different from a and b (as it corresponds to the first paragraph of the document), and this can aid exploiting information and effectively ranking answer candidates from multiple documents (possibly from different domains) like for IQAD.For these reasons, we recommend using DPC and DSLC when answer candidates are extracted from the same document, and SDC when candidates are extracted from multiple sources.

Conclusion and Future Work
In this paper, we have proposed three pre-training strategies for transformers, which (i) are aware of the downstream task of contextual AS2, and (ii) use the document and paragraph structure information to define effective objectives.Our experiments on three public and two industrial datasets using two transformer models show that our pre-training strategies can provide significant improvement over the contextual AS2 models.
In addition to local context around answer candidates (the previous and successive sentences), other contextual signals can also be incorporated to improve the relevance ranking of answer candidates.Meta-information like document title, abstract/firstparagraph, domain name, etc. corresponding to the document containing the answer candidates can help answer ranking.These signals differ from the previously mentioned local answer context as they provide "global" contextual information pertaining to the documents for AS2.Our SDC objective, which uses the first paragraph of the document for the context input slot, captures global information pertaining to the document, and we hypothesise that this may improve downstream performance using other global contextual signals in addition to local answer context.

Limitations
Our proposed pre-training approaches require access to large GPU resources (pre-training is performed on 350M training samples for large language models containing 100's of millions of parameters).Even using 10% of the original pretraining compute, the additional pre-training takes a long time duration to finish (several days even on 8 NVIDIA A100 GPUs).This highlights that this procedure cannot easily be re-done with newer data being made available in an online setting.However the benefit of our approach is that once the pre-training is complete, our released model checkpoints can be directly fine-tuned (even on smaller target datasets) for the downstream contextual AS2 task.For the experiments in this paper, we only consider datasets from the English language, however we conjecture that our techniques should work similarly for other languages with limited morphology.Finally, we believe that the three proposed objectives could be better combined in a multi-task training scenario where the model has to jointly predict the task and the label.At the moment, we only tried using different classification heads for this but the results were worse.size of 4096 examples and a triangular learning rate with a peak value of 10 −4 and 10K steps of warmup.In order to save resources, we found it beneficial to reduce the maximum sequence length to 128 tokens.In this setting, our models see ∼210B additional tokens each, which is 10% of what is used in the original RoBERTa pre-training.Our objectives are more efficient because the attention computational complexity grows quadratically with the sequence length, which in our case is 4 times smaller than the original RoBERTa model.
We use cross-entropy as the loss function for all our pre-training and fine-tuning experiments.Specifically, for RoBERTa pre-training we add the MLM loss to our proposed binary classification losses using equal weights (1.0) for both the loss terms.For ELECTRA pre-training, we sum three loss terms: MLM loss with a weight of 1.0, the Token Detection loss with a weight of 50.0, and our proposed binary classification losses with a weight of 1.0.
During continuous pre-training, we feed the text tuples (a, b, c) (as described in Section 4) as input to the model in the following format: '[CLS]a[SEP]b[SEP]c[SEP]'. To provide independent sentence/segment ids to each of the inputs a, b and c, we initialize the sentence embeddings layers of RoBERTa and ELECTRA from scratch, and extend them to an input size of 3.
The pre-training of every model obtained by combining ELECTRA and RoBERTa architectures with our contextual pre-training objectives took around 3.5 days each on the machine configuration described in Appendix B. The dataset preparation required 10 hours over 64 CPU cores.

D Details of Fine-Tuning
The most common paradigm for AS2 fine-tuning is to consider publicly available pre-trained transformer checkpoints (pre-trained on large amounts of raw data) and fine-tune them on the AS2 datasets.Using our proposed pre-training objectives, we are proposing stronger model checkpoints which can improve over the standard public checkpoints, and can be used as the initialization for downstream fine-tuning for contextual AS2.
To fine-tune our models on the downstream AS2 datasets, we found it is beneficial to use a very large batch size for ASNQ and a smaller one for IQAD, NewsAS2 and WikiQA.Moreover, for every experiment we used a triangular learning rate scheduler and we did early stopping on the development if the MAP did not improve for 5 times in a row.We fixed the maximum sequence length to 256 tokens in every run, and we repeated each experiment 5 times with different initial random seeds.We did not use weight decay but we clipped gradients larger than 1.0 in absolute value.More specifically, for the learning rate we tried all values in {5 * 10 −6 , 10 −5 , 2 * 10 −5 } for RoBERTa and in {10 −5 , 2 * 10 −5 , 5 * 10 −5 } for ELECTRA.Regarding the batch size, we tried all values in {512, 1024, 2048, 4096} for ASNQ, in {64, 128, 256, 512} for IQAD and NewsAS2 and in {16, 32, 64, 128} for WikiQA.More details about the final setting are given in Table 3.
For the pair-wise models, we format inputs as '[CLS]q[SEP]s i [SEP]', while for contextual models we build inputs of the form '[CLS]q[SEP]s i [SEP]c i [SEP]'.We do not use extended sentence/segment ids for the non-contextual baselines and retain the original model design: (i) disabled segment ids for RoBERTa and (ii) only using 2 different sentence/segment ids for ELECTRA.For the fine-tuning of our continuously pre-trained models as well as the contextual baseline, we use three different sentence ids corresponding to q, s and c for both RoBERTa and ELECTRA.Finally, differently from pre-training, in fine-tuning we always provide the previous and the next sentence as context for a given candidate.
The contextual fine-tuning of every models on ASNQ required 6 hours per run on the machine configuration described in Appendix B. For other fine-tuning datasets, we used a single GPU for every experiment, and runs took less than 2 hours.

E Qualitative Examples
In Table 4 we show a comparison of the ranking produced by our models and that by the contextual baselines on some questions selected from the ASNQ test set.
ELECTRA Q how many games does a team have to win for the world series A1 Seven games were played, with the Astros victorious after game seven, played in Los Angeles.

A2
In 1985, the format changed to best-of-seven.

A3
Since then, the 2011, 2014, and 2016 World Series have gone the full seven games.

A4
The winner of the World Series championship is determined through a best-of-seven playoff, and the winning team is awarded the Commissioner's Trophy.

A5
The Houston Astros won the 2017 World Series in 7 games against the Los Angeles Dodgers on November 1st, 2017, winning their first World Series since their creation in 1962.

Q
where are trigger points located in the body A1 Myofascial pain is associated with muscle tenderness that arises from trigger points, focal points of tenderness, a few millimeters in diameter, found at multiple sites in a muscle and the fascia of muscle tissue.

A2
Myofascial trigger points, also known as trigger points, are described as hyperirritable spots in the fascia surrounding skeletal muscle.

A3
Trigger points form only in muscles.

A4
These in turn can pull on tendons and ligaments associated with the muscle and can cause pain deep within a joint where there are no muscles.

A5
They form as a local contraction in a small number of muscle fibers in a larger muscle or muscle bundle.
, Chang et al. (2020) and Sachan et al. (2021) use the Inverse Cloze task to improve retrieval performance for bi-encoders, by exploiting paragraph structure via self-supervised objectives.For AS2, recently Di Liello et al. (2022a) proposed paragraph-aware pre-training for joint classification of multiple can- s 2 : In 2002, Lovato began her acting career on the children's television series Barney & Friends, portraying the role of Angela.s 3 : She appeared on Prison Break in 2006 and on Just Jordan the following year.
Ghosh et al. (2016)use LSTMs for answers and topics, improving accuracy for next sentence selection.Tan et al. (2018) use GRUs to model answers and local context, improving performance on two AS2 datasets.Lauriola and Moschitti (2021) propose a transformer encoder that uses context to better disambiguate between answer candidates.Han et al. (2021) use unsupervised similarity matching techniques to extract relevant context for answer candidates from documents.

Table 1 :
(Lauriola and Moschitti, 2021)is) on AS2.Models with ♣ are from(Lauriola and Moschitti, 2021).✓ and ✗ denote whether local contextual information was used in fine-tuning.SDC, DPC and DSLC indicate the pre-training variants of the SSP task that we propose.Best results are in bold while we underline statistically significant improvements over the two contextual baselines (♣) using a Student t-test with 95% of confidence level.and create the SSP pre-training datasets following Section 4. Refer Appendix A.1 for more details.

Table 2 :
Number or unique questions and questionanswer pairs in the fine-tuning datasets.IQAD Bench 1 and Bench 2 sizes are mentioned in the Test set column corresponding to IQAD.

Table 3 :
Hyper-parameters used to fine-tune RoBERTa and ELECTRA on the AS2 datasets.The best hyperparameters have been chosen based on the MAP results on the validation set.

Table 4 :
Some qualitative examples from ASNQ test set where our ELECTRA and RoBERTa models with DSLC contextual continuous pre-training were able to rank the correct candidate in the top position while the contextual baselines failed.The answer candidates are shown ranked by the ordering produced by the contextual baselines.Other positive candidates answers are colored in light green.