LAIT: Efficient Multi-Segment Encoding in Transformers with Layer-Adjustable Interaction

Transformer encoders contextualize token representations by attending to all other tokens at each layer, leading to quadratic increase in compute effort with the input length. In practice, however, the input text of many NLP tasks can be seen as a sequence of related segments (e.g., the sequence of sentences within a passage, or the hypothesis and premise in NLI). While attending across these segments is highly beneficial for many tasks, we hypothesize that this interaction can be delayed until later encoding stages.To this end, we introduce Layer-Adjustable Interactions in Transformers (LAIT). Within LAIT, segmented inputs are first encoded independently, and then jointly. This partial two-tower architecture bridges the gap between a Dual Encoder’s ability to pre-compute representations for segments and a fully self-attentive Transformer’s capacity to model cross-segment attention. The LAIT framework effectively leverages existing pretrained Transformers and converts them into the hybrid of the two aforementioned architectures, allowing for easy and intuitive control over the performance-efficiency tradeoff. Experimenting on a wide range of NLP tasks, we find LAIT able to reduce 30-50% of the attention FLOPs on many tasks, while preserving high accuracy; in some practical settings, LAIT could reduce actual latency by orders of magnitude.


Introduction
Although the meaning of a sentence may depend on the context in which it appears, sentences still have meaning per se.However, in tasks involving reasoning across multiple sentences or text segments -like natural language inference (NLI), fact verification, question answering (QA), semantic similarity (STS), etc. -the common setting is to concatenate and jointly process all tokenized segments as input to a neural model, most often some form of bidirectional Transformer-based architecture (Vaswani et al., 2017).In this setting, the self-attention blocks of the Transformer layers contextualize the per-token representations against all other input tokens, including those of different input segments.The potential for independent sentence-level semantics is largely ignored.
While this practice has shown to achieve high accuracy, it is computationally expensive due to the quadratic increase in cost with the input length.And in practical settings, such as large-scale citation retrieval (Petroni et al., 2022a) or documentlevel NLI (Koreeda and Manning, 2021), where a given segment may occur multiple times, the full Cartesian product of the sets of text segments must be processed, e.g., Schuster et al. (2022a) processes all sentence pairs from two Wikipedia articles around one subject but in two different languages to identify potential discrepancies.This leads to yet another quadratic increase in cost.Our goal is to reduce both of these computational burdens, rendering transformer architectures more efficient for large-scale multi-segment reasoning.
In this paper, we present LAIT (/leIt/), a late interaction Transformer model with easy to implement Layer-Adjustable Interactions.LAIT includes encoder layers that process each segment locally and independent of the other segments, followed by traditional Transformer layers, in a simple but effective way.Unlike the late interaction components of other models, such as ColBERT (Khattab and Zaharia, 2020), which are specifically geared toward measuring a similarity score between two text segments, LAIT generally supports any sequence-to-sequence task and any number of input segments.
LAIT enables several desirable properties for an efficient encoder: it ( 1) is easy to train on top of existing pretrained language models; (2) readily supports any seq-2-seq task, and any segmentation of the input; (3) improves the encoding efficiency by skipping a large number of attention computations; (4) disentangles independent segment representations from joint processing to allow caching of intermediate segment representations for repeated computations; and (5) provides an easy-to-tune hyperparameter for controlling the efficiency-performance tradeoff.

Dual Encoders
A key strength of a fully self-attentive (FSA) architecture, such as BERT or T5 (Devlin et al., 2019;Raffel et al., 2020) is the ability of each token in the input to interact with each other token in the input throughout all layers of the model.Although expensive, this type of architecture has shown impressive performance across a wide variety of NLP tasks such as those in the GLUE and SuperGLUE benchmarks (Wang et al., 2019b,a).A common alternative to FSA is the dual encoder (DE) framework (Gillick et al., 2018).With DE, two text segments are embedded independently, either by separate networks or by two networks that share parameters.A DE typically involves two encoders, Enc q (•) and Enc d (•), and a comparison function Comp(•), and for a given pair of input segments q, d: score = Comp(Enc q (q), Enc d (d)).In practice, the two encoders can share parameters.
DE is typically trained with a contrastive loss over a set of positive q, d pairs, with the goal of having the score of positive pairs greater than that of negatives.Therefore, DE is most suited for similarity tasks such as information retrieval.
A specific advantage of the DE architecture for retrieval tasks is its ability to independently encode the two input segments.In practice, this allows encoding and storing many documents' representations in parallel in advance.Then, only new queries need to be encoded into a vector that can be used for retrieving the top similar documents from the pre-encoded corpus using efficient methods such as maximum inner product search (MIPS).
The method above, however, only supports similarity tasks or binary classification tasks over input pairs.To expand this setting to multi-class tasks, prior approaches like Casanueva et al. ( 2020); Ni et al. (2022) add a classification head with optional non-linear layers on top of the two encoded representations.Since the classifier requires a fixedsize input, the segment representations are aggregated (e.g., by taking the average over tokens, or by selecting a predefined special token).While conceptually enabling any classification task, the performance of such models is usually far behind the state-of-the-art (see Section 5).

Layer-Adjustable Interactions
We argue that both FSA and DE Transformer models can be seen as special cases of a general architecture with adjustable layer depths for both segment-independence and segment-interaction, which we will call a "Layer-Adjustable Interaction Transformer" (LAIT).
For a Transformer with L layers and an input with N segments, LAIT is a set of N independent stacks of P layers each, followed by L − P fully self-attentive encoder layers.Any function can be used after the encoder.Thus a typical fully self-attentive Encoder-Decoder Transformer is a LAIT where P = 0, and a shared-parameter dual encoder is a LAIT where P = L and N = 2.In the fully self-attentive Transformer, each token in each segment is interacting with each token in each other segment throughout the entire depth of the encoder; in a Dual Encoder, each segment is treated independently throughout the encoder.The LAIT framework allows us to make the core questions of this work precise: (1) to what extent are interactions across multiple input text segments necessary?And (2) If they are not always necessary, how can we take advantage of this fact to perform multi-segment modeling efficiently at scale?Specifically, given an input X with m tokens that is split into n segments s i . . .s n of possibly different lengths, the LAIT encoder is defined as: LAIT(s 1 , s 2 , . . ., s n ) = Enc L−P ([Enc P (s 1 ); Enc P (s 2 ); . . .; Enc P (s n )]), where [x; y] denotes concatenating vectors x and y, and Enc K (•) denotes a Transformer encoder with K layers.
The rule for splitting the input into segments R(x 1 , . . ., x m ) → s 1 , . . ., s n is predefined for each task, based either on prior knowledge of the input structure, or on a simple segmentation function.For example, in NLI we can simply use the hypothesis and premise as two segments.In passage-level QA, we can use the question as one segment and the passage as another.However, splitting the passage into multiple shorter segments could help further reduce compute.For instance, we can split the passage by sentences to k segments, leading to a total of k + 1 segments.
For P ∈ [0, L], LAIT interpolates between an N -Encoder model and a fully-self attentive Transformer.Because interaction between segments is delayed, representations computed at layer P of the model can be stored or cached for later reuse as they are independently generated.Figure 2 demonstrates the basic LAIT architecture, as well as possibilities for partial caching (for instance, multiple unique questions about the same passage), or full caching (for instance, NLI-based cross-document reasoning (Schuster et al., 2022a)).
Similar to general text-to-text models, the outputs of the LAIT encoder, consisting of m contextualized representations for m tokens, are passed to the Transformer-decoder for generating the output sequence.Similarly, the decoder may be replaced with a classification head, or any other module.

Attention Complexity
By first processing text independently, and then processing the intermediate representations jointly, LAIT reduces the attention complexity within a Transformer in accordance with both the degree of independence (i.e., P ) and the balance of length across segment inputs.We can calculate the number of attention operations, O, for a given input to LAIT with the formula: where |s i | denotes the length of segment i out of n total segments for a given input.Ultimately, the number of FLOPs to process a single example will depend on the lengths of the input segments, the Transformer architecture used, and the degree of independence P .We discuss these practical details in Section 4.2, and Table 4.

Training LAIT
Thanks to LAIT not adding any new parameters to the Transfomer architecture, we can easily convert an existing Transformer to the LAIT framework and train it end-to-end with any objective.In this work, we focus on the T5 (Raffel et al., 2020) model since it is a general text-to-text Transfomer, and apply LAIT to the encoder stack.In our experiments here, since we focus on classification tasks, we only keep a single decoding layer.
Given an input with n text segments, LAIT first encodes and concatenates the segments.During encoding, a block-diagonal attention mask restricts attention between different text segments for the early layers of the model (denoted "parallel layers"), and allows cross-segment attention for the later layers of the model ("joint layers").Figure 3 illustrates the block-diagonal attention mask used for parallel layers.This approach allows for parameter sharing while independently encoding the segments, as well as flexibility for tasks with different numbers of input segments without needing to initialize additional models.

Experimental Setting
Below, we describe our evaluation setting, tasks, used metrics, and baselines.

Implementation details
We implement LAIT on top of the T5 model (Raffel et al., 2020) using Google's T5x library (Roberts et al., 2022).In all experiments, we use T5-base which has a total of 12 encoder layers and 220M parameters.To reduce compute effort, we use only a single decoder layer for LAIT (See Appendix B.1 for larger models).We load the parameters from the public pretrained checkpoint, and finetune on the target task for up to 100K steps with different LAIT configurations (value of P ).We train LAIT on 16 TPUv3 chips, taking about 4 hours per run.We run a small grid search over learning rate and batch size configurations, and pick the top performing checkpoint based on validation performance.

Tasks and metrics
We experiment using LAIT on a diverse set of common tasks and datasets.For each task, we must determine which fields of the dataset to use as input segments for LAIT.We evaluate each task using its typical quality metric.In addition, to measure the efficiency gains of different LAIT configurations, we compute the average self-attention FLOPs.We use Equation ( 1) and the precise configuration of the T5-base model we implement LAIT within, which has 768-dimensional embeddings and 12 64dimensional attention heads.
The evaluated tasks are described below.Many of these tasks are from the popular GLUE (Wang et al., 2019b) and SuperGLUE (Wang et al., 2019a) benchmarks, and all are in English.Number of used segments and average lengths per task are summarized in Table 1.Pre-processing and concatenation strategy are described in Appendix A. MNLI (Williams et al., 2018): A dataset for natural language inference across diverse categories.We use the hypothesis and premise as separate segments, and predict one of three labels: "entailment", "contradiction", and "neutral".We report accuracy on the "matched" eval set.RTE: The Recognizing Textual Entailment dataset combines the data from a series of annual textual entailment challenges (Dagan et al., 2005;Bar-Haim et al., 2006;Giampiccolo et al., 2007;Bentivogli et al., 2009).We use the hypothesis and the premise as separate segments and predict "entailment" vs. "non-entailment", and measure accuracy.QQP (Iyer et al., 2017): Quora Question Pairs dataset is a collection of question pairs from Quora, where the task is to determine whether a pair of questions have the same meaning.For LAIT, we treat each question as a segment, and predict "duplicate" or "not_duplicate", and measure accuracy.STSB (Cer et al., 2017): Semantic textual similarity benchmark, a task for estimating the similarity of a pair of sentences.We use each sentence as a separate segment, and predict a score in [0, 5], represented as a string rounded to 2 decimal places.We measure Spearman correlation.AE (Bulian et al., 2022): Answer Equivalence requires determining whether a "candidate" answer is semantically equivalent to a "reference" answer, given a question.We use the question and each of the answers as independent text segments, make a binary prediction "true" or "false", and measure accuracy.
BoolQ (Clark et al., 2019): Boolean Questions is a binary question answering task with passages and questions.We use the provided text passage and the question as text segments, and make a binary prediction "true" or "false", and measure accuracy.BoolQ-Split A modification of BoolQ, where each passage is split into 5 sub-passages, treated as independent input segments.The sub-passages are formed by greedily merging the passage's sentences, smallest merge first.
WiC (Pilehvar and Camacho-Collados, 2019): Words in Context is a task for evaluating contextual word meanings.Given a word and two sentences in which it occurs, determine whether the word has the same meaning in each sentence.For LAIT, we prefix each sentence by the specified word and treat the newly-prefixed sentences as our text segments.We then predict "true" or "false", corresponding to whether the word has the same in-context meaning in both sentences.Evaluation is by accuracy.FEVER (Thorne et al., 2018): A dataset for fact verification with claims and corresponding evidence.Each claim-evidence pair is labeled as "supported," "refuted," or "NotEnoughInfo."For LAIT, we treat the claim and the evidence as our separate text segments, and aim to predict the correct label.
Evaluation is done by accuracy.
VitaminC (Schuster et al., 2021): A challenging dataset for fact verification which includes "contrastive evidence", i.e., claim-evidence pairs that differ only slightly (in either the text of the claim or that of the evidence) from another claim-evidence pair, but have a different label.We treat the claim and evidence as independent text segments, and evaluate by accuracy.
MultiRC (Khashabi et al., 2018): The Multi-Sentence Reading Comprehension dataset is a question answering dataset, where each example contains a passage, a question, and an answer1 .For LAIT, we use the passage, the question, and the answer as the segments.The label is either "True" or "False" meaning whether the answer is correct or not.Evaluation is done by computing the F1 score over all answers.

Baselines
We compare LAIT against two groups of baselines: Dual Encoder models and Fully self-attentive models.For the Dual Encoder, we use the SentenceT5base (Ni et al., 2022) shared-parameter Dual Encoder which outputs the concatenation of the average of the per-token output representations from the two encoders, together with their difference and dot product, followed by a classifier.We experiment with two depths of classifier: One with a single non-linear layer, and one with 2 additional hidden layers (d = 768 for all layers).As fully self-attentive baselines, we consider T5-base and T5-small (Raffel et al., 2020).

Results
To study the performance-efficiency tradeoff, we consider multiple configurations of LAIT to fully interpolate between a Dual Encoder and a fully self-  attentive Transformer.As T5-base has a 12-layer encoder, we consider all LAIT-p, for p ∈ [0, 12], where p is the number of layers of independent segment processing before the fully self-attentive component.Note that LAIT-0 is roughly equivalent to T5-base, though it uses a 1-layer decoder vs. the 12-layer decoder of T5-base.
As can be seen in Tables 2 and 3, which compare best validation-set performance across models, LAIT either matches, nearly-matches, or outperforms the T5-base baseline for every task.This holds even in configurations where cross-segment interaction is delayed to the last few layers of the encoder.As long as there are a few cross-segment interactions later in the model, performance remains relatively stable even as the architecture becomes increasingly efficient; crucially, LAIT can delay cross-segment interaction by 8-10 layers without a notable decrease in performance.We specifically focus on the most efficient LAIT models that: (1) achieve within 99% of LAIT-0 performance, which we call LAIT-99%; (2) achieve within 95% of LAIT-0 perfomrance, called LAIT-95%; and (3) achieve within the 95% confidence interval within LAIT-0 performance, called LAIT .To select these models with higher validity, we perform a synthetic dev/test split of the validation sets and report the held-out validation performance of the LAIT models with the highest performance on the synthetic dev set, reported in Appendix B.
These results also suggest differences in the proportion of cross-segment processing necessary for different tasks.Sentence and word representation tasks (i.e., Answer Equivalence, STSB, and WiC) have much better LAIT models than reasoningintensive tasks, such as MNLI, BoolQ, and Vita-minC.We note that FEVER appears to be easier for LAIT than other "reasoning" tasks, which we explore further in Section 5.3.We also note that some degree of cross-segment processing is necessary for all tasks, evidenced by the steep drop in performance as p approaches 12 (see Figure 4).

Scalability
By deferring the expensive cross-segment attention to later stages of the model, LAIT both reduces the attention complexity of the model, and enables the caching and reuse of partial representations computed before the cross-segment attention layers.
Table 4 shows improvements in attention FLOPs for LAIT, both with and without caching of the intermediate representations, when using the LAIT-95% model.

Caching and Reusing Representations
A key advantage of the delayed cross-segment interaction in LAIT is the ability to cache and reuse intermediate representations of text segments.Unlike in benchmarks, real-world settings almost never process a set of segments in isolation; it is much more likely that the processing of a set of text segments occurs as part of a larger task such as document comparison, document analysis, or claim verification.
Recently, a number of datasets (Schuster et al., 2022a;Koreeda and Manning, 2021;Petroni et al., 2022b) have suggested the usefulness of natural language inference in large-scale real-world reasoning tasks.In one such dataset, ContractNLI (Koreeda and Manning, 2021), a fixed set of 17 claims are evaluated against different legal contracts.In other scenarios (Schuster et al., 2022a;Gu et al., 2020), the contents of multiple documents within a cluster of related documents must be compared.
In both scenarios, a typical approach would require comparing each sentence within a document with each other sentence, leading to a complexity that scales quadratically with the size of the document cluster, the size of the documents, and the length of the sentences.But with LAIT, the bulk of the work will be performed only once.Because each document or claim can be encoded independently for most of the layers of the model, the latency improvement offered by LAIT in these settings is related to the overall redundancy and duplication of text segments within the task.Table 5 demonstrates the savings possible for both popular academic tasks, and two realistic settings: ContractNLI (Koreeda and Manning, 2021), and WikiClusters (Schuster et al., 2022a).For MNLI and BoolQ, we measure the time to encode the entire dataset.For WikiClusters and Con-tractNLI, we both measure the time to encode the entire dataset and the time to encode a single document (in the case of ContractNLI) or cluster (in the case of WikiClusters).We compare a standard fully self-attentive model (T5), a sparse model (LongT5 with local attention), and LAIT.For MNLI and BoolQ, we estimate the latency of the LAIT-95% model for that task, as a weighted average of FSA and LAIT layers.
Even without a custom kernel, LAIT's independent processing of input segments enables significant speedups for processing real-world data.Interestingly, the sparse transformer demonstrates slightly increased latency, likely because the the input sizes are relatively short.However, even when enabled by a sparse transformer, processing larger chunks of data -such as an entire ContractNLI contract alongside each of the 17 claims -will not fully alleviate the problem, as the contracts must still be processed 17 times, rather than just once as in LAIT.In these situations, LAIT may be able to complement a sparse transformer; this would require further study.

Robustness
A potential concern with an approach like LAIT is that it may be more susceptible to reported biases in sentence-level models (Poliak et al., 2018;Schuster et al., 2021) the segments instead of performing cross-segment reasoning.Schuster et al. (2021) found that in FEVER, when evidence text in a claim-evidence pair was revised in a way that would reverse the semantic relationship (e.g., f revision (Claim, Evidence, RE-FUTES) → (Claim, Evidence , SUPPORTS), models trained on FEVER would only make the correct prediction 56% of the time.Table 6 summarizes our robustness experiments using zero-shot transfer from FEVER and VitaminC.
We find that when LAIT is trained on on FEVER, the transfer performance drops faster than the indomain performance as independence is increased.However, when training on VitaminC, the decrease in accuracy as a function of P is more correlated with the in-domain trend.This suggests that LAIT models can be robust against domain shifts and contrastive adversarial examples when trained with appropriate data.

Related Work
Sentence encoders.Modern representation learning systems at the sentence level have rapidly risen in popularity, starting with InferSent (Conneau et al., 2017), ESIM (Cer et al., 2018), and USE (Chen et al., 2017).Following the inception of Transformer (Vaswani et al., 2017), new sentence encoders (see e.g., Gao et al., 2021;Ni et al., 2022;Reimers and Gurevych, 2019) demonstrated improved performance on many sentence-pair benchmarks.Other work extended this approach to document encoders by hierarchically encoding sentences independently before combining them into a pooled document embedding (Wu et al., 2021;Yang et al., 2020).Yet, unlike previous work, LAIT effectively breaks a pretrained Transformer into a hybrid of multiple parallel segment encoders and powerful fully-attentive layers to match state-ofthe-art performance across many NLP tasks.Efficient text classifiers Dual encoder architectures, originally dating back to the Siamese architecture of (Bromley et al., 1993), were proposed for efficient retrieval in (Gillick et al., 2018).(Ni et al., 2021) and (Menon et al., 2022) significantly broaden the range of tasks efficiently served by dual encoders.
Building on the Transformer architecture, LAIT can also readily leverage many other known efficiency solutions (Tay et al., 2022) such as distillation (Sanh et al., 2019;Jiao et al., 2020), quantization (Shen et al., 2020;Zafrir et al., 2019), and early exiting (Schuster et al., 2022b;Xin et al., 2020).Sparse attention.Sparse attention architectures have demonstrated that not all attention connections within a Transformer are necessary, and that impressive performance can be achieved even when removing a large number of the cross-token attention.Examples such as BigBird, Longformer, and LongT5 (Zaheer et al., 2020;Beltagy et al., 2020;Guo et al., 2021) use local attention windows and some form of global attention to reduce the attention complexity.Other approaches dynamically skip certain computations (Tay et al., 2020).Unlike these approaches, here we impose the sparsity on top of known input segments, which preserves segment-level semantics and supports parallel computing and caching of segments.Despite their benefits, sparse transformers still include cross-segment attention at every layer of the model, and as such they cannot encode segments independently.Late interaction.Some recent work has consid-ered precomputing full-token representations of some, but not all, text segments, as well as late interaction between queries and documents (Lu et al., 2020;Xiong et al., 2017).ColBERT (Khattab and Zaharia, 2020;Santhanam et al., 2022) uses precomputed token representations as part of a DE retrieval framework.These architectures, however, are tailored for retrieval tasks that use embedding similarity scores, and generally under-perform in classification tasks like NLI.The fully-attentive layers in LAIT allow bridging this performance gap while still providing efficiency gains.Our caching variant also relates to other recent parallel work on precomputing and reusing representations of repeated passages to speed up computation (Saad-Falcon et al., 2023;de Jong et al., 2023;Li et al., 2022).Hui et al. (2022) develop a fully parallel encoder for documents and queries, where both encodings are fed to a joint decoder for reranking.Most similar to our work is MacAvaney et al. ( 2020) that study a hybrid Transformer architecture for ranking.In this work, we focus on general NLP tasks with an arbitrary number of segments, and unconstrained output space.

Conclusion
We present Layer-Adjustable Interactions in Transformers (LAIT) to allow simple-but-effective efficiency gains over a wide range of NLP tasks.The LAIT framework leverages existing pretrained Transformers such as T5, and converts them during finetuning into a hybrid model that combines parallel independent encoding of multiple segments, followed by fully-attentive layers to allow crosssegment reasoning.
We evaluate LAIT on a large set of 10 wellknown datasets, involving different examined capabilities, number of segments, input lengths, output spaces, and difficulty levels.We find LAIT to consistently provide significant reduction in encoder attention complexity while preserving high accuracy.Furthermore, we show that the parallel independent segment encoding of LAIT enables additional inference-time compute savings by caching representations of repeated segments in large scale real-world settings.
LAIT demonstrates that transformers can achieve high performance even without crosssegment interaction at every layer; essentially, that sentences can be just as effectively encoded if first processed separately, and then processed jointly.

Limitations
While the LAIT framework can significantly reduce the computation required for large-scale sentencelevel reasoning and classification tasks, we do foresee some limitations in its use.Caching pertoken representations for large numbers of text segments leads to a dramatic increase in memory requirements, which could be prohibitive for extremely low-compute end users.We also note that LAIT can further exacerbate segment-level bias in datasets.While we believe that careful data curation approaches ameliorate this issue, the risk of bias is not always known to downstream users and as such corrective datasets may not always be available.Finally, LAIT can increase the cost of training because the optimal degree of independence is not known until all LAIT-p models are evaluated, though in practical settings (1) it is possible to perform a binary search of LAIT configurations because performance generally decreases monotonically as p increases; (2) even a naive rule of setting p to a quarter of the model's depth seems to provide some immediate gains while preserving 99% of the accuracy in all our evaluated tasks; and (3) inference-time cost improvements will far outweigh training costs.

B.1 Full Decoder and T5-Large Models
For our experiments in the main paper we used a T5-base model with only a single decoder layer.Using only one decoder layer is faster at inference time enforces the model to more heavily rely on the encoder stack, and therefore the strong results of LAIT in that setting are even more encouraging.We also experiment with a LAIT on top of a T5-Base with all 12 decoder layers and with a larger T5-Large that has 24 layers in both encoder and decoder stacks.
Table 7 and Table 8 present the results for T5-Base and T5-Large, respectively.LAIT shows similar trends for these different configurations, indicating that our approach is general and translates to different model configurations.Also, as expected, larger decoder allows LAIT to further postpone the cross-segment interactions (larger P ) without loosing accuracy.

B.2 Generalization of LAIT configuration
Here, we report additional results using our split of the existing validation sets into a synthetic validation set and a heldout test set.
Figure 5 reports the decrease in model performance as the number of parallel encoder layers increases.Table 9 reports the heldout test results for the LAIT models with the best synthetic validation performance.Table 11 includes the tasks with more than two segments.Table 10 reports the cost of both full encoding and partially-cached encoding for LAIT models identified from Tables 9 and 11.

Figure 1 :
Figure 1: A comparison of three approaches to multisegment modeling for an arbitrary claim verification task.a) Fully-self attentive architecture, with each token attending to each other token over L layers.b) Generalized dual encoder, with each segment encoded separately by an L-layer Transformer and representations concatenated.c) Layer-adjustable interactions (ours), with N layers of independent segment encoding and L − P layers of fully self-attentive segment encoding.

Figure 3 :
Figure 3: In the parallel layers of LAIT, segments are concatenated but a block-diagonal attention mask maintains independent encoding of each segment.Figure design adapted from Guo et al. (2021).

Figure 4 :
Figure4: Relative performance of LAIT vs. T5 Fully self-attentive baseline on a variety of multi-segment natural language processing tasks.For all tasks, we report performance on the validation set.Performance only degrades after 8-10 layers of independent segment processing.95% confidence interval via bootstrapping on the evaluation data.

Figure 5 :
Figure5: Relative performance of LAIT vs. T5 Fully self-attentive baseline on a variety of multi-segment natural language processing tasks.For all tasks, we report performance on a held-out portion of the validation set.Performance only degrades after 8-10 layers of independent segment processing.95% confidence interval via bootstrapping on the evaluation data.

Table 1 :
Summary of the evaluated tasks: number of segments (n) and average token length of each segment.Measured on training sets.

Table 2 :
Results comparing LAIT configurations with Dual Encoder and Transformer baselines across a variety of sentence-level reasoning tasks.To make comparison easier with other works, we report the best score on the validation set.See Table9for a synthetic test-set comparison of LAIT configurations.Most efficient LAIT model within a 99% performance of LAIT-0 in bold, most efficient LAIT model within 95% performance of LAIT-0 is underlined, most efficient LAIT model where the validation score is within the bootstrapped 95% confidence interval of LAIT-0 is boxed .

Table 3 :
Results for tasks with more than two segments.Bold, underline, and box indicate model performance as in Table 2.
Table 10 contains results for LAIT .As we would expect from Equation 1, datasets with text segments of similar size benefit the most in the typical setting.Howevever, to fully realize this benefit for single forward passes would require a custom kernel, such as those implemented in work on sparse transformers.

Table 4 :
Percent of encoder attention FLOPs (compared to T5-base) when using LAIT-95% model for each task to process the entire validation set (lower is better).LAIT-95% selection is based on results in tables 9 and 11 in the Appendix.

Table 6 :
. We test LAIT's effect on the model's robustness to domain shifts, and to biases in the training data such as over-relying on clues in one of Accuracy of FEVER-and VitaminC-trained LAIT models on FEVER, VitaminC, and MNLI.

Table 7 :
Results for different number of parallel layers P of LAIT using the same setting as Table2, but with 12 decoder layers instead of a single decoder layer.Hence, P = 0 is similar to T5-base setting from Table 2 (numbers are not identical due to different training runs).The extra decoding layers allows further increasing P compared to single decoder-layer while maintaining similar performance.The relative column shows the accuracy or F1 change compared to P = 0.

Table 8 :
Results for different number of parallel layers P of LAIT with a T5-Large backbone model, using all 24 decoder layers.The relative column shows the accuracy or F1 change compared to P = 0.