TAPIR: Learning Adaptive Revision for Incremental Natural Language Understanding with a Two-Pass Model

Language is by its very nature incremental in how it is produced and processed. This property can be exploited by NLP systems to produce fast responses, which has been shown to be beneficial for real-time interactive applications. Recent neural network-based approaches for incremental processing mainly use RNNs or Transformers. RNNs are fast but monotonic (cannot correct earlier output, which can be necessary in incremental processing). Transformers, on the other hand, consume whole sequences, and hence are by nature non-incremental. A restart-incremental interface that repeatedly passes longer input prefixes can be used to obtain partial outputs, while providing the ability to revise. However, this method becomes costly as the sentence grows longer. In this work, we propose the Two-pass model for AdaPtIve Revision (TAPIR) and introduce a method to obtain an incremental supervision signal for learning an adaptive revision policy. Experimental results on sequence labelling show that our model has better incremental performance and faster inference speed compared to restart-incremental Transformers, while showing little degradation on full sequences.


Introduction
Incrementality is an inseparable aspect of language use.Human speakers can produce utterances based on an incomplete message formed in their minds while simultaneously continuing to refine its content for subsequent speech production (Kempen andHoenkamp, 1982, 1987).They also comprehend language on (approximately) a word-by-word basis and do not need to wait until the utterance finishes to grasp its meaning (Kamide, 2008).
As observed by Madureira and Schlangen (2020), a natural option for neural network-based incremental processing would be RNNs (Rumelhart et al., 1986)  A policy for adaptive revision, here parameterised by a controller, can enable reanalyses to be performed when necessary (here at time steps 3 and 7).
required in incremental scenarios: They keep a recurrent state, are sensitive to the notion of order and are able to accept partial input and produce an output at each time step.Ideally, an incremental processor should also be able to revise its previous incorrect hypotheses based on new input prefixes (Schlangen and Skantze, 2009).However, RNNs are unable to do so as their output is monotonic.The Transformer architecture (Vaswani et al., 2017) has been the de facto standard for many NLP tasks since its inception.Nevertheless, it is not designed for incremental processing as the input sequences are assumed to be complete and processed as a whole.A restart-incremental interface (Beuck et al., 2011;Schlangen and Skantze, 2011) can be applied to adapt Transformers for incremental processing (Madureira and Schlangen, 2020), where available input prefixes are recomputed at each time step to produce partial outputs.Such an interface also provides the capability to revise existing outputs through its non-monotonic nature.Although feasible, this method does not scale well for long sequences since the number of required forward passes grows with the sequence length. 2The revision process is also not effective as it occurs at every time step, even when it is unnecessary.
Revision is crucial in incremental processing, as it is not always possible for a model to be correct at the first attempt, either because the linguistic input is provided in its inherent piecemeal fashion (as shown in Figure 1) or because of mistakes due to poor approximation.One way to improve the output quality is the delay strategy (Beuck et al., 2011;Baumann et al., 2011), where tokens within a lookahead window are used to disambiguate the currently processed input.However, it can neither fix past hypotheses nor capture long-range influences e.g. in garden path sentences.
In this work, we propose the Two-pass model for AdaPtIve Revision (TAPIR), which is capable of adaptive revision, while also being fast in incremental scenarios.This is achieved by using a revision policy to decide whether to WRITE (produce a new output) or REVISE (refine existing outputs based on new evidence), whose mechanism is described in §3.Learning this policy requires a supervision signal which is usually not present in non-incremental datasets (Köhn, 2018).In §4, we tackle this issue by introducing a method for obtaining action sequences using the Linear Transformer (LT) (Katharopoulos et al., 2020).As silver labels, these action sequences allow us to view policy learning as a supervised problem.
Experiments on four NLU tasks in English, framed as sequence labelling3 , show that, compared to a restart-incremental Transformer encoder, our model is considerably faster for incremental inference with better incremental performance, while being comparable when processing full sequences.Our in-depth analysis inspects TAPIR's incremental behaviour, showing its effectiveness at avoiding ill-timed revisions on correct prefixes.

Related Work
There has been increasing interest to explore neural network-based incremental processing.Žilka and Jurčíček (2015) proposed a dialogue state tracker using LSTM (Hochreiter and Schmidhuber, 1997) to incrementally predict each component of the dialogue state.Liu et al. (2019) introduced an incremental anaphora resolution model composed of a memory unit for entity tracking and a recurrent unit as the memory controller.RNNs still fall short on non-incremental metrics due to their strict left-to-right processing.Some works have attempted to address this issue by adapting BiL-STMs or Transformers for incremental processing and applying it on sequence labelling and classification tasks (Madureira and Schlangen, 2020;Kahardipraja et al., 2021) and disfluency detection (Rohanian and Hough, 2021;Chen et al., 2022).
Our revision policy is closely related to the concept of policy in simultaneous translation, which decides whether to wait for another source token (READ action) or to emit a target token (WRITE action).Simultaneous translation policies can be categorised into fixed and adaptive.An example of a fixed policy is the wait-k policy (Ma et al., 2019), which waits for first k source tokens before alternating between writing and reading a token.An adaptive policy on the other hand, decides to read or write depending on the available context and can be learned by using reinforcement learning techniques (Grissom II et al., 2014;Gu et al., 2017) or applying monotonic attention (Raffel et al., 2017;Chiu and Raffel, 2018;Arivazhagan et al., 2019;Ma et al., 2020).
The memory mechanism is a key component for revision policy learning as it stores representations which, for instance, can be used to ensure that the action is correct (Guo et al., 2022).It also absorbs asynchronies that may arise when each component in an incremental system has different processing speed (Levelt, 1993).The memory can be internal as in RNNs, or external such as memory networks (Weston et al., 2015;Sukhbaatar et al., 2015) and the Neural Turing Machine (Graves et al., 2014).
Revision in incremental systems has been previously explored.In simultaneous spoken language translation, Niehues et al. (2016) proposed a scheme that allows re-translation when an ASR component recognises a new word.Arivazhagan et al. (2020) evaluated streaming translation against re-translation models that translate from scratch for each incoming token and found that re-translation yields a comparable result to streaming systems.Zheng et al. (2020) proposed a decoding method for simultaneous translation that overgenerates target words at each step, which are subsequently revised.One way to achieve revision is by employing a two-pass strategy.Xia et al. (2017) proposed a deliberation network for machine translation, composed of encoder-decoder architecture with an additional second-pass decoder to refine the generated target sentence.In dialogue domains, this strategy is also used to improve the contextual coherence and correctness of the response (Li et al., 2019) and to refine the output of retrieval-based dialogue systems (Song et al., 2018;Weston et al., 2018).Furthermore, the two-pass approach is commonly utilised in streaming ASR to improve the initial hypothesis (Sainath et al., 2019;Hu et al., 2020;Wang et al., 2022, inter alia).
The aforementioned works shared a common trait, as they used a fixed policy and performed revision either for each incoming input or when the input is already complete.Our approach differs in that our model learns an adaptive policy that results in more timely revisions.Contemporaneous to our work, Kaushal et al. (2023) proposed a cascaded uni-and bidirectional architecture with an additional module to predict when to restart.The module is trained with a supervision signal obtained from comparing the model's prediction against the ground truth.Their approach is effective in reducing the required computational budget.

Model
To address the weaknesses of RNN-and Transformer-only architectures for incremental processing ( §1), we introduce a Two-pass model for AdaPtIve Revision named TAPIR, which integrates advantages of both models and is based on the deliberation network (Xia et al., 2017).Its architecture, depicted in Figure 2, consists of four components as follows: Figure 2: TAPIR computes a candidate output using an RNN at each time step.Then the controller decides whether to WRITE by adding the new output to the output buffer or to take a REVISE action, which can edit the output buffer after observing the effect of the new input on past outputs with the help of the memory.
as they are computationally cheap, offering a considerable speed-up in incremental settings.
4. Controller: a neural network that parameterises the revision policy.We choose a recurrent controller following Graves et al. ( 2014), as its internal memory complements the memory module and is also suitable for incremental scenarios.We use a modified LSTMN (Cheng et al., 2016) for this component.
During incremental inference, TAPIR computes a candidate output y t for the most recent input x t as the first pass.Then, based on x t and the memory, it decides whether to take a WRITE (add y t to an output buffer) or REVISE (perform a second pass to recompute all existing outputs) action.The action is defined by a revision policy π θ , which models the effect of new input on past outputs.At each time t, π θ makes use of processed inputs x ≤t and past outputs y <t to select a suitable action a t . 4It is parameterised by the controller hidden state k t with a non-linear function g: (1)

Revision Policy
In restart-incremental models, revisions can occur as a result of recomputations, which are costly since they happen at every time step, even when no revisions occur.TAPIR revises by selectively deciding when to recompute, which enables it to revisit previous outputs at different points in time while reducing the number of recomputations.Memory Content.The memory in TAPIR contains information pertaining to processed inputs and their corresponding outputs, which is crucial for our approach.This is because it enables our model to perform relational learning between an incoming input and past outputs, using past inputs as an additional cue.Here, we use three caches Γ. Γ h stores the hidden state h of the incremental processor, representing the current input prefix, Γ z stores the projected output vector z which represents the output, and Γ p stores the input-output representation φ, which is computed from h and z.The i-th slot of the caches contains γ h i , γ z i , γ p i , all of them computed at the same time step.The representations z and φ are computed as follows: where ỹ is the output logits from the incremental processor.W ỹ, W in , and W out are parameters while b z and b φ are bias terms.The dimension of z and h is the same.We keep the cache size N small, as we later perform soft attention over Γ p .The attention computation for large cache sizes is costly and is not suitable for incremental scenarios.Due to this limitation, the oldest cache element is discarded when the cache is full and new partial input arrives.Modelling Actions.To model possible changes in past outputs as an effect of a new input, we use an LSTMN controller due to its ability to induce relations among tokens.It computes the relation between h t and each cache element γ p i via an attention mechanism: which yields a probability distribution over Γ p .kt−1 is the previous summary vector of the controller hidden state.W c , W h , W k, and v are parameters and b u is a bias term.We can then compute adaptive summary vectors kt and ct as a weighted sum of the cache Γ p and the controller memory tape C t−1 : where c i+max (0,t−N −1) is the controller memory cell for the corresponding cache element γ p i .The attention can be partially viewed as local (Luong et al., 2015), since older cache elements are incorporated through kt−1 .These summary vectors are used to compute the recurrent update as follows: Lastly, k t is used by the revision policy to compute the action a t : where θ is a parameter vector, b k is the bias, and τ ∈ [0, 1] is a decision threshold.According to equation ( 11), a REVISE action is selected only if the policy value is greater than or equal to τ ; otherwise, a WRITE action is chosen.This threshold can be adjusted to encourage or discourage the recomputation frequency without the need to retrain the policy.Our model is equal to an RNN when τ = 1 (never recompute), and becomes a restart-incremental Transformer when τ = 0 (always recompute).

Incremental Inference Mechanism
Using the policy, TAPIR predicts when to perform a recomputation.Assume that an input token x t is fed to the RNN component to obtain y t .The controller then reads x t , h t , and Γ p to compute a t .If a REVISE action is emitted, the input buffer (containing all available inputs so far) will be passed to the reviser to yield the recomputed outputs.When this happens, both z and φ stored in the caches also need to be updated to reflect the effect of the recomputation.The recomputation of past z and φ will occur simultaneously with the computation of z and φ for the current time step to update Γ z and Γ p using the recomputed outputs.If a WRITE action is emitted, we take y t to be the current output and continue to process the next token.The content of Γ z and Γ p are also updated for the current step.The cache Γ h is always updated regardless of which action the policy takes.See algorithm in the Appendix.
Let us use Figure 1 and τ = 0.5 as a constructed example.At t = 1, the incremental processor consumes the token the, updates its hidden state and predicts the POS-tag det.The controller predicts that the probability for recomputation is e.g.0.3.Since it is lower than τ , det gets written to the output buffer, the memory is updated and the current step is finished.A similar decision happens at t = 2 and alert is classified as noun.At t = 3, however, the controller predicts that a REVISE action should occur after the input citizens.That triggers the reviser, which takes the alert citizens as input and returns det adj noun.The output buffer gets overwritten with this new hypothesis and the caches are recomputed to accommodate the new state.This dynamics continues until the end of the sentence.

Training
Jointly training all components of such a two-pass model from scratch can be unstable (Sainath et al., 2019), so we opt for a two-step training process: 1. Train only the reviser using cross entropy loss.
2. Train the incremental processor and the controller together with a combined loss: where y gold is the expected output and a LT is the expected action.

Supervision Signal for Revision
During incremental sentence comprehension, a revision or reanalysis occurs when disambiguating material rules out the current sentence interpretation.In Figure 1, noun is a valid label for suspect at t = 6, but person at t = 7 rules that analysis out, forcing a reanalysis to adj instead.Training TAPIR's controller requires a sequence of WRITE/REVISE actions expressed as the supervision signal a LT in equation ( 12), capturing when revision happens.This signal then allows us to frame the policy learning as a supervised learning task (as in the work of Zheng et al. (2019)).
If we have the sequence of output prefix hypotheses at each step, as shown in Figure 1, we know that the steps when revisions have occurred are t = {3, 7}.We can then construct the sequence of actions we need.The first action is always WRITE as there is no past output to revise at this step.For t > 1, the action can be determined by comparing the partial outputs at time step t (excluding y t ) against the partial outputs at time step t − 1.If no edits occur, then the partial outputs after processing x t should not change, and a WRITE action is appended to the sequence.If any edits occur, we append a REVISE action instead.
Intermediate human judgements about when to revise are not available, so we need to retrieve that from a model.It is possible obtain this information from a restart-incremental Transformer, by comparing how the prefix at t differs from prefix at t − 1.However, as shown by Kahardipraja et al. (2021), the signal captured using this approach may lack incremental quality due to the missing recurrence mechanism.Using a recurrent model is advisable here, as it can capture order and hierarchical structure in sentences, which is apparently hard for Transformers (Tran et al., 2018;Hahn, 2020;Sun and Lu, 2022).But it is difficult to retrieve this signal using vanilla RNNs because its recurrence only allows a unidirectional information flow, which prevents a backward update of past outputs.
Therefore, we opt for the Linear Transformer (LT) (Katharopoulos et al., 2020), which can be viewed both as a Transformer and as an RNN. 5o generate the action sequences, we first train the action generator LT with causal mask to mimic an RNN training.Afterwards, it is deployed under restart-incrementality on the same set used for training with the mask removed.We collect the sequence of partial prefixes for all sentences and use it to derive the action sequences.

Experiments
Datasets.We evaluate TAPIR on four tasks in English, for NLU and task-oriented dialogue, using seven sequence labelling datasets: Alarm, Reminder & Weather (Schuster et al., 2019) and MIT Movie (Liu et al., 2013).
▷ Chunking (CoNLL-2003).Lower is better for EO and CT, while higher is better for RC.TAPIR is better compared to the reference model for the non-delayed case (output prefixes are often correct and stable).The delay strategy of one lookahead token is beneficial.
Table 1 shows the distribution of generated actions in the final training set for each task.Further details regarding the datasets and generated action sequences are available in the Appendix.Evaluation.An ideal incremental model deployed in real-time settings should (i) exhibit good incremental behaviour, i.e. produce correct and stable partial hypotheses and timely recover from its mistakes; (ii) be efficient for inference by delivering responses without wasting computational resources; and (iii) not come with the cost of a negative impact on the non-incremental performance, i.e. produce correct final outputs.Achieving all at the same time may be hard, so trade-offs can be necessary.We evaluate TAPIR on these three relevant di-mensions.For (i), we use similarity and diachronic metrics6 proposed by Baumann et al. (2011) and adapted in Madureira and Schlangen (2020): edit overhead (EO, the proportion of unnecessary edits over all edits), correction time score (CT, the average proportion of time steps required for an output increment to settle down), and relative correctness (RC, the proportion of output prefixes that match with the final output).Aspect (ii) is analysed by benchmarking the incremental inference speed.For (iii), we use the F1 score adapted for the IOB sequence labelling scheme, except for PoS tagging, which is evaluated by measuring accuracy.Rather than trying to beat the state-of-the art results, we focus on analysing the incremental abilities of models whose performances are high enough for our purposes.As a reference model, we use a Transformer encoder applied in a restartincremental fashion, which implicitly performs revision at every step.We follow Baumann et al. (2011) and Madureira and Schlangen (2020) by evaluating partial outputs with respect to the final output, to separate between incremental and nonincremental performance.Delay strategy.To inspect the effect of right context on the model's performance, we use the delay strategy (Baumann et al., 2011) with a lookahead window of size 1 and 2, computing a delayed version of EO and RC (Madureira and Schlangen, 2020).The output for the reference model is delayed only during inference, as in Madureira and Schlangen (2020).For TAPIR, the same treatment would not be possible as it contains an RNN that must be able to recognise the output delay.Thus, we follow the approach of Turek et al. ( 2020): During training and inference, the label for input x t is expected at time step t + d, where d is the delay.Implementation.For the reviser component, we choose Transformer (Trf) and Linear Transformer (LT) encoders trained with full attention. 7The reference model is trained with cross entropy loss similar to the reviser.All models are trained with the AdamW optimiser (Loshchilov and Hutter, 2019).We use 300-D GloVe embeddings (Pennington et al., 2014), which, for the reference model and the reviser, are passed through an additional linear projection layer.The probability threshold τ is set to 0.5.We report results for a single run with the best hyperparameter configuration.See Appendix for details about the set-up and experiments.

Results and Analysis
Incremental. Figure 3 depicts the incremental evaluation results.For the no-delay case, TAPIR performs better compared to the reference model.We also observe that the delay strategy helps improve the metrics.It improves the results for TAPIR, in general, but a longer delay does not always yield a better incremental performance.We suspect this happens for two possible reasons: First, if we consider the case where the delay is 1, TAPIR has already achieved relatively low EO (< 0.1) and high RC (> 0.85).This, combined with its nonmonotonic behaviour, might make it harder to further improve on both incremental metrics, even if a longer delay is allowed.Second, a longer delay means that our model needs to wait longer before producing an output.In the meantime, it still has to process incoming tokens, which might cause some difficulty in learning the relation between the input and its corresponding delayed output.As a consequence, we have mixed results when comparing EO and RC for the delayed version of the reference model and TAPIR.Their differences are, however, very small.TAPIR achieves low EO and CT score, which indicates that the partial output is stable and settles down quickly.RC is also high, which shows that, most of the time, the partial outputs are correct prefixes of the final, non-incremental output and would be useful for downstream processing.Non-Incremental.The performance of the restartincremental reference model and our model on full sentences is shown in Table 3.The results of TAPIR, in particular with the Transformer reviser (TAPIR-Trf), are roughly comparable to the reference model, with only modest differences (0.96% -4.12%).TAPIR-Trf performs slightly better than TAPIR-LT.This is possibly due to the approximation of softmax attention in LT, which leads to degradation in the output quality.Furthermore, we see that delay of 1 or 2 tokens for TAPIR is generally beneficial.9Note that we do not force a REVISE action at the final time step to examine the effect of the learned policy on TAPIR's performance, although that would be a strategy to achieve the same non-incremental performance as the reference model.Table 3: Non-incremental performance of the models on test sets (first group is F1, second group is accuracy).D = delay.The performance of TAPIR is roughly comparable to the reference model.

Detailed Analysis
In the next paragraphs, we assess TAPIR-Trf on aspects beyond the basic evaluation metrics.Policy Effectiveness. Figure 4 shows the distributions of actions and states of the output prefixes.
Here, a prefix is considered correct if all its labels match the final output, and incorrect otherwise.We start by noticing that most of the actions are WRITE, and among them, very few occur when the prefix is incorrect.TAPIR is thus good at recognising states where recomputation is not required, supporting its speed advantage.A good model should avoid revising prefixes that are already correct.We see that, for all datasets, the vast majority of the correct prefixes indeed do not get revised.A utopian model would not make mistakes (and thus never need to revise) or immediately revise incorrect prefixes.In reality, this cannot be achieved, given the incremental nature of language and the long-distance dependencies.As a result, incorrect prefixes are expected to have a mixed distribution between actions, as the model needs to wait for the edit-triggering input, and our results corroborate that.Finally, among the REVISE actions (i.e. the lighter bars in the bottom area), there is still a considerable relative number of unnecessary revisions occurring for correct prefixes.We see room for further refinement of the policy in that sense, but, in absolute numbers, the occurrence of recomputations is much lower than in the restart-incrementality paradigm, where all steps require a recomputation.Qualitative analysis.Figure 5 shows two examples of how TAPIR behaves in incremental slot filling (more examples in the Appendix), showing that it performs critical revisions that would not be possible with a monotonic model.At the top, the model must produce labels for unknown tokens, which is harder to perform correctly.The first UNK token is initially interpreted as a city at t = 6, which is probably deemed as correct considering the available left context.The controller agrees with this, producing a WRITE action.However, when heritage and the second UNK token have been consumed at t = 8, the incremental processor labels them as parts of a geographic point of interest.The controller is able to notice the output inconsistency as I-geographic_poi should be preceded by B-geographic_poi (following the IOB scheme) and emits a REVISE action.As a result, the label B-city is correctly replaced.
In the second example, TAPIR produces interest- ing interpretations.It initially considers woods to be an actor name at t = 4.When it reads have, the reanalysis triggered by the controller interprets woods as a part of a title, the woods.The model revises its hypothesis again at t = 6, and decides that the complete title should be the woods have eyes.It still makes a mistake at the last time step, opting for a (wrong) revision of O to B-RATING for rated when it should be unnecessary.Effect of Threshold.Figure 6 portrays the effect of the probability threshold τ on incremental and non-incremental metrics.As τ increases, the incremental performance improves while the nonincremental performance deteriorates.This happens as higher τ discourages recomputation and makes the model closer to an RNN.In return, it is harder for the model to revisit its past decisions.

Conclusion
We proposed TAPIR, a two-pass model capable of performing adaptive revision in incremental scenarios e.g. for dialogue and interactive systems.
We also demonstrated that it is possible to obtain an incremental supervision signal using the Linear Transformer (LT), in the form of WRITE/REVISE action sequences, to guide the policy learning for adaptive revision.Results on sequence labelling tasks showed that TAPIR has a better incremental performance than a restart-incremental Transformer, in general, while being roughly comparable to it on full sentences.The delay strategy helps to improve incremental and non-incremental metrics, although a longer delay does not always yield better results.
The ability to revise adaptively provides our model with substantial advantages over using RNNs or restart-incremental Transformers.It can fix incorrect past outputs after observing incoming inputs, which is not possible for RNNs.Looking from the aspect of efficiency, our model is also better compared to restart-incremental Transformers as the recomputation is only performed when the need for it is detected.TAPIR is consequently faster in terms of inference speed.

Limitations
In this section, we discuss some of the known limitations of our set-up, data and models.
To handle unknown words in the test sets, we replace them by a special UNK token which is also used to mask some tokens in the training set.The UNK token provides little information regarding the actual input and TAPIR might be unable to fully utilise the token to refine its interpretation of the past output.This has a direct influence in the incremental metrics, as the model can exploit this property by using UNK token as a cue to emit the REVISE action.This strategy also introduces the extra hyperparameter of what proportion of tokens to mask.
We put effort into achieving a diverse selection of datasets in various tasks, but our analysis is limited to English.We are reporting results on the datasets for which the non-incremental versions of the model could achieve a performance high enough to allow a meaningful evaluation of their incremental performance.Tuning is still required to extend the analysis to other datasets.
Related to these two issues, we decided to use tokens as the incremental unit for processing.We follow the tokenization given by the sequence labelling datasets we use.Extending the analysis for other languages requires thus a good tokenizer, and annotated data, which may not exist.We may also inherit limitations from the datasets that we use.Although we do not include an in-depth analysis of the datasets, as our focus is on the model and not on solving the tasks themselves, they are widely used by the community and details are available in their corresponding publications.
The method we propose to retrieve the action sequences depends on the chosen model, and the grounding of the action sequences in the actual prefix outputs have a direct influence in training the controller.Therefore, the decisions made by TAPIR rely on the quality of the underlying generated action sequences.In order to ensure that the internal representations of the action generator LT do not depend on right context, we had to restrict ourselves to a single layer variation of this model when generating the sequence of actions.It is possible that with more layers its behaviour would be different, but that would invalidate the assumptions needed for an incremental processor.
When it comes to the TAPIR architecture, the attention scores for the controller are computed independently of temporal order and we do not explicitly model relation between cache elements.The limited cache size also means that some past information has to be discarded to accommodate incoming inputs.Although we have made efforts to incorporate them through the summary vector, this might be not ideal due to information bottleneck.

Ethics Statement
We do not see any immediate ethical issues arising from this work, beyond those inherent to NLP which are under discussion by the community.

A Appendix
In this section, we provide information regarding the hyperparameters, implementation, and additional details that are needed to reproduce this work .We also present supplementary materials to accompany the main text (Proof for §4, Algorithm 1, Figure 7 -8).For all of our experiments, the seed is set to 42119392.We re-implement the Transformer and the LSTMN used in this work, while for the Linear Transformer (LT), we use the official implementation.10 Further information regarding dependencies and versions are available in the repository.

Datasets
Tables 6 and 7 summarise the datasets.For SNIPS, we use the preprocessed data and splits provided by E et al. (2019).As the MIT Movie dataset does not have an official validation set, we randomly select 10% of the training data as the validation set.We also remove sentences longer than 200 words.While we use the validation set to tune the hyperparameters of our models, the results on test sets are obtained by using models that are trained on the combination of training and validation sets.

Action Sequence Generation
For the action sequence generation, we train a single-layer LT for 20 epochs with linear learning rate warm-up over the first 5 epochs.We use AdamW optimiser (Loshchilov and Hutter, 2019) with β 1 = 0.9 and β 2 = 0.98.Xavier initialisation (Glorot and Bengio, 2010) is applied to all parameters.The learning rate is set to 1e −4 , with gradient clipping of 1, dropout of 0.1, and batch size of 128.We set the FFNN dimension to 2048 and selfattention dimension to 512, with 8 attention heads.The same hyperparameters are used for all datasets.Action sequences for training the final models are obtained using single-layer LTs that are trained on the combination of training and validation sets.

Implementation and training details
Our reference model and TAPIR are trained for 50 epochs with dropout of 0.1 and early stopping with patience of 10.For AdamW, we use β 1 = 0.9 and β 2 = 0.98.We also apply Xavier initialisation to all parameters.To train the reference model and the reviser, we use linear learning rate warmup over the first 5 epochs.The learning rate is decayed by 10 https://linear-transformers.com/ 0.5 after 30, 40, and 45 epochs for all models.The number of attention heads for Transformer and LT encoders is set to 8, where each head has the dimension of d model /8 and d model is the self-attention dimension.The embedding projection layer is of size d model .For OOV words, we follow Žilka and Jurčíček (2015) by randomly replacing tokens with an UNK token during training with a probability that we set to 0.02, and then use this token whenever we encounter unknown words during inference.Hyperparameter search is performed using Optuna (Akiba et al., 2019) by maximising the corresponding non-incremental metric on the validation set.We limit the hyperparameter search trials to 25 for all of our experiments.Different from the two-pass model in Sainath et al. (2019), during training we do not take the trained reviser in step (1), freeze its weights, and use it for training step (2).This is because when recomputation occurs, we use output logits from the reviser to recompute z and φ, but this would mean that the error from the previous z and φ cannot be backpropagated.We also experimented using unit logits (ỹ/∥ỹ∥) to compute z, as the logits value from the incremental processor and the reviser might differ in magnitude, but using raw logits proved to be more effective.All the experiments were conducted on a GeForce GTX 1080 Ti and took ∼2 weeks to complete.

Overview of the Linear Transformer
The Linear Transformer (LT) (Katharopoulos et al., 2020) uses kernel-based formulation and associative property of matrix products to approximate the softmax attention in conventional Transformers, which is a special case of self-attention.In LT, the self-attention for the i-th position is expressed as: For unmasked attention with a sequence length of N , p = N whereas p = i for causal attention.The feature map ϕ is an exponential linear unit (elu) (Clevert et al., 2016), specifically ϕ(x) = elu(x) + 1. LT can be viewed as an RNN with hidden states S and Z that are updated as follows: with initial states S 0 = Z 0 = 0.  Proof: Duality of the Linear Transformer Ideally, the information regarding when to revise should be obtained with RNNs, as they have properties that are crucial for incremental processing and therefore can capture high-quality supervision signal.In practice, this is difficult because it cannot perform revision and its recurrence only allows a unidirectional information flow, which prevents a backward connection to any past outputs.For example, creating a link between the input x t and any past outputs requires computing past hidden states from h t , which is non-trivial.One technique to achieve this is to use reversible RNNs (MacKay et al., 2018) to reverse the hidden state transition, but this is only possible during training.Another approach involves using neural ODE (Chen et al., 2018) to solve the initial value problem from h 0 , which yields h t for any time step t as the solution, but it would be just an approximation of the true hidden state.
Let us consider an RNN in an incremental scenario, keeping a hidden state h j .How does x t affect the earlier output y j for 1 ≤ j < t?We want an answer that satisfies the following conditions for incremental processing: 1.The converse hidden state for time step j computed at time step t, ḧj , is a function of x t .
2. The computation of h t is a function of h t−1 , and not of ḧt−1 .This is consistent with how RNNs work.
3. The computation of h t−1 is valid iff it involves hidden states h 0 , . . ., h t−2 that agree with condition (2) in their corresponding step.
In other words, we want a way to compute converse states ḧj as a function of x t , but it should not be affecting h t , which is only supposed to be computed using past hidden states built from left to right.We are able to satisfy the conditions above and resolve the conflicting hidden state computation by using the Linear Transformer (LT) (Katharopoulos et al., 2020), which can be viewed both as a Transformer and as an RNN.This allows us to get the supervision signal to determine when revision should happen through restart-incremental computation, while still observes how x t affects all past outputs from the perspective of RNNs.
Let us consider the self-attention computation at time step t for the current and past positions n, n − 1, n − 2; n = t obtained with a LT under restart-incrementality: From equations ( 18) and ( 19) we can see that the hidden state S for computing the representations at positions n − 1 and n − 2 are functions of x n which satisfies condition (1).Furthermore, they are equal to each other i.e., Sn−2 = Sn−1 = S n = S t .Note that we only consider S, however the proof also holds for Z.To satisfy condition (2), consider the self-attention at time step t − 1 for position n − 1: We also know that S t = Sn−1 , which means that condition (2) is not completely fulfilled.However, the last clause can be relaxed as it only exists to ensure that the incremental assumption during the computation of S t is met.The Algorithm 1 TAPIR Require: Incremental processor ψ, reviser η, caches Γ h , Γ z , Γ p , controller ξ, policy π θ , input X, input buffer X buf , output buffer Y buf 1: Initialise: end for 8: else if a t = REVISE then 23: for j ← max (1, t − N + 1) to t do 25: end for 28: end if 29: x t+1 ⇐ X, t ← t + 1 30: end while      Red labels are incorrect with respect to the final output.In the first example, how does it is interpreted as an object name at t = {4, 5}, but is revised to a part of an album when TAPIR reads by.It still makes a mistake at the last step, as it edits the label for how from B-album to B-track when it is unnecessary.TAPIR initially labels this in rate this as B-object_select in the second example, which probably suits the available evidence at t = 2.When it encounters the UNK token, B-object_select is revised to O. D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure1: Illustrative example of how a monotonic incremental POS-tagger would not recover from wrong hypotheses.A policy for adaptive revision, here parameterised by a controller, can enable reanalyses to be performed when necessary (here at time steps 3 and 7).

Figure 3 :
Figure 3: Incremental evaluation of the models on test sets.Edit Overhead, Correction Time Score and Relative Correctness ∈ [0, 1].Lower is better for EO and CT, while higher is better for RC.TAPIR is better compared to the reference model for the non-delayed case (output prefixes are often correct and stable).The delay strategy of one lookahead token is beneficial.

Figure 4 :
Figure 4: Distribution of actions and output prefixes by dataset.Most of the actions are WRITE and most of the partial prefixes which are correct do not get unnecessarily revised.Incorrect prefixes cannot always be immediately detected, as expected.Part of the REVISE actions are dispensable, but in a much lower frequency than in the restart-incremental paradigm.

Figure 5 :
Figure 5: Examples of incremental inference (from SNIPS and Movie) for TAPIR-Trf.Edited labels are marked by a diamond symbol, with the immediate past output at the top right corner for right-frontier edits.Red labels are incorrect with respect to the final output.

Figure 6 :
Figure6: Effect of the probability threshold τ on incremental and non-incremental metrics, using TAPIR-Trf.Increasing τ leads to improvement of incremental metrics at the cost of non-incremental performance.

Figure 7 :
Figure7: Additional inference examples from SF-SNIPS obtained with TAPIR-Trf.Edited labels are marked by a diamond symbol, with the immediate past output at the top right corner for right-frontier edits.Red labels are incorrect with respect to the final output.In the first example, how does it is interpreted as an object name at t = {4, 5}, but is revised to a part of an album when TAPIR reads by.It still makes a mistake at the last step, as it edits the label for how from B-album to B-track when it is unnecessary.TAPIR initially labels this in rate this as B-object_select in the second example, which probably suits the available evidence at t = 2.When it encounters the UNK token, B-object_select is revised to O.
, as they have essential properties
Table2shows that TAPIR is considerably faster compared to the reference model in incremental settings, as it offers, on average, ∼4.5× speed-up in terms of sequences per second.8

Table 2 :
Comparison of incremental inference speed on test sets.TAPIR is ∼4.5× faster compared to the reference model.All results are in sentences/sec.
Lyn Frazier and Keith Rayner.1982.Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences.Cognitive Psychology, 14(2):178-210.Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Victor O.K. Li. 2017.Learning to translate in real-time with neural machine translation.In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1053-1062, Valencia, Spain.Association for Computational Linguistics.Patrick Kahardipraja, Brielen Madureira, and David Schlangen.2021.Towards incremental transformers: An empirical analysis of transformer models for incremental NLU.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1178-1189, Online and Punta Cana,

Table 4 :
Hyperparameter search space for the reference model and TAPIR.The reference model and the reviser share the same search space.
the first cache element when full.
16:if a t = WRITE then 17:

Table 5 :
Hyperparameters for our experiments.We use the same hyperparameters for the delayed variants.

Table 6 :
Details about each dataset.

Table 7 :
Descriptive statistics of the datasets.The vocabulary size is computed from training and validation sets.

Table 8 :
Mean of WRITE and REVISE action ratios per sentence for training sets and combination of training and validation sets.Most of the time, the mean of the WRITE action ratio is higher compared to the REVISE action ratio.

Table 9 :
Distribution of examples in each dataset by their REVISE action ratio for training sets and combination of training and validation sets.Most of the examples in the datasets have considerably low REVISE action ratio (< 0.6).

Table 10 :
Non-incremental performance of the models on validation sets (F1 for the first group, accuracy for the second group).

Table 11 :
Number of parameters for each model, in millions.

Table 13 :
Overall distribution of actions and prefixes on test sets using TAPIR-Trf.W represents WRITE and R represents REVISE.C and I denote correct and incorrect output prefixes, respectively.C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Section 5 and Appendix A C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 5C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?In the code repository.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.