Towards Incremental Transformers: An Empirical Analysis of Transformer Models for Incremental NLU

Incremental processing allows interactive systems to respond based on partial inputs, which is a desirable property e.g. in dialogue agents. The currently popular Transformer architecture inherently processes sequences as a whole, abstracting away the notion of time. Recent work attempts to apply Transformers incrementally via restart-incrementality by repeatedly feeding, to an unchanged model, increasingly longer input prefixes to produce partial outputs. However, this approach is computationally costly and does not scale efficiently for long sequences. In parallel, we witness efforts to make Transformers more efficient, e.g. the Linear Transformer (LT) with a recurrence mechanism. In this work, we examine the feasibility of LT for incremental NLU in English. Our results show that the recurrent LT model has better incremental performance and faster inference speed compared to the standard Transformer and LT with restart-incrementality, at the cost of part of the non-incremental (full sequence) quality. We show that the performance drop can be mitigated by training the model to wait for right context before committing to an output and that training with input prefixes is beneficial for delivering correct partial outputs.


Introduction
One fundamental property of human language processing is incrementality (Keller, 2010).Humans process language on a word-by-word basis by maintaining a partial representation of the sentence meaning at a fast pace and with great accuracy (Marslen-Wilson, 1973).The garden path effect, for example, shows that language comprehension is approximated incrementally before committing to a careful syntactic analysis (Frazier and Rayner, 1982;Altmann and Steedman, 1988;Trueswell et al., 1994, inter alia).
The notion of order along the time axis during computation is a key aspect of incremental pro-cessing and thus a desirable property both of cognitively plausible language encoders as well as in applications such as interactive systems (Skantze and Schlangen, 2009).RNNs, for example, are inherently able to process words sequentially while updating a recurrent state representation.However, the Transformer architecture (Vaswani et al., 2017), which has brought significant improvements on several NLP tasks, processes the input sequence as a whole, thus prioritising parallelisation to the detriment of the notion of linear order.
One way to employ non-incremental models in incremental settings is resorting to an incremental interface, like in Beuck et al. (2011), where a complete recomputation of the available partial input happens at each time step to deliver partial output.Madureira and Schlangen (2020) examined the output stability of non-incremental encoders in this restart-incremental fashion.While qualitatively feasible, this procedure is computationally costly, especially for long sequences, since it requires as many forward passes as the number of input tokens.
In parallel, there is ongoing research on ways to make Transformers more efficient, e.g. the Linear Transformer (LT) introduced by Katharopoulos et al. (2020).Besides being more efficient, LTs can be employed with a recurrence mechanism based on causal masking that turns them into models similar to RNNs.In this work, we examine the suitability of using LTs in incremental processing for sequence tagging and classification in English.We also inspect the use of the delay strategy (Baumann et al., 2011;Oda et al., 2015;Ma et al., 2019) to examine the effect of right context availability on the model's incremental performance.Our hypothesis is that recurrence will allow LTs to be better in incremental processing as it captures sequence order.As LTs use an approximation of softmax attention, we also expect a performance drop compared to the standard Transformer while being faster in the incremental setting due to its linear time attention.
In recent years, neural approaches, including Transformer-based architectures (Vaswani et al., 2017), have become more popular for incremental processing.Given that Transformer models are not inherently incremental, employing them for incremental processing demands adaptation.
In simultaneous translation, for instance, Ma et al. (2019) proposed to use an incremental encoder by limiting each source word to attend to its predecessors and recompute the representation for previous source words when there is new input.Zhang et al. (2021) introduced an average embedding layer to avoid recalculation when using an incremental encoder, while exploiting right context through knowledge distillation.An investigation of the use of non-incremental encoders for incremental NLU in interactive systems was conducted by Madureira and Schlangen (2020).The authors employed BERT (Devlin et al., 2019) for sequence tagging and classification using restart-incrementality, a procedure with high computational cost.
The computational cost of a restart-incremental Transformer can be reduced with more efficient models or even avoided if an inherently incremental Transformer architecture existed.Recent works have proposed modifications that could help achieve that.For instance, by approximating the softmax attention with a recurrent state (Katharopoulos et al., 2020;Choromanski et al., 2021;Peng et al., 2021).The Linear Transformer model (Katharopoulos et al., 2020, LT henceforth) can be viewed as an RNN when the attention is causal (see also, very recently, Kasai et al., 2021).

Overview of the Linear Transformer
In LTs, the similarity score between a query and a key for the i-th position is computed using a kernel function.The causal attention can be written as: with feature map φ(x) = elu(x)+1 where elu(•) denotes the exponential linear unit (Clevert et al., 2016).Hence, S i and Z i can be viewed as a recur-rent state: with S 0 = Z 0 = 0.As an RNN, the run-time complexity is linear with respect to the sequence length and constant for each added token, which promises faster inference compared to the restartincremental approach.

Models
We examine the behaviour of Transformer models used as incremental processors on token level, in five configurations (Table 1): 1. Baseline: the standard Transformer encoder incrementalised via restart-incrementality, trained with access to full sequences.
2. LT: the LT encoder incrementalised via restart-incrementality, trained with access to full sequences.
3. LT+R: the LT encoder trained as in (2) but during test time we use its recurrent state vector to predict the label at each time step, as in an RNN.
4. LT+R+CM: the LT encoder trained with causal masking to ensure each token representation can only attend to previous tokens.
During inference, we convert the model to an RNN as in (3).Training with input prefixes aims at encouraging the learning of intermediate structures (Köhn and Menzel, 2014) and the anticipation of future output (Ma et al., 2019).
5. LT+R+CM+D: similar to (4), but, during training, the output for the input token x t is obtained at time t + d, where d ∈ {1, 2} is the delay, following the approach in Turek et al. (2020).There is evidence that additional right context improve the models' incremental performance (Baumann et al., 2011;Ma et al., 2019;Madureira and Schlangen, 2020), which results in a trade-off between providing timely output or waiting for more context to deliver more stable output.
We also delay the output by 1 and 2 time steps for the baseline and LT following Madureira and Schlangen (2020), to provide a fair comparison on incremental metrics.Note that outputs from both (1) and ( 2) are non-monotonic, as labels can be reassigned when a new input token is observed.The other models deliver monotonic output for sequence tagging as RNNs.A slight modification is needed for sequence classification as each sequence is mapped to a single label.We average the hidden representation at the last layer and project it linearly, followed by a softmax to obtain the sequence label ŷt based on the consumed input until time t.For LT+ models, we use incremental averaging to avoid recomputation.By doing this, sequence classification is performed similarly for all models.4 Experimental Setup

Datasets
We evaluate our models on 9 datasets in English, which were also used in Madureira and Schlangen (2020).The tasks consist of sequence tagging: slot filling (ATIS, Hemphill et al. (1990); Dahl et al. (1994) Ganapathibhotla and Liu (2008)).More details are available in the Appendix.

Evaluation
The overall performance of the models is measured with accuracy and F1 Score, according to the task.For the incremental evaluation, we report the diachronic metrics proposed by Baumann et al. (2011) and adapted in Madureira and Schlangen (2020): edit overhead (EO, the proportion of unnecessary edits over all edits), correction time score (CT, the average proportion of time steps necessary to reach a final decision), and relative correctness (RC, the proportion of time steps in which the output is a correct prefix of the final, non-incremental output).
To focus on the incremental quality of the models and allow a clear separation between incremen-tal and non-incremental evaluation, we follow the approach by Baumann et al. (2011) and Madureira and Schlangen (2020), evaluating incremental outputs with respect to the final output produced by the models.While the final output may differ from the gold standard, it serves as the target for the incremental output, as the non-incremental performance is an upper bound for incremental processing (Baumann et al., 2011).

Implementation
We re-implement the Transformer and use the original implementation of the LT.1 All models are trained to minimise cross-entropy with the AdamW optimiser (Loshchilov and Hutter, 2019).We use 300-D GloVe embeddings (Pennington et al., 2014)  Here, the baseline performs generally better than the LT variants.
Ultimately, the quality of the output when all input has been seen matters; hence, we first look at the non-incremental or full-sequence performance, in Table 2.We can see that LTs do not outperform the baseline here,3 although they have the advantage of being faster (Table 3).We see two possible reasons for this: First, as the LT variants strictly go left-to-right through the sequence, they have less information for each token to make their decision.This can be alleviated by allowing LT+R+CM to wait for 1 or 2 tokens before producing partial output, and indeed we see an overall performance increase of 0.1% -16.5% for the +D variants.Second, we suspect that the chosen feature map in LTs to approximate the softmax attention is sub-optimal, and a further gating mechanism could yield a better performance (Peng et al., 2021).
Comparing LT+R+CM against LT+R, we observe that training on prefixes yields a better result during test time, as LT+R+CM may learn anticipation as a by-product, in line with the work of Ma et al. (2019).LT+R+CM+D performs competitively with LT, outperforming the latter in 4 out of 9 datasets.This is likely due to the better capability of the delayed network in modelling both non-linear and acausal functions that appear in some of the tasks (Turek et al., 2020).
Figure 1 depicts the incremental metrics of all models.The EO and CT score for sequence tagging is low for all models, which indicates that the models are capable, in general, of producing stable and accurate partial outputs.Notice that the LT+ models are not able to revise the output in sequence tagging.For sequence classification, the models have higher EO and CT score due to the fact that the label is a function of the whole sequence and the model might be unable to reach an early decision without enough right context.LT+R+CM performs better in incremental metrics compared to the baseline and LT in sequence classification.This is evidence that the notion of order is important for incremental processing, as the recurrent state in the LT allows partial representation updates along the time axis when processing partial input.Here, sequence classification is treated in a similar fashion for all models, by using the average of the hidden representation in the last layer.
All the models have high RC score in general for both sequence tagging and classification.This means that most of the partial outputs are a correct prefix of the final ("non-incremental") output and could fruitfully be used as input to subsequent processors in an incremental pipeline.For RC, LT+R+CM also outperforms both the baseline and LT in all tasks.A delay of 1 or 2 tokens before committing to an output also helps to improve the incremental performance across all models.In terms of incremental inference speed, we see that the recurrent mode is more than 10 times faster compared to using restart-incrementality (Table 3).
To understand the models' behaviour better, especially pertaining to their potential for real-time applications, we also examine their incremental inference speed for different sequence lengths as shown in Figure 2. As expected, LT+R+CM scales linearly and outperforms the baseline and LT considerably as the sequence becomes longer.The run-time performance of LT is slightly better than the baseline because of its linear time attention, however it is still slower compared to LT+R+CM as it is restart-incremental.For EO, the lines on the bars refer to original, delay=1 and delay=2, from top to bottom, and vice versa for RC, showing that delay improves the results.LT+R+CM performs better compared to the baseline and LT.    3 with increasing sequence length.LT+R+CM scales linearly with sequence length unlike the baseline and LT.Note that the incremental inference speed of LT+R+CM is similar to LT+R.

Ablations
We examine the importance of word and positional embeddings on the baseline and LT+R+CM for non-incremental metrics (Table 4).We find that using pre-trained GloVe (Pennington et al., 2014) embeddings is beneficial for the models' performance.On average, it contributes 2.74 accuracy and 5.16 F1 for the baseline, while improving LT+R+CM by 1.6 and 2.55 points for accuracy and F1.On the other hand, we observe that positional embeddings play a less significant role in LT+R+CM compared to the baseline.Without them the performance, on average, for LT+R+CM improves in accuracy by 0.15 and the F1 score degrades by 1.35.The baseline, however, experiences degradation in performance by 1.79 and 18.46 points for accuracy and F1, on average.The recurrence mechanism may be a reason for the effect of positional embeddings being less pronounced in LT+R+CM.

Conclusion
We studied the use of Transformer encoders for incremental processing and concluded that it is possible to deploy them as incremental processors with certain trade-offs.With recurrent computation, the Linear Transformer (LT) has inferior nonincremental performance compared to the regular Transformer and the LT with restart-incrementality.However, it has the great advantage of being much more efficient for incremental processing, since recomputation at each time step is avoided.The output of the recurrent LT is generally more stable for sequence classification and monotonic for tagging.Its non-incremental performance drop can be mitigated by introducing delay, which also improves the incremental metrics.It is also beneficial to train such model with input prefixes, allowing it to learn more robust predictions.

A Reproducibility
We describe in more detail the hyperparameters and implementation of our models.
We use only the WSJ section of OntoNotes with splits following Pradhan et al. (2013).For Pos/Neg and Pros/Cons datasets, we split them randomly with a proportion of 70% train, 10% validation, and 20% test set due to the unavailability of an official splitting scheme.We removed sentences longer than 200 words as they were infeasible to compute.We use the preprocessed data and splits for SNIPS and ATIS made available by E et al. (2019).

Training details
Our models are trained for 50 epochs, using early stopping with patience of 10 and dropout of 0.1.For AdamW (Loshchilov and Hutter, 2019), we use β 1 = 0.9 and β 2 = 0.98.The learning rate is increased for the first 5 epochs.After 30, 40, and 45 epochs, we decay the learning rate by 0.5.Xavier initialisation (Glorot and Bengio, 2010) is applied to all parameters.The number of attention heads is set to 8, where the dimension of each head is self-attention dimension d/8.We also apply label smoothing (Szegedy et al., 2016) with = 0.1 for sequence classification to make the model more robust for incremental processing.For OOV words, we randomly replace tokens by "UNK" token with p = 0.02 during training and use it for testing (Žilka and Jurčíček, 2015).We perform hyperparameter search using Comet's Bayesian search algorithm 4 , maximising F1 score for sequence tagging and accuracy for sequence classification on the validation set.The hyperparameter search trials are limited to 20 for all of our experiments.The hyperparameters for LT were also used for LT+R.We use similar hyperparameters for LT+R+CM and LT+R+CM+D.We set the seed to 42119392 for all of our experiments.

Figure 1 :
Figure1: Incremental evaluation on the test sets.EO, CT and RC ∈ [0, 1], y-axes are clipped to improve readability.Lower is better for EO and CT, higher for RC.For EO, the lines on the bars refer to original, delay=1 and delay=2, from top to bottom, and vice versa for RC, showing that delay improves the results.LT+R+CM performs better compared to the baseline and LT.

Figure 2 :
Figure 2: Incremental inference speed of models from Table3with increasing sequence length.LT+R+CM scales linearly with sequence length unlike the baseline and LT.Note that the incremental inference speed of LT+R+CM is similar to LT+R.

Table 1 :
Overview of the Transformer models.* means we perform further comparisons with a delayed variant.

Table 2 :
which are passed through a linear projection layer with size d model .All experiments were performed on a GPU GeForce GTX 1080 Ti.Details on the implementation, hyperparameters and reproducibility are available in the Appendix.Our implementation is publicly available. 2Non-incremental performance of our models on test sets (first group, F1, second group, accuracy).

Table 3 :
Comparison of incremental inference speed on test sets, measured in sequences/sec.All the models have similar size with 4 layers, feed-forward dimension of 2048 and self-attention dimension of 512.

Table 4 :
Ablation of GloVe and positional embeddings on the baseline and LT+R+CM for non-incremental metrics.

Table 5 :
Hyperparameter search space.We use the same search space for all of our models.

Table 6 :
Average sequence length on test sets for each task.

Table 7 :
Hyperparameters used for our experiments.The best configuration for LT was also used for LT+R, while the best configuration for LT+R+CM was also used for LT+R+CM+D1 and LT+R+CM+D2.

Table 8 :
Number of parameter for each model.

Table 9 :
Tasks, datasets and their size.