Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification

Transformer-based models have achieved dominant performance in numerous NLP tasks. Despite their remarkable successes, pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism that interacts with all tokens, including the ones unfavorable to classification performance. To overcome these challenges, we propose integrating two strategies: token pruning and token combining. Token pruning eliminates less important tokens in the attention mechanism's key and value as they pass through the layers. Additionally, we adopt fuzzy logic to handle uncertainty and alleviate potential mispruning risks arising from an imbalanced distribution of each token's importance. Token combining, on the other hand, condenses input sequences into smaller sizes in order to further compress the model. By integrating these two approaches, we not only improve the model's performance but also reduce its computational demands. Experiments with various datasets demonstrate superior performance compared to baseline models, especially with the best improvement over the existing BERT model, achieving +5%p in accuracy and +5.6%p in F1 score. Additionally, memory cost is reduced to 0.61x, and a speedup of 1.64x is achieved.


Introduction
Transformer-based deep learning architectures have achieved dominant performance in numerous areas of natural language processing (NLP) studies (Devlin et al., 2018;Lewis et al., 2019;Brown et al., 2020;Yang et al., 2019).In particular, pre-trained transformer-based language models like BERT (Devlin et al., 2018) and its variants (Yasunaga et al., 2022;He et al., 2020;Guo et al., 2020) have demonstrated state-of-the-art performance on many NLP tasks.The self-attention mechanism, a key element in transformers, allows for interactions between every pair of tokens in a sequence.This effectively captures contextual information across the entire sequence.This mechanism has proven to be particularly beneficial for text classification tasks (Yang et al., 2020;Karl and Scherp, 2022;Munikar et al., 2019).
Despite their effectiveness, BERT and similar models still face major challenges.BERT can be destructive in that not all tokens contribute to the final classification prediction (Guan et al., 2022).Not all tokens are attentive in multi-head self-attention, and uninformative or semantically meaningless parts of the input may not have a positive impact on the prediction (Liang et al., 2022).Further, the self-attention mechanism, which involves interaction among all tokens, suffers from substantial computational costs.Its quadratic complexity relative to the length of the input sequences results in high time and memory costs, making training impractical, especially for document classifications (Lee et al., 2022;Pan et al., 2022).In response to these challenges, many recent studies have attempted to address the problem of computational inefficiency and improve model ability by focusing on a few core tokens, thereby reducing the number of tokens that need to be processed.Their intuition is similar to human reading comprehension achieved by paying closer attention to important and interesting words (Guan et al., 2022).
One approach is a pruning method that removes a redundant token.Studies have shown an acceptable trade-off between performance and cost by simply removing tokens from the entire sequence to reduce computational demands (Ma et al., 2022;Goyal et al., 2020;Kim and Cho, 2020).However, this method causes information loss, which degrades the performance of the model (Wei et al., 2023).Unlike previous studies, we apply pruning to remove tokens from the keys and values of the attention mechanism to prevent the information loss and reduce the cost.In our method, less important tokens are removed, and the number of tokens gradually decreases by a certain ratio as they pass through the layers.However, there is still a risk of mispruning when the distribution of importance is imbalanced, since the ratio-based pruning does not take into account the importance distribution (Zhao et al., 2019).To address this issue, we propose to adopt the fuzzy logic by utilizing fuzzy membership functions to reflect the uncertainty and support token pruning.
However, the trade-off between performance and cost of pruning limits the number of tokens that can be removed, hence, self-attention operations may still require substantial time and memory resources.For further model compression, we propose a token combining approach.Another line of prior works (Pan et al., 2022;Chen et al., 2023;Bolya et al., 2022;Zeng et al., 2022) have demonstrated that combining tokens can reduce computational costs and improve performance in various computer vision tasks, including image classification, object detection, and segmentation.Motivated by these studies, we aim to compress text sequence tokens.Since text differs from images with locality, we explore Slot Attention (Locatello et al., 2020), which can bind any object in the input.Instead of discarding tokens from the input sequence, we combine input sequences into smaller number of tokens adapting the Slot Attention mechanism.By doing so, we can decrease the amount of memory and time required for training, while also minimizing the loss of information.
In this work, we propose to integrate token pruning and token combining to reduce the computational cost while improving document classification capabilities.During the token pruning stage, less significant tokens are gradually eliminated as they pass through the layers.We implement pruning to reduce the size of the key and value of attention.Subsequently, in the token combining stage, tokens are merged into a combined token.This process results in increased compression and enhanced computational efficiency.
We conduct experiments with document classification datasets in various domains, employing efficient transformer-based baseline models.Compared to the existing BERT model, the most significant improvements show an increase of 5%p in accuracy and an improvement of 5.6%p in the F1 score.Additionally, memory cost is reduced to 0.61x, and a speedup of 1.64x is achieved, thus accelerating the training speed.We demonstrate that our integration results in a synergistic effect not only improving performance, but also reducing memory usage and time costs.
Our main contributions are as follows: • We introduce a model that integrates token pruning and token combining to alleviate the expensive and destructive issues of selfattention-based models like BERT.Unlike previous works, our token pruning approach removes tokens from the attention's key and value, thereby reducing the information loss.Furthermore, we use fuzzy membership functions to support more stable pruning.
• To our knowledge, our token combining approach is the first attempt to apply Slot Attention, originally used for object localization, for lightweight purposes in NLP.Our novel application not only significantly reduces computational load but also improves classification performance.
• Our experiment demonstrates the efficiency of our proposed model, as it improves classification performance while reducing time and memory costs.Furthermore, we highlight the synergy between token pruning and combining.Integrating them enhances performance and reduces overall costs more effectively than using either method independently.
2 Related Works

Sparse Attention
In an effort to decrease the quadratic time and space complexity of attention mechanisms, sparse attention sparsifies the full attention operation with complexity O(n 2 ), where n is the sequence length.Numerous studies have addressed the issue of sparse attention, which can hinder the ability of transformers to effectively process long sequences.The studies also demonstrate strong performances, especially in document classification.Sparse Transformer (Child et al., 2019) introduces sparse factorizations of the attention matrix by using a dilated sliding window, which reduces the complexity to O(n √ n).Reformer (Kitaev et al., 2020) reduces the complexity to O(nlogn) using the localitysensitive hashing attention to compute the nearest neighbors.Longformer (Beltagy et al., 2020) scales complexity to O(n) by combining local window attention with task-motivated global attention, making it easy to process long documents.Linformer (Wang et al., 2020) (Ding et al., 2020).

Token Pruning and Combining
Numerous studies have explored token pruning methods that eliminate less informative and redundant tokens, resulting in significant computational reductions in both NLP (Kim et al., 2022;Kim and Cho, 2020;Wang et al., 2021) and Vision tasks (Chen et al., 2022;Kong et al., 2021;Fayyaz et al., 2021;Meng et al., 2022).Attention is one of the active methods used to determine the importance of tokens.For example, PPT (Ma et al., 2022) uses attention maps to identify human body tokens and remove background tokens, thereby speeding up the entire network without compromising the accuracy of pose estimation.The model that uses the most similar method to our work to determine the importance of tokens is LTP (Kim et al., 2022).LTP applies token pruning to input sequences in order to remove less significant tokens.The importance of each token is calculated through the attention score.On the other hand, Dy-namicViT (Rao et al., 2021) proposes an learned token selector module to estimate the importance score of each token and to prune less informative tokens.Transkimmer (Guan et al., 2022) leverages the skim predictor module to dynamically prune the sequence of token hidden state embeddings.Our work can also be interpreted as a form of sparse attention that reduces the computational load of attention by pruning the tokens.However, there is a limitation to pruning mechanisms in that the removal of tokens can result in a substantial loss of information (Kong et al., 2021).
To address this challenge, several studies have explored methods for replacing token pruning.ToMe (Bolya et al., 2022) gradually combines tokens based on their similarity instead of removing redundant ones.TokenLearner (Ryoo et al., 2021) extracts important tokens from visual data and combines them using MLP to decrease the number of tokens.F-TFM (Dai et al., 2020) gradually compresses the sequence of hidden states while still preserving the ability to generate token-level representations.Slot Attention (Locatello et al., 2020) learns a set of task-dependent abstract representations, called "slots", to bind the objects in the input through self-supervision.Similar to Slot Attention, GroupViT (Xu et al., 2022) groups tokens that belong to similar semantic regions using cross-attention for semantic segmentation with weak text supervision.In contrast to GroupViT, Slot Attention extracts object-centric representations from perceptual input.Our work is fundamentally inspired by Slot Attention.To apply Slot Attention, which uses a CNN as the backbone, to our transformer-based model, we propose a combining module that functions similarly to the grouping block of GroupViT.TPS (Wei et al., 2023) introduces an aggressive token pruning method that divides tokens into reserved and pruned sets through token pruning.Then, instead of removing the pruned set, it is squeezed to reduce its size.TPS shares similarities with our work in that both pruning and squeezing are applied.However, while TPS integrates the squeezing process with pruning by extracting information from pruned tokens, our model processes combining and pruning independently.

Methods
In this section, we first describe the overall architecture of our proposed model, which integrates token pruning and token combining.Then, we introduce each compression stage in detail, including the token pruning strategy in section 3.2 and the token combining module in section 3.3.

Overall Architecture
Our proposed model architecture is illustrated in Figure 1.The existing BERT model (Devlin et al., 2018) consists of stacked attention blocks.We modify the vanilla self-attention mechanism by applying a fuzzy-based token pruning strategy.Subsequently, we replace one of the token-pruned attention blocks with a token combining module.Replacing a token-pruned attention block instead of inserting an additional module not only enhances model performance but also reduces computational overhead due to its dot product operations.First, suppose X = {x i } n i=1 is a sequence token from an input text with sequence length n.Given X, let E = {e i } n i=1 be an embedded token after passing through the embedding layer.Each e i is an embedded token that corresponds to the sequence token x i .Additionally, we add learnable combination tokens.Suppose C = {c i } m i=1 is a set of learnable combination tokens, where m is the number of combination tokens.These combination tokens bind other embedded tokens through the token combining module.We simplify {e i } n i=1 to {e i } and {c i } m i=1 to {c i }.We concatenate {e i } and {c i } and use them as input for token-pruned attention blocks.We denote fuzzy-based token pruning self-attention by F T P Attn , feed-forward layers by F F , and layer norm by LN .The operations performed within the token-pruned attention block in l-th layer are as follows: The token combining module receives { êi l } and { ĉi l } as input and merges { êi l } into { ĉi l } to output combined tokens {c l+1 i }.After the token combining module, subsequent attention blocks do not perform pruning.Finally, we obtain the sequence representation by aggregating the output tokens {r i }, in which our method averages the output.

Fuzzy-based Token Pruning Self-attention
We modify vanilla self-attention by implementing token pruning.Our token pruning attention mechanism gradually reduces the size of the key and value matrices by eliminating relatively unimportant embedded tokens, except for the combination tokens.
Importance Score We measure the significance of tokens based on their importance score.For each layer and head, the attention probability Attention prob is defined as: where l is the layer index, h is the head index, d is the feature dimension, and Q l,h p , K l,h p ∈ R n× d h indicate query, key, and respectively.Attention prob is interpreted as a similarity between the i-th token e i and the j-th token e j , with row index i ∈ [1, n] and column index j ∈ [1, n].As the similarity increases, a larger weight is assigned to the value corresponding to e j .The j-th column in Equation 3represents the amount of token e j attended by other tokens e i (Wang et al., 2021).Therefore, e j is considered a relatively important token as it is attended by more tokens.We define the importance score S(e j ) in layer l and head h as: Token Preservation Ratio After calculating the importance score using Q p and K p in the l-th layer, we select t l+1 embedded tokens in descending order of their scores.The t l+1 embedded tokens are then indexed for K p and V p in the (l + 1)-th layer.
Other embedded tokens with relatively low importance score are pruned as a result.We define the number of tokens that remain after token pruning in the (l + 1)-th layer as: where t l+1 depends on p, a hyperparameter indicating the token preservation ratio of t l+1 to t l .This preservation ratio represents the proportion of tokens that are retained after pruning, relative to the number of tokens before pruning.As token pruning is not performed in the first layer, t 1 = n, and the attention uses the entire token in Q p , K p , and V p .In the (l + 1)-th layer, tokens are pruned based on S(e j ) l,h with Q l,h p ∈ R n× d h and K l,h p ∈ R t l × d h , where t l+1 ≤ t l .In the subsequent layers, the dimensions of K p and V p gradually decreases.
Fuzzy-based Token Pruning However, simply discarding a fixed proportion of tokens based on a importance score could lead to mispruning.Especially in imbalanced distributions, this pruning strategy may remove crucial tokens while retaining unimportant ones, thereby decreasing the model accuracy (Zhao et al., 2019).Insufficient training in the initial layers of the model can lead to uncertain importance scores, thereby increasing the risk of mistakenly pruning essential tokens.Furthermore, the importance score of a token is relative, and the distinction between the degree of importance and unimportance may be unclear and uncertain.To address this challenge, we exploit fuzzy theory, which can better perceive uncertainty.We employ two fuzzy membership functions to evaluate the degree of importance and unimportance together.Inspired by the previous work (Zhao et al., 2019) on fuzzy-based filter pruning in CNN, we design fuzzy membership functions for Importance(S(e)) and U nimportance(S(e)) as: where we simplify the importance score S(e j ) l,h to S(e).Unlike the previous work (Zhao et al., 2019) that uses fixed constants as hyperparameters, our approach adopts the quantile function Q S(e) (0.25) and Q S(e) (0.75) for a and b, respectively, to ensure robustness.We compute a quantile function for all importance scores, capturing the complete spectrum of head information.The importance set I and the unimportance set U are defined using the α − cut, commonly referred to as α A = x|A(x) ≥ α in fuzzy theory.To mitigate information loss due to imbalanced distribution, we employ token pruning based on the preservation ratio p for tokens that fall within the set (I − U ) c .In the initial layers, where attention might not be adequately trained, there's a risk of erroneously removing crucial tokens.To counteract this, we've set the α for I to a minimal value of 0.01, while the α for U is empirically set to 0.9.Finally, our fuzzy-based token pruning self-attention F T P Attn is defined as :

Token Combining Module
Token combining module takes token-pruned attention block's output representation êi l , ĉi l as inputs.Combination tokens, which are concatenated with embedded tokens, pass through token-pruned attention blocks to incorporate global information from input sequences.Then, combination tokens integrate embedded tokens based on their similarity in the embedded space.Similar to GroupViT (Xu et al., 2022), our token combining module uses Gumbel-Softmax (Jang et al., 2016) to perform cross-attention between combination tokens and embedded tokens.We define the similarity matrix Sim as: where LN is layer normalization, W q and W k are the weights of projection matrix for the combination tokens and embedded tokens, respectively, and {g i } are i.i.d random samples from the Gumbel(0, 1) distribution.Subsequently, we implement hard assignment technique (Xu et al., 2022), which employs a one-hot operation to determine the specific combination token to which each embedded token belongs.We define hard assignment HA as: where sg is the stop gradient operator to stop the accumulated gradient of the inputs.We update the combination token by calculating the weighted sum of the embedded token that corresponds to the same combination token.The output of the token combining block is calculated as follows: where W v and W o are the weights of the projection matrix.We adopt the grouping mechanism described in GroupViT.GroupViT learns semantic segmentation by grouping output segment tokens to object classes through several grouping stages.
Our method, on the other hand, replaces one layer with a token combining module and compresses embedded tokens to a few informative combined tokens.Empirically, we find that this approach reduces the training memory and time of the model, increasing performance.

Experiments
This section aims to validate the effectiveness of our proposed model.Firstly, we evaluate the document classification performance of our proposed model compared to the baseline models.Secondly, we investigate the time and memory costs of our proposed model and evaluate its efficiency.Lastly, through the ablation study, we compare the effects of different preservation ratios on fuzzy-based token pruning self-attention.We also analyze the impact of the position of the token combining module and the number of combination tokens.

Dataset
We evaluate our proposed model using six datasets across different domains with varying numbers of classes.SST-2 (Socher et al., 2013) and IMDB (Maas et al., 2011) are datasets for sentiment classification on movie reviews.BBC News (Greene and Cunningham, 2006) and 20 NewsGroups (Lang, 1995) comprise a collection of public news articles on various topics.LEDGAR (Tuggener et al., 2020) includes a corpus of legal provisions in contract, which is part of the LexGLUE (Chalkidis et al., 2021) benchmark to evaluate the capabilities of legal text.arXiv is a digital archive that stores scholarly articles from a wide range of fields, such as mathematics, computer science, and physics.We use the arXiv dataset employed by the controller (He et al., 2019) and perform classification based on the abstract of the paper as the input.We present more detailed statistics of the dataset in Table 1.

Experimental Setup and Baselines
The primary aim of this work is to address the issues of the BERT, which can be expensive and destructive.To evaluate the effectiveness of our proposed model, we conduct experiments comparing it with the existing BERT model(bert-base-uncased).For a fair comparion, both of BERT and our proposed model follow the same settings, and ours is warm-started on bert-base-uncased.Our model has the same number of layers and heads, embedding and hid-  den sizes, and dropout ratio as BERT.Our model has the same number of layers and heads, embedding and hidden sizes, and dropout ratio as BERT.
We also compare our method to other baselines, including BigBird (Zaheer et al., 2020) and Longformer (Beltagy et al., 2020), which employ sparse attention, as well as F-TFM (Dai et al., 2020) andTranskimmer (Guan et al., 2022), which utilize token compression and token pruning, respectively.For IMDB, BBC News, and 20NewsGroup, 20% of the training data is randomly selected for validation.During training, all model parameters are fine-tuned using the Adam optimizer (Kingma and Ba, 2014).The first 512 tokens of the input sequences are processed.The learning rate is set to 2e-5, and we only use 3e-5 for the LEDGAR dataset.We also use a linear warm-up learning rate scheduler.In all experiments, we use a batch size of 16.We choose the model with the lowest validation loss during the training step as the best model.We set the token preservation ratio p to 0.9.

Main Result
To evaluate the performance and the efficiency of each strategy, we compare our proposed model(ours-PFC) with five baselines and three other models: one that uses only token pruning(ours-P), one that applies fuzzy membership functions for token pruning(ours-PF), and one that uses only a token combining module(ours-C), as shown in Table 2 and Table 3.
Compared to BERT, ours-P consistently outperforms it for all datasets with higher accuracy and F1 scores, achieving approximately 1.33x speedup and 0.88x memory savings.More importantly, the performance of ours-PF significantly surpasses that of ours-P with up to 5.0%p higher accuracy, 4.4%p higher F1(macro) score, and 3.8%p higher F1(micro) score, with the same FLOPs and comparable memory costs.To evaluate the performance of ours-C and ours-PFC, we incorporate the combining module at the 11-th layer, which results in the optimal performance.A comprehensive discussion on performance fluctuations in relation to the location of the combining module is presented in Section 4.4.Excluding the IMDB dataset, ours-C not only achieves higher values in both the accuracy and F1 scores compared to the BERT but also exceeds by 0.89x speedup and 1.26x memory savings while reducing FLOPs.Across all datasets, our models(ours-PF or ours-PFC) consistently outperform all efficient transformer-based baseline models.Furthermore, ours-PFC outperforms the BERT with up to 5.0%p higher accuracy, 5.6%p higher macro F1 score, and 4.8%p higher micro F1 score.Additionally, ours-PFC exhibits the best performance with the least amount of time and memory required, compared to models that use pruning or combining methodologies individually.These findings highlight the effectiveness of integrating token pruning and token combining on BERT's document classification performance, from both the performance and the efficiency perspective.
Subsequently, we evaluate the potential effectiveness of ours-C and ours-PFC by implementing the combining module at the 7-th layer.As shown in Table 5 of Section 4.4, applying the combining module to the 7-th layer leads to further time and memory savings while also mitigating the potential decrease in accuracy.Compared to BERT, it only shows a minimal decrease in accuracy (at most 0.8%p).Moreover, it reduces FLOPs and memory costs to 0.61x, while achieving a 1.64x speedup.In our experiments, we find that our proposed model effectively improves document classification performance, outperforming all baselines.Even when the combination module is applied to the 7th layer, it maintains performance similar to BERT while further reducing FLOPs, lowering memory usage, and enhancing speed.

Ablation study
Token Preservation Ratio We evaluate different token preservation ratios p on the BBC News dataset, as shown in Table 4.Our findings indicate that the highest accuracy, at 98.1%, was achieved when p is 0.9.Moreover, our fuzzy-based pruning strategy results in 1.17x reduction in memory cost and 1.12x speedup compared to the vanilla self-attention mechanism.As p decreases, we observe that performance deteriorates and time and memory costs decrease.A smaller p leads to the removal of more tokens from the fuzzy-based token pruning self-attention.While leading a higher degree of information loss in attention, it removes more tokens can result in time and memory savings.Token Combining Module In Table 5, we compare the positions of different token combining modules on the BBC News dataset.We observe that placing the token combining module earlier within the layers results in greater speedup and memory savings.Since fewer combined tokens proceed through subsequent layers, the earlier this process begins, the greater the computational reduction that can be achieved.However, this reduction in the interaction between the combination token and the embedded token hinders the combination token to learn global information, potentially degrading the model performance.Our proposed model shows the highest performance when the token combining module is placed in the 11-th layer, achieving an accuracy of 98.1%.Combination Tokens In Table 6, we compare the accuracy achieved with different numbers of combination tokens.We use four datasets to analyze the impact of the number of classes on the combination tokens.We hypothesized that an increased number of classes would require more combination tokens to encompass a greater range of information.However, we observe that the highest accuracy is achieved when using eight combination tokens across all four datasets.When more combination tokens are used, performance gradually degrades.These results indicate that when the number of combination tokens is fewer than the number of classes, each combination token can represent more information as a feature vector in a 768-dimensional embedding space, similar to findings in GroupViT.Through this experiment, we find that the optimal number of combination tokens is 8.We show that our proposed model performs well in multi-class classification without adding computation and memory costs.

Conclusion
In this paper, we introduce an approach that integrates token pruning and token combining to improve document classification by addressing expensive and destructive problems of self-attention in the existing BERT model.Our approach consists of fuzzy-based token pruned attention and token combining module.Our pruning strategy gradually removes unimportant tokens from the key and value in attention.Moreover, we enhance the robustness of the model by incorporating fuzzy membership functions.For further compression, our token combining module reduces the time and memory costs of the model by merging the tokens in the input sequence into a smaller number of combination tokens.Experimental results show that our proposed model enhances the document classification performance by reducing computational requirements with focusing on more significant tokens.
Our findings also demonstrate a synergistic effect by integrating token pruning and token combining, commonly used in object detection and semantic segmentation.Ultimately, our research provides a novel way to use pre-trained transformer models more flexibly and effectively, boosting performance and efficiency in a myriad of applications that involve text data processing.

Limitations
In this paper, our goal is to address the fundamental challenges of the BERT model, which include high cost and performance degradation, that hinder its application to document classification.We demonstrate the effectiveness of our proposed method, which integrates token pruning and token combining, by improving the existing BERT model.However, our model, which is based on BERT, has an inherent limitation in that it can only handle input sequences with a maximum length of 512.Therefore, it is not suitable for processing datasets that are longer than this limit.The problems arising from the quadratic computation of self-attention and the existence of redundant and uninformative tokens are not specific to BERT and are expected to intensify when processing longer input sequences.Thus, we will improve other transformer-based models that can handle long sequence datasets, such as LexGLUE, and are proficient in performing natural language inference tasks in our future work.

Ethics Statement
Our research adheres to ethical standards of practice.The datasets used to fine-tune our model are publicly available and do not contain any sensitive or private information.The use of open-source data ensures that our research maintains transparency.Additionally, our proposed model is built upon a pre-trained model that has been publicly released.
Our research goal aligns with an ethical commitment to conserve resources and promote accessibility.By developing a model that minimizes hardware resource requirements and time costs, we are making a valuable contribution towards a more accessible and inclusive AI landscape.We aim to make advanced AI techniques, including our proposed model, accessible and practical for researchers with diverse resource capacities, ultimately promoting equity in the field.

Figure 1 :
Figure 1: Overall architecture of our purposed model Model architecture is composed of several Token-pruned Attention Blocks, a Token Combining Module, and Attention Blocks.(Left): Fuzzy-based Token Pruning Selfattention In each layer, fuzzy-based pruning method removes tokens using importance score and fuzzy membership function.(Right): Token Combining Module This module apportions embedded tokens to each of the combination token using a similarity matrix between them.

Table 1 :
Statistics of the datasets.C denotes the number of classes in the dataset, I the number of instances, and T the average number of tokens calculated using BERT(bert-base-uncased) tokenizer.

Table 2 :
Performance comparison on document classification.To our proposed model, ours-P applies token pruning, ours-PF applies fuzzy-based token pruning, ours-C applies token combining module, and ours-PFC applies both fuzzy-based token pruning and token combining module.The best performance is highlighted in bold.

Table 3 :
Efficiency comparison on document classification.

Table 4 :
Comparisons of different token preservation ratios

Table 6 :
Performance comparison of different numbers of combination tokens.C denotes the number of classes in the dataset.