Adapter Pruning using Tropical Characterization

Adapters are widely popular parameter-efficient transfer learning approaches in natural language processing that insert trainable modules in between layers of a pre-trained language model. Apart from several heuristics, however, there has been a lack of studies analyzing the optimal number of adapter parameters needed for downstream applications. In this paper, we propose an adapter pruning approach by studying the tropical characteristics of trainable modules. We cast it as an optimization problem that aims to prune parameters from the adapter layers without changing the orientation of underlying tropical hypersurfaces. Our experiments on five NLP datasets show that tropical geometry tends to identify more relevant parameters to prune when compared with the magnitude-based baseline, while a combined approach works best across the tasks.


Introduction
With the increase in network sizes, we are observing an ever-increasing space and computational demand for models needed to solve a given task.
To tackle this, model compression (Cheng et al., 2017) techniques are becoming continuously popular which retain the most important learning from the full model while reducing the size of the network either by pruning or distillation.
Transfer learning approaches, such as adapters (Houlsby et al., 2019), are a parameter-efficient alternative to full model fine-tuning which obviates the need to maintain a task-specific copy of the base language model (LM).Adapters insert simple modules in between layers of an LM to adapt the pre-trained representation for a given downstream NLP task.However, there is a lack of research in pruning adapter modules to further enhance their parameter efficiency.We hypothesize that adapter weights can be pruned significantly by not compromising the performance observed with unpruned states, this motivates the proposed approach.
In this work, we propose a novel approach to pruning adapter layers without any iterative finetuning of the model parameters on downstream tasks.Using tropical algebra, we study the (duals of) hypersurfaces generated by adapter modules in the high-dimensional space.As a pruning objective, we aim to minimize the magnitude of adapter weights while constraining the change in hypersurface geometry to be small.Related works include adapters pruning using lottery ticket hypothesis (Wu et al., 2022;Frankle and Carbin, 2018) that performs iterative pruninga few gradient steps, prune, and reset the parameters to initial weights.Rücklé et al. (2020) drops adapter from lower transformer layers.While these works are interesting, we provide a more concrete angle to prune adapter layers-prune by preserving the hypersurface geometry.We extend an insightful analysis of tropical geometry of neural networks (Zhang et al., 2018;Alfarra et al., 2022) to adapters.

Background
Adapter Operations.We use the adapter setup proposed by Pfeiffer et al. (2020) that inserts small modules after FFN add and layer norm sub-layer.
, where typically r < d.
Tropical Arithmetic.Tropical algebra is a variant of classical algebra where basic arithmetic operations are redefined.The tropical sum ⊕ of two numbers represents their maximum and the tropical product ⊙ represents a classical addition1 .Thus, x ⊕ y = max {x, y} x ⊙ y = x + y For e.g., 2 ⊕ 5 = 5 and 2 ⊙ 5 = 7. Axioms and order of arithmetic operations in tropical algebra follow the classical, thus addition is commutative and multiplication is distributive over addition.We relegate detailed discussions about tropical algebra, polynomials, and hypersurfaces to the Appendix A.
Notations used: Henceforth, we denote W d , W u , h by A, B, and x, respectively;

Tropical Adapter Pruning
Given a frozen language model adapted to a specific task using adapter layers, we divide our approach into two steps: 1) Finding adapter weights P T that are crucial to preserving the tropical adapter hypersurface by solving a simple optimization problem; 2) Pruning of adapter weights with least magnitudes that do not lie in P T .Next, we describe step-1 which is core to the pruning method: A bottleneck adapter block can be expressed by f (x) = B max{Ax, 0}.Since f (x) in itself is not a tropical polynomial and thus does not form a tropical surface, we rewrite it in terms of the difference between two tropical polynomials f (x) = H(x)−Q(x), following the analysis of tropical rational function by Alfarra et al. (2022).Thus we focus on a relatively lenient problem i.e. identifying weights that preserve tropical hypersurfaces defined by H(x) and Q(x).Let H(x) and Q(x) be the respective hypersurfaces, one can choose a sparse set of Â, B that belongs to the set of matrices obtained by solving the following optimization problem Where d(•) defines the distance between two geometric objects; Ĥ and Q are hypersurfaces obtained by substituting A and B with Â and B in f (x).
In place of preserving the orientation of H(x) and Q(x), we aim to preserve the orientation of their respective dual objects denoted by δ(H(x)) and δ(Q(x)).Thus, Without the loss of generality, we assume downprojection is bias-free2 , δ(•) can be expressed in terms of generator matrices G of zonotopes obtained from A and B. To find sparse Â, B, we introduce sparse regularization terms in the optimization function.Thus, finding adapter weights that preserve the hypersurface geometry can be cast as the following optimization problem: where We provide a derivation of the above function in Appendix B. It is important to note that in the pruning phase, we do not iteratively fine-tune adapter or LM parameters on the downstream task.
Algorithm 1: Tropical Adapter Pruning Given an adapter module, Algorithm1 finds the minimizers Â and B by performing gradient descent-based updates 3 over two loss terms expressed in terms of generators G 1 and G 2 .T , r denote the maximum gradient steps and the number of rows in A and columns in B. η ∈ R + is step size and λ 1 , λ 2 ∈ R + indicate the importance of pruning over the shift in generators.We employ layer-wise pruning of the network without any iterative fine-tuning on downstream tasks.We find p% parameters with the smallest magnitude in {A, B} and { Â, B} separately, denoted by P S and P T .We denote tropical adapter pruning by Tropical that prunes only those parameters in P T which are also present in the set P S .The final percentage of pruned parameters decreases to p%.We compare the approach with the baseline that prunes p% of the smallest magnitude parameters from the layer.We denote this setting by Standard.Combined chooses one of Tropical or Standard whichever gives better results on the development set.We omit the comparison with AdapterDrop 3 Not to confuse with gradient descent used to learn model parameters.Here, it is used to solve the optimization problem in Equation (2).method as even at 50% pruning, the method shows a significant drop in the performance.Standard inherently tests the validity of magnitude-based pruning via lottery ticket hypothesis (Wu et al., 2022) but without iterative retraining of adapter parameters.We do not iteratively fine-tune adapter parameters on the downstream task.The proposed method is agnostic to downstream tasks, models, and the learning algorithm used to train it.Thus, the framework is related to but not directly comparable to model L 0 sparsification (Louizos et al., 2017) and low-rank compression (Idelbayev and Carreira-Perpinán, 2020).

Experiments
We set up a RoBERTa-base (Liu et al., 2019) with one adapter module inserted in each layer after add and layer norm sub-layer.We follow the adapter configuration from Pfeiffer et al. (2020).For pruning analysis, we consider three tasks-Emotion Recognition in Conversations (ERC), Natural Language Inference (NLI), and Text Classification (TC).For ERC, we use MELD, the task is to classify the emotion of an utterance given past utterances.Keeping the current utterance first, we append the past seven utterances in reverse order (Bhardwaj et al., 2022b).For NLI, we use SNLI dataset (Bowman et al., 2015).We append the premise and hypothesis separated by the special token <s>.For TC task, we use three datasets: IMDB (Maas et al., 2011), Rotten Tomatoes RT (Pang and Lee, 2005), and TREC (Li and Roth, 2002).Separately, we pretrain adapters on downstream tasks with batch size 32, LR of 0.001, and 1000 steps with evaluation at every 100 steps using a development set.The evaluation metric for ERC is macro F1 and accuracy for all the other tasks.We set pruning percentage p ∈ {98%, 96%, . . ., 2%}.Table 1 shows the test performance of networks with the percentage of adapter parameters retained, i.e., (100 − p)%, this is represented in black-bold fonts.We observe that both Standard and Tropical can be effective in pruning more than 60% of the adapter parameters with a small drop in performance with respect to the full module performance (FM).Moreover, we notice Tropical outperforms Standard in eight out of nine pruned model states on MELD, six out of nine on SNLI, eight out of nine pruned adapter states on RT and IMDB, and six out of nine states on Trec.Across the 45 combinations of tasks and pruning fractions, except for two settings, we observe tropical geometry-based combined approach outperforms the other two, denoted in red font.
Next, we study tropical pruning in different scenarios-class-bind, class-uniform, and nodewise (See et al., 2016).In class-blind (CB), all the parameters of adapters are considered for pruning p% of the smallest magnitude weights and biases.In class-uniform (CU), we prune p% of the smallest magnitude parameters of each adapter layer separately.We also refer to it as layer-wise pruning.In node-wise pruning, we prune p% of the node-wise parameters (considering both weights and biases).
As shown in Table 2,  Figure 1 shows the Objective function in Equation (2) quickly converges to the minimum.This observation corroborates the claim of convexity by (Alfarra et al., 2022).The plot in Figure 2 shows the change in zonotope structure before and after optimization on SNLI.The black polytope is obtained from generators A, B and the red polytope shows the polytope obtained after optimization, i.e., zonotope obtained from Â. B. We observe the optimization preserves the geometry of zonotopes while enforcing the rows of the down-projection matrices to be as much sparse as possible, i.e., many points in the zonotope come close to zero, keeping necessary boundary points to preserve the geometry.These zonotopes are dual to adapter hypersurfaces, thus preserving one structure enforces the other's orientation to remain preserved.Hence, one can prune adapters yet maintain their characteristic properties.

Conclusion
We proposed a novel approach for adapter pruning by studying their tropical characteristics.We formulated it as an optimization problem that aims to identify row-sparse projection matrices while minimizing the distance between tropical hypersurfaces before and after pruning.We demonstrated the advantages of tropical characterization on five NLP datasets reformulated as classification.

Limitations
As our focus is on adapter-based architectures, the proposed approach can not be directly adapted to other parameter-efficient approaches such as soft prompt tuning (Lester et al., 2021;Bhardwaj et al., 2022a) which do not have explicit dense connections and activation.Another limitation comes from ReLU activation function.Since it fits in min-max (Tropical) algebra, we could reformulate the problem in terms of tropical polynomials.However, for other non-linear activation functions such as Tanh, one has to reformulate and likely resort to approximations as there is no straightforward way to cast them in a tropical algebraic expression.

A Tropical Algebra and Geometry
To motivate our approach we first provide background on tropical algebra and geometry.
Tropical Arithmetic.Tropical algebra is a variant of classical algebra where basic arithmetic operations are redefined.The tropical sum ⊕ of two numbers represents their maximum and the tropical product ⊙ represents a classical addition4 .Thus, x ⊕ y = max {x, y} x ⊙ y = x + y For instance, 2 ⊕ 5 = 5 and 2 ⊙ 5 = 7. Axioms and order of arithmetic operations in tropical algebra follows the classical, thus addition is commutative and multiplication is distributive over addition: From these properties, it can be inferred that −∞ is the additive identity as −∞ ⊕ x = x and 0 is multiplicative identity 0 ⊙ x = x.Elements under the tropical arithmetic in the space of real numbers (with −∞) are said to form a semiring T denoted by a triplet (R ∪ {−∞}, ⊕, ⊙).
Tropical Power and Monomial.For any variable x ∈ T, the tropical power can be defined as x ⊙a = a.x,where a ∈ N (a natural number).For simplicity of notations, we will write x a in place of x ⊙a .A tropical monomial is expressed in the form where the a i ̸ = a j when i ̸ = j, coefficients c i ∈ R ∪ {−∞}, α i = (a i1 , a i2 , . . ., a id ) ∈ N d and exponents a i are integers.Ignoring −∞ for ease, it is important to note that p has a mapping R d → R, both x and α are d-dimensional vectors.
Tropical powers, monomials and polynomials are basic building blocks of the algorithm we propose for adapter pruning.
Tropical hypersurfaces are analogues to classical algebraic surfaces and key objects for us to study for adapter pruning.Given a tropical polynomial f (x) = c 1 x α 1 ⊕ . . .⊕ c n x αn , its tropical hypersurface is a set of points where p is attained by two or more constituting monomials, thus for some α i ̸ = α j }.
Here we mention a few provable facts-F divides the domain of p into convex regions (or cells).Polynomial p is non-linear at x if and only if x lies on F. Similar to algebraic polynomials, we can identify Newton polytopes associated to tropical polynomials.
Newton Polytopes.For a given polynomial f (x) = c 1 x α 1 ⊕ . . .⊕ c n x αn , its newton polytope is defined by the convex hull of the exponents α i ∈ N d .The points α i and polytope lies in a d-dimensional plane (R d ).Thus The tropical polynomial p determines the dual subdivision δ(p) of newton polytope.et al., 2018;Alfarra et al., 2022).Figure 3 provides an illustration of F(p) adn δ(p) for a specific p.
Zonotopes.The zonotope formed by v Given two sets P 1 and P 2 in R d , the Minkowski is defined as Property-1.The Minkowski sum of two polytopes is the convex hull of their vertex sets.Let, V(P ) be the vertex sets of a polytope P , then Similarly, we compute dual subdivision of q i δ(Q) = +p j=1 b i− ConvHull(a j , 0) + shif t Note that convex hull of a j and 0 is a line segment.Thus, δ(h i ) defines a Minkowski sum over line segments which is a zonotope.Following Alfarra et al. (2022), and ognoting the shifts, one can straightaway obtain zonotope generators G 1 and G 2 for δ(H) and δ(Q), respectibely.
in the Standard settings S-CN/ S-CU/ S-CB, we observe layer-wise S-CU pruning works best in four out of six different fractions of parameters retained.In the Tropical pruning settings T-CN/ T-CU/ T-CB, layer-wise pruning T-CU performs best amongst all the considered pruning fractions.Moreover, T-CU works best under each pruning fraction category.

Figure 2 :
Figure 2: Zonotope defined by adapters before (red) and after the pruning (blue) via Algorithm 1.

n
where c ∈ R∪{−∞} and a i ∈ N.For convenience, we will write tropical monomial by c x α where x=(x 1 , . . ., x d ) ∈ T d and α=(a 1 , . . ., a d ) ∈ N d .Tropical Polynomial.A d-variable tropical polynomial f (x) can be represented by a finite sum of tropical monomials

Table 2 :
Accuracy scores on RT task.Comparing nodewise (CN), layer-wise (CU), and pruning all modules together (CB).S and T denote Standard and Tropical, respectively.
Bo Pang and Lillian Lee.2005.Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.In Proceedings of the ACL.
is linear.Since δ(p) is in one-to-one correspondence with the tropical hypersurface, we study the adapter characteristics-underlying hypersurfaces F(p))-by studying the orientation of the primal graph δ(p).To determine δ(p), we use the fact that when the model is bias-free, δ(p) = ∆(p) (Zhang The tropical hypersurface F(p) is dual graph to this δ(p), i.e., vertices of F(p) are regions of δ(p) and edges represent two adjacent regions in δ(p) 5 .Each vertex in δ(p) corresponds to one "cell" in R b where p