On the Cost-Effectiveness of Stacking of Neural and Non-Neural Methods for Text Classification: Scenarios and Performance Prediction

Nowadays, neural networks algorithms, such as those based on Attention and Transformers, have excelled on Automatic Text Classification (ATC). However, such enhanced performance comes at high computational costs. Stacking of simpler classifiers that exploit algorithmic and representational complementarity has also been shown to produce superior performance in ATC, enjoying high effectiveness and potentially lower computational costs than complex neural networks. In this master's thesis, we present the first and largest comparative study to exploit the cost-effectiveness of Stacking in ATC, consisting of Transformers and non-neural algorithms. In particular, we are interested in answering the following research question: Is it possible to obtain an effective ensemble with significantly less computational cost than the best learning model for a given dataset? Besides answering that question, another main contribution of this thesis is the proposal of a low-cost oracle-based method that can predict the best ensemble in each scenario using only a fraction of the training data.


Introduction
Natural Language Processing, Machine Learning and Data Mining techniques work together to automate the fundamental task of Automatic Text Classification (ATC). ATC automatically associates documents with classes, providing means to organize information, allowing better comprehension and interpretation of the data. Algorithms based on neural networks (e.g., BERT (Devlin et al., 2018), XLNet (Yang et al., 2019)) have become the highlight in the area, where they are used both to learn features for text representation and as classification algorithms. The main problem of such methods is the very high computational costs needed for learning the model parameters (Sun et al., 2019;Cunha et al., 2021).
Ensemble approaches, such as stacking, which combine the outputs of several base classification models to form an integrated output, have also been shown to excel in ATC (Džeroski anď Zenko, 2004;Ding and Wu, 2020), enjoying high effectiveness and computational costs that depend on the selected learning methods of the ensemble. They are motivated by the fact that distinct learning models or text representations may complement each other, uncovering specific structures that underlie the input/output relationship of the data. Early works (Larkey and Croft, 1996) aimed at showing combinations of different classification algorithms capable of producing better effectiveness results than any single type of classifier.
However, the benefits of ensemble techniques against a strong classifier are not always clear (Yan-Shi Dong and Ke-Song Han, 2004), in part, due to the excellent generalization power of the best classifiers. In fact, previous ensemble works mostly focus on improving the overall classification effectiveness using the results of traditional classification algorithms (Campos et al., 2017;Ding and Wu, 2020), paying little or no attention to practical issues such as the execution time or which combination of efficient base algorithms can bring effective results at a lower cost.
Accordingly, our first contribution in this paper is a thorough study of the cost-effectiveness tradeoff of stacking techniques for text classification tasks. Rather than just evaluating the effectiveness of an ensemble of various recent and effective methods, including those based on transformers and attention models, we focus on the study of stackers capable of achieving a better compromise between low cost (or high efficiency) and high ef-fectiveness when compared to a single base model (i.e., the most effective single classifier in a given dataset). A wide range of comparative experiments with stacked ensemble models and state-of-the-art base algorithms are conducted on six datasets widely used in text classification. We seek answers based on empirical evidence to the following questions, considering the best learning model for each given dataset: (RQ1): Is it possible to obtain an effective ensemble with significantly less computational time than the best learning model? (RQ2): Is it possible to improve the effectiveness of the best learning model using an ensemble without increasing the computational time? (RQ3): Disregarding the computational time, can an ensemble improve the effectiveness of the best learning model? As far as we know, we are the first to investigate the cost-effectiveness trade-offs (Cunha et al., 2021) of stacking of neural and non-neural text classifiers from the described perspectives.
A second main contribution of our work is the proposal of a low-cost oracle-based method that can predict the best ensemble in each scenario (with and without computational cost limitations) using only a fraction of the available training data. Our "Oracle" first estimates the best base algorithm (which can be seen as a baseline for effectiveness) to perform an efficient greedy search of ensembles guided by both their effectiveness and efficiency concerning the best base algorithm. Particularly, the Oracle predicts effective ensembles by successively including base algorithms that improve their combined majority voting effectiveness. Moreover, our method avoids the inclusion of expensive base algorithms (concerning the best base algorithm) to guarantee the ensemble efficiency. Our proposed Oracle is the first known strategy to provide an efficient prediction of effective ensembles capable of tackling practical efficiency issues related to our research questions. In more details, our proposal aims at predicting three ensembles corresponding to the time restrictions of RQ1, RQ2 and RQ3, respectively, while avoiding the potential high computational cost of evaluating expensive base models and their ensembles, especially on large datasets.
Our experimental results show affirmative answers to our three research questions in most experiments. In most datasets, it is possible to obtain an ensemble of base algorithms that is as good as or better than the base algorithm, at a lower cost. In 5 out of 6 datasets it is possible to obtain an ensemble with statistically significant gains against the best algorithm with no increase in cost. Similarly, in 5 out of 6 datasets, our oracle provides as good as or better results than the best base algorithm with no increase in cost, providing empirical evidence to the practical benefits of the proposed oracle.
2 Background and Related Work

Text Classification Strategies
Early efforts in ATC focused on improving machine learning algorithms such as Naïve Bayes, kNN, Logistic Regression and SVM (Howard and Ruder, 2020) using a simple bag of (TFIDF weighted) words representation. Even with such simple document representation, the use of methods such as LinearSVM (Fan et al., 2008a) and XGBoost (Chen and Guestrin, 2016a) produced high effectiveness with and efficient convergence for large datasets (Fan et al., 2008a).
More recent strategies, such as metafeatures  and neural networks (NNs) Tang et al., 2015a), exploit the training data to build more informative document representations. Particularly, strategies based on metafeatures Canuto et al., 2019b) extract information from more basic (bag-of-words) features to enhance the feature space by smartly exploiting a document's neighborhood. Strategies based on NNs enhance word representations (and thus documents) also exploiting the training data. FastText  and PTE (Tang et al., 2015a), for instance, presented high effectiveness in comparison to (costly) deep learning approaches.
Considerable advances on deep learning for ATC were achieved by using pre-trained language models with fine-tuning (Howard and Ruder, 2018), mainly when combined with attention mechanisms (Kokkinos and Potamianos, 2017;Yang et al., 2016) and the parallelization benefits of transformers, better exemplified by BERT (Devlin et al., 2018). Following BERT's success, the recent XLNet network (Yang et al., 2019) proposes a new autoregressive formulation to improve the exploitation of contextual information. Though effective, the fine-tuning process of methods such as BERT and XLNet still takes substantial computational time, requiring powerful hardware (GPUs) (Sun et al., 2019). Such requirements might bring practical limitations for these solutions.

Stacking
Stacking (Wolpert, 1992) is a widely known ensemble technique that combines the predictions of heterogeneous algorithms (i.e., base algorithms) to improve effectiveness concerning these base algorithms. To implement stacking, we first need to train each base algorithm. With the trained models, we can make predictions in a different validation set, which was not used for training. With the saved models and the predictions in the validation set, a metalayer (another learning algorithm) is used to learn how to combine the predictions in the combination. Recent work reported high effectiveness with stacking for multiple ATC tasks, such as topic classification (Campos et al., 2017;Abuhaiba and Dawoud, 2017), sentiment analysis (Carvalho and Plastino, 2020;Onan et al., 2016) and multi-label classification (Xia et al., 2020;Weng et al., 2019). Particularly, stacking provided substantial effectiveness improvements on recently proposed decisiontree-based algorithms (Campos et al., 2017) and with methods trained on different representations (including word embeddings) (Carvalho and Plastino, 2020;Pelle et al., 2018;Onan et al., 2016).
A careful choice of base algorithms is necessary due to the potential degradation of the stacking effectiveness and efficiency. The literature reported low effectiveness on stacking due to overfitting issues with multiple base algorithms (Reid and Grudic, 2009;Ledezma et al., 2010). Previous works that optimize the choice of a subset of base algorithms (Ledezma et al., 2010;Gupta and Thakkar, 2014) focused on maximizing the ensemble effectiveness with no concern for efficiency. Stacking efficiency is usually attached to the most expensive method. In fact, (Hou et al., 2021) reportedly avoids the cost of using expensive deep learning methods by including gradient boosting base algorithms comparable to convolutional NNs.
In this work, we provide a thorough evaluation of the effectiveness and efficiency tradeoffs of stacking, i.e., we investigate whether there are combinations of algorithms that overcome (in both, efficiency and effectiveness) the best base ones in a given dataset. Our proposed oracle in turn is the first method to explicitly tackle a timeconstrained stacking prediction goal by explicitly and efficiently exploiting the relationships between stacking and the best base algorithms.

Time-Constrained Stacking
We aim to answer the following research questions: (RQ1): Is it possible to obtain an effective ensemble with significantly less computational time than the best learning model? (RQ2): Is it possible to improve the effectiveness of the best learning model using an ensemble without increasing the computational time? (RQ3): Disregarding the computational time, can an ensemble improve the effectiveness of the best learning model?
With RQ1 we aim to identify whether it is possible to obtain a stacking of (a subset of) base algorithms that is as effective or better than the best (i.e., most effective) base algorithm and takes strictly less computational time than the best base. Favorable evidence towards a positive answer is important to indicate the existence of cost-effective stacking solutions, especially if the best base algorithm is a costly strong/high generalization power baseline. RQ2 keeps the same effectiveness demands of RQ1, but considering the following relaxation on the time constraint: the parallel execution of the base models can take at most the same execution time as the best base algorithm. This time constraint allows the best base algorithm to be included in the stacking. With this, we intend to evaluate if effectiveness improvements are possible with the (time) cost of the best base algorithm as an upper limit. In RQ3 we remove all time constraints to obtain the best possible stacking regardless of cost. With RQ3 we want to evaluate the potential effectiveness improvements of an stacking over the best base algorithm, in exchange for additional (time) cost.

Oracle-Based Prediction of Stacking Performance
The proposed strategy is implemented as follows: (i) each base algorithm is trained with a reduced amount of the training set (e.g., 30%); (ii) we run an algorithm, called "Oracle" (Algorithm 1), which aims at finding the best combination of base algorithms with less training by a greedy strategy. First, we select the best base algorithm obtained with the reduced training to start the combination, where A is the set of all base algorithms executed with less training. For this, we use the Best(A) function, which simply returns the best algorithm based on the validation set. For each iteration, the next best algorithm, as estimated in a validation set, is added and we verify whether the combined result presents a statistically significant improvement (α = 0.05) in relation to the previous iteration. If positive, it is permanently included in the combination. The process continues until all base algorithms are considered. The strategy is greedy since it makes the best choice in the current iteration.
To perform the comparison and statistical tests in each iteration, we use a separated piece within the training set (validation) that is not contained in the smaller part used in training. Besides, as a metalayer, we use a simple average, i.e., we simply add the probabilities of the predictions dividing it by the number of base algorithms. The meta-layer average is represented by the function Avg(E) in the pseudocode, where E ⊂ A. As it is a simple metalayer and not a learning algorithm, the cost can be considered insignificant in the choice process.
With the oracle defined, we raise the following research questions: (ORQ1): Can we predict, using a fraction of the training data, an effective stacking that will tie or outperform the best learning model when trained with all the available training, at a smaller cost than that of the best model? (ORQ2): Can we make a similar prediction than in ORQ1, but now with cost smaller or at the maximum equal to that of the best model when trained with all training data? (ORQ3): with no constraints in time, can we predict a combination that will be better than the best learning algorithm in a dataset?

Experimental Setup
We consider the effectiveness and efficiency of the models on two large-scale ATC datasets (Zhang et al., 2015) (more than 100,000 documents) -AG's News (AGNews) and IMdB Reviews -and four mid-sized datasets very known in the ATC community -20 Newsgroups (20NG), WebKB (WebKB), Reuters (Reut) and ACM Digital Library (ACM). Table 1 shows the datasets details.
In terms of classification (base) algorithms, we consider the LinearSVM (Fan et al., 2008b), kNN (Altman, 1992, LogisticRegression (Fan et al., 2008b), XGBoost (Chen and Guestrin, 2016b), XLNet (Yang et al., 2019) and BERT (Devlin et al., 2018). In terms of representations, beyond the traditional term-weighting alternatives (TFIDF), we consider distributional and other types of word embeddings, such as FastText (Joulin et al., 2016; and PTE (Tang et al., 2015b), as well as recent representations based on MetaFeatures that have obtained state-of-the-art (SOTA) effectiveness in some of the experimented datasets (Canuto et al., 2019a(Canuto et al., , 2016Cunha et al., 2020Cunha et al., , 2021.  We run the stacking process with the following variants: all combinations of the same base algorithm with different representations, all combinations of different base algorithms with their best representations, and a combination that includes all the base algorithms. For example, we perform all possible combinations of LinearSVM with Fast-Text, PTE, TFIDF and MetaFeatures, resulting in total of 11 combinations: 4 2 + 4 3 + 4 4 . This limitation of combinations has a main reason: all combinations of all algorithms and representations, 18 in our case, would result in an impracticable number of possible combinations for execution: 262,125 experiments = 18 2 + 18 3 + .. + 18 18 . An important observation is that we assume that the base algorithms can be run in parallel (e.g. different machines). Thus, a stacking or oracle combination has the execution time limited by the most costly base algorithm in the respective combination. Even if this assumption is not true and it is necessary to execute the base algorithms and combinations on one single machine, this would only aggravate the cost problem and allow an unfair comparison in our favor. Therefore, to avoid this unfair comparison, we maintain the assumption of parallel execution.
The experiments in the smaller datasets were executed using a 10-fold cross-validation procedure while in the larger we used 5 folds due to the cost of the procedure. The algorithms parameters were tuned using the Bayesian Optimization (Bergstra et al., 2015) approach with 10 iterations, with the 5-fold stratified strategy and the training set (nested cross-validation). In Table 14 in the Appendix, we have the values of each parameter that we optimize in the non-neural base algorithms. The parameters and pre-trained models for BERT and XLNet are also shown in the Appendix (Table 12). For the neural networks, we adopted the same parameters defined by the authors of their respective methods (Devlin et al., 2018;Yang et al., 2019). In our experiments, we adopt AWS EC2 instances to run and measure the execution time for both neural and non-neural algorithms. For the non-neural algorithms, we use the instance model c5a.12xlarge, which has 48 CPUs, 96GB of RAM (without GPU). For the neural algorithms, we use the instance model p2.xlarge, which has one NVIDIA K80 GPU (12 GB of memory), 4 CPUs and 61 GB of RAM.
We evaluate all methods, combined with different representations, with respect to classification effectiveness and training time. We assess classification effectiveness in the test partitions using MicroF1 and MacroF1 (Sokolova and Lapalme, 2009). While MicroF1 measures the classification effectiveness over all decisions, MacroF1 measures the classification effectiveness for each individual class, averaging them, being very important for skewed datasets. In addition to effectiveness, we also assess the cost of each method in terms of the training execution time aiming at analyzing the cost-effectiveness trade-offs for all methods. The metric is the overall time in seconds (average of folds). To compare the average test results on our cross-validation experiments, we assess the statistical significance employing the paired t-test with 95% confidence, which is strongly recommended over signed-rank tests for hypothesis testing on mean effectiveness and arguably robust to potential violations of the normality assumption in this context (Urbano et al., 2019;Hull, 1993).

Stacking Results
Effectiveness and Time results for the base algorithms in each dataset are shown in Table 3. The results of these best base algorithms are considered in the next analyses. Results for RQ1, RQ2 and RQ3, in terms of MacroF1 for each dataset are shown in Tables 4, 5, and 6, respectively, while Figure 1, shows the analysis of the cost (time). For each dataset, the tables show the effectiveness (MacroF1) of the best base algorithm along with the stacking combination that best answered the respective research question (if any), the respective combination of methods (the letters refer to the index of algorithms described in Table 2), and finally, in the last column, (Most Costly) the most costly algorithm that entered in the combination, according to the constraints imposed by the question. We present only MacroF1 results due to space constraints and the fact that is it harder to improve them in the highly skewed scenario that occurs in most of the experimented datasets. However, we also consider MicroF1, whose results are summarized in Table 7.
In Table 4, which focuses on RQ1 that has a strong constraint in terms of cost (time), we can see that in 4 out of 6 datasets, it is possible to obtain a combination of classifiers (stacking) that is    as good as (statistical tie) or better (see the ACM case, with statistically significant gains of 3.1%) than the best base algorithm, at a lower cost. In fact, the gains in terms of cost (time) are very significant (see Figure 1)  effectiveness gains with no increase in cost (remind that in this scenario the cost is limited by that of the best base algorithm). Effectiveness gains vary from 0.4% in AGNews, 1.15% in 20NG 2 , 3.1% in ACM, 5.4% in IMdB and 9% in WebKB. Reuters is only considered a tie because of the high variability of the results across folds in this dataset (due to the large number of classes and very high skewness), which generates large standard deviations/confidence intervals. In absolute terms, there was a positive variation (non-statistically significant gain) of more than 9.7%. Indeed, the MicroF1 stacking results confirms statistically significant gains in Reuters (See Table 7). As expected, to obtain gains in this scenario it is necessary to include the best base algorithm in the combination in most datasets, inserting diversity/complementarity into the combination. Only  in ACM, the base algorithm is not part of the combination. Notice also that, due to the time constraints, the gains are somewhat limited due to the restricted number of classifiers that can be combined. This has some impacts on the results, for instance, in IMdB only two algorithms belong to the best combination while the combination in ACM has only three classifiers. Only in WebKB the combination includes all 18 classifiers as the base algorithm is also the most expensive one. Another interesting aspect of the combinations is that in all datasets, a classifier using Metafeatures was included (e.g., M and Q) Finally, in the scenario with no time constraint (RQ3), further gains can be obtained with the inclusion of more costly classifiers. There are further gains in AGNews (0.94%), 20NG (2.06%), IMdB (5.8%) and ACM (6.32%). Notice that in this scenario, there is a tendency to include most algorithms in the combinations, like in ACM, WebKB and AGNews, to obtain further improvements. This means that most algorithms have complementary information that tends to contribute to the final results. Another interesting aspect to notice is that in some cases, such as in 20NG, a completely different combination than that chosen in scenario RQ2, was picked. This combination exploits the most effective and complementary algorithms, and may not even include the base classifier. In other cases, such as in IMdB, a combination of a few of the most effective (and costly) algorithms suffices to obtain larger gains. This means that the meta-layer is really doing a good job in learning about the individual performance of the algorithms and their complementarity. Finally, these additional effectiveness gains come with potential high increases in cost, clearly seen in Figure 1 for the cases of 20NG, ACM and AGNews. In those datasets, the costs have tripled (AGNews), quadrupled (ACM and IMdB) or become 8x more cost expensive. It is up to the application designer to decide whether this cost-effectiveness tradeoff is worth it.   Table 7 summarizes the effectiveness results. For RQ1, there are 8 win/ties out of 12 possibilities (6 datasets, two metrics). Remind that in this scenario ties are considered a good result due to the reduction in costs. For RQ2 and RQ3, there are 11 wins, only 1 tie (in Macro in Reuters) no losses at all. In terms of cost (Figure 1), significant reductions in scenario 1 (RQ1) can be obtained in all 6 datasets, with almost no loss (or minimal losses) in terms of effectiveness. For scenario 2 (RQ2), effectiveness gains can be obtained in almost all cases with no additional cost when compared to the cost of the base classifier. And for scenario 3 (RQ3) additional effectiveness gains can be obtained, but sometimes with a very high increase in cost.

RQ MicroF1 MacroF1
Win Tie Loss Win Tie Loss

Oracle Results
MacroF1 results of the Greedy Oracle predictor are shown in Tables 8, 9 and 10. These results correspond to an Oracle that uses the results of the base algorithms trained with 30% of the training data and predicting in a different training data portion in a nested folded cross-validation procedure. We start by answering ORQ1. Table 8 shows that in half of the cases we can perform a good prediction, i.e., one that predicts a combination of methods that will tie or outperform the best base algorithm when trained with all the available training data (100%). It is very important to stress that in a real situation we do not really know what will be the best algorithm when using all the training data nor its effectiveness. Indeed, with more data, there is a tendency for some algorithms, such as the transformers, to improve their effectiveness, but their good performance may not be predicted with few training data. Remind also that this is a very strict scenario: even if we can predict which will be the best base algorithm, we cannot use it in the combination given the time constraints of ORQ1.  Given all these limitations, mainly that only the algorithms with a cost lower than the best base algorithm (with 30% of training) can be considered, it is impressive that we can make a prediction that will surpass in effectiveness the best base algorithms using 100% of training in 20NG and ACM and tie with it, being cheaper, in WebKB. But even in the case in which there were losses, some were minimal, like in AGNews with a loss of only 2.5% with potential gains in training time. Only in Reuters and IMdB there were significant MacroF1 losses, mainly due to the failure of predicting which would be the best base algorithm 3 and the impossibility of including the predicted best base algorithm in the combination.
When we are allowed to include the bestpredicted algorithm in the stacking (scenario for ORQ2) results are even better -we can make a good prediction in 5 out 6 cases (2 wins and 3 ties). Notice that in this scenario we consider a tie as a good result. We interpret that being able to predict a combination that will tie with the best algorithm with 100% of training in a dataset, without knowing which one will this best, at a very lost cost  ( Figure 2), as an excellent result. Notice that the best results in this scenario (i.e., 20NG and ACM) are obtained when we can in fact predict what will be the best base algorithm with 100% of training. But even when we cannot predict, as in the case of WebKB and AGNews 4 , we can find a combination of simpler (and potentially less expensive) algorithms that can tie with the best. Again, IMdB was the only case in which we could not make a good prediction exactly by the failure in predicting, with 30% of training, that BERT would be the best algorithm when all the training data is used. Finally, when no time constraints are imposed the oracle's prediction results are excellent: 4 wins, 1 tie and only one loss (in IMdB). This last loss is explained by the same reasons as in the previous scenario: the failure of predicting BERT as the future best algorithm. But even in this case, the prediction for using algorithm K: LogisticRegression with PTE as the sole combination (an unusual prediction) produced minimal losses: only 1.05% at a cost much smaller than using BERT. And in the case of Reuters, we obtain an absolute increase in MacroF1 values (6% increase), though not statistically significant due to the high variability.
When looking at the costs of making the predictions in each scenario (ORQ1, ORQ2, and ORQ3), shown in Figure 2, we can see that in all cases (but 20NG for ORQ3), the oracle's predictions times are much smaller, in many cases negligible 5 , when compared to the time to run the base algorithm with 100% of training. Given the time constraints imposed by ORQ1 and ORQ2 and the fact even in the scenario for ORQ3, only a portion of the 18 available algorithms needed to be stacked (in most cases) to produce effectiveness gains, the advan-    Table 11 summarizes the results in terms of Micro and MacroF1: considering all 36 results (three RQs, 6 datasets, 2 metrics) the oracle predicted 17 wins, 10 ties (most of them (8) in scenarios ORQ1 and ORQ2, which can be considered good results) and only 9 losses, six of them in a single dataset (IMdB) for the simple reason that we failed in predicting a neural network winner with fewer data. This is certainly a point to be improved in our methodology. One idea is to look not only at the absolute effectiveness values with a single training point (30%) but look also at the tendency of growing considering several points (5%, 10%, ..).

Conclusion and Future Work
We presented two important contributions to the application of Stacking in ATC: a thorough study of cost-effectiveness trade-offs and the proposal of a new oracle method to predict the best ensemble combination for a dataset at a low cost. Our extensive experiments, composed of 4 textual representation methods, 6 datasets, 4 non-neural based algorithms and 2 neural-based algorithms, provided us with answers to questions that had not yet been explored in the literature. By performing stacking with different time constraints, we showed that it was possible to obtain combinations that positively answered the posed questions regarding the time-constrained stacking and the oracle predictions in terms of both, effectiveness and efficiency. We highlight general and practical guidelines based on our extensive experiments. First, we notice the consistent appearance of recent metafeatures on the best combinations of base learners obtained for each evaluated research question (Tables 4-6). In fact, due to the focus of meta-features on summarizing relevant distance-based information from the original features, we strongly suggest their exploitation in ensemble combinations. Moreover, the largest datasets benefit from additional data to fine tune BERT for the classification task. Therefore, combinations including both of these recent and distinct paradigms (meta-features and BERT) for stacking were able to produce very effective results on most datasets (as shown in Table 10). We suggest that stacking methods should start by exploiting these two paradigms in conjunction. Finally, our experiments show the need of specific stacking solutions for different scenarios/datasets. The application of our proposed Oracle efficiently predicts effective best base models on time-constrained scenarios, allowing adaptable solutions that automatically optimize the choice of base learners for each specific dataset. We suggest to exploit the Oracle in all these situations.
In the future, we will explore different Oracle configurations, explore multi-objective feature selection in the stacking meta-layer (Viegas et al., 2018), study other types of constraints (e.g., labeling effort) and apply the Oracle in fields such as recommender systems. initial learning rate 5e-5 MetaFeatures k [10,15,20,30,35,40,45,50] This table has the range functions and the uniform and quniform distributions functions, which are used to define the search space of some algorithms. The range(low, high, step) function returns a number between [low, high) in a step interval. The uniform(low, high) function returns a value uniformly between low and high. The quniform(low, high, q) function returns a value like round(uniform (low, high) / q) * q and differs from the uniform by a smooth factor.