Efficient Sentiment Analysis: A Resource-Aware Evaluation of Feature Extraction Techniques, Ensembling, and Deep Learning Models

While reaching for NLP systems that maximize accuracy, other important metrics of system performance are often overlooked. Prior models are easily forgotten despite their possible suitability in settings where large computing resources are unavailable or relatively more costly. In this paper, we perform a broad comparative evaluation of document-level sentiment analysis models with a focus on resource costs that are important for the feasibility of model deployment and general climate consciousness. Our experiments consider different feature extraction techniques, the effect of ensembling, task-specific deep learning modeling, and domain-independent large language models (LLMs). We find that while a fine-tuned LLM achieves the best accuracy, some alternate configurations provide huge (up to 24, 283 *) resource savings for a marginal (<1%) loss in accuracy. Furthermore, we find that for smaller datasets, the differences in accuracy shrink while the difference in resource consumption grows further.


Introduction
The wider NLP community in recent years has seen a trend in growing models and data sizes to improve performance.This trend is most clearly demonstrated by the progression of large language models over the past few years, from ELMo (Peters et al., 2018), followed by BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019),  Figure 1: Change in accuracy and estimated grams of CO 2 emitted from various feature extraction models and the standalone fine-tuned RoBERTa model on the IMDB dataset (Maas et al., 2011).Plotted values are averages across all considered classifiers.The changes in values (∆CO 2 / accuracy) were computed relative to FastText, which had the lowest CO 2 of all feature extractors we considered.For full results see Table 1.et al., 2020), and LaMDA (Thoppilan et al., 2022), just to name several notable examples, though this is by no means the only area of NLP where this is happening.This trend leaves resource efficiency and sustainability as an afterthought, a luxury that social media platforms may not have due to the sheer amount of data produced on social networks.Resource-intensive models not only cost more but also have a greater carbon impact-a consideration that is increasingly important to make in a climateconscious society.
Ensemble methods are another common, though more old-fashioned, technique for attaining performance improvements.Ensemble learning combines the results of several models and frequently achieves state-of-the-art (SOTA) results in benchmarks across NLP domains.These models are often only outperformed by large language models that have been pretrained on vast amounts of data.Notable examples in NLP include the Graphene arXiv:2308.02022v2[cs.CL] 18 Apr 2024 algorithm for AMR Parsing (Hoang et al., 2021), the SQuAD 2.0 question answering task (Rajpurkar et al., 2018) where ensemble models consistently beat single-models of comparable complexity.In these performance-topping scenarios, the ensemble learning methods lead to increased computing costs and associated greenhouse gas emissions due to the multiple models that make up the ensemble.
In this paper, we investigate the relationships between system performance, computing costs, and environmental costs while considering both large pretrained models and ensembling methods to find an accurate view of the trade-offs that we make when we select one model over another.Performance analysis in the literature often limits its view to only measuring the quality of the output (e.g., accuracy), ignoring both real-world resource constraints such as time and memory requirements and the relative environmental impact of using one model over another.
We use the task of document-level sentiment analysis as a case study.This ensures that the experiments stay within the scope of a single paper and that we avoid an explosion of experimental settings.Social networks produce huge amounts of sentiment-rich data in various forms, such as comments, reviews, and customer service chat, which are difficult to parse manually.Sentiment analysis is a critical tool for providing a global view of sentiment on a social media platform.
In our investigation, we use three different review datasets and consider nine feature extraction methods, two ensembling methods, and three standalone models.The wide range of datasets and methods ensures we can identify broad trends and make replicable conclusions.This paper's contributions are the following.
• We investigate the relationship between performance, resource use, and environmental impact in sentiment analysis by measuring accuracy, end-to-end runtime, memory usage, energy expenditure, and generated CO 2 estimates for a broad range of models that previous literature has shown to be appropriate for this task.
• We find that although modern neural systems can outperform other methods in accuracy, this comes at a considerable cost in other resources (time, memory, and energy).The performance advantage of large neural mod-els over other methods shrinks on smaller datasets, making alternatives more appealing.
• For most of the scenarios that we tested, we find that FastText with a Support Vector Machine (SVM) classifier or a frozen RoBERTa model (no finetuning) alongside FastText and an SVM classifier achieves strong performance while greatly reducing runtime and energy expenditure.2 Related Work  Goularas and Kamis (2019) considered GloVe and word2vec embeddings with multiple CNN and RNN architectures.They proposed a method of combining multiple CNNs with a biLSTM using GloVe embeddings as input, which they showed to outperform the other configurations in terms of accuracy.

Methodology
We separate our models into two broad categories.One category contains models based on an arbitrary combination of a feature extractor and a classifier.The other category contains standalone models where at least one of these two components is inherent to the model design.The input for all models is preprocessed in the same manner and all models are trained to produce a binary sentiment polarity classification.Figure 2 illustrates the high-level data flow from the review dataset to the output for the two types of models.For data preprocessing details see Appendix A. See Appendix B for a discussion of feature extractors, classifiers, and standalone models that were abandoned from our final evaluation.The remainder of this section provides brief introductions to each of the feature extractors, classifiers, and standalone models that we consider in our experiments.2

Feature Extraction Methods
Bag of Words (BOW).BOW is a classic, simple approach to converting text data into a vector that can be input into an ML model.Each document is represented by a V dimensional vector, where V is the vocabulary size.Each dimension value corresponds to the number of times the word appears in the text.
Term Frequency-Inverse Document Frequency (TF-IDF).TF-IDF combines the term frequency and inverse document frequency to produce a value representing the relevance of a word to a document (Salton and Buckley, 1988).The term frequency is the number of times the word appears in the text.The inverse document frequency is the inverse of the number of documents that the word appears in.This ensures that words that are common across texts, such as function words, are not over-valued.TF-IDF unfortunately may become computationally expensive for large vocabularies.

Continuous Bag of Words (CBOW).
The CBOW model encodes the meaning of words using the several surrounding context words as feedback (Mikolov et al., 2013).From this, it predicts contextually correct sentences.As CBOW models are able to make use of more data during the input than the Skip-gram model (Mikolov et al., 2013), they can capture the word vectors efficiently and hence perform better than Skip-gram.We trained our CBOW model using the word2vec module in the Gensim library ( Řehůřek and Sojka, 2010).
FastText.FastText (Bojanowski et al., 2016;Joulin et al., 2017) is a highly engineered library by Facebook for representation learning and classification exploiting subword information and quantization-based memory optimizations.We use a supervised learning algorithm to train FastText on our specific training data, which allows us to learn word embeddings from each dataset.Further details of this system are beyond the scope of this paper.
Double Word Embedding (DWE).DWEs were introduced by Zhou et al. (2022) as a method to obtain combine the benefits of two different word embedding schemes using a convolution layer with piece-wise max pooling.We use DWEs combining FastText and GloVe whereas Zhou et al. (2022) used GloVe and word2vec based on preliminary experiments on all combinations of the three word embedding models.Word-level GloVe vectors are mean-pooled to obtain document-level vectors.FastText vectors are obtained according to the same approach as described in the FastText section.
CNN.Our convolutional neural network (CNN) feature extractor has the following architecture, which is described below.Each convolution layer is followed by ReLU activation, max pooling, and dropout.For training, after the final convolution layer (+ aforementioned post-convolution steps) the data is fed through two dense layers and a sigmoid.When used as a feature extractor, the feature embeddings are pulled from the last convolution layer, before the max pooling stage, but after ReLU activation.These are fed into the classifiers.For the IMDB dataset, we use two convolution layers.For the smaller restaurant and product review datasets, we only use a single convolution layer.LSTM.Long short-term memory networks (LSTMs) are designed to learn long-term dependencies in the input, such as those seen in text data.Our LSTM design is similar to our CNN feature extractor.Each LSTM layer is followed by a dropout layer.During training, prediction is done after two additional dense layers and a sigmoid activation.When used as a feature extractor, features are pulled from the last LSTM layer before dropout and fed into classifiers.We use two LSTM layers for the IMDB dataset and one layer for the two smaller datasets.
RoBERTa.We use the 12th hidden layer for the [CLS] token of the RoBERTa-base model as the feature vector.When used as a feature extractor, RoBERTa is fully frozen and never trained.
RoBERTa + FastText.FastText and RoBERTa feature vectors-as described in FastText and RoBERTa section above-are concatenated to create a joint representation of the input text data.

Classification Models
SVM. Support vector machines (SVMs) select the hyperplane that divides the data points into labeled classes while maintaining a margin between the hyperplane and the nearest data point(s) (Hearst et al., 1998).For binary classification, this hyperplane is positioned to accomplish the maximum separation between the two classes, resulting in the largest possible margin.SVMs are popular classifiers, especially in machine learning application domains, due to their robustness to small dataset sizes.
Voting Ensemble.Voting ensembles take the predictions ("votes") from member classifiers and select the winning vote.We use a soft-voting method described by Witten et al. (2011), where each member classifier produces a probability distribution over the possible classes, and the class with the highest average probability wins.We use SVM, Random Forest (RF), and Multinomial Naïve Bayes (MNB) as the member classifiers in our experiments.
Boosting Ensemble.Boosting is a sequential ensemble process for changing the weight of observations based on the previous classifications iteratively.When an observation is wrongly labeled, the weight of the observation rises.This eliminates dataset bias error and generates robust statistical models.During preparation, the Boosting algorithm assigns weights to each of the resulting models.For boosting, we use XGBoost (eXtreme Gradient Boosting) (Chen and Guestrin, 2016).

Standalone Models
In initial experiments, we found that changing the classification head led to a consistent decrease in performance for certain neural models.This suggests that the feature extraction stages of these models are interwoven with the classification method over which they are trained.In these cases, we keep both the feature extraction and classification components together without swapping out classification heads.
CNN-LSTM.Following Goularas and Kamis (2019), we use a combination of CNN and LSTM to extract the local and global characteristics in the input text.Each CNN layer is followed by ReLU activation, max pooling, and dropout.The first CNN layer takes randomly initialized vocabulary vectors as input.The LSTM layers then run on top of this.Each LSTM layer is followed by dropout.
Finally, two dense layers and a sigmoid activation are used for prediction.For the IMDB dataset, we use two CNN and LSTM layers each.For the restaurant and product review datasets, we only use a single layer each.
RoBERTa-base.When used as a standalone model, we jointly fine-tune all layers of the RoBERTa base model and train the classifier on each dataset.We obtain a RoBERTa base model classifier using the Huggingface (Wolf et al., 2020) AutoModel feature for sequence classification.
DistilBERT-base-uncased. We use DistilBERT base uncased model as a standalone model in our experiment.This model is fine-tuned with an added classification layer in the same manner as the RoBERTa base model.

Experimental Setup
All experiments were conducted on a single workstation in Florida, USA.For more details of the workstation configuration see Appendix C.
Datasets.We evaluate our models on three different binary sentiment classification datasets.We evaluate our model configurations on the IMDB movie review dataset (Maas et al., 2011), the "Restaurant Reviews in Dhaka, Bangladesh" dataset (Adhikary, 2019), and the "Grammar and Online Product Review" dataset (Datafiniti, 2018).Following Zhang et al. (2015) and Agarwal et al. (2011), we construct a binary classification task from restaurant and product review datasets by labeling 1-star samples as negative polarity and 5-star samples as positive polarity examples.For more details of datasets see Appendix D. In all of our experiments, 80% of the data is used for training and 20% for testing.
Model Parameters.The context window size of the CBOW model is set to 15 and the mincount is set to 1 so that it considers all words.For SVM, MNB, and RF, we use the default parameters provided by the scikit-learn library (Pedregosa et al., 2011).For the XGBoost classifier, we use the default parameters of the xgboost library (Chen and Guestrin, 2016).We adjust the non-pretrained neural models based on different sizes of the datasets.For more details about these parameters see Appendix E.
Training Parameters.For the CNN-LSTM models, we choose the number of epochs between 2 and 5 depending on the dataset size.For the restaurant and product review dataset, the batch size is 32 and for IMDB, the batch size is 16.Adaptive Moment Estimation is used as an optimizer with binary cross-entropy loss.The CNN and LSTM feature extractors are only trained for 1 epoch each.The fine-tuned RoBERTa-base model is trained with a learning rate of 1e-1, accumulation steps of 2, and batch size of 8. We train it for 3 epochs using the AdamW optimizer.DistilBERT is trained with a batch size of 16.Otherwise, the parameters are the same as RoBERTa.
Evaluation Metric.To evaluate our models we measure accuracy, end-to-end execution time or total time, and memory measurement that includes the memory usage of both the model itself (model size) and the memory consumed during the forward pass or inference, electricity consumption, and CO 2 emissions.We use codecarbon (Lottick et al., 2019) to estimate the carbon emissions.

Results & Analysis
The full results on the IMDB, restaurant, and product review datasets for various feature extractionclassifier combinations are provided in Table 1,  Table 2, and Table 3, respectively, alongside the standalone RoBERTa model.In every table, CO 2 emissions are measured in grams, energy consumption in Watt-hours (Wh), time in minutes, and memory in megabytes (MB).Figure 3 plots the average accuracy, CO 2 , time, and memory usage for each feature extraction technique on the IMDB dataset.The same plot for the restaurant review dataset (Figure 4) is provided in Appendix F. Looking at only the accuracy numbers, we find that the fine-tuned RoBERTa model outperforms every other configuration in every dataset.However, upon closer inspection, we find that the story is not so simple.
Feature Extractors and Classifiers.In Table 1, we see that the standalone RoBERTa model achieves the best accuracy of 91.02%, followed by TF-IDF, FastText, and RoBERTa + FastText feature extractors with the SVM classifier, which achieve 90.11%, 88.64%, and 89.51%, respectively.If we take a look at the carbon emissions, these models do considerably better than the standalone RoBERTa model.while DWE with boosting produces 306 times less carbon emission, resulting in 1.51% and 2.29% accuracy reductions, respectively.When taking these other resources into consideration, FastText may be a better option than the fine-tuned RoBERTa model.The CNN feature extractor achieved competitive accuracy but has considerable memory requirements.
It is also worth mentioning that end-to-end runtime and CO 2 for CNN as a feature extractor with SVM and Voting is more than the RoBERTa as a feature extractor because the feature layer of CNN is computationally intensive for its number of filters and kernels.
Table 2 shows similar results for the restaurant review dataset.RoBERTa achieves the highest accuracy, while FastText, DWE, and RoBERTa + FastText feature extractors with the SVM classifier achieve nearly the same performance with major carbon emission reductions.DWE with SVM produces 78 times less carbon with only a 0.59% reduction in accuracy when compared to fine-tuned RoBERTa.Similarly, RoBERTa + FastText with SVM and FastText with SVM produce 27 times and 24,283 times less carbon emissions, while only losing 0.16% and 0.68% in accuracy, respectively.
The results from the product review dataset (Table 3) further solidify our observations so far.DWE with boosting, RoBERTa + FastText with SVM, and FastText with SVM result in 103 times, 34 times, and 19,409 times less carbon emissions with only 0.84%, 0.52%, and 1.22% accuracy reduction, respectively, when compared to fine-tuned RoBERTa.
Standalone Models.The results of our standalone models are presented in Table 4.For IMDB review, RoBERTa requires almost 17 hours to complete the task, consuming 201.05 Wh of electricity which produces 90.554g of CO 2 .DistilBERT takes almost 9 hours and emits 45.207 g CO 2 which is comparatively much lower than the RoBERTa model.
CNN vs. CNN-LSTM.We also find that using CNN as a feature extractor produces slightly better results than the CNN-LSTM model, though at the cost of higher memory use. 3 For example, in the IMDB review dataset, CNN with the voting ensemble attains 89.65% accuracy whereas the CNN-LSTM standalone model achieves 88.41% 3 We also tried CNN and LSTM as standalone models but the results were worse than the CNN-LSTM model.accuracy.

Conclusion & Future Work
We contextualized the accuracy of sentiment analysis systems within their computing resource requirements: runtime, maximum memory use, energy expenditure, and estimated CO 2 emissions.As expected, a finetuned LLM achieves the highest accuracy in all three datasets.However, this comes at the cost of tens to thousands of times the cost in terms of other resources (time, memory, energy, and CO 2 ) when compared to other configurations which result in accuracy reductions of less than 1%.We find that the FastText feature extractor with an SVM classifier, in particular, achieves good accuracy scores while having minimal resource requirements.
As such, for the vast majority of use cases, we recommend this configuration (concatenating the frozen RoBERTa embeddings to FastText (RoBERTa+FastText) w/ SVM) where one must balance resource costs with output quality.It provides strong accuracy scores without incurring the extreme time and energy costs of fine-tuning RoBERTa.Although our experiments only considered the task of sentiment analysis, we expect that the major patterns would carry over to other classification tasks to the degree that they have similar dataset features (e.g., text length, dataset size, difficulty, etc.).We encourage others to test this hypothesis and provide insights into which techniques best balance output quality with resource usage in other NLP domains.
Due to our own resource constraints, our experiments do not include the most cutting-edge LLM models (e.g., LLaMa,etc.).The use and deployment of these models are qualitatively different (e.g., prompting, RLHF, etc.) from those we considered and thus are outside of the scope of this paper.We leave this extension of our experiments for future work.

A Data Preprocessing
For data preprocessing, we normalize all the texts by converting them into lowercase, removing URLs, HTML tags, email addresses, nonalphabetic characters, all the special characters, and stop words, tokenizing the text data, and lemmatizing the text.We also transform contractions to their expanded full form to maintain uniformity.

B Abandoned Methods
Early in our experiments, we also considered i) GloVe, ii) Skip-gram, and iii) lower layers (non-12th) of RoBERTa-base as feature extractors, bagging as a classifier, and ALBERT as a standalone model.However, quite quickly we found that these methods consistently gave poor accuracy results or results that were not qualitatively different from a feature extractor already considered in our experiments.

D Dataset Details
The IMDB movie review is a balanced dataset of 25k samples for each positive and negative label.
The restaurant review dataset is collected from 338 restaurants in Dhaka, Bangladesh and the product review dataset is procured from 1,000 different products.These original datasets use 5-star rating systems and we use 1-star samples as negative polarity and all 5-star samples as positive polarity examples.This leads to 8,621 positive reviews and 2,241 negative reviews in the restaurant review dataset.As the product review dataset is highly imbalanced and we seek to investigate the effect of dataset size in our experiments, we further subsample the positive reviews in the product review dataset down to 6,981 positive reviews, while keeping all 3,701 negative reviews.

E Model Parameters Details
Regarding the non-pretrained neural models, we use a dropout rate of 0.2, maximum sequence lengths of 512, and LSTM layers with 128 units.
For the restaurant and product review datasets, the vocabulary size of embedding layers is 10,000 and the CNN layers have a kernel size of 3, with 128 filters.For the IMDB review dataset, the kernel size and number of filters in the first convolution layer are 5 and 300, respectively, while those for the second convolution layer are 3 and 100.The size of the vocabulary is 30,000.Additionally, we also try different parameters of SVM, XGBoost, MNB, and RF but the default one got the more prominent result compared to others.For example, when we use RF in our voting ensemble method, we also try the number of estimators=200, random state=1200, and crite-rion='entropy'.
The RoBERTa base model comes pretrained with 12 transformer layers, 12 attention heads, and a hidden layer size of 768.

Figure 1
Figure 1 demonstrates all three points for feature extractors on the IMDB review dataset.It shows how accuracy and carbon emissions change across feature extractors, including the extreme carbon emissions cost of the fine-tuned RoBERTa model and the balanced accuracy and emissions of the RoBERTa+FastText feature extractor.

Figure 2 :
Figure 2: High-level data flow from the review dataset to the output for models composed of feature extractors with classifiers and standalone models.

Figure 3 :
Figure 3: Feature extraction results for the IMDB dataset, averaged across classifiers.

Figure 4 :
Figure 4: Average Results Diagram of Restaurant Review Dataset

Table 1 :
Results of IMDB review dataset on various combinations of feature extraction methods and classifiers.

Table 2 :
Results of restaurant review dataset on different feature extraction methods.

Table 3 :
Results of product review dataset on different feature extraction methods.

Table 4 :
Results of standalone models on the three datasets.

Table 5 :
DistilBERT base uncased model has 6 layers with 12 attention heads.Other tested hyperparameters for LLMs are represented in Table 5 F Average Result Diagram for Restaurant Review Dataset Tested Hyperparameters for LLMs other than Specified Particularly