1. Introduction
In the field of natural language processing (NLP), sentiment analysis encompasses methods designed to model the emotions conveyed in textual opinions across various domains, including topics, products, subjects, and services. The goal is to determine whether a given text expresses positive, negative, or neutral sentiment [
1,
2]. Furthermore, sentiment analysis can be extended to detect specific emotions such as anger, fear, joy, sadness, or frustration [
3].
Sentiment analysis is widely applied in numerous fields, including e-commerce, where it is used to assess customer satisfaction based on product reviews [
4,
5]. It also plays a crucial role in market research [
6] and in the analysis of cryptocurrency fluctuations [
7,
8]. Additionally, it is instrumental in tracking public trends [
9,
10] and in conducting political sentiment analysis [
11,
12].
In the context of sentiment analysis, polarity refers to the orientation of a piece of text, indicating whether it expresses a positive, negative, or sometimes neutral sentiment. The concept originates from linguistics and psychology, where “polarity” describes the opposing ends of an emotional spectrum. It provides a straightforward way to categorize the emotional tone of textual content using a simple scale.
Polarity can be estimated using either a lexicon-based method or a deep learning-based approach. Lexicon-based methods rely on predefined sentiment dictionaries, such as VADER [
13] and SentiWordNet [
14], to compute a polarity score based on the presence and strength of sentiment-bearing words. Alternatively, deep learning-based approaches [
15] use supervised learning models trained on labeled sentiment data. These include architectures like Long Short-Term Memory (LSTM) networks [
16] and transformer-based models, such as Bidirectional Encoder Representations from Transformers (BERT) [
17] and the Robustly Optimized BERT Pretraining Approach (RoBERTa) [
18], which are capable of capturing complex contextual dependencies for more nuanced sentiment detection.
Sentiment analysis remains challenging, particularly on short-text platforms like Twitter, where limited contextual cues hinder accurate classification [
19]. Poria et al. [
19] emphasized the difficulty of extracting meaningful sentiment from short texts due to the lack of sufficient context. In contrast, Barbieri et al. [
20] analyzed seven heterogeneous Twitter-specific classification tasks, including sentiment analysis, and demonstrated that leveraging existing pretrained generic language models and fine-tuning them on domain-specific corpora, such as Twitter data, improves classification performance. While Poria et al. highlighted the fundamental challenges posed by short texts, Barbieri et al. proposed a concrete method to enhance performance through additional training on specialized datasets.
BERT is a deep learning model for NLP introduced by Google in 2018. Based on the transformer architecture, BERT processes input text bidirectionally—considering both left-to-right and right-to-left contexts simultaneously—which improves its ability to understand the semantics of language in context. It was pretrained on two large text corpora, Wikipedia and BookCorpus, using masked language modeling and next-sentence prediction objectives [
17]. BERT has significantly advanced performance in a wide range of NLP tasks, including question answering, sentiment analysis, named entity recognition, and text classification. In the context of sentiment analysis, BERT can be applied in two main ways. First, it can be fine-tuned on a labeled sentiment dataset, allowing the model to adapt its general language understanding to the specific classification task. Second, BERT can serve as a teacher model in a process known as knowledge distillation, where its outputs are used to train a smaller, more efficient student model that mimics its behavior, making deployment in resource-constrained environments more feasible.
Areshey et al. [
21] fine-tuned a BERT model for helpfulness classification tasks, which evaluate how popular or useful a review is perceived by other users. They investigated the influence of batch size and sequence length on model performance, concluding that a sequence length above 128 words is generally sufficient for this task, as longer input lengths yield only marginal improvements. In a different context, Mutinda et al. [
22] proposed a modified BERT architecture for binary polarity classification on Yelp reviews. However, their evaluation was limited, as the results were reported only for the positive sentiment class. Moreover, the absence of information about the dataset—such as its size, composition, and the definitions of polarity—makes it difficult to assess the validity and generalization of their findings.
RoBERTa (Robustly Optimized BERT Pretraining Approach [
18]) models share the same architecture as BERT but differ in their training procedure. The main difference is that RoBERTa removes next-sentence prediction from its objectives—as it is unnecessary for downstream tasks—and dynamically masks tokens instead of fixing them once per epoch. Also, RoBERTa uses a larger batch size and more training steps, resulting in longer pretraining. For instance, the original BERTweet model [
23] pretraining procedure is based on RoBERTa for more robust performance.
Hugging Face is a widely adopted collaborative platform that enables the sharing and distribution of open-source machine learning models and applications. It offers high-level abstractions through its “pipelines” API, which simplifies the implementation of tasks, such as sentiment analysis, using transformer-based models. These pipelines abstract away much of the underlying complexity, allowing users to easily apply state-of-the-art models to a variety of natural language processing tasks. The platform hosts numerous fine-tuned BERT variants, making them readily accessible for testing, comparison, and integration into applications. Several pipelines are available for text classification, and some of them are specifically tailored for polarity detection, such as DistilBERT [
24], the Pysentimiento [
25] variant of BERTweet, and RoBERTa [
26].
Recurrent Neural Networks (RNNs) were among the earliest deep learning architectures applied to NLP tasks due to their ability to model sequential information, effectively incorporating prior word context through a memory mechanism. However, traditional RNNs suffer from vanishing and exploding gradient problems, limiting their capacity to learn long-range dependencies. To address this limitation, Long Short-Term Memory (LSTM) networks were introduced. These models incorporate gating mechanisms to retain and control information over longer sequences. Bidirectional LSTMs (Bi-LSTMs) extend this architecture by processing input sequences in both the forward and backward directions, thus capturing both the preceding and following contexts for each token [
27]. Despite their theoretical advantages, relatively few studies have systematically evaluated the actual performance gains (or potential trade-offs) of Bi-LSTMs compared to standard LSTMs in practical classification tasks.
Basiri et al. [
28] evaluated various machine learning approaches, including LSTM and Bi-LSTM, on a small Persian dataset of online doctor reviews comprising only 700 samples. While their results provided insight into sentiment classification techniques, the limited dataset size restricts the applicability of their findings. Similarly, Wang et al. [
29] conducted a comparative study between LSTM and traditional machine learning methods such as Naive Bayes, Maximum Entropy, Support Vector Machines (SVMs), and Convolutional Neural Networks (CNNs). However, their evaluation was based on a labeled dataset of just 177 negative and 182 positive Tweets, and model performance was assessed solely using accuracy—a metric insufficient for capturing the full effectiveness of classification models, especially in imbalanced settings. Kapali et al. [
30] also performed a brief comparison of LSTM and Bi-LSTM models using a corpus of only 800 Bengali-language samples. They reported similar performance between the two architectures; however, the LSTM model employed a single-layer configuration, while the Bi-LSTM used a deeper, two-layer architecture. As with the previous studies, the small size and linguistic specificity of the dataset limit the robustness and applicability of the results.
Using a substantially larger dataset, Sadikin and Fauzan [
31] compared the performance of Long Short-Term Memory (LSTM) networks with that of Multi-Layer Perceptrons (MLPs) for binary sentiment classification, using a test set of 20,000 Yelp reviews. Their study examined the influence of batch size on model performance. The LSTM model achieved a significantly higher accuracy of 91% compared to 75% for the MLP. However, the evaluation was limited to accuracy alone, with no reporting of complementary metrics such as recall, precision, or F1 score, which are essential for a more comprehensive assessment of classification effectiveness.
Chandra et al. [
32] conducted a comparative study of LSTM, Bi-LSTM, and BERT models on vaccine-related Tweets, focusing on a multi-label sentiment classification task with 11 possible categories (e.g., optimistic, pessimistic, and denial). This type of classification differs fundamentally from polarity detection, as each text instance can be assigned multiple sentiment labels simultaneously. As such, their findings are not directly transferable to polarity classification tasks. Furthermore, the authors did not report any details regarding the architectural configurations of the LSTM and Bi-LSTM models, limiting the interpretability and reproducibility of their results.
Chandra and Saini [
33] proposed a framework for modeling the outcome of the 2020 U.S. general elections based on sentiment analysis of Twitter data, using LSTM and BERT language models. Both models were trained or fine-tuned on the Internet Movie Database (IMDB) movie review dataset, a widely used benchmark in sentiment analysis containing labeled positive and negative film reviews, to evaluate their ability to capture sentiment. The authors reported that BERT and LSTM performed similarly in terms of training accuracy and F1 score. However, these results were limited to the IMDB dataset and were not the primary focus of the study. The main objective was to estimate state-level election results by computing the average polarity scores per state based on model predictions. In this context, BERT served as a general-purpose tool for modeling voting outcomes and suggested a victory for Biden. The LSTM model, while offering valuable insights for identifying swing states, struggled to capture voting dynamics effectively, largely due to its tendency to classify a large number of Tweets as neutral.
Bello et al. [
34] explored the integration of BERT with CNN, RNN, and Bi-LSTM architectures for three-class sentiment classification using a large dataset of over 42,000 Tweets sourced from Kaggle. Their results indicated that combining BERT with these neural architectures leads to superior performance—generally exceeding 91% in terms of accuracy, recall, precision, and F1 score—compared to using CNNs or Bi-LSTMs alone. However, several methodological concerns limit the interpretability of their findings. The structure of the Bi-LSTM model was not thoroughly described, other than the mention of a single layer. Moreover, the performance of the standalone BERT-base model was not reported, which is a notable omission given its role as a baseline. A further concern relates to data quality: Tweets typically lack explicit user ratings, and the Kaggle dataset does not document how the sentiment labels were assigned. This absence of transparency raises questions about whether the Tweets were annotated manually or through automated heuristics, thereby casting doubt on the reliability of the labels and the validity of the conclusions drawn from the study.
While prior studies have provided valuable insights into sentiment classification using LSTM- and BERT-based architectures, they often differ in scope, dataset size, evaluation metrics, or the level of detail provided regarding model design. Consequently, there remains a gap in the literature for a clear, systematic comparison between these two families of models under consistent conditions. This study aims to address this gap by offering a comprehensive comparison of LSTM- and BERT-based approaches for sentiment polarity classification. It does so using a large and well-balanced dataset of Yelp reviews and reports results across standard evaluation metrics commonly used in the NLP community: accuracy, precision, recall, and F1 score. The architectures employed are described with clarity and rigor to support reproducibility, and the experimental setup is designed to provide a fair and transparent benchmark.
The remainder of this article is organized as follows. The next section describes the dataset and experimental protocol, including the preprocessing steps and model configurations.
Section 3 presents the evaluation metrics and results obtained for each model.
Section 4 discusses the findings in light of prior research, highlighting strengths and limitations. Finally, the last section concludes this study and outlines directions for future work.
3. Results
3.1. Performance Indicators
Accuracy is the ratio of correctly predicted labels to the total number of reviews. Precision is the ratio of correctly predicted positive labels to the total number of predicted positive labels. Recall is the ratio of correctly predicted positive labels to the total number of actual positive labels. A more balanced metric is the F1 score, which is the harmonic mean of precision and recall. The F1 score ranges from 0 to 1, with 1 being the best possible score. A confusion matrix is a more thorough way to visualize and analyze the performance of a sentiment analysis model: it is a table that shows the distribution of predicted labels versus actual labels for each class. It is used to identify precisely where the model is making mistakes and which classes are more difficult to classify.
Accuracy measures the overall correctness:
Precision measures how many predicted positive sentiments are actually correct:
Recall (sensitivity) measures the proportion of actual positive sentiments correctly identified:
The F1 score combines precision and recall, providing a balanced assessment of the model’s performance in predicting each sentiment category:
Except for accuracy, these metrics require defining the notions of true positives (TPs), false positives (FPs), and false negatives (FNs); these notions are related to the target sentiment considered. For instance, for the negative sentiment class, a true positive is a review labeled as negative and correctly predicted as negative.
When evaluating a sentiment analysis model with two outputs—negative and positive—it is standard practice to treat it as a binary classification problem. In such cases, the F1 score, precision, and recall are typically calculated separately for each class, meaning one has precision and recall for the positive class and precision and recall for the negative class. This approach gives a clearer picture of how well the model performs in each sentiment category, although some works [
40] only report the results for the positive class, which can be impractical if the dataset is imbalanced (e.g., more positive than negative samples) or if one class is more important to classify correctly than the other.
For a three-class classification problem, with negative, neutral, or positive as possible outputs, the TPs, FPs, and FNs are also defined for each class separately. For instance, for the positive sentiment class, a TP is a sample predicted by the model as positive that is actually positive, an FP is a sample predicted by the model as positive that is actually negative or neutral, and an FN is a sample predicted by the model as negative or neutral that is actually positive. For the positive class, recall measures how many of the actual positive sentiments the model correctly identified, out of all actual positive sentiments, and precision measures how many of the reviews predicted as positive are actually positive.
Therefore, the performance in each class is measured using three metrics—precision, recall, and F1 score—each supported by a certain number of test samples. We can globally assess the model by its accuracy and the macro-averages of precision, recall, and F1 score.
3.2. Experimental Results on Yelp Reviews
Table 4 reports the complete results obtained for each class on the test set containing 10,795 Yelp reviews. LSTM, which was trained on substantially less data than the BERT pipeline methods, attained the best overall results, with an accuracy of 77%, followed closely by the Bi-LSTM method (75%). For all sentiment labels, LSTM achieved the best F1 scores with 80% for negative, 68% for neutral, and 81% for positive sentiments. Bi-LSTM achieved an F1 score performance close to that of LSTM: the bidirectionality of the method did not bring much improvement to the LSTM approach.
Positive and negative classes were well modeled, with all performance metrics in the range of 80–84% for the LSTM and 73–84% for the Bi-LSTM. For the BERT methods, the ranges were larger, with precision, recall, and F1 score values between 40% and 77% for BERTweet, between 55% and 91% for RoBERTa, and between 62% and 88% for BERT.
Neutral sentiment is harder to capture, even for a human, as this sentiment class can be very subjective. Algorithmic approaches for classifying text as neutral struggled to grasp the correct modeling for this target label, although the LSTM-based approaches performed relatively well, with a recall of about 70% on neutral sentiments, meaning that out of ten neutral reviews, seven were correctly labeled by these methods. However, for the LSTM and Bi-LSTM methods, we observed a decrease of about 10% in the F1 score for the neutral class compared to the positive and negative classes. This drop in performance was far worse for the BERT-based approaches, with F1 scores ranging from 26% for RoBERTa to 46% for BERT.
Although overall RoBERTa performed worse than LSTM, it achieved an impressive recall of 91% on positive sentiments but an F1 score of only 68%, which can be explained by the bias of the method toward positive outputs. On the other hand, BERT achieved an impressive recall of 88% on negative sentiments but with a precision of 62%, reflecting its tendency to classify text input as negative.
Figure 5 displays the distribution of predictions for each model on the Yelp test set, comprising 10,795 reviews equally distributed among the three polarities. The results highlight significant differences in the way these models classify reviews as negative, neutral, or positive. LSTM exhibited the most balanced distribution, classifying reviews almost evenly as negative, neutral, or positive. It did not strongly favor any polarity compared to transformer-based models. BERT leaned toward negative predictions (48% of its predictions), indicating a bias in how it interprets sentiment, possibly being more sensitive to negative cues. BERTweet and RoBERTa favored positive polarity, with 50% and 56% of their predictions, respectively, recognizing positive sentiment.
The confusion matrix of the LSTM model, as displayed in
Figure 6, indicates that 8265 reviews out of 10,795 were correctly classified; 2915 out of 3652 negative sentiments were correctly classified, 2529 out of 3586 neutral sentiments were correctly classified, and 2821 out of 3557 positive sentiments were correctly classified. Accuracy is reflected by the diagonal elements, precision corresponds to the columns, and recall corresponds to the rows. Accuracy is indeed the sum of diagonal elements divided by the sum of all elements of the matrix: the darker the diagonal, the better the classifier. For the negative target label, for instance, precision is the number in the first cell divided by the sum of elements in the first column, while recall is the number in the first cell divided by the sum of the elements in the first row.
Figure 7a depicts the confusion matrix of Bi-LSTM computed on the test dataset. It indicates that 8094 out of 10,795 reviews were correctly classified overall; 2680 out of 3652 negative sentiments were correctly classified, 2523 out of 3586 neutral sentiments were correctly classified, and 2891 out of 3557 positive sentiments were correctly classified.
The confusion matrix of the BERT pipeline shows that 7199 out of 10,795 reviews were correctly classified, as illustrated in
Figure 7b. Specifically, 3208 out of 3652 negative sentiments were correctly classified, 1231 out of 3586 neutral sentiments were correctly classified, and 2760 out of 3557 positive sentiments were correctly classified. The BERT pipeline performed better than the other BERT-based pipelines and the LSTM-based models on negative sentiments, better than RoBERTa on neutral and negative sentiments, and better than BERTweet on negative sentiments.
The confusion matrix of the BERTweet pipeline shows that 6526 out of 10,795 reviews were correctly classified, as illustrated in
Figure 7c. A total of 2133 out of 3652 negative sentiments were correctly classified, 1246 out of 3586 neutral sentiments were correctly classified, and 3147 out of 3557 positive sentiments were correctly classified. The BERTweet pipeline exhibited the worst performance among all the methods investigated. It achieved a recall of 88% on positive sentiments, meaning that many positive reviews were correctly predicted, but with a precision of only 59%, meaning that many reviews from other classes were also classified as positive.
Figure 7d represents the confusion matrix of the RoBERTa pipeline on the test dataset. It indicates that, overall, 6421 out of 10,795 reviews were correctly classified; 2532 out of 3652 negative sentiments were correctly classified, 649 out of 3586 neutral sentiments were correctly classified, and 3240 out of 3557 positive sentiments were correctly classified. The RoBERTa pipeline clearly performed better on positive sentiments and worse on neutral sentiments; neutral reviews tended to be classified as positive, amounting to 2150 of them (60%). Compared to the other methods, it correctly classified more positive reviews, with a recall of 91% on positive sentiments, but with a precision of only 55%, meaning that many other Yelp reviews were also incorrectly classified as positive.
It appears that RoBERTa had an overall tendency to classify text as positive, a bias it shares with BERTweet, as shown in
Figure 5, where more than 50% of their predictions had a positive polarity on the Yelp dataset.
We computed the mean of the precision, recall, and F1 score across all classes to obtain the final macro-averaged precision, recall, and F1 score, as shown in
Figure 8. The macro-performance indicators are represented on the
x-axis, and the performance achieved on the dataset is represented on the
y-axis. LSTM achieved accuracy, macro-averaged precision, recall, and F1 score values of 77%. Bi-LSTM achieved accuracy, macro-averaged precision, recall, and F1 score values of 75%, 76%, 75%, and 77%, respectively. BERT achieved accuracy, macro-averaged precision, recall, and F1 score values of 67%, 68%, 67%, and 64%, respectively. BERTweet achieved accuracy, macro-averaged precision, recall, and F1 score values of 60%, 61%, 61%, and 59%, respectively. RoBERTa achieved accuracy, macro-averaged precision, recall, and F1 score values of 59%, 58%, 60%, and 55%, respectively.
Since the target classes were balanced—negative, neutral, and positive reviews were evenly distributed– this graphic of macro-indicators can be used to globally compare the performance of the methods. LSTM-based methods performed better on all accounts, with the BERT pipeline achieving superior results compared to RoBERTa and BERTweet. BERTweet and RoBERTa exhibited similar performance, but BERTweet achieved better macro-precision.
Regarding a binary polarity classification problem with only negative and positive target sentiments, we can compare the DistilBERT method with LSTM by regrouping DistilBERT’s negative and neutral output classes; the test set consisted only of 3652 negative and 3557 positive samples, as specified in
Table 1.
Figure 9 illustrates the results achieved by both methods. The performance of DistilBERT and LSTM was similar, with LSTM having the advantage in terms of accuracy, precision, and recall for both classes. We can see that LSTM achieved significantly better recall on negative labels and better precision on positive labels, predicting fewer negative reviews as positive compared to DistilBERT.
3.3. Experimental Results on Other Databases
The results presented in the previous section were obtained solely on the Yelp test dataset, which may lead to domain-specific limitations. It is important to further validate the models’ effectiveness on other datasets with more diversity. We chose three datasets of a more diverse nature: Google Maps restaurant reviews, Amazon product reviews, and the SemEval 2017 Task 4 dataset. The first dataset contains comments about places, the second relates to products purchased online, and the third comprises Tweets related to more general and diverse topics.
The Google Maps restaurant reviews were collected by webscraping; software extracted 15k reviews along with their star ratings, given by customers of restaurants in the city of Paris, France. Stars were converted to labels: 1–2 stars for negative sentiments, 3 stars for neutral sentiments, and 4–5 stars for positive sentiments. We produced a balanced dataset with 5k reviews per sentiment, as reported in
Table 5.
The Amazon product review dataset was extracted from the Amazon Reviews’23 dataset [
41]. We used the subset “Subscription Boxes” (
https://amazon-reviews-2023.github.io/ (accessed on 23 April 2025)), comprising 16k reviews along with their ratings and metadata. The dataset is imbalanced, with two-thirds of the reviews labeled positive and one-quarter labeled negative.
The last dataset was collected from the International Workshop on Semantic Evaluation. We used the test set of the SemEval 2017 Task 4 subtask A dataset [
42], containing 12k hand-labeled Tweets, including the overall sentiments of the Tweets. It is imbalanced, with half of the Tweets labeled neutral and one-third labeled negative, as reported in
Table 5.
Figure 10 presents a comparative performance analysis of four sentiment analysis methods—LSTM (red), BERT (deep blue), BERTweet (light blue), and RoBERTa (lighter blue)—across four key metrics: accuracy, macro-precision, macro-recall, and macro-F1 score. The dataset used is the Google Maps restaurant reviews, containing 15.2k reviews distributed equally across sentiments.
Overall, LSTM and BERT performed the best across all metrics, with values slightly above 70%. RoBERTa and BERTweet exhibited poorer performance, with values under 65%. This suggests that traditional deep learning models (LSTM) and established transformer models (BERT) are more effective for sentiment analysis compared to more specialized models such as RoBERTa and BERTweet.
LSTM and BERT maintained a strong balance between precision and recall, ensuring both accurate classification and comprehensive coverage of sentiment categories. BERTweet and RoBERTa lagged behind, indicating potential challenges in capturing nuanced variations in sentiments. Since the F1 score balances precision and recall, its trend followed the same pattern: LSTM and BERT exhibited the best performance, while RoBERTa and BERTweet exhibited weaker performance.
Figure 11 reports the performance results on the Amazon product review dataset. BERT achieved the best performance across all metrics, making it the most effective model in this comparison. This was expected since this BERT model was fine-tuned on a dataset of product reviews. BERTweet and RoBERTa performed similarly, slightly below BERT, suggesting that they are competitive but may require further optimization. LSTM trailed behind the transformer-based models, achieving lower accuracy and macro-F1 score values.
The three BERT-based methods achieved considerable accuracy values, outperforming LSTM by a noticeable margin. Since LSTM relies on sequential learning, it struggled with contextual understanding compared to the transformer-based models. LSTM was also trained on Yelp reviews, which pertain to rating businesses, whereas the other BERT-based methods were fine-tuned on product reviews or Tweets, which are closer in nature to rating a product, compared to Google Maps restaurant and Yelp reviews, which are more similar in nature. The fact that LSTM’s precision and recall values were similar to those achieved by the other methods stemmed from its better classification of the neutral class.
Figure 12 shows the performance results on the SemEval 2017 database. Both BERTweet and RoBERTa achieved consistently higher values across all metrics, with accuracy, macro-precision, macro-recall, and macro-F1 score values of around 70%. These models outperformed LSTM and BERT, highlighting the effectiveness of the domain-specific pretraining and fine-tuning of BERTweet and RoBERTa, which are based on a RoBERTa-base model trained on millions of Tweets and fine-tuned for sentiment analysis on 40 to 45k Tweets. Just like the LSTM trained on Yelp reviews performed better on a Yelp test set, these two methods are better suited for classifying Tweets.
The BERT model was trained on a more general corpus (Wikipedia and BookCorpus) and fine-tuned on a set of product reviews. It was noticeably weaker than the Tweet-trained BERTweet and RoBERTa, as was the LSTM, which lagged behind, especially in the macro-precision and macro-F1 score metrics, which were below 50%. This was due to fine-tuning limitations, where training BERT-based models on social media texts provides an advantage in sentiment detection.
4. Discussion
4.1. Findings Inferred from the Results
Overall, we can see that using a large BERT pipeline model—with hundreds of millions of parameters and trained through a time-consuming process involving a very large corpus of data—does not ensure the best performance on the Yelp reviews dataset. Training a deep neural network with two LSTM layers achieved better performance on the Yelp test set, similar to what was observed on the Google Maps restaurant reviews training set.
Across a range of test sets, transformer models generally outperformed traditional RNN-based models, especially when applying a single model to various text analysis tasks, from general comments, such as Tweets, to movie or product reviews. But LSTM still holds strong: depending on the dataset’s characteristics, training an LSTM can still be a viable option when computational efficiency is a priority. As shown in
Table 6, an LSTM model can process a Tweet 20 times faster than a BERT pipeline. Memory usage refers to the peak RAM required to process the 12k Tweets of the SemEval 2017 dataset, although this metric is less important since modern hardware capabilities have reached the gigabyte range.
Beyond architecture and data considerations, our analysis on the Yelp dataset revealed interesting behavioral differences among models. RoBERTa, for instance, demonstrated a strong bias toward predicting positive sentiments and performed poorly on neutral sentiments. BERTweet exhibited a similar tendency. Despite these shortcomings, RoBERTa achieved an impressive recall of 91% on positive sentiments, although this was accompanied by a relatively low F1 score of 68%, reflecting its over-prediction of the positive class. Conversely, BERT achieved a recall of 88% on negative sentiments but with only 62% precision, indicating a tendency to classify inputs as negative.
These patterns highlight the trade-off between class-specific performance and overall balance. While transformer-based models excel in detecting particular classes, their predictions may be skewed without fine-tuning or calibration. This reinforces the value of interpretable and customizable models like LSTM, especially when balanced performance across classes is required.
Finally, we note that although BERT-based pipelines are easier to deploy and require minimal coding skills—making them appealing for non-specialist users—training an LSTM model requires collecting domain-specific data, understanding deep learning concepts, and implementing training workflows. Thus, the choice of model should take into account not only predictive performance but also deployment constraints, technical expertise, and resource availability.
4.2. Comparison with Related Works
Bi-LSTM models are theoretically advantageous over LSTMs due to their ability to capture contextual information from both past and future tokens. However, our results do not fully support this assumption. One possible explanation is that, in the context of sentiment analysis on datasets such as Yelp reviews, sentiment is often expressed through specific keywords or phrases that are already sufficiently informative when processed in a unidirectional manner. Our findings are consistent with those reported by Hussain and Naseer [
43], who conducted a comparative analysis of Logistic Regression, LSTM, and Bi-LSTM models on the IMDB movie reviews dataset. Their results showed that while Bi-LSTM achieved slightly higher accuracy than LSTM (87% vs. 86%), the improvement was marginal.
Loureiro et al. [
26] reported that, on the TweetEval dataset containing 11k test Tweets (SemEval 2017 Task 4), BERTweet and the original RoBERTa-base (not fine-tuned) attained macro-recall values of 73%, while Bi-LSTM achieved only 58%. In contrast, our results (
Figure 8) showed macro-recall values of 74% for Bi-LSTM and 60% for RoBERTa and BERTweet. The Bi-LSTM architecture and training details were not reported in [
26], so it is difficult to interpret the difference. A possible explanation for BERTweet’s good results on TweetEval is that BERTweet was fine-tuned on a dataset of Tweets, so its performance was better on textual data of the same type and worse on Yelp reviews, as they are not the same type of comments.
Perez et al. [
25] tested the BERTweet pipeline on the SemEval 2017 Task 4 dataset and reported performance using macro-F1 scores. In sentiment analysis, they reported similar macro-F1 scores of around 70% for BERT and RoBERTa, and around 72% for BERTweet. On our Yelp dataset, the macro-F1 score for BERTweet was closer to 60%. According to Perez et al. [
25], the performance of BERTweet varied depending on the test dataset. For instance, on the Amazon product reviews dataset, the macro-F1 score was only 63%, similar to what we achieved on the Yelp dataset. This indicates that BERTweet is more suitable for Tweet-like textual polarity detection.
Iqbal et al. [
40] compared different LSTM models for polarity detection of consumer reviews on a Yelp dataset (the authors did not report the number of samples tested). One of the models tested contained two LSTM layers (Model 3), similar to the one we built. Model 3 achieved a precision of 71%, a recall of 60%, an F1 score of 61%, and an accuracy of 81%. Our results cannot be directly compared for three reasons: first, their task was a binary problem with negative and positive target classes; second, they did not document the Yelp reviews they used; and third, they did not stipulate what target classes these metrics related to. Since they did not mention that they used macro-metrics, one can assume that the results relate to the positive sentiment target class. In this class, our LSTM model achieved a precision of 84%, an accuracy of 79%, and an F1 score of 68%, with a similar accuracy. The difference was probably due to the architecture; Iqbal et al. chose two identical consecutive layers, LSTM(100)–LSTM(100), each with 100-dimensional hidden states, whereas our architecture follows the basic principle of deep learning, with a decreasing number of units in the two LSTM layers: the first one with 128 units, and the second one with half of that, LSTM(128)–LSTM(64), which allows for better feature extraction.
4.3. Methodological Limitations
One limitation of this study is that all reviews used to train the LSTM-based methods were sourced from the Yelp Open Dataset. Although the corpus is rich and diverse, it is domain-specific. To confirm the results, further evaluations should be conducted using a training dataset that combines samples from other review datasets from different platforms or domains.
Additionally, the architecture of the LSTM used here was arbitrarily fixed to two layers. Only the training hyperparameters were optimized using grid search. In future work, we aim to explore architectural variations such as the number of layers and the size of the hidden state. The impact of the sequence length and embedding dimensions—parameters often set arbitrarily in practice—also warrants more systematic investigation.
The size of the training dataset is another important factor to consider. While we used a balanced dataset containing over 100,000 reviews, it remains to be seen how performance scales with more or fewer samples. Understanding the minimal dataset size needed to achieve satisfactory performance could help guide applications in resource-constrained scenarios.
4.4. Recent Developments in Transformer-Based Models and LLMs
Very recent advances in transformer-based architectures have led to the development of more efficient and robust models, such as ModernBERT [
44], which introduces improvements over the original BERT in terms of training stability, representation quality, and scalability. However, despite these technical advances, no tri-polar sentiment classification pipeline based on ModernBERT was available at the time of our experiments. While some variants of ModernBERT have been fine-tuned for multilingual sentiment analysis or emotion detection, these models are not directly applicable to our setting, which focuses on polarity-based classification (positive, neutral, and negative) over structured review data such as Yelp. In contrast, the encoder-only transformer models we selected—BERT, RoBERTa, DistilBERT, and BERTweet—are all supported by readily available and reproducible pipelines tailored to tri-polar sentiment classification. This practical consideration, combined with their proven performance and efficient integration into large-scale workflows, motivated our choice to focus on these architectures rather than exploring recent alternatives that, although promising, were not yet mature or aligned with our specific task.
The field of sentiment analysis has undergone a significant transformation with the rise of large language models (LLMs), such as LLaMA, Mistral, and GPT-4. These models have set new benchmarks for language understanding, surpassing traditional encoder-based architectures like BERT and RoBERTa in various sentiment classification tasks. One particularly notable advancement is LLM2vec [
45], an approach that leverages the embeddings from these large models to enhance sentiment analysis performance across multiple domains.
Recent studies have demonstrated that LLMs can achieve state-of-the-art results in sentiment classification, even in zero-shot and few-shot learning settings [
46]. Despite their impressive capabilities, LLMs introduce new challenges in sentiment analysis. Their reliance on massive computational resources and specialized fine-tuning techniques makes them less accessible to practitioners who require ready-to-use NLP solutions.
In contrast to LLMs, Hugging Face pipelines provide a more user-friendly approach to sentiment analysis. These pipelines allow researchers and practitioners to apply pretrained transformer models with minimal setup, making them ideal for applications where ease of use is prioritized over cutting-edge performance. While LLMs offer superior accuracy in complex sentiment tasks, Hugging Face pipelines remain a practical choice for those seeking efficient, accessible, and interpretable NLP solutions.
5. Conclusions
This paper presents a comparative study of LSTM-based models and BERT-based pipeline models for sentiment analysis using the Yelp Open Dataset. We evaluated the performance of LSTM, Bi-LSTM, and some of the most popular transformer-based pipelines available on Hugging Face—DistilBERT, BERT-base, RoBERTa, and BERTweet—using standard metrics, including accuracy, precision, recall, and F1 score.
Our results on the Yelp dataset showed that a domain-specific trained LSTM model outperformed all other models in overall performance, achieving 77% across all macro-level metrics, despite having significantly fewer parameters than the BERT-based models. The Bi-LSTM architecture yielded results comparable to the LSTM, indicating that bidirectional processing did not provide substantial added value in this context. It appeared that RoBERTa had an overall tendency to classify text as positive, a bias it shares with BERTweet. RoBERTa also performed poorly in classifying neutral Yelp reviews. Although overall RoBERTa performed worse than LSTM, it achieved an impressive recall of 91% on positive sentiments, which is high, but with an F1 score of only 68%, which can be explained by the bias of the method toward positive outputs. On the other hand, BERT achieved an impressive 88% recall on negative sentiments but with a precision of 62%, reflecting its tendency to classify text inputs as negative.
To further assess the robustness and transferability of the LSTM model, we evaluated its performance on three additional datasets of varying nature: Google Maps restaurant reviews, Amazon product reviews, and the SemEval 2017 Task 4 dataset containing manually annotated Tweets. On the Google Maps restaurant reviews dataset, which shares a similar domain and structure with the Yelp reviews dataset, the LSTM achieved performance levels comparable to BERT, with the macro-level metrics slightly above 70%, outperforming RoBERTa and BERTweet, which remained below 65%. However, on the Amazon product reviews dataset, BERT clearly outperformed all other models, including LSTM, which trailed behind with lower accuracy and macro-F1 scores, especially due to the domain shift and imbalanced class distribution. On the SemEval Tweet dataset, transformer models fine-tuned on social media text—namely BERTweet and RoBERTa—demonstrated superior performance, achieving macro-scores of around 70%, while LSTM lagged behind with macro-precision and F1 score values below 50%.
These findings suggest that well-tuned LSTM architectures offer a lightweight and effective alternative to large pretrained transformer models, particularly for structured and domain-specific datasets like Yelp reviews. Nonetheless, BERT-based pipelines remain attractive due to their ease of use, rapid implementation, and generalization capabilities. Particularly if one wants to use the same model on texts of a diverse nature, BERT-based models are better suited, since the performance of LSTM decreases when applied to a test set with a different nature than the training set. LSTMs are simpler and faster but offer fewer generalization capabilities.
Future work will expand the scope of evaluation to explore architectural variations in the LSTM network (e.g., number of layers and hidden size) and assess the impact of embedding dimensions and sequence length. Investigating the relationship between dataset size and model performance will also be critical in determining the optimal training strategies and the scalability of these models for real-world sentiment classification tasks.