Hybrid Ensemble of Large Language Models and Fractional Derivative Features for Domain-Specific Engineering Sentiment Analysis

Karim, Abdul; Triandini, Evi; Lee, Seoyeong; Jeong, In cheol

doi:10.3390/app16094266

Open AccessArticle

Hybrid Ensemble of Large Language Models and Fractional Derivative Features for Domain-Specific Engineering Sentiment Analysis

¹

Cerebrovascular Disease Research Center, Hallym University, Chuncheon 24252, Republic of Korea

²

Department of Artificial Intelligence Convergence, Hallym University, Chuncheon 24252, Republic of Korea

³

Department of Information Systems, Institute Technology and Business STIKOM, Bali 80223, Indonesia

⁴

Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(9), 4266; https://doi.org/10.3390/app16094266

Submission received: 19 March 2026 / Revised: 3 April 2026 / Accepted: 6 April 2026 / Published: 27 April 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

This study addresses the need for applied sentiment analysis in engineering decision-support systems by presenting a hybrid framework for domain-specific engineering text. This study presents a hybrid sentiment classification framework by integrating transformer-based semantic embeddings with fractional-order feature modeling. The proposed BERTLR framework combines BERT and RoBERTa representations with Grünwald–Letnikov fractional derivative–enhanced TF-IDF features and logistic regression within a unified soft-voting architecture. Unlike conventional ensemble sentiment models that merely aggregate embeddings and handcrafted features, the proposed method introduces fractional-order feature transformation to capture non-local dependency patterns and memory-aware lexical variations that are often overlooked in technical review text. This design provides a structured fusion of contextual semantic information and fractional statistical representations, supported by SHAP-based explainability and ablation analysis. Experiments conducted on six real-world engineering application domains show consistent improvements over conventional TF-IDF models, LSTM baselines, and non-fractional transformer variants. The framework achieves up to 91% accuracy, together with strong precision, recall, and F1-score performance. These results demonstrate that fractional-order feature augmentation can provide a meaningful complementary signal to transformer embeddings, offering an interpretable and effective sentiment analysis solution for engineering and industrial decision-support applications.

Keywords:

knowledge engineering; hybrid intelligent systems; large language models; fractional calculus; domain-specific sentiment analysis; engineering applications

1. Introduction

Deep learning has rapidly transformed a wide range of real-world applications, resulting in demonstrable gains in industries such as telecommunications, finance, healthcare, natural language processing (NLP), and intelligent robotics. Large language models (LLMs), particularly Bidirectional Encoder Representations from Transformers (BERT), are now crucial tools in NLP, allowing context-aware and semantically rich text processing [1,2,3]. Traditional sentiment analysis approaches frequently encounter limitations in interpreting nuanced language, often resulting in misclassification. To address such limitations, models like BERT adopt deep bidirectional architectures, offering improved accuracy over conventional statistical or rule-based methods [4,5]. These developments underscore the growing need for advanced text-mining solutions capable of managing large-scale, user-generated datasets such as customer reviews and online feedback [6]. From an applied perspective, robust sentiment classification of technical review data is increasingly important for engineering monitoring, product improvement, customer feedback analysis, and industrial decision-support systems. This practical relevance aligns well with the growing use of applied artificial intelligence methods for domain-specific text analytics in real-world engineering environments.

As digital platforms generate increasing volumes of user content, the requirement for more adaptable sentiment analysis tools continues to grow. Existing techniques frequently fall short of capturing complex linguistic structures, implicit sentiment cues, or dynamic sentiment patterns in specialized text. In this context, we propose a hybrid framework that combines fractional derivative-based feature extraction, transformer-based models, and ensemble learning. Specifically, we extend our previous BERT + Logistic Regression (BERTLR) model by incorporating a RoBERTa-based variant and a unified ensemble using soft voting. This framework combines the semantic richness of LLMs with the mathematical depth of fractional calculus, offering improved expressiveness for classification tasks [7,8]. However, limited attention has been given to integrating fractional mathematical modeling with transformer-based embeddings for sentiment classification in technical and domain-specific settings, where sentiment is often expressed through sparse, context-dependent, and terminology-rich language.

Although transformer models such as BERT and RoBERTa excel in modeling contextual semantics via self-attention, they may not fully capture long-memory effects or non-linear lexical variations inherent in domain-specific language. This limitation is particularly evident when analyzing sentiment-rich engineering feedback, short technical reviews, or ambiguous expressions in which polarity depends not only on semantics but also on subtle term interactions. Fractional calculus offers a rigorous mathematical tool to address this gap by modeling memory-dependent behaviors and capturing long-range dependencies. By transforming conventional TF and TF-IDF features using fractional derivatives, we enrich the representation space before classification. This dual-channel approach, combining deep contextual embeddings with fractional statistical features, enables the model to preserve semantic information while also incorporating non-local lexical dynamics that are often overlooked in standard transformer-only sentiment pipelines.

Ensemble learning has been shown to be an effective approach for increasing prediction robustness by aggregating multiple models. Methods such as bagging, boosting, AdaBoost, and Random Forest improve sentiment classification accuracy by reducing overfitting and generalization errors [9,10]. Nevertheless, simply combining several existing components does not automatically yield a meaningful methodological contribution. The key challenge is to design a fusion mechanism in which the components provide complementary information rather than redundant predictions. Building on this perspective, our proposed system employs both hard and soft voting mechanisms to unify the outputs from BERTLR and RoBERTaLR, while the fractional derivative-enhanced channel introduces an additional feature representation that complements transformer embeddings instead of merely duplicating their role. In this sense, the proposed framework is not only an ensemble of classifiers, but a structured hybrid representation model for engineering sentiment analysis.

Sentiment analysis is an important technique for organizations seeking actionable insights from consumer feedback. Machine learning (ML) and deep learning (DL) models have emerged as effective methods for automated sentiment recognition and trend forecasting. However, identifying the most effective approach remains challenging due to the diversity of available algorithms, each with varying levels of interpretability and performance across domains [11,12]. Comparative research highlights the need for hybrid methods that can adapt to dataset-specific nuances while balancing generalization and accuracy [7,11,13]. In engineering review analysis, these challenges are further amplified by highly specific terminology, irregular sentiment markers, and domain-dependent expressions, motivating the need for models that combine semantic understanding with mathematically grounded feature enrichment. As shown in Figure 1, the proposed sentiment classification framework integrates BERTLR and RoBERTaLR models.

Recent breakthroughs in hybrid modelling, which combine machine learning and deep learning approaches, have produced promising results in sentiment classification problems. Support vector machines (SVM), random forests (RF), artificial neural networks (ANN), and gradient-boosted regression trees (GBRT) have shown effectiveness in high-dimensional textual domains [9,11,14]. The integration of fractional derivative-based transformations into these frameworks offers added granularity in feature extraction, further elevating classification accuracy.

In addition, recent research has explored sentiment analysis within engineering design and product development contexts, where fine-grained sentiment information is used to guide optimization and redesign processes. For example, ontology-based sentiment analysis combined with evolutionary design strategies has been applied to extract aesthetic and functional preferences from user feedback, enabling intelligent product redesign [15]. While such approaches demonstrate the importance of domain-specific sentiment modeling in engineering applications, they primarily rely on structured semantic representations and optimization-driven frameworks, without explicitly capturing memory-dependent lexical interactions.

Despite this promise, prior studies have rarely examined whether fractional-order feature modeling can provide a complementary signal to transformer embeddings in a unified and interpretable sentiment analysis framework. This unresolved issue forms the central motivation of the present work. Accordingly, this study is positioned not only as a methodological contribution in hybrid AI-based sentiment modeling but also as an applied framework for engineering-oriented text intelligence and decision support.

In this study, we analyze and compare the performance of different ML and DL models on a public engineering review dataset, with the extended ensemble, which includes both BERT and RoBERTa models, serving as the primary architecture. We evaluate classification performance using common metrics including precision, recall, F1-score, and accuracy. The key contributions of this study are summarized below:

A hybrid semantic–fractional framework that combines BERT and RoBERTa embeddings with logistic regression and fractional derivative-based feature transformation, enabling the integration of contextual semantic information and memory-aware statistical text representations.
A mathematically motivated feature enrichment strategy in which fractional derivative-enhanced TF-IDF is used to model non-local and long-range lexical variations that are not explicitly captured by standard TF-based or embedding-only sentiment pipelines.
An extended ensemble design that includes BERTLR, RoBERTaLR, and soft-voting integration, allowing us to examine both individual and complementary contributions of transformer-based variants within a unified framework.
Empirical validation across multiple engineering sentiment datasets showing that fractional derivative-enhanced TF-IDF consistently outperforms standard TF representations and provides a useful complementary signal to transformer embeddings.
An interpretable evaluation setting for industrial-scale opinion mining and decision support, supported by comparative experiments, ablation-style analysis, and explainability-oriented assessment of the proposed framework.

This paper is organized as follows. Section 2 examines the current state of the art. Section 3 outlines the methodology and implementation details. Section 4 analyses the experimental results. Section 5 concludes with future directions.

2. Related Work

The growing complexity of user-generated content and large-scale online feedback has motivated the development of advanced classification frameworks capable of extracting reliable and actionable information from text. A recent study introduced an ensemble detection architecture utilizing a voting classifier that incorporates 11 machine learning algorithms, such as Naïve Bayes and K-Nearest Neighbours (K-NN). Following cross-validation, the three most effective classifiers were integrated to create a robust ensemble model. The framework attained a classification accuracy of 94.5% and demonstrated robust performance as indicated by ROC curves, recall, and F1-score [15]. Although originally developed for fake news detection, this study demonstrates the broader effectiveness of ensemble-based classification for complex text mining and AI-driven decision-support systems.

Misinformation and deceptive user content continue to pose significant challenges across online platforms, with broad socio-political and economic implications. Ensuring the credibility of digital content has therefore become essential, while traditional classifiers often struggle to adapt to evolving text patterns [16]. In sentiment analysis, particularly for customer reviews, this challenge is equally important because unreliable or noisy language can distort sentiment models and reduce classification quality. Basic bag-of-words (BoW) models provide sparse representations and often fail to capture semantic richness, even when extended with neural embeddings [17]. This limitation has encouraged the design of more expressive feature extraction and model fusion strategies.

To address these shortcomings, multi-view learning frameworks that combine bag-of-n-grams with parallel convolutional neural networks (CNNs) have been developed. These models incorporate embedding layers with small-kernel convolutions, improving their ability to capture local semantic dependencies and increasing robustness in text classification tasks. In the context of fraudulent review detection, the integration of textual and behavioral features has produced F1-scores of up to 92% on datasets such as Yelp Filtered Reviews [18]. These findings suggest that hybrid models can benefit from combining heterogeneous signals rather than relying on a single feature family.

Cognitive computing and computational modeling are increasingly applied in sentiment-driven applications, particularly in domains such as finance, telecommunications, and business intelligence. The quality and structure of textual data play a central role in model performance. One study found that high readability and concise review length significantly improved the accuracy of models such as SRN, CNN, and LSTM across benchmark sentiment datasets [19]. Regression-based analyses further confirmed that readability-related variables correlate with improved classification outcomes, while time-sensitive sentiment evaluation strategies often outperform simple preprocessing-only pipelines [20]. These studies indicate that sentiment classification benefits not only from stronger classifiers but also from richer and more informative representations of text.

Beyond binary polarity prediction, recent work has focused on improving the granularity and robustness of sentiment classification through hybrid modeling. To enhance classification effectiveness, ensemble systems have been developed by combining deep learning and classical machine learning [21]. Among ensemble strategies, boosting-based approaches have often outperformed bagging, with optimized combinations yielding superior results [22]. Similarly, hybrid models that fuse machine learning and deep learning components, particularly those incorporating CNNs, have demonstrated robust performance in sentiment and misinformation detection tasks. These results support the view that ensemble learning can improve sentiment analysis when the constituent models contribute complementary information.

Recent ensemble classifiers have combined multiple base learners, including SVC, Random Forest, Decision Tree, and MLP, within unified architectures. The addition of deep learning components has further strengthened model resilience on large-scale datasets, improving precision, recall, and F1-score [23]. However, the dynamic and informal nature of user-generated language remains a major challenge. Research targeting platforms such as Twitter and Facebook emphasizes the need for architectures that can better manage linguistic variability, implicit sentiment, and domain-dependent expressions [24]. This challenge becomes even more pronounced in engineering review corpora, where sentiment is often embedded in technical terminology and short context-dependent statements.

Advanced feature extraction techniques, including TF-IDF, BoW, Word2Vec, and BERT embeddings, remain central to modern sentiment classification. When integrated with ensemble classifiers such as LR, NB, SVM, XGBoost, and hybrid neural models, these representations can substantially improve predictive performance. Comparative investigations involving IDF, LSA, and LDA found that TF-IDF is particularly effective for large datasets, whereas smaller datasets require additional semantic modeling to achieve competitive performance [25]. Ensemble deep learning frameworks have also outperformed traditional pipelines in multiple settings. For instance, random forests achieved 87.1% accuracy, while SVM classifiers showed stronger precision and F1-score in some tasks, highlighting the advantages of hybrid AI approaches [26]. Nevertheless, most of these studies still rely on standard lexical or embedding-based representations and do not explicitly model non-local memory effects in text.

Modern sentiment analysis increasingly underpins automated decision-making across industries. Traditional methods often suffer from manual feature engineering bias and limited adaptability, whereas the integration of ML and DL techniques offers scalable alternatives for customer feedback analytics and business intelligence systems [27]. In addition, computational intelligence approaches, including machine learning-enabled liquid democracy models, illustrate the broader potential of AI-assisted decision frameworks in adaptive classification settings [28]. These developments reinforce the importance of sentiment models that are not only accurate but also sufficiently flexible and interpretable for real-world deployment.

With the emergence of LLMs, BERT has reshaped NLP tasks through pre-trained contextual embeddings and task-specific fine-tuning. Recent studies have further improved sentiment classification by leveraging sentence pairing and transformer-based contextualization [29]. Compared with earlier shallow or CNN-based sentiment models, transformer architectures provide substantially richer semantic representations. However, transformer embeddings alone may not fully characterize non-linear lexical variation or long-memory effects in domain-specific review text. Our approach therefore combines deep learning and classical machine learning within a hybrid ensemble architecture and introduces a soft-voting integration strategy to enhance sentiment classification performance [30]. In parallel, fractional calculus, as an extension of classical differentiation, has recently been explored for AI-based modeling. Numerical computation of fractional derivatives has shown promise in engineering applications, especially for extracting richer and more non-local features in AI-based text classification settings [31,32].

Despite these advances, three important gaps remain. First, prior sentiment analysis studies rarely integrate transformer-based contextual embeddings with fractional-order lexical feature transformation in a unified framework. Second, many hybrid sentiment models combine components heuristically without clearly exploiting complementary semantic and memory-aware statistical representations. Third, limited work has examined this problem in technical or engineering review domains, where sentiment is often expressed through specialized terminology, implicit evaluation, and context-sensitive wording. These gaps motivate the proposed framework, which integrates BERT/RoBERTa representations with fractional derivative-enhanced TF-IDF features to construct a more expressive and interpretable sentiment classification pipeline.

3. Materials and Methods

The dataset was analyzed and classified using machine learning (ML) and deep learning (DL) approaches, considering key performance metrics such as precision, F1-score, recall, and accuracy. A comparative analysis was conducted between individual ML/DL models and ensemble learning classifiers, incorporating fractional derivative-based feature extraction, along with hard and soft voting strategies. In addition, new experiments were introduced using both BERT and RoBERTa, as well as a BERT–RoBERTa ensemble, to improve the robustness of sentiment prediction.

Various classification and regression models from ML and DL domains were explored to identify optimal methods for sentiment prediction. This section provides both a theoretical and technical overview of the research methodology, detailing the analytical framework used to process Google Play Store reviews.

The dataset was originally collected using an application programming interface (API), enabling programmatic extraction of user-generated reviews and ratings. Due to the time of collection, the exact API version is not available. The collected data were subsequently processed, preprocessed, and analyzed using Python (version 3.9).

3.1. Dataset

The study utilizes a large-scale dataset of user reviews and ratings from multiple applications across diverse categories, as depicted in Figure 2. The dataset comprises 322,528 reviews spanning six distinct categories (Action, Casual, Entertainment, Music & Audio, Photography, and Card), ensuring a comprehensive representation of user sentiment across various application domains. The data was systematically processed and visualized to enhance its interpretability, facilitating a deeper understanding of sentiment trends and linguistic patterns. Figure 3 illustrates the rating system employed to assess the performance of the selected classifiers. To ensure data integrity and compliance, the dataset sources align with publicly available and ethically obtained user-generated content. Although the dataset is derived from Google Play Store reviews rather than a traditional engineering benchmark, it is suitable for evaluating engineering-oriented sentiment analysis because many application reviews contain functional, technical, and performance-related feedback. Users frequently comment on reliability, usability, efficiency, interface behavior, compatibility, and system responsiveness, which are closely related to engineering and product quality assessment. Therefore, the dataset provides a realistic applied setting for studying sentiment classification in technically oriented review text. In this context, the term “engineering sentiment analysis” is used in an applied sense, referring to sentiment evaluation of technically oriented user feedback rather than strictly domain-specific engineering corpora.

3.2. Methodology

This study proposes a hybrid sentiment classification methodology that integrates logistic regression (LR) with transformer-based language representations and fractional-order feature modeling. In particular, Bidirectional Encoder Representations from Transformers (BERT) and its optimized variant RoBERTa are used to capture contextual semantic representations of review text, while fractional derivative-enhanced lexical features are employed to model statistical and memory-dependent patterns in textual data. Unlike conventional sentiment pipelines that rely solely on transformer embeddings or traditional TF-IDF representations, the proposed framework combines semantic embeddings and fractional-order feature transformations within a unified ensemble architecture. This design allows the model to leverage both contextual language understanding and mathematically enriched lexical representations.

To evaluate the effectiveness of this approach, several classifiers and feature extraction techniques were examined. The evaluated models include Support Vector Machine Classifier (SVMC), Decision Tree Classifier (DTC), Random Forest Classifier (RFC), Naïve Bayes Classifier (NBC), Logistic Regression Classifier (LRC), AdaBoost Classifier (ABC), and transformer-based architectures such as BERT. In addition to these individual models, the proposed framework introduces extended variants based on BERTLR and RoBERTaLR as well as a hybrid ensemble combining both representations. These models were evaluated using both term frequency (TF) and term frequency-inverse document frequency (TF-IDF) features, together with ensemble decision mechanisms including soft voting and hard voting. The overall workflow of the proposed classification framework is illustrated in Figure 4.

The first phase of this study involved preprocessing the dataset. Application reviews were collected using API-based access and publicly available data sources from the Google Play Store. After collecting the reviews, they underwent preprocessing to clean and normalize the data. This preprocessing involved removing unwanted characters, trimming leading spaces, eliminating excessive spaces, and converting text to lowercase. The following figure illustrates the classification mechanism for Google Play Store reviews:

To further improve the quality of textual representations, stop-word removal and stemming techniques were applied during preprocessing. After refining the text, a Bag-of-Words (BoW) representation was initially constructed, followed by term frequency (TF) analysis using Python (version 3.9). The widely used term frequency-inverse document frequency (TF-IDF) approach was then employed to produce more informative lexical representations. TF-IDF provides a weighting mechanism that highlights informative terms while reducing the impact of common or non-discriminative words. Figure 4 illustrates the TF-IDF-based representation used prior to the feature extraction stage.

In addition to conventional TF and TF-IDF representations, this study incorporates fractional derivative-based transformations to enrich lexical features. Fractional calculus extends classical differentiation to non-integer orders and has been shown to capture long-memory effects and non-local relationships in data representations. By applying fractional derivatives to TF and TF-IDF representations, the proposed framework enhances the sensitivity of feature vectors to subtle lexical variations and term dependencies. This transformation produces a richer representation space that complements the contextual semantic embeddings produced by transformer models.

Ensemble learning, a powerful machine learning strategy, was employed to improve predictive performance by integrating multiple classifiers. Instead of relying on a single model, the ensemble framework aggregates the predictions of multiple models that capture different aspects of the input data. In the proposed framework, transformer-based models capture semantic context, while fractional feature transformations enhance lexical representation. Their integration within an ensemble architecture enables complementary learning signals to contribute to the final prediction.

Two ensemble decision strategies were implemented:

Hard voting: Each classifier in the ensemble casts a vote for a class, and the class receiving the majority of votes is selected as the final prediction.
Soft voting: Each classifier generates a probability distribution over classes, and the class with the highest cumulative probability across classifiers is selected as the final prediction.

By combining fractional derivative-based lexical representations with transformer-based contextual embeddings, the proposed BERTLR framework provides a hybrid semantic–statistical learning architecture. The fractional feature transformation acts as a complementary signal to the contextual embeddings learned by BERT and RoBERTa, enabling the system to capture both semantic meaning and long-range lexical patterns within review text. This integration allows the framework to better interpret subtle sentiment cues often present in technical review datasets.

The ensemble learning framework was implemented in Python using widely adopted machine learning libraries including scikit-learn, TensorFlow, and the Hugging Face Transformers library. Logistic regression (LR), BERT, and RoBERTa models were integrated within both hard and soft voting ensemble structures. Fractional derivative-based feature extraction was applied during the preprocessing stage to enrich the textual feature space prior to classification. Model performance was evaluated using standard metrics including accuracy, precision, recall, and F1-score, enabling a comprehensive comparison between individual classifiers and the proposed hybrid ensemble framework.

3.3. Ensemble Learning

Ensemble learning is a robust technique that improves machine learning performance by aggregating predictions from multiple models. In traditional sentiment classification pipelines, ensemble models typically combine several classifiers to improve predictive stability and reduce variance. However, such ensembles often rely on homogeneous representations, where each model processes the same underlying feature space. In contrast, the proposed BERTLR framework introduces a hybrid representation strategy in which complementary feature families are integrated within the ensemble architecture. Specifically, transformer-based contextual embeddings capture semantic meaning in review text, while fractional derivative-enhanced lexical features introduce a mathematically enriched representation capable of modeling long-range dependencies and non-local lexical relationships.

The proposed BERTLR method therefore extends conventional ensemble learning by integrating fractional calculus-based feature transformation with transformer embeddings. Instead of simply combining multiple classifiers, the framework fuses heterogeneous representations that encode different linguistic properties. Transformer embeddings capture contextual semantic information, whereas fractional-order transformations enrich TF-IDF features by modeling memory-aware lexical relationships. This dual-channel representation allows the ensemble model to exploit complementary signals that are not captured when using transformer embeddings or statistical features independently.

Mathematical Formulation: The fractional derivative, denoted as

D^{α} f (t)

, is a generalization of the ordinary derivative to non-integer orders. It allows modeling long-term dependencies and subtleties in review data. The fractional derivative-based enhancement applied to BERTLR is described by the following formula:

D^{α} f (t) = \frac{1}{Γ (1 - α)} \frac{d}{d t} \int_{0}^{t} \frac{f (τ)}{{(t - τ)}^{α}} d τ

(1)

where:

$α$ is the order of the derivative (non-integer).
$Γ$ is the Gamma function, generalizing the factorial function.
$f (t)$ represents the feature extracted from the review data.
$τ$ is the integration variable.

This formulation enables fractional calculus to capture long-memory characteristics that cannot be represented using classical integer-order differentiation. In the context of text classification, fractional operators introduce a transformation that emphasizes subtle term variations and contextual dependencies within TF-IDF representations. As a result, the feature space becomes more expressive, allowing the classification model to better discriminate sentiment signals embedded in domain-specific language.

Algorithm 1 summarises the full BERTLR pipeline, including preprocessing, fractional GL differencing, BERT CLS embeddings, concatenation, LR training, and optional RoBERTa ensemble voting.

Integration into BERTLR: The BERTLR model employs fractional derivative transformations on TF-IDF features to enhance the model’s capacity to capture nuanced patterns in review data. This transformation enriches the lexical feature representation before it is combined with contextual embeddings generated by transformer models. The integration of these two feature channels enables the system to simultaneously model semantic context and memory-aware lexical dynamics. Consequently, the ensemble does not simply aggregate predictions from multiple models but integrates complementary feature representations that contribute to improved discrimination performance.

Algorithm 1 BERTLR pipeline: preprocessing, fractional features, embeddings, and ensemble prediction

Require: Corpus

D = {(x_{i}, y_{i})}

; FD order

d \in (0, 1)

1:: for each review $x_{i}$ in $D$ do
2:: $x_{i} \leftarrow$ lowercase, de-punctuate, stopword remove, stem/lemmatize
3:: end for
4:: $X_{tfidf} \leftarrow TFIDF ({x_{i}})$
5:: $X_{FD} \leftarrow Δ_{GL}^{d} (X_{tfidf})$ ▹ Grünwald–Letnikov fractional differencing
6:: $E_{BERT} \leftarrow BERT_CLS ({x_{i}})$
7:: $E_{RoBERTa} \leftarrow RoBERTa_CLS ({x_{i}})$ ▹ optional
8:: $Z \leftarrow [E_{BERT} ∥ X_{FD}]$ ▹ concatenate
9:: Train LR on $(Z, y)$ to obtain $h_{BERTLR}$
10:: if ensemble then
11:: $Z_{rob} \leftarrow [E_{RoBERTa} ∥ X_{FD}]$
12:: Train LR on $(Z_{rob}, y)$ to obtain $h_{RoBERTaLR}$
13:: $\hat{y} \leftarrow$ soft/hard voting of ${h_{BERTLR}, h_{RoBERTaLR}}$
14:: else
15:: $\hat{y} \leftarrow h_{BERTLR} (x)$
16:: end if
17:: return predictions $\hat{y}$ and metrics (Accuracy, Precision, Recall, F1)

Experimental evaluations demonstrate the effectiveness of the proposed BERTLR framework and highlight the benefits of combining fractional-order feature transformations with transformer embeddings in sentiment classification tasks. The hybrid architecture allows the ensemble to exploit complementary signals from contextual semantic representations and mathematically enriched lexical features.

In ensemble learning, the final prediction is usually derived by aggregating the outputs of multiple models. The most frequent approaches for this aggregation are voting and averaging. The mathematical formulation for integrating machine learning and deep learning algorithms into a voting-based ensemble classifier is expressed as follows:

F i n a l_O u t p u t = w_{1} \cdot M o d e l_{1} + w_{2} \cdot M o d e l_{2} + \dots + w_{n} \cdot M o d e l_{n}

(2)

Each model is assigned a weight

w_{i}

(where

i = 1, 2, \dots, n

), and

M o d e l_{i}

denotes the output prediction of the corresponding model. These weights determine the individual model’s contribution to the overall prediction. The values of these weights can be derived through methods such as cross-validation, boosting, and averaging.

In the BERTLR method, fractional derivative-based enhancement was incorporated into the ensemble learning framework to improve the discrimination of reviews. To define and describe the fractional derivative-based enhancement equation specific to BERTLR, let us denote X as the input review data, and

F (X)

as the fractional derivative-based enhancement applied to X. The enhancement equation in BERTLR can be represented as:

F (X) = D^{α} X

(3)

The operator

D^{α}

denotes the fractional derivative of order

α

. Unlike traditional derivatives, the fractional derivative operator

D^{α}

extends the concept of differentiation to non-integer orders, enabling the modeling of more complex behaviors. By applying this fractional derivative-based enhancement

F (X)

to the input data, BERTLR captures intricate patterns and lexical dependencies that are not easily detected by conventional feature extraction methods.

The transformed data,

F (X)

, are then utilized within the ensemble learning framework, where predictions from multiple models are aggregated using voting-based strategies. This process allows the proposed framework to combine contextual semantic embeddings from transformer models with fractional-order lexical features, producing a richer and more expressive representation for sentiment classification.

It is essential to emphasize that the exact form and implementation of the fractional derivative operator,

D^{α}

, may vary depending on the characteristics of the dataset and the sentiment classification task. Further elaboration on the selection and parameterization of the fractional derivative operator contributes to the interpretability and reproducibility of the proposed framework.

3.4. Voting Techniques

A hybrid classification method leverages several ML and DL algorithms to evaluate key metrics such as accuracy, precision, recall, and F1-score. During this phase, various voting strategies were employed. After statistical data analysis and testing, the model with the highest accuracy is chosen as the final output. After individually assessing the accuracy of each classifier and comparing their performances, the BERTLR model was selected based on its superior accuracy. The two algorithms were then merged using a hybrid approach to maximize the overall accuracy. This study focused on two voting strategies.

Soft voting
Hard voting

The following subsections describe each voting technique.

3.4.1. Soft Voting

The soft voting classifier categorises input data based on the probability of all predictions provided by different classifiers. Soft voting is only possible if all of the classifiers can calculate the probabilities of the outcomes. Soft voting achieves the best outcome by averaging the chances calculated using classification models. If the classifier can determine the possibility of its predictions, soft voting is performed automatically. This can be confirmed by determining whether the classifier has a prediction probability method.

3.4.2. Hard Voting

The input data are classified using a hard-voting classifier that uses the mode of all predictions made by the different classifiers. When a model is picked by a simple majority vote from among a group to make the final forecast, it is known as hard voting. Majority voting is processed differently when classifier weights are equal or unequal. Therefore, the predicted label mode is employed when majority voting is used, along with equal weights. Consider clf1, clf2, and clf3 as three classifiers. Given a set of data, the forecast was [1, 1, 0]. The classifier weights must be equal if a prediction method is used. Therefore, the mode of [1, 1, 0] is 1, and the real class of the record is 1 which is the mode [1,26,33].

3.5. Experimental Configurations

To validate the contribution of fractional derivative-based feature transformation and the role of transformer-based hybrid integration, we evaluated five configurations of the proposed framework. These configurations were designed not only to compare alternative models, but also to isolate the contribution of each major component in the pipeline. In particular, they allow us to examine whether the observed performance gains arise from transformer embeddings alone, fractional-order feature enrichment alone, or the structured combination of both within the proposed ensemble framework.

BERTLR (Proposed): The primary model, which combines BERT embeddings with fractional derivative-enhanced TF-IDF features and Logistic Regression. This configuration represents the main hybrid semantic–fractional architecture proposed in this study.
RoBERTaLR: An extension that replaces BERT embeddings with RoBERTa embeddings while retaining the same fractional derivative-enhanced TF-IDF features and Logistic Regression classifier. This configuration is used to evaluate whether the proposed fractional feature integration remains effective across alternative transformer backbones.
BERT-NoFD (Ablation): A baseline setup that removes the fractional derivative transformation and retains only standard TF-IDF features together with BERT embeddings. This configuration is intended to isolate the contribution of fractional-order feature enhancement in the proposed BERT-based framework.
RoBERTa-NoFD (Optional Ablation): A variant that uses RoBERTa embeddings without fractional enhancement. This configuration provides an additional ablation setting to examine whether the benefit of fractional transformation is consistent when the contextual encoder is changed.
Hybrid Ensemble (Voting): A combined configuration that integrates BERTLR and RoBERTaLR predictions through both soft and hard voting strategies. This setting evaluates whether the two transformer-based variants provide complementary predictive information that can be further exploited through ensemble decision fusion.

These configurations enable a structured comparative analysis in which the role of fractional derivatives, transformer backbones, and voting-based integration can be examined independently and jointly. As a result, the experimental design supports both performance comparison and methodological validation, providing clearer evidence for the contribution of the proposed hybrid sentiment classification framework.

3.6. Feature Extraction Methods

Feature extraction is one of the most important processes in DL/ML-based classification because the quality of the learned representation strongly influences downstream performance. In the present study, the collected Google Play Store review data were processed to construct both conventional lexical features and fractional-order enhanced feature channels. Rather than treating fractional derivatives as an auxiliary post-processing step, we position them as a core mechanism for enriching text representations before classification. The motivation is that conventional TF and TF-IDF statistics capture local term salience, whereas fractional-order transformations can additionally encode long-range and memory-dependent lexical interactions. This makes the resulting representation more expressive for sentiment classification in technical and domain-specific review text.

3.6.1. Extended Fractional Operators (Caputo vs. Grünwald–Letnikov)

We complement the Grünwald–Letnikov (GL) fractional differencing used to construct TF-IDF channels with a theoretical comparison to the Caputo operator. The GL form defines a discrete fractional difference via binomially weighted historical terms:

Δ_{G L}^{d} x [n] = \sum_{k = 0}^{n} {(- 1)}^{k} (\binom{d}{k}) x [n - k], (\binom{d}{k}) = \frac{Γ (d + 1)}{Γ (k + 1) Γ (d - k + 1)},

(4)

which we implement with causal convolution and padding for stability.

To make the transformation process explicit, let the TF-IDF representation of a review be denoted by the feature vector

x = [x_{0}, x_{1}, \dots, x_{N - 1}]

, where each

x_{n}

is the TF-IDF weight of the nth feature. The Grünwald–Letnikov fractional transformation of order d is then applied element-wise along the ordered feature sequence by computing a weighted sum of the current and preceding feature values. Specifically, for each position n, the transformed feature is obtained as

{\tilde{x}}_{n} = \sum_{k = 0}^{n} {(- 1)}^{k} (\binom{d}{k}) x_{n - k}, n = 0, 1, \dots, N - 1,

(5)

where

{\tilde{x}}_{n}

denotes the GL-transformed TF-IDF coefficient. In practice, the transformation is carried out through the following steps: (1) compute the conventional TF-IDF vector

x

for each review; (2) choose the fractional order d; (3) compute the Grünwald–Letnikov coefficients

{(- 1)}^{k} (\binom{d}{k})

; (4) apply the weighted historical accumulation in the above equation to obtain the transformed vector

\tilde{x}

; and (5) use

\tilde{x}

as the fractional lexical feature channel for fusion with transformer embeddings. In this way, each transformed coefficient depends not only on its original TF-IDF value but also on a weighted history of preceding feature components, which introduces the desired memory-aware behavior into the lexical representation.

The notion of memory in fractional calculus provides a natural analogy to long-range dependencies in textual data. In conventional TF-IDF representations, each feature is treated independently, and the contribution of a term is determined only by its local occurrence statistics. In contrast, the fractional Grünwald–Letnikov operator introduces a history-dependent transformation in which each feature coefficient is influenced by preceding feature values through a weighted accumulation process across the feature sequence.

From a linguistic perspective, sentiment expressed in text often depends on contextual interactions between words that may not be adjacent in the feature space. For example, the polarity of a term can be influenced by earlier descriptive or modifying words in the review. By incorporating fractional-order differencing, the transformed feature representation implicitly captures such non-local dependencies, allowing the model to encode memory-aware lexical relationships. This establishes a direct correspondence between the mathematical memory effect of fractional operators and the modeling of long-range dependencies in sentiment analysis.

As a result, the fractional transformation enhances the discriminative power of TF-IDF features by incorporating contextual influence across feature components, thereby improving sentiment classification performance compared to standard TF-IDF representations that rely solely on local term statistics.

In contrast, the Caputo derivative acts on continuously differentiable signals and is defined (for

0 < d < 1

) as

{}^{C}D^{d} x (t) = \frac{1}{Γ (1 - d)} \int_{0}^{t} x^{'} (τ) {(t - τ)}^{- d} d τ,

(6)

emphasizing recent gradients through a power-law memory kernel. While GL is natural for discrete text features, both forms share the key property of long-range memory controlled by the fractional order d. Empirically varying

d \in {0.3, 0.5, 0.7}

preserves monotonic improvements up to

d \approx 0.5

and then saturates, suggesting that moderate memory depth best complements contextual embeddings. This supports our design choice and provides a principled knob to trade off smoothness and sensitivity in fractional TF-IDF channels.

The inclusion of both operators is important because it clarifies the theoretical basis of the proposed feature transformation. The Caputo form provides an interpretable continuous-time reference for memory-aware differentiation, whereas the GL form is more suitable for discrete text representations and direct implementation on token-weight vectors. In this work, GL is adopted as the operational mechanism because TF-IDF features are discrete and finite-dimensional. Therefore, the proposed methodology does not simply borrow fractional calculus conceptually; it selects a specific fractional operator whose discrete structure aligns naturally with text-based feature engineering.

3.6.2. Fractional Derivatives: An Overview and Their Application in Large Language Models

Fractional derivatives provide a sophisticated method for analyzing the rate of change in functions that may not exhibit typical linear behavior. Unlike traditional derivatives, which measure changes at specific integer orders (such as first- or second-order derivatives), fractional derivatives extend this concept to non-integer orders such as 0.5 or 1.5. This extended range allows a more detailed examination of systems that do not follow simple, predictable patterns, offering a useful mathematical tool to characterize processes that deviate from standard local behavior.

The primary advantage of fractional derivatives is their capacity to describe systems with memory and hereditary traits. In simpler terms, fractional derivatives allow both the current and past states of a system to influence the present representation. For example, in physics, fractional derivatives can describe anomalous diffusion, where future behavior depends not only on the present state but also on a history of previous states. In engineering, they are useful for modeling materials with memory, such as viscoelastic materials, where stress depends on both current and past strains [34].

In computational disciplines, particularly in artificial intelligence, fractional derivatives help model data with long-term dependencies. This is crucial for sequential and language-related data, where the significance of a term or phrase often depends on previous lexical context. Unlike traditional algorithms that treat each feature independently, fractional derivatives enable the model to encode dependencies across multiple components of the representation. This property makes them attractive for NLP settings in which sentiment is influenced by non-local contextual interactions. Accordingly, fractional derivatives can improve predictive performance and make feature representations more sensitive to subtle textual cues [31,35].

The integration of fractional derivatives allows the model to account for long-term dependencies within text data. This is especially useful for sentiment analysis tasks in which previous contexts influence the meaning of words and sentences. Such enrichment helps the model classify text more accurately by providing richer feature representations, which are essential for predicting sentiment in nuanced reviews. In this study, we employ fractional derivatives to strengthen the feature extraction stage of our hybrid framework, which includes BERT and RoBERTa. This integration supports a more detailed analysis of sentiment in text data and captures subtle linguistic characteristics that might be overlooked by purely embedding-based or purely lexical models.

Subsequent to preprocessing, the corpus was partitioned into two subsets with a 3:1 ratio, with one subset designated for training and the other for testing. The methodology employed for feature extraction is illustrated in Figure 4, demonstrating the use of extraction techniques such as TF and TF-IDF on both training and testing datasets. The models were trained on the training subset, and their classification performance was assessed on the test data.

TF-IDF, commonly utilised in information retrieval (IR) and summarisation tasks, evaluates the importance of terms based on their frequency within a document and across the corpus. The TF and IDF components are essential in constructing TF-IDF representations. The IDF component assigns higher weights to rare terms across the dataset, thereby increasing their contribution to the overall representation.

When calculating the IDF, the following formula is applied:

IDF (t) = log (\frac{N}{d f_{t}})

(7)

where N denotes the total number of documents and

d f_{t}

represents the document frequency of term t. The term frequency (TF) is defined as the frequency of term t in document d. Consequently, the total weight of a token in a document using TF-IDF is given by:

TF - IDF (t, d) = TF (t, d) \times IDF (t)

(8)

Within the proposed framework, this conventional TF-IDF representation is not used as the final lexical descriptor. Instead, it serves as the input to a fractional-order transformation stage, where the feature vector is enriched through GL-based differencing. This design is important because it links the theoretical discussion of fractional operators directly to the actual model pipeline: TF-IDF provides a weighted lexical structure, and the fractional transformation reshapes that structure to encode non-local and memory-aware dependencies before fusion with transformer embeddings.

3.6.3. Fractional Derivative Feature

A fractional calculus-based feature extraction technique was developed to mitigate the adverse effects encountered during feature processing. The derivative allows a more refined feature representation by calculating the rate of change of the features. In particular, the fractional derivative is employed to reduce these negative effects and enhance the weighted numerical values generated by TF-IDF or TF feature engineering methods. The integration of this technique with TF-IDF and TF improves the predictive accuracy of the proposed BERTLR model [36].

The integration of fractional derivatives in feature extraction facilitates the identification of intricate, non-linear patterns in text data. This indicates that utilising models such as BERTLR and RoBERTa-based variations on the data enhances the model’s capacity to discern complex connections between terms, resulting in improved sentiment comprehension. More importantly, the role of the fractional operator in our framework is not merely to perturb the input features, but to produce a transformed lexical channel that is complementary to transformer embeddings. While BERT and RoBERTa capture contextual semantics through self-attention, the fractional feature mechanism emphasizes graded lexical interactions and memory-aware term behavior that are not explicitly modeled by embedding-only architectures.

This approach allows a richer, non-linear understanding of the data. When applied to feature extraction in our sentiment analysis model, fractional derivatives facilitate the identification of complex relationships and patterns in text that would otherwise be missed. The refined representation of features through fractional derivatives significantly improves the performance of the BERTLR and RoBERTa-based models, as it allows the model to process nuanced and context-dependent features more effectively. Consequently, the proposed framework should be interpreted as a structured hybrid representation model, in which fractional-order lexical transformation and contextual semantic embedding play different but complementary roles.

By modifying the fractional derivative (FD) approach, a new set of characteristics can be discovered. The FD of a function

g (x)

of kth order k is not only an integer but also any real number, and is defined as follows:

g^{(k)} (x) \approx \lim_{h \to 0} \frac{g (x) - k g (x - h) + \frac{k (k - 1)}{2} γ (x - 2 h) + \dots}{h^{k}}

(9)

Fractional derivatives used with text-based feature extraction techniques have been shown to improve classification model performance. By capturing the non-integer order dynamics within the text data, the fractional derivative-based mechanism offers a deeper and more refined understanding of the relationships between the features. This approach provides a more comprehensive representation of text data when combined with traditional techniques such as TF and TF-IDF. This fusion has been shown to significantly boost the prediction accuracy of classification models, outperforming the results of traditional methods alone. The enhanced representation of text data enables classification models to grasp the underlying relationships more effectively, leading to more accurate decision-making and improved outcomes.

In the context of the proposed BERTLR framework, this fractional derivative feature channel is central to the methodological contribution of the study. It provides the mathematical mechanism that distinguishes the framework from standard transformer-only sentiment classifiers and from conventional lexical-statistical pipelines. By explicitly connecting fractional-order feature modeling with transformer-based semantic embeddings, the proposed method offers a principled and interpretable pathway for hybrid sentiment analysis in technical domains.

3.7. Experimental Setup

The experimental setup was designed to ensure reproducibility and fair comparison across models. The dataset was divided into training and testing subsets using a standard split ratio (80% training and 20% testing), ensuring balanced representation across all categories. Stratified sampling was applied to preserve class distribution.

For transformer-based components, pre-trained BERT and RoBERTa models were employed as feature extractors without fine-tuning. The embeddings generated by these models were combined with fractional derivative-enhanced TF-IDF features within a soft-voting ensemble framework.

The fractional order d was empirically selected from the range

[0.3, 0.7]

, with optimal performance observed near

d = 0.5

. For the LSTM baseline, standard training procedures were followed, including the use of the Adam optimizer, ReLU activation, and a dropout rate of 0.5 to prevent overfitting.

All experiments were conducted using Python-based deep learning libraries (e.g., TensorFlow/Keras), and model training was performed on a standard computational environment with GPU acceleration.

3.8. Classifiers Used for Review Classification

In this section, various ML and DL algorithms employed in the study are described.

3.8.1. Support Vector Regression

The SVM model represents instances as spatial points that are assigned to as many individual categories as feasible. Support vector machines (SVMs) are widely utilised as dependable and scalable supervised machine learning methods for regression and classification. However, they are more typically used when trying to categorise a scenario. When first adopted in the 1960s, they were refined and improved in the 1990s. Compared to other machine learning algorithms, SVMs have unique display methods. Owing to their ability to deal with various chronic and categorical instances, they have become common in recent years. Consequently, SVM classifiers have excellent precision and can handle large-dimensional spaces [37]. A subset of the training points was used by the SVM classifiers, which means that they used very little memory. For example, support vector machines separate classes by locating divisional spaces to categorize hyper-floors within large fields. For the most part, a hyperplane is an excellent way to divide various categories. The classifier can generate fewer errors if the rim is larger than the standard value. Consequently, new examples were mapped to the same space and projected as part of the group on the other side of the distance. One can construct an infinite or high-dimensional set of data instances in a hyper-vector machine that can be used to classify, regress, or perform other tasks such as identifying outliers in the data. Each upper-space hyperplane is composed of points with permanent dot products and their vectors, and each of these orthogonal sets of vectors defines an upper-space hyperplane [38].

3.8.2. Random Forest Classifier

Random Forest (RF) is an estimating technique that focusses on ensemble random forest training. In ensemble learning, various models or relevant concepts are grouped and used repeatedly. For example, the “random forest” algorithm performs the same function using various algorithms. A random forest algorithm can also be used to estimate and recognize functions. By analyzing the output of the local outlier factor and random forest algorithms, we determine the exact fraud percentage in the dataset [39].

Random forest is a supervised learning technique that can also be used to make regression predictions, while its major application is in classification problems. The random forest performed better as the number of trees rose. This technique generates numerous decision trees from data samples and then aggregates their predictions to find the best effective option. The random forest strategy, unlike a single decision tree, decreases the danger of overfitting by aggregating the results of numerous trees. This reduces the variance and improves flexibility and reliability. Unlike other models that may require scaling or large datasets, the random forest algorithm delivers high precision, even when working with smaller datasets or incomplete data, maintaining its robustness and accuracy [40].

Random forest was employed to predict the outcomes based on an ensemble of tree-based algorithms. Each tree relied on a distinct set of random features and independently calculated vector estimates. Random forest training helps to reduce overfitting, which is common in decision trees. Random forests improve model resilience by selecting predictors from a random subset of features at each node, compared to typical decision trees that split based on the most optimal feature from all available variables [41].

3.8.3. Logistic Regression Algorithm

Logistic regression is a popular machine-learning algorithm that is well-suited to binary classification applications. It uses probabilities to express the possible outcomes of a trial. As a type of regression, logistic regression uses a logistic function to map expected values, making it useful when the dependent variable is categorical. The overall structure of the logistic regression model is specified as follows:

ρ (Y | X, w) = \frac{1}{1 + e x p (- w_{0} - \sum w_{i} X_{i})}

(10)

3.8.4. Decision Tree Classifier

Decision trees are structures that categorise instances based on feature values for classification purposes. Each node in a decision tree represents a trait that can be categorized into a specific situation and each branch reflects the value that the node can adopt. The Quinlan ID3 algorithm was extended to include the Kotsiantis decision-tree technique. The Q-decision tree method was used in this study, and bagging is another algorithm used for the decision trees. (CHAID) [42] is a CHi-square automated connection detector.

3.8.5. Long Short-Term Memory

Long short-term memory (LSTM) is a sophisticated sort of recurrent neural network (RNN) architecture that is widely used in deep learning. Unlike standard feedforward neural networks, LSTM uses feedback connections to efficiently handle full data sequences, including both dependent and independent factors. LSTM is particularly effective for tasks such as network traffic monitoring or intrusion detection, where traditional recurrent neural networks, Markov models, and other sequence-learning techniques struggle with varying sequence lengths. The core components of an LSTM unit are the input, output, and forget gates that regulate the flow of information within the unit. This structure allows LSTMs to capture, analyze, and forecast time-series data by handling unknown delays between significant events. The LSTM design was developed to overcome the limitations of standard artificial RNNs, giving it a competitive edge in time-series prediction tasks compared to other AI techniques [43].

3.8.6. Bidirectional Encoder Representations from Transformers (BERT)

BERT, a sophisticated machine learning model developed by Google, is specifically designed for pre-training in natural language processing (NLP). Utilizing a transformer architecture, BERT effectively captures the context of words in both directions, enhancing its understanding of linguistic relationships. This concept, proposed by Jacob Devlin and his team on Google, was introduced in 2018. Since 2019, Google has integrated BERT into its search algorithm to improve the interpretation and analysis of user queries, thereby enhancing the relevance and accuracy of search results.

3.8.7. AdaBoost

AdaBoost, short for Adaptive Boosting, is a meta-algorithm developed by Freund and Schapire that functions as a binary classifier. It improves the performance of learning algorithms by assigning higher weights to misclassified instances, enabling subsequent models to concentrate on these errors. The final output is derived as a weighted sum of predictions from multiple learning algorithms. AdaBoost’s adaptive nature allows it to iteratively refine the model by learning from earlier mistakes. In certain cases, it demonstrates greater resistance to overfitting than other learning methods. Although individual learners may be weak, AdaBoost creates a robust overall model as long as it performs better than random guessing [44].

3.8.8. Proposed Hybrid Classifier (BERT + LR)

A hybrid voting classifier, also known as the ensemble learning approach, is shown in Figure 5. It integrates several ML and DL models to obtain final classification results. BERT and LR predictive algorithms were integrated in this study to solve the prediction challenge. The ensemble learning technique uses the training data to train each model in its ensemble. A class label was predicted for each sample in the testing data using each model, which was fed to the models once the training phase was complete. An ensemble learning method was employed to train and evaluate the models using real-world data. In the subsequent phase, a voting mechanism was applied to generate predictions for each individual sample. It is generally possible to vote on either hard or soft ballots depending on the situation. In hard voting, the ensemble learning method designates a class label to a sample based on the majority vote. Of the seven models, four classified the sample

X k

as belonging to Class C1, while the other three classified it as belonging to Class C2. Due to the preponderance of voters selecting Class C1, that class will be allocated to the specified sample.

Compared to hard voting, soft voting takes the average of all of the expected outputs, such as the class labels, and assigns the sample to the class with the highest chance of being allocated to the sample.

3.9. Additional Experimental Variants with RoBERTa and Ablation

To further examine the robustness and originality of the proposed framework, additional configurations were analysed, as previously outlined in Section 4.5. These comprise:

RoBERTa-FD: Replaces BERT with RoBERTa embeddings while keeping the fractional derivative-enhanced features and Logistic Regression classifier.
BERT-NoFD and RoBERTa-NoFD: Ablation models that exclude the fractional derivative feature enhancement to assess its individual contribution.
Ensemble Voting: Combines BERT-FD and RoBERTa-FD forecasts with soft and hard voting strategies to improve overall performance.

The conventional metrics (Accuracy, Precision, Recall, F1-score) were employed to evaluate these experiments, which were executed using the same training pipeline. The Results and Discussion section contains comprehensive comparative results.

3.10. Evaluation Matrices

To evaluate and compare the ML/DL models, we relied on a set of performance metrics, as detailed below [45,46,47]:

Precision is defined as the proportion of correctly predicted positive samples among all predicted positive samples:

Precision = \frac{T P}{T P + F P}

(11)

The F1-score, which is the harmonic mean of precision and recall, is defined as:

F 1 - score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(12)

Accuracy is defined as the proportion of correctly predicted samples among all samples:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(13)

Recall is defined as the proportion of correctly predicted positive samples among all actual positive samples:

Recall = \frac{T P}{T P + F N}

(14)

4. Result and Discussion

This section evaluates the predictive performance of the proposed framework and compares it with multiple machine learning and deep learning baselines under different preprocessing and feature extraction settings. The objective is not only to report classification accuracy, but also to clarify how the proposed semantic–fractional representation affects performance across engineering review categories. In particular, the experiments are designed to examine the effect of conventional lexical representations (TF and TF-IDF), fractional derivative-enhanced features, and transformer-based hybrid integration. The discussion therefore focuses on both comparative performance and the methodological role of the proposed feature fusion strategy.

4.1. Results with Pre-Processing

This study used a combination of ML and DL classifiers, each with various hyperparameters. These parameters were determined through empirical analysis to achieve high classification performance. In the conventional ML setting, classifiers such as DTC, RFC, NBC, and LRC were evaluated with standard lexical features, whereas the transformer-based configurations incorporated contextual semantic representations through BERT. The purpose of this comparison is to establish a meaningful baseline before introducing fractional feature enhancement and hybrid ensemble integration. Table 1 presents the category-wise accuracy outcomes for all classifiers when TF-IDF is applied. Compared with simpler statistical classifiers, transformer-based models benefit from richer contextual representations, while the proposed BERTLR model further improves performance by combining such semantic embeddings with enhanced lexical signals.

As shown in Figure 6, DTC exhibited the lowest accuracy when combined with the TF-IDF feature extraction technique, particularly in the Action and Casual categories. This indicates that shallow tree-based models are less effective in exploiting high-dimensional sparse textual representations. In contrast, both BERT and the proposed BERTLR model consistently achieved high accuracy across all categories, demonstrating the value of contextual semantic information in engineering sentiment classification. More importantly, BERTLR systematically outperformed standalone BERT, suggesting that the gains do not arise solely from transformer embeddings, but from the additional contribution of enhanced lexical representations. The performance of RFC and SVM also varied across categories, especially in Music & Audio, indicating that certain domains present more difficult feature interactions. Overall, these results show that while TF-IDF provides a stronger lexical baseline than TF, its full advantage is realized when it is combined with more expressive representation learning and hybrid classification strategies.

Table 2 summarizes the classification accuracies of various machine learning and deep learning classifiers when the TF feature extraction approach is used. The results indicate that LRC and BERT remain relatively competitive under TF-based representation, while BERTLR again achieves the strongest overall performance. This comparison is important because it shows that the proposed framework does not depend exclusively on TF-IDF superiority; rather, it remains strong even when the initial lexical representation is simpler. At the same time, the gap between TF and TF-IDF settings reveals that richer lexical weighting remains beneficial before hybrid semantic integration.

As shown in Figure 7, LRC performed strongly across categories under TF-based representation, especially in Entertainment and Photography. However, the proposed BERTLR model remained consistently stronger overall, suggesting that the hybrid semantic–fractional architecture benefits not only from improved lexical representation but also from the complementary contribution of transformer-based semantics. Although the numerical difference between TF and TF-IDF is smaller for some conventional models, the proposed framework preserves a consistent advantage under both representations. This stability is important because it indicates that the model improvement is not tied to a single handcrafted representation, but rather emerges from the structured integration of contextual and fractional feature channels.

Table 3 summarizes the precision, recall, and F1-scores for the different review categories investigated in this study. During the study, the Logistic Regression Classifier (LRC) showed superior performance relative to several classical baselines under TF representation. Nevertheless, the hybrid BERTLR framework achieved stronger and more balanced performance across the categories, indicating that the integration of contextual semantic embeddings and enhanced lexical features yields a more robust classification mechanism than conventional single-representation models.

The Logistic Regression Classifier (LRC) assesses the impact of various factors on binary outcomes [48]. It performed well on the TF-IDF dataset, outperforming other conventional classifiers in terms of accuracy. Furthermore, the LRC achieved comparatively strong recall and F1 values. While BERT performed well on its own, the proposed BERTLR model surpassed all other classifiers in precision, recall, and F1-score. Table 4 therefore provides more than a simple accuracy comparison; it shows that the proposed hybrid model produces stronger and more balanced classification quality across multiple evaluation metrics.

Table 5 summarizes the performance metrics of classifiers utilizing the TF feature extraction method together with fractional derivative enhancement. The results show that while several conventional classifiers benefit from fractional augmentation, the strongest gains remain associated with the proposed BERTLR configuration. This suggests that fractional-order features are most effective when they are used as a complementary lexical signal alongside contextual semantic embeddings rather than as an isolated enhancement to shallow classifiers.

Table 6 compares the accuracies of the classifiers when applied to different feature extraction methods. The results demonstrate that TF features generally lead to weaker performance than TF-IDF-based representations. In contrast, under TF-IDF with fractional enhancement, the proposed BERTLR framework achieves the highest accuracy. Overall, these observations confirm that the proposed method benefits from both the lexical discriminability of TF-IDF and the additional representational richness introduced by fractional-order transformation.

Figure 8 compares the average accuracy of all classifiers using TF and TF-IDF feature extraction techniques. The experimental results show only a marginal variation in performance between the two methods for some conventional classifiers, whereas TF-IDF consistently outperforms TF in terms of accuracy, precision, and related metrics across the stronger models. This finding emphasizes the effectiveness of TF-IDF in capturing more informative features from the dataset and further supports the use of TF-IDF as the base lexical representation for fractional enhancement. More importantly, the comparative analysis indicates that the proposed hybrid framework preserves a consistent advantage over alternative classifiers across both representation settings.

4.2. Results with Fractional Derivative

Table 5 presents the classification accuracy achieved when fractional derivative enhancement is applied together with the TF feature extraction method. These results are important because they isolate the effect of fractional-order lexical enrichment under a relatively simple base representation. Although several conventional classifiers benefit from the fractional transformation, the gains are not uniform across models. In particular, classifiers such as RFC and DTC remain comparatively limited, suggesting that fractional enhancement alone is not sufficient to overcome the inherent representational constraints of shallow classifiers. By contrast, stronger improvements are observed for models that can better exploit enriched features, especially the proposed BERTLR configuration.

The results further indicate that the value of fractional derivatives is not merely to increase numerical feature complexity, but to reshape the lexical feature space in a way that becomes more informative for downstream classification. Under TF-based representation, the proposed BERTLR model consistently achieves the highest category-wise accuracy, showing that fractional-order enhancement provides a useful complementary signal even when the underlying lexical weighting is relatively simple. This observation supports the argument that the proposed framework benefits from the interaction between semantic embeddings and fractional lexical dynamics rather than from ensemble aggregation alone.

As shown in Figure 9, the TF feature extraction technique combined with fractional enhancement resulted in the lowest accuracy for the RFC classifier, indicating its limited effectiveness in exploiting this enriched representation. In contrast, the stronger performance of BERTLR demonstrates that the benefit of fractional derivatives is most pronounced when the enhanced lexical channel is integrated with contextual semantic representations. This finding is significant because it suggests that fractional-order features are not universally advantageous across all classifiers but become substantially more effective when incorporated into a structured hybrid architecture.

To further examine the effect of fractional derivatives under a stronger lexical representation, Table 6 summarizes the corresponding category-wise accuracies when TF-IDF is used instead of TF. The comparison between TF and TF-IDF is particularly important because it clarifies whether fractional-order enhancement remains beneficial when applied to a more informative base representation. The results show that TF-IDF consistently outperforms TF, indicating that fractional derivatives are most effective when they operate on a lexically discriminative feature space.

Under TF-IDF with fractional enhancement, the proposed BERTLR model again achieves the strongest performance across all six categories, reaching the highest values in Action, Casual, Entertainment, Music & Audio, Photography, and Card. This is an important result because it shows that the improvement is not confined to a single category or a single feature setting. Instead, the gains are systematic, which strengthens the argument that fractional-order transformation contributes meaningful additional information when fused with transformer-based embeddings.

Another important observation is that the performance gap between conventional classifiers and BERTLR becomes more pronounced in the TF-IDF setting than in the TF setting. This suggests that fractional derivatives do not act as a generic performance booster for all models. Rather, their strongest effect emerges when they are integrated into a hybrid architecture that can jointly exploit contextual semantics and enriched lexical statistics. In this sense, the proposed model benefits from a principled interaction between transformer embeddings and fractional feature engineering, which helps explain why its gains remain consistent across categories.

Overall, the results in this subsection indicate that fractional derivative enhancement plays a meaningful methodological role in the proposed framework. The strongest performance is obtained not by voting alone and not by lexical transformation alone, but by the combination of fractional-order feature enrichment with transformer-based semantic modeling. This provides empirical support for the central hypothesis of this study: that memory-aware lexical features can act as a complementary signal to contextual embeddings in engineering sentiment classification.

4.3. Comparison to State-of-the-Art Methods

In this work, we assess the performance of the proposed BERTLR framework with fractional derivative-enhanced features and compare it with a range of previously reported machine learning and deep learning approaches used in text classification and sentiment analysis. The comparative results in Table 7 are intended to position the proposed framework relative to representative methods in the literature, while acknowledging that differences in datasets, class distributions, and problem settings can affect direct numerical comparability. Therefore, this comparison is interpreted as a contextual benchmark rather than as a strict one-to-one ranking.

The proposed BERTLR framework achieves strong performance in comparison with multiple baseline and hybrid methods reported in the literature. For example, LSTM-based models [49] reported an accuracy of 85%, while Particle Swarm Optimization combined with LinearSVC [50] achieved a precision of 63.6% and recall of 55.0%. Convolutional neural networks (CNNs) [51] achieved high precision and recall on a substantially larger review corpus. In this context, the proposed BERTLR framework demonstrates competitive performance, achieving up to 91% accuracy on the engineering review classification task. More importantly, the proposed model combines strong predictive performance with a structured hybrid representation strategy based on contextual transformer embeddings and fractional-order lexical feature enrichment.

Unlike many existing methods that rely either on shallow lexical representations or on end-to-end neural architectures alone, the proposed framework explicitly integrates two complementary information sources: contextual semantic embeddings and fractional derivative-enhanced TF-IDF features. This distinction is important because the model does not seek to replace transformer-based understanding with handcrafted representations but rather to introduce a mathematically motivated lexical signal that complements the semantic information learned by BERT and RoBERTa. As a result, the observed gains should be interpreted not simply as ensemble effect, but as evidence that the hybrid semantic–fractional representation is useful for engineering sentiment analysis.

As shown in Figure 10, LRC demonstrated strong performance among the conventional classifiers when TF-IDF feature extraction was used. However, the proposed BERTLR framework consistently exceeded the performance of such single-representation models. This observation is important because it shows that strong TF-IDF-based lexical classification alone is not sufficient to match the hybrid representation obtained through the integration of contextual embeddings and fractional-order feature enhancement. In other words, the improvement achieved by BERTLR cannot be attributed only to the use of TF-IDF or only to the use of logistic regression; rather, it emerges from the structured interaction between semantic and fractional lexical representations.

A deep learning method, specifically LSTM, was also used to assess accuracy on the Google Play Store dataset. The architecture of the LSTM network used in this study is depicted in Figure 11, where an embedding layer is introduced between the input and LSTM layers to transform input word vectors into word embeddings. A rectified linear unit (ReLU) activation function was selected due to its effective performance with text data [49,58]. A dropout layer with a value of 0.5 was employed as the regularization unit. A sigmoid function is used in the final layer to generate the probability of each class [59]. The Adam optimizer was employed in this investigation since it has been shown to perform better with noisy data [60]. LSTM achieved an accuracy of approximately 0.85, confirming that sequence-based neural models are effective for sentiment analysis; however, it also shows that the proposed BERTLR framework remains superior for the present task. It is important to note that, in this study, transformer models such as BERT and RoBERTa are employed as feature extractors rather than fully fine-tuned end-to-end classifiers. While fine-tuned transformer models can provide strong baseline performance, the objective of this work is to investigate the complementary role of fractional derivative-enhanced lexical features when combined with contextual embeddings within a hybrid framework. The use of pre-trained embeddings allows a controlled comparison between conventional lexical features and fractional-order transformations under a unified classification setting.

A direct comparison with fully fine-tuned transformer models constitutes an important direction for future work. Such experiments would enable further evaluation of whether fractional-order feature modeling provides additional benefits beyond end-to-end transformer architectures.

The performance improvements of the proposed BERTLR framework can be attributed to the complementary interaction between contextual semantic embeddings and fractional lexical feature enhancement. While conventional models rely either on local lexical statistics (e.g., TF-IDF) or sequential modeling (e.g., LSTM), the hybrid architecture combines semantic understanding with memory-aware feature transformation, enabling richer representation of sentiment signals.

However, it is also observed that the performance gains over strong baselines such as BERT and LSTM are relatively moderate in certain categories. This can be explained by the fact that transformer-based embeddings already capture substantial contextual information, leaving limited room for additional improvement. In such cases, the contribution of fractional-order features acts as a complementary refinement rather than a dominant factor. These observations highlight that the effectiveness of the proposed hybrid approach depends on the complexity and variability of the underlying dataset.

Several studies have shown mixed results when deep learning is applied to smaller datasets. Wang et al. [61] noted that deep learning-based approaches can perform poorly on small datasets. However, Zampieri et al. [50] examined foul language detection on social media using SVM, CNN, and bidirectional LSTM (BiLSTM), finding that BiLSTM and CNN outperformed classical machine learning SVM on relatively small datasets. These findings suggest that architecture suitability depends strongly on dataset characteristics [62]. In the present work, the proposed hybrid semantic–fractional architecture appears to be particularly well suited to engineering review data, where both contextual meaning and lexical nuance contribute to sentiment expression.

Table 7 presents a comparative analysis of various machine learning and deep learning methodologies applied to classification tasks across different datasets. The literature covers models ranging from optimization-enhanced ensembles such as Particle Swarm Optimization (PSO) combined with LinearSVC [50] to more complex neural network models such as CNNs [51], hyperparameter-optimized machine learning models [52], SVM-based systems [53], big data deep learning approaches [54], gradient boosting sentiment models [55], logistic regression baselines [56], and ensemble methods based on random subspace learning [57]. Taken together, these results show that no single model is universally dominant across all tasks and datasets. This reinforces the importance of designing representations that are well matched to the target domain.

From this perspective, the contribution of the proposed framework lies not only in its accuracy but also in its representation strategy. By introducing a fractional derivative-based enhancement layer into a transformer-driven sentiment analysis pipeline, the proposed method provides a more expressive hybrid architecture for engineering review classification. The results therefore support the view that the combination of contextual transformer embeddings and memory-aware fractional lexical features offers a meaningful methodological advantage for this domain, beyond the effect of conventional model aggregation alone.

4.4. Discussion on Limitations

While the hybrid BERTLR model combined with fractional derivatives has demonstrated strong performance, it is important to acknowledge certain limitations. For instance, the model may struggle with reviews that contain very ambiguous or contradictory sentiments, where the context is unclear or overly complex. In such cases, the model may fail to accurately classify sentiments, which could lead to misinterpretation of user opinions. Future work could focus on developing advanced techniques for handling such ambiguous sentiment reviews, perhaps by integrating context-aware mechanisms or fine-tuning the model to better detect conflicting sentiments.

Additionally, the model’s reliance on the availability of large labeled datasets means that its performance might degrade when applied to smaller, less diverse datasets. This limitation is especially evident in scenarios where only limited training data is available, which could lead to overfitting or inaccurate results. Future research could address this issue by exploring transfer learning approaches or semi-supervised learning to improve the model’s performance on smaller datasets.

Although the BERTLR model performs well across a broad range of review datasets, future work should investigate its adaptability to niche domains, such as specialized product categories or industry-specific reviews. Such exploration may reveal whether domain-specific training data can further improve classification accuracy and generalization.

Moreover, the computational cost associated with deep learning models—particularly when combined with fractional derivative-based feature extraction—can present challenges for real-time deployment or use in resource-constrained environments. Future research could explore architectural optimizations or lightweight model variants to enhance computational efficiency, making the framework more suitable for deployment on edge devices or in latency-sensitive systems.

Implementation Details: All experiments were conducted on a system equipped with an Intel Core i7 processor, 32 GB RAM, and an NVIDIA RTX GPU. The hybrid model was implemented in Python 3.9 using the TensorFlow and Scikit-learn libraries.

4.5. Additional Results from RoBERTa, Ablation Studies, and Ensemble Voting

To evaluate the novelty and robustness of the proposed hybrid sentiment analysis framework, we conducted additional experiments using the RoBERTa transformer model, fractional derivative (FD) feature variants, and ensemble voting. Table 8 presents a comparative analysis between BERTLR (our proposed model) and RoBERTa variants with and without fractional features, along with a combined ensemble configuration using soft voting. These extended variants help clarify the individual and joint contributions of transformer embeddings and fractional calculus.

As shown in Table 8, the proposed BERTLR model with fractional derivative features consistently achieved the highest accuracy across all categories, confirming its strength in engineering sentiment classification. The RoBERTa + FD variant also performed strongly, though slightly below BERTLR. Meanwhile, the ensemble model using soft voting provided balanced and reliable results, benefiting from the complementary characteristics of both transformer-based architectures. These findings reaffirm the contribution of fractional feature augmentation and model integration strategies.

Figure 12 illustrates the accuracy distribution across different model configurations. Notably, BERTLR emerged as the best-performing individual model, while the RoBERTa-FD and ensemble methods demonstrated competitive accuracy, further validating the robustness and flexibility of the proposed framework for sentiment classification in engineering domains.

Table 9 presents the ablation study. The results show that FD features alone fail to provide useful predictive power, while BERT+LR achieves strong performance. The proposed BERTLR (BERT + FD + LR) maintains high accuracy with balanced precision, recall, and F1, confirming the complementary role of fractional derivatives.

4.6. Explainability and Interpretation

To increase transparency, we analyzed the proposed BERTLR classifier using SHAP (SHapley Additive exPlanations). We employed a linear SHAP explainer with a background subset of the training instances to estimate feature contributions for the combined representation (BERT CLS embeddings concatenated with fractional-differenced TF-IDF features). Figure 13 reports the global importance ranking: the model relies on a small subset of latent embedding dimensions together with several fractional TF-IDF channels, indicating that the fractional operator contributes non-redundant information beyond the contextual embeddings. For interpretability at the lexical level, we additionally trained a TF-IDF + LR surrogate and computed word-level SHAP values ( Figure 14), which highlight sentiment-bearing tokens as the most influential (e.g., “excellent”, “crash”, “useless”, “smooth”). These analyses support the validity of the hybrid design and clarify why ensemble features improve predictive performance.

5. Conclusions

This study presented a hybrid AI framework that integrates Bidirectional Encoder Representations from Transformers (BERT) with Logistic Regression (LR), augmented by fractional derivative-based feature engineering for enhanced sentiment classification. By combining deep contextual embeddings with mathematically enriched lexical transformations, the proposed framework provides a hybrid semantic–fractional representation for domain-specific engineering sentiment analysis. This design goes beyond the use of conventional ensemble aggregation by explicitly integrating contextual semantic information with memory-aware fractional-order feature modeling.

Empirical evaluations demonstrated that TF-IDF consistently outperformed TF, while the introduction of fractional derivative-based features further improved representational granularity and predictive performance. The proposed BERTLR framework achieved accuracy levels of up to 0.91 across the evaluated categories, indicating that fractional-order lexical enhancement can provide a meaningful complementary signal to transformer-based embeddings. In addition, SHAP-based explainability analyses confirmed that fractional TF-IDF features contribute non-redundant information alongside contextual embeddings, helping to address the interpretability limitations often associated with deep language models.

The SHAP-based explainability analysis further highlights how the proposed hybrid framework improves interpretability by identifying the relative contribution of individual features. The results suggest that fractional derivative-enhanced TF-IDF features capture complementary lexical information that is not fully represented by transformer-based embeddings alone.

Moreover, the SHAP values indicate that the model integrates both semantic context and memory-aware lexical patterns when making predictions. This supports the effectiveness of the proposed hybrid architecture, demonstrating that the combination of fractional feature modeling and transformer embeddings leads to more informative and reliable sentiment classification decisions.

Additional experiments using RoBERTa embeddings, with and without fractional derivative augmentation, further supported the generality of the proposed approach. Although RoBERTa+FD achieved competitive performance, BERTLR remained the strongest individual model. The ensemble configuration offered stable and complementary improvements, suggesting that the proposed semantic–fractional integration is robust across transformer backbones. While an LSTM baseline was also explored, it did not yield significant gains. Nevertheless, prior work [58] has demonstrated that bidirectional LSTM and convolutional neural network (CNN) models can outperform traditional ML classifiers in related sentiment classification settings. This contrast highlights that classification performance depends not only on model complexity but also on the suitability of the representation strategy for the target domain.

Overall, this work demonstrates the value of integrating fractional calculus with large language models for sentiment analysis in technical domains. The main contribution of the study lies in showing that fractional-order lexical modeling can be used as a principled and interpretable complement to transformer-based semantic embeddings, thereby improving sentiment discrimination in engineering review text. In addition to its methodological contribution, the framework has practical relevance for applied engineering analytics, where interpretable sentiment modeling can support product assessment, service optimization, and data-driven industrial decision-making. The methodology is extensible to other domains where nuanced text interpretation is critical, offering a transparent and mathematically motivated framework for future AI-driven decision-support systems. However, transferring the proposed fractional feature extraction mechanism to domains with substantially different vocabulary distributions may present challenges. In particular, domain shifts involving highly specialized terminology, sparse lexical patterns, or different semantic structures may affect the stability and effectiveness of fractional-order transformations. The weighting behavior of the Grünwald–Letnikov operator depends on the structure and ordering of features, which may differ significantly across domains. Therefore, domain adaptation strategies, parameter tuning of the fractional order, or integration with domain-specific embeddings may be required to maintain performance when applying the framework beyond engineering review data.

Future Work

Future research may address ambiguous or mixed-sentiment inputs by incorporating attention mechanisms or reinforcement learning to better capture contextual nuances. Extending the framework to multilingual datasets and domain-specific corpora—such as financial or clinical texts—would further validate its generalizability.

Transfer learning using larger multilingual models and semi-supervised learning on limited labeled data may enhance cross-lingual performance. The RoBERTa experiments suggest that exploring additional transformer variants, such as XLNet or DeBERTa, could yield further insights into adaptability. Deeper ablations, including varying the order of the fractional derivative or integrating domain-specific embeddings, may also advance the theoretical understanding of feature transformation benefits.

Finally, future work should optimize computational efficiency through pruning, quantization, or knowledge distillation, enabling lightweight deployment on edge devices. Real-world applications such as social media monitoring, medical sentiment analysis, and customer feedback prediction stand to benefit from the proposed hybrid framework, particularly in contexts where accuracy, interpretability, and efficiency are equally essential.

Author Contributions

Conceptualization, A.K.; methodology, A.K.; formal analysis, A.K.; visualization, S.L.; writing—original draft preparation, A.K. and E.T.; writing—review and editing, A.K., I.c.J. and S.L.; supervision, I.c.J.; project administration, I.c.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2022-NR070859).

Institutional Review Board Statement

Not applicable. This study did not involve human participants or animals and required no institutional ethical approval.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw review texts were collected from the publicly accessible Google Play Store interface. To respect platform terms and user privacy, we share derived features and analysis code rather than raw text. Materials are available from the corresponding author upon reasonable request.

Acknowledgments

The authors did not use generative AI or AI-assisted tools for writing, data analysis, figure creation, or manuscript preparation. All content is the authors’ own work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. Statistical and machine learning forecasting methods: Concerns and ways forward. PLoS ONE 2018, 13, e0194889. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Guan, Y.; Li, C.T.; Roli, F. On Reducing the Effect of Covariate Factors in Gait Recognition: A Classifier Ensemble Method. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1521–1528. [Google Scholar] [CrossRef] [PubMed]
Karim, A.; Azhari, A.; Belhaouri, S.B.; Qureshi, A.A.; Ahmad, M. Methodology for analyzing the traditional algorithms performance of user reviews using machine learning techniques. Algorithms 2020, 13, 202. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Rustam, F.; Ashraf, I.; Mehmood, A.; Ullah, S.; Choi, G.S. Tweets classification on the base of sentiments for US airline companies. Entropy 2019, 21, 1078. [Google Scholar] [CrossRef]
Awan, F.M.; Saleem, Y.; Minerva, R.; Crespi, N. A comparative analysis of machine/deep learning models for parking space availability prediction. Sensors 2020, 20, 322. [Google Scholar] [CrossRef]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Hazar, M.A.; Odabasioglu, N.; Ensari, T.; Kavurucu, Y.; Sayan, O.F. Performance analysis and improvement of machine learning algorithms for automatic modulation recognition over Rayleigh fading channels. Neural Comput. Appl. 2018, 29, 351–360. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Narayanan, B.N.; Djaneye-Boundjou, O.; Kebede, T.M. Performance analysis of machine learning and pattern recognition algorithms for malware classification. In Proceedings of the 2016 IEEE National Aerospace and Electronics Conference (NAECON) and Ohio Innovation Summit (OIS), Dayton, OH, USA, 25–29 July 2016; pp. 338–342. [Google Scholar]
Ho, T.K. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; pp. 278–282. [Google Scholar]
Rajendran, S.; Meert, W.; Giustiniano, D.; Lenders, V.; Pollin, S. Deep learning models for wireless signal classification with distributed low-cost spectrum sensors. IEEE Trans. Cogn. Commun. Netw. 2018, 4, 433–445. [Google Scholar] [CrossRef]
Hannun, A.Y.; Case, C.; Casper, J.; Catanzaro, B.; Diamos, G.; Elsen, E.; Prenger, R.; Satheesh, S.; Sengupta, S.; Coates, A.; et al. Deep Speech: Scaling up end-to-end speech recognition. arXiv 2014, arXiv:1412.5567. [Google Scholar]
Zhu, S.; Qi, J.; Hu, J.; Huang, H. Intelligent product redesign strategy with ontology-based fine-grained sentiment analysis—An evolutionary form design method based on aesthetic dimension selection and NSGA-II. Artif. Intell. Eng. Des. Anal. Manuf. 2021, 35, 295–315. [Google Scholar] [CrossRef]
Mahabub, A. A robust technique of fake news detection using Ensemble Voting Classifier and comparison with other classifiers. SN Appl. Sci. 2020, 2, 525. [Google Scholar] [CrossRef]
Javed, M.S.; Majeed, H.; Mujtaba, H.; Beg, M.O. Fake reviews classification using deep learning ensemble of shallow convolutions. J. Comput. Soc. Sci. 2021, 4, 883–902. [Google Scholar] [CrossRef]
Budhi, G.S.; Chiong, R.; Pranata, I.; Hu, Z. Using Machine Learning to Predict the Sentiment of Online Reviews: A New Framework for Comparative Analysis. Arch. Comput. Methods Eng. 2021, 28, 2543–2566. [Google Scholar] [CrossRef]
Li, L.; Goh, T.T.; Jin, D. How textual quality of online reviews affect classification performance: A case of deep learning sentiment analysis. Neural Comput. Appl. 2020, 32, 4387–4415. [Google Scholar] [CrossRef]
Ismail Fawaz, H.; Forestier, G.; Weber, J.; Idoumghar, L.; Muller, P.A. Deep learning for time series classification: A review. Data Min. Knowl. Discov. 2019, 33, 917–963. [Google Scholar] [CrossRef]
Chu, L.; Qiu, R.; Liu, H.; Ling, Z.; Zhang, T.; Wang, J. Individual Recognition in Schizophrenia using Deep Learning Methods with Random Forest and Voting Classifiers: Insights from Resting State EEG Streams. arXiv 2017, arXiv:1707.03467. [Google Scholar]
Prabhakar, E.; Kumar, K.N.; Karthikeyan, S.; Kumar, A.N.; Kavin, P. Smart online voting and enhanced deep learning to identify voting patterns. Int. Res. J. Mod. Eng. Technol. Sci. 2021, 3, 162–165. [Google Scholar]
Banfield, R.E.; Hall, L.O.; Bowyer, K.W.; Kegelmeyer, W.P. A comparison of decision tree ensemble creation techniques. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 173–180. [Google Scholar] [CrossRef]
Damdoo, R.; Kalyani, K. Multilevel voter identity protocol for secure online voting. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9, 3741–3745. [Google Scholar] [CrossRef]
Salminen, J.; Hopf, M.; Chowdhury, S.A.; Jung, S.-G.; Almerekhi, H.; Jansen, B.J. Developing an online hate classifier for multiple social media platforms. Hum.-Centric Comput. Inf. Sci. 2020, 10, 1. [Google Scholar] [CrossRef]
Dzisevic, R.; Sesok, D. Text Classification using Different Feature Extraction Approaches. In Proceedings of the 2019 Open Conference of Electrical, Electronic and Information Sciences (eStream), Vilnius, Lithuania, 25 April 2019; pp. 2019–2022. [Google Scholar]
Rahman, M.M.; Rahman, S.S.M.M.; Allayear, S.M.; Patwary, M.F.K.; Munna, M.T.A. A Sentiment Analysis Based Approach for Understanding the User Satisfaction on Android Application. In Data Engineering and Communication Technology; Springer: Singapore, 2020; pp. 397–407. [Google Scholar]
Zhao, Y.; Li, Z.; Pan, Y.; Wang, J.; Wang, Y. LB-KBQA: Large-language-model and BERT based Knowledge-Based Question and Answering System. arXiv 2024, arXiv:2402.05130. [Google Scholar]
Vanak, J. Artificial Intelligence and Medicine. Sci. Insights 2022, 41, 567–575. [Google Scholar] [CrossRef]
Bhuvanapriya, R.; Rozil Banu, S.; Sivapriya, P.; Kalaiselvi, V.K.G. Smart voting. In Proceedings of the 2017 2nd International Conference on Computing and Communications Technologies (ICCCT), Chennai, India, 23–24 February 2017; pp. 143–147. [Google Scholar]
Colley, R.; Grandi, U.; Novaro, A. Smart voting. IJCAI Int. Jt. Conf. Artif. Intell. 2020, 2021, 1734–1740. [Google Scholar] [CrossRef]
Yu, S.; Su, J.; Luo, D. Improving BERT-Based Text Classification with Auxiliary Sentence and Domain Knowledge. IEEE Access 2019, 7, 176600–176612. [Google Scholar] [CrossRef]
Adams, M. differint: A Python package for numerical fractional calculus. arXiv 2019, arXiv:1912.05303. [Google Scholar] [CrossRef]
Karim, A.; Azhari, A.; Shahroz, M.; Brahim Belhaouri, S.; Mustofa, K. LDSVM: Leukemia Cancer Classification Using Machine Learning. Comput. Mater. Contin. 2022, 71, 3887–3903. [Google Scholar] [CrossRef]
Sahloul, H.; Shirafuji, S.; Ota, J. An Accurate and Efficient Voting Scheme for a Maximally All-Inlier 3D Correspondence Set. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 2287–2298. [Google Scholar] [CrossRef]
Pothina, C.K.; Reddy, A.I.; Cv, R. Smart Voting System using Facial Detection. Int. J. Innov. Technol. Explor. Eng. 2020, 9, 2208–2213. [Google Scholar] [CrossRef]
Islam, M.S.; Qaraqe, M.K.; Belhaouari, S.B.; Petrovski, G. Long Term HbA1c Prediction Using Multi-Stage CGM Data Analysis. IEEE Sens. J. 2021, 21, 15237–15247. [Google Scholar] [CrossRef]
Panahi, M.; Dodangeh, E.; Rezaie, F.; Khosravi, K.; Van Le, H.; Lee, M.-J.; Lee, S.; Pham, B.T. Flood spatial prediction modeling using a hybrid of meta-optimization and support vector regression modeling. Catena 2021, 199, 105114. [Google Scholar] [CrossRef]
Cheng, K.; Lu, Z. Adaptive Bayesian support vector regression model for structural reliability analysis. Reliab. Eng. Syst. Saf. 2019, 206, 107286. [Google Scholar] [CrossRef]
Alhuzali, H.; Zhang, T.; Ananiadou, S. Predicting sign of depression via using frozen pre-trained models and random forest classifier. CEUR Workshop Proc. 2021, 2936, 888–896. [Google Scholar]
Jiao, S.; Xu, L.; Ju, Y. CWLy-RF: A novel approach for identifying cell wall lyases based on random forest classifier. Genomics 2021, 113, 2919–2924. [Google Scholar] [CrossRef]
Meshram, S.G.; Safari, M.J.S.; Khosravi, K.; Meshram, C. Iterative classifier optimizer-based pace regression and random forest hybrid models for suspended sediment load prediction. Environ. Sci. Pollut. Res. 2021, 28, 11637–11649. [Google Scholar] [CrossRef]
Wang, F.; Wang, Q.; Nie, F.; Li, Z.; Yu, W.; Ren, F. A linear multivariate binary decision tree classifier based on K-means splitting. Pattern Recognit. 2020, 107, 107521. [Google Scholar] [CrossRef]
Van Houdt, G.; Mosquera, C.; Nápoles, G. A review on the long short-term memory model. Artif. Intell. Rev. 2020, 53, 5929–5955. [Google Scholar] [CrossRef]
Shahraki, A.; Abbasi, M.; Haugen, Ø. Boosting algorithms for network intrusion detection: A comparative evaluation of Real AdaBoost, Gentle AdaBoost and Modest AdaBoost. Eng. Appl. Artif. Intell. 2020, 94, 103770. [Google Scholar] [CrossRef]
Lipton, Z.C.; Elkan, C.; Naryanaswamy, B. Optimal thresholding of classifiers to maximize F1 measure. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2014; Volume 8725, pp. 225–239. [Google Scholar] [CrossRef]
Ashraf, I.; Hur, S.; Park, Y. BLocate: A building identification scheme in GPS denied environments using smartphone sensors. Sensors 2018, 18, 3862. [Google Scholar] [CrossRef]
Ashraf, I.; Hur, S.; Park, Y. MagIO: Magnetic field strength based indoor- outdoor detection with a commercial smartphone. Micromachines 2018, 9, 534. [Google Scholar] [CrossRef]
Klang, E.; Levin, M.A.; Soffer, S.; Zebrowski, A.; Glicksberg, B.S. A simple free-text-like method for extracting semi-structured data from electronic health records: Exemplified in prediction of in-hospital mortality. Big Data Cogn. Comput. 2021, 5, 40. [Google Scholar] [CrossRef]
Zampieri, M.; Malmasi, S.; Nakov, P.; Rosenthal, S.; Farra, N.; Kumar, R. Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; Volume 1, pp. 1415–1420. [Google Scholar] [CrossRef]
Bhatia, M.P.S.; Kumar, A.; Beniwal, R. An Optimized Classification of Apps Reviews for Improving Requirement Engineering. Recent Adv. Comput. Sci. Commun. 2019, 14, 1390–1399. [Google Scholar] [CrossRef]
Aslam, N.; Ramay, W.Y.; Xia, K.; Sarwar, N. Convolutional neural network based classification of app reviews. IEEE Access 2020, 8, 185619–185628. [Google Scholar] [CrossRef]
Aldabbas, H.; Bajahzar, A.; Alruily, M.; Qureshi, A.A.; Amir Latif, R.M.; Farhan, M. Google Play Content Scraping and Knowledge Engineering using Natural Language Processing Techniques with the Analysis of User Reviews. J. Intell. Syst. 2020, 30, 192–208. [Google Scholar] [CrossRef]
Magar, B.T.; Mali, S.; Abdelfattah, E. App Success Classification Using Machine Learning Models. In Proceedings of the 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC), Virtual, 27–30 January 2021; pp. 642–647. [Google Scholar]
Ruhela, S.; Saini, H. Google Playstore Application Analysis and Prediction; Jaypee University of Information Technology: Solan, HP, India, 2019; Available online: https://ir.juit.ac.in/ (accessed on 3 April 2026).
Suleman, M.; Malik, A.; Hussain, S.S. Google Play Store App Ranking Prediction Using Machine Learning Algorithm. 2019. Available online: https://www.researchgate.net/ (accessed on 3 April 2026).
Adeli, E.; Li, X.; Kwon, D.; Zhang, Y.; Pohl, K.M. Logistic Regression Confined by Cardinality-Constrained Sample and Feature Selection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 1713–1728. [Google Scholar] [CrossRef]
Silva, J.; Praça, I.; Pinto, T.; Vale, Z. Energy consumption forecasting using ensemble learning algorithms. In Distributed Computing and Artificial Intelligence; Advances in Intelligent Systems and Computing; Springer: Cham, Switzerland, 2020; Volume 1004, pp. 5–13. [Google Scholar] [CrossRef]
Huang, C.; Zhu, J.; Liang, Y.; Yang, M.; Fung, G.P.C.; Luo, J. An efficient automatic multiple objectives optimization feature selection strategy for internet text classification. Int. J. Mach. Learn. Cybern. 2019, 10, 1151–1163. [Google Scholar] [CrossRef]
Agnihotri, D.; Verma, K.; Tripathi, P.; Singh, B.K. Soft voting technique to improve the performance of global filter based feature selection in text corpus. Appl. Intell. 2019, 49, 1597–1619. [Google Scholar] [CrossRef]
Wang, D.; Gong, J.; Song, Y. W-RNN: News text classification based on a Weighted RNN. arXiv 2019, arXiv:1909.13077. [Google Scholar] [CrossRef]
Feng, S.; Zhou, H.; Dong, H. Using deep neural network with small dataset to predict material defects. Mater. Des. 2019, 162, 300–310. [Google Scholar] [CrossRef]

Figure 1. Sentiment classification framework based on the proposed hybrid ensemble combining BERTLR and RoBERTaLR.

Figure 2. Data collection process using API-based access and dataset sampling.

Figure 3. Reviews based on rating count between 1 and 5.

Figure 4. Classification reviews of Google Play Store application.

Figure 5. Ensemble Learning or Voting Classifier Architecture.

Figure 6. Classifier’s accuracy with TF-IDF.

Figure 7. Classifier’s accuracy with TF.

Figure 8. Comparison of classifier’s accuracy with TF and TF-IDF feature extraction.

Figure 9. Classifier’s accuracy with TF using fraction derivative.

Figure 10. Accuracy of classifiers with TF-IDF using fractional derivatives.

Figure 11. LSTM model architecture.

Figure 12. Comparison of Accuracy across BERTLR, RoBERTa, and Ensemble Variants.

Figure 13. SHAP summary (bar) for BERTLR on the test subset, showing top-20 feature contributions across concatenated BERT and fractional TF-IDF features.

Figure 14. Word-level SHAP importance from a TF-IDF + LR surrogate model, illustrating the most influential terms.

Table 1. Category-wise accuracy comparison of classifiers based on TF-IDF.

Classifier	Action	Casual	Entertainment	Music & Audio	Photography	Card
SVM	0.69	0.75	0.72	0.71	0.71	0.68
DTC	0.65	0.71	0.68	0.67	0.67	0.65
RFC	0.67	0.74	0.67	0.67	0.68	0.66
NBC	0.69	0.74	0.72	0.71	0.71	0.68
LRC	0.75	0.79	0.77	0.76	0.75	0.75
ABC	0.70	0.75	0.73	0.72	0.71	0.69
CC	0.70	0.75	0.73	0.72	0.71	0.69
BERT	0.84	0.89	0.87	0.82	0.87	0.86
BERTLR	0.88	0.91	0.88	0.89	0.88	0.90
Hard Voting	0.71	0.76	0.74	0.73	0.72	0.70
Soft Voting	0.70	0.75	0.72	0.72	0.71	0.69
LSTM	0.85	0.85	0.85	0.85	0.84	0.84

Table 2. Category-wise accuracy comparison of classifiers based on TF.

Classifier	Action	Casual	Entertainment	Music & Audio	Photography	Card
SVM	0.68	0.74	0.707	0.701	0.706	0.682
DTC	0.65	0.71	0.691	0.671	0.681	0.654
RFC	0.67	0.74	0.673	0.678	0.682	0.667
NBC	0.70	0.75	0.742	0.733	0.726	0.710
LRC	0.74	0.77	0.793	0.775	0.794	0.741
ABC	0.71	0.757	0.730	0.729	0.719	0.695
CC	0.71	0.757	0.730	0.729	0.719	0.695
BERT	0.84	0.89	0.87	0.82	0.87	0.86
BERTLR	0.872	0.89	0.88	0.894	0.90	0.874
Hard Voting	0.71	0.757	0.740	0.728	0.726	0.706
Soft Voting	0.70	0.753	0.737	0.724	0.721	0.702
LSTM	0.85	0.850	0.854	0.852	0.846	0.848

Table 3. Category-wise accuracy comparison of classifiers based on TF.

Classifier	Action			Casual			Entertainment			Music & Audio			Photography			Card
Classifier	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1
SVM	0.586	0.689	0.600	0.640	0.744	0.665	0.624	0.707	0.634	0.619	0.701	0.624	0.622	0.706	0.625	0.598	0.682	0.591
DTC	0.610	0.658	0.631	0.670	0.711	0.688	0.638	0.691	0.660	0.623	0.671	0.644	0.624	0.681	0.646	0.603	0.654	0.623
RFC	0.563	0.670	0.538	0.557	0.746	0.638	0.453	0.673	0.542	0.565	0.678	0.548	0.506	0.682	0.553	0.445	0.667	0.533
NBC	0.604	0.709	0.631	0.674	0.756	0.701	0.661	0.742	0.676	0.649	0.733	0.668	0.643	0.726	0.660	0.622	0.710	0.637
LRC	0.654	0.747	0.664	0.721	0.774	0.711	0.721	0.776	0.714	0.679	0.755	0.698	0.657	0.754	0.665	0.662	0.741	0.675
ABC	0.616	0.712	0.628	0.661	0.757	0.681	0.644	0.730	0.654	0.650	0.729	0.653	0.625	0.719	0.643	0.585	0.695	0.609
BERT	0.83	0.84	0.84	0.83	0.89	0.86	0.85	0.87	0.86	0.86	0.88	0.87	0.85	0.87	0.86	0.86	0.88	0.87
BERTLR	0.88	0.87	0.86	0.88	0.87	0.88	0.89	0.85	0.89	0.87	0.87	0.89	0.88	0.87	0.87	0.86	0.88	0.89
Hard Voting	0.615	0.713	0.635	0.667	0.757	0.687	0.658	0.740	0.669	0.642	0.728	0.654	0.646	0.726	0.649	0.631	0.706	0.623
Soft Voting	0.617	0.706	0.645	0.670	0.753	0.697	0.656	0.737	0.680	0.644	0.724	0.668	0.639	0.721	0.659	0.621	0.702	0.641
LSTM	0.859	0.849	0.854	0.849	0.853	0.851	0.862	0.845	0.853	0.853	0.853	0.853	0.848	0.845	0.846	0.846	0.853	0.849

Table 4. Category-wise accuracy comparison of classifiers based on TF-IDF.

Classifier	Action			Casual			Entertainment			Music & Audio			Photography			Card
Classifier	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1
SVM	0.621	0.699	0.606	0.687	0.753	0.666	0.625	0.721	0.636	0.627	0.711	0.620	0.629	0.710	0.621	0.630	0.687	0.586
DTC	0.604	0.658	0.627	0.667	0.716	0.688	0.631	0.688	0.655	0.624	0.673	0.645	0.619	0.678	0.642	0.595	0.650	0.617
RFC	0.449	0.670	0.538	0.557	0.746	0.638	0.453	0.673	0.541	0.565	0.678	0.548	0.546	0.682	0.553	0.445	0.667	0.533
NBC	0.602	0.693	0.587	0.628	0.748	0.643	0.620	0.729	0.639	0.587	0.712	0.613	0.607	0.717	0.621	0.595	0.688	0.578
LRC	0.685	0.789	0.746	0.785	0.796	0.793	0.768	0.798	0.783	0.685	0.778	0.769	0.753	0.761	0.760	0.723	0.740	0.674
ABC	0.592	0.702	0.618	0.662	0.757	0.684	0.639	0.730	0.654	0.641	0.725	0.649	0.612	0.715	0.636	0.586	0.691	0.607
CC	0.592	0.702	0.618	0.662	0.757	0.684	0.639	0.730	0.654	0.641	0.725	0.649	0.612	0.715	0.636	0.586	0.691	0.607
BERT	0.83	0.84	0.84	0.83	0.89	0.86	0.85	0.87	0.86	0.86	0.88	0.87	0.85	0.87	0.86	0.86	0.88	0.87
BERTLR	0.88	0.90	0.896	0.870	0.910	0.872	0.886	0.895	0.879	0.88	0.89	0.879	0.89	0.87	0.88	0.86	0.872	0.87
Hard Voting	0.632	0.715	0.629	0.670	0.763	0.687	0.665	0.742	0.665	0.654	0.732	0.651	0.641	0.725	0.643	0.620	0.704	0.614
Soft Voting	0.615	0.702	0.643	0.675	0.754	0.701	0.650	0.729	0.675	0.642	0.720	0.666	0.632	0.715	0.655	0.613	0.694	0.633
LSTM	0.859	0.849	0.854	0.849	0.853	0.851	0.862	0.845	0.853	0.853	0.853	0.853	0.848	0.845	0.846	0.846	0.853	0.849

Table 5. Six-category accuracy comparison of classifiers based on TF with fractional derivative.

Classifier	Action	Casual	Entertainment	Music & Audio	Photography	Card
SVM	0.699	0.752	0.716	0.713	0.717	0.694
DTC	0.669	0.724	0.711	0.680	0.693	0.666
RFC	0.681	0.756	0.682	0.689	0.695	0.679
NBC	0.710	0.758	0.753	0.745	0.737	0.723
LRC	0.729	0.775	0.755	0.743	0.735	0.725
ABC	0.714	0.766	0.742	0.738	0.728	0.705
CC	0.722	0.763	0.744	0.737	0.727	0.716
BERTLR	0.884	0.89	0.88	0.881	0.876	0.868
Hard Voting	0.733	0.768	0.753	0.739	0.737	0.717
Soft Voting	0.715	0.762	0.747	0.735	0.734	0.714

Table 6. Six categorize accuracy of models with TF/IDF base on fractional derivative.

Classifier	Action	Casual	Entertainment	Music & Audio	Photography	Card
SVM	0.713	0.776	0.746	0.736	0.734	0.698
DTC	0.678	0.737	0.703	0.697	0.699	0.678
RFC	0.693	0.768	0.696	0.699	0.698	0.689
NBC	0.716	0.768	0.747	0.737	0.737	0.699
LRC	0.739	0.789	0.769	0.759	0.757	0.737
ABC	0.735	0.776	0.755	0.747	0.738	0.702
CC	0.737	0.778	0.759	0.749	0.739	0.707
BERTLR	0.90	0.91	0.89	0.89	0.89	0.896
Hard Voting	0.737	0.785	0.766	0.756	0.746	0.728
Soft Voting	0.736	0.776	0.746	0.758	0.739	0.706

Table 7. Literature-based comparative analyses.

Literature	Year	Technology	Dataset	Results
[49]	2020	Energy consumption forecasting using ensemble learning algorithms	N/A	Achieved accuracy of 85%
[50]	2021	Optimization of automatic classification of reviews based on Particle Swarm Optimization (PSO) ensemble with LinearSVC	Dataset extracted from Kaggle, consisting of Google Play Store reviews	Achieved precision 63.6%, recall 55.0%
[51]	2020	A multi-class classification task was performed using a convolutional neural network (CNN)-based approach	The dataset contains 1,126,453 reviews from 1100 Apple Store apps and 146,057 reviews of 80 Google Store apps	Achieved precision 95.49%, recall 93.94%, and F-measure 94.71%
[52]	2020	Machine learning-based models with hyperparameter tuning	Scraped 506,259 user reviews and application ratings from Google Play Store across 14 different categories	Achieved accuracy of 83.23% using bigrams and a Logistic Regression model
[53]	2021	Support Vector Machine (SVM) was used	Google Play Store reviews dataset consists of 23 features and 600,000 records extracted from Kaggle	Achieved precision, recall, F1-score, and accuracy ranging from 0.78–0.8 for SVM models
[54]	2019	Big data techniques based on deep learning models	Analyzed relationships between various attributes in the dataset extracted from the Kaggle repository	Achieved 84.3% accuracy
[15]	2020	Machine learning-based approach using 11 algorithms to detect fraudulent profiles and fake communications	Dataset contains 6500 records, with 3252 labeled as fake and 3259 as real	Achieved an accuracy rate of 94.5%
[55]	2019	Gradient Boosting Machine (GBM)-based sentiment analysis using text classification on the Bangla dataset	Data crawled from Google Play Store	Achieved best accuracy score of 76.95%
[56]	2020	Logistic Regression model	MNIST dataset	Achieved accuracy up to 77.9%
[57]	2015	Ensemble method based on the Random Subspace Method (RSM) and Majority Voting (MV)	The USF dataset comprises an extensive outdoor gait database with a total of 122 subjects	Achieved a maximum accuracy of 81.15%

Table 8. Accuracy comparison of BERTLR and RoBERTa variants with and without Fractional Derivative (FD), including ensemble.

Model	Action	Casual	Entertainment	Music & Audio	Photography	Card
BERT (no FD)	0.81	0.84	0.83	0.80	0.82	0.81
RoBERTa (no FD)	0.83	0.86	0.85	0.83	0.84	0.84
RoBERTa + FD	0.88	0.90	0.88	0.89	0.89	0.90
BERTLR (Proposed)	0.89	0.91	0.89	0.90	0.90	0.91
BERTLR + RoBERTa (Soft Voting)	0.87	0.90	0.88	0.88	0.88	0.89

Table 9. Ablation study of model components on the combined 6-domain dataset.

Model Variant	Accuracy	Precision	Recall	F1
BERT + LR (no FD)	0.8811	0.9172	0.8803	0.9276
FD (GL) + LR (no BERT)	0.1873	0.2000	0.0000	0.0000
BERT + FD + LR (BERTLR, Proposed)	0.8811	0.9172	0.8811	0.8919

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Karim, A.; Triandini, E.; Lee, S.; Jeong, I.c. Hybrid Ensemble of Large Language Models and Fractional Derivative Features for Domain-Specific Engineering Sentiment Analysis. Appl. Sci. 2026, 16, 4266. https://doi.org/10.3390/app16094266

AMA Style

Karim A, Triandini E, Lee S, Jeong Ic. Hybrid Ensemble of Large Language Models and Fractional Derivative Features for Domain-Specific Engineering Sentiment Analysis. Applied Sciences. 2026; 16(9):4266. https://doi.org/10.3390/app16094266

Chicago/Turabian Style

Karim, Abdul, Evi Triandini, Seoyeong Lee, and In cheol Jeong. 2026. "Hybrid Ensemble of Large Language Models and Fractional Derivative Features for Domain-Specific Engineering Sentiment Analysis" Applied Sciences 16, no. 9: 4266. https://doi.org/10.3390/app16094266

APA Style

Karim, A., Triandini, E., Lee, S., & Jeong, I. c. (2026). Hybrid Ensemble of Large Language Models and Fractional Derivative Features for Domain-Specific Engineering Sentiment Analysis. Applied Sciences, 16(9), 4266. https://doi.org/10.3390/app16094266

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid Ensemble of Large Language Models and Fractional Derivative Features for Domain-Specific Engineering Sentiment Analysis

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Dataset

3.2. Methodology

3.3. Ensemble Learning

3.4. Voting Techniques

3.4.1. Soft Voting

3.4.2. Hard Voting

3.5. Experimental Configurations

3.6. Feature Extraction Methods

3.6.1. Extended Fractional Operators (Caputo vs. Grünwald–Letnikov)

3.6.2. Fractional Derivatives: An Overview and Their Application in Large Language Models

3.6.3. Fractional Derivative Feature

3.7. Experimental Setup

3.8. Classifiers Used for Review Classification

3.8.1. Support Vector Regression

3.8.2. Random Forest Classifier

3.8.3. Logistic Regression Algorithm

3.8.4. Decision Tree Classifier

3.8.5. Long Short-Term Memory

3.8.6. Bidirectional Encoder Representations from Transformers (BERT)

3.8.7. AdaBoost

3.8.8. Proposed Hybrid Classifier (BERT + LR)

3.9. Additional Experimental Variants with RoBERTa and Ablation

3.10. Evaluation Matrices

4. Result and Discussion

4.1. Results with Pre-Processing

4.2. Results with Fractional Derivative

4.3. Comparison to State-of-the-Art Methods

4.4. Discussion on Limitations

4.5. Additional Results from RoBERTa, Ablation Studies, and Ensemble Voting

4.6. Explainability and Interpretation

5. Conclusions

Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI