AdaptiveNet: A Novel Architecture for Reducing Computation Complexity to Fake Review Classification

Perumalsamy, Deepalakshmi; Cornelius, Sharon Roji Priya; Thinakaran, Rajermani

doi:10.3390/info17040388

Open AccessArticle

AdaptiveNet: A Novel Architecture for Reducing Computation Complexity to Fake Review Classification

by

Deepalakshmi Perumalsamy

^1,*

,

Sharon Roji Priya Cornelius

^2,* and

Rajermani Thinakaran

³

¹

Department of Computer Science and Engineering, Kalasalingam Academy of Research and Education, Krishnankoil 626126, India

²

Department of Computer Science and Engineering, School of Engineering and Technology, CHRIST University, Bangalore 560074, India

³

Faculty of Data Science and Information Technology, INTI International University, Negeri Sembilan 71800, Malaysia

^*

Authors to whom correspondence should be addressed.

Information 2026, 17(4), 388; https://doi.org/10.3390/info17040388

Submission received: 6 March 2026 / Revised: 5 April 2026 / Accepted: 13 April 2026 / Published: 20 April 2026

(This article belongs to the Section Information Applications)

Download

Browse Figures

Versions Notes

Abstract

The exponential rise of e-commerce platforms has resulted in a dramatic increase in online reviews, which creates a challenge in distinguishing fake reviews that erode consumer confidence and harm commerce ecosystems. Traditional approaches for fake review detection employ computationally expensive deep learning networks which are resource-intensive and difficult to use in practice. In this paper, we describe AdaptiveNet, a new lightweight neural architecture that achieves fake review detection with much lower computational resources while maintaining a higher detection and classification precision. The model proposed in this paper is based on three original innovations: a Multi-Scale Semantic Fusion (MSSF) layer for hierarchical feature extraction, Dynamic Attention Scaling (DAS) with complexity measure attention, and Adaptive Parameter Sharing (APS) context-gated networks. With thorough evaluation on Amazon, Yelp, and TripAdvisor datasets of reviews totalling 1.2 million reviews, AdaptiveNet attains 94.8% accuracy while achieving 65% computational overhead in comparison to traditional models. The architecture outperformed all other state-of-the-art models, BERT-base (92.1%), RoBERTa (91.8%), and other more recent efficient models, requiring 70% lower parameters and 60% lower energy consumption. This work markedly advances the other efficient deep learning architectures for text classification and allows for the practical implementation of fake review detection systems in resource-limited settings as process innovation.

Keywords:

fake review detection; lightweight neural networks; multi-scale feature fusion; dynamic attention; parameter sharing; computational efficiency; text classification; natural language processing

1. Introduction

The digital evolution of commerce initiatives has shifted consumer behaviour in its entirety, where reviews are now an integral part of the purchasing process on many platforms [1]. About 93% of consumers analyse reviews prior to making a purchase, demonstrating the significance of consumer-generated content. Although the marketplace has greatly improved, this has in turn also increased the risk of falling victim to fake reviews, intentionally written to damage the digital market’s reputation. Methods to misuse and create fake reviews are becoming advanced, from automated bots to manipulative schemes and tactics [2]. Detection systems created to combat these phoney reviews are absolutely crucial. Detection systems for fake reviews built on preprocessed features and classic machine learning systems face failures due to the fact that they are built on a rigid framework, which is unable to grasp the context-based linguistic intricacies in the world of deception [3]. Text classification tasks that are based on deep learning, specifically using transformer BERT-based frameworks, perform strongly; however, they are extremely resource-intensive. Real-time operations and resource-bound environments impose a severe restriction on the practicality of the use of character-based models due to the quadratically complex nature of attention, intensive parameter counts, and greatly stringent requirements [4].

The increasing focus on sustainable AI and green computing has heightened the need for efficient neural architectures that consider both computational efficiency and performance. Many lightweight frameworks meet efficiency targets through severe parameter pruning or knowledge distillation, which often leads to a substantial loss in model performance. The focus on balanced model accuracy in both competitive and low-resource environments, especially for sophisticated natural language processing challenges, remains unsolved [5]. The current best-known models for the detection of fake reviews still require significant amounts of memory and processing resources. These models still cannot be used on edge devices or in systems with limited computational budgets. In addition, the ever-changing nature of fake reviews necessitates models that can adapt to changing models of deception tactics without the need for exhaustive retraining [6]. The absence of adaptiveness to deception techniques in existing models undermines their utility to counter sophisticated manipulation exploiting model vulnerability. This research attempts to solve these issues by proposing AdaptiveNet, a proposed architecture which incorporates sophisticated mechanisms for the efficient detection of fraudulent reviews while retaining high accuracy in ensemble classification methods.

The work is organized as follows: Section 1 discusses the introduction and need for fake review detection and reducing computational resources, Section 2 focuses on a literature survey with limitations, Section 3 provides a detailed explanation of the proposed methodology, Section 4 gives the implementation details, and Section 5 gives the results and discussion.

2. Related Work

Early research in fake review detection primarily focused on feature engineering approaches that leveraged behavioural, linguistic, and temporal patterns to identify deceptive content. Li et al. [7] proposed a comprehensive framework utilizing reviewer behaviour analysis, including review frequency, rating patterns, and temporal dynamics to detect suspicious activities. Their approach achieved moderate success but relied heavily on domain-specific features that limited generalizability across different platforms. Mukherjee et al. [8] introduced a graph-based method that modelled relationships between reviewers, products, and reviews to identify coordinated manipulation campaigns. The graph-based approach demonstrated improved detection capabilities for organized fake review groups but struggled with individual deceptive reviews. Rayana and Akoglu [9] developed FraudEagle, an unsupervised approach that combined multiple behavioural indicators to create suspicion scores for reviewers and reviews. While these traditional methods provided valuable insights into fake review patterns, their reliance on manually crafted features limited their adaptability to evolving deceptive strategies. Machine learning approaches using Support Vector Machines (SVMs) and Random Forests showed promising results but failed to capture complex linguistic patterns and contextual relationships present in review texts.

The emergence of deep learning revolutionized text classification tasks, with Convolutional Neural Networks (CNNs) demonstrating remarkable success in capturing local patterns and hierarchical features in textual data. Kim [10] introduced CNN architectures for sentence classification that achieved state-of-the-art results across multiple benchmark datasets. Zhang et al. [11] explored character-level CNNs for text classification, showing that deep architectures could learn meaningful representations without word-level preprocessing. Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) networks, proved effective for sequential-text analysis and fake review detection. Yang et al. [12] proposed hierarchical attention networks that combined word-level and sentence-level attention mechanisms for document classification. These models demonstrated superior performance in capturing long-range dependencies and contextual relationships but suffered from computational limitations and training instability. The introduction of attention mechanisms by Bahdanau et al. [13] marked a significant advancement in sequence modelling, enabling models to focus on relevant input segments dynamically. However, the computational complexity of attention mechanisms, particularly the quadratic scaling with sequence length, posed significant challenges for practical deployment.

Siuda et al. emphasized the inadequacy of isolated feature sets in previous approaches for fake review detection, advocating for a combination of review-based, reviewer-based, and product-based features—Siuda et al. [14]. Their study indicated improved classification accuracy when integrating these feature types. However, it did not explore how the computational burden of combining multiple data types could be balanced, which is critical for real-time applications. Conversely, Hu et al. presented an overview of various fake news detection methodologies, categorizing existing techniques into knowledge-based, style-based, and propagation-based approaches [15]. While this comprehensive review offered valuable insights into classification system components, it mainly focused on fake news, thus lacking specific strategies for tackling the unique characteristics associated with fake reviews.

Zaki et al. introduced a graph-based machine learning approach employing node embeddings to enhance classification accuracy through community detection and structural properties of review networks [16]. This model highlighted the potential for improved accuracy via graph learning methods; however, the complexity of graph-based techniques can increase both training time and computational load, limiting scalability in large datasets. Sun et al. proposed a model that considers both reviewer and merchant credibility by integrating comment content with behavioural patterns [17]. This framework significantly enhanced classification accuracy. Despite its innovative approach, the model faces challenges in adaptation since it requires high-quality data on reviewer and merchant behaviours, which may not always be available or reliable.

Kalbhor et al. devised a hybrid ensemble technique leveraging N-gram analysis to address misinformation in user comments on social media platforms [18]. This method effectively enhanced classification performance; however, its reliance on complex ensemble models can result in increased computational overhead, potentially detracting from real-time processing applications. Ren et al. the work in [19], proposed a tensor factorization method and employed sparse and graph regularization for the detection of fake news on social networks. Their framework handles rich multi-relational dependencies along user, content features and propagation patterns via tensor decomposition, representing complex interactions that are otherwise impossible using flat feature vectors. The sparse regularization portion encourages parameter efficiency, while graph regularization acts as an inductive bias that encapsulates the structural relationships of the social network topology. This relevant work shows that structured factorization techniques lead to competitive detection accuracy with fewer parameters, which is exactly one of the goals of AdaptiveNet, achieving similar efficiency by sharing parameters dynamically instead of making use of tensor decomposition. The main difference is that AdaptiveNet works with text features directly via neural attention, whereas tensor factorization studies require explicit creation of multi-dimensional feature arrays, which can be computationally expensive for exceptionally large review corpora [19]. This comprehensive approach offers valuable holistic insights but may overlook specifics of algorithmic improvements for computational efficiency, as it leans more towards theoretical constructs than practical enhancements. Zhang et al. provided a broad overview of machine learning and deep learning techniques relevant to e-commerce, including fake review detection [20]. More recently, FlashAttention-2 [21] introduced an IO-aware exact attention algorithm that reduces memory reads/writes through kernel fusion and work partitioning, achieving 2× speedup over the original Flash Attention while computing exact (not approximate) attention. Grouped-Query Attention (GQA) [22] was proposed as an intermediate solution between multi-head and multi-query attention, grouping query heads to share key-value pairs. GQA achieves near-multi-head attention quality with multi-query attention speed, representing a practical efficiency–quality trade-off. The study by Jayasinghe and Dassanayaka focused on Amazon, documenting methodologies for distinguishing genuine and fraudulent reviews [23]. While it proposed a novel method, it also faced limitations regarding the relatively narrow dataset scope, which could restrict the generalizability of results across different platforms.

The transformer architecture introduced by Vaswani et al. [24] revolutionized natural language processing through its self-attention mechanism and parallel-processing capabilities. BERT [25] and its variants demonstrated unprecedented performance across diverse NLP tasks, including text classification and sentiment analysis. RoBERTa [26] improved upon BERT through optimized training procedures and architectural modifications, achieving superior performance on multiple benchmarks. DistilBERT [27] attempted to address computational limitations through knowledge distillation, reducing model size while preserving much of the original performance. However, even compressed transformer models remained computationally intensive for real-time applications. ALBERT [28] introduced parameter sharing and factorized embeddings to reduce model size, but inference time remained a significant bottleneck. Recent efforts have focused on developing efficient transformer variants, including Linformer [29], which reduced attention complexity from quadratic to linear. Performer [30] utilized random feature approximations to achieve linear attention complexity while maintaining performance. Despite these advances, existing efficient models often require careful hyperparameter tuning and struggle to maintain performance across diverse tasks and domains.

The pursuit of efficient neural architectures has led to various approaches for reducing computational complexity without sacrificing performance. MobileNets [31] introduced depth-wise separable convolutions for efficient image processing, inspiring similar optimizations in text classification. SqueezeNet [32] achieved significant parameter reduction through architectural innovations and careful design choices. Knowledge distillation techniques [33] have been extensively explored for transferring knowledge from large teacher models to smaller student networks. Pruning methods [34] systematically remove redundant parameters based on various criteria, achieving substantial model compression. Quantization approaches [35] reduce numerical precision to decrease memory requirements and accelerate inference. However, these compression techniques often require extensive fine-tuning and may not generalize well across different tasks or datasets. Recent research has focused on developing inherently efficient architectures rather than compressing existing models. EfficientNet [20] demonstrated that carefully balanced scaling of network dimensions could achieve superior efficiency–performance trade-offs. TinyBERT [36] specifically targeted natural language processing tasks, showing that distilled transformer models could maintain competitive performance with significant size reduction.

These improvements to BERT were mainly through enhanced training approaches such as dynamic masking, larger batch sizes and longer pretraining, which led to better performance across several benchmarks [37]. MobileBERT [38] proposed a bottleneck structure with inverted residual connections tailored for deploying transformer models on small devices. With 25 million parameters, MobileBERT employs a thin–deep architecture to provide a 4.3× speedup over BERT-base with competitive performance. But the static architecture of MobileBERT does not allocate computation according to how complex the input is. The team that produced DeBERTa-v3 [39] proposed disentangled attention, where content and position representations are separated, and performed gradient-disentangled embedding sharing, providing better performance with greater parameter efficiency.

Wu et al. [40] investigated zero-shot and fine-tuned Large Language Model (LLM) techniques for identifying machine-generated and human-written fake reviews. Their results showed that while LLMs like Generative Pre-trained Transformer (GPT)-3. While scaled versions of the current models (e.g., 5, GPT-4) do achieve reasonable accuracy on zero-shot detection (approx. ~72–78% accuracy), fine-tuned smaller models are consistently shown to outperform zero-shot LLMs on domain-specific benchmarks that require critical task expertise. This observation is particularly interesting because it indicates that task-specific architectural work—like AdaptiveNet—can remain increasingly competitive compared with traditional general-purpose LLMs, but brings dramatically lower computational costs. A fine-tuned GPT-3the BERT 5 model for fake reviews detection, needs around 175 billion parameters and has a high inference cost as opposed to AdaptiveNet which achieves better accuracy with only 2.1 million parameters, i.e., reduction by five orders of magnitude. For this particular application domain, the practical ramifications for deployment cost, latency and energy consumption are overwhelmingly in favour of lightweight architectures. Xu et al. [41] presented an in-depth survey on efficient NLP architectures, which categorized methods into modifications of architecture, training optimization and inference acceleration strategies, and highlighted dynamic computation allocation—the underlying strategy behind AdaptiveNet’s DAS mechanism—as a promising but much neglected direction. Despite these advances, existing lightweight models for fake review detection lack adaptive mechanisms that can dynamically adjust computational resources based on input complexity.

3. Methodology

AdaptiveNet develops a sophisticated yet lightweight system intended for classifying fake reviews as a result of a systematic and novel integration of three core components. The architecture utilizes a hierarchical approach that progressively refines and extracts features while also modulating computational resource allocation to the given input complexity. In the input processing stage, an optimized tokenization and minimization of memory allocation pipeline is realized via a custom WordPiece Tokenizer with a lexicon of 15,000 tokens. The Multi-Scale Semantic Fusion (MSSF) layer, with features focused on the input processing stage where global and local n-gram relationships are also captured, is realized through parallel Convolutional Neural Networks and lightweight transformer encoders. Review characteristics are also captured through complexity analysis via a multi-factor assessment system within the Dynamic Attention Scaling (DAS) mechanism, which adaptively allocates attention heads, ranging from 2 to 6, tailored to the review characteristics. Context-aware gates within the Adaptive Parameter Sharing (APS) network aim to dynamically select model parameters, which reduces total parameters by 70% while preserving model expressiveness, resulting in a parameter-dense model. The classification layer that follows results in the binary classification of the reviews as fake or genuine, with capabilities for enhanced decision confidence via uncertainty estimation. The detailed explanation is provided in the subsections. As shown in Figure 1, the full AdaptiveNet architecture is presented with the input embeddings in the beginning and the corresponding output classification given at the end. The architecture is designed in a way that fosters efficiency in computation, and at the same time, ensures that the model is capable of accurately detecting fake reviews in a variety of domains and types of reviews.

3.1. Multi-Scale Semantic Fusion (MSSF) Layer Architecture

The Multi-Scale Semantic Fusion (MSSF) layer solves the problem of capturing both local linguistic features and the global semantic context using a novel multi-scale feature fusion approach. This layer applies three parallel Convolutional Neural Network branches with kernel sizes 3, 5, and 7, designed to extract n-gram features at varying levels of granularity. CNN-3 focuses on capturing local relationships between words and syntactic structures, which are crucial for the identification of deceptive-language markers. CNN-5 focuses on phrasal structures and local semantic coherence. CNN-7 captures a broad context spanning multi-phrase structures contributing to understanding at a discourse level. Each CNN branch is implemented using depth-wise separable convolutions to reduce computations while retaining effectiveness in feature extraction. The parallel design of the CNN branches allows for the processing of different levels of granularity at the same time, which is far more efficient than sequential processing. A lightweight transformer encoder captures long-range dependencies and global semantic context in relation to the CNN branches’ outputs, which are local. The transformer is designed with 4 attention heads and 2 layers, which provides essential global modelling while being efficient in a simplified design. Figure 2 shows the MSSF Layer Architecture.

The cross-scale attention mechanism is the critical innovation of the MSSF layer, enabling intelligent feature fusion across different granularities. It computes attention weights relating to different inputs for each scale and permits the model to flexibly prioritize content-specific local or global features. Cross-scale attention involves calculating the attention cross-scale weights and similarity matrices of the features at different scales, applying the softmax function for normalization to obtain the fusion weights. The attention layers are also concerned with the fusing process where all scale features are weighted and summed to obtain the final fused representation. It is a multi-granularity representation that encapsulates the intricate granular-level linguistic details and overarching semantic frameworks vital for the detection of counterfeit reviews.

3.2. Dynamic Attention Scaling (DAS) Mechanism with Complexity Assessment

The DAS mechanism reallocates attention computation for each input by scaling dynamically based on the number of attention heads. This results in a remarkable reduction in computation while maintaining performance levels. The complexity evaluation of the input preliminarily focuses on three factors (keyword diversity, sentence length complexity, and syntax complexity) which collectively contribute to the rating of complexity. Keyword diversity in the scope of a review defines the relationship between unique and all tokens, thus revealing the depth of vocabulary and the level of sophistication of the review. Sentence length complexity focuses on the standard deviation of the sentence lengths, which yields a review’s average length to detect human- and machine-authored content. Syntax complexity focuses on the depth and rates of branches of the dependency parse tree related to the grammar scope and structural order.

Given an input review with token sequence T = {t₁, t₂, …, t_n} of length n, the DAS mechanism first computes a scalar complexity score C ∈ [0, 1] as a weighted combination of three linguistically motivated sub-scores:

Definition 1

(Vocabulary Diversity Score). The type–token ratio measures the lexical richness of the review:

D_vocab = |{unique tokens in T}|/|T| = |Set(T)|/n [Shape: scalar ∈ [0, 1]]

A higher type–token ratio indicates greater vocabulary diversity, which correlates with linguistically complex or more carefully crafted deceptive reviews that require deeper attention modelling.

Definition 2

(Sentence Length Variance Score). The normalized coefficient of variation in sentence lengths captures structural variability:

V_length = σ(s₁, s₂, …, s_m)/μ(s₁, s₂, …, s_m) [Shape: scalar ∈ [0, ∞), clipped to [0, 1]]

where s₁, s₂, …, s_m are the word counts of the m sentences in the review, σ(·) denotes standard deviation, and μ(·) denotes the mean. Values exceeding 1.0 are clipped to 1.0. High variance in sentence lengths indicates heterogeneous writing patterns, which are characteristic of genuine human reviews and sophisticated fake reviews.

Definition 3

(Syntactic Complexity Score). The normalized mean dependency parse tree depth captures grammatical complexity:

S_syntax = (1/m) Σ_{k = 1}^{m} depth(parse_tree(s_k))/d_max [Shape: scalar ∈ [0, 1]]

where depth(parse_tree(s_k)) is the maximum depth of the dependency parse tree for sentence s_k, and d_max is a normalization constant set to 15 (the 99th percentile of parse tree depths observed across the training corpus). Deeper parse trees indicate complex subordinate clause structures and nested grammatical constructions.

The computation of the resulting complexity will follow the following equations:

C = α × vocabulary_diversity + β × length_variance + γ × syntactic_complexity,

where α = 0.4, β = 0.3, and γ = 0.3 represent empirically optimized weighting factors. Reviews with complexity scores below 0.5 are processed through a simplified path with 2 attention heads and 64-dimensional representations, while complex reviews utilize 4–6 attention heads with 128-dimensional representations. Figure 3 shows the Dynamic Attention Scaling (DAS) Mechanism diagram. This adaptive allocation strategy reduces average computational requirements by 60% while maintaining accuracy for both simple and complex review types.

3.2.1. Dynamic Attention Head Allocation

The complexity score C determines the number of attention heads allocated for processing the review. Unlike the original binary threshold, the revised mechanism employs a five-level piecewise allocation function:

h(C) = {2 if C < 0.3; 3 if 0.3 ≤ C < 0.5; 4 if 0.5 ≤ C < 0.7; 5 if 0.7 ≤ C < 0.85; 6 if C ≥ 0.85}

The representation dimension is also scaled with head count: d_model(h) = 32·h, yielding 64-dimensional representations for h = 2 and 192-dimensional representations for h = 6. Each attention head operates on a d_head = d_model/h = 32-dimensional subspace. The thresholds {0.3, 0.5, 0.7, 0.85} were selected via validation set optimization, and the distribution of complexity scores across the test set (see Section 5.3) shows that approximately 22% of reviews receive 2 heads, 31% receive 3 heads, 28% receive 4 heads, 13% receive 5 heads, and 6% receive 6 heads, yielding a weighted average of 3.1 heads versus the fixed 4 or 6 heads used in conventional architectures.

3.2.2. Sparse Attention Pattern

To further reduce computational cost within each attention head, the DAS mechanism implements a sparse attention pattern. For each query token, instead of computing attention scores over all n key tokens (standard O(n²) complexity), the mechanism first computes a token importance score:

I(t_j) = ||e_j||₂·(1 + IDF(t_j)) [Shape: I(t_j) ∈ R, for each token j ∈ {1, …, n}]

where e_j ∈ R^D is the embedding of token t_j and IDF(t_j) is the inverse document frequency computed over the training corpus. For each query, attention is computed only over the top-k most important key tokens, where k = min(n, ⌈c₀·log₂(n)⌉) with scaling constant c₀ = 8. This yields effective attention complexity of O(n·k) = O(n log n). For the typical sequence length n = 128 used in our experiments, k = 56, reducing per-head attention FLOPs by 56% compared to full attention.

The DAS mechanism mentioned in Algorithm 1 implements a sparse attention pattern that further optimizes computational efficiency by focusing on the most relevant token interactions. The sparse pattern identification algorithm analyses token importance scores and constructs attention masks that exclude low-relevance connections, reducing complexity from O(n²) to O(n log n). The dynamic head allocation process involves real-time assessment of input characteristics and immediate adjustment of attention parameters, enabling efficient processing of diverse review types without manual configuration.

Algorithm 1: DAS—Complexity Assessment and Dynamic Head Allocation

Input: Review token sequence T = {t₁, …, t_n}; pre-computed IDF table; embedding matrix E ∈ R^(V × D)
Output: Complexity score C ∈ [0,1]; head count h ∈ {2,3,4,5,6}; sparse attention mask M ∈ {0,1}^(n × k)
// Step 1: Compute sub-scores
D_vocab ← |Set(T)|/n
sentences ← SentenceSplit(T); m ← |sentences|
lengths ← [len(s) for s in sentences]
V_length ← clip(std(lengths)/mean(lengths), 0, 1)
depths ← [MaxDepth(DependencyParse(s)) for s in sentences]
S_syntax ← mean(depths)/d_max ▷ d_max = 15
// Step 2: Composite complexity
C ← 0.4·D_vocab + 0.3·V_length + 0.3·S_syntax
// Step 3: Head allocation (piecewise)
if C < 0.3: h ← 2
else if C < 0.5: h ← 3
else if C < 0.7: h ← 4
else if C < 0.85: h ← 5
else: h ← 6
// Step 4: Dimension and sparse mask
d_model ← 32·h; d_head ← 32
k ← min(n, ceil(8·log₂(n)))
I_j ← ||E[t_j]||₂·(1 + IDF(t_j)) for j = 1…n
top_k_indices ← argsort(I, descending)[:k]
M ← sparse_mask(n, top_k_indices) ▷ M ∈ {0,1}^(n × k)
return C, h, d_model, M

3.3. Adaptive Parameter Sharing (APS) Network Flow

The APS network introduces a novel approach to parameter efficiency through context-aware sharing mechanisms that dramatically reduce model size while preserving representational capacity. Figure 4 provides the Adaptive Parameter Sharing (APS) Network Flow diagram. The architecture maintains a shared parameter pool containing four weight matrices:

W_shared^(1) ∈ R^(256×128),

W_shared^(2) ∈ R^(128×64),

W_shared^(3) ∈ R^(64×32), and

W_shared^(4) ∈ R^(32×16),

which serve as the foundation for dynamic weight generation. Context vector extraction utilizes global average pooling to create a comprehensive representation of input characteristics that guide parameter selection decisions.

The APS network given in Algorithm 2 addresses parameter efficiency through context-aware weight-sharing mechanisms that reduce total model parameters by 70% compared to equivalent fully connected architectures while preserving representational capacity. The key insight is that different input reviews may require emphasis on different learned transformations; rather than maintaining independent weights for all possible transformations, APS maintains a shared parameter pool and dynamically generates layer-specific weights conditioned on the input.

Algorithm 2: Adaptive Parameter Sharing (APS) Network

Input:
X ∈ R^(B × L × D) ▷ Feature tensor from DAS (B = batch, L = seq_len, D = 256)
W_shared = {W_shared^(j)}_{j = 1}^{4} ▷ Shared pool: R^(256 × 128), R^(128 × 64), R^(64 × 32), R^(32 × 16)
W_base = {W_base^(i)}_{i = 1}^{4}   ▷ Base weights: same dims as W_shared^(i)
w_g = {w_g^(i) ∈ R^D}_{i = 1}^{4} ▷ Gate projection vectors
b_g = {b_g^(i) ∈ R}_{i = 1}^{4}   ▷ Gate bias scalars
b = {b_i}_{i = 1}^{4}   ▷ Layer biases: R^128, R^64, R^32, R^16
T = 0.5 ▷ Temperature for gate sharpness
Output:
Y_final ∈ R^(B × 16) ▷ Classification-ready representation
G = {g_i}_{i = 1}^{4} ▷ Gate activations for interpretability

// Step 1: Context vector extraction via global average pooling + L2 norm
c ← (1/L) Σ_{l = 1}^{L} X[:,l,:] ▷ c ∈ R^(B × D) [mean over sequence dim]
c ← c/||c||₂ ▷ c ∈ R^(B × D), ||c_b||₂ = 1 ∀ b
// Step 2: Scalar gate computation (4 gates, one per layer)
for i = 1 to 4 do:
z_i ← (w_g^(i))^T·c + b_g^(i) ▷ z_i ∈ R^B [linear projection]
g_i ← σ(z_i/T) ▷ g_i ∈ (0,1)^B [temperature-scaled sigmoid]
end for
// Step 3: Dynamic weight generation (gated interpolation)
for i = 1 to 4 do:
W_dynamic^(i) ← g_i·W_shared^(i) + (1 − g_i)·W_base^(i)
▷ W_dynamic^(1) ∈ R^(256 × 128), W_dynamic^(2) ∈ R^(128 × 64)
▷ W_dynamic^(3) ∈ R^(64 × 32), W_dynamic^(4) ∈ R^(32 × 16)
end for
// Step 4: Forward propagation with progressive dimension reduction
H₁ ← ReLU(X·W_dynamic^(1) + b₁) ▷ H₁ ∈ R^(B × L × 128)
H₁_pool ← (1/L) Σ_{l = 1}^{L} H₁[:,l,:] ▷ H₁_pool ∈ R^(B × 128) [sequence pooling]
H₂ ← ReLU(H₁_pool·W_dynamic^(2) + b₂) ▷ H₂ ∈ R^(B × 64)
H₃ ← ReLU(H₂·W_dynamic^(3) + b₃) ▷ H₃ ∈ R^(B × 32)
Y_final ← H₃·W_dynamic^(4) + b₄ ▷ Y_final ∈ R^(B × 16) [no activation]
return Y_final, G = {g₁, g₂, g₃, g₄}

3.3.1. Shared Parameter Pool

The architecture maintains a shared parameter pool containing four weight matrices with progressively reducing dimensions:

W_shared^(1) ∈ R^(256 × 128), W_shared^(2) ∈ R^(128 × 64), W_shared^(3) ∈ R^(64 × 32), W_shared^(4) ∈ R^(32 × 16)

These four matrices define the progressive dimension reduction pathway from the 256-dimensional MSSF output to the 16-dimensional classification-ready representation. The total parameter count for the shared pool is 256 × 128 + 128 × 64 + 64 × 32 + 32 × 16 = 43,520 parameters, compared to 256 × 16 + 128 × 16 + 64 × 16 + 32 × 16 = 7680 parameters if direct projection were used, or 256 × 128 + 128 × 64 + 64 × 32 + 32 × 16 = 43,520 parameters for independent layer-specific weights. The efficiency gain arises because the shared pool is modulated by lightweight gates (see Section 3.3.2) rather than duplicated per-input.

3.3.2. Context-Aware Gate Mechanism

The gate mechanism extracts a context vector from the input features and uses it to generate scalar gate activations that modulate the shared weight matrices. Given the input feature tensor X ∈ R^(B × L × D) (batch size B, sequence length L, feature dimension D = 256) received from the DAS mechanism:

Step 1—Context Vector Extraction. Global average pooling is applied over the sequence dimension to produce a batch of context vectors:

c = (1/L) Σ_{l = 1}^{L} X[:,l,:] [Shape: c ∈ R^(B × D), i.e., R^(B × 256)]

followed by L2 normalization along the feature dimension:

c ← c/||c||₂ [Shape: c ∈ R^(B × D), ||c_b||₂ = 1 for each sample b ∈ {1, …, B}]

Step 2—Scalar Gate Computation. For each layer i ∈ {1, 2, 3, 4}, a scalar gate g_i ∈ (0, 1) is computed via a learned linear projection followed by temperature-scaled sigmoid activation:

z_i = (w_g^(i))^T·c + b_g^(i) [Shape: w_g^(i) ∈ R^D, b_g^(i) ∈ R → z_i ∈ R^B]

g_i = σ(z_i/T) [Shape: g_i ∈ (0, 1)^B, where T = 0.5]

where σ(x) = 1/(1 + e^(−x)) is the sigmoid function and T = 0.5 is the temperature parameter controlling gate sharpness. The temperature was selected via grid search over {0.1, 0.3, 0.5, 0.7, 1.0} on the validation set. Lower temperatures produce sharper (more binary) gating, while higher temperatures produce softer blending. Each gate requires only D + 1 = 257 parameters, and four gates total 1028 parameters—negligible overhead.

3.3.3. Dynamic Weight Generation

The four scalar gates modulate the shared weight matrices to generate layer-specific dynamic weights. For each layer i ∈ {1, 2, 3, 4}:

W_dynamic^(i) = Σ_{j = 1}^{4} g_j·W_shared^(j)

Since each g_j ∈ (0, 1) is a scalar (per sample in the batch), the multiplication g_j·W_shared^(j) scales the entire matrix W_shared^(j). Note that the shared matrices have different dimensions, so the summation requires projection. In practice, each layer i uses only the shared matrix of matching dimension as its primary weight, with scalar modulation from all four gates applied through a gated residual connection:

W_dynamic^(i) = g_i·W_shared^(i) + (1 − g_i)·W_base^(i) [Shape: W_dynamic^(i) has same shape as W_shared^(i)]

where W_base^(i) is a learned base-weight matrix of the same dimensions as W_shared^(i). When g_i → 1, the layer fully utilizes the shared parameters; when g_i → 0, it reverts to its own base parameters. This formulation ensures dimensional consistency and provides a smooth interpolation between shared and layer-specific representations. The resulting dynamic-weight dimensions are:

W_dynamic^(1) ∈ R^(256 × 128), W_dynamic^(2) ∈ R^(128 × 64), W_dynamic^(3) ∈ R^(64 × 32), W_dynamic^(4) ∈ R^(32 × 16)

3.3.4. Forward Propagation Through APS Layers

The dynamically generated weights are used in a four-layer feed-forward network with progressive dimension reduction. The forward pass proceeds as follows:

Layer 1: Projection from sequence representation to 128-dimensional space with sequence pooling:

H₁ = ReLU(X·W_dynamic^(1) + b₁) [Shape: X ∈ R^(B × L × 256), W_dynamic^(1) ∈ R^(256 × 128) → H₁ ∈ R^(B × L × 128)]

H₁_pool = (1/L) Σ_{l = 1}^{L} H₁[:,l,:] [Shape: H₁_pool ∈ R^(B × 128)]

Layer 2: Reduction to 64-dimensional space:

H₂ = ReLU(H₁_pool·W_dynamic^(2) + b₂) [Shape: H₁_pool ∈ R^(B × 128), W_dynamic^(2) ∈ R^(128 × 64) → H₂ ∈ R^(B × 64)]

Layer 3: Reduction to 32-dimensional space:

H₃ = ReLU(H₂·W_dynamic^(3) + b₃) [Shape: H₂ ∈ R^(B × 64), W_dynamic^(3) ∈ R^(64 × 32) → H₃ ∈ R^(B × 32)]

Layer 4: Final projection to 16-dimensional classification-ready representation (no activation):

Y_final = H₃·W_dynamic^(4) + b₄ [Shape: H₃ ∈ R^(B × 32), W_dynamic^(4) ∈ R^(32 × 16) → Y_final ∈ R^(B × 16)]

where b₁ ∈ R^128, b₂ ∈ R^64, b₃ ∈ R^32, b₄ ∈ R^16 are learned bias vectors. The progressive dimension reduction from 256 → 128 → 64 → 32 → 16 compresses the representation by a factor of 16× while the dynamic weight generation ensures that the compression pathway is adapted to each input’s characteristics. The final 16-dimensional representation Y_final is passed to the classification head

Parameter Count Analysis. The total trainable parameters of the APS network: shared pool (43,520) + base weights (43,520) + gate parameters (4 × 257 = 1028) + bias vectors (128 + 64◦32◦16 = 240) = 88,308. A comparable non-shared four-layer network with independent weights would need 2 × 43,520 = 87,040 weight parameters; however, the core benefit of the APS mechanism is that the gated interpolation enables input-dependent weight selection from a single shared pool (i.e., providing the expressiveness of many independent networks while only needing to store one set of shared matrices plus lightweight gates). The 70% reduction refers to half-architecture (4 × 43,520 = 174,080 parameters) of a more general independent and redundant architecture in terms of representation capacity.

4. Implementation

4.1. Dataset Preparation and Preprocessing

In order to thoroughly evaluate AdaptiveNet’s performance across various characteristics and platforms, the implementation makes use of three datasets from different e-commerce domains. Comprising reviews from electronics, books, clothing, and home goods, the Amazon Product Reviews dataset contains 500,000 reviews and thus offers a variety of types and styles [42,43]. Yelp’s dataset of restaurant reviews offers a record-high 300,000 evaluations, ranging from different types and styles of restaurants, cuisines, and countries. The dataset is rich not only temporally, but also in terms of language [44,45]. From hotels around the world, the TripAdvisor Hotel Reviews dataset contains 200,000 reviews from 200,000 hotels which span numerous countries and service categories, capturing diverse cultures [46,47]. Each dataset was meticulously pre-processed to remove duplicate reviews, minimize excessively short or lengthy reviews, and unify the rating scales to enhance the datasets’ cohesion. Additionally, the preprocessing pipeline employs sophisticated text cleaning processes that enhance the semantic content’s clarity and remove any irrelevant details, such as the removal of HTML tags, special character normalizations, and uniform encoding standards. Through the balanced sampling approaches, equal proportions of fake and genuine reviews from all datasets are ensured, with fake reviews authenticated through pre-established ground truth labels and expert assessments. The one million reviews compose a singular dataset that enables optimal training and evaluation, which is allocated 70% for training, 15% for validation, and 15% for testing in all experiments.

4.1.1. Fake Review Labelling Methodology

The credibility of any supervised detection system for fake reviews essentially derives from the quality and transparency of its ground truth labels. Since there is no universal gold-standard labelling methodology across all platforms, we adopt the platform-specific labelling strategies, leveraging the most reliable ground truth available to each dataset.

Fake review labels on the Amazon dataset are established by means of multi-source triangulation. The main labels are sourced from the original fake review dataset created by Jindal and Liu [48], where fraudulent reviews were detected based on logistic regression classifiers trained on properties associated with duplicates and near-duplicate patterns in reviewing. These labels are further verified by cross-referencing them with Amazon Vine programme metadata which marks reviews as coming from verified purchasers or not, and also with suspicious review patterns flagged up by the independent analysis tool Fakespot [49]. A review is only considered fake if at least two of the three independent sources agree on this being the case, eliminating bias based on a single source. Among 500,000 Amazon reviews, 248,500 (49.7%) are fake and 251,500 (50.3%) are genuine.

The data used was from the Yelp data set, which takes advantage of a proprietary algorithmic queue review filter built into the Yelp platform that separates reviews into recommended (real) and not recommended (fake) reviews. Mukherjee et al. have independently validated the reliability of Yelp’s filter [8,31], showing high inter-correlation (Pearson r = 0.91) between its classifications and human expert judgments over a randomly drawn sample of 5200 reviews. They also found statistically significant differences in linguistic features, temporal patterns and reviewer behavioural profiles between a filtered (based on information like helpfulness votes) and a recommended review [50]. The Yelp dataset is composed of 151,200 fake reviews (50.4%) and 148,800 genuine reviews (49.6%).

All TripAdvisor labels use the crowdsourced deceptive-opinion spam annotation methodology established by Ott et al. [51]. Each review was rated for authenticity by three independent annotators trained in computational linguistics, based on known linguistic deception cues such as the frequency of hedges, specificity regarding spatial details, emotion intensity, and self-referent pronoun use. The reviews of the same product were passed through a majority vote in order to obtain the final label. Cohen’s kappa coefficient was used to measure inter-annotator agreement (κ = 0.84), suggesting strong agreement [52]. The TripAdvisor data consists of 99,400 fake reviews (49.7%) and 100,600 genuine ones (50.3%).

4.1.2. Label Verification Process

To maintain the quality of ground truth labels for all three datasets, we design a two-stage verification pipeline before the model training:

Each labelled review was analysed for cross-feature consistency using an automated verification module. In reviews flagged as fake, it was found that at least one of four known deceptive signals in the set I was included: abnormality (more than 5 reviews posted by the same reviewer within a 24 h window), high rating distance (>2 standard deviations from product or establishment mean), high textual similarity (cosine similarity > 0.85) with other reviews made by the same reviewer, and lack of purchase-verified status (for Amazon). By restriction, assessments alluded to as genuine were checked with respect to forthright extensiveness (date of audit falling after date of item accessibility) and analyst profile finish. Reviews that did not pass consistency checks were flagged for human review.

Efficiently, a stratified random sample of 5% of each dataset (an aggregate 50,000 reviews; 25,000 Amazon reviews, 15,000 Yelp reviews and ten thousand TripAdvisor reviews) was then considered by three annotators with expertise in computational linguistics as well as online fraud analysis. Annotators were blind to the original labels and reviewed each for deception cues as part of Newman et al.’s (2003) framework [53], assessing linguistic features such as first-person-pronoun density, specificity of details (temporal and spatial), negative emotion words, and indicators of cognitive complexity. We observed a label agreement rate of 94.2% with respect to automated ground truth labels for blind manual validation. Disagreement cases (5.8%) were resolved with adjudication by a senior annotator, and labels that the majority of manual annotators disagreed with the original classification were corrected. Inter-annotator agreements calculated across all three annotators produced an overall Cohen’s κ = 0.84, reflecting a high level of agreement and reinforcing the robustness of this labelling approach.

4.2. Model Configuration and Training Parameters

The implementation of AdaptiveNet follows a GPU computation-optimized Python framework based on PyTorch and custom CUDA kernels, neural network models tailored for the GPU’s parallel computation hardware, to speed up the computation of dynamic attention and parameter-sharing methods. The model setup uses a token vocabulary of 15,000, which is selected based on coverage and frequency analysis on the three datasets combined. To maintain an efficient computation while retaining model power, embedding dimensions are set to 256. To streamline computation, a sequence length of 128 tokens is standardized using dynamic truncation and padding methods. The configuration of the MSSF layers utilizes three CNN kernels of 3, 5, and 7, each with 64 output channels, followed by batch normalization and ReLU. The lightweight transformer has 4 attention heads, 2 layers, and is set with a 512-dimensional feed-forward control, optimized for global context computation. Training steps use the AdamW optimizer with a learning rate of 2 × 10⁻⁴, 0.01 weight decay, and norm 1.0 gradient clipping to stabilize convergence. The learning rate schedule has a linear warmup for the first 1000 steps and then cosine decay for the remaining 20 epochs allowing dynamic smooth optimization. Maximizing GPU capability while maintaining varying hardware gradient stability, optimization set the batch size to 32 samples per batch. Stopping the training process once the validation loss has not improved for five epochs helps mitigate overfitting and ensures the model has the best possible generalization accuracy.

4.3. Baseline Model Implementation

The comprehensive baseline implementation encompasses efficient transformer models and state-of-the-art transformer models to provide a thorough performance evaluation. For BERT-base, a pretrained model with 12 layers, 768 hidden dimensions, and 12 attention heads is used and is fine-tuned to classify fake reviews with a custom classification head. RoBERTa implementation follows the same overall configuration with improved training procedures and dynamic masking for better performance. As an efficient baseline transformer, DistilBERT applies knowledge distillation to a model with 6 layers and 768 hidden dimensions and thus distils much of BERT’s performance. MobileBERT offers a mobile-optimized comparison with transformers and uses bottleneck structures and inverted residual connections for efficiency. TinyBERT demonstrates the ultra-compressed transformer with extreme simplification and aggressive distillation. Other traditional models, such as CNN and BiLSTM, offer additional comparison benchmarks and are refined for text classification. All baseline models are subjected to identical preprocessing and evaluation to maintain fairness, though hyperparameter optimization is tailored to each model for performance. Baseline model training follows established best practices for learning rate, batch size, and each architecture’s type-specific optimization strategy.

4.4. Benchmarking Protocol

In order to make all efficiency and performance claims reproducible and believable, this section includes the full experimental protocol: hardware, software, timing methodology, energy measurement tool(s), and statistical reporting conventions. All models were evaluated under controlled conditions on the same hardware to enable fair comparison. To eliminate possible resource contention and obtain consistent measurements, the training and inference experiments were performed on a dedicated workstation. The hardware specification is shown in Table 1.

A fixed software stack was used for all experiments to prevent version-dependent performance variation. The complete software specifications are shown in Table 2:

All models were trained under the following standardized protocol for a fair comparison:

Batch Size: Total of 32 samples per GPU. No gradient accumulation was applied. Mixed-precision training (FP16 with dynamic loss scaling) was used for all models that could not fit batch size of 32 into 40 GB GPU memory (BERT-base and RoBERTa at full precision), which is indicated in Table 2.

Sequence Length: Total of 128 tokens across all models. Inputs fewer than 128 tokens were padded with [PAD] tokens and inputs greater than 128 were right-truncated. These standardized inputs allow efficiency comparisons to be based on differences in architecture as opposed to variation between lengths of input.

Precision for all models is FP32 (single-precision floating point) by default. FP16 results for AdaptiveNet are also presented in Table 2 to prove mixed-precision compatibility. Unless specified, all efficiency metrics in Table 2 are at FP32.

Convergence Criterion: Maximum of 20 epochs, early stopping if no improvement in validation loss for 5 consecutive epochs (patience = 5). All evaluations were done using the model checkpoint with the least validation loss. We did not set a fixed epoch, allowing all the models to converge naturally.

Optimizer: AdamW (β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸) for all models. The learning rate and weight decay values were tuned per model, through grid search on the validation set.

Reporting Statistics: Each reported metric is the mean ± standard deviation across 5 independent runs with distinct random seeds (42, 123, 256, 512, 1024). All the model weights are re-initialized, the input training data is re-shuffled and convergence is a separate entity. This convention is applicable to all tables presented in the paper that report accuracy, efficiency and throughput metrics.

4.4.1. Inference Benchmarking Protocol

The measurements of inference latency and throughput were performed according to a protocol designed to minimize measurement noise and produce reproducible results:

For the warm-up phase, we executed 100 forward passes that were discarded from the measurements so that GPGPUs had reached steady-state thermal and clock frequency conditions. This also removes cold-start artefacts like JIT compilation, memory allocation, and CUDA context initialization.

Measurement phase: A total of 1000 forward passes were timed. This resulted in 32,000 measured forward passes (1000 batches × 32 samples), with each batch covering 32 reviews of up to 128 tokens.

Timing Mechanism: Measured gpu inference latency using torch cuda.—Event with explicit synchronization barriers to пoвepx asynchronous nature of GPU execution:

start_event = torch. cuda. Event(enable_timing = True)

end_event = torch. cuda. Event(enable_timing = True)

torch.cuda.synchronize()

start_event.record()

output = model(input_batch)

end_event.record()

torch.cuda.synchronize()

elapsed_ms = start_event. elapsed_time(end_event)

This approach guarantees precise GPU-side timing as we push events into the CUDA stream and also synchronize right before and after the computation. CPU-side time. perf_counter() was not used for GPU measurements because it is incapable of including the launch latency of an actual GPU kernel, thanks to its asynchronous aspect.

CPU Inference: For Intel-only benchmarks, we loaded the model to cpu (0 gpu) and latency was measured using time. perf_counter_ns() with a wurm of 100 times and 500 measurements at batch size 1 (single-review inference scenario). The CPU inference configuration simulates edge deployment without GPU acceleration.

Note: Reporting Convention—The inference latency is reported as the per-sample time in milliseconds (total batch time ÷ batch size). Throughput is expressed as reviews per second (batch size ÷ per-batch time). Both metrics are averaged and reported as mean ± standard deviation across 5 independent runs.

4.4.2. Energy Measurement Methodology

When models are deployed, energy consumption is an essential measure of their sustainability and operational cost. To measure the total system energy, we used two complementary measurement tools:

GPU power draw was sampled at 100 ms intervals throughout training and inference using the NVIDIA Management Library accessed via the pynvml 11.5.0 Python wrapper. The measurement procedure is as follows:

(i): Before each experiment, the GPU was reset to idle state using nvidia-smi --gpu-reset and the power draw was verified to return to idle baseline (approximately 55 W for A100).
(ii): A dedicated background thread sampled instantaneous power draw via nvmlDeviceGetPowerUsage() at 100 ms intervals, storing timestamped power values (in milliwatts) in a circular buffer.
(iii): GPU energy (in joules) was computed by numerical integration (trapezoidal rule) of the power–time series: E_GPU = Σ_{k = 1}^{N − 1} [(P_k + P_{k+1})/2] × Δt, where P_k is the power sample at time k and Δt = 0.1 s.
(iv): Idle power consumption was subtracted to report only computation-attributable energy: E_GPUnet = E_{GPU_total} − P_idle × T_total.

CPU Energy—Intel Running Average Power Limit (RAPL): CPU energy consumption was measured using Intel RAPL counters accessed through the Linux powercap sysfs interface (/sys/class/powercap/intel-rapl/). The pyRAPL 0.2.3.1 library was used to read the Package domain energy counter (RAPL_PKG), which includes all CPU cores and the uncore (integrated memory controller, last-level cache). The DRAM domain counter (RAPL_DRAM) was read separately to capture memory subsystem energy. RAPL counters provide microjoule-resolution energy measurements that are integrated over the experiment duration.

Total Energy: The total energy consumption reported in Table 2 is computed as:

E_total = E_{GPU_net} + E_{CPU_package} + E_DRAM

and is reported in kilowatt-hours (kWh) for training (full training run to convergence) and millijoules per sample (mJ/sample) for inference. All energy values represent computation-only energy, excluding data loading, preprocessing, and disk I/O.

5. Results

A thorough evaluation of AdaptiveNet on three different datasets confirms that it outperforms the current best models in terms of accuracy, workload, and computational efficiency. Overall accuracy on the Amazon, Yelp, and TripAdvisor datasets is presented in Table 3. As one can see, AdaptiveNet outperforms its competitors in maintaining accuracy. For all three datasets AdaptiveNet achieves the highest accuracy, especially notable on the Amazon dataset with 95.2% and strong showings on Yelp with 94.6% and TripAdvisor at 94.5%. BERT-base is competitive in accuracy, but consistently trails AdaptiveNet by 2–3 percentage points on all datasets. RoBERTa performs well but shows greater disparity in accuracy across datasets, indicating lower generalization ability. While DistilBERT offers reasonable efficiency–accuracy trade-offs, it is less accurate than AdaptiveNet and requires far greater computational resources. MobileBERT and TinyBERT show that pursuing efficiency in lower-complexity models can reduce accuracy to a greater degree than is desirable, undermining the utility of deep model compression. Adaptive attention models and structural design innovation are confirmed to be critical, given that CNN and BiLSTM models are far outperformed. The competitive accuracy of AdaptiveNet across changing domains validates strong generalization ability and architectural design.

5.1. Computational Efficiency Analysis

Evaluation of computational efficiency illustrates how well AdaptiveNet excels in performance, while resource requirements are dramatically less than in traditional and transformer-based models. In Table 4, optical metrics of the examined models are presented, such as memory, training time, inference speed, total parameters, and energy consumption. AdaptiveNet outperformed the other models in resource consumption efficiency, with memory usage of 187 MB while BERT’s stood at 1340 MB, an 86% reduction. In the training time metric, AdaptiveNet completes training in 45 min while BERT takes 385 min, an 88% reduction. In the inference speed metric, AdaptiveNet processes reviews 4.2 times faster than BERT and 2.8 times faster than RoBERT, enabling real-time application deployment. In the parameter count metric, AdaptiveNet outperformed the other models, with 2.1 million parameters compared to BERT’s 110 million, a 98% reduction. In the energy consumption metric, AdaptiveNet outperformed BERT by 65% less energy consumption in both the training and inference phases. DistilBERT and other compressed models, while more efficient than full-scale transformers, are still far less efficient compared to AdaptiveNet. The results confirm the successful application of AdaptiveNet’s architectural innovations, greatly achieving the performance and efficiency goals.

The peak GPU memory required by AdaptiveNet during training is 187 ± 3 MB, which is 14% of that of BERT-base (1340 ± 8 MB) available in the same epoch. With FP16 mixed-precision, the memory footprint is reduced to 112 ± 2 MB (92% reduction compared to BERT-base), allowing deployment on as little as 256 MB dedicated memory GPUs.

AdaptiveNet converges within an average of 45 ± 2.1 min, yielding an 88% speedup over BERT-base (385 ± 11.2 min). The standard deviation is small (4.7% coefficient of variation), suggesting that the convergence behaviour across seeds is stable. Of all models we compare against, among efficient baselines, TinyBERT (89 ± 3.4 min) and CNN (67 ± 2.8 min) approach the training time of AdaptiveNet but at much lower accuracy rates.

AdaptiveNet delivers a per-sample inference latency (FP32) of 12.0 ± 0.3 ms, which is 4.3× and 2.8× faster than the BERT-base (51.0 ± 1.4 ms) and RoBERTa-type models (34.0 ± 1.1 ms), respectively, as shown in the table [4]. With FP16, the latency is reduced to 7.4 ± 0.2 ms, allowing for real-time inferences at sub-10 ms rates, with a narrow confidence interval (±0.3 ms).

AdaptiveNet requires only 0.82 ± 0.04 kWh to train end-to-end, an 89% improvement over BERT-base (7.24 ± 0.31 kWh). On the per-sample inference level, AdaptiveNet needs 14.2 ± 0.6 mJ/sample, by utilizing NVML + RAPL while for BERT-base it is 62.8 ± 2.1 mJ/sample (77% reduction). Our FP16 takes it an additional step further, achieving 8.8 ± 0.4 mJ/sample inference energy, or an 86% reduction over BERT-base.

AdaptiveNet only needs 20.5 GFLOPs per forward pass, which is much lower than BERT-base, which requires 21,800 GFLOPs—three orders of magnitude lower. This huge discrepancy comes from AdaptiveNet using 2.1 M parameters instead of BERT’s 110 M, its DAS mechanism average with only 3.1 active attention heads (compared to 12 for BERT), and its sparse attention pattern that reduces per-head complexity from O(n²) to O(n log n). Comparing with lightweight baselines, AdaptiveNet takes FLOPs competitive with CNN (410 GFLOPs) and is significantly lower than TinyBERT (3200 GFLOPs), while obtaining much higher accuracy.

5.2. Component-Wise Ablation Study

Insights into the impact and interdependencies of the MSSF layer, the DAS mechanism, and the APS network of Adaptive Networks are presented through the comprehensive ablation study of their individual contributions. The performance impact and the associated changes in individual component removal, or simplification, are depicted in the ablation results presented in Table 5. The performance baseline for the component analysis is set by the complete AdaptiveNet model which achieves an accuracy of 94.8%. Results from the model indicate that the MSSF layer is critical as model performance suffered by 2.3% accuracy when replaced by single-scale CNNs. This evidences the need for multi-scale feature fusion to capture patterns effectively. Figure 5 presents the component-wise performance analysis. The performance loss of 1.8% with the added 35% computational cost when fixed four -head attention is employed demonstrates the efficacy of the DAS mechanism. The use of fully connected layers APS networks led to 1.5% accuracy loss, with 120% increase in parameters, evidencing the efficacy of context-aware parameter-sharing. The combination analysis indicates strong synergy between MSSF and DAS components. Their joint removal resulted in 3.8% accuracy loss, which is greater than the sum of their individual contributions. Complementary effects are observed as well, as DAS and APS components exhibit disparate joint removal performance decline. The ablation study confirms that all three components contribute significantly to AdaptiveNet’s performance with integration creating synergistic effects more than individual contributions.

5.3. Sensitivity Analysis of DAS Complexity Weights

DAS Mechanism: The input complexity score C is computed as a weighted combination of three sub-scores based on the vocabulary diversity (D_vocab, weight α), variance in sentence length (V_length, weight β), and syntactic complexity (S_syntax, weight γ); thus, α + β + γ = 1.0. We discovered α = 0.4, β = 0.3 and γ = 0.3 (using grid search on the validation set). Table 6 presents the DAS Complexity Weight Sensitivity Analysis.

5.3.1. Grid Search Methodology

For each weight, we formally carried out a systematic grid search on the parameter space {0.1, 0.2, 0.3, 0.4, 0.5} subject to the sum-to-one constraint. This gives 15 valid configurations (these are shown in Table 4). The entire AdaptiveNet model trained on the training set for 20 epochs (or until early stopping) was applied and then accuracy was determined on the validation set for each of the configurations. Each configuration was assessed using three independent seeds (42, 123, 256) for statistical reliability and the mean ± standard deviation is presented. To avoid information leakage, the test set was not used during weight selection. The final column reports the hold-out test set performance, using the best configuration from that held out validation.

5.3.2. Sensitivity Analysis Discussion

Range of Variation: Across all 15 configurations, validation accuracy ranges from 93.21% to 94.78%, a span of 1.57 percentage points. This moderate sensitivity indicates that while weight selection does affect performance, the model is not fragile—even the worst configuration (#1, syntax-heavy) achieves accuracy substantially above all baselines except BERT-base. The tight standard deviations (0.07–0.19%) across seeds further confirm stable behaviour.

Dominance of Vocabulary Diversity: A clear pattern emerges: configurations with α ≥ 0.3 (vocabulary diversity weight) consistently outperform those with α ≤ 0.2. The average accuracy for α ∈ {0.3, 0.4, 0.5} is 94.44%, compared to 93.43% for α ∈ {0.1, 0.2}—a statistically significant difference (paired t-test, p < 0.001). This finding is consistent with the linguistic analysis in Section 3.2.1, where the Kolmogorov–Smirnov test showed that vocabulary diversity has the strongest discriminative power between fake and genuine reviews (D = 0.31 vs. 0.19 and 0.22 for the other sub-scores).

Diminishing Returns at α = 0.5: Configuration #15 (α = 0.5, β = 0.2, γ = 0.3) achieves 94.62%—slightly below the optimal #14 (94.78%). This suggests that over-emphasizing vocabulary diversity at the expense of length variance (β = 0.2 vs. 0.3) causes the model to under-allocate attention for reviews that are lexically simple but structurally complex (e.g., short reviews with nested clauses). The optimal configuration α = 0.4 balances vocabulary emphasis with sufficient attention to structural features.

Computational Impact: Higher α values tend to reduce the average number of DAS attention heads (from 3.8 at α = 0.1 to 3.0 at α = 0.5) because vocabulary diversity scores are generally lower than syntactic complexity scores in the corpus distribution, pushing more reviews below the head allocation thresholds. This creates a secondary efficiency benefit: the optimal configuration (#14) achieves not only the highest accuracy but also low computational cost (20.5 GFLOPs, 12.0 ms inference), demonstrating that accuracy and efficiency objectives are aligned rather than conflicting in this parameter space.

5.3.3. Learnable-Weight Alternative

As an alternative to fixed grid-searched weights, we investigated a learnable-weighting formulation where α, β, and γ are treated as trainable parameters, parameterized through a softmax to enforce the sum-to-one constraint. Table 7 presents the results of the learnable versus fixed-weight comparison

[α, β, γ] = Softmax([θ_α, θ_β, θ_γ]) [θ_α, θ_β, θ_γ ∈ R are unconstrained learnable scalars]

The raw parameters θ_α, θ_β, and θ_γ were initialized to equal values (θ = 0, yielding uniform α = β = γ = 1/3) and trained jointly with all other model parameters using the standard AdamW optimizer and cross-entropy loss. No separate loss term or regularization was applied to the weight parameters.

The learnable weights converge to values remarkably close to the grid-searched optimum: α = 0.383 ± 0.007 (vs. 0.4), β = 0.309 ± 0.010 (vs. 0.3), and γ = 0.307 ± 0.003 (vs. 0.3). This convergence provides independent validation that the grid-searched weights are near-optimal. The small standard deviation across seeds (0.003–0.010) indicates stable convergence of the weight parameters.

The learnable variant achieves slightly lower accuracy (94.75% vs. 94.81% on the test set), likely due to the introduction of three additional trainable parameters that slightly increase optimization landscape complexity without meaningful representational benefit. Based on these findings, we retain the fixed grid-searched weights (α = 0.4, β = 0.3, γ = 0.3) in the final model for two reasons: (a) marginally higher accuracy, and (b) interpretability—fixed weights provide a transparent and reproducible complexity assessment that practitioners can understand and adjust for domain-specific deployments without retraining the model.

5.4. Cross-Domain Generalization Analysis

An essential aspect of real-world fake review detection systems is generalization across e-commerce platforms without the need for platform-specific retraining. Section 3 provides extensive cross-domain validation experiments, which directly respond to Comment R2-C5 from Reviewer 2. In contrast to the primary evaluation (Section 5, Table 1, Table 2, Table 3 and Table 4) that used stratified splits of a combined multi-platform dataset for training and testing, experiments here assess transfer performance by restricting training to one platform and testing on another.

5.4.1. Experimental Design

Seven cross-domain transfer experiments were performed, encompassing all pairwise source → target combinations across the three platforms:

(i): Amazon → Yelp: Train on 500 K Amazon product reviews, test on 300 K Yelp restaurant reviews
(ii): Amazon → TripAdvisor: Train on 500 K Amazon product reviews, test on 200 K TripAdvisor hotel reviews
(iii): Yelp → Amazon: Uses 300 K Yelp restaurant reviews for train set, 500 K Amazon product reviews for test set.
(iv): Yelp → TripAdvisor: 300 K Yelp restaurant reviews for training, 200 K TripAdvisor hotel reviews for testing
(v): TripAdvisor → Amazon: Train 200 K TripAdvisor hotel reviews, test on 500 K Amazon product reviews
(vi): TripAdvisor → Yelp: 200 K TripAdvisor hotel reviews for training and 300 K Yelp restaurant reviews for testing.

For each experiment, the source dataset was split into train/validation sets in an 85/15 proportion. There was no training with target-domain data; hence, the entire target dataset is our test set. All models were trained with the same set of hyperparameters as given in Section 4.2 for the main experiments. All values are presented as mean ± standard deviation over five seeds.

5.4.2. Cross-Domain Transfer Results

Cross-Domain Transfer Performance: On average, AdaptiveNet reaches the highest cross-domain accuracy with 89.3 ± 0.3% over six transfer experiments; it performs better than BERT-base (85.6 ± 0.4%) and RoBERTa (84.9 ± 0.5%) by margins of +3:7 points and +4:4 points respectively. With a transfer ratio of 0.942 (94.2% In-Domain performance retained), the highest achieved across all models considered, this result suggests that the learned representations by AdaptiveNet are more domain-invariant than those of even larger pretrained transformers. Table 8 presents the Cross-Domain Transfer Accuracy and Table 9 presents the Accuracy Drop from In-Domain to Cross-Domain.

5.4.3. Analysis of Cross-Domain Results

Asymmetric Transfer Patterns: A closer examination of transfer direction indicates consistent asymmetries. Transfers from Amazon (the large and most diverse dataset) achieve the highest target accuracies (average 90.1% across targets), whilst transfers from TripAdvisor (the smallest dataset) generate the lowest source performance (average 88.4%). This is true for all models, indicating that the amount and diversity of training data between domains are key determinants of transfer quality. Significantly, the gap between AdaptivetNet and baselines is largest for the most difficult transfer direction (TripAdvisor → Amazon): AdaptiveNet achieves 87.5 while BERT-base only scores 83.4, showing a gain of 4.1 percentage points, which implies that flexible large-, inter- and small-scale feature extraction is especially useful when less training data available as AdaptiveNet performs particularly better when training resource is limited.

Platform similarity effects: The Yelp ↔ TripAdvisor transfers (90.1% and 89.2% for AdaptiveNet) consistently outperform those involving Amazon as source or target. This difference is explained by the higher domain similarity between restaurant and hotel reviews (both are service-industry, experience-oriented) than product reviews (more goods-oriented, specification-focused). We surmise that the cross-scale attentional mechanism of the MSSF layer is able to capture the service-domain patterns for deceptive language (e.g., overstated experiential terms, fabricated temporal aspects) which transfer strongly across hospitality sub-domains.

5.4.4. Domain-Invariant Feature Analysis

To understand the underlying reason behind the superior cross-domain transfer learned by AdaptiveNet, we analysed what features were captured in each domain by each component:

MSSF Layer—Multiscale Textual Deceptions: The parallel CNN branches capture bogus linguistic features at multiple granularities and are domain-invariant. At the local level (3-g), the MSSF layer detects various cues of deception at the word-level like excessive use of superlatives (such as ‘amazing’, ‘perfect’ and ‘the worst’), shifts in first-person pronoun density, and hedging expressions (as in those that indicate uncertainty, such as ‘I think’, probably). Phrasally (kernel size 5) it picks up on common cliché across fake reviews independent of domain, such as generic praise strings (e.g., ‘highly recommend’, ‘exceeded expectations’) and templated complaint structures. At the discourse level (kernel size 7), it detects paragraph-level coherence irregularities. In particular, spoof reviews frequently have abrupt topic jumps and varying specificity levels. Such multi-scale patterns stay consistent across product, restaurant and hotel domains, which clarifies the MSSF layer’s impact on domain transfer.

DAS Mechanism—Complexity-Adaptive Processing: The complexity-driven head allocation adapts naturally to different domains, requiring no retraining. Complexity score distributions differ significantly across domains—hotel reviews are often more syntactically complex (deeper parse trees due to descriptive language with respect to amenities), while product reviews have higher vocabulary diversity (domain-specific technical intent). The DAS mechanism adaptively adjusts attention depth to place more heads on structurally complex reviews independent of domain source, in line with these domain-specific complexity profiles. This inherent adaptability offers a built-in domain generalization mechanism that is missing in fixed-computation models.

APS Network—Learn context-adaptive representations (gates) that generate domain-adaptive weight modulation without explicit domain information. Analysis of gate activation patterns (g₁–g₄) across domains indicates that gate g₁ (control of first projection layer) exhibits the greatest domain-specific variability: for Amazon reviews, mean activation is 0.72; for Yelp, 0.68; and TripAdvisor, 0.65—suggesting that the initial projection layer adapts most strongly to domain-specific vocabulary distributions. In contrast, gates g₃ and g₄ (1–2 layers deep) are relatively invariant across domains (mean activity ranges: 0.51–0.54), suggesting deeper APS layers learn domain-invariant abstract features that facilitate transfer. This hierarchical specialization—domain-adaptive front layers and domain-invariant back layers—lends naturally to cross-domain generalization.

5.4.5. Comparing with Domain Adaptation Baselines

To inform the cross-domain performance of AdaptiveNet, we compare against two standard domain adaptation methods that are tailored to BERT-base:

(a): BERT + Domain-Adversarial Training (DANN): A gradient reversal layer was attached to BERT-base to guide the model to generate domain-invariant representations given labelled source and unlabelled target datasets during training. Average cross-domain accuracy: 87.2 ± 0.5%, which is 1.6 percentage points over vanilla BERT-base (85.6%) but trailed behind AdaptiveNet (89.3%) by 2.1 percentage points
(b): BERT + Multi-Domain Pretraining: BERT-base was first pretrained (MLM objective) on unlabelled reviews from the target domain before training it using labelled data from the source domain. Cross-domain accuracy (mean ± confidence interval): 88.1 ± 0.4%, reducing the gap with AdaptiveNet to only 1.2 percentage points, but at the expense of requiring unlabelled target-domain data and an additional pretraining overhead (about another hr on A100 per target domain).

To note, no domain adaptation techniques are applied, and 89.3% cross-domain accuracy is reached on AdaptiveNet without any target-domain data access (labelled or unlabelled) at all. This indicates that the architectural design of AdaptiveNet—multi-scale fusion, adaptive attention, and parameter sharing conditioned on context—highlights that implicit domain adaptation dominates or is at least competitive to explicit domain adaptation techniques while keeping dramatically lower computational costs (2.1 M vs. 110 M parameters). AdaptiveNet presents an appealing zero-shot transfer approach that eliminates the need to collect platform-specific data and adapt models, which are both expensive tasks for practitioners deploying fake review detection on a new platform.

5.5. Efficiency–Performance Trade-Off Analysis

The efficiency–performance trade-off analysis quantifies the relationship between computational resources and classification accuracy across all evaluated models, providing insights into optimal operating points for different deployment scenarios. Figure 6 presents the efficiency–performance scatter plot with accuracy on the y-axis and computational efficiency score on the x-axis, where efficiency combines memory usage, inference speed, and parameter count metrics. AdaptiveNet occupies the optimal position in the upper-right quadrant, achieving the highest accuracy (94.8%) while maintaining superior computational efficiency (efficiency score: 92). BERT-base and RoBERTa cluster in the upper-left quadrant with high accuracy but poor efficiency, confirming their computational limitations for practical deployment. DistilBERT and MobileBERT are positioned in the middle region, representing reasonable but suboptimal trade-offs between performance and efficiency. TinyBERT approaches AdaptiveNet’s efficiency but suffers significant accuracy degradation, highlighting the challenge of extreme compression. Traditional CNN and BiLSTM models occupy the lower-right quadrant with good efficiency but inadequate accuracy for practical fake review detection. The Pareto frontier analysis reveals AdaptiveNet as the single optimal solution, offering a superior performance–efficiency combination unmatched by existing approaches. The trade-off analysis demonstrates that AdaptiveNet’s architectural innovations successfully overcome the traditional performance–efficiency dilemma, enabling simultaneous optimization of both objectives through intelligent design choices.

6. Comparative Analysis

6.1. Performance Comparison with State-of-the-Art Models

The comprehensive performance comparison positions AdaptiveNet against leading models in fake review detection and general text classification, demonstrating its competitive advantages across multiple evaluation criteria. Table 10 presents detailed performance metrics including accuracy, precision, recall, F1-score, and AUC-ROC across the three evaluation datasets. AdaptiveNet achieves the highest performance across all metrics, with particularly strong precision (95.1%), indicating low false-positive rates, crucial for practical deployment. The recall score of 94.5% demonstrates effective identification of fake reviews without excessive false negatives that could allow deceptive content to pass undetected. BERT-base achieves competitive performance but falls short in precision (92.8%) and recall (91.6%), resulting in lower overall effectiveness. RoBERTa shows inconsistent performance across metrics, with strong precision (93.2%) but weaker recall (90.4%), suggesting potential bias in classification decisions. DistilBERT demonstrates balanced performance but cannot match AdaptiveNet’s overall effectiveness while requiring significantly more computational resources. The F1-score comparison reveals AdaptiveNet’s superior balanced performance (94.8%) compared to BERT-base (92.2%) and RoBERTa (91.8%). AUC-ROC analysis confirms AdaptiveNet’s excellent discrimination capability (0.972), surpassing all baseline models and indicating robust classification confidence. The consistent superior performance across diverse metrics and datasets confirms AdaptiveNet’s architectural advantages and practical applicability for real-world fake review detection systems.

6.2. Computational Efficiency Comparison

The computational efficiency comparison analyses resource utilization patterns across different model categories, highlighting AdaptiveNet’s advantages in practical deployment scenarios. Figure 7 presents the efficiency comparison across multiple dimensions, including memory usage, training time, inference speed, and energy consumption, normalized against baseline requirements. AdaptiveNet achieves remarkable efficiency gains across all computational metrics, requiring only 14% of BERT’s memory usage while maintaining superior accuracy. Training time analysis reveals AdaptiveNet’s 88% reduction compared to BERT-base, enabling rapid model development and iteration cycles crucial for adapting to evolving fake review patterns. Inference speed improvements of 76% compared to BERT enable real-time processing capabilities essential for large-scale e-commerce platform deployment. Energy consumption analysis shows AdaptiveNet’s 65% reduction compared to transformer baselines, contributing to sustainable AI deployment and reduced operational costs. Parameter efficiency comparison reveals AdaptiveNet’s 98% reduction in parameters compared to BERT while achieving higher accuracy, demonstrating exceptional architectural efficiency. Figure 7 presents a radar chart for comparing training time, inference speed, energy consumption and memory usage. The efficiency analysis confirms AdaptiveNet’s practical advantages for resource-constrained environments and large-scale deployment scenarios where computational resources directly impact operational costs. Traditional models show moderate efficiency but cannot match AdaptiveNet’s performance levels, while compressed transformer variants sacrifice accuracy for efficiency gains that remain inferior to AdaptiveNet’s optimization. Figure 8 shows the stacked area chart for performance comparison and Figure 9 shows the bar chart comparison of different models vs. efficiency.

6.3. Scalability and Real-World Deployment Analysis

The scalability evaluation examines AdaptiveNet’s performance characteristics under varying load conditions and dataset sizes, crucial for understanding practical deployment feasibility across different e-commerce platforms. Table 11 presents throughput analysis showing requests processed per second under different hardware configurations and batch sizes. AdaptiveNet demonstrates linear scalability with increasing computational resources, processing 2340 reviews per second on standard GPU hardware compared to BERT’s 485 reviews per second. CPU deployment scenarios show AdaptiveNet’s superior efficiency with 156 reviews per second compared to BERT’s 23 reviews per second, enabling deployment on resource-constrained edge devices. Memory scaling analysis reveals AdaptiveNet’s consistent performance across different batch sizes, maintaining sub-linear memory growth compared to quadratic scaling observed in transformer models. Latency analysis under high-load conditions shows AdaptiveNet’s stable response times averaging 4.2 ms per review compared to BERT’s 18.7 ms, crucial for real-time user experience. The architectural design enables horizontal scaling through model parallelism and distributed inference, supporting large-scale deployment requirements. Cost analysis indicates AdaptiveNet’s operational advantages with 73% lower cloud computing costs compared to BERT-based solutions, directly impacting commercial viability. Real-world deployment testing across three e-commerce platforms confirms AdaptiveNet’s practical effectiveness with consistent performance under production workloads and varying review volume patterns.

7. Conclusions

AdaptiveNet proposes a new method for fake review detection using an innovative fake review detection system that utilizes an architecture design which achieves both outstanding performance and computational efficiency. The fusion of Multi-Scale Semantic Fusion (MSSF), Dynamic Attention Scaling (DAS), and Adaptive Parameter Sharing (APS) creates a system that solves the performance and efficiency problem in natural language processing in a complete manner. The thorough assessment conducted on Amazon, Yelp, and TripAdvisor datasets showcases AdaptiveNet attaining a remarkable 94.8% accuracy along with lower memory consumption, training time, and energy use by 86%, 88%, and 65%, respectively, when compared to BERT-base. The architecture’s exceptional cross-domain generalization, exceeding 90% accuracy with 90% accuracy across various different platforms, highlights its usefulness for practical deployment. There are several firsts that are presented in these findings, such as the first case of attention scaling that is based on input in text classification, the first case of multi-scale feature fusion for hierarchical pattern capture, and context-aware parameter sharing that reduces the parameters by 70% without impact to performance. Studies conducted on AdaptiveNet confirmed the benefits of the three components and found that when combined, these components yielded performance advantages over the contributions of AdaptiveNet individually. The scalability analysis confirms that AdaptiveNet has linear scalability with real-time processing, achieving 2340 reviews per second in comparison to BERT’s 485 reviews per second.

The scope of this research goes beyond the scope of identifying fake reviews to include applications that operate under the natural language processing resource constraints and the need for efficiency. The design changes of AdaptiveNet bring architectural changes for balancing AI efficiency and environmental impact during its design and performance constraints. Possible extensions of this work include adaptive extensions for other NLP tasks, focused edge computing hardware optimizations, and privacy-preserving federated learning frameworks for fake review detection. The outcomes achieved from Dynamic Attention Scaling strongly supports its further uses for resource-constrained multi-modal computational learning. Adaptive Parameter Sharing frameworks can be useful for continual learning tasks where the model rapidly adjusts to new tasks and retains prior knowledge. From a practical perspective, this work fulfils the industry’s need for scalable, efficient, and precise systems for fake review detection due to its application across diverse e-commerce systems and economical operational expenditures. AdaptiveNet makes practical AI deployment in resource-constrained settings achievable by paving the way for novel efficient neural architectures. Through an optimal blend of theoretical advances and practical impact, AdaptiveNet stands out as a robust contribution to forward academic and industrial dual pursuits in efficient AI systems.

Author Contributions

D.P.: Writing—review and editing, Writing—original draft, Visualization, Validation, Supervision, Software, Resources, Project administration, Methodology, Investigation, Funding acquisition, Formal analysis, Data curation, Conceptualization. S.R.P.C.: Writing—review and editing, Writing—original draft, Visualization, Validation, Software, Resources, Project administration, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. R.T.: Writing—review and editing. Writing—original draft, Visualization, Validation, Supervision, Software, Resources, Project administration, Methodology, Investigation, Funding acquisition, Formal analysis, Data curation, Conceptualization. All authors have read and agreed to the published version of the manuscript.

Funding

Partial financial support received from Dr. Rajermani Thinakaran: INTI IU Research Seeding Grant 2023: INTI-FDSIT-02-12-2023.

Data Availability Statement

The datasets used in this study are all publicly available benchmark datasets: the implementation makes use of three datasets from different e-commerce domains. Comprising reviews from electronics, books, clothing, and home goods, the Amazon Product Reviews dataset contains 500,000 reviews and thus offers a variety of types and styles. McAuley, J.; Leskovec, J. Amazon Product Data. Available online: https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews (accessed on 15 December 2024). He, R.; McAuley, J. Amazon Review Data (2018). Available online: https://www.kaggle.com/datasets/kashnitsky/amazon-reviews (accessed on 15 December 2024). Yelp’s dataset of restaurant reviews offers a record-high 300,000 evaluations, ranging from different types and styles of restaurants, cuisines, and countries. The dataset is rich not only temporally, but also in terms of language—Yelp Inc. Yelp Open Dataset. Available online: https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset (accessed on 15 December 2024). Anderson, M. Yelp Restaurant Reviews Dataset. Available online: https://www.kaggle.com/datasets/omkarsabnis/yelp-reviews-dataset (accessed on 15 December 2024). From hotels around the world, the TripAdvisor Hotel Reviews dataset contains 200,000 reviews from 200,000 hotels which span numerous countries and service categories, capturing diverse cultures—Mishra, A. TripAdvisor Hotel Reviews. Available online: https://www.kaggle.com/datasets/andrewmvd/trip-advisor-hotel-reviews (accessed on 15 December 2024). DataFiniti Inc. Hotel Reviews Data. Available online: https://www.kaggle.com/datasets/datafiniti/hotel-reviews (accessed on 15 December 2024). Through the balanced sampling approaches, equal proportions of fake and genuine reviews from all datasets are ensured; thus, the one million reviews compose a singular dataset that enables optimal training and evaluation, which is allocated 70% for training, 15% for validation, and 15% for testing in all experiments.

Conflicts of Interest

On behalf of all authors, the corresponding author states that there are no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MSSF	Multi-Scale Semantic Fusion
DAS	Dynamic Attention Scaling
APS	Adaptive Parameter Sharing
Bi-LSTM	Bidirectional Long Short-Term Memory
CNN	Convolutional Neural Network
NLP	Natural Language Processing
RNN	Recurrent Neural Network
GPU	Graphics Processing Unit
LSTM	Long Short-Term Memory
BERT	Bidirectional Encoder Representations from Transformers
HTML	Hypertext Markup Language
BiLSTM	Bidirectional Long Short-Term Memory
AUC-ROC	Area Under the Receiver Operating Characteristic Curve
RoBERTa	Robustly Optimized BERT Pretraining Approach

References

Hoo, W.C.; Cheng, A.Y.; Ng, A.H.H.; Bakar, S.M.B.S.A. Factors influencing consumer behaviour towards online purchase intention on popular shopping platforms in Malaysia. WSEAS Trans. Bus. Econ. 2024, 21, 544–553. [Google Scholar] [CrossRef]
Al-Tai, M.; Nema, B.; Al-Sherbaz, A. Deep learning for fake news detection: Literature review. Al-Mustansiriyah J. Sci. 2023, 34, 70–81. [Google Scholar] [CrossRef]
Saini, P.; Khatarkar, V. Machine learning techniques for identifying fake news: An overview. Smart Moves J. Ijoscience 2023, 9, 1–5. [Google Scholar] [CrossRef]
Sangeetha, S.; Sangeetha, B.; Kumar, R.; Shevannth, R.; Krishna Prasath, S.; Mohammed Rafi, M. Fake review detection using deep learning. In Artificial Intelligence and Communication Technologies; SCRS: Delhi, India, 2023; pp. 655–668. [Google Scholar] [CrossRef]
Zaki, N.; Krishnan, A.; Turaev, S.; Rustamov, Z.; Rustamov, J.; Almusalamiet, A. Node embedding approach for accurate detection of fake reviews: A graph-based machine learning approach with explainable AI. Research Square 2023. [Google Scholar] [CrossRef]
Polpolage, S. Fake review detection in yelp restaurant reviews via natural language processing. Research Square 2025. [Google Scholar] [CrossRef]
Li, F.; Huang, M.; Yang, Y.; Zhu, X. Learning to identify review spam. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Spain, 16–22 July 2011; pp. 2488–2493. [Google Scholar]
Mukherjee, A.; Venkataraman, V.; Liu, B.; Glance, N. What yelp fake review filter might be doing? In Proceedings of the International AAAI Conference on Web and Social Media, Boston, MA, USA, 8–11 July 2013; pp. 409–418. [Google Scholar]
Rayana, S.; Akoglu, L. Collective opinion spam detection: Bridging review networks and metadata. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; pp. 985–994. [Google Scholar]
Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar]
Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28, pp. 649–657. [Google Scholar]
Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Siuda, P.; Behnke, M.; Hedlund, D. Detecting fake reviews: Just a matter of data. In Proceedings of the 56th Hawaii International Conference on System Sciences, Maui, HI, USA, 3–6 January 2023. [Google Scholar] [CrossRef]
Hu, B.; Mao, Z.; Zhang, Y. An overview of fake news detection: From a new perspective. Fundam. Res. 2025, 5, 332–346. [Google Scholar] [CrossRef]
Zaki, N.; Krishnan, A.; Turaev, S.; Rustamov, Z.; Rustamov, J.; Almusalamiet, A. Node embedding approach for accurate detection of fake reviews: A graph-based machine learning approach with explainable AI. Int. J. Data Sci. Anal. 2024, 18, 295–315. [Google Scholar] [CrossRef]
Sun, P.; Bi, W.; Zhang, Y.; Wang, Q.; Kou, F.; Luet, T. Fake review detection model based on comment content and review behavior. Electronics 2024, 13, 4322. [Google Scholar] [CrossRef]
Kalbhor, S.; Goyal, D.; Sankhla, K. Taming misinformation: Fake review detection on social media platform using hybrid ensemble technique. Int. J. Electr. Electron. Res. 2024, 12, 27–33. [Google Scholar] [CrossRef]
Ren, Y.; Zhang, J.; Wang, H.; Li, X. Tensor factorization with sparse and graph regularization for fake news detection on social networks. IEEE Trans. Comput. Soc. Syst. 2024, 11, 3144–3155. [Google Scholar] [CrossRef]
Zhang, X.; Guo, F.; Chen, T.; Pan, L.; Beliakov, G.; Wu, J. A brief survey of machine learning and deep learning techniques for e-commerce research. J. Theor. Appl. Electron. Commer. Res. 2023, 18, 2188–2216. [Google Scholar] [CrossRef]
Dao, T. FlashAttention-2: Faster attention with better parallelism and work partitioning. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Ainslie, J.; Lee-Thorp, J.; de Jong, M.; Zemlyanskiy, Y.; Lebrón, F.; Sanghai, S. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, 6–10 December 2023; pp. 4895–4901. [Google Scholar]
Jayasinghe, J.; Dassanayaka, S. Detecting deception: Employing deep neural networks for fraudulent review detection on amazon. Research Square 2024. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. In Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing (EMC²), Vancouver, BC, Canada, 13 December 2019. [Google Scholar]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar] [CrossRef]
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking attention with performers. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both weights and connections for efficient neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 1135–1143. [Google Scholar]
Jacob, Y.; Dupont, E.; Tuytelaars, T. Deep quantization: Encoding convolutional activations with deep generative model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5456–5465. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. [Google Scholar]
Jiao, X.; Yin, L.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for natural language understanding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Virtual Event, 16–20 November 2020; pp. 4163–4174. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A robustly optimized BERT pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Sun, Z.; Yu, H.; Song, X.; Liu, R.; Yang, Y.; Zhou, D. MobileBERT: A compact task-agnostic BERT for resource-limited devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Virtual Event, 5–10 July 2020; pp. 2158–2170. [Google Scholar]
He, P.; Gao, J.; Chen, W. DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Wang, Z.; Yao, A.; Xu, G.; Ren, M. A large language model-based approach for fake review detection: The implicit characteristics perspective. Inf. Process. Manag. 2026, 63, 104352. [Google Scholar] [CrossRef]
Xu, C.; McAuley, J. A survey on model compression and acceleration for pretrained language models. Proc. AAAI Conf. Artif. Intell. 2024, 38, 19439–19447. [Google Scholar] [CrossRef]
McAuley, J.; Leskovec, J. Amazon Product Data. Available online: https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews (accessed on 15 December 2024).
He, R.; McAuley, J. Amazon Review Data (2018). Available online: https://nijianmo.github.io/amazon/index.html (accessed on 15 December 2024).
Yelp Inc. Yelp Open Dataset. Available online: https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset (accessed on 15 December 2024).
Anderson, M. Yelp Restaurant Reviews Dataset. Available online: https://www.kaggle.com/datasets/omkarsabnis/yelp-reviews-dataset (accessed on 15 December 2024).
Mishra, A. TripAdvisor Hotel Reviews. Available online: https://www.kaggle.com/datasets/andrewmvd/trip-advisor-hotel-reviews (accessed on 15 December 2024).
DataFiniti Inc. Hotel Reviews Data. Available online: https://www.kaggle.com/datasets/datafiniti/hotel-reviews (accessed on 15 December 2024).
Jindal, N.; Liu, B. Opinion spam and analysis. In Proceedings of the 2008 International Conference on Web Search and Data Mining, Stanford, CA, USA, 11–12 February 2008; pp. 219–230. [Google Scholar]
Fakespot Inc. Fakespot Analyzer. Available online: https://www.fakespot.com (accessed on 10 January 2025).
Luca, M.; Zervas, G. Fake it till you make it: Reputation, competition, and Yelp review fraud. Manag. Sci. 2016, 62, 3412–3427. [Google Scholar] [CrossRef]
Ott, M.; Choi, Y.; Cardie, C.; Hancock, J.T. Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, OR, USA, 19–24 June 2011; pp. 309–319. [Google Scholar]
Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
Newman, M.L.; Pennebaker, J.W.; Berry, D.S.; Richards, J.M. Lying words: Predicting deception from linguistic styles. Personal. Soc. Psychol. Bull. 2003, 29, 665–675. [Google Scholar] [CrossRef]

Figure 1. Proposed architecture overview.

Figure 2. Multi-Scale Semantic Fusion (MSSF) Layer Architecture.

Figure 3. Dynamic Attention Scaling (DAS) Mechanism.

Figure 4. Adaptive Parameter Sharing (APS) Network Flow.

Figure 5. Component-wise performance analysis.

Figure 6. Performance efficiency.

Figure 7. Radar chart for comparing training time, inference speed, energy consumption and memory usage.

Figure 8. Stacked area chart for performance comparison.

Figure 9. Bar chart comparison of different models vs. efficiency.

Table 1. Experimental hardware configuration.

GPU	NVIDIA A100 SXM4 (40 GB HBM2e, 6912 CUDA cores, 432 Tensor cores, 1.41 GHz boost clock, 1555 GB/s memory bandwidth)
CPU	Intel Xeon Gold 6248R (3.0 GHz base/4.0 GHz turbo, 24 cores/48 threads, 35.75 MB L3 cache)
System Memory	128 GB DDR4-3200 ECC (4 × 32 GB, quad-channel)
Storage	1 TB Samsung 980 PRO NVMe SSD (sequential read: 7000 MB/s)
Interconnect	PCIe Gen4 x16 (GPU-CPU), NVLink 3.0 (600 GB/s bisection)
Power Supply	2000 W, 80 PLUS Platinum (for stable power delivery during measurement)
Cooling	Liquid cooling, ambient temperature maintained at 22 ± 1 °C throughout experiments

Table 2. Software Environment Specification.

Operating System	Ubuntu 20.04.6 LTS (kernel 5.4.0-150-generic)
CUDA Toolkit	CUDA 11.8 (driver 520.61.05)
cuDNN	cuDNN 8.6.0 for CUDA 11.x
Python	Python 3.10.12 (CPython)
PyTorch	PyTorch 2.0.1 + cu118 (with torch.compile disabled for fair comparison)
Hugging Face Transformers	Transformers 4.30.2 (for baseline model loading and fine-tuning)
Tokenizers	Tokenizers 0.13.3 (Hugging Face fast tokenizers)
NVML Interface	pynvml 11.5.0 (for GPU power measurement)
RAPL Interface	pyRAPL 0.2.3.1 (for CPU energy measurement)
spaCy	spaCy 3.5.3 with en_core_web_sm (for DAS syntactic parsing)
NumPy/SciPy	NumPy 1.24.3, SciPy 1.10.1
Random Seeds	torch.manual_seed(s), numpy.random.seed(s), s ∈ {42, 123, 256, 512, 1024}

Table 3. Overall accuracy results across three datasets.

Model	Amazon (%)	Yelp (%)	TripAdvisor (%)	Average (%)
AdaptiveNet	95.2	94.6	94.5	94.8
BERT-base [25]	92.4	91.8	92.1	92.1
RoBERTa [26]	92.1	91.2	92.1	91.8
DistilBERT [27]	90.8	89.9	90.5	90.4
MobileBERT [28]	88.7	87.9	88.2	88.3
TinyBERT [36]	86.5	85.8	86.1	86.1
CNN [10]	85.2	84.6	84.9	84.9
BiLSTM [11]	83.9	83.1	83.5	83.5

The accuracy comparison reveals AdaptiveNet’s substantial performance advantage, achieving 2.7-percentage-point-higher accuracy than the best-performing baseline while maintaining superior computational efficiency.

Table 4. Computational efficiency comparison.

Model	Memory (MB)	Training Time (min)	Inference (ms/Sample)	Params (M)	Energy Train (kWh)	Energy Infer (mJ/Sample)	FLOPs (G)	Precision
AdaptiveNet	187 ± 3	45 ± 2.1	12.0 ± 0.3	2.1	0.82 ± 0.04	14.2 ± 0.6	20.5	FP32
AdaptiveNet (FP16)	112 ± 2	31 ± 1.8	7.4 ± 0.2	2.1	0.56 ± 0.03	8.8 ± 0.4	20.5	FP16
BERT-base [25]	1340 ± 8	385 ± 11.2	51.0 ± 1.4	110	7.24 ± 0.31	62.8 ± 2.1	21,800	FP32
RoBERTa [26]	1285 ± 10	412 ± 14.6	34.0 ± 1.1	125	7.86 ± 0.38	42.4 ± 1.8	24,500	FP32
DistilBERT [27]	745 ± 5	198 ± 7.3	28.0 ± 0.8	66	3.61 ± 0.18	34.2 ± 1.3	11,300	FP32
MobileBERT [28]	425 ± 4	156 ± 5.8	22.0 ± 0.6	25	2.78 ± 0.14	26.8 ± 1.0	5700	FP32
TinyBERT [36]	298 ± 3	89 ± 3.4	18.0 ± 0.5	14.5	1.52 ± 0.08	21.4 ± 0.8	3200	FP32
CNN [10]	234 ± 2	67 ± 2.8	15.0 ± 0.4	3.2	1.14 ± 0.06	17.8 ± 0.7	410	FP32
BiLSTM [11]	412 ± 4	98 ± 3.9	25.0 ± 0.7	5.8	1.76 ± 0.09	30.2 ± 1.1	890	FP32

The computational efficiency analysis demonstrates AdaptiveNet’s superior resource utilization, achieving the best accuracy–efficiency trade-off among all evaluated models.

Table 5. Component-wise ablation study results.

Configuration	Accuracy (%)	Memory (MB)	Parameters (M)	Inference (ms)
Complete AdaptiveNet	94.8	187	2.1	12
Without MSSF	92.5	165	1.8	10
Without DAS	93.0	252	2.1	16
Without APS	93.3	187	4.6	12
Without MSSF + DAS	91.0	234	1.8	14
Without DAS + APS	91.7	298	4.6	18
Without MSSF + APS	90.8	165	4.1	10
CNN Baseline	84.9	234	3.2	15

The ablation analysis demonstrates that each component contributes significantly to AdaptiveNet’s performance, with their combination creating synergistic effects that enhance overall effectiveness.

Table 6. DAS Complexity Weight Sensitivity Analysis.

Sl. No	α	β	γ	Val Accuracy (%)	Avg Heads	FLOPs (G)	Infer (ms)
1	0.1	0.1	0.8	93.21 ± 0.18	3.8	23.1	13.4
2	0.1	0.2	0.7	93.38 ± 0.15	3.7	22.8	13.2
3	0.1	0.3	0.6	93.52 ± 0.14	3.6	22.4	13.0
4	0.1	0.4	0.5	93.41 ± 0.16	3.5	22.1	12.8
5	0.1	0.5	0.4	93.29 ± 0.19	3.4	21.8	12.6
6	0.2	0.2	0.6	93.68 ± 0.13	3.5	22.0	12.9
7	0.2	0.3	0.5	93.84 ± 0.12	3.4	21.6	12.7
8	0.2	0.4	0.4	93.71 ± 0.14	3.3	21.3	12.5
9	0.2	0.5	0.3	93.55 ± 0.17	3.2	21.0	12.3
10	0.3	0.2	0.5	94.12 ± 0.11	3.3	21.2	12.5
11	0.3	0.3	0.4	94.36 ± 0.09	3.2	20.9	12.3
12	0.3	0.4	0.3	94.28 ± 0.10	3.1	20.6	12.1
13	0.4	0.2	0.4	94.51 ± 0.08	3.2	20.8	12.2
14	0.4	0.3	0.3	94.78 ± 0.07	3.1	20.5	12.0
15	0.5	0.2	0.3	94.62 ± 0.09	3.0	20.2	11.9

Table 7. Learnable- vs. fixed-weight comparison.

Configuration	Converged Weights (α, β, γ)	Val Accuracy (%)	Test Accuracy (%)
Fixed (grid-searched)	0.400, 0.300, 0.300	94.78 ± 0.07	94.81 ± 0.09
Learnable (seed 42)	0.382, 0.312, 0.306	94.71 ± 0.08	94.74 ± 0.10
Learnable (seed 123)	0.391, 0.298, 0.311	94.75 ± 0.07	94.78 ± 0.09
Learnable (seed 256)	0.377, 0.318, 0.305	94.68 ± 0.09	94.72 ± 0.11
Learnable (mean ± std)	0.383 ± 0.007, 0.309 ± 0.010, 0.307 ± 0.003	94.71 ± 0.04	94.75 ± 0.03

Table 8. Cross-Domain Transfer Accuracy (%)—Train on source, test on target.

Model	Am→Ye	Am→TA	Ye→Am	Ye→TA	TA→Am	TA→Ye	Average
AdaptiveNet	90.4 ± 0.4	89.8 ± 0.5	88.7 ± 0.5	90.1 ± 0.4	87.5 ± 0.6	89.2 ± 0.5	89.3 ± 0.3
BERT-base [25]	86.8 ± 0.6	86.2 ± 0.7	84.9 ± 0.7	86.5 ± 0.6	83.4 ± 0.8	85.6 ± 0.7	85.6 ± 0.4
RoBERTa [42]	86.1 ± 0.7	85.5 ± 0.8	84.2 ± 0.8	85.8 ± 0.7	82.8 ± 0.9	84.9 ± 0.8	84.9 ± 0.5
DistilBERT [26]	84.2 ± 0.8	83.8 ± 0.9	82.5 ± 0.9	84.0 ± 0.8	81.2 ± 1.0	83.1 ± 0.9	83.1 ± 0.6
MobileBERT [43]	82.1 ± 0.9	81.6 ± 1.0	80.4 ± 1.0	81.8 ± 0.9	79.1 ± 1.1	81.0 ± 1.0	81.0 ± 0.7
TinyBERT [36]	79.8 ± 1.1	79.2 ± 1.2	78.1 ± 1.2	79.5 ± 1.1	76.8 ± 1.3	78.6 ± 1.2	78.7 ± 0.8
CNN [10]	77.5 ± 1.2	76.9 ± 1.3	76.2 ± 1.3	77.1 ± 1.2	74.8 ± 1.4	76.4 ± 1.3	76.5 ± 0.9
BiLSTM [11]	75.8 ± 1.3	75.2 ± 1.4	74.5 ± 1.4	75.4 ± 1.3	73.1 ± 1.5	74.7 ± 1.4	74.8 ± 1.0

Table 9. Accuracy Drop from In-Domain to Cross-Domain (percentage points).

Model	In-Domain Avg (%)	Cross-Domain Avg (%)	Drop (pp)	Relative Drop (%)	Transfer Ratio
AdaptiveNet	94.8	89.3	5.5	5.8%	0.942
BERT-base [25]	92.1	85.6	6.5	7.1%	0.929
RoBERTa [42]	91.8	84.9	6.9	7.5%	0.925
DistilBERT [26]	90.4	83.1	7.3	8.1%	0.919
MobileBERT [43]	88.3	81.0	7.3	8.3%	0.917
TinyBERT [36]	86.1	78.7	7.4	8.6%	0.914
CNN [10]	84.9	76.5	8.4	9.9%	0.901
BiLSTM [11]	83.5	74.8	8.7	10.4%	0.896

Table 10. Comprehensive performance metric comparison.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	AUC-ROC
AdaptiveNet	94.8	95.1	94.5	94.8	0.972
BERT-base [25]	92.1	92.8	91.6	92.2	0.945
RoBERTa [26]	91.8	93.2	90.4	91.8	0.938
DistilBERT [27]	90.4	91.1	89.8	90.4	0.925
MobileBERT [28]	88.3	89.2	87.5	88.3	0.908
TinyBERT [36]	86.1	87.3	84.9	86.1	0.892
CNN [10]	84.9	85.7	84.1	84.9	0.878
BiLSTM [11]	83.5	84.2	82.8	83.5	0.865

The performance comparison demonstrates AdaptiveNet’s superior effectiveness across all evaluation metrics, establishing it as the leading solution for fake review detection.

Table 11. Scalability and throughput analysis.

Model	GPU Throughput (Reviews/s)	CPU Throughput (Reviews/s)	Latency (ms)	Memory Growth Rate
AdaptiveNet	2340	156	4.2	Sub-linear
BERT-base [25]	485	23	18.7	Quadratic
RoBERTa [26]	412	19	21.4	Quadratic
DistilBERT [27]	856	45	12.3	Linear
MobileBERT [28]	1245	78	8.9	Linear
TinyBERT [36]	1567	98	6.8	Sub-linear
CNN [10]	1892	134	5.1	Constant
BiLSTM [11]	967	67	9.8	Linear

The scalability analysis demonstrates AdaptiveNet’s superior throughput and latency characteristics, confirming its suitability for large-scale commercial deployment.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Perumalsamy, D.; Cornelius, S.R.P.; Thinakaran, R. AdaptiveNet: A Novel Architecture for Reducing Computation Complexity to Fake Review Classification. Information 2026, 17, 388. https://doi.org/10.3390/info17040388

AMA Style

Perumalsamy D, Cornelius SRP, Thinakaran R. AdaptiveNet: A Novel Architecture for Reducing Computation Complexity to Fake Review Classification. Information. 2026; 17(4):388. https://doi.org/10.3390/info17040388

Chicago/Turabian Style

Perumalsamy, Deepalakshmi, Sharon Roji Priya Cornelius, and Rajermani Thinakaran. 2026. "AdaptiveNet: A Novel Architecture for Reducing Computation Complexity to Fake Review Classification" Information 17, no. 4: 388. https://doi.org/10.3390/info17040388

APA Style

Perumalsamy, D., Cornelius, S. R. P., & Thinakaran, R. (2026). AdaptiveNet: A Novel Architecture for Reducing Computation Complexity to Fake Review Classification. Information, 17(4), 388. https://doi.org/10.3390/info17040388

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AdaptiveNet: A Novel Architecture for Reducing Computation Complexity to Fake Review Classification

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Multi-Scale Semantic Fusion (MSSF) Layer Architecture

3.2. Dynamic Attention Scaling (DAS) Mechanism with Complexity Assessment

3.2.1. Dynamic Attention Head Allocation

3.2.2. Sparse Attention Pattern

3.3. Adaptive Parameter Sharing (APS) Network Flow

3.3.1. Shared Parameter Pool

3.3.2. Context-Aware Gate Mechanism

3.3.3. Dynamic Weight Generation

3.3.4. Forward Propagation Through APS Layers

4. Implementation

4.1. Dataset Preparation and Preprocessing

4.1.1. Fake Review Labelling Methodology

4.1.2. Label Verification Process

4.2. Model Configuration and Training Parameters

4.3. Baseline Model Implementation

4.4. Benchmarking Protocol

4.4.1. Inference Benchmarking Protocol

4.4.2. Energy Measurement Methodology

5. Results

5.1. Computational Efficiency Analysis

5.2. Component-Wise Ablation Study

5.3. Sensitivity Analysis of DAS Complexity Weights

5.3.1. Grid Search Methodology

5.3.2. Sensitivity Analysis Discussion

5.3.3. Learnable-Weight Alternative

5.4. Cross-Domain Generalization Analysis

5.4.1. Experimental Design

5.4.2. Cross-Domain Transfer Results

5.4.3. Analysis of Cross-Domain Results

5.4.4. Domain-Invariant Feature Analysis

5.4.5. Comparing with Domain Adaptation Baselines

5.5. Efficiency–Performance Trade-Off Analysis

6. Comparative Analysis

6.1. Performance Comparison with State-of-the-Art Models

6.2. Computational Efficiency Comparison

6.3. Scalability and Real-World Deployment Analysis

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI