1. Introduction
The exponential growth of e-commerce and online review platforms has fundamentally transformed consumer decision-making processes, with over 93% of consumers consulting online reviews before making purchase decisions [
1]. However, this democratization of consumer opinions has been paralleled by an alarming rise in fraudulent review activities. Recent industry reports suggest that a substantial portion of online reviews may be fake or manipulated, with some studies estimating figures as high as 30–40% [
2]. However, estimates vary considerably across studies and platforms; under stricter definitions, roughly 4% of reviews are identified as clearly fake, while higher percentages are reported when broader forms of manipulation (e.g., incentivized or selectively filtered reviews) are considered. The prevalence can also vary depending on the platform, product category, and statistical methodology used. Fake reviews are estimated to directly influence approximately
$152 billion in global online spending by misleading consumers into purchasing lower-quality products or services [
3]. Beyond this direct spending impact, fake reviews can also reduce consumer welfare, distort fair competition among businesses, and erode consumer trust in online marketplaces. This pervasive problem not only distorts market competition but fundamentally erodes consumer trust in digital commerce ecosystems.
Traditional approaches to fake review detection have primarily relied on shallow machine learning methods, including Support Vector Machines (SVMs) [
4], Random Forests [
5], and neural architectures such as Long Short-Term Memory (LSTM) networks [
6] and BERT-based classifiers [
7]. While these methods have demonstrated reasonable performance, they suffer from three critical limitations. First, they lack interpretability—providing binary classifications without explaining why a review is deemed fraudulent, which is crucial for regulatory compliance and user trust [
8]. Second, they fail to effectively leverage the rich contextual knowledge and reasoning capabilities that have emerged from recent advances in large language models (LLMs) [
9,
10]. Third, traditional methods typically analyze text in isolation, neglecting the multi-sensor behavioral signals such as temporal posting patterns, user metadata, and social network dynamics that often characterize fraudulent activities, similar to anomaly detection challenges in distributed sensor networks [
11].
The advent of large language models has revolutionized natural language understanding tasks, with models such as GPT-3 [
9], LLaMA [
10], and their successors demonstrating unprecedented capabilities in contextual reasoning and knowledge integration. However, directly applying these models to fraud detection faces significant challenges: (1) computational cost—full fine-tuning of LLMs with billions of parameters is prohibitively expensive [
12]; (2) domain adaptation—pre-trained LLMs lack specialized knowledge about deceptive patterns in review ecosystems; and (3) explainability gap—despite their reasoning capabilities, LLMs often produce opaque predictions without transparent decision pathways [
13].
Recent advances in parameter-efficient fine-tuning, particularly Low-Rank Adaptation (LoRA) [
12], offer a promising solution to computational constraints by freezing the majority of pre-trained parameters and introducing trainable low-rank decomposition matrices. Concurrently, Chain-of-Thought (CoT) prompting techniques [
14] have demonstrated remarkable success in eliciting step-by-step reasoning from LLMs, enabling transparent decision-making processes. Furthermore, the integration of multimodal information—combining textual content with behavioral metadata—has shown substantial improvements in fraud detection across various domains [
15,
16].
Building upon these insights, we propose KE-MLLM (Knowledge-Enhanced Multimodal Large Language Model), an integrated framework for fake review detection that combines parameter-efficient LLM adaptation, knowledge-enhanced prompting, multimodal feature fusion, and explicit explainability mechanisms. Rather than introducing an entirely new model architecture, our contribution lies in integrating these complementary elements into a unified and explainable framework for fake review detection. To clarify the methodological novelty of our approach, KE-MLLM differs from standard LoRA fine-tuning, conventional multimodal fusion, and existing LLM-based fake review classifiers in three important aspects. First, rather than using LoRA solely as a lightweight adaptation technique, we combine LoRA-based domain adaptation with structured domain knowledge injection to guide the model toward deception-relevant linguistic and behavioral cues. Second, instead of applying a generic multimodal fusion pipeline, we design a task-specific fusion mechanism that integrates review text with behavioral and user-level signals for fake review detection. Third, unlike existing LLM-based classifiers that mainly focus on label prediction, our framework incorporates an explicit Chain-of-Thought reasoning module to produce interpretable decision rationales. The main contributions of this work are summarized as follows:
Knowledge-enhanced parameter-efficient LLM adaptation: We develop a LoRA-based adaptation scheme for LLaMA-3-8B guided by domain-specific knowledge about deceptive linguistic patterns and behavioral red flags. Compared with standard LoRA fine-tuning, this design is more explicitly tailored to the fake review detection task through knowledge-enhanced prompting.
Task-specific multimodal fusion for fake review detection: We design a multimodal fusion mechanism that integrates textual review content with auxiliary behavioral and user-level signals, including temporal activity patterns and review-history statistics. Our fusion strategy is designed to capture cross-modal inconsistencies that are often associated with deceptive reviews.
Explainable LLM-based classification via Chain-of-Thought reasoning: Beyond predicting whether a review is fake, our framework generates structured reasoning paths that expose the textual and behavioral evidence supporting each decision. This differentiates our framework from many LLM-based classifiers that mainly focus on prediction performance without explicitly generating interpretable rationales.
Comprehensive evaluation on benchmark fake review datasets: We validate the proposed framework on YelpChi and Amazon Reviews from the DGL Fraud Dataset [
17], showing that KE-MLLM achieves consistently better results than the selected strong baselines in our experiments. Ablation studies further confirm the contribution of knowledge enhancement, multimodal fusion, and reasoning-based explanation to the overall performance. Furthermore, human evaluation involving domain experts shows that the generated explanations achieve 89.5% consistency with human annotations, substantially outperforming attention-based visualization methods [
18].
To further clarify the novelty of KE-MLLM,
Table 1 provides a structured comparison between our framework and representative prior approaches in fake review detection, including traditional classifiers, LLM-based methods, and multimodal models. Beyond the immediate fake review detection task, this work provides a useful case study of how parameter-efficient LLM adaptation, multimodal behavioral signals, and reasoning-based explanation can be combined within a single framework for trustworthy AI applications. The explainable nature of our framework makes it particularly suitable for regulatory environments where algorithmic decisions must be auditable and contestable [
19]. Furthermore, our multi-sensor fusion approach provides a blueprint for integrating diverse information sources in fraud detection systems, applicable to domains beyond reviews including social media bot detection [
20], financial fraud identification [
21], and misinformation classification [
22], as well as anomaly detection in smart city sensor networks and IoT security monitoring.
The remainder of this paper is organized as follows:
Section 2 reviews related work in fake review detection, explainable AI, and parameter-efficient LLM fine-tuning.
Section 3 introduces necessary background on large language models and the problem formulation.
Section 4 presents our KE-MLLM framework in detail, including the LoRA adaptation strategy, knowledge enhancement mechanism, multimodal fusion architecture, and CoT explainability module.
Section 5 reports comprehensive experimental results on benchmark datasets, including comparisons with baselines, ablation studies, and human evaluation of explanations. Finally,
Section 6 concludes the paper with discussions on limitations and promising directions for future work.
4. Methodology
In this section, we present the KE-MLLM framework for explainable fake review detection. Our approach synergistically integrates knowledge-enhanced prompting, parameter-efficient LoRA adaptation, multimodal feature fusion, and chain-of-thought reasoning to achieve both high detection accuracy and transparent decision-making.
Figure 1 illustrates the overall architecture of our proposed system. From a methodological perspective, KE-MLLM can be viewed as a structured integration of three complementary components: parameter-efficient LLM adaptation, knowledge-guided reasoning, and multimodal behavioral feature fusion. The framework is designed to leverage the strengths of each component while mitigating their individual limitations in fake review detection tasks.
4.1. Framework Overview
The KE-MLLM framework operates through a four-stage pipeline that transforms raw review data into explainable predictions. Given a review instance , the system first constructs a knowledge-enhanced prompt that incorporates domain-specific fraud indicators alongside the review text. This enriched input is then processed by a LoRA-adapted LLaMA-3-8B model, which generates an initial classification decision. Simultaneously, a multimodal fusion module encodes metadata and behavioral features into dense representations, which are integrated with the textual embeddings through a cross-attention mechanism. Finally, a chain-of-thought reasoning module explicitly generates structured explanations that highlight linguistic anomalies, sentiment inconsistencies, and behavioral red flags. The entire framework is optimized end-to-end through a composite loss function that balances classification accuracy with explanation quality.
4.2. Knowledge-Enhanced Prompting Strategy
Traditional prompting approaches for LLMs typically concatenate task instructions with raw input text, relying on the model’s pre-trained knowledge to identify relevant patterns. However, fake review detection requires specialized understanding of deceptive linguistic strategies and fraud tactics that may not be sufficiently represented in general-purpose pre-training corpora. To address this limitation, we develop a knowledge base that encodes domain-specific expertise about fraudulent review characteristics.
The knowledge base consists of three components: linguistic fraud indicators , sentiment anomaly patterns , and behavioral red flags . Linguistic indicators capture patterns such as excessive use of superlatives (e.g., “absolutely perfect,” “totally amazing”), generic descriptions lacking specific product details, and abnormal syntactic structures. Sentiment anomalies include rating-text mismatches where highly positive textual sentiment accompanies low numerical ratings, or vice versa. Behavioral red flags encompass temporal patterns like review bursts within short time windows and suspicious account characteristics such as newly created accounts with immediate high-volume posting.
The fraud-indicator knowledge base was constructed through a hybrid process combining manual curation and automatic candidate extraction. First, an initial pool of candidate knowledge items was collected from prior studies on deceptive review detection and from commonly reported fraud patterns in online review platforms, covering linguistic cues, sentiment inconsistencies, and behavioral anomalies. Second, additional candidate indicators were identified from the training data by examining recurrent patterns associated with labeled fake reviews, including frequent lexical expressions, rating–text inconsistencies, and abnormal temporal or user-level behaviors. To improve reliability, the candidate knowledge items were manually reviewed and refined by multiple annotators with experience in fake review analysis. During this process, redundant, overly generic, or weakly informative items were removed, and semantically overlapping indicators were merged into concise knowledge statements. Only knowledge items that were consistently judged to be relevant to deceptive review behavior were retained in the final knowledge base. Representative examples of the retained knowledge items are provided in
Appendix A.
For each review instance, we construct a knowledge-enhanced prompt
by retrieving the most relevant knowledge items from
and incorporating them into the prompt structure:
The retrieval process computes semantic similarity between the review text
and knowledge items using a lightweight encoder. Specifically, we encode the review into an embedding vector
and compute cosine similarity with each knowledge item embedding
:
The top-
k knowledge items with highest similarity scores are selected and formatted into natural language descriptions that guide the LLM’s attention toward potentially fraudulent patterns. For instance, if a review contains excessive superlatives, the retrieved knowledge might state: “Pay attention to the presence of extreme positive language that may indicate exaggerated claims typical of fake reviews.” This knowledge injection serves as a soft constraint that steers the model’s reasoning process without rigidly constraining its predictions.
The complete prompt structure follows a systematic format that first provides the task instruction, then presents the retrieved knowledge context, followed by the review metadata and text, and finally requests both a classification decision and an explanation. This structured approach ensures the LLM processes information in a logical sequence that mimics expert human analysis of potentially fraudulent reviews.
4.3. LoRA-Based Model Adaptation
While the knowledge-enhanced prompting provides task-specific guidance, adapting the base LLM to the fake review detection domain requires learning from labeled training data. Full fine-tuning of the entire LLaMA-3-8B model with approximately 8 billion parameters would be computationally expensive and prone to overfitting on modestly-sized fraud detection datasets. We therefore employ Low-Rank Adaptation to efficiently adapt the model while preserving its general language understanding capabilities.
We apply LoRA to the query and value projection matrices in the multi-head attention layers, as these components have been shown to be most critical for task adaptation [
12]. For each attention head in layer
ℓ, the original query projection
is augmented with low-rank matrices:
where
remains frozen and
,
are trainable with rank
r. The same decomposition is applied to the value projection matrices
. During forward propagation, the hidden state
at layer
ℓ is transformed as:
The scaling factor
ensures that the contribution of the low-rank adaptation is appropriately balanced relative to the frozen pre-trained weights. This formulation allows the model to learn task-specific attention patterns while maintaining the broad language understanding encoded in the pre-trained parameters.
The choice of rank r represents a trade-off between expressiveness and efficiency. Lower ranks reduce trainable parameters and training time but may limit the model’s capacity to capture complex task-specific patterns. Through preliminary experiments, we observe that intermediate rank values provide optimal balance, enabling effective adaptation without excessive computational overhead. The LoRA adaptation focuses the model’s attention on fraud-indicative linguistic patterns while avoiding the catastrophic forgetting that can occur with full fine-tuning on small specialized datasets.
4.4. Multi-Sensor Feature Fusion
Textual content alone provides insufficient information for robust fake review detection, as sophisticated fraudsters can craft linguistically plausible reviews that nonetheless exhibit suspicious behavioral patterns in sensor-monitored activities. To leverage complementary signals from multiple sensing modalities, we design a cross-attention fusion mechanism that dynamically integrates metadata and behavioral features with textual representations, analogous to adaptive multi-sensor fusion in distributed monitoring systems.
The metadata features
and behavioral features
are first encoded through separate feed-forward networks into dense representations:
where
,
are weight matrices and
,
are bias vectors. These encoded features are concatenated and projected into the same dimensional space as the LLM’s hidden states:
To enable the textual representation to selectively attend to relevant multimodal features, we employ a cross-attention mechanism. Let
denote the sequence of hidden states from the final layer of the LoRA-adapted LLM, where
L is the sequence length and
d is the hidden dimension. We compute cross-attention between the text and multimodal features:
where
,
,
are learned projection matrices for query, key, and value transformations, and the residual connection preserves the original textual information. This cross-attention mechanism allows the model to dynamically weight the importance of multi-sensor features based on the textual context, similar to adaptive weighting in sensor fusion algorithms. For instance, when the text appears linguistically normal, the model may rely more heavily on behavioral sensor signals such as review burstiness or rating deviation to make its decision.
The fused representation captures complementary information from all modalities, enabling the model to identify fraudulent reviews that may appear legitimate in one modality but exhibit suspicious patterns in others. This holistic view is particularly valuable for detecting sophisticated fraud campaigns where attackers carefully craft realistic-seeming text but inadvertently leave traces in their posting behaviors or metadata patterns.
4.5. Chain-of-Thought Reasoning for Explainability
A critical limitation of existing fake review detection systems is their opacity—they provide binary classifications without revealing the reasoning behind their decisions. This lack of transparency hinders human oversight, reduces user trust, and complicates model debugging. To address this challenge, we implement a Chain-of-Thought reasoning module that generates structured natural language explanations alongside predictions.
The CoT module extends the LLM’s output to produce not only a classification label but also a step-by-step reasoning process. We achieve this by appending a specialized generation prompt to the knowledge-enhanced input that explicitly instructs the model to articulate its reasoning:
This structured prompt encourages the model to decompose its analysis into interpretable components. During generation, the model produces a sequence of tokens that first describes observed linguistic characteristics, then evaluates sentiment-rating alignment, subsequently considers behavioral features from the multimodal input, and finally synthesizes these observations into a classification decision.
The generated explanation
follows a template structure but with content dynamically determined by the review characteristics:
where the superscript LoRA indicates the adapted model with trainable low-rank matrices. The explanation generation is conditioned on both the textual input and the encoded multimodal features
, ensuring that behavioral signals are reflected in the reasoning process.
To enhance the faithfulness of generated explanations—ensuring they accurately reflect the model’s actual decision process rather than post hoc rationalizations—we employ a two-stage training approach. In the first stage, the model is trained solely on classification with standard cross-entropy loss. In the second stage, we introduce an explanation supervision loss that encourages the model to generate explanations consistent with human-annotated rationales in a subset of the training data:
where
denotes the
j-th token in the explanation and
represents all preceding tokens. This supervision signal guides the model toward generating explanations that align with expert human reasoning patterns.
The final explanation highlights specific evidence from multiple modalities. For example, it might note: “The review contains excessive superlatives (“absolutely perfect’) without specific product details (linguistic anomaly). The 5-star rating contradicts negative phrases in the text (sentiment inconsistency). The reviewer’s account was created 2 days ago with 15 reviews posted within 24 h (behavioral red flag). Classification: Fake review.” This multifaceted reasoning provides actionable insights for human moderators and enables users to understand and potentially contest the automated decision.
4.6. Knowledge Distillation for Enhanced Performance
To further boost the detection accuracy of our relatively compact LLaMA-3-8B student model, we employ knowledge distillation from a larger teacher model (LLaMA-3-70B). The teacher model is first fine-tuned on the fake review detection task using the same LoRA adaptation strategy, albeit with greater capacity to learn nuanced patterns. The student model then learns to mimic both the teacher’s predictions and its internal representations.
The distillation process optimizes a combined objective that balances three components: task-specific classification loss, soft label distillation loss, and feature-level distillation loss. Let
and
denote the predicted probability distributions from the teacher and student models, respectively. The soft label distillation loss employs Kullback-Leibler divergence with temperature scaling:
where
is the temperature parameter that softens the probability distributions, allowing the student to learn from the teacher’s uncertainty. Higher temperatures produce softer distributions that reveal more information about the teacher’s confidence across different classes.
Additionally, we minimize the distance between the student’s and teacher’s intermediate representations to transfer the teacher’s learned feature space:
where
and
are hidden representations from the student and teacher models, and
is a learned linear transformation that accounts for potential dimensional differences.
The complete training objective combines these components with the standard classification loss:
where
is the cross-entropy loss on ground-truth labels, and
,
,
are weighting hyperparameters that control the relative importance of each objective component.
This distillation strategy enables the student model to inherit the teacher’s superior discrimination capability while maintaining the computational efficiency required for real-time inference in production environments. The soft labels from the teacher provide richer training signals than hard binary labels alone, particularly for ambiguous cases near the decision boundary.
4.7. Overall Training and Inference Pipeline
Algorithm 1 summarizes the complete training and inference procedure of the KE-MLLM framework. The training process consists of three sequential stages: teacher model pre-training with LoRA adaptation, student model distillation with multi-objective optimization, and explanation fine-tuning with human-annotated rationales. During inference, the knowledge retrieval and multimodal fusion operate in parallel to minimize latency, with the final prediction and explanation generated in a single forward pass through the LoRA-adapted student model.
| Algorithm 1 KE-MLLM Training and Inference |
- 1:
Input: Training data , knowledge base , pre-trained LLMs - 2:
Output: Trained model parameters , prediction , explanation e - 3:
- 4:
// Stage 1: Teacher Model Training - 5:
for each batch in do - 6:
Fine-tune LLaMA-3-70B with LoRA on classification task - 7:
Update via - 8:
end for - 9:
- 10:
// Stage 2: Student Model Distillation - 11:
for each batch in do - 12:
Retrieve knowledge: - 13:
Construct prompts: - 14:
Encode multimodal: - 15:
Compute teacher predictions: - 16:
Compute student predictions: - 17:
Fuse features: - 18:
Update student via (Equation (16)) - 19:
end for - 20:
- 21:
// Stage 3: Explanation Fine-tuning - 22:
for each annotated sample do - 23:
Generate explanation: - 24:
Update via (Equation (13)) - 25:
end for - 26:
- 27:
// Inference - 28:
Input: New review - 29:
- 30:
- 31:
- 32:
- 33:
Return , e
|
The modular design of KE-MLLM enables flexible deployment scenarios. The knowledge base can be dynamically updated as new fraud patterns emerge without retraining the entire model. The LoRA adapters can be efficiently swapped to specialize the model for different product categories or platforms while sharing the same frozen base model. The multimodal fusion module can incorporate additional feature types as they become available, such as image content in reviews with photos or network features from social graphs.
5. Experiments
In this section, we present comprehensive experimental evaluations of the proposed KE-MLLM framework. We begin by describing the datasets and evaluation metrics, followed by detailed comparisons with state-of-the-art baselines. We then conduct ablation studies to analyze the contribution of each component and present qualitative analysis of the generated explanations through human evaluation.
5.1. Datasets
We conduct experiments on two widely-adopted benchmark datasets for fake review detection, both available through the Deep Graph Library (DGL) Fraud Dataset repository [
17].
YelpChi Dataset: This dataset contains restaurant reviews from the Yelp platform in the Chicago area, comprising 67,395 reviews from 38,063 users on 14,082 restaurants. Each review is labeled as genuine (YR) or fake (YF) based on Yelp’s proprietary filtering algorithm, which has been validated through manual inspection studies. The dataset includes rich metadata such as user registration date, review count, average rating, and temporal posting patterns. Importantly, it also contains social network information capturing friendship relations among users, though our primary focus is on textual and behavioral features. The class distribution is imbalanced, with approximately 13.2% of reviews labeled as fake, reflecting real-world fraud rates.
Amazon Reviews Dataset: This dataset aggregates product reviews from multiple Amazon categories, including Electronics, Home and Kitchen, and Books. It contains 11,944 reviews from 7818 users on 4194 products. Reviews are labeled based on a combination of Amazon’s spam detection system and crowd-sourced annotations from the Amazon Mechanical Turk platform. The dataset exhibits different characteristics compared with YelpChi, with longer average review length (156 tokens vs. 89 tokens) and more diverse product categories. The fake review rate is approximately 19.7%, slightly higher than YelpChi, potentially reflecting different fraud patterns across e-commerce versus local business reviews.
For both datasets, we extract multi-sensor features including metadata (rating, account age, review count, review length) and behavioral sensing signals (review burstiness computed as coefficient of variation of inter-review intervals, rating deviation from product average, account activity patterns) that are continuously monitored as time-series sensor data. We randomly split each dataset into training (70%), validation (15%), and test (15%) sets, ensuring stratified sampling to maintain class distribution across splits. All reported results are averaged over three independent runs with different random seeds to account for training variability.
5.2. Baseline Methods
We compare KE-MLLM against a comprehensive set of baseline methods representing different paradigms in fake review detection:
Traditional Machine Learning Methods: We implement SVM with TF-IDF features [
4] and Random Forest with hand-crafted linguistic and behavioral features [
5] as classical baselines. These methods serve as lower bounds on performance and demonstrate the value of deep learning approaches.
Neural Network Baselines: We include BiLSTM with attention mechanism [
6], which captures sequential dependencies in review text, and CNN-BiLSTM hybrid architecture that combines local pattern extraction with sequential modeling. These represent strong neural baselines from the pre-transformer era.
Pre-trained Language Model Baselines: We compare against BERT-base fine-tuned on review classification [
24], RoBERTa-base with adversarial training [
25], and DistilBERT as a lightweight alternative. These methods represent the current mainstream approach of fine-tuning pre-trained transformers.
Graph-Based Methods: We include CARE-GNN [
17], a state-of-the-art graph neural network designed specifically for fraud detection that leverages user-product-review heterogeneous graphs. This provides comparison with methods that explicitly model relational structures.
LLM-Based Methods: We implement several LLM baselines including: (1) GPT-3.5-turbo with few-shot prompting (5 examples), (2) LLaMA-3-8B with full fine-tuning, (3) LLaMA-3-8B with standard LoRA (without knowledge enhancement or multimodal fusion), and (4) LLaMA-3-8B with prompt-tuning. These comparisons isolate the contributions of our proposed components.
All baselines are implemented using their official codebases or reproduced following the original papers. Hyperparameters are tuned on the validation set using grid search to ensure fair comparison. For neural and LLM-based methods, we use the same computational budget in terms of training iterations and hardware resources.
5.3. Evaluation Metrics
Given the class imbalance in fake review detection datasets, we employ multiple evaluation metrics that provide complementary perspectives on model performance:
Accuracy: The proportion of correct predictions among all samples, providing an overall performance measure but potentially misleading under class imbalance.
Precision, Recall, and F1-Score: For the positive class (fake reviews), precision measures the proportion of predicted fake reviews that are truly fake, recall measures the proportion of actual fake reviews correctly identified, and F1-score provides their harmonic mean. Given the critical importance of catching fake reviews while minimizing false accusations, we consider F1-score as our primary metric.
AUC-ROC (Area Under Receiver Operating Characteristic Curve): This metric evaluates the model’s ability to discriminate between classes across all classification thresholds, providing a threshold-independent performance measure.
AUC-PR (Area Under Precision-Recall Curve): Under class imbalance, AUC-PR is often more informative than AUC-ROC as it focuses on performance on the minority (positive) class.
For explainability evaluation, we additionally compute:
Explanation Consistency: The proportion of generated explanations that align with expert human annotations, measured through semantic similarity and overlap of highlighted evidence.
Faithfulness: The degree to which explanations reflect the model’s actual decision-making process, assessed through perturbation tests where evidence mentioned in explanations is removed and prediction change is measured.
5.4. Implementation Details
Our KE-MLLM framework is implemented in PyTorch with the Hugging Face Transformers library. We use LLaMA-3-8B as the base model with LoRA rank applied to query and value projection matrices in all 32 transformer layers. The LoRA scaling factor is set to 32. The knowledge base contains 150 curated fraud indicators across linguistic, sentiment, and behavioral categories, with top- items retrieved per review. Multimodal feature encoders consist of 2-layer feed-forward networks with 256 hidden dimensions and ReLU activation. The cross-attention fusion module uses 4 attention heads with dimension 512.
Training employs AdamW optimizer with learning rate for LoRA parameters and for other components. We use linear warmup for 500 steps followed by cosine decay. Batch size is set to 16 with gradient accumulation over 4 steps, resulting in effective batch size of 64. The loss weights are , , for knowledge distillation. Training converges within 10 epochs on both datasets. For the teacher model (LLaMA-3-70B), we use the same LoRA configuration but with rank for increased capacity.
All experiments are conducted on NVIDIA A100 GPUs with 40GB memory. Training the complete KE-MLLM framework (including teacher model) takes approximately 8 h on YelpChi and 3 h on Amazon Reviews. Inference time per review is 0.15 s on average, making the system suitable for real-time deployment. To assess performance stability, all experiments are repeated five times with different random seeds, and the reported results correspond to the mean performance across runs.
5.5. Main Results
Table 2 presents the performance comparison between KE-MLLM and baseline methods on both datasets. Our proposed framework achieves substantial improvements over all baselines across multiple metrics. On the YelpChi dataset, KE-MLLM achieves 94.3% F1-score, substantially outperforming the strongest baseline (standard LoRA fine-tuning) by 10.0 percentage points. The improvement is particularly pronounced in recall (92.1% vs. 83.1%), indicating that our framework successfully identifies a larger proportion of actual fake reviews while maintaining high precision. This is important for practical deployment scenarios in which missing fraudulent content may reduce moderation effectiveness. The accuracy improvement of 2.9% over the best baseline demonstrates robust overall performance despite the class imbalance. Similarly, on the Amazon Reviews dataset, KE-MLLM achieves 92.8% F1-score, representing a 6.2 percentage point improvement over standard LoRA. The consistent performance gains across both datasets suggest that the proposed framework has promising generalization ability across different review domains. Besides, to further assess the statistical reliability of the improvements, we conduct paired
t-tests between KE-MLLM and the strongest baseline across multiple runs. The results indicate that the performance gains achieved by KE-MLLM are statistically significant (
p < 0.05).
Table 3 presents AUC-ROC and AUC-PR scores, which further confirm the effectiveness of KE-MLLM across different operating points. The AUC-ROC scores of 96.7% (YelpChi) and 94.9% (Amazon) indicate excellent discrimination capability across all threshold settings. More importantly, the AUC-PR improvements of 4.5 and 4.6 percentage points over the strongest baseline demonstrate that KE-MLLM maintains high precision even as recall increases, which is particularly valuable given the class imbalance.
Several key observations emerge from these results. First, graph-based methods like CARE-GNN perform well but are limited by the availability and quality of network information, which may be sparse or noisy in real-world scenarios. Second, standard LoRA adaptation of LLaMA-3-8B already outperforms traditional pre-trained models like BERT and RoBERTa, validating the value of larger language models. Third, few-shot prompting of GPT-3.5 underperforms fine-tuned models, suggesting that domain adaptation through parameter updates is essential for this specialized task. Finally, the substantial gains of KE-MLLM over standard LoRA highlight the importance of our knowledge enhancement, multimodal fusion, and distillation strategies. Meanwhile, we also conduct a detailed error analysis which is presented in
Section 5.11, to further understand the limitations of the proposed framework.
5.6. Ablation Studies
To understand the contribution of each component in KE-MLLM, we conduct comprehensive ablation studies by systematically removing individual modules.
Table 4 presents the results.
Knowledge Enhancement: Removing the knowledge-enhanced prompting strategy results in a 3.2 percentage point drop in F1-score (from 94.3% to 91.1%). This significant degradation confirms that domain-specific knowledge about fraud patterns substantially improves the model’s discrimination capability. Without this guidance, the model must rely solely on patterns learned from training data, which may be insufficient for capturing sophisticated fraud tactics.
Multimodal Fusion: Eliminating behavioral and metadata features reduces F1-score by 2.4 points (to 91.9%). This demonstrates that textual analysis alone, while powerful, misses important fraud signals that manifest in posting behaviors and account characteristics. The relatively smaller impact compared with knowledge enhancement suggests that the LLM’s textual understanding is quite strong, but multimodal signals still provide valuable complementary information.
Knowledge Distillation: Removing the teacher–student distillation framework decreases F1-score by 1.8 points (to 92.5%). This indicates that the larger teacher model successfully captures nuanced patterns that benefit the student model. The moderate impact suggests that the 8B parameter student has sufficient capacity for this task but benefits from the teacher’s guidance, particularly for ambiguous cases near the decision boundary.
CoT Reasoning: Interestingly, removing the explicit CoT generation has minimal impact on quantitative metrics (0.5 point drop to 93.8%), suggesting that the classification performance is primarily driven by other components. However, as shown in
Section 5.8, CoT reasoning is crucial for generating interpretable explanations, which is a primary contribution of our work. The small quantitative impact may actually be beneficial, as it indicates that the explanation generation does not compromise classification accuracy.
The cumulative effect of removing all enhancements (reverting to the LoRA baseline) results in a dramatic 10.0 point F1-score drop, confirming that the synergistic combination of our proposed components drives the superior performance.
5.7. Hyperparameter Sensitivity Analysis
We analyze the sensitivity of KE-MLLM to key hyperparameters, focusing on LoRA rank
r and knowledge retrieval size
k.
Figure 2 shows the results.
LoRA Rank: We vary
while keeping other parameters fixed. Performance improves as rank increases from 4 to 16, then plateaus beyond
. Very low ranks (
) constrain the model’s expressiveness, while very high ranks (
) approach full fine-tuning and risk overfitting. The optimal rank of
balances adaptation capability with parameter efficiency, consistent with findings in other LoRA applications [
12].
Knowledge Retrieval Size: We test retrieved knowledge items. Performance peaks at , with degradation at both extremes. Too few items () provide insufficient guidance, while too many () introduce noise and dilute the relevance of retrieved knowledge. The optimal value of provides diverse fraud indicators without overwhelming the context window.
These analyses indicate that KE-MLLM remains relatively stable across reasonable hyperparameter ranges, with clear empirical optima on the evaluated datasets.
5.8. Qualitative Analysis and Human Evaluation
Beyond quantitative metrics, we conduct qualitative analysis of the generated explanations through human evaluation studies. We randomly sample 200 reviews (100 genuine, 100 fake) from the test set and collect explanations from three methods: KE-MLLM, attention-based visualization from BERT, and LIME explanations from RoBERTa.
Three domain experts with prior experience in fraud detection, online review analysis, and natural language processing independently evaluated each explanation. Before the formal evaluation, the evaluators were provided with a short guideline describing the three scoring dimensions and representative examples of high- and low-quality explanations: (1)
Completeness: Does the explanation cover all relevant evidence? (2)
Accuracy: Is the highlighted evidence truly indicative of the predicted label? (3)
Clarity: Is the explanation understandable to non-experts? For each dimension, a score of 1 indicates very poor quality, 2 indicates limited quality, 3 indicates moderate quality, 4 indicates good quality, and 5 indicates excellent quality. Specifically, for
Completeness, higher scores indicate that the explanation covers a larger proportion of the relevant evidence; for
Accuracy, higher scores indicate that the cited evidence is more strongly aligned with the predicted label; for
Clarity, higher scores indicate that the explanation is more understandable and actionable for non-expert users.
Table 5 presents the aggregated results.
KE-MLLM substantially outperforms baseline explanation methods across all dimensions, achieving near-expert level ratings (4.6–4.8 out of 5). Attention-based visualizations receive the lowest scores, as they merely highlight words without explaining why those words are indicative of fraud. LIME provides more structured explanations through feature importance but lacks the natural language reasoning that domain experts and end-users prefer. To assess annotation consistency, we computed Krippendorff’s alpha across the three evaluators. The resulting agreement coefficients were 0.81 for Completeness, 0.84 for Accuracy, and 0.79 for Clarity, indicating substantial agreement.
To quantitatively measure explanation faithfulness, we conduct a controlled perturbation analysis in which evidence explicitly cited in the generated explanation is systematically removed from the input while keeping all other content unchanged. If an explanation is faithful, removing the cited evidence should significantly reduce the model’s confidence in its original prediction. We define faithfulness score as:
where
denotes the input with explanation-cited evidence removed, and
is a threshold for significant prediction change. KE-MLLM achieves a faithfulness score of 89.5%, compared with 62.3% for attention-based methods and 71.8% for LIME, confirming that our generated explanations genuinely reflect the model’s decision-making process.
Figure 3 presents representative examples of KE-MLLM explanations for both fake and genuine reviews, demonstrating the system’s ability to articulate multifaceted reasoning.
These examples illustrate how KE-MLLM synthesizes evidence across multiple modalities into coherent natural language explanations that are accessible to both technical and non-technical stakeholders.
5.9. Computational Efficiency Analysis
A critical consideration for real-world deployment in edge sensing devices is computational efficiency.
Table 6 compares training and inference costs across different methods.
Despite incorporating multiple sophisticated components, KE-MLLM requires only 42 M trainable parameters (0.5% of the full model) and completes training in 8 h, dramatically more efficient than full fine-tuning (36 h). During inference, the model requires approximately 15 GB GPU memory, which is comparable to other LoRA-based LLM adaptations and significantly lower than full fine-tuning of the base model. The inference latency of 150 ms per review appears reasonable for many near-real-time moderation scenarios. The additional 15 ms overhead compared with standard LoRA primarily comes from knowledge retrieval and multimodal encoding, which could be further optimized through caching and batching strategies. We also evaluated CPU-based inference using a 16-core server processor. The average latency is approximately 880 ms per review, which remains practical for batch-based moderation workflows. Besides, although the current implementation targets GPU-based inference, the parameter-efficient design of KE-MLLM may facilitate deployment on resource-constrained servers, although system-level optimization would still be required in practical settings. In particular, LoRA adaptation reduces the number of trainable parameters while maintaining strong performance, enabling further optimization through model quantization or distillation. With 8-bit or 4-bit quantization, the memory footprint could be further reduced, making the framework suitable for deployment on edge AI accelerators or resource-constrained moderation systems.
5.10. Cross-Domain Generalization
To assess the generalizability of KE-MLLM, we conduct cross-domain experiments where the model is trained on one dataset and evaluated on the other without additional fine-tuning.
Table 7 presents the results. KE-MLLM demonstrates superior cross-domain performance, maintaining 79.8% and 82.6% F1-scores when transferring between domains. This represents approximately 15% degradation compared with in-domain performance, substantially better than the 25–30% degradation observed in baseline methods. The strong generalization capability stems from the knowledge-enhanced prompting strategy, which encodes domain-agnostic fraud principles that transfer across different review contexts.
5.11. Error Analysis
To understand the limitations of KE-MLLM, we analyze the 5.7% of test cases where the model makes incorrect predictions on YelpChi. The errors fall into three primary categories:
Sophisticated Camouflaged Reviews (48%): These are fake reviews crafted with genuine-seeming specific details and natural language, posted by aged accounts with normal activity patterns. The model’s textual and behavioral signals fail to detect the subtle manipulation, highlighting the ongoing arms race between fraud detection and increasingly sophisticated fraudsters.
Ambiguous Genuine Reviews (31%): Some authentic reviews are misclassified as fake due to unusual characteristics such as extremely positive language from enthusiastic customers or burst posting from legitimate users during vacation trips. These false positives suggest that the model may be overly sensitive to certain patterns.
Data Quality Issues (21%): A portion of errors trace back to noisy labels in the dataset, where reviews labeled by the filtering algorithm may not reflect ground truth. Manual inspection by our expert evaluators suggests that in approximately 12% of error cases, the model’s prediction may actually be more accurate than the provided label.
This error analysis reveals that while KE-MLLM achieves strong overall performance, handling highly sophisticated fraud and reducing false positives on unusual but genuine content remain open challenges for future work.
5.12. Further Discussions
5.12.1. Potential Label Bias and Generalisability
It is important to note that the ground-truth labels in datasets such as YelpChi and Amazon Reviews are derived from platform-level filtering mechanisms rather than fully verified human annotations. As a result, some labels may reflect heuristic detection rules or platform moderation policies, which may introduce bias into the dataset. This limitation could affect the generalisability of fake review detection models trained on these datasets. In particular, models may learn patterns that correlate with platform-specific detection heuristics rather than universally applicable fraud signals. Despite this limitation, these datasets remain widely used benchmarks in the literature, allowing fair comparison with prior work. Future research could further improve robustness by incorporating manually verified datasets or cross-platform evaluation.
5.12.2. Scope of Cross-Domain Evaluation
We note that the current empirical validation is conducted on two benchmark review datasets, namely YelpChi and Amazon Reviews. While these datasets are widely adopted in prior fake review detection research, they do not cover the full diversity of review platforms, product categories, or review-writing styles encountered in real-world environments. Therefore, the reported cross-domain results should be interpreted as preliminary evidence rather than a comprehensive validation of robustness. Future work should extend the evaluation to additional review platforms, broader review genres, and more recent LLM baselines and open models.
5.12.3. Dataset Scope and Emerging LLM-Generated Reviews
It is important to note that the YelpChi and Amazon Reviews datasets were collected before the widespread availability of modern large language models. As a result, they primarily contain traditional forms of fraudulent reviews rather than sophisticated LLM-generated content. We nevertheless use these datasets because they are widely adopted benchmarks in the fake review detection literature, enabling direct comparison with prior methods. Evaluating detection approaches against modern LLM-generated fake reviews remains an important direction for future work.
6. Conclusions and Future Work
In this paper, we presented KE-MLLM, a promising integrated framework for fake review detection that combines knowledge-enhanced prompting, parameter-efficient LoRA adaptation, multimodal feature fusion, and chain-of-thought explainability. Experimental results on the YelpChi and Amazon Reviews benchmarks show that the proposed approach achieves strong detection performance while also producing human-interpretable explanations. In particular, KE-MLLM attains F1-scores of 94.3% and 92.8%, respectively, and the generated explanations show high agreement with expert assessment. These findings suggest that integrating LLM-based reasoning with behavioral and contextual signals is a useful direction for improving both prediction quality and interpretability in fake review detection. At the same time, the current results are based on benchmark datasets, and further validation is required before drawing conclusions about real-world deployment readiness or broad cross-platform applicability.
Despite these encouraging results, several limitations warrant further investigation. First, our evaluation on sophisticated camouflaged reviews reveals that a substantial portion of errors arise from advanced fraud tactics in which attackers produce linguistically plausible content while maintaining relatively normal behavioral patterns. This suggests that robustness to evolving and adaptive fraud strategies remains an open challenge. Second, the false positive rate on unusual but genuine reviews indicates that the model may still be overly sensitive to certain linguistic or behavioral cues, highlighting the need for improved uncertainty estimation and more calibrated decision-making. Third, although the framework shows encouraging cross-domain transfer performance, a noticeable gap remains between in-domain and cross-domain performance, indicating that broader validation across platforms and review ecosystems is still necessary. Fourth, the current knowledge base relies in part on manually curated fraud indicators, which may limit scalability as fraud tactics evolve; future work could explore more automated and continuously updated knowledge construction strategies. Fifth, while the proposed framework improves explanation quality, further study is needed to examine fairness, robustness, and faithfulness under more diverse real-world conditions.
Overall, KE-MLLM should be viewed as a promising step toward more accurate and interpretable fake review detection rather than a definitive deployment-ready solution. Future work will focus on robustness against adversarial manipulation, fairness across different user populations and platforms, improved generalization under dataset shift, and broader real-world evaluation in practical online review environments.