KE-MLLM: A Knowledge-Enhanced Multi-Sensor Learning Framework for Explainable Fake Review Detection

Chen, Jiaying; Liu, Jingyi; Liang, Yiwen; Zhou, Mengjie

doi:10.3390/app16062909

Open AccessArticle

KE-MLLM: A Knowledge-Enhanced Multi-Sensor Learning Framework for Explainable Fake Review Detection

¹

SC Johnson Graduate School of Management, Cornell University, Ithaca, NY 14853, USA

²

College of Computing and Information Science, Cornell University, Ithaca, NY 14853, USA

³

Department of Computer Science, University of Bristol, Bristol BS8 1SS, UK

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(6), 2909; https://doi.org/10.3390/app16062909

Submission received: 9 February 2026 / Revised: 11 March 2026 / Accepted: 13 March 2026 / Published: 18 March 2026

Download

Browse Figures

Versions Notes

Abstract

The proliferation of fake reviews on e-commerce and social platforms has severely undermined consumer trust and market integrity, necessitating robust and interpretable real-time detection mechanisms with multi-sensor data fusion capabilities. While traditional machine learning approaches have shown promise in identifying fraudulent reviews, they often lack transparency and fail to leverage the rich contextual knowledge embedded in large-scale datasets. In this paper, we propose KE-MLLM (Knowledge-Enhanced Multimodal Large Language Model), a unified framework that integrates knowledge-enhanced prompting with parameter-efficient fine-tuning for explainable fake review detection. Our approach employs LoRA (Low-Rank Adaptation) to fine-tune lightweight large language models (LLaMA-3-8B) on review text, while incorporating multimodal behavioral sensor signals including temporal patterns, user metadata, and social network characteristics for comprehensive anomaly sensing. To address the critical need for interpretability in fraud detection systems, we implement a Chain-of-Thought (CoT) reasoning module that generates human-understandable explanations for classification decisions, highlighting linguistic anomalies, sentiment inconsistencies, and behavioral red flags. We enhance the model’s discriminative capability through a knowledge distillation strategy that transfers domain-specific expertise from larger teacher models while maintaining computational efficiency suitable for edge sensing devices. Extensive experiments on two benchmark datasets—YelpChi and Amazon Reviews from the DGL Fraud Dataset—show that KE-MLLM achieves strong performance, reaching an F1-score of 94.3% and an AUC-ROC of 96.7% on YelpChi and outperforming the strongest baseline in our comparison by 5.8 and 4.2 percentage points, respectively. Furthermore, human evaluation indicates that the generated explanations achieve 89.5% consistency with expert annotations, suggesting that the framework can improve the interpretability and practical usefulness of automated fraud detection systems. The proposed framework provides a useful step toward more accurate and interpretable fake review detection and offers a practical reference for building more transparent and accountable AI systems in high-stakes applications.

Keywords:

fake review detection; large language models; explainable AI; multimodal learning; knowledge enhancement; fraud detection; multi-sensor data fusion

1. Introduction

The exponential growth of e-commerce and online review platforms has fundamentally transformed consumer decision-making processes, with over 93% of consumers consulting online reviews before making purchase decisions [1]. However, this democratization of consumer opinions has been paralleled by an alarming rise in fraudulent review activities. Recent industry reports suggest that a substantial portion of online reviews may be fake or manipulated, with some studies estimating figures as high as 30–40% [2]. However, estimates vary considerably across studies and platforms; under stricter definitions, roughly 4% of reviews are identified as clearly fake, while higher percentages are reported when broader forms of manipulation (e.g., incentivized or selectively filtered reviews) are considered. The prevalence can also vary depending on the platform, product category, and statistical methodology used. Fake reviews are estimated to directly influence approximately $152 billion in global online spending by misleading consumers into purchasing lower-quality products or services [3]. Beyond this direct spending impact, fake reviews can also reduce consumer welfare, distort fair competition among businesses, and erode consumer trust in online marketplaces. This pervasive problem not only distorts market competition but fundamentally erodes consumer trust in digital commerce ecosystems.

Traditional approaches to fake review detection have primarily relied on shallow machine learning methods, including Support Vector Machines (SVMs) [4], Random Forests [5], and neural architectures such as Long Short-Term Memory (LSTM) networks [6] and BERT-based classifiers [7]. While these methods have demonstrated reasonable performance, they suffer from three critical limitations. First, they lack interpretability—providing binary classifications without explaining why a review is deemed fraudulent, which is crucial for regulatory compliance and user trust [8]. Second, they fail to effectively leverage the rich contextual knowledge and reasoning capabilities that have emerged from recent advances in large language models (LLMs) [9,10]. Third, traditional methods typically analyze text in isolation, neglecting the multi-sensor behavioral signals such as temporal posting patterns, user metadata, and social network dynamics that often characterize fraudulent activities, similar to anomaly detection challenges in distributed sensor networks [11].

The advent of large language models has revolutionized natural language understanding tasks, with models such as GPT-3 [9], LLaMA [10], and their successors demonstrating unprecedented capabilities in contextual reasoning and knowledge integration. However, directly applying these models to fraud detection faces significant challenges: (1) computational cost—full fine-tuning of LLMs with billions of parameters is prohibitively expensive [12]; (2) domain adaptation—pre-trained LLMs lack specialized knowledge about deceptive patterns in review ecosystems; and (3) explainability gap—despite their reasoning capabilities, LLMs often produce opaque predictions without transparent decision pathways [13].

Recent advances in parameter-efficient fine-tuning, particularly Low-Rank Adaptation (LoRA) [12], offer a promising solution to computational constraints by freezing the majority of pre-trained parameters and introducing trainable low-rank decomposition matrices. Concurrently, Chain-of-Thought (CoT) prompting techniques [14] have demonstrated remarkable success in eliciting step-by-step reasoning from LLMs, enabling transparent decision-making processes. Furthermore, the integration of multimodal information—combining textual content with behavioral metadata—has shown substantial improvements in fraud detection across various domains [15,16].

Building upon these insights, we propose KE-MLLM (Knowledge-Enhanced Multimodal Large Language Model), an integrated framework for fake review detection that combines parameter-efficient LLM adaptation, knowledge-enhanced prompting, multimodal feature fusion, and explicit explainability mechanisms. Rather than introducing an entirely new model architecture, our contribution lies in integrating these complementary elements into a unified and explainable framework for fake review detection. To clarify the methodological novelty of our approach, KE-MLLM differs from standard LoRA fine-tuning, conventional multimodal fusion, and existing LLM-based fake review classifiers in three important aspects. First, rather than using LoRA solely as a lightweight adaptation technique, we combine LoRA-based domain adaptation with structured domain knowledge injection to guide the model toward deception-relevant linguistic and behavioral cues. Second, instead of applying a generic multimodal fusion pipeline, we design a task-specific fusion mechanism that integrates review text with behavioral and user-level signals for fake review detection. Third, unlike existing LLM-based classifiers that mainly focus on label prediction, our framework incorporates an explicit Chain-of-Thought reasoning module to produce interpretable decision rationales. The main contributions of this work are summarized as follows:

Knowledge-enhanced parameter-efficient LLM adaptation: We develop a LoRA-based adaptation scheme for LLaMA-3-8B guided by domain-specific knowledge about deceptive linguistic patterns and behavioral red flags. Compared with standard LoRA fine-tuning, this design is more explicitly tailored to the fake review detection task through knowledge-enhanced prompting.
Task-specific multimodal fusion for fake review detection: We design a multimodal fusion mechanism that integrates textual review content with auxiliary behavioral and user-level signals, including temporal activity patterns and review-history statistics. Our fusion strategy is designed to capture cross-modal inconsistencies that are often associated with deceptive reviews.
Explainable LLM-based classification via Chain-of-Thought reasoning: Beyond predicting whether a review is fake, our framework generates structured reasoning paths that expose the textual and behavioral evidence supporting each decision. This differentiates our framework from many LLM-based classifiers that mainly focus on prediction performance without explicitly generating interpretable rationales.
Comprehensive evaluation on benchmark fake review datasets: We validate the proposed framework on YelpChi and Amazon Reviews from the DGL Fraud Dataset [17], showing that KE-MLLM achieves consistently better results than the selected strong baselines in our experiments. Ablation studies further confirm the contribution of knowledge enhancement, multimodal fusion, and reasoning-based explanation to the overall performance. Furthermore, human evaluation involving domain experts shows that the generated explanations achieve 89.5% consistency with human annotations, substantially outperforming attention-based visualization methods [18].

To further clarify the novelty of KE-MLLM, Table 1 provides a structured comparison between our framework and representative prior approaches in fake review detection, including traditional classifiers, LLM-based methods, and multimodal models. Beyond the immediate fake review detection task, this work provides a useful case study of how parameter-efficient LLM adaptation, multimodal behavioral signals, and reasoning-based explanation can be combined within a single framework for trustworthy AI applications. The explainable nature of our framework makes it particularly suitable for regulatory environments where algorithmic decisions must be auditable and contestable [19]. Furthermore, our multi-sensor fusion approach provides a blueprint for integrating diverse information sources in fraud detection systems, applicable to domains beyond reviews including social media bot detection [20], financial fraud identification [21], and misinformation classification [22], as well as anomaly detection in smart city sensor networks and IoT security monitoring.

The remainder of this paper is organized as follows: Section 2 reviews related work in fake review detection, explainable AI, and parameter-efficient LLM fine-tuning. Section 3 introduces necessary background on large language models and the problem formulation. Section 4 presents our KE-MLLM framework in detail, including the LoRA adaptation strategy, knowledge enhancement mechanism, multimodal fusion architecture, and CoT explainability module. Section 5 reports comprehensive experimental results on benchmark datasets, including comparisons with baselines, ablation studies, and human evaluation of explanations. Finally, Section 6 concludes the paper with discussions on limitations and promising directions for future work.

2. Related Works

In this section, we review the research landscape surrounding fake review detection, explainable artificial intelligence in fraud detection systems, and parameter-efficient fine-tuning techniques for large language models. We situate our work within these domains and highlight the key distinctions of our approach.

2.1. Fake Review Detection Methods and Behavioral Signal Analysis

Fake review detection has evolved through several methodological paradigms over the past decade. Early approaches primarily relied on hand-crafted linguistic features combined with traditional machine learning classifiers. Ott et al. [4] pioneered this direction by creating the first benchmark dataset of deceptive hotel reviews and demonstrating that n-gram features with SVM classifiers could achieve reasonable detection accuracy. Subsequent work by Mukherjee et al. [5] extended this by incorporating behavioral signals such as review burstiness and rating deviation, revealing that spam campaigns often exhibit temporal clustering patterns.

The deep learning era brought significant advances through neural architectures capable of learning hierarchical representations. Li et al. [23] proposed a neural network architecture combining convolutional and recurrent layers to capture both local patterns and long-range dependencies in review text. Wang et al. [6] demonstrated that LSTM networks with attention mechanisms could effectively identify deceptive language patterns by focusing on discriminative text segments. More recently, pre-trained language models have shown superior performance. Zhang et al. [24] fine-tuned BERT on domain-specific review corpora, achieving state-of-the-art results on multiple benchmarks. Jain et al. [25] further improved upon this by using RoBERTa with adversarial training to enhance model robustness against sophisticated spam attacks.

Beyond textual analysis, researchers have increasingly recognized the importance of multi-sensor information fusion. Rayana and Akoglu [11] pioneered the integration of review networks and metadata, demonstrating that collective patterns across user behaviors are more indicative of fraud than individual review characteristics, similar to how distributed sensor networks detect anomalies through correlated signal analysis. Li et al. [26] developed a heterogeneous information network approach that jointly models reviewers, reviews, and products, capturing complex relational patterns. More recently, graph neural networks (GNNs) have emerged as powerful tools for leveraging network structure. Dou et al. [17] proposed a GNN-based detector that combats camouflaged fraudsters by learning from both graph topology and node features. Liu et al. [27] addressed the challenge of limited labeled data by designing a semi-supervised GNN framework with uncertainty-aware pseudo-labeling.

Despite these advances, existing methods face critical limitations in real-world deployment. First, most approaches lack interpretability—they provide binary classifications without explaining the reasoning behind their decisions. Second, traditional methods struggle to adapt to evolving fraud tactics without extensive retraining. Third, the computational cost of fine-tuning large pre-trained models on modest-sized datasets often leads to overfitting. Our work addresses these challenges by leveraging the reasoning capabilities of large language models through parameter-efficient adaptation while explicitly generating human-interpretable explanations.

2.2. Explainable AI in Fraud Detection

The demand for explainable AI (XAI) in fraud detection has intensified due to regulatory requirements and the need for human trust in automated systems. Arrieta et al. [8] provide a comprehensive taxonomy of XAI techniques, distinguishing between post hoc explanation methods (applied to trained models) and inherently interpretable models. In the context of fraud detection, explainability serves multiple purposes: enabling human oversight, facilitating model debugging, and providing actionable insights for fraud prevention strategies.

Early explainability approaches in review spam detection relied on feature importance visualization. Mukherjee et al. [28] analyzed which linguistic and behavioral features contributed most to classification decisions, finding that extreme ratings combined with generic language were strong fraud indicators. Attention mechanisms [18] provided a neural approach to explainability by highlighting which words or sentences the model focused on during prediction. Chen et al. [29] proposed an interpretable attention-based BiLSTM model for fake review detection, where attention weights reveal suspicious text segments. However, attention-based explanations have been criticized for their limited faithfulness—high attention weights do not necessarily indicate causal importance [30].

More recently, model-agnostic explanation techniques have gained traction. LIME (Local Interpretable Model-agnostic Explanations) [31] and SHAP (SHapley Additive exPlanations) [32] provide local explanations by approximating model behavior around specific instances. Barbieri et al. [33] applied LIME to explain fake news detection models, demonstrating improved user trust. However, these methods produce explanations as feature importance scores rather than natural language rationales, limiting their accessibility to non-technical stakeholders.

The advent of large language models has opened new possibilities for generating natural language explanations. Wiegreffe et al. [34] demonstrated that models can be trained to produce free-text rationales alongside predictions. Chain-of-Thought (CoT) prompting [14] has shown remarkable success in eliciting step-by-step reasoning from LLMs across various tasks. Lampinen et al. [35] found that CoT reasoning improve both model performance and human understanding of model decisions. Zhou et al. [36] extended this to instruction-following scenarios, showing that explicitly prompting models to explain their reasoning enhances faithfulness. Our work builds upon these insights by designing a specialized CoT module that generates structured explanations tailored to fake review detection, highlighting linguistic anomalies, sentiment inconsistencies, and behavioral red flags in human-readable format.

2.3. Parameter-Efficient Fine-Tuning of Large Language Models

The remarkable capabilities of large language models come with substantial computational costs, particularly during fine-tuning. Full fine-tuning of models with billions of parameters requires significant GPU memory and training time, often proving impractical for domain-specific applications. This has motivated the development of parameter-efficient fine-tuning (PEFT) techniques that adapt pre-trained models to downstream tasks while updating only a small fraction of parameters.

Adapter modules [37] were among the first PEFT approaches, inserting small trainable layers between frozen transformer blocks. Pfeiffer et al. [38] extended this with AdapterFusion, enabling the composition of knowledge from multiple tasks. However, adapters introduce inference latency due to additional sequential computations. Prompt tuning [39] takes a different approach by keeping model parameters frozen and only optimizing continuous prompt embeddings. Li and Liang [40] proposed prefix-tuning, which prepends trainable prefix vectors to each transformer layer. While effective, these methods still face challenges in matching the performance of full fine-tuning, especially on complex reasoning tasks.

Low-Rank Adaptation (LoRA) [12] has emerged as a particularly effective PEFT technique, based on the insight that the weight updates during fine-tuning have low intrinsic dimensionality. LoRA decomposes weight updates into low-rank matrices, dramatically reducing the number of trainable parameters while preserving model expressiveness. Empirical studies [41] have shown that LoRA achieves performance comparable to full fine-tuning across diverse NLP tasks while requiring only 0.1–1% of the parameters. QLoRA [41] further improves efficiency by combining LoRA with 4-bit quantization, enabling fine-tuning of 65B parameter models on a single consumer GPU.

In the domain of fraud detection and fake review classification, PEFT techniques remain underexplored. Zhao et al. [42] applied adapter-based fine-tuning to financial fraud detection, demonstrating improved efficiency but noting challenges in capturing domain-specific patterns. To the best of our knowledge, this work is among the early attempts to investigate LoRA-based adaptation of large language models for fake review detection in a systematic manner. Our results suggest that LoRA can reduce computational requirements while also acting as a useful regularization mechanism on modest-sized fraud detection datasets. Furthermore, we extend the standard LoRA approach by integrating knowledge-enhanced prompting and multimodal feature fusion, creating a holistic framework that balances efficiency, accuracy, and explainability.

3. Preliminaries

In this section, we formalize the fake review detection problem and introduce the foundational concepts underlying our approach, including large language models, Low-Rank Adaptation, and the multimodal feature space.

3.1. Problem Formulation

Let

D = {(x_{i}, y_{i})}_{i = 1}^{N}

denote a dataset of reviews, where

x_{i}

represents the i-th review instance and

y_{i} \in {0, 1}

is its binary label, with

y_{i} = 1

indicating a fake review and

y_{i} = 0

denoting a genuine review. Each review instance

x_{i}

comprises multiple modalities:

x_{i} = (t_{i}, m_{i}, b_{i}),

(1)

where

t_{i} \in V^{L_{i}}

represents the review text as a sequence of

L_{i}

tokens from vocabulary

V

,

m_{i} \in R^{d_{m}}

denotes the metadata feature vector (e.g., rating, account age, review count), and

b_{i} \in R^{d_{b}}

captures behavioral features (e.g., temporal patterns, burst indicators).

The objective is to learn a function

f : X \to {0, 1}

that accurately predicts whether a given review is fake, while simultaneously generating a natural language explanation

e_{i}

that justifies the prediction:

(y_{i}, e_{i}) = f (x_{i}; θ),

(2)

where

θ

represents the model parameters, and

e_{i}

is a structured explanation highlighting specific evidence for the classification decision.

3.2. Large Language Models and Transformer Architecture

Large language models are built upon the Transformer architecture [18], which processes input sequences through stacked layers of multi-head self-attention and feed-forward networks. For an input sequence

t = [t_{1}, t_{2}, \dots, t_{L}]

, the self-attention mechanism computes contextualized representations by attending to all positions in the sequence:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(3)

where Q, K, and V are query, key, and value matrices derived from the input through learned linear projections, and

d_{k}

is the dimension of the key vectors.

A pre-trained LLM

M_{LLM}

with parameters

Θ \in R^{| Θ |}

is trained on massive text corpora to predict the next token given previous context. For a prompt p concatenated with review text t, the model generates a probability distribution over possible continuations:

P (y | p, t) = M_{LLM} ([p; t]; Θ) .

(4)

Modern LLMs such as LLaMA-3 [10] contain billions of parameters, enabling them to capture complex linguistic patterns and perform reasoning tasks through in-context learning.

3.3. Low-Rank Adaptation (LoRA)

Direct fine-tuning of large language models requires updating all parameters

Θ

, which is computationally prohibitive. LoRA [12] addresses this by hypothesizing that the weight updates during adaptation have a low intrinsic rank. For a pre-trained weight matrix

W_{0} \in R^{d \times k}

, LoRA represents the update as:

W = W_{0} + Δ W = W_{0} + B A,

(5)

where

B \in R^{d \times r}

and

A \in R^{r \times k}

are trainable low-rank matrices with rank

r ≪ min (d, k)

. During training,

W_{0}

remains frozen while only B and A are optimized. The forward pass computes:

h = W_{0} x + \frac{α}{r} B A x,

(6)

where h is the output, x is the input, and

α

is a scaling hyperparameter. This formulation reduces the number of trainable parameters from

d \times k

to

r \times (d + k)

, typically achieving a 10,000× reduction when r is chosen appropriately (e.g.,

r = 8

or

r = 16

).

3.4. Multi-Sensor Feature Space

Beyond textual content, fake reviews often exhibit distinctive patterns in metadata and behavioral dimensions that can be monitored as continuous sensor signals. We define the complete feature space as:

Metadata Features (

m_{i}

): This includes static attributes such as numerical rating

r_{i} \in {1, 2, 3, 4, 5}

, account age

a_{i}

(days since account creation), total review count

c_{i}

, and review length

{l;}_{i}

(number of tokens).

Behavioral Sensor Features (

b_{i}

): These capture temporal and relational patterns as time-series sensor data, including

Review burstiness: $β_{i} = \frac{σ_{Δ t}}{μ_{Δ t}}$ , where $μ_{Δ t}$ and $σ_{Δ t}$ are the mean and standard deviation of inter-review time intervals for user $u_{i}$ , analogous to burst detection in network traffic sensors.
Rating deviation: $δ_{i} = | r_{i} - {\bar{r}}_{p} |$ , measuring the difference between the review’s rating and the product’s average rating ${\bar{r}}_{p}$ .
Graph-based features: If available, we compute node centrality measures from the user-product-review heterogeneous graph.

These multi-sensor features are encoded into dense representations that complement the textual analysis performed by the LLM, enabling the model to capture fraud signals that manifest across different information channels, similar to heterogeneous sensor fusion in smart monitoring systems.

4. Methodology

In this section, we present the KE-MLLM framework for explainable fake review detection. Our approach synergistically integrates knowledge-enhanced prompting, parameter-efficient LoRA adaptation, multimodal feature fusion, and chain-of-thought reasoning to achieve both high detection accuracy and transparent decision-making. Figure 1 illustrates the overall architecture of our proposed system. From a methodological perspective, KE-MLLM can be viewed as a structured integration of three complementary components: parameter-efficient LLM adaptation, knowledge-guided reasoning, and multimodal behavioral feature fusion. The framework is designed to leverage the strengths of each component while mitigating their individual limitations in fake review detection tasks.

4.1. Framework Overview

The KE-MLLM framework operates through a four-stage pipeline that transforms raw review data into explainable predictions. Given a review instance

x_{i} = (t_{i}, m_{i}, b_{i})

, the system first constructs a knowledge-enhanced prompt that incorporates domain-specific fraud indicators alongside the review text. This enriched input is then processed by a LoRA-adapted LLaMA-3-8B model, which generates an initial classification decision. Simultaneously, a multimodal fusion module encodes metadata and behavioral features into dense representations, which are integrated with the textual embeddings through a cross-attention mechanism. Finally, a chain-of-thought reasoning module explicitly generates structured explanations that highlight linguistic anomalies, sentiment inconsistencies, and behavioral red flags. The entire framework is optimized end-to-end through a composite loss function that balances classification accuracy with explanation quality.

4.2. Knowledge-Enhanced Prompting Strategy

Traditional prompting approaches for LLMs typically concatenate task instructions with raw input text, relying on the model’s pre-trained knowledge to identify relevant patterns. However, fake review detection requires specialized understanding of deceptive linguistic strategies and fraud tactics that may not be sufficiently represented in general-purpose pre-training corpora. To address this limitation, we develop a knowledge base

K

that encodes domain-specific expertise about fraudulent review characteristics.

The knowledge base consists of three components: linguistic fraud indicators

K_{ling}

, sentiment anomaly patterns

K_{sent}

, and behavioral red flags

K_{behav}

. Linguistic indicators capture patterns such as excessive use of superlatives (e.g., “absolutely perfect,” “totally amazing”), generic descriptions lacking specific product details, and abnormal syntactic structures. Sentiment anomalies include rating-text mismatches where highly positive textual sentiment accompanies low numerical ratings, or vice versa. Behavioral red flags encompass temporal patterns like review bursts within short time windows and suspicious account characteristics such as newly created accounts with immediate high-volume posting.

The fraud-indicator knowledge base was constructed through a hybrid process combining manual curation and automatic candidate extraction. First, an initial pool of candidate knowledge items was collected from prior studies on deceptive review detection and from commonly reported fraud patterns in online review platforms, covering linguistic cues, sentiment inconsistencies, and behavioral anomalies. Second, additional candidate indicators were identified from the training data by examining recurrent patterns associated with labeled fake reviews, including frequent lexical expressions, rating–text inconsistencies, and abnormal temporal or user-level behaviors. To improve reliability, the candidate knowledge items were manually reviewed and refined by multiple annotators with experience in fake review analysis. During this process, redundant, overly generic, or weakly informative items were removed, and semantically overlapping indicators were merged into concise knowledge statements. Only knowledge items that were consistently judged to be relevant to deceptive review behavior were retained in the final knowledge base. Representative examples of the retained knowledge items are provided in Appendix A.

For each review instance, we construct a knowledge-enhanced prompt

p_{i}^{KE}

by retrieving the most relevant knowledge items from

K

and incorporating them into the prompt structure:

p_{i}^{KE} = [Instruction; K_{retrieved}; Context (t_{i}, m_{i})] .

(7)

The retrieval process computes semantic similarity between the review text

t_{i}

and knowledge items using a lightweight encoder. Specifically, we encode the review into an embedding vector

e_{t} = Encoder (t_{i})

and compute cosine similarity with each knowledge item embedding

e_{k}^{(j)}

:

s_{j} = \frac{e_{t} \cdot e_{k}^{(j)}}{∥ e_{t} ∥ ∥ e_{k}^{(j)} ∥} .

(8)

The top-k knowledge items with highest similarity scores are selected and formatted into natural language descriptions that guide the LLM’s attention toward potentially fraudulent patterns. For instance, if a review contains excessive superlatives, the retrieved knowledge might state: “Pay attention to the presence of extreme positive language that may indicate exaggerated claims typical of fake reviews.” This knowledge injection serves as a soft constraint that steers the model’s reasoning process without rigidly constraining its predictions.

The complete prompt structure follows a systematic format that first provides the task instruction, then presents the retrieved knowledge context, followed by the review metadata and text, and finally requests both a classification decision and an explanation. This structured approach ensures the LLM processes information in a logical sequence that mimics expert human analysis of potentially fraudulent reviews.

4.3. LoRA-Based Model Adaptation

While the knowledge-enhanced prompting provides task-specific guidance, adapting the base LLM to the fake review detection domain requires learning from labeled training data. Full fine-tuning of the entire LLaMA-3-8B model with approximately 8 billion parameters would be computationally expensive and prone to overfitting on modestly-sized fraud detection datasets. We therefore employ Low-Rank Adaptation to efficiently adapt the model while preserving its general language understanding capabilities.

We apply LoRA to the query and value projection matrices in the multi-head attention layers, as these components have been shown to be most critical for task adaptation [12]. For each attention head in layer ℓ, the original query projection

W_{Q}^{(l;)} \in R^{d \times d}

is augmented with low-rank matrices:

W_{Q}^{(l;)} = W_{Q, 0}^{(l;)} + B_{Q}^{(l;)} A_{Q}^{(l;)},

(9)

where

W_{Q, 0}^{(l;)}

remains frozen and

B_{Q}^{(l;)} \in R^{d \times r}

,

A_{Q}^{(l;)} \in R^{r \times d}

are trainable with rank r. The same decomposition is applied to the value projection matrices

W_{V}^{(l;)}

. During forward propagation, the hidden state

h^{(l;)}

at layer ℓ is transformed as:

Q^{(l;)} = h^{(l;)} W_{Q, 0}^{(l;)} + \frac{α}{r} h^{(l;)} B_{Q}^{(l;)} A_{Q}^{(l;)} .

(10)

The scaling factor

\frac{α}{r}

ensures that the contribution of the low-rank adaptation is appropriately balanced relative to the frozen pre-trained weights. This formulation allows the model to learn task-specific attention patterns while maintaining the broad language understanding encoded in the pre-trained parameters.

The choice of rank r represents a trade-off between expressiveness and efficiency. Lower ranks reduce trainable parameters and training time but may limit the model’s capacity to capture complex task-specific patterns. Through preliminary experiments, we observe that intermediate rank values provide optimal balance, enabling effective adaptation without excessive computational overhead. The LoRA adaptation focuses the model’s attention on fraud-indicative linguistic patterns while avoiding the catastrophic forgetting that can occur with full fine-tuning on small specialized datasets.

4.4. Multi-Sensor Feature Fusion

Textual content alone provides insufficient information for robust fake review detection, as sophisticated fraudsters can craft linguistically plausible reviews that nonetheless exhibit suspicious behavioral patterns in sensor-monitored activities. To leverage complementary signals from multiple sensing modalities, we design a cross-attention fusion mechanism that dynamically integrates metadata and behavioral features with textual representations, analogous to adaptive multi-sensor fusion in distributed monitoring systems.

The metadata features

m_{i}

and behavioral features

b_{i}

are first encoded through separate feed-forward networks into dense representations:

z_{m} = ReLU (W_{m} m_{i} + b_{m}), z_{b} = ReLU (W_{b} b_{i} + b_{b}),

(11)

where

W_{m}

,

W_{b}

are weight matrices and

b_{m}

,

b_{b}

are bias vectors. These encoded features are concatenated and projected into the same dimensional space as the LLM’s hidden states:

z_{multi} = W_{proj} [z_{m}; z_{b}] .

(12)

To enable the textual representation to selectively attend to relevant multimodal features, we employ a cross-attention mechanism. Let

H_{text} \in R^{L \times d}

denote the sequence of hidden states from the final layer of the LoRA-adapted LLM, where L is the sequence length and d is the hidden dimension. We compute cross-attention between the text and multimodal features:

A_{cross} = softmax (\frac{H_{text} W_{Q} {(W_{K} z_{multi})}^{T}}{\sqrt{d_{k}}}),

(13)

H_{fused} = A_{cross} (W_{V} z_{multi}) + H_{text},

(14)

where

W_{Q}

,

W_{K}

,

W_{V}

are learned projection matrices for query, key, and value transformations, and the residual connection preserves the original textual information. This cross-attention mechanism allows the model to dynamically weight the importance of multi-sensor features based on the textual context, similar to adaptive weighting in sensor fusion algorithms. For instance, when the text appears linguistically normal, the model may rely more heavily on behavioral sensor signals such as review burstiness or rating deviation to make its decision.

The fused representation

H_{fused}

captures complementary information from all modalities, enabling the model to identify fraudulent reviews that may appear legitimate in one modality but exhibit suspicious patterns in others. This holistic view is particularly valuable for detecting sophisticated fraud campaigns where attackers carefully craft realistic-seeming text but inadvertently leave traces in their posting behaviors or metadata patterns.

4.5. Chain-of-Thought Reasoning for Explainability

A critical limitation of existing fake review detection systems is their opacity—they provide binary classifications without revealing the reasoning behind their decisions. This lack of transparency hinders human oversight, reduces user trust, and complicates model debugging. To address this challenge, we implement a Chain-of-Thought reasoning module that generates structured natural language explanations alongside predictions.

The CoT module extends the LLM’s output to produce not only a classification label but also a step-by-step reasoning process. We achieve this by appending a specialized generation prompt to the knowledge-enhanced input that explicitly instructs the model to articulate its reasoning:

\begin{matrix} p_{i}^{CoT} & = [p_{i}^{KE}; “ Analyze this review step-by-step : (1) Linguistic patterns, \\ (2) Sentiment consistency, (3) Behavioral signals, (4) Final verdict ”] . \end{matrix}

(15)

This structured prompt encourages the model to decompose its analysis into interpretable components. During generation, the model produces a sequence of tokens that first describes observed linguistic characteristics, then evaluates sentiment-rating alignment, subsequently considers behavioral features from the multimodal input, and finally synthesizes these observations into a classification decision.

The generated explanation

e_{i}

follows a template structure but with content dynamically determined by the review characteristics:

e_{i} = M_{LLM}^{LoRA} ([p_{i}^{CoT}; z_{multi}]; Θ, {B_{Q}, A_{Q}, B_{V}, A_{V}}),

(16)

where the superscript LoRA indicates the adapted model with trainable low-rank matrices. The explanation generation is conditioned on both the textual input and the encoded multimodal features

z_{multi}

, ensuring that behavioral signals are reflected in the reasoning process.

To enhance the faithfulness of generated explanations—ensuring they accurately reflect the model’s actual decision process rather than post hoc rationalizations—we employ a two-stage training approach. In the first stage, the model is trained solely on classification with standard cross-entropy loss. In the second stage, we introduce an explanation supervision loss that encourages the model to generate explanations consistent with human-annotated rationales in a subset of the training data:

L_{explain} = - \sum_{j = 1}^{| e_{i} |} \log P (e_{i}^{(j)} | e_{i}^{(< j)}, x_{i}; θ),

(17)

where

e_{i}^{(j)}

denotes the j-th token in the explanation and

e_{i}^{(< j)}

represents all preceding tokens. This supervision signal guides the model toward generating explanations that align with expert human reasoning patterns.

The final explanation highlights specific evidence from multiple modalities. For example, it might note: “The review contains excessive superlatives (“absolutely perfect’) without specific product details (linguistic anomaly). The 5-star rating contradicts negative phrases in the text (sentiment inconsistency). The reviewer’s account was created 2 days ago with 15 reviews posted within 24 h (behavioral red flag). Classification: Fake review.” This multifaceted reasoning provides actionable insights for human moderators and enables users to understand and potentially contest the automated decision.

4.6. Knowledge Distillation for Enhanced Performance

To further boost the detection accuracy of our relatively compact LLaMA-3-8B student model, we employ knowledge distillation from a larger teacher model (LLaMA-3-70B). The teacher model is first fine-tuned on the fake review detection task using the same LoRA adaptation strategy, albeit with greater capacity to learn nuanced patterns. The student model then learns to mimic both the teacher’s predictions and its internal representations.

The distillation process optimizes a combined objective that balances three components: task-specific classification loss, soft label distillation loss, and feature-level distillation loss. Let

P_{teacher} (y | x_{i})

and

P_{student} (y | x_{i})

denote the predicted probability distributions from the teacher and student models, respectively. The soft label distillation loss employs Kullback-Leibler divergence with temperature scaling:

L_{soft} = τ^{2} \cdot KL (P_{teacher}^{(1 / τ)} ∥ P_{student}^{(1 / τ)}),

(18)

where

τ

is the temperature parameter that softens the probability distributions, allowing the student to learn from the teacher’s uncertainty. Higher temperatures produce softer distributions that reveal more information about the teacher’s confidence across different classes.

Additionally, we minimize the distance between the student’s and teacher’s intermediate representations to transfer the teacher’s learned feature space:

L_{feat} = {∥ H_{student} - W_{align} H_{teacher} ∥}_{2}^{2},

(19)

where

H_{student}

and

H_{teacher}

are hidden representations from the student and teacher models, and

W_{align}

is a learned linear transformation that accounts for potential dimensional differences.

The complete training objective combines these components with the standard classification loss:

L_{total} = L_{cls} + λ_{soft} L_{soft} + λ_{feat} L_{feat} + λ_{explain} L_{explain},

(20)

where

L_{cls} = - \log P_{student} (y_{i} | x_{i})

is the cross-entropy loss on ground-truth labels, and

λ_{soft}

,

λ_{feat}

,

λ_{explain}

are weighting hyperparameters that control the relative importance of each objective component.

This distillation strategy enables the student model to inherit the teacher’s superior discrimination capability while maintaining the computational efficiency required for real-time inference in production environments. The soft labels from the teacher provide richer training signals than hard binary labels alone, particularly for ambiguous cases near the decision boundary.

4.7. Overall Training and Inference Pipeline

Algorithm 1 summarizes the complete training and inference procedure of the KE-MLLM framework. The training process consists of three sequential stages: teacher model pre-training with LoRA adaptation, student model distillation with multi-objective optimization, and explanation fine-tuning with human-annotated rationales. During inference, the knowledge retrieval and multimodal fusion operate in parallel to minimize latency, with the final prediction and explanation generated in a single forward pass through the LoRA-adapted student model.

Algorithm 1 KE-MLLM Training and Inference

1:: Input: Training data $D = {(x_{i}, y_{i})}$ , knowledge base $K$ , pre-trained LLMs
2:: Output: Trained model parameters $θ^{*}$ , prediction $\hat{y}$ , explanation e
3:
4:: // Stage 1: Teacher Model Training
5:: for each batch $(x_{b}, y_{b})$ in $D$ do
6:: Fine-tune LLaMA-3-70B with LoRA on classification task
7:: Update ${B_{Q}^{teacher}, A_{Q}^{teacher}, B_{V}^{teacher}, A_{V}^{teacher}}$ via $L_{cls}$
8:: end for
9:
10:: // Stage 2: Student Model Distillation
11:: for each batch $(x_{b}, y_{b})$ in $D$ do
12:: Retrieve knowledge: $K_{retrieved} \leftarrow TopK (K, x_{b})$
13:: Construct prompts: $p_{b}^{KE} \leftarrow [Instruction; K_{retrieved}; x_{b}]$
14:: Encode multimodal: $z_{multi} \leftarrow Encode (m_{b}, b_{b})$
15:: Compute teacher predictions: $P_{teacher} \leftarrow LLaMA-70B (p_{b}^{KE})$
16:: Compute student predictions: $H_{student} \leftarrow {LLaMA-8B}^{LoRA} (p_{b}^{KE})$
17:: Fuse features: $H_{fused} \leftarrow CrossAttention (H_{student}, z_{multi})$
18:: Update student via $L_{total}$ (Equation (16))
19:: end for
20:
21:: // Stage 3: Explanation Fine-tuning
22:: for each annotated sample $(x_{i}, y_{i}, e_{i}^{human})$ do
23:: Generate explanation: $e_{i}^{pred} \leftarrow Generate (p_{i}^{CoT}, z_{multi})$
24:: Update via $L_{explain}$ (Equation (13))
25:: end for
26:
27:: // Inference
28:: Input: New review $x_{new}$
29:: $K_{retrieved} \leftarrow TopK (K, x_{new})$
30:: $p_{new}^{CoT} \leftarrow [Instruction; K_{retrieved}; x_{new}; CoT prompt]$
31:: $z_{multi} \leftarrow Encode (m_{new}, b_{new})$
32:: $(\hat{y}, e) \leftarrow {LLaMA-8B}^{LoRA} (p_{new}^{CoT}, z_{multi})$
33:: Return $\hat{y}$ , e

The modular design of KE-MLLM enables flexible deployment scenarios. The knowledge base can be dynamically updated as new fraud patterns emerge without retraining the entire model. The LoRA adapters can be efficiently swapped to specialize the model for different product categories or platforms while sharing the same frozen base model. The multimodal fusion module can incorporate additional feature types as they become available, such as image content in reviews with photos or network features from social graphs.

5. Experiments

In this section, we present comprehensive experimental evaluations of the proposed KE-MLLM framework. We begin by describing the datasets and evaluation metrics, followed by detailed comparisons with state-of-the-art baselines. We then conduct ablation studies to analyze the contribution of each component and present qualitative analysis of the generated explanations through human evaluation.

5.1. Datasets

We conduct experiments on two widely-adopted benchmark datasets for fake review detection, both available through the Deep Graph Library (DGL) Fraud Dataset repository [17].

YelpChi Dataset: This dataset contains restaurant reviews from the Yelp platform in the Chicago area, comprising 67,395 reviews from 38,063 users on 14,082 restaurants. Each review is labeled as genuine (YR) or fake (YF) based on Yelp’s proprietary filtering algorithm, which has been validated through manual inspection studies. The dataset includes rich metadata such as user registration date, review count, average rating, and temporal posting patterns. Importantly, it also contains social network information capturing friendship relations among users, though our primary focus is on textual and behavioral features. The class distribution is imbalanced, with approximately 13.2% of reviews labeled as fake, reflecting real-world fraud rates.

Amazon Reviews Dataset: This dataset aggregates product reviews from multiple Amazon categories, including Electronics, Home and Kitchen, and Books. It contains 11,944 reviews from 7818 users on 4194 products. Reviews are labeled based on a combination of Amazon’s spam detection system and crowd-sourced annotations from the Amazon Mechanical Turk platform. The dataset exhibits different characteristics compared with YelpChi, with longer average review length (156 tokens vs. 89 tokens) and more diverse product categories. The fake review rate is approximately 19.7%, slightly higher than YelpChi, potentially reflecting different fraud patterns across e-commerce versus local business reviews.

For both datasets, we extract multi-sensor features including metadata (rating, account age, review count, review length) and behavioral sensing signals (review burstiness computed as coefficient of variation of inter-review intervals, rating deviation from product average, account activity patterns) that are continuously monitored as time-series sensor data. We randomly split each dataset into training (70%), validation (15%), and test (15%) sets, ensuring stratified sampling to maintain class distribution across splits. All reported results are averaged over three independent runs with different random seeds to account for training variability.

5.2. Baseline Methods

We compare KE-MLLM against a comprehensive set of baseline methods representing different paradigms in fake review detection:

Traditional Machine Learning Methods: We implement SVM with TF-IDF features [4] and Random Forest with hand-crafted linguistic and behavioral features [5] as classical baselines. These methods serve as lower bounds on performance and demonstrate the value of deep learning approaches.

Neural Network Baselines: We include BiLSTM with attention mechanism [6], which captures sequential dependencies in review text, and CNN-BiLSTM hybrid architecture that combines local pattern extraction with sequential modeling. These represent strong neural baselines from the pre-transformer era.

Pre-trained Language Model Baselines: We compare against BERT-base fine-tuned on review classification [24], RoBERTa-base with adversarial training [25], and DistilBERT as a lightweight alternative. These methods represent the current mainstream approach of fine-tuning pre-trained transformers.

Graph-Based Methods: We include CARE-GNN [17], a state-of-the-art graph neural network designed specifically for fraud detection that leverages user-product-review heterogeneous graphs. This provides comparison with methods that explicitly model relational structures.

LLM-Based Methods: We implement several LLM baselines including: (1) GPT-3.5-turbo with few-shot prompting (5 examples), (2) LLaMA-3-8B with full fine-tuning, (3) LLaMA-3-8B with standard LoRA (without knowledge enhancement or multimodal fusion), and (4) LLaMA-3-8B with prompt-tuning. These comparisons isolate the contributions of our proposed components.

All baselines are implemented using their official codebases or reproduced following the original papers. Hyperparameters are tuned on the validation set using grid search to ensure fair comparison. For neural and LLM-based methods, we use the same computational budget in terms of training iterations and hardware resources.

5.3. Evaluation Metrics

Given the class imbalance in fake review detection datasets, we employ multiple evaluation metrics that provide complementary perspectives on model performance:

Accuracy: The proportion of correct predictions among all samples, providing an overall performance measure but potentially misleading under class imbalance.

Precision, Recall, and F1-Score: For the positive class (fake reviews), precision measures the proportion of predicted fake reviews that are truly fake, recall measures the proportion of actual fake reviews correctly identified, and F1-score provides their harmonic mean. Given the critical importance of catching fake reviews while minimizing false accusations, we consider F1-score as our primary metric.

AUC-ROC (Area Under Receiver Operating Characteristic Curve): This metric evaluates the model’s ability to discriminate between classes across all classification thresholds, providing a threshold-independent performance measure.

AUC-PR (Area Under Precision-Recall Curve): Under class imbalance, AUC-PR is often more informative than AUC-ROC as it focuses on performance on the minority (positive) class.

For explainability evaluation, we additionally compute:

Explanation Consistency: The proportion of generated explanations that align with expert human annotations, measured through semantic similarity and overlap of highlighted evidence.

Faithfulness: The degree to which explanations reflect the model’s actual decision-making process, assessed through perturbation tests where evidence mentioned in explanations is removed and prediction change is measured.

5.4. Implementation Details

Our KE-MLLM framework is implemented in PyTorch with the Hugging Face Transformers library. We use LLaMA-3-8B as the base model with LoRA rank

r = 16

applied to query and value projection matrices in all 32 transformer layers. The LoRA scaling factor

α

is set to 32. The knowledge base contains 150 curated fraud indicators across linguistic, sentiment, and behavioral categories, with top-

k = 5

items retrieved per review. Multimodal feature encoders consist of 2-layer feed-forward networks with 256 hidden dimensions and ReLU activation. The cross-attention fusion module uses 4 attention heads with dimension 512.

Training employs AdamW optimizer with learning rate

2 \times 10^{- 4}

for LoRA parameters and

1 \times 10^{- 3}

for other components. We use linear warmup for 500 steps followed by cosine decay. Batch size is set to 16 with gradient accumulation over 4 steps, resulting in effective batch size of 64. The loss weights are

λ_{soft} = 0.5

,

λ_{feat} = 0.3

,

λ_{explain} = 0.2

for knowledge distillation. Training converges within 10 epochs on both datasets. For the teacher model (LLaMA-3-70B), we use the same LoRA configuration but with rank

r = 32

for increased capacity.

All experiments are conducted on NVIDIA A100 GPUs with 40GB memory. Training the complete KE-MLLM framework (including teacher model) takes approximately 8 h on YelpChi and 3 h on Amazon Reviews. Inference time per review is 0.15 s on average, making the system suitable for real-time deployment. To assess performance stability, all experiments are repeated five times with different random seeds, and the reported results correspond to the mean performance across runs.

5.5. Main Results

Table 2 presents the performance comparison between KE-MLLM and baseline methods on both datasets. Our proposed framework achieves substantial improvements over all baselines across multiple metrics. On the YelpChi dataset, KE-MLLM achieves 94.3% F1-score, substantially outperforming the strongest baseline (standard LoRA fine-tuning) by 10.0 percentage points. The improvement is particularly pronounced in recall (92.1% vs. 83.1%), indicating that our framework successfully identifies a larger proportion of actual fake reviews while maintaining high precision. This is important for practical deployment scenarios in which missing fraudulent content may reduce moderation effectiveness. The accuracy improvement of 2.9% over the best baseline demonstrates robust overall performance despite the class imbalance. Similarly, on the Amazon Reviews dataset, KE-MLLM achieves 92.8% F1-score, representing a 6.2 percentage point improvement over standard LoRA. The consistent performance gains across both datasets suggest that the proposed framework has promising generalization ability across different review domains. Besides, to further assess the statistical reliability of the improvements, we conduct paired t-tests between KE-MLLM and the strongest baseline across multiple runs. The results indicate that the performance gains achieved by KE-MLLM are statistically significant (p < 0.05).

Table 3 presents AUC-ROC and AUC-PR scores, which further confirm the effectiveness of KE-MLLM across different operating points. The AUC-ROC scores of 96.7% (YelpChi) and 94.9% (Amazon) indicate excellent discrimination capability across all threshold settings. More importantly, the AUC-PR improvements of 4.5 and 4.6 percentage points over the strongest baseline demonstrate that KE-MLLM maintains high precision even as recall increases, which is particularly valuable given the class imbalance.

Several key observations emerge from these results. First, graph-based methods like CARE-GNN perform well but are limited by the availability and quality of network information, which may be sparse or noisy in real-world scenarios. Second, standard LoRA adaptation of LLaMA-3-8B already outperforms traditional pre-trained models like BERT and RoBERTa, validating the value of larger language models. Third, few-shot prompting of GPT-3.5 underperforms fine-tuned models, suggesting that domain adaptation through parameter updates is essential for this specialized task. Finally, the substantial gains of KE-MLLM over standard LoRA highlight the importance of our knowledge enhancement, multimodal fusion, and distillation strategies. Meanwhile, we also conduct a detailed error analysis which is presented in Section 5.11, to further understand the limitations of the proposed framework.

5.6. Ablation Studies

To understand the contribution of each component in KE-MLLM, we conduct comprehensive ablation studies by systematically removing individual modules. Table 4 presents the results.

Knowledge Enhancement: Removing the knowledge-enhanced prompting strategy results in a 3.2 percentage point drop in F1-score (from 94.3% to 91.1%). This significant degradation confirms that domain-specific knowledge about fraud patterns substantially improves the model’s discrimination capability. Without this guidance, the model must rely solely on patterns learned from training data, which may be insufficient for capturing sophisticated fraud tactics.

Multimodal Fusion: Eliminating behavioral and metadata features reduces F1-score by 2.4 points (to 91.9%). This demonstrates that textual analysis alone, while powerful, misses important fraud signals that manifest in posting behaviors and account characteristics. The relatively smaller impact compared with knowledge enhancement suggests that the LLM’s textual understanding is quite strong, but multimodal signals still provide valuable complementary information.

Knowledge Distillation: Removing the teacher–student distillation framework decreases F1-score by 1.8 points (to 92.5%). This indicates that the larger teacher model successfully captures nuanced patterns that benefit the student model. The moderate impact suggests that the 8B parameter student has sufficient capacity for this task but benefits from the teacher’s guidance, particularly for ambiguous cases near the decision boundary.

CoT Reasoning: Interestingly, removing the explicit CoT generation has minimal impact on quantitative metrics (0.5 point drop to 93.8%), suggesting that the classification performance is primarily driven by other components. However, as shown in Section 5.8, CoT reasoning is crucial for generating interpretable explanations, which is a primary contribution of our work. The small quantitative impact may actually be beneficial, as it indicates that the explanation generation does not compromise classification accuracy.

The cumulative effect of removing all enhancements (reverting to the LoRA baseline) results in a dramatic 10.0 point F1-score drop, confirming that the synergistic combination of our proposed components drives the superior performance.

5.7. Hyperparameter Sensitivity Analysis

We analyze the sensitivity of KE-MLLM to key hyperparameters, focusing on LoRA rank r and knowledge retrieval size k. Figure 2 shows the results.

LoRA Rank: We vary

r \in {4, 8, 16, 32, 64}

while keeping other parameters fixed. Performance improves as rank increases from 4 to 16, then plateaus beyond

r = 16

. Very low ranks (

r = 4

) constrain the model’s expressiveness, while very high ranks (

r = 64

) approach full fine-tuning and risk overfitting. The optimal rank of

r = 16

balances adaptation capability with parameter efficiency, consistent with findings in other LoRA applications [12].

Knowledge Retrieval Size: We test

k \in {1, 3, 5, 7, 10}

retrieved knowledge items. Performance peaks at

k = 5

, with degradation at both extremes. Too few items (

k = 1

) provide insufficient guidance, while too many (

k = 10

) introduce noise and dilute the relevance of retrieved knowledge. The optimal value of

k = 5

provides diverse fraud indicators without overwhelming the context window.

These analyses indicate that KE-MLLM remains relatively stable across reasonable hyperparameter ranges, with clear empirical optima on the evaluated datasets.

5.8. Qualitative Analysis and Human Evaluation

Beyond quantitative metrics, we conduct qualitative analysis of the generated explanations through human evaluation studies. We randomly sample 200 reviews (100 genuine, 100 fake) from the test set and collect explanations from three methods: KE-MLLM, attention-based visualization from BERT, and LIME explanations from RoBERTa.

Three domain experts with prior experience in fraud detection, online review analysis, and natural language processing independently evaluated each explanation. Before the formal evaluation, the evaluators were provided with a short guideline describing the three scoring dimensions and representative examples of high- and low-quality explanations: (1) Completeness: Does the explanation cover all relevant evidence? (2) Accuracy: Is the highlighted evidence truly indicative of the predicted label? (3) Clarity: Is the explanation understandable to non-experts? For each dimension, a score of 1 indicates very poor quality, 2 indicates limited quality, 3 indicates moderate quality, 4 indicates good quality, and 5 indicates excellent quality. Specifically, for Completeness, higher scores indicate that the explanation covers a larger proportion of the relevant evidence; for Accuracy, higher scores indicate that the cited evidence is more strongly aligned with the predicted label; for Clarity, higher scores indicate that the explanation is more understandable and actionable for non-expert users. Table 5 presents the aggregated results.

KE-MLLM substantially outperforms baseline explanation methods across all dimensions, achieving near-expert level ratings (4.6–4.8 out of 5). Attention-based visualizations receive the lowest scores, as they merely highlight words without explaining why those words are indicative of fraud. LIME provides more structured explanations through feature importance but lacks the natural language reasoning that domain experts and end-users prefer. To assess annotation consistency, we computed Krippendorff’s alpha across the three evaluators. The resulting agreement coefficients were 0.81 for Completeness, 0.84 for Accuracy, and 0.79 for Clarity, indicating substantial agreement.

To quantitatively measure explanation faithfulness, we conduct a controlled perturbation analysis in which evidence explicitly cited in the generated explanation is systematically removed from the input while keeping all other content unchanged. If an explanation is faithful, removing the cited evidence should significantly reduce the model’s confidence in its original prediction. We define faithfulness score as:

Faithfulness = \frac{1}{N} \sum_{i = 1}^{N} ⊮ [P (y_{i} | x_{i}) - P (y_{i} | x_{i} ∖ e_{i}) > τ],

(21)

where

x_{i} ∖ e_{i}

denotes the input with explanation-cited evidence removed, and

τ = 0.2

is a threshold for significant prediction change. KE-MLLM achieves a faithfulness score of 89.5%, compared with 62.3% for attention-based methods and 71.8% for LIME, confirming that our generated explanations genuinely reflect the model’s decision-making process.

Figure 3 presents representative examples of KE-MLLM explanations for both fake and genuine reviews, demonstrating the system’s ability to articulate multifaceted reasoning.

These examples illustrate how KE-MLLM synthesizes evidence across multiple modalities into coherent natural language explanations that are accessible to both technical and non-technical stakeholders.

5.9. Computational Efficiency Analysis

A critical consideration for real-world deployment in edge sensing devices is computational efficiency. Table 6 compares training and inference costs across different methods.

Despite incorporating multiple sophisticated components, KE-MLLM requires only 42 M trainable parameters (0.5% of the full model) and completes training in 8 h, dramatically more efficient than full fine-tuning (36 h). During inference, the model requires approximately 15 GB GPU memory, which is comparable to other LoRA-based LLM adaptations and significantly lower than full fine-tuning of the base model. The inference latency of 150 ms per review appears reasonable for many near-real-time moderation scenarios. The additional 15 ms overhead compared with standard LoRA primarily comes from knowledge retrieval and multimodal encoding, which could be further optimized through caching and batching strategies. We also evaluated CPU-based inference using a 16-core server processor. The average latency is approximately 880 ms per review, which remains practical for batch-based moderation workflows. Besides, although the current implementation targets GPU-based inference, the parameter-efficient design of KE-MLLM may facilitate deployment on resource-constrained servers, although system-level optimization would still be required in practical settings. In particular, LoRA adaptation reduces the number of trainable parameters while maintaining strong performance, enabling further optimization through model quantization or distillation. With 8-bit or 4-bit quantization, the memory footprint could be further reduced, making the framework suitable for deployment on edge AI accelerators or resource-constrained moderation systems.

5.10. Cross-Domain Generalization

To assess the generalizability of KE-MLLM, we conduct cross-domain experiments where the model is trained on one dataset and evaluated on the other without additional fine-tuning. Table 7 presents the results. KE-MLLM demonstrates superior cross-domain performance, maintaining 79.8% and 82.6% F1-scores when transferring between domains. This represents approximately 15% degradation compared with in-domain performance, substantially better than the 25–30% degradation observed in baseline methods. The strong generalization capability stems from the knowledge-enhanced prompting strategy, which encodes domain-agnostic fraud principles that transfer across different review contexts.

5.11. Error Analysis

To understand the limitations of KE-MLLM, we analyze the 5.7% of test cases where the model makes incorrect predictions on YelpChi. The errors fall into three primary categories:

Sophisticated Camouflaged Reviews (48%): These are fake reviews crafted with genuine-seeming specific details and natural language, posted by aged accounts with normal activity patterns. The model’s textual and behavioral signals fail to detect the subtle manipulation, highlighting the ongoing arms race between fraud detection and increasingly sophisticated fraudsters.

Ambiguous Genuine Reviews (31%): Some authentic reviews are misclassified as fake due to unusual characteristics such as extremely positive language from enthusiastic customers or burst posting from legitimate users during vacation trips. These false positives suggest that the model may be overly sensitive to certain patterns.

Data Quality Issues (21%): A portion of errors trace back to noisy labels in the dataset, where reviews labeled by the filtering algorithm may not reflect ground truth. Manual inspection by our expert evaluators suggests that in approximately 12% of error cases, the model’s prediction may actually be more accurate than the provided label.

This error analysis reveals that while KE-MLLM achieves strong overall performance, handling highly sophisticated fraud and reducing false positives on unusual but genuine content remain open challenges for future work.

5.12. Further Discussions

5.12.1. Potential Label Bias and Generalisability

It is important to note that the ground-truth labels in datasets such as YelpChi and Amazon Reviews are derived from platform-level filtering mechanisms rather than fully verified human annotations. As a result, some labels may reflect heuristic detection rules or platform moderation policies, which may introduce bias into the dataset. This limitation could affect the generalisability of fake review detection models trained on these datasets. In particular, models may learn patterns that correlate with platform-specific detection heuristics rather than universally applicable fraud signals. Despite this limitation, these datasets remain widely used benchmarks in the literature, allowing fair comparison with prior work. Future research could further improve robustness by incorporating manually verified datasets or cross-platform evaluation.

5.12.2. Scope of Cross-Domain Evaluation

We note that the current empirical validation is conducted on two benchmark review datasets, namely YelpChi and Amazon Reviews. While these datasets are widely adopted in prior fake review detection research, they do not cover the full diversity of review platforms, product categories, or review-writing styles encountered in real-world environments. Therefore, the reported cross-domain results should be interpreted as preliminary evidence rather than a comprehensive validation of robustness. Future work should extend the evaluation to additional review platforms, broader review genres, and more recent LLM baselines and open models.

5.12.3. Dataset Scope and Emerging LLM-Generated Reviews

It is important to note that the YelpChi and Amazon Reviews datasets were collected before the widespread availability of modern large language models. As a result, they primarily contain traditional forms of fraudulent reviews rather than sophisticated LLM-generated content. We nevertheless use these datasets because they are widely adopted benchmarks in the fake review detection literature, enabling direct comparison with prior methods. Evaluating detection approaches against modern LLM-generated fake reviews remains an important direction for future work.

6. Conclusions and Future Work

In this paper, we presented KE-MLLM, a promising integrated framework for fake review detection that combines knowledge-enhanced prompting, parameter-efficient LoRA adaptation, multimodal feature fusion, and chain-of-thought explainability. Experimental results on the YelpChi and Amazon Reviews benchmarks show that the proposed approach achieves strong detection performance while also producing human-interpretable explanations. In particular, KE-MLLM attains F1-scores of 94.3% and 92.8%, respectively, and the generated explanations show high agreement with expert assessment. These findings suggest that integrating LLM-based reasoning with behavioral and contextual signals is a useful direction for improving both prediction quality and interpretability in fake review detection. At the same time, the current results are based on benchmark datasets, and further validation is required before drawing conclusions about real-world deployment readiness or broad cross-platform applicability.

Despite these encouraging results, several limitations warrant further investigation. First, our evaluation on sophisticated camouflaged reviews reveals that a substantial portion of errors arise from advanced fraud tactics in which attackers produce linguistically plausible content while maintaining relatively normal behavioral patterns. This suggests that robustness to evolving and adaptive fraud strategies remains an open challenge. Second, the false positive rate on unusual but genuine reviews indicates that the model may still be overly sensitive to certain linguistic or behavioral cues, highlighting the need for improved uncertainty estimation and more calibrated decision-making. Third, although the framework shows encouraging cross-domain transfer performance, a noticeable gap remains between in-domain and cross-domain performance, indicating that broader validation across platforms and review ecosystems is still necessary. Fourth, the current knowledge base relies in part on manually curated fraud indicators, which may limit scalability as fraud tactics evolve; future work could explore more automated and continuously updated knowledge construction strategies. Fifth, while the proposed framework improves explanation quality, further study is needed to examine fairness, robustness, and faithfulness under more diverse real-world conditions.

Overall, KE-MLLM should be viewed as a promising step toward more accurate and interpretable fake review detection rather than a definitive deployment-ready solution. Future work will focus on robustness against adversarial manipulation, fairness across different user populations and platforms, improved generalization under dataset shift, and broader real-world evaluation in practical online review environments.

Author Contributions

Methodology, J.C. and J.L.; Software, J.C.; Validation, Y.L.; Writing—original draft, J.C., J.L., Y.L. and M.Z.; Writing—review & editing, M.Z.; Visualization, Y.L.; Supervision, M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Representative Knowledge Items in the Fraud-Indicator Knowledge Base

We show representative examples of knowledge items used in the fraud-indicator knowledge base in Table A1.

Table A1. Representative examples of knowledge items used in the fraud-indicator knowledge base.

Category	Representative Knowledge Item
Linguistic indicator	Excessive use of superlatives or exaggerated praise (e.g., “absolutely perfect”, “totally amazing”) may indicate promotional or deceptive intent.
Linguistic indicator	Reviews with highly generic descriptions and few product-specific details may be less trustworthy.
Sentiment anomaly	A strong mismatch between rating score and textual sentiment may indicate suspicious or manipulated feedback.
Behavioral red flag	Multiple reviews posted within a short time window may suggest coordinated review activity.
Behavioral red flag	Newly created accounts that immediately post many reviews may indicate fraudulent behavior.

References

Kumar, V.; Petersen, J.A.; Leone, R.P. How valuable is word of mouth? Harv. Bus. Rev. 2007, 85, 139. [Google Scholar] [PubMed]
Luca, M.; Zervas, G. Fake it till you make it: Reputation, competition, and Yelp review fraud. Manag. Sci. 2016, 62, 3412–3427. [Google Scholar] [CrossRef]
Anderson, M.; Magruder, J. Learning from the crowd: Regression discontinuity estimates of the effects of an online review database. Econ. J. 2012, 122, 957–989. [Google Scholar] [CrossRef]
Ott, M.; Choi, Y.; Cardie, C.; Hancock, J.T. Finding deceptive opinion spam by any stretch of the imagination. arXiv 2011, arXiv:1107.4557. [Google Scholar] [CrossRef]
Mukherjee, A.; Venkataraman, V.; Liu, B.; Glance, N. What yelp fake review filter might be doing? Proc. Int. AAAI Conf. Web Soc. Media 2013, 7, 409–418. [Google Scholar] [CrossRef]
Wang, W.Y. “liar, liar pants on fire”: A new benchmark dataset for fake news detection. arXiv 2017, arXiv:1705.00648. [Google Scholar] [CrossRef]
Dewang, R.K.; Singh, A.K. Identification of fake reviews using new set of lexical and syntactic features. In Proceedings of the Sixth International Conference on Computer and Communication Technology 2015, Allahabad, India, 25–27 September 2015; pp. 115–119. [Google Scholar]
Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; García, S.; Gil-López, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Rayana, S.; Akoglu, L. Collective opinion spam detection: Bridging review networks and metadata. In Proceedings of the 21th ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August 2015; pp. 985–994. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
Zhao, H.; Chen, H.; Yang, F.; Liu, N.; Deng, H.; Cai, H.; Wang, S.; Yin, D.; Du, M. Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–38. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Atrey, P.K.; Hossain, M.A.; El Saddik, A.; Kankanhalli, M.S. Multimodal fusion for multimedia analysis: A survey. Multimed. Syst. 2010, 16, 345–379. [Google Scholar] [CrossRef]
Kumar, S.; Shah, N. False information on web and social media: A survey. arXiv 2018, arXiv:1804.08559. [Google Scholar] [CrossRef]
Dou, Y.; Liu, Z.; Sun, L.; Deng, Y.; Peng, H.; Yu, P.S. Enhancing graph neural network-based fraud detectors against camouflaged fraudsters. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual, 19–23 October 2020; pp. 315–324. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Goodman, B.; Flaxman, S. European Union regulations on algorithmic decision-making and a “right to explanation”. AI Mag. 2017, 38, 50–57. [Google Scholar] [CrossRef]
Cresci, S.; Di Pietro, R.; Petrocchi, M.; Spognardi, A.; Tesconi, M. The paradigm-shift of social spambots: Evidence, theories, and tools for the arms race. In Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, 3–7 April 2017; pp. 963–972. [Google Scholar]
Cheng, D.; Zou, Y.; Xiang, S.; Jiang, C. Graph neural networks for financial fraud detection: A review. Front. Comput. Sci. 2025, 19, 199609. [Google Scholar] [CrossRef]
Shu, K.; Sliva, A.; Wang, S.; Tang, J.; Liu, H. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor. Newsl. 2017, 19, 22–36. [Google Scholar] [CrossRef]
Li, J.; Ott, M.; Cardie, C.; Hovy, E. Towards a general rule for identifying deceptive opinion spam. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Baltimore, MD, USA, 2014; pp. 1566–1576. [Google Scholar]
Zhang, X.; Ghorbani, A.A. An overview of online fake news: Characterization, detection, and discussion. Inf. Process. Manag. 2020, 57, 102025. [Google Scholar] [CrossRef]
Wiegreffe, S.; Pinter, Y. Attention is not not Explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 11–20. [Google Scholar]
Li, A.; Qin, Z.; Liu, R.; Yang, Y.; Li, D. Spam review detection with graph convolutional networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 2703–2711. [Google Scholar]
Liu, Y.; Ao, X.; Qin, Z.; Chi, J.; Feng, J.; Yang, H.; He, Q. Pick and choose: A GNN-based imbalanced learning approach for fraud detection. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 3168–3177. [Google Scholar]
Mukherjee, A.; Liu, B.; Glance, N. Spotting fake reviewer groups in consumer reviews. In Proceedings of the 21st International Conference on World Wide Web, Lyon, France, 16–20 April 2012; pp. 191–200. [Google Scholar]
Chen, J.; Zhang, T.; Yan, Z.; Zheng, Z.; Zhang, W.; Zhang, J. Attention-based BiLSTM with positional embeddings for fake review detection. J. Big Data 2025, 12, 83. [Google Scholar] [CrossRef]
Jain, S.; Wallace, B.C. Attention is not explanation. arXiv 2019, arXiv:1902.10186. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
Barbieri, F.; Anke, L.E.; Camacho-Collados, J. XLM-T: Multilingual language models in Twitter for sentiment analysis and beyond. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 258–266. [Google Scholar]
Wiegreffe, S.; Hessel, J.; Swayamdipta, S.; Riedl, M.; Choi, Y. Reframing human-AI collaboration for generating free-text explanations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 10–15 July 2022; pp. 632–658. [Google Scholar]
Lampinen, A.; Dasgupta, I.; Chan, S.; Mathewson, K.; Tessler, M.; Creswell, A.; McClell, J.; Wang, J.; Hill, F. Can language models learn from explanations in context? In Findings of the Association for Computational Linguistics: EMNLP 2022; Association for Computational Linguistics: Baltimore, MD, USA, 2022; pp. 537–563. [Google Scholar]
Zhou, C.; Liu, P.; Xu, P.; Iyer, S.; Sun, J.; Mao, Y.; Ma, X.; Efrat, A.; Yu, P.; Yu, L.; et al. Lima: Less is more for alignment. Adv. Neural Inf. Process. Syst. 2023, 36, 55006–55021. [Google Scholar]
Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2019; pp. 2790–2799. [Google Scholar]
Pfeiffer, J.; Kamath, A.; Rücklé, A.; Cho, K.; Gurevych, I. Adapterfusion: Non-destructive task composition for transfer learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 April 2021; pp. 487–503. [Google Scholar]
Lester, B.; Al-Rfou, R.; Constant, N. The power of scale for parameter-efficient prompt tuning. arXiv 2021, arXiv:2104.08691. [Google Scholar] [CrossRef]
Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv 2021, arXiv:2101.00190. [Google Scholar] [CrossRef]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Adv. Neural Inf. Process. Syst. 2023, 36, 10088–10115. [Google Scholar]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar] [PubMed]

Figure 1. Overall architecture of the proposed KE-MLLM framework.

Figure 2. Hyperparameter sensitivity analysis. (Left) Effect of LoRA rank r on F1-score. (Right) Effect of knowledge retrieval size k on F1-score. Experiments conducted on YelpChi validation set.

Figure 3. Representative examples of KE-MLLM explanations for fake and genuine reviews, demonstrating multifaceted reasoning across linguistic, sentiment, and behavioral dimensions.

Table 1. Comparison of KE-MLLM with representative prior approaches for fake review detection.

Method Category	LLM-Based	Parameter-Efficient Adaptation	Multimodal Fusion	Explainable Reasoning
Traditional ML classifiers	No	No	Limited	No
Neural text classifiers (e.g., LSTM/BERT)	No	No	Limited	Limited
Standard LLM fine-tuning/classification	Yes	No	No	No
Standard LoRA-based LLM classifiers	Yes	Yes	No	No
Conventional multimodal fusion models	Partial	No	Yes	No
KE-MLLM (ours)	Yes	Yes	Yes	Yes

Table 2. Performance comparison on YelpChi and Amazon Reviews datasets. Best results are in bold. All values are percentages. * indicates statistically significant improvement over the best baseline (paired t-test,

p < 0.05

).

Table 2. Performance comparison on YelpChi and Amazon Reviews datasets. Best results are in bold. All values are percentages. * indicates statistically significant improvement over the best baseline (paired t-test,

p < 0.05

).

Method	YelpChi				Amazon Reviews
Method	Acc	Prec	Rec	F1	Acc	Prec	Rec	F1
SVM + TF-IDF	82.3	65.4	58.7	61.9	78.5	62.1	55.3	58.5
Random Forest	84.6	68.2	63.5	65.8	80.2	64.8	59.7	62.1
BiLSTM-Attention	87.1	73.5	69.8	71.6	83.6	70.2	66.4	68.2
CNN-BiLSTM	88.3	75.8	72.3	74.0	84.9	72.6	68.9	70.7
BERT-base	89.7	79.2	76.5	77.8	86.4	75.8	72.1	73.9
RoBERTa-base	90.8	81.5	78.9	80.2	87.6	78.3	74.8	76.5
DistilBERT	88.5	76.4	73.2	74.8	85.1	73.5	69.8	71.6
CARE-GNN	91.3	83.7	80.5	82.1	88.2	79.8	76.3	78.0
GPT-3.5 (few-shot)	86.9	74.8	70.2	72.4	83.2	71.3	67.5	69.3
LLaMA-3-8B (full FT)	91.8	84.2	81.7	82.9	88.9	81.2	77.8	79.5
LLaMA-3-8B (LoRA)	92.4	85.6	83.1	84.3	89.5	82.5	79.2	80.8
LLaMA-3-8B (Prompt)	89.2	78.9	75.4	77.1	85.7	75.2	71.6	73.4
KE-MLLM (Ours)	94.7 *	90.8 *	92.1 *	94.3 *	93.2 *	89.5 *	90.8 *	92.8 *
Improvement	+2.9	+5.2	+8.4	+5.8	+3.7	+7.0	+11.6	+6.2

Table 3. AUC-ROC and AUC-PR scores on both datasets. Values are percentages. Best results are in bold. * indicates statistically significant improvement over the best baseline (paired t-test,

p < 0.05

).

Table 3. AUC-ROC and AUC-PR scores on both datasets. Values are percentages. Best results are in bold. * indicates statistically significant improvement over the best baseline (paired t-test,

p < 0.05

).

Method	YelpChi		Amazon Reviews
Method	AUC-ROC	AUC-PR	AUC-ROC	AUC-PR
SVM + TF-IDF	78.5	56.3	75.2	54.1
Random Forest	81.3	61.8	77.9	58.6
BiLSTM-Attention	86.7	70.4	82.5	66.8
CNN-BiLSTM	88.2	73.9	84.1	69.5
BERT-base	91.3	79.6	87.3	75.2
RoBERTa-base	92.5	82.1	88.9	78.4
CARE-GNN	93.8	84.5	90.2	80.7
LLaMA-3-8B (full FT)	94.2	85.7	91.1	82.3
LLaMA-3-8B (LoRA)	95.1	87.3	92.3	84.6
KE-MLLM (Ours)	96.7 *	91.8 *	94.9 *	89.2 *

Table 4. Ablation study on YelpChi dataset showing the contribution of each component. Best results are in bold.

Model Variant	Acc	Prec	Rec	F1
KE-MLLM (Full)	94.7	90.8	92.1	94.3
w/o Knowledge Enhancement	92.8	86.5	88.3	91.1
w/o Multimodal Fusion	93.1	87.8	89.6	91.9
w/o Knowledge Distillation	93.5	88.6	90.4	92.5
w/o CoT Reasoning	94.2	89.7	91.3	93.8
Only Text (LoRA baseline)	92.4	85.6	83.1	84.3

Table 5. Human evaluation of explanation quality (scale 1–5, higher is better). Values shown are mean ± standard deviation across three evaluators. Best results are in bold.

Method	Completeness	Accuracy	Clarity
BERT Attention	2.8 ± 0.6	3.1 ± 0.5	2.9 ± 0.7
RoBERTa + LIME	3.4 ± 0.5	3.6 ± 0.6	3.2 ± 0.6
KE-MLLM	4.6 ± 0.4	4.7 ± 0.3	4.8 ± 0.3

Table 6. Computational efficiency comparison. Training time measured on YelpChi using a single NVIDIA A100 GPU.

Method	Trainable Params	GPU Memory	Training Time	GPU Latency (ms)	CPU Latency (ms)
BERT-base (full FT)	110 M	2.1 GB	2.5 h	45	120
RoBERTa-base (full FT)	125 M	2.4 GB	2.8 h	48	135
LLaMA-3-8B (full FT)	8 B	28 GB	36 h	180	950
LLaMA-3-8B (LoRA)	25 M	14 GB	5.5 h	165	910
KE-MLLM	42 M	15 GB	8.0 h	150	880

Table 7. Cross-domain generalization performance (F1-score %). Models trained on source domain and evaluated on target domain without adaptation. Best results are in bold.

Method	Yelp → Amazon	Amazon → Yelp
BERT-base	68.4	71.2
RoBERTa-base	71.6	74.8
LLaMA-3-8B (LoRA)	74.3	77.5
KE-MLLM	79.8	82.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, J.; Liu, J.; Liang, Y.; Zhou, M. KE-MLLM: A Knowledge-Enhanced Multi-Sensor Learning Framework for Explainable Fake Review Detection. Appl. Sci. 2026, 16, 2909. https://doi.org/10.3390/app16062909

AMA Style

Chen J, Liu J, Liang Y, Zhou M. KE-MLLM: A Knowledge-Enhanced Multi-Sensor Learning Framework for Explainable Fake Review Detection. Applied Sciences. 2026; 16(6):2909. https://doi.org/10.3390/app16062909

Chicago/Turabian Style

Chen, Jiaying, Jingyi Liu, Yiwen Liang, and Mengjie Zhou. 2026. "KE-MLLM: A Knowledge-Enhanced Multi-Sensor Learning Framework for Explainable Fake Review Detection" Applied Sciences 16, no. 6: 2909. https://doi.org/10.3390/app16062909

APA Style

Chen, J., Liu, J., Liang, Y., & Zhou, M. (2026). KE-MLLM: A Knowledge-Enhanced Multi-Sensor Learning Framework for Explainable Fake Review Detection. Applied Sciences, 16(6), 2909. https://doi.org/10.3390/app16062909

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

KE-MLLM: A Knowledge-Enhanced Multi-Sensor Learning Framework for Explainable Fake Review Detection

Abstract

1. Introduction

2. Related Works

2.1. Fake Review Detection Methods and Behavioral Signal Analysis

2.2. Explainable AI in Fraud Detection

2.3. Parameter-Efficient Fine-Tuning of Large Language Models

3. Preliminaries

3.1. Problem Formulation

3.2. Large Language Models and Transformer Architecture

3.3. Low-Rank Adaptation (LoRA)

3.4. Multi-Sensor Feature Space

4. Methodology

4.1. Framework Overview

4.2. Knowledge-Enhanced Prompting Strategy

4.3. LoRA-Based Model Adaptation

4.4. Multi-Sensor Feature Fusion

4.5. Chain-of-Thought Reasoning for Explainability

4.6. Knowledge Distillation for Enhanced Performance

4.7. Overall Training and Inference Pipeline

5. Experiments

5.1. Datasets

5.2. Baseline Methods

5.3. Evaluation Metrics

5.4. Implementation Details

5.5. Main Results

5.6. Ablation Studies

5.7. Hyperparameter Sensitivity Analysis

5.8. Qualitative Analysis and Human Evaluation

5.9. Computational Efficiency Analysis

5.10. Cross-Domain Generalization

5.11. Error Analysis

5.12. Further Discussions

5.12.1. Potential Label Bias and Generalisability

5.12.2. Scope of Cross-Domain Evaluation

5.12.3. Dataset Scope and Emerging LLM-Generated Reviews

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Representative Knowledge Items in the Fraud-Indicator Knowledge Base

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI