Lightweight Intrusion Detection Systems for IoT–Edge Environments: A PRISMA-ScR Systematic Review of Deployability Evidence and a Unified Assessment Framework

Islam, Md Manirul; Salsabil, Umme; Nurmamatov, Mekhriddin; Hossain, Sazzad

doi:10.3390/fi18060300

Open AccessSystematic Review

Lightweight Intrusion Detection Systems for IoT–Edge Environments: A PRISMA-ScR Systematic Review of Deployability Evidence and a Unified Assessment Framework

by

Md Manirul Islam

^1,2,*

,

Umme Salsabil

³

,

Mekhriddin Nurmamatov

² and

Sazzad Hossain

^2,*

¹

Faculty of Science and Technology, American International University-Bangladesh, Dhaka 1229, Bangladesh

²

Faculty of Artificial Intelligence and Digital Technologies, Samarkand State University Named After Sharof Rashidov, Samarkand 140104, Uzbekistan

³

School of Energy, British Columbia Institute of Technology, Burnaby, BC V5G 3H2, Canada

^*

Authors to whom correspondence should be addressed.

Future Internet 2026, 18(6), 300; https://doi.org/10.3390/fi18060300

Submission received: 1 May 2026 / Revised: 29 May 2026 / Accepted: 30 May 2026 / Published: 2 June 2026

(This article belongs to the Section Cybersecurity)

Download

Browse Figures

Versions Notes

Abstract

Future internet services are expected to increasingly depend on IoT–edge deployments, in which intrusion detection must operate close to constrained, heterogeneous devices rather than only in cloud or data-center environments. Although the literature focuses on many “lightweight” intrusion detection systems (IDSs), the evidence supporting deployability is uneven and often limited to accuracy-oriented benchmark results. This PRISMA-ScR review, which was cross-checked against the PRISMA 2020 reporting items, synthesizes 78 peer-reviewed studies published between January 2017 and March 2026 and evaluates how they report model compactness, data and preprocessing burden, system placement, hardware measurements, operational robustness, and reproducibility. The reviewers independently screened 1162 deduplicated records and charted the included studies. This review found that architectural compactness is commonly reported, whereas target device latency, runtime memory, measured power or energy, zero-day evaluation, time-aware splitting, and device shift validation remain inconsistent. To make these gaps auditable, this study introduces a five-dimensional deployability framework using log-scale normalization, bounded benefit coding, completeness penalties, scorer agreement checks, and scenario-based sensitivity analysis. The results show that no IDS family dominates across all deployment scenarios: rankings change when hardware constraints or operational robustness receive priority. This review concludes with a benchmark blueprint, reporting protocol, completed PRISMA checklist, and research agenda for deployment-grade IoT–edge IDS studies.

Keywords:

intrusion detection; Internet of Things; edge computing; lightweight IDS; deployability; TinyML; systematic review; PRISMA-ScR

1. Introduction

Edge-enabled Internet of Things (IoT) deployments now run consumer, industrial, healthcare, and vehicular workloads, and they push latency-sensitive analytics from centralized clouds to gateways, fog nodes, and microcontroller-class endpoints. However, this architectural shift enlarges the attack surface. Heterogeneous devices, weak default configurations, intermittent connectivity, and long operational lifetimes combine to produce persistent security exposure that cloud-only monitoring cannot mitigate cost-effectively [1,2,3]. Intrusion detection systems (IDSs) adapted for edge operation are therefore central to IoT security engineering, but they must operate under memory, compute, energy, and update constraints that do not apply in data-center settings.

A large body of work now proposes lightweight IDSs for IoT–edge environments. The methods involved include compact convolutional, temporal, and graph-based architectures; aggressive feature and sample reduction; quantization and pruning; federated and split learning schemes; TinyML-style endpoint deployment; and anomaly-based unsupervised detectors [4,5,6,7,8,9,10,11,12,13]. Reported detection accuracy on standard datasets is often excellent. Many studies claim near-perfect F1 scores on CICIDS, CICIoT2023, BoT-IoT, ToN-IoT, and related benchmarks [14,15,16,17]. At first glance, the literature suggests a mature and near-deployable state of the art.

However, closer inspection tells a different story. The term “lightweight” is used inconsistently. Some studies use this term to mean low parameter or FLOP count, feature compression, or distributed execution, and some simply demonstrate a Raspberry Pi deployment. Latency, memory, power, and energy on the target device are often absent and, when they appear, they frequently lack the measurement context (warm-up, batch size, concurrency, thermal state) needed for an honest comparison. Evaluation under class imbalance, zero-day conditions, distribution drift, or device shift is rarer still [18,19,20,21]. As a result, studies that appear comparable often rely on evidence that is not directly comparable.

Recent reviews of machine learning IDSs for IoT have clarified algorithmic taxonomies [22,23,24,25,26] and dataset characteristics [27,28,29,30], and several systematic reviews have addressed deep learning-based IDSs specifically [31,32,33,34]. However, three gaps persist. First, considering the composite question of whether a model can run, be maintained, and be trusted on target hardware, deployability itself is not treated as a primary review outcome. Second, reporting completeness is rarely audited; missing hardware or operational evidence is typically left uncommented rather than scored. Third, the heterogeneity of benchmark protocols, split definitions, and device tiers across the literature is not translated into a reproducible assessment framework that other researchers can apply to new work as it is published.

These gaps matter in engineering practice. The question facing a deployment team is not which model has the highest reported F1; rather, it is which model, under which evidence, is genuinely suitable for a given deployment tier and threat model. Answering this question requires a review that jointly tracks model compactness, data burden, system placement, hardware measurement, and operational robustness, and penalizes missing evidence rather than treating it as neutral.

Five research questions guide this review. RQ1 asks which lightweighting strategies are used in peer-reviewed lightweight IoT–edge IDS studies and how they are distributed across architectural, data-centric, system-level, and operational design choices. RQ2 asks which datasets, split protocols, and evaluation metrics are used and how consistently operational protocols (zero-day holdout, time-aware split, device or site shift) are applied. RQ3 asks what hardware-level evidence (latency, throughput, memory, power, energy, runtime toolchain) is reported and on which device tiers. RQ4 asks how operational robustness (class imbalance, novelty, drift, updatability) is addressed in the included studies. Finally, RQ5 asks how complete the deployability reporting is across the literature and which reporting anti-patterns recur.

This review contributes the following. (i) A PRISMA-ScR evidence synthesis focused specifically on deployability evidence in lightweight IoT–edge IDSs, rather than accuracy or model taxonomy alone. (ii) A unified five-dimensional deployability framework that uses log-scale normalization for cost indicators, bounded coding for benefit indicators, and an explicit completeness penalty so that missing evidence is not silently rewarded. (iii) An application of the framework to a harmonized set of representative studies, with a scenario-based sensitivity analysis concerning balanced, hardware-priority, and operational-priority weights. (iv) A benchmark blueprint, a reporting checklist, and a workflow view that other researchers can adopt directly. (v) A research agenda for deployment-grade IoT–edge IDS work.

Section 2 describes the PRISMA-ScR methodology, including the protocol, search strategy, eligibility criteria, charting, and inter-rater reliability. Section 3 reports the bibliometric overview and the thematic findings against the five research questions and presents the unified deployability framework along with its scoring procedure, a worked example, and sensitivity analysis. Section 4 discusses the principal findings, implications for design and benchmarking, and limitations. Section 5 concludes this paper. Three appendices report the search strings, the list of included studies, and the dimension-level score worksheet.

2. Materials and Methods

2.1. Reporting Standard and Protocol

This review follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) [35] and was cross-checked against the PRISMA 2020 checklist [36]. PRISMA-ScR is appropriate here because the objective is to map the heterogeneous landscape of deployability evidence across a maturing field, not to perform a quantitative meta-analysis of a narrowly defined intervention. The PRISMA 2020 checklist is provided as Supplementary File S4, and the PRISMA flow diagram is presented in Figure 1. The protocol was specified a priori, following Kitchenham and Charters’ guidelines for systematic reviews in software engineering [37], and was registered internally by the review team before the search began. This review was not externally registered because it is an engineering scoping review rather than a health intervention one eligible for a domain registry such as PROSPERO. The protocol amendments made during execution are recorded in Appendix B and the Supplementary workbook.

2.2. Background and Related Reviews

Classical IDSs grew out of signature matching and rule-based anomaly thresholds. They were designed for enterprise and data-center traffic and perform poorly on IoT networks, which are characterized by heterogeneous devices, constrained telemetry, high-cardinality but low-volume flows, and frequent drift in benign behavior [22,23]. The past decade has seen a clear shift toward machine learning- and, subsequently, deep learning-based IDSs for IoT, including classical supervised learners (decision trees, random forests, gradient-boosted ensembles) [24,38]; recurrent and convolutional deep networks [39,40]; hybrid spatio-temporal architectures [41]; and unsupervised or semi-supervised anomaly detectors for zero-day and open-world settings [42,43].

IoT-specific benchmark datasets have matured in parallel. NSL-KDD [44] and UNSW-NB15 [45] provided general network traffic benchmarks. BoT-IoT [46], CICIDS2017 [47], IoT-23 [48], ToN-IoT [49], CICIoT2023 [27], and Edge–IIoT [50] introduced IoT-specific telemetry, attack families, and deployment contexts. Dataset choice materially influences reported performance: class composition, feature semantics, and attack aggregation differ enough that a model strong on one benchmark can expose weaknesses on another [27,49].

Several surveys have synthesized the IoT-IDS literature. Early reviews focused on attack taxonomies and architectural placement [22,51]. Subsequent surveys on machine learning and deep learning IDSs [23,31,32] examined algorithmic families, feature engineering, and benchmark usage. More recent reviews have addressed deployment-adjacent concerns: trust and privacy in IoT IDSs [52], federated learning for collaborative detection [53,54], and TinyML-enabled endpoint security [55,56,57]. Systematic reviews on anomaly detection [42,58] and spatio-temporal deep learning for IDSs [34] have also appeared. These reviews are valuable, but when read through a deployability lens, they share three structural limitations: hardware profiling and operational evaluation are treated as secondary, reporting completeness is narrated rather than scored, and few provide reusable assessment tooling. The present review treats deployability as a first-class outcome and supplies the scoring logic as a reusable artifact.

2.3. Information Sources and Search Strategy

Five bibliographic databases were queried: Scopus, Web of Science Core Collection, IEEE Xplore, ACM Digital Library, and ScienceDirect. These cover the main indexed venues for IoT security, edge computing, and machine learning systems. A pilot search tested candidate strings against a gold set of eight known relevant studies; the final string was required to retrieve all eight. Table 1 lists the final database-specific strings. The initial search was executed on 15 February 2026 and rerun on 31 March 2026 to capture late-indexed records. The 2026 subset should therefore be interpreted as records indexed up to the rerun date rather than a complete calendar-year sample. The gray literature and preprint servers were excluded to preserve peer review quality. Backward and forward citation chasing on included studies was used as a supplementary strategy and added 23 records beyond the database searches.

2.4. Eligibility Criteria

Inclusion and exclusion criteria were defined a priori (Table 2). Studies had to propose a detection method targeting IoT–edge environments with an explicit lightweight, on-device, edge, fog, or TinyML claim and report evaluation evidence along at least one deployability dimension. We excluded studies that reported only accuracy on a standard benchmark, with no discussion of compactness, preprocessing, placement, hardware, or operational robustness.

2.5. Study Selection Process

Selection followed a two-stage procedure. In Stage 1, two reviewers (M.M.I. and S.H.) independently screened titles and abstracts against the eligibility criteria using a shared spreadsheet. In Stage 2, the same two reviewers independently assessed the full text of records passing Stage 1. Disagreements at either stage were resolved by discussion until consensus was reached; no third adjudicator was required. Records excluded at the full-text stage are listed in Appendix B with a primary exclusion reason. Figure 1 shows the PRISMA-ScR flow. Table 3 reports the counts for each stage. The complete bibliographic list of the 78 included studies is provided in Supplementary Table S1, together with coded fields for dataset, lightweighting strategy, device tier, hardware evidence, operational evidence, and completeness score. All of these supporting files are available in the Supplementary Materials. The main reference list cites only the studies and background sources discussed directly in the manuscript narrative.

Figure 1. PRISMA-ScR flow diagram showing identification, screening, eligibility assessment, and final inclusion.

2.6. Data Charting and Extraction

A structured charting form was developed iteratively during a pilot on ten records and then finalized before full extraction. Table 4 summarizes the seven extraction categories. The reviewers chart-extracted each included study independently. All 78 included studies were coded for the thematic synthesis and reporting completeness appraisal; the six families used later in the worked deployability example were selected from this fully coded corpus to illustrate the scoring procedure across contrasting strategy families. When reported values used different units (for example, KB versus MB, ms versus microseconds), values were normalized to kilobytes and milliseconds during charting. Missing fields were coded explicitly as “not reported” rather than inferred and used as inputs to the completeness penalty in the deployability framework (Section 3.7). Fields judged demonstrably not applicable to a declared deployment scope were coded separately as “not applicable” and excluded from the denominator of the relevant completeness calculation, with the rationale recorded in the supplementary worksheet.

2.7. Quality Appraisal and Inter-Rater Reliability

Because this review maps reporting evidence rather than pooling effect estimates, formal risk-of-bias tools such as ROBINS-I were not applied. Instead, a reporting completeness appraisal was performed: each included study was scored as “reported”, “not reported”, or “not applicable” against the extracted fields in Table 4. This appraisal feeds directly into the deployability framework’s completeness factor (Section 3.7). Inter-rater reliability was computed at both screening stages using Cohen’s kappa coefficient [59], with benchmarks interpreted per Landis and Koch [60]. During title/abstract screening, the two reviewers achieved kappa = 0.78 (89% raw agreement) across 1162 records. During full-text assessment, kappa = 0.83 (92% raw agreement) was obtained across 180 records. Both coefficients fall in the substantial-to-almost-perfect range. To validate the framework scoring itself, two reviewers independently scored the six worked example families and a stratified 20-study validation subset spanning the major strategy families and device tiers. Agreement for qualitative anchor coding was substantial (weighted kappa = 0.81), and that for dimension-level scores was high (ICC (2, 1) = 0.86). Score disagreements greater than 0.10 on any dimension were resolved via consensus after revisiting the source paper and coding rules. No generative AI tool was used for study identification, eligibility decisions, data charting, or framework scoring.

2.8. Data Synthesis Approach

Because the included studies use heterogeneous datasets, task definitions, and measurement practices, a quantitative meta-analysis of detection performance was neither feasible nor appropriate. Synthesis is therefore qualitative and thematic for RQ1–RQ4 and uses descriptive counts for RQ5. For the deployability framework (Section 3.7), a structured semi-quantitative approach was used: log-scale normalization for continuous cost indicators (parameters, FLOPs, model size, latency, energy), bounded unit interval coding for qualitative benefit indicators (presence of target device measurement, zero-day evaluation, quantization, and others), explicit completeness penalties for missing evidence, and sensitivity checks over normalization bounds and scenario weights. The purpose of the composite score is comparative evidence mapping, not the certification of deployability.

3. Results

3.1. Bibliometric Overview

3.1.1. Publication Growth and Venue Distribution

The 78 included studies span 2017 to 2026, with a clear growth trajectory after 2021. Annual publication counts are as follows: one in 2017, two in 2018, four in 2019, five in 2020, eight in 2021, 11 in 2022, 14 in 2023, 14 in 2024, 12 in 2025, and seven in 2026. The 2026 count reflects records indexed by 31 March 2026 and may undercount late-indexed articles from the same year. The median publication year is 2023. IEEE Access, Sensors (MDPI), Future Generation Computer Systems, IEEE Internet of Things Journal, and Internet of Things (Elsevier) together account for 58% of the corpus; the remainder is distributed across other journals and conference proceedings. Figure 2 stratifies annual publication counts by deployment evidence level. “Full” deployment evidence refers to at least target device latency plus either memory or power/energy evidence with a stated runtime/platform; “partial” evidence refers to at least one deployability-relevant measurement or placement detail beyond accuracy but not the full hardware profile; and “architectural-only” evidence means that lightweightness is supported mainly by parameter count, model size, FLOPs, or textual assertion without target device measurement. The proportion of architectural-only studies declines modestly over the period but remains the largest single category. Reporting practice has not kept pace with growth in publication volume.

3.1.2. Geographic and Methodological Trends

First-author affiliations span 27 countries, with concentrations in China, India, and the United States. Cross-institutional or cross-country collaborations are present in 41% of the included studies, and industry-affiliated authors appear on 12% of them. Deep learning families dominate the post-2020 corpus. Convolutional, temporal (TCN), and recurrent (LSTM/GRU) architectures are the most frequent, followed by transformer-based and graph neural network variants from 2023 onward. Classical ensembles (random forests, gradient boosting, LightGBM) remain well represented, particularly in studies that emphasize feature selection, zero-day evaluation, or MCU-class deployment [6,61,62]. Unsupervised and semi-supervised approaches such as autoencoders, one-class classifiers, isolation forests, and counter-propagation networks are a smaller but growing subset [7,57,63].

3.2. RQ1: Lightweighting Strategies

The included studies were coded against seven non-exclusive lightweighting strategies: architecture-centric compactness, data-centric reduction, quantization and compression, adaptive or unsupervised learners, imbalance-aware pipelines, distributed/split/federated schemes, and multi-access edge computing (MEC) or offloading. Table 5 summarizes the taxonomy and representative studies.

Architecture-centric designs are the most frequent strategy in the corpus. PNet-IDS [4] illustrates the approach with partial convolutions and channel shuffle operations that reduce parameter count while retaining classification accuracy. The lightweight TCN family [5] couples a compact temporal architecture with post-training quantization and Raspberry Pi deployment. Several compact CNN variants [64,65,66] pursue depthwise separable convolutions in the tradition of MobileNet [70]. Data-centric strategies treat feature engineering, dimensionality reduction, and sample compression as first-class deployability levers rather than auxiliary preprocessing. The ensemble-with-PCA pipeline [6],two-stage feature-selection pipelines [12], and ensemble feature selection (ELIDS) work [61,62] argue that the model a device executes is the full data path, not just the classifier.

Quantization and compression appear predominantly as secondary techniques applied in addition to another strategy. INT8 and dynamic quantization are increasingly used to reduce model size and support deployment on memory-constrained IoT devices, although the runtime and energy effects are still not reported consistently [5,11]. Adaptive and unsupervised strategies target drift and zero-day readiness rather than raw compactness. CPN-GHSOM [7] combines counter-propagation with a growing hierarchical self-organizing map; HoloTiny-AD [57] pursues trust-aware anomaly detection on resource-constrained devices. Distributed strategies shift burden rather than reduce it: multi-hop split learning [3] partitions a GRU-family model across edge nodes; federated learning for IDSs [53,54,68] aggregates locally trained models without raw data centralization. Recent work on offloaded embeddings for IDSs [69] shows that a compact on-device embedding stage can feed a richer back-end classifier at an MEC node.

3.3. RQ2: Datasets and Evaluation Protocols

Dataset choice shows a clear generational shift. Earlier studies (2017–2020) lean heavily on NSL-KDD [44], UNSW-NB15 [45], and CICIDS2017 [47]. From 2021 onward, IoT-specific datasets become dominant, including BoT-IoT [46], IoT-23 [48], ToN-IoT [49], CICIoT2023 [27], and Edge–IIoT [50]. Wi-Fi-specific intrusion detection is comparatively underrepresented in the corpus: only one included study [15] evaluated on the Aegean Wi-Fi Intrusion Dataset family, specifically AWID2 [71], and none used the more recent enterprise-oriented AWID3 [72], even though AWID2 and AWID3 remain the principal public 802.11 benchmarks for this domain (Table 6). This limited uptake of the dominant Wi-Fi benchmarks is itself relevant for 802.11-centric edge deployments. Table 6 summarizes usage across the corpus. CICIDS2017 appears frequently because it remains a widely recognized transitional benchmark, but its enterprise traffic design limits its IoT realism; therefore, frequent usage should not be read as evidence that it is sufficient for deployment-grade IoT–edge validation. Custom or testbed-generated captures are used in 13 of 78 studies, usually for protocol-specific or vehicular scenarios. Even when the same dataset is named, versions of preprocessing, attack category aggregation, split policy, and feature subset can differ enough to invalidate naive leaderboard comparisons [27,49].

Split practice is the single largest source of apparent performance inflation in the corpus. Random in-distribution stratified splits dominate (71% of included studies), followed by holdout validation sets. Time-aware splits appear in only 12% of studies. Meanwhile, zero-day or attack holdout splits are reported in 23%. Device shift or site shift splits are rarer still (8%), in which training and test data come from non-overlapping devices or sites. Studies claiming “real-world readiness” using only random splits should be read conservatively. Accuracy and F1 are almost universally reported. Per-class recall is reported in most studies but frequently omitted for minority attack classes, which is precisely where operational performance matters the most. AUROC and precision–recall AUC are reported by a minority. Calibration metrics are essentially absent. Explicit zero-day evaluation is reported in 18 of 78 studies, and open-set recognition is rare outside the anomaly detection subset.

3.4. RQ3: Hardware-Level Evidence

Target-device choice spans four tiers: cloud/GPU, gateway-class edge, microcontroller/TinyML, and “not reported”. Table 7 summarizes reporting frequencies by tier. Gateway-class devices (predominantly the Raspberry Pi 3/4/5 family, NVIDIA Jetson Nano/Xavier, and Google Coral) are the most common edge platforms (35 of 78 studies). This matters because gateway-class boards are not equivalent to MCU-class endpoints: Raspberry Pi and Jetson devices usually run full operating systems and have far more memory and power headroom, whereas Cortex-M-class deployments often require kilobyte-scale memory planning, integer-only runtimes, and stricter energy accounting [73,74,75]. Microcontroller-class endpoints (ESP32, STM32, Arduino Nicla, and ARM Cortex-M boards) appear in a small but growing subset (nine studies), usually in studies self-identifying as TinyML [9,10,55,56,57,73,74,75]. A non-trivial fraction of the corpus (12 of 78) reports no hardware evidence at all and relies on architectural proxies.

Inference latency is reported in 31% of included studies, of which only 38% specify whether latency is end-to-end or model-only, and only 21% report warm-up strategy, the number of runs, and concurrency assumptions. Throughput is reported even less often (18%). Moreover, memory footprint is reported in two forms: stored model size (widely reported) and runtime memory use (reported in 26% of studies). For microcontroller-class deployment, SRAM is the binding constraint, and the absence of runtime memory reporting in most gateway-class studies makes extrapolation to MCU feasibility impossible. Power and energy are the weakest evidence dimension. Only 17% of the included studies report measured power (W) or energy per inference (mJ or microjoules), despite the availability of repeatable embedded and edge device energy measurement methods [76,77,78]. Runtime disclosure (which inference engine, quantization toolchain, and OS configuration) is reported in 47% of studies. TensorFlow Lite, TensorFlow Lite Micro, ONNX Runtime, and PyTorch Mobile are the most common runtimes referenced. Exact software versions were not consistently reported across the included studies.

3.5. RQ4: Operational Robustness

Most IoT intrusion datasets are severely imbalanced. Explicit imbalance handling is reported in 44% of included studies, using techniques that include SMOTE-family oversampling, focal loss, cost-sensitive training, and GAN-based augmentation [8,67]. Beyond attack category holdout, only a small minority of studies evaluate under temporal drift or environmental drift. Time-aware evaluation, adversarial robustness against evasion attacks, and robustness to data poisoning during federated training are reported in only nine, seven, and four of 78 studies, respectively. Deployment lifecycle concerns are almost entirely absent: only five of 78 studies describe an update mechanism, and none provide quantitative evidence on update cost, downtime, or rollback reliability under realistic conditions.

3.6. RQ5: Deployability Reporting Gaps

By aggregating the categorical reporting appraisal across all included studies, the mean reporting completeness against the fields in Table 4 is obtained as 0.51 (standard deviation 0.13, range 0.18–0.83). Model- and data-level fields are the most complete; hardware- and operational-level ones are the least. No study in the corpus achieves full completeness across every field, and only six of 78 studies exceed 80% field completeness. As an exploratory check on the relationship between missing physical evidence and very high benchmark results, we coded studies reporting accuracy or F1 ≥ 0.99 as “near-perfect”. Near-perfect results appeared in nine of 12 studies with no hardware evidence (75%) but in 20 of 66 studies with at least partial hardware or system evidence (30%). Fisher’s exact test indicated a significant association (p = 0.007), with a moderate phi coefficient (0.36); Spearman correlation between the absence of hardware evidence and near-perfect reporting was rho = 0.35. This association does not prove causality, but it supports the caution that architectural-only studies can overstate field readiness when physical constraints are not measured. Five reporting anti-patterns recur across the corpus. (i) Lightweight by assertion: A paper claims lightweightness but reports only accuracy and parameter count, with no latency, memory, or energy evidence. (ii) Hidden preprocessing cost: Expensive feature engineering is treated as free preprocessing. (iii) Binary-only evaluation in contexts in which the threat model is explicitly multi-class and imbalanced. (iv) Over-precision in cross-paper comparison, in which decimal-point differences are presented as decisive even though datasets, splits, and hardware differ substantially. (v) Neglect of deployment lifecycle: Update paths, drift monitoring, and rollback procedures are rarely discussed even though they often dominate real-world maintainability [66].

3.7. A Unified Deployability Framework

3.7.1. Framework Overview

The evidence gaps identified in Section 3.2, Section 3.3, Section 3.4, Section 3.5 and Section 3.6 motivate a reusable assessment framework. We define deployability as the degree to which an IDS can be executed, maintained, and trusted on resource-constrained edge infrastructure without relying on hidden assumptions absent from the published evaluation. The framework has five dimensions: model-, data-, system-, hardware-, and operational-level. Each is scored on the unit interval, with explicit completeness penalties for missing evidence and a scenario-based weight sensitivity analysis. Figure 3 shows the five dimensions and their relationship to the composite score.

3.7.2. Dimensions, Anchors, and Scoring Procedure

Table 8 defines the five dimensions, their representative indicators, the reason each matters for edge deployment, and the most common reporting failure observed in the corpus. The indicators are deliberately chosen to be observable in a typical paper’s evaluation section. For each dimension, three anchor points (high, mid, low) guide consistent coding across reviewers.

Figure 3. Five-dimensional deployability framework with completeness penalty and scenario weighting.

The following equations define how individual indicators are normalized and combined into the dimension-level and composite scores reported in Table 9 and Appendix C. The scoring procedure is deliberately conservative. First, for each cost indicator x (parameters, FLOPs, size, latency, energy), a reversed log-scale normalization is applied:

s = 1 − (ln x − ln x_min)/(ln x_max − ln x_min)

(1)

where x_min and x_max are corpus-derived bounds for this indicator after unit harmonization. To reduce distortion from extreme GPU- or MCU-scale outliers, the primary analysis uses the fifth and 95th percentile values as lower and upper bounds, with observations outside the interval clipped to the nearest bound before normalization. Log scaling reflects the fact that deployment cost changes meaningfully at order-of-magnitude boundaries; a 10× increase in parameters or latency can cross a feasibility threshold that a small linear change would not reflect. Boundary sensitivity was checked using raw observed minima/maxima and device-tier-specific bounds. Across these alternatives, the balanced scenario rank correlation with the primary score remained high (Spearman rho = 0.94–0.97), and the main qualitative conclusion that no family dominates across scenarios was unchanged. Second, benefit indicators (presence of target device measurement, zero-day evaluation, quantization, documented update path, etc.) are mapped to bounded scores on the unit interval using fixed anchors: high = 1.0 when the item is measured or fully documented in the target deployment context; mid = 0.5 when evidence is present but partial, indirect, or measured only on a proxy device; and low = 0.0 when the evidence is absent or contradicted by the method description. These thresholds are recorded in the supplementary scoring worksheet so that future reviewers can reproduce or modify the coding. Third, a completeness factor penalizes missing evidence so that omission is not silently rewarded:

D = c × mean(s_reported), c = r/m

(2)

where r is the number of reported applicable indicators in dimension d, m is the number of expected applicable indicators for that dimension and mean(s_reported) is the average of normalized indicator scores across the reported subset. A field marked “not applicable” is excluded from m only when the paper’s declared deployment scope makes the indicator genuinely irrelevant; for example, a cloud-assisted split learning design is not expected to report MCU flash occupancy, but it should still report coordination overhead and communication latency. Missing applicable fields remain in m and reduce c. The linear penalty was chosen because this review’s purpose is evidence mapping across heterogeneous studies rather than pass/fail certification. A sensitivity check with a weighted critical metric penalty, giving double weight to missing runtime memory and power/energy in hardware-focused scenarios, did not change the main interpretation, although it lowered scores for studies with architectural claims but no device-level evidence. For engineering procurement or certification, those critical metrics may reasonably be treated as exclusionary. The overall deployability score is as follows:

S = sum_d (w_d ∗ D_d), with sum_d w_d = 1

(3)

where w_d is the weight assigned to dimension d under a given scenario. Equal weights (0.20 each) define the balanced scenario; the hardware- and operational-priority scenarios assign 0.50 to the hardware or operational dimension respectively and 0.125 to each of the other four dimensions.

3.7.3. Worked Example and Sensitivity Analysis

The framework is applied to six representative families drawn from the fully coded corpus: PNet-IDS [4], lightweight TCN [5], feature/sample reduction ensemble [6], CPN-GHSOM [7], TICNN + TIGAN [8], and multi-hop split learning [3]. These families were selected because they span the major observed lightweighting modes (architecture-centric, hardware-facing quantization, data-centric reduction, adaptive/unsupervised learning, imbalance-aware augmentation, and split/distributed placement); cover different device evidence levels; and are sufficiently documented to illustrate the scoring mechanics. They are not the only studies coded; the complete 78-study coding sheet is provided in Supplementary Table S1. To make the scoring auditable, consider the lightweight TCN family in detail. The charting sheet records approximately 63.6 K parameters, 248.5 KB FP32 and 89.1 KB INT8 stored size, gateway local execution on a Raspberry Pi 4, sub-millisecond per-sample inference, a reported power of approximately 4.2–4.6 W, post-training INT8 quantization, minimal preprocessing burden, and limited novelty evaluation. After the reversed-log normalization of the cost indicators and unit interval coding of the benefit indicators, the dimension means multiplied by completeness factors yield the following: model-level—0.73; data-level—0.75; system-level—0.68; hardware-level—0.90; operational-level—0.49. Under balanced weights, the composite deployability score is S_balanced = 0.71. Under the hardware- and operational-priority scenarios, the same dimension vector yields approximately 0.78 and 0.63, respectively.

Table 9 reports composite deployability scores across the six families under the three weighting scenarios. Figure 4 visualizes the same scores as a line chart. Rankings shift substantially between scenarios. The lightweight TCN score rises from balanced at 0.71 to hardware-priority at 0.78. Meanwhile, the feature/sample reduction ensemble score rises from balanced at 0.69 to operational-priority at 0.76. After re-checking the system-level coding, the TICNN + TIGAN family is treated as a heavier end-to-end pipeline, and its system score is reduced, yielding balanced, hardware-priority, and operational-priority composites of 0.46, 0.41, and 0.56, respectively. This ranking instability is informative rather than problematic: it exposes the sensitivity of any single-number summary to the stakeholder priority structure.

Three interpretive constraints apply. First, the composite is not a certification metric; it supports scenario-specific decisions about which family to deploy under which priorities. Second, the framework rewards evidentiary completeness rather than rhetorical confidence. A paper with excellent reported accuracy but no hardware or operational evidence will receive a moderate or low score, and that is intentional. Third, accuracy is deliberately kept outside the composite. A model could otherwise appear highly deployable merely because it was tested on an easier binary task. The composite should therefore be read alongside the task definition, metric choice, and reported accuracy in the source paper.

4. Discussion

4.1. Principal Findings

Three findings dominate the synthesis. First, no single lightweighting strategy dominates across all deployability dimensions. Architectural compactness does not imply operational robustness by itself, and hardware validation does not guarantee minority-class performance. Distributed schemes trade local arithmetic for communication cost. Second, both hardware- and operational-level evidence are the weakest dimensions across the corpus and the most directly relevant to practical deployment. Third, reporting completeness varies widely between studies that are otherwise similar in algorithmic choice, which indicates that reporting culture is a major source of apparent performance variation, rather than fundamental capability.

4.2. Trends over Time

Comparing pre-2021 and post-2021 subsets of the corpus, three shifts are visible. Hardware measurement reporting has improved modestly, driven by the increased use of Raspberry Pi and Jetson platforms. TinyML endpoint studies are a distinct emerging subset with characteristic reporting on flash/SRAM footprint and energy per inference but still-limited reporting on update and drift. Zero-day and time-aware evaluation practice has not kept pace with publication volume. The field’s headline deployability claims are outpacing the evidentiary culture needed to support them.

4.3. Implications for Design and Benchmarking

For practitioners choosing a lightweight IDS for a specific deployment, this review supports three design implications. First, identify the dominant bottleneck before selecting a model family: arithmetic cost on the device, preprocessing cost in the pipeline, coordination cost across the system, or field robustness under drift and imbalance. Second, treat missing deployment evidence as a risk rather than as neutral silence. Third, align target hardware with the target threat model. A binary DDoS detector validated on a Raspberry Pi is not automatically suitable for a multi-class industrial or vehicular monitoring role.

At the community level, this review argues for three benchmarking changes. First, leaderboard-style comparisons on a single dataset should be de-emphasized in favor of multi-dataset, multi-split evaluations. Second, minimum device-side reporting (hardware platform, runtime, latency definition, memory, power, or energy) should be treated as mandatory rather than discretionary; when possible, power should be measured with a physical monitor or a documented embedded measurement setup rather than inferred from model size alone [76,77,78]. Third, artifact release should be integral to the benchmark, not ancillary: containerized environments, checkpoint hashes, and measurement harnesses should accompany any claim of deployability. Table 10 and Table 11 translate these arguments into a concrete blueprint and reporting checklist that future studies can adopt directly.

4.4. Research Agenda

The synthesis suggests six concrete research directions. First, unified benchmark construction: Future studies should report not only in-distribution classification but also zero-day exclusion, time-aware splits, and device or site shift evaluation, using at least two heterogeneous datasets. Without this, claims of real-world readiness will remain under-specified. Second, true end-to-end cost accounting: Researchers should measure the full path from raw telemetry to final alert, including feature extraction, transformation, communication, and inference. Third, stronger gateway-to-endpoint continuity: Raspberry Pi deployments are valuable but do not automatically translate to MCU-class endpoints, and more work is needed on aggressive compression, integer-only runtimes, TinyML runtime support, and NAS-derived compact architectures for truly tiny devices [55,56,57,73,74,75,79,80]. Fourth, operationally faithful learning under imbalance and novelty: Multi-class rare attack detection, class-imbalance-aware TinyML evaluation, open-set recognition, calibration, and human-in-the-loop triage deserve more attention than they currently receive [10]. Fifth, system-aware security design: Distributed, split, and federated learning approaches should report their communication, orchestration, and privacy trade-offs as first-class outcomes [3,53,54]. Sixth, the treatment of future 6G and low-altitude aerial IoT–edge environments as a distinct deployment frontier: UAV-enabled and integrated sensing and communication networks introduce three-dimensional mobility, rapidly changing air–ground channels, beam-tracking constraints, and stricter secrecy–energy trade-offs that can invalidate assumptions derived from static terrestrial gateways [81,82,83]. In particular, unified channel model-driven optimization frameworks that jointly couple propagation modeling with system-level resource optimization provide a more robust conceptual basis for ISAC-enabled deployability assessment in dynamic three-dimensional environments [84]. Lightweight IDS research for such environments should therefore couple cyber-detection metrics with mobility-aware telemetry, communication overhead, link reliability, and energy security co-optimization.

4.5. Limitations and Threats to Validity

Four methodological limitations deserve explicit acknowledgement. (i) This review synthesizes published evidence rather than re-implementing every compared method on a common platform; a focused replication of representative families on shared edge hardware would be a natural next validation layer, but it is outside the current scope. (ii) Some compared papers report richer hardware evidence than others; the completeness penalty compensates partially but does not eliminate all uncertainty. (iii) The mapping from published qualitative evidence to bounded indicator scores involves judgment; this threat is mitigated by anchor definitions, scorer agreement checks, and sensitivity analysis, but alternative coding decisions could reasonably be preferred in some settings. (iv) Dataset heterogeneity remains a confounder; separating deployability from accuracy reduces benchmark-driven distortion but does not fully normalize task difficulty across binary, multi-class, and novelty-oriented evaluations.

Following the classical taxonomy of threats to validity [37], this review faces construct, internal, external, and conclusion threats. Construct validity is mitigated by grounding the dimensions in deployment practice evidence, defining scoring anchors, and publishing the indicator set. Internal validity is mitigated through independent dual screening, duplicate framework scoring on the worked examples and validation subset, and explicit consensus rules. External validity is limited by the restriction to the peer-reviewed English-language literature and the incomplete 2026 indexing window; citation chasing partially mitigates this but does not eliminate publication or language bias. Conclusion validity is mitigated by presenting scenario-based sensitivity, boundary sensitivity for Equation (1), and a critical metric penalty check rather than relying on a single composite ranking.

5. Conclusions

This PRISMA-ScR systematic review, which was cross-checked against PRISMA 2020, mapped deployability evidence across the lightweight IoT–edge intrusion detection literature and introduced a five-dimensional framework for assessing that evidence with explicit completeness penalties, scoring validation checks, and scenario-based sensitivity. The principal empirical finding is that hardware- and operational-level reporting remain the weakest dimensions of the literature, even as algorithmic and benchmark evidence have matured. The principal methodological contribution is a reusable assessment framework that other researchers can apply to new work as it is published, complete with a scoring procedure, benchmark blueprint, reporting checklist, and completed PRISMA checklist.

No single lightweight IDS family dominates every deployment requirement. The lightweight TCN family provides the clearest gateway-level hardware evidence; ensemble and adaptive families become stronger when operational robustness is prioritized; and distributed designs are the most useful when privacy and system placement outweigh local arithmetic cost. The framework’s value lies in making trade-offs explicit, penalizing missing evidence, and supporting weight-sensitive interpretation. Adopting the associated reporting protocol would move the field from rhetorical lightweightness claims toward deployability as an auditable scientific outcome.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/fi18060300/s1, Table S1: A full list of the 78 included studies and extracted coding fields; Table S2: Full-text exclusion reasons; Table S3: Dimension-level deployability scoring worksheet; File S1: PRISMA-ScR search and screening log; File S2: Deployability score calculator; File S3: Raspberry Pi measurement harness scaffold; File S4: Completed PRISMA 2020 checklist. Scoring_Validation sheet: scorer agreement, normalization-bound sensitivity, critical metric penalty, and exploratory missing evidence analysis.

Author Contributions

Conceptualization, M.M.I. and S.H.; methodology, M.M.I., S.H. and M.N.; investigation, M.M.I., S.H. and U.S.; data curation, M.M.I. and U.S.; formal analysis, M.M.I. and S.H.; validation, U.S., M.N. and S.H.; visualization, M.M.I. and U.S.; writing—original draft preparation, M.M.I.; writing—review and editing, M.M.I., U.S., M.N. and S.H.; supervision, S.H. and M.N.; project administration, M.M.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The coding workbook, PRISMA flow data, framework scoring worksheet, PRISMA 2020 checklist, deployability score calculator, and Raspberry Pi measurement harness are provided as Supplementary Materials with this submission.

Acknowledgments

During the preparation of this manuscript, the authors used Generative AI only for language refinement and the structural editing of the prose. AI tools were not used for database searching, study identification, eligibility screening, data charting, data extraction, framework scoring, statistical analysis, or scientific interpretation. All research design decisions, screening judgments, coding decisions, interpretations, and scientific claims were made and verified by the authors. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUROC	Area Under the Receiver Operating Characteristic Curve
CNN	Convolutional Neural Network
FLOP	Floating-Point Operation
GAN	Generative Adversarial Network
GHSOM	Growing Hierarchical Self-Organizing Map
GNN	Graph Neural Network
GRU	Gated Recurrent Unit
IDS	Intrusion Detection System
IIoT	Industrial Internet of Things
IoT	Internet of Things
LSTM	Long Short-Term Memory
MCU	Microcontroller Unit
MEC	Multi-access Edge Computing
NAS	Neural Architecture Search
OTA	Over-the-Air
PCA	Principal Component Analysis
PRISMA-ScR	Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews
RNN	Recurrent Neural Network
RQ	Research Question
SMOTE	Synthetic Minority Oversampling Technique
SRAM	Static Random-Access Memory
TCN	Temporal Convolutional Network
TinyML	Tiny Machine Learning

Appendix A. Search Execution Notes

The database-specific strings in Table 1 were executed on 15 February 2026 and rerun on 31 March 2026 to capture late-indexed records. For Scopus, document type was restricted to “Article” and “Conference Paper” and language to “English”. For Web of Science, the same type and language filters were applied over the SCI-EXPANDED and ESCI editions. For IEEE Xplore, the search scope was restricted to “Conferences” and “Journals & Magazines”. For ACM Digital Library, the filter “Research Article” was applied. For ScienceDirect, the Article Type filter was set to “Research Article” and “Review Article”. Citation chasing on the included studies was performed in two passes: backward and forward (the latter through Google Scholar on 5 April 2026). Records retrieved through citation chasing were subjected to the same eligibility screening as database-identified records.

Appendix B

The full list of 78 included studies with bibliographic details, coded lightweighting strategy, dataset(s), device tier, and completeness score is provided as Supplementary Materials (coding_workbook.xlsx). The breakdown of full-text exclusion reasons is as follows:

Not peer-reviewed or insufficient peer-review evidence: 14 records.
No IoT–edge orientation (generic cloud or desktop IDS): 31 records.
No detection component (prevention- or policy-only): 11 records.
No deployability-relevant evidence (accuracy-only): 27 records.
Duplicate extended versions superseded by a later journal paper: eight records.
Full text inaccessible after two retrieval attempts: seven records.
Language other than English: three records.
Outside date range: one record.

Appendix C. Dimension-Level Deployability Worksheet

Table A1 reports the dimension-level values used to recover the balanced deployability scores in Table 9 and support the scenario reweighting. The entries are evidence-grounded coding scores on the unit interval after normalization and completeness penalties.

Table A1. Dimension-level scores for the representative families.

Model Family	Model	Data	System	Hardware	Operational	Balanced
PNet-IDS [4]	0.92	0.73	0.50	0.39	0.47	0.60
Lightweight TCN [5]	0.73	0.75	0.68	0.90	0.49	0.71
Feature/sample reduction ensemble [6]	0.66	0.62	0.76	0.53	0.88	0.69
CPN-GHSOM [7]	0.80	0.71	0.68	0.67	0.75	0.72
TICNN + TIGAN [8]	0.36	0.31	0.54	0.34	0.74	0.46
Multi-hop split learning [3]	0.64	0.62	0.75	0.34	0.55	0.58

References

Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge computing: Vision and challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
Mao, Y.; You, C.; Zhang, J.; Huang, K.; Letaief, K.B. A survey on mobile edge computing: The communication perspective. IEEE Commun. Surv. Tutor. 2017, 19, 2322–2358. [Google Scholar] [CrossRef]
Zhao, H.; Law, K.L.E.; Ng, B.K.; Lam, C.-T. Edge computing-based distributed intrusion detection systems via multi-hop split learning. IEEE Access 2026, 14, 23800–23813. [Google Scholar] [CrossRef]
Iliyasu, A.S.; Siddiqui, A.J.; Song, H.; Abdu, F.J. PNet-IDS: A lightweight and generalizable convolutional neural network for intrusion detection in Internet of Things. IEEE Access 2025, 13, 102624–102639. [Google Scholar] [CrossRef]
Akhi, M.; Eising, C.; Dhirani, L.L. Securing IoT using lightweight TCN for edge deployment on Raspberry Pi 4. IEEE Open J. Commun. Soc. 2026, 7, 442–460. [Google Scholar] [CrossRef]
Zhang, H.; Upadhyay, D.; Zaman, M.; Sampalli, S. Lightweight IoT intrusion detection via feature and sample reduction with multi-client ensemble for zero-day attacks. IEEE Access 2026, 14, 29764–29780. [Google Scholar] [CrossRef]
Baz, M. Lightweight IDS for IoT based on counter-propagation networks. IEEE Access 2025, 13, 147086–147111. [Google Scholar] [CrossRef]
Awasthi, A.; Prajapati, A.K.; Vediya, P.; Battula, R.B. TICNN: A hybrid light-weight CNN for large imbalanced multi-class datasets on intrusion detection system in IoT. IEEE Access 2026, 14, 23413–23427. [Google Scholar] [CrossRef]
Amuthadevi, C.; Venkatesan, R.; Mythily, M.; Aroul Canessane, R. TinyML-based intrusion detection systems for sustainable and energy-constrained IoT devices. Results Eng. 2025, 28, 108013. [Google Scholar] [CrossRef]
Fusco, P.; Montefusco, A.; Rimoli, G.P.; Palmieri, F.; Ficco, M. TinyML-Based Intrusion Detection System for Handling Class Imbalance in IoT-Edge Domain Using Siamese Neural Network on MCU. In Advanced Information Networking and Applications, AINA 2025; Springer: Cham, Switzerland, 2025; pp. 389–402. [Google Scholar] [CrossRef]
Misrak, S.F.; Melaku, H.M. Lightweight intrusion detection system for IoT with improved feature engineering and advanced dynamic quantization. Discov. Internet Things 2025, 5, 97. [Google Scholar] [CrossRef]
Zhang, D.; Huang, D.; Chen, Y.; Lin, S.; Li, C. A lightweight IoT intrusion detection method based on two-stage feature selection and Bayesian optimization. AIMS Electron. Electr. Eng. 2025, 9, 359–389. [Google Scholar] [CrossRef]
Rajalakshmi, S.; Madhav Kuthadi, V.; Baskar, S.; Acevedo Llanos, R. Tiny ML-Enabled Energy-Efficient Intrusion Detection System for Sustainable IoT Security in Green Cybersecurity Ecosystems. J. Internet Serv. Inf. Secur. 2025, 15, 602–625. [Google Scholar] [CrossRef]
Manivannan, D. Recent endeavors in machine learning-powered intrusion detection systems for the Internet of Things. J. Netw. Comput. Appl. 2024, 229, 103925. [Google Scholar] [CrossRef]
Almalkawi, I.T.; Alhowaide, A.; Al-Omari, A.; Shtaiwi, S.; Guerrero-Zapata, M. SecIDS-CNN-WF: A trust-aware edge-efficient CNN for real-time Wi-Fi intrusion detection. IEEE Access 2026, 14, 14996–15014. [Google Scholar] [CrossRef]
Zhou, Y.; Zhang, J.; Yang, G. A lightweight unsupervised intrusion detection model for in-vehicle edge computing based on FlexRay. IEEE Access 2026, 14, 31954–31967. [Google Scholar] [CrossRef]
Fusco, P.; Rimoli, G.P.; Ficco, M. An IoT intrusion detection system by Tiny Machine Learning. In Computational Science and Its Applications—ICCSA 2024 Workshops; Springer: Cham, Switzerland, 2024; pp. 71–82. [Google Scholar] [CrossRef]
Trilles, S.; Belmonte, A.; Gonzalez-Perez, R.; Huerta, J. Anomaly detection based on Artificial Intelligence of Things: A systematic literature mapping. Internet Things 2024, 25, 101063. [Google Scholar] [CrossRef]
Ahanger, T.A.; Alqahtani, H.; Rasool, R.U.; Aljumah, A.; Alkhalaf, K. Machine learning-inspired intrusion detection system for IoT: Security issues and future challenges. Comput. Electr. Eng. 2025, 123, 110265. [Google Scholar] [CrossRef]
Rahman, A.; Ab Rahman, N.H.; Ahmad, M. A survey on intrusion detection system in IoT networks. Cyber Secur. Appl. 2025, 3, 100082. [Google Scholar] [CrossRef]
Zhang, Y.; Park, M.; Kim, H. A review of deep learning applications in intrusion detection systems: Overcoming challenges in spatiotemporal feature extraction and data imbalance. Appl. Sci. 2025, 15, 1552. [Google Scholar] [CrossRef]
Zarpelao, R.R.; Miani, R.S.; Kawakani, C.T.; de Alvarenga, S.C. A survey of intrusion detection in Internet of Things. J. Netw. Comput. Appl. 2017, 84, 25–37. [Google Scholar] [CrossRef]
Khraisat, A.; Gondal, I.; Vamplew, P.; Kamruzzaman, J. Survey of intrusion detection systems: Techniques, datasets and challenges. Cybersecurity 2019, 2, 20. [Google Scholar] [CrossRef]
Buczak, A.L.; Guven, E. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Commun. Surv. Tutor. 2016, 18, 1153–1176. [Google Scholar] [CrossRef]
Ferrag, M.A.; Maglaras, L.; Moschoyiannis, S.; Janicke, H. Deep learning for cyber security intrusion detection: Approaches, datasets, and comparative study. J. Inf. Secur. Appl. 2020, 50, 102419. [Google Scholar] [CrossRef]
Albulayhi, K.; Smadi, A.A.; Sheldon, F.T.; Abercrombie, R.K. IoT intrusion detection taxonomy, reference architecture, and analyses. Sensors 2021, 21, 6432. [Google Scholar] [CrossRef] [PubMed]
Neto, E.C.P.; Dadkhah, S.; Neto, R.F.M.; Lu, R.; Ghorbani, A.A.; Zohrevand, A. CICIoT2023: A real-time dataset and benchmark for large-scale attacks in IoT environment. Sensors 2023, 23, 5941. [Google Scholar] [CrossRef] [PubMed]
Chaabouni, N.; Mosbah, M.; Zemmari, A.; Sauvignac, C.; Faruki, P. Network intrusion detection for IoT security based on learning techniques. IEEE Commun. Surv. Tutor. 2019, 21, 2671–2701. [Google Scholar] [CrossRef]
Gyamfi, E.; Jurcut, A. Intrusion detection in Internet of Things systems: A review on design approaches leveraging multi-access edge computing, machine learning, and datasets. Sensors 2022, 22, 3744. [Google Scholar] [CrossRef]
da Costa, L.F.E.; Papa, V.; de Souza, A.M.; Viana, F. Internet of Things: A survey on machine learning-based intrusion detection approaches. Comput. Netw. 2019, 151, 147–157. [Google Scholar] [CrossRef]
Belenguer, L.; Navaridas, J.; Pascual, J.A. A review of federated learning in intrusion detection systems for IoT. Comput. Netw. 2025, 258, 111023. [Google Scholar] [CrossRef]
Heidari, A.; Jabraeil Jamali, M.A. Internet of Things intrusion detection systems: A comprehensive review and future directions. Clust. Comput. 2023, 26, 3753–3780. [Google Scholar] [CrossRef]
Tsimenidis, S.; Lagkas, T.; Rantos, K. Deep learning in IoT intrusion detection. J. Netw. Syst. Manag. 2022, 30, 8. [Google Scholar] [CrossRef]
Gueriani, S.I.; Kheddar, H.; Mazari, A.C. Deep reinforcement learning for intrusion detection in IoT: A survey. In Proceedings of the 2023 2nd International Conference on Electronics, Energy and Measurement (IC2EM), Medea, Algeria, 28–29 November 2023; pp. 1–7. [Google Scholar] [CrossRef]
Tricco, A.C.; Lillie, E.; Zarin, W.; O’Brien, K.K.; Colquhoun, H.; Levac, D.; Moher, D.; Peters, M.D.J.; Horsley, T.; Weeks, L.; et al. PRISMA extension for scoping reviews (PRISMA-ScR): Checklist and explanation. Ann. Intern. Med. 2018, 169, 467–473. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef] [PubMed]
Kitchenham, B.; Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering; EBSE Technical Report EBSE-2007-01; Keele University and Durham University: Keele, UK, 2007. [Google Scholar]
Shafiq, M.; Tian, Z.; Bashir, A.K.; Du, X.; Guizani, M. CorrAUC: A malicious Bot-IoT traffic detection method in IoT network using machine learning techniques. IEEE Internet Things J. 2021, 8, 3242–3254. [Google Scholar] [CrossRef]
Diro, A.; Chilamkurti, N. Distributed attack detection scheme using deep learning approach for Internet of Things. Future Gener. Comput. Syst. 2018, 82, 761–768. [Google Scholar] [CrossRef]
Vinayakumar, R.; Alazab, M.; Soman, K.P.; Poornachandran, P.; Al-Nemrat, A.; Venkatraman, S. Deep learning approach for intelligent intrusion detection system. IEEE Access 2019, 7, 41525–41550. [Google Scholar] [CrossRef]
Yin, C.; Zhu, Y.; Fei, J.; He, X. A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access 2017, 5, 21954–21961. [Google Scholar] [CrossRef]
Ullah, I.; Mahmoud, Q.H. A scheme for generating a dataset for anomalous activity detection in IoT networks. In Proceedings of the Canadian Conference on Artificial Intelligence, Ottawa, ON, Canada, 13–15 May 2020; pp. 508–520. [Google Scholar] [CrossRef]
Aldaej, M.; Younis, A.A.; Alhumam, A.; Akhunzada, A. Deep learning-inspired IoT-IDS mechanism for edge computing environments. Sensors 2023, 23, 9869. [Google Scholar] [CrossRef]
Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A.A. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the IEEE Symposium on Computational Intelligence for Security and Defense Applications, Ottawa, ON, Canada, 8–10 July 2009; pp. 1–6. [Google Scholar] [CrossRef]
Moustafa, N.; Slay, J. UNSW-NB15: A comprehensive data set for network intrusion detection systems. In Proceedings of the Military Communications and Information Systems Conference (MilCIS), Canberra, Australia, 10–12 November 2015; pp. 1–6. [Google Scholar] [CrossRef]
Koroniotis, N.; Moustafa, N.; Sitnikova, E.; Turnbull, B. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst. 2019, 100, 779–796. [Google Scholar] [CrossRef]
Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward generating a new intrusion detection dataset and intrusion traffic characterization. In Proceedings of the International Conference on Information Systems Security and Privacy (ICISSP), Funchal, Portugal, 22–24 January 2018; pp. 108–116. [Google Scholar] [CrossRef]
Garcia, S.; Parmisano, A.; Erquiaga, M.J. IoT-23: A Labeled Dataset with Malicious and Benign IoT Network Traffic; Zenodo: Genève, Switzerland, 2020. [Google Scholar] [CrossRef]
Booij, T.; Chiscop, I.; Meeuwissen, E.; Moustafa, N.; Sheldon, F.T. ToN_IoT: The role of heterogeneity and the need for standardization of features and attack types in IoT network intrusion data sets. IEEE Internet Things J. 2022, 9, 485–496. [Google Scholar] [CrossRef]
Ferrag, M.A.; Friha, O.; Hamouda, D.; Maglaras, L.; Janicke, H. Edge-IIoTset: A new comprehensive realistic cyber security dataset of IoT and IIoT applications for centralized and federated learning. IEEE Access 2022, 10, 40281–40306. [Google Scholar] [CrossRef]
Vasilomanolakis, E.; Karuppayah, S.; Muhlhauser, M.; Fischer, M. Taxonomy and survey of collaborative intrusion detection. ACM Comput. Surv. 2015, 47, 55. [Google Scholar] [CrossRef]
Zolanvari, M.; Teixeira, M.A.; Gupta, L.; Khan, K.M.; Jain, R. Machine learning-based network vulnerability analysis of industrial Internet of Things. IEEE Internet Things J. 2019, 6, 6822–6834. [Google Scholar] [CrossRef]
Mothukuri, V.; Khare, P.; Parizi, R.M.; Pouriyeh, S.; Dehghantanha, A.; Srivastava, G. Federated learning-based anomaly detection for IoT security attacks. IEEE Internet Things J. 2022, 9, 2545–2554. [Google Scholar] [CrossRef]
Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. 2019, 10, 1–19. [Google Scholar] [CrossRef]
Dutta, S.; Bharali, A. TinyML meets IoT: A comprehensive survey. Internet Things 2021, 16, 100461. [Google Scholar] [CrossRef]
Abadade, Y.; Temouden, A.; Bamoumen, H.; Benamar, N.; Chtouki, Y.; Hafid, A.S. A comprehensive survey on TinyML. IEEE Access 2023, 11, 96892–96922. [Google Scholar] [CrossRef]
Zhao, Y.; Srivastava, G.; Ullah, F. HoloTiny-AD: A trustworthy anomaly detection in resource-constrained IoT devices using holographic TinyML and deep metaheuristics. IEEE Internet Things J. 2026, 13, 8348–8358. [Google Scholar] [CrossRef]
Alwaisi, A.; Al-Kasassbeh, M.; Almseidin, M.; Alauthman, M.; Pasha, M.F. Securing constrained IoT systems: A lightweight machine learning approach for anomaly detection and prevention. Internet Things 2024, 28, 101398. [Google Scholar] [CrossRef]
Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef]
Fatima, S.; Ali, A.; Abbasi, S.H.; Mohamad, A.A.H.; Baker, T. Towards ensemble feature selection for lightweight intrusion detection in resource-constrained IoT devices. Future Internet 2024, 16, 368. [Google Scholar] [CrossRef]
Fatima, S.; Ali, A.; Abbasi, S.H.; Mohamad, A.A.H.; Baker, T. ELIDS: Ensemble feature selection for lightweight intrusion detection against DDoS attacks in resource-constrained IoT environment. Future Gener. Comput. Syst. 2024, 159, 172–187. [Google Scholar] [CrossRef]
Abeshu, A.; Chilamkurti, N. Deep learning: The frontier for distributed attack detection in fog-to-things computing. IEEE Commun. Mag. 2018, 56, 169–175. [Google Scholar] [CrossRef]
Yao, L.; Wang, S.; Li, J.; Wang, L.; Liu, Z. A lightweight intelligent network intrusion detection system using one-class autoencoder and ensemble learning for IoT. Sensors 2023, 23, 4141. [Google Scholar] [CrossRef]
Abbas, A.; Khan, M.A.; Latif, S.; Ajaz, M.; Shah, A.A.; Ahmad, J. A new ensemble-based intrusion detection system for Internet of Things. Arab. J. Sci. Eng. 2022, 47, 1805–1819. [Google Scholar] [CrossRef]
Saba, S.; Rehman, A.; Sadad, T.; Kolivand, H.; Bahaj, S.A. Anomaly-based intrusion detection system for IoT networks through deep learning model. Comput. Electr. Eng. 2022, 99, 107810. [Google Scholar] [CrossRef]
Bennour, A.; Elhoseny, M.; Veerasamy, B.D.; Bhatt, R.; Agrawal, A.; Shukla, P.K.; Ghabband, F. An innovative framework to securely transfer data through the Internet of Things using advanced generative adversarial networks. Concurr. Eng. Res. Appl. 2025, 33, 67–81. [Google Scholar] [CrossRef]
Nguyen, T.D.; Marchal, S.; Miettinen, M.; Fereidooni, H.; Asokan, N.; Sadeghi, A.-R. DIoT: A federated self-learning anomaly detection system for IoT. In Proceedings of the IEEE International Conference on Distributed Computing Systems (ICDCS), Dallas, TX, USA, 7–10 July 2019; pp. 756–767. [Google Scholar] [CrossRef]
Adjewa, F.; Esseghir, M.; Merghem-Boulahia, L. From edge transformer to IoT decisions: Offloaded embeddings for lightweight intrusion detection. Sensors 2026, 26, 356. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Kolias, C.; Kambourakis, G.; Stavrou, A.; Gritzalis, S. Intrusion Detection in 802.11 Networks: Empirical Evaluation of Threats and a Public Dataset. IEEE Commun. Surv. Tutor. 2016, 18, 184–208. [Google Scholar] [CrossRef]
Chatzoglou, E.; Kambourakis, G.; Kolias, C. Empirical Evaluation of Attacks Against IEEE 802.11 Enterprise Networks: The AWID3 Dataset. IEEE Access 2021, 9, 34188–34205. [Google Scholar] [CrossRef]
Rajput, S.; Widmayer, T.; Shang, Z.; Kechagia, M.; Sarro, F.; Sharma, T. Enhancing Energy-Awareness in Deep Learning through Fine-Grained Energy Measurement. ACM Trans. Softw. Eng. Methodol. 2024, 33, 1–34. [Google Scholar] [CrossRef]
Wu, H.; Chen, C.; Weng, K. Two Designs of Automatic Embedded System Energy Consumption Measuring Platforms Using GPIO. Appl. Sci. 2020, 10, 4866. [Google Scholar] [CrossRef]
Elhanashi, A.; Dini, P.; Saponara, S.; Zheng, Q. Advancements in TinyML: Applications, Limitations, and Impact on IoT Devices. Electronics 2024, 13, 3562. [Google Scholar] [CrossRef]
Paleyes, A.; Urma, R.-G.; Lawrence, N.D. Challenges in deploying machine learning: A survey of case studies. ACM Comput. Surv. 2022, 55, 114. [Google Scholar] [CrossRef]
Hamdouchi, S.; Idri, A. Empowering IoT security: Deploying TinyML ensemble techniques for cyberattack detection. Sci. Afr. 2025, 29, e02809. [Google Scholar] [CrossRef]
Tu, X.; Mallik, A.; Chen, D.; Han, K.; Altintas, O.; Wang, H.; Xie, J. Unveiling Energy Efficiency in Deep Learning: Measurement, Prediction, and Scoring across Edge Devices. In Proceedings of the 8th ACM/IEEE Symposium on Edge Computing (SEC), Wilmington, DE, USA, 6–9 December 2023; pp. 1–14. [Google Scholar]
Lin, J.; Chen, W.-M.; Lin, Y.; Cohn, J.; Gan, C.; Han, S. MCUNet: Tiny Deep Learning on IoT Devices. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
David, R.; Duke, J.; Jain, A.; Janapa Reddi, V.; Jeffries, N.; Li, J.; Kreeger, N.; Nappier, I.; Natraj, M.; Regev, S.; et al. TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems. In Proceedings of the Machine Learning and Systems, Virtual, 5–9 April 2021. [Google Scholar]
Ma, Z.; Liang, Y.; Zhu, Q.; Zheng, J.; Lian, Z.; Zeng, L.; Fu, C.; Peng, Y.; Ai, B. Hybrid-RIS-Assisted Cellular ISAC Networks for UAV-Enabled Low-Altitude Economy via Deep Reinforcement Learning with Mixture-of-Experts. IEEE Trans. Cogn. Commun. Netw. 2026, 12, 3875–3888. [Google Scholar] [CrossRef]
Song, Y.; Zeng, Y.; Yang, Y.; Ren, Z.; Cheng, G.; Xu, X.; Xu, J.; Jin, S.; Zhang, R. An Overview of Cellular ISAC for Low-Altitude UAV: New Opportunities and Challenges. IEEE Commun. Mag. 2025, 63, 88–95. [Google Scholar] [CrossRef]
Ma, S.; Zhang, R.; Ma, Z.; Liu, G.; Niyato, D.; Ai, B.; Zhang, R. Energy-Efficient Transmission in STAR-RIS Assisted Secure ISAC Networks with RSMA: An MoE-RBPPO Approach. IEEE Trans. Veh. Technol. 2026. early access. [Google Scholar] [CrossRef]
Ma, Z.; Lin, Y.; Hua, B.; Mao, K.; Zeng, L.; Lian, Z. SIM-Empowered LAINs: A Unified Channel Model-Driven Optimization Framework. IEEE Wirel. Commun. 2026. early access. [Google Scholar] [CrossRef]

Figure 2. Annual publication counts (2017–2026) stratified by deployment evidence level (full, partial, architectural-only).

Figure 4. Deployability score sensitivity across balanced, hardware-priority, and operational-priority weighting scenarios; values from Table 9.

Table 1. The database-specific search strings used in the systematic review.

Database	Search String	Date Range
Scopus	TITLE-ABS-KEY ((“intrusion detection” OR “IDS” OR “anomaly detection”) AND (“IoT” OR “Internet of Things” OR “edge” OR “fog” OR “TinyML” OR “embedded”) AND (“lightweight” OR “compact” OR “resource-constrained” OR “on-device” OR “low-power”))	2017–2026
Web of Science	TS = ((“intrusion detection” OR IDS) AND (IoT OR “Internet of Things” OR edge OR fog OR TinyML) AND (lightweight OR compact OR “resource-constrained” OR on-device))	2017–2026
IEEE Xplore	(“All Metadata”: lightweight OR compact) AND (“All Metadata”: “intrusion detection”) AND (“All Metadata”: IoT OR edge OR fog OR TinyML)	2017–2026
ACM Digital Library	[Abstract: “intrusion detection”] AND [Abstract: lightweight OR compact OR on-device] AND [Abstract: IoT OR edge OR fog OR TinyML]	2017–2026
ScienceDirect	Title, abstract, keywords: (“intrusion detection” OR “anomaly detection”) AND (IoT OR edge OR TinyML) AND (lightweight OR compact OR “resource-constrained”)	2017–2026

Table 2. Inclusion and exclusion criteria applied during screening.

Criterion	Inclusion	Exclusion
Study type	Peer-reviewed journal or conference articles; extended conference versions.	Editorials, opinion pieces, short abstracts (<4 pages), posters, books, theses, non-peer-reviewed preprints.
Topic	IDS or anomaly detection for IoT–edge environments with an explicit lightweight, on-device, edge, fog, or TinyML claim.	Cloud-only IDS; generic ML/DL studies without IoT–edge orientation; pure cryptographic defenses; intrusion prevention without a detection component.
Evidence	Proposes a detection method, reports evaluation metrics, and addresses at least one deployability dimension.	No detection evaluation; no deployability-relevant reporting; vendor white papers with no reproducible evidence.
Language	English.	Non-English; conference presentations without an accompanying peer-reviewed paper.
Date	1 January 2017 to 31 March 2026.	Outside this window.
Access	Full text accessible through institutional or open access channels.	Abstract-only records; paywalled records with no retrievable full text after two attempts.

Table 3. PRISMA-ScR flow counts for the systematic review.

Stage	Records
Records identified—Scopus	487
Records identified—Web of Science	312
Records identified—IEEE Xplore	396
Records identified—ACM Digital Library	128
Records identified—ScienceDirect	203
Additional records from citation chasing	23
Total identified before duplicate removal	1549
Duplicates removed	387
Records after deduplication	1162
Excluded at title/abstract screening	982
Records assessed at full text	180
Excluded at full text	102
Studies included in this review	78

Table 4. Data charting categories and extracted fields.

Category	Extracted Fields
Bibliographic	Authors, year, venue, country of first author, funding declaration.
Model	Detection family (CNN, TCN, RNN, GNN, ensemble, unsupervised, split/federated), parameters, FLOPs, stored model size before/after compression.
Data	Datasets used, feature count retained, preprocessing steps, feature reduction or sampling method, offline/online boundary.
System	Execution placement (endpoint, gateway, fog, cloud-assist, split/federated), coordination overhead.
Hardware	Target device(s), runtime/toolchain, inference latency (end-to-end versus model-only), throughput, memory footprint, power/energy.
Operational	Binary versus multi-class, class imbalance handling, zero-day/time-aware/device shift evaluation, model update path.
Reproducibility	Code availability, dataset version, split seeds, measurement harness, artifact manifest.

Table 5. Taxonomy of lightweighting strategies observed across included studies.

Strategy Family	Primary Mechanism	Representative Studies	Typical Strengths	Typical Trade-Offs
Architecture-centric compactness	Partial convolutions, depthwise separable layers, compact temporal blocks, channel shuffle.	PNet-IDS [4], lightweight TCN [5], compact CNN variants [64,65,66]	Low parameter count, low FLOPs, predictable inference cost.	Device-side validation often incomplete.
Data-centric reduction	Feature selection, dimensionality reduction (PCA, autoencoders), sample compression.	Ensemble + PCA pipeline [6], two-stage feature selection [12], and ELIDS [61,62]	Reduces preprocessing and inference cost jointly.	Preprocessing cost itself can dominate runtime.
Quantization and compression	INT8/INT4 post-training quantization, pruning, knowledge distillation.	Lightweight TCN (INT8) [5], dynamic quantization IDS [11], TinyML detectors [13,55,56,57]	Large size reductions; enables MCU deployment.	Toolchain-dependent; rarely ablated.
Adaptive/unsupervised learners	Counter-propagation networks, growing hierarchical SOMs, one-class classifiers.	CPN-GHSOM [7], HoloTiny-AD [57]	Graceful handling of drift and novelty.	Threshold calibration is fragile.
Imbalance-aware pipelines	GAN-based augmentation, focal loss, SMOTE oversampling.	TICNN + TIGAN [8], Siamese TinyML IDS [10], and hybrid GAN-IDS [67]	Improves minority-class recall.	Heavier end-to-end pipeline.
Distributed/split/federated	Model partitioning across gateway/cloud, federated averaging.	Multi-hop split learning [3], FL-IDS [53,54], and DIoT [68]	Reduces raw data centralization; addresses privacy.	Communication and orchestration overhead.
MEC/offloading-based	Embedding extraction on edge, classification offloaded to MEC server.	MEC-NIDS surveys [51] and SEED [69]	Retains endpoint compactness.	Latency and privacy depend on link quality.

Table 6. Dataset usage across the included studies.

Dataset	Usage	Scope	Typical Evaluation Context
CICIoT2023 [27]	21/78	IoT traffic, 33 attack types, 100+ devices	Multi-class; binary used in some studies.
ToN-IoT [49]	19/78	Telemetry + network + OS; heterogeneous sources	Multi-class with cross-modality splits.
BoT-IoT [46]	17/78	Large-scale botnet traffic on IoT testbed	Multi-class; strong class imbalance.
CICIDS2017 [47]	24/78	Enterprise traffic with common attacks	Binary and multi-class; legacy benchmark.
IoT-23 [48]	11/78	Malware-labeled IoT network captures	Binary and family-level multi-class.
UNSW-NB15 [45]	9/78	General network traffic (non-IoT-specific)	Often used for transfer baselines.
Edge–IIoT [50]	14/78	Industrial IoT, 14 attack categories	Multi-class; growing adoption.
NSL-KDD [44]	8/78	KDD99 revision	Legacy baseline; limited IoT realism.
AWID3 [72]	0/78	Enterprise 802.11 Wi-Fi; 802.1X/EAP and WPA2/WPA3 traffic with PMF (802.11w) attacks (Krack, Kr00k, deauthentication, evil twin, botnet).	Key enterprise Wi-Fi benchmark; not used by any included study (coverage gap).
AWID2 [71]	1/78	Original 802.11 Wi-Fi benchmark (WEP/WPA); injection, flooding, and impersonation attacks.	Multi-class Wi-Fi evaluation; used by one included study (the Wi-Fi edge IDS [10]).
Custom/testbed	13/78	Author-specific captures	Protocol- or application-specific.

Table 7. Hardware evidence reporting across device tiers.

Device Tiers	Representative Platforms	Frequency	Reported	Missing
Cloud/GPU	NVIDIA V100, A100, RTX 30/40	22/78	Inference time per batch, training time	Power, energy, runtime memory.
Gateway-class	Raspberry Pi 3/4/5, Jetson Nano/Xavier, Coral	35/78	Latency, CPU/RAM use, quantized model size	Sustained throughput, thermal, end-to-end latency.
Microcontroller	ESP32, STM32, Arduino Nicla, Sony Spresense	9/78	Flash/SRAM footprint, per-inference energy	Long-run stability, OTA path, drift handling.
No hardware reported	Offline/unspecified	12/78	Architectural proxies only	All device-level evidence.

Table 8. Deployability dimensions, indicators, and common reporting failures.

Dimension	Representative Indicators	Why It Matters	Common Reporting Failure
Model-level	Parameters, FLOPs, stored size, compact operation design.	Compact models are easier to place on constrained devices.	Only accuracy reported.
Data-level	Retained features, transforms, reduction, offline/online split.	Preprocessing can dominate runtime and memory.	Hidden preprocessing cost.
System-level	Execution placement, coordination overhead.	Placement shapes latency, resilience, privacy, governance.	Distributed overhead omitted.
Hardware-level	Latency, throughput, power, energy, memory, quantization, runtime.	Measured device behavior is the strongest evidence.	No target device measurement.
Operational-level	Imbalance handling, zero-day/drift evaluation, updatability.	Real deployments face novelty, skew, long lifetimes.	Binary accuracy used as proxy for robustness.

Table 9. Deployability scores under balanced, hardware-priority, and operational-priority weighting scenarios.

Family	Balanced	Hardware-Priority	Operational-Priority	Interpretation
PNet-IDS [4]	0.60	0.52	0.55	Excellent compactness; deployment evidence sparse.
Lightweight TCN [5]	0.71	0.78	0.63	Most convincing hardware-facing evidence.
Feature/sample reduction ensemble [6]	0.69	0.63	0.76	Strong when robustness under novelty is valued.
CPN-GHSOM [7]	0.72	0.70	0.73	Balanced profile with adaptation benefits.
TICNN + TIGAN [8]	0.46	0.41	0.56	Heavy pipeline.
Multi-hop split learning [3]	0.58	0.49	0.57	Fit for privacy-sensitive collaboration.

Table 10. Unified benchmark blueprint for edge–IoT IDS evaluation.

Component	Recommended Default	Why It Matters	Minimum Artifact
Datasets	At least two heterogeneous datasets (e.g., CICIoT2023 + ToN-IoT).	Reduces single-benchmark overfitting; reveals transfer limits.	Dataset version, class mapping, split definitions.
Evaluation tasks	Random split, time-aware split, zero-day holdout, device/site shift when feasible.	Separates closed-world accuracy from operational robustness.	Task scripts and seed list.
Metrics	Macro-F1, per-class recall, AUROC/PR, latency (end-to-end), memory, throughput, measured power/energy, and calibration [76,77,78].	Prevents accuracy-only claims; forces deployment evidence.	Metric implementation and logging code.
Preprocessing	Report raw features, retained features, transforms, augmentation, offline/online boundary.	Hidden preprocessing cost can dominate runtime and memory.	Preprocessing pipeline and feature schema.
Device tiers	MCU/TinyML endpoint, gateway-class edge, and distributed setup where relevant [73,74,75].	Clarifies where the model is meant to run.	Hardware bill of materials and runtime versions.
Reproducibility	Container or environment file, checkpoint hashes, measurement harness, coding workbook.	Makes claims auditable after publication.	Manifest, hashes, and execution commands.

Table 11. Reporting checklist for edge–IoT IDS deployability claims.

Checklist Item	Assessment Question	Typical Evidence
Dataset integrity	Are dataset versions, class mappings, and all split seeds documented?	Manifest file; split generator script.
Model reproducibility	Can another group rebuild the architecture and hyperparameters exactly?	Config files; training script; seed control.
Compression traceability	Is the path from training checkpoint to deployed model documented?	Conversion script; quantized model hash; calibration data.
Hardware clarity	Are device model, OS, runtime, batch size, warm-up, and thermal/load conditions stated?	Benchmark README; system information dump.
Operational validity	Does the study include imbalance-aware and zero-day-oriented evaluation?	Per-class metrics; holdout protocol description.
Claim discipline	Do textual claims match what was actually measured?	Cross-check between tables, figures, and narrative.
Artifact availability	Can independent evaluators access scripts, logs, or a containerized environment?	Repository link or archived supplement.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Islam, M.M.; Salsabil, U.; Nurmamatov, M.; Hossain, S. Lightweight Intrusion Detection Systems for IoT–Edge Environments: A PRISMA-ScR Systematic Review of Deployability Evidence and a Unified Assessment Framework. Future Internet 2026, 18, 300. https://doi.org/10.3390/fi18060300

AMA Style

Islam MM, Salsabil U, Nurmamatov M, Hossain S. Lightweight Intrusion Detection Systems for IoT–Edge Environments: A PRISMA-ScR Systematic Review of Deployability Evidence and a Unified Assessment Framework. Future Internet. 2026; 18(6):300. https://doi.org/10.3390/fi18060300

Chicago/Turabian Style

Islam, Md Manirul, Umme Salsabil, Mekhriddin Nurmamatov, and Sazzad Hossain. 2026. "Lightweight Intrusion Detection Systems for IoT–Edge Environments: A PRISMA-ScR Systematic Review of Deployability Evidence and a Unified Assessment Framework" Future Internet 18, no. 6: 300. https://doi.org/10.3390/fi18060300

APA Style

Islam, M. M., Salsabil, U., Nurmamatov, M., & Hossain, S. (2026). Lightweight Intrusion Detection Systems for IoT–Edge Environments: A PRISMA-ScR Systematic Review of Deployability Evidence and a Unified Assessment Framework. Future Internet, 18(6), 300. https://doi.org/10.3390/fi18060300

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Intrusion Detection Systems for IoT–Edge Environments: A PRISMA-ScR Systematic Review of Deployability Evidence and a Unified Assessment Framework

Abstract

1. Introduction

2. Materials and Methods

2.1. Reporting Standard and Protocol

2.2. Background and Related Reviews

2.3. Information Sources and Search Strategy

2.4. Eligibility Criteria

2.5. Study Selection Process

2.6. Data Charting and Extraction

2.7. Quality Appraisal and Inter-Rater Reliability

2.8. Data Synthesis Approach

3. Results

3.1. Bibliometric Overview

3.1.1. Publication Growth and Venue Distribution

3.1.2. Geographic and Methodological Trends

3.2. RQ1: Lightweighting Strategies

3.3. RQ2: Datasets and Evaluation Protocols

3.4. RQ3: Hardware-Level Evidence

3.5. RQ4: Operational Robustness

3.6. RQ5: Deployability Reporting Gaps

3.7. A Unified Deployability Framework

3.7.1. Framework Overview

3.7.2. Dimensions, Anchors, and Scoring Procedure

3.7.3. Worked Example and Sensitivity Analysis

4. Discussion

4.1. Principal Findings

4.2. Trends over Time

4.3. Implications for Design and Benchmarking

4.4. Research Agenda

4.5. Limitations and Threats to Validity

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Search Execution Notes

Appendix B

Appendix C. Dimension-Level Deployability Worksheet

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI