Next Article in Journal
Design of Secure Communication Networks for UAV Platform Empowered by Lightweight Authentication Protocols
Next Article in Special Issue
Smart Home IoT Forensics in Matter Ecosystems: A Data Extraction Method Using Multi-Admin
Previous Article in Journal
HCA-IDS: A Semantics-Aware Heterogeneous Cross-Attention Network for Robust Intrusion Detection in CAVs
 
 
Article
Peer-Review Record

Wavelet-Based IoT Device Fingerprinting

Electronics 2026, 15(4), 786; https://doi.org/10.3390/electronics15040786
by Abdelfattah Amamra *, Viet Nguyen, Adam Cheung, Sarah Acosta and Thuy Linh Pham
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4:
Electronics 2026, 15(4), 786; https://doi.org/10.3390/electronics15040786
Submission received: 3 December 2025 / Revised: 3 February 2026 / Accepted: 6 February 2026 / Published: 12 February 2026
(This article belongs to the Special Issue New Challenges in IoT Security)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The claim to the best of our knowledge, this work is the first to apply wavelet-based analysis on network traffic for the purpose of IoT device fingerprinting (lines 94–95) requires further support, as a work cited by the same author transfers traffic analysis to the frequency domain using a Fourier transform (lines 197–203), making the claim of primacy more precise, for example: first application of DWT/WST on a specific traffic-in-matrix coding, or first systematic evaluation on these three datasets.

The TCP window size as a hardware signature is concerning because it can vary based on the network stack and operating system settings, making it unstable and not purely a hardware characteristic.

The use of IQR filtering with boundary value replacement (lines 297–302) could mitigate bursts and benign anomalies that often characterize device communication patterns. Details regarding the thresholds used are missing, as is a demonstration that this operation does not erase discriminating information.

A discussion of the splitting protocol adopted and the order of transformations is missing.

Regarding DWT and WST, details regarding the parent wavelet, decomposition levels, edge management (padding), and any feature standardization prior to the transform are missing. For WST, there is therefore no clarification on how feature interactions (F-dimension) are captured and whether the 2D matrix is ​​a layout trick or an object that is truly operated on in a two-dimensional manner. Furthermore, the complexity analysis attributes a cost of approximately JQNlogN to WST (lines 403–407), without providing the actual values ​​of J and Q used in the experiments, making it impossible to evaluate the computational trade-off invoked for application purposes (lines 408–410).

Feature Reduction using PCA is motivated too generically; it is necessary to indicate the explained variance and whether PCA is fitted exclusively to the training set in each iteration.

The experiments conducted are lacking in terms of validation scheme (hold-out, k-fold, stratification), sample size, hyperparameter selection criteria for SVM/KNN/RF/XGB, number of runs, and statistical dispersion.

Table 4 reports an accuracy of 0.86 for CIC2023 with SVM, but a precision of 0.75 and a recall of 0.74 with an F1 score of 0.86: an F1 higher than both precision and recall, given the same standard definition, is mathematically inconsistent.

In the section on DDoS robustness, Table 10 contains a clear typo in SVM 041.

Author Response

Comment 1: The claim to the best of our knowledge, this work is the first to apply wavelet-based analysis on network traffic for the purpose of IoT device fingerprinting (lines 94–95) requires further support, as a work cited by the same author transfers traffic analysis to the frequency domain using a Fourier transform (lines 197–203), making the claim of primacy more precise, for example: first application of DWT/WST on a specific traffic-in-matrix coding, or first systematic evaluation on these three datasets.

Response 1: Thank you for the comment.  The text updated according to the reviewer comment. “This work is, to the best of our knowledge, the first to apply wavelet-based analysis (DWT and WST) to network-traffic signals specifically for the purpose of IoT device fingerprinting, and the first to conduct a systematic and comparative evaluation of these wavelet techniques across three distinct datasets.” (lines 86-90). 

For comparison with FFT-based techniques. In the revised manuscript, we have clarified the distinction between our wavelet-based approach and prior frequency-domain traffic analysis methods. Lines 197- 207

Comment 2: The TCP window size as a hardware signature is concerning because it can vary based on the network stack and operating system settings, making it unstable and not purely a hardware characteristic.

Response 2: We appreciate the reviewer’s concern regarding the use of TCP window size as a standalone hardware signature, as it can indeed be influenced by operating system configurations and network stack parameters. While TCP window size may not reflect purely hardware-level characteristics on its own, its combination with additional transport-layer, physical-layer, and timing-related features provides a more reliable composite fingerprint. Our results show that the integrated feature set, rather than any single metric, contributes to distinguishing devices effectively, thereby mitigating the instability associated with using TCP window size alone.

Comment 3: The use of IQR filtering with boundary value replacement (lines 297–302) could mitigate bursts and benign anomalies that often characterize device communication patterns. Details regarding the thresholds used are missing, as is a demonstration that this operation does not erase discriminating information.

Response 3: We appreciate the reviewer’s concern regarding the IQR parameters. In the updated manuscript,  IQR multiplier is specified as follow: (Lines 294-303)

"We use interquartile range (IQR) filtering [38], where values falling outside a defined range are considered outliers and replaced with the boundary values. The IQR multiplier is used to identify outliers by defining thresholds based on the interquartile range (IQR). Specifically, data points are flagged as outliers if they fall below the lower bound Q1−1.5 ×IQR or above the upper bound Q3−1.5 ×IQR. To apply this method, we first compute Q1, Q3, and the IQR using only the training set, then derive the lower and upper bounds from the selected multiplier. These bounds are subsequently applied to the training data to remove outliers and consistently used on the validation, test, and unseen data to filter or flag anomalous instances"

Comment 4: A discussion of the splitting protocol adopted and the order of transformations is missing

Response 4: We appreciate the reviewer’s concern regarding A discussion of the splitting protocol. In the updated manuscript,  the splitting protocol is specified as follow: Lines 473 - 482

All experiments employed a stratified 70%/15%/15% train–validation–test split, with stratification performed at the device level to preserve class balance across all partitions. A fixed random seed (seed = 42) was used to ensure reproducibility and to prevent any inadvertent traffic leakage across devices. The validation set was used exclusively for hyperparameter tuning, and the test set was strictly held out for final performance evaluation.

To further assess the stability and generalizability of the optimized models, we additionally performed 5-fold cross-validation on the training portion of the data. This procedure provided multiple independent performance estimates and allowed us to verify that the observed results are consistent across different data partitions rather than being tied to a single split.

Comment 5: Regarding DWT and WST, details regarding the parent wavelet, decomposition levels, edge management (padding), and any feature standardization prior to the transform are missing. For WST, there is therefore no clarification on how feature interactions (F-dimension) are captured and whether the 2D matrix is ​​a layout trick or an object that is truly operated on in a two-dimensional manner. Furthermore, the complexity analysis attributes a cost of approximately JQNlogN to WST (lines 403–407), without providing the actual values ​​of J and Q used in the experiments, making it impossible to evaluate the computational trade-off invoked for application purposes (lines 408–410).

Response 5:  The revised manuscript has a new section: Wavelet-Based Feature Construction Parameters that now includes both parameter justifications and sensitivity results.

Comment 6: Feature Reduction using PCA is motivated too generically; it is necessary to indicate the explained variance and whether PCA is fitted exclusively to the training set in each iteration.

Response 6:  The revised manuscript includes a new text (lines 576-585) that address the reviewer concerns.

"To address this,  PCA is applied primarily as an exploratory feature-reduction step to assess whether the high-dimensional DWT and WST feature sets contain redundant or non-discriminative components, rather than as a mandatory preprocessing stage. To guide component selection, we adopt a model-driven performance criterion: instead of selecting the number of principal components solely based on retained variance ratios, we systematically evaluate classifier accuracy across a range of PCA dimensions and select the configuration that yields the best validation performance. This approach is particularly appropriate in the context of IoT device fingerprinting, where not all variance captured by PCA is relevant to device identity, and some components may predominantly reflect noise or channel effects."

Comment 7: The experiments conducted are lacking in terms of validation scheme (hold-out, k-fold, stratification), sample size, hyperparameter selection criteria for SVM/KNN/RF/XGB, number of runs, and statistical dispersion.

Response 7:  Hyperparameters table is added to the section machine learning algorithms

Comment 8:  Table 4 reports an accuracy of 0.86 for CIC2023 with SVM, but a precision of 0.75 and a recall of 0.74 with an F1 score of 0.86: an F1 higher than both precision and recall, given the same standard definition, is mathematically inconsistent.

Response 8: the Error has been correted

Comment 9: In the section on DDoS robustness, Table 10 contains a clear typo in SVM 041.

Response 9: the Error has been correted

Reviewer 2 Report

Comments and Suggestions for Authors

Brief Summary
==========
The manuscript proposes an IoT device fingerprinting pipeline that converts network traffic into eight time-series features (including ratio features with special edge-case handling), applies wavelet-based feature extraction (DWT and WST), reduces dimensionality to a 30-feature representation, and classifies either individual devices or device types using KNN, SVM, Random Forest, and XGBoost across three public datasets (CIC IoT 2022, CIC IoT 2023, and an UNSW/“USNW” IoT dataset). 


The reported results show a large jump from moderate baseline performance using time-domain features (e.g., individual identification accuracy as low as 0.41 on CIC2023) to near-perfect performance with DWT and WST features (often ≈0.99–1.00 for RF/XGB). 
Robustness is assessed under DDoS conditions, where baseline performance collapses (e.g., accuracies down to 0.09–0.11), while WST partially recovers performance (e.g., ≈0.69–0.71 for RF/XGB on CIC2022 in one reported table). 


However, several reported metric rows are internally inconsistent (F1 exceeding what is feasible given precision/recall), and there is a direct contradiction in the manuscript’s own dataset description of the UNSW device count (28 vs “only 10”), which undermines confidence in the numerical evidence and interpretation. 


Major Comments
===========
#1 The experimental protocol is under-specified in ways that prevent assessing validity and risk inflating performance. The manuscript describes training and testing across scenarios and datasets but does not state the train/validation/test split strategy (percentages, stratification), whether splits are performed at the level of sessions/captures/devices to avoid temporal leakage, whether results come from a single split or repeated runs, or whether hyperparameters are tuned on held-out data. Add a fully specified evaluation protocol: (i) define the unit of splitting (e.g., by capture/session or by contiguous time blocks per device) and enforce group-wise splits to prevent leakage; (ii) use a validation set (or nested cross-validation) for choosing the sampling interval and the “30 features” PCA dimensionality; (iii) report results over at least 5 independent seeds or 5-fold grouped CV, with 95% confidence intervals (bootstrap over test samples or folds) for accuracy and macro-F1.


#2 Several reported metrics are mathematically inconsistent, calling the correctness of the analysis into question. In multiple tables, the SVM row shows accuracy ≈0.86 with precision ≈0.75 and recall ≈0.74, yet F1 is reported as ≈0.86; this cannot hold under any standard F1 definition because F1 cannot exceed both precision and recall. Recompute all metrics from stored predictions and explicitly state the averaging scheme for multi-class reporting (macro, micro, weighted). Then add a consistency check in the results generation script (e.g., assert F1 ≤ max(precision, recall) for each reported row under the chosen averaging), and include confusion matrices for CIC2023 (the hardest dataset) for both device-level and type-level tasks to reveal failure modes.


#3 The results presentation appears to include at least one table duplication/copy error, which must be resolved to make the evidence auditable. The baseline table for device type identification reproduces exactly the baseline table for individual device identification (same dataset-by-model numbers), which is implausible and suggests a copy/paste mistake rather than two independently computed evaluations. Regenerate and replace the type-identification baseline table from the corresponding type labels, and publish per-sample predictions (or at minimum per-class support counts and macro/weighted metrics) so readers can verify that scenario 3 and scenario 4 use different label spaces and outputs.


#4 The feature-engineering pipeline contains degrees of freedom that are not concretely pinned down, weakening reproducibility and potentially introducing hidden sensitivity. The paper defines ratio features where division-by-zero yields “∞ or a large constant,” but it does not specify the actual constant used nor assess sensitivity to that choice, and it applies IQR-based outlier replacement without reporting the IQR multiplier or whether thresholds are fit on training only. Provide exact preprocessing parameters (the constant value for “∞”, IQR multiplier, normalisation formula and fitted statistics), and enforce a strict training-only fit for all preprocessing (IQR bounds, scaling, PCA) implemented as a single pipeline object applied to test data without refitting; then add a short sensitivity experiment varying the “∞ constant” over at least three orders of magnitude (e.g., 10²/10⁴/10⁶ before normalisation) and report the impact on CIC2023 macro-F1.


#5 Claims and interpretations about dataset complexity and generalisability are undermined by internal inconsistencies and lack of targeted tests. The datasets section states the UNSW IoT dataset has 28 devices, while the discussion later describes it as “only 10 IoT devices,” which directly contradicts the earlier description and affects the complexity argument.  More importantly, “complexity/scale” is discussed qualitatively without controlled experiments isolating factors such as number of devices, class imbalance, and traffic heterogeneity.  Correct the UNSW device-count inconsistency and then add a controlled scaling study on CIC2023: subsample to {10, 25, 50, 100} devices with fixed per-device sample counts, and separately vary per-device training size (e.g., 10%, 25%, 50%, 100% of available windows) while holding splits grouped by session; report macro-F1 and calibration (ECE or Brier score) to quantify robustness as complexity increases.


#6 The robustness evaluation under DDoS provides useful stress testing but is not yet sufficient to substantiate resilience claims. Baselines degrade sharply under DDoS (e.g., individual identification accuracy down to 0.09–0.14), and WST improves outcomes in at least one reported table (e.g., CIC2022 RF/XGB around 0.69–0.71), but the paper does not clearly specify whether models are trained on benign-only and tested on attacked traffic, retrained with mixed traffic, or evaluated under a defined attack intensity and attack-type diversity. Add a minimal, explicit robustness protocol with three settings: (i) train on benign, test on DDoS; (ii) train on benign+DDoS mixture with the same device labels, test on DDoS; (iii) train on one attack intensity, test on a different intensity; report relative degradation (Δ macro-F1 from benign to attack) and confidence intervals, and include per-device/type breakdowns to show whether robustness is uniform or concentrated in a subset of classes.


#7 The manuscript motivates wavelet-based traffic fingerprinting as comparatively unexplored, but it also notes difficulty comparing numerically to prior work due to reproducibility limits, so the novelty and baseline adequacy remain only partially evidenced. Two recent works can be used to better situate the paper within IoT/network security system contexts are: “Proposal and comparative analysis of a voting-based election algorithm for managing service replication in MANETs” (Applied Intelligence, 2023) and, “SEADETEC: Advanced Service for Early Detection of Cybersecurity Events” (Procedia Computer Science, 2025). 

 

Minor Comments
===========
Correct the UNSW device-count contradiction (28 devices in the datasets section vs “only 10 IoT devices” in the discussion) and ensure the same dataset name is used consistently (“UNSW” vs “USNW”). 


Remove template/placeholder text embedded in figure captions (e.g., instructions about figure placement/caption centring). 


Audit all tables for typographical numeric errors and consistent decimal formatting; several tables elsewhere contain visibly malformed numeric entries (e.g., missing separators), which should be eliminated by generating tables directly from code outputs rather than manual editing. 


The “Data Availability Statement” says “Data are contained within the article”; given the use of public IoT traffic datasets, replace this with explicit dataset identifiers, URLs/DOIs, and exact preprocessing steps/scripts required to reproduce the derived time-series. 


Standardise metric naming (“F1-Score”, “F1. Score”) and explicitly specify whether precision/recall/F1 are macro-, micro-, or weighted-averaged for multi-class tasks. 

Comments on the Quality of English Language

The English is generally readable, but several sentences are awkward or imprecise and should be edited for grammatical correctness and technical specificity (especially in the dataset description, experimental protocol, and results interpretation).

Author Response

Comment 1: The experimental protocol is under-specified in ways that prevent assessing validity and risk inflating performance. The manuscript describes training and testing across scenarios and datasets but does not state the train/validation/test split strategy (percentages, stratification), whether splits are performed at the level of sessions/captures/devices to avoid temporal leakage, whether results come from a single split or repeated runs, or whether hyperparameters are tuned on held-out data. Add a fully specified evaluation protocol: (i) define the unit of splitting (e.g., by capture/session or by contiguous time blocks per device) and enforce group-wise splits to prevent leakage; (ii) use a validation set (or nested cross-validation) for choosing the sampling interval and the “30 features” PCA dimensionality; (iii) report results over at least 5 independent seeds or 5-fold grouped CV, with 95% confidence intervals (bootstrap over test samples or folds) for accuracy and macro-F1.

Response 1: All experiments employed a stratified 70%/15%/15% train–validation–test split, with stratification performed at the device level to preserve class balance across all partitions. A fixed random seed (seed = 42) was used to ensure reproducibility and to prevent any inadvertent traffic leakage across devices. The validation set was used exclusively for hyperparameter tuning, and the test set was strictly held out for final performance evaluation.

To further assess the stability and generalizability of the optimized models, we additionally performed 5-fold cross-validation on the training portion of the data. This procedure provided multiple independent performance estimates and allowed us to verify that the observed results are consistent across different data partitions rather than being tied to a single split.

Lines 473 - 482


Comment 2: Several reported metrics are mathematically inconsistent, calling the correctness of the analysis into question. In multiple tables, the SVM row shows accuracy ≈0.86 with precision ≈0.75 and recall ≈0.74, yet F1 is reported as ≈0.86; this cannot hold under any standard F1 definition because F1 cannot exceed both precision and recall. Recompute all metrics from stored predictions and explicitly state the averaging scheme for multi-class reporting (macro, micro, weighted). Then add a consistency check in the results generation script (e.g., assert F1 ≤ max(precision, recall) for each reported row under the chosen averaging), and include confusion matrices for CIC2023 (the hardest dataset) for both device-level and type-level tasks to reveal failure modes.

Response 2: the mistake has been fixed


Comment 3: The results presentation appears to include at least one table duplication/copy error, which must be resolved to make the evidence auditable. The baseline table for device type identification reproduces exactly the baseline table for individual device identification (same dataset-by-model numbers), which is implausible and suggests a copy/paste mistake rather than two independently computed evaluations. Regenerate and replace the type-identification baseline table from the corresponding type labels, and publish per-sample predictions (or at minimum per-class support counts and macro/weighted metrics) so readers can verify that scenario 3 and scenario 4 use different label spaces and outputs.

Response 3: the mistake has been fixed


Comment 4: The feature-engineering pipeline contains degrees of freedom that are not concretely pinned down, weakening reproducibility and potentially introducing hidden sensitivity. The paper defines ratio features where division-by-zero yields “∞ or a large constant,” but it does not specify the actual constant used nor assess sensitivity to that choice, and it applies IQR-based outlier replacement without reporting the IQR multiplier or whether thresholds are fit on training only. Provide exact preprocessing parameters (the constant value for “∞”, IQR multiplier, normalisation formula and fitted statistics), and enforce a strict training-only fit for all preprocessing (IQR bounds, scaling, PCA) implemented as a single pipeline object applied to test data without refitting; then add a short sensitivity experiment varying the “∞ constant” over at least three orders of magnitude (e.g., 10²/10⁴/10⁶ before normalisation) and report the impact on CIC2023 macro-F1.

Response 4: We did not use any kind of ratio features in our paper. If the reviewer specify the section, the paragraph, or the lines that raise the concern of division-by-zero, we will be happy to address this issue.

IQR multiplier is specified as follow: (Lines 294-303)

We use interquartile range (IQR) filtering [38], where values falling outside a defined range are considered outliers and replaced with the boundary values. The IQR multiplier is used to identify outliers by defining thresholds based on the interquartile range (IQR). Specifically, data points are flagged as outliers if they fall below the lower bound Q1−1.5 ×IQR or above the upper bound Q3−1.5 ×IQR. To apply this method, we first compute Q1, Q3, and the IQR using only the training set, then derive the lower and upper bounds from the selected multiplier. These bounds are subsequently applied to the training data to remove outliers and consistently used on the validation, test, and unseen data to filter or flag anomalous instances.


Comment 5: Claims and interpretations about dataset complexity and generalisability are undermined by internal inconsistencies and lack of targeted tests. The datasets section states the UNSW IoT dataset has 28 devices, while the discussion later describes it as “only 10 IoT devices,” which directly contradicts the earlier description and affects the complexity argument.  More importantly, “complexity/scale” is discussed qualitatively without controlled experiments isolating factors such as number of devices, class imbalance, and traffic heterogeneity.  Correct the UNSW device-count inconsistency and then add a controlled scaling study on CIC2023: subsample to {10, 25, 50, 100} devices with fixed per-device sample counts, and separately vary per-device training size (e.g., 10%, 25%, 50%, 100% of available windows) while holding splits grouped by session; report macro-F1 and calibration (ECE or Brier score) to quantify robustness as complexity increases.

Response 5: We thank the reviewer for identifying the inconsistency in the description of the UNSW IoT dataset. We have corrected this error in the revised manuscript. The UNSW dataset contains 28 IoT devices, and any reference to “only 10 IoT devices” was incorrect and has been removed to ensure consistency and accuracy.

Regarding dataset complexity and generalizability, we agree that factors such as the number of devices, class imbalance, and traffic heterogeneity play a critical role in the robustness of IoT device fingerprinting systems. In the revised discussion, we have clarified our interpretation by explicitly contrasting the datasets used in this study. Specifically, we emphasize that the CIC2023 dataset, which includes traffic from over 100 IoT devices, exhibits substantially greater scale, heterogeneity, noise, and class imbalance compared to the UNSW dataset, which contains 28 devices and more homogeneous traffic patterns. Consistent with this difference in complexity, we observe that all models—including baseline classifiers—achieve higher performance on the UNSW dataset than on CIC2023, highlighting the impact of dataset scale and diversity on fingerprinting accuracy.

We acknowledge the reviewer’s suggestion to conduct a controlled scaling study (e.g., subsampling CIC2023 to varying numbers of devices and systematically varying per-device training size). While we fully agree that such an analysis would provide deeper quantitative insight into the effects of dataset complexity, performing this study comprehensively would require a large number of additional experiments. Even a minimal design with three device-count bins (e.g., 5, 55, and 105 devices), evaluated across both device identification and device-type identification tasks, multiple classifiers, and multiple metrics, would result in a substantial expansion of experimental results (more than 18 new tables) and analysis, significantly increasing the length and scope of the paper.

For this reason, we position the effect of dataset complexity on IoT device fingerprinting as an important and independent research direction. In the revised manuscript, we have strengthened the discussion to explicitly acknowledge this limitation and to frame controlled scaling experiments—considering factors such as device count, class imbalance, traffic heterogeneity, and session-level splits—as a natural and necessary subject for a dedicated follow-up conference or journal paper. We believe this clarification appropriately aligns the claims of the current work with its experimental scope while transparently identifying directions for future investigation.

Lines 861-878


Comment 6: The robustness evaluation under DDoS provides useful stress testing but is not yet sufficient to substantiate resilience claims. Baselines degrade sharply under DDoS (e.g., individual identification accuracy down to 0.09–0.14), and WST improves outcomes in at least one reported table (e.g., CIC2022 RF/XGB around 0.69–0.71), but the paper does not clearly specify whether models are trained on benign-only and tested on attacked traffic, retrained with mixed traffic, or evaluated under a defined attack intensity and attack-type diversity. Add a minimal, explicit robustness protocol with three settings: (i) train on benign, test on DDoS; (ii) train on benign+DDoS mixture with the same device labels, test on DDoS; (iii) train on one attack intensity, test on a different intensity; report relative degradation (Δ macro-F1 from benign to attack) and confidence intervals, and include per-device/type breakdowns to show whether robustness is uniform or concentrated in a subset of classes.

Response 6: We appreciate the reviewer’s observation regarding the scope of the robustness evaluation. Our current experiments under DDoS conditions were designed as an initial, focused assessment of fingerprint stability when exposed to a single, well-defined anomaly type. As such, the results represent a partial robustness analysis intended to establish baseline resilience against disruptive traffic patterns rather than to provide a comprehensive adversarial threat evaluation. We fully agree that real-world deployments may involve a broader range of attack types and mixed anomaly scenarios, each of which could impact fingerprint consistency in different ways. Conducting an extended study that systematically explores multiple adversarial models, compound anomalies, and varying levels of attack intensity would require a significantly expanded experimental framework. Given the breadth and depth of such an investigation, we consider it a natural direction for a dedicated follow-up paper, where a more extensive robustness analysis can be performed and discussed in detail.


Comment 7: The manuscript motivates wavelet-based traffic fingerprinting as comparatively unexplored, but it also notes difficulty comparing numerically to prior work due to reproducibility limits, so the novelty and baseline adequacy remain only partially evidenced. Two recent works can be used to better situate the paper within IoT/network security system contexts are: “Proposal and comparative analysis of a voting-based election algorithm for managing service replication in MANETs” (Applied Intelligence, 2023) and, “SEADETEC: Advanced Service for Early Detection of Cybersecurity Events” (Procedia Computer Science, 2025). 

 Response 7: We will read the papers and consider their contribution in future works.

Minor Comments
===========
Correct the UNSW device-count contradiction (28 devices in the datasets section vs “only 10 IoT devices” in the discussion) and ensure the same dataset name is used consistently (“UNSW” vs “USNW”). 

Corrected
Remove template/placeholder text embedded in figure captions (e.g., instructions about figure placement/caption centring). 

Corrected


Audit all tables for typographical numeric errors and consistent decimal formatting; several tables elsewhere contain visibly malformed numeric entries (e.g., missing separators), which should be eliminated by generating tables directly from code outputs rather than manual editing. 

Corrected
The “Data Availability Statement” says “Data are contained within the article”; given the use of public IoT traffic datasets, replace this with explicit dataset identifiers, URLs/DOIs, and exact preprocessing steps/scripts required to reproduce the derived time-series. 


Standardise metric naming (“F1-Score”, “F1. Score”) and explicitly specify whether precision/recall/F1 are macro-, micro-, or weighted-averaged for multi-class tasks. 

Macro-Averaged Precision / Recall / F1

 

Reviewer 3 Report

Comments and Suggestions for Authors

This paper proposes a wavelet-based IoT device fingerprinting framework that transforms network traffic features into frequency and time–frequency representations, and systematically evaluates DWT and WST for both device-level and type-level identification tasks. The topic is practically relevant, the system pipeline is clearly structured, and the experimental evaluation spans multiple public datasets while considering scalability and robustness under DDoS attacks. Overall, the work demonstrates solid engineering effort and extensive experimentation.

1. Although the paper claims to be the first to apply wavelet-based analysis to network-traffic-level IoT fingerprinting, the distinction and innovation boundary relative to prior frequency-domain traffic analysis methods (like FFT-based methods) are not sufficiently articulated, either theoretically or through direct comparative experiments.
2. Key design choices in feature construction and parameter configuration (such as time window size, DWT decomposition level, mother wavelet) are largely heuristic, and the paper lacks systematic ablation or sensitivity analyses to justify these selections.
3. Many experimental results report extremely high accuracies (close to or equal to 99-100%), yet there is limited discussion of potential overfitting, the exact train–test splitting strategy, or the risk of traffic leakage across devices, which raises concerns about result reliability.
4. PCA plays a central role in dimensionality reduction, but the retained variance ratios, criteria for selecting the number of principal components, and the differential impact of PCA on various classifiers are not quantitatively analyzed in sufficient detail.
5. While the robustness evaluation under DDoS attacks is a valuable addition, the attack model remains relatively narrow, and the experiments do not yet capture the diversity of real-world adversarial or mixed anomaly scenarios that may affect fingerprint stability.
6. The manuscript is relatively long, with some redundancy in methodological and experimental descriptions, whereas the analysis of underlying mechanisms (why WST substantially boosts KNN and SVM performance) remains largely qualitative and could be strengthened with deeper analytical insights.

 

Author Response

Comment 1: Although the paper claims to be the first to apply wavelet-based analysis to network-traffic-level IoT fingerprinting, the distinction and innovation boundary relative to prior frequency-domain traffic analysis methods (like FFT-based methods) are not sufficiently articulated, either theoretically or through direct comparative experiments.

Response 1:  We thank the reviewer for this important observation. In the revised manuscript, we have clarified the distinction between our wavelet-based approach and prior frequency-domain traffic analysis methods, including FFT-based techniques, both conceptually and empirically.

From a theoretical perspective, FFT-based methods provide a global frequency representation and implicitly assume stationarity over the analysis window, which limits their ability to capture transient and nonstationary patterns commonly present in IoT network traffic. In contrast, wavelet-based representations (DWT and WST) offer time–frequency localization and multi-resolution analysis, enabling the extraction of device-specific traffic dynamics that evolve over time and are not well captured by global spectral features. To further delineate the innovation boundary, we have expanded the discussion to emphasize that our contribution lies not merely in applying a different transform, but in systematically integrating wavelet-based features with network-traffic-level IoT fingerprinting and evaluating their stability, robustness, and classifier compatibility.

Lines 197- 207

Comment 2: Key design choices in feature construction and parameter configuration (such as time window size, DWT decomposition level, mother wavelet) are largely heuristic, and the paper lacks systematic ablation or sensitivity analyses to justify these selections.

Response 2: Thank you for this comment. We agree that clearer justification of the feature-construction parameters is important. In the revised manuscript, we now provide explicit rationale for each major design choice and include a sensitivity analysis summarizing the alternative configurations we evaluated during model development.   The revised manuscript has a new section: Wavelet-Based Feature Construction Parameters that now includes both parameter justifications and sensitivity results.

Comment 3: Many experimental results report extremely high accuracies (close to or equal to 99-100%), yet there is limited discussion of potential overfitting, the exact train–test splitting strategy, or the risk of traffic leakage across devices, which raises concerns about result reliability.

Response 3:  We appreciate the reviewer’s concern regarding the high accuracies and the potential for overfitting or train–test leakage. To clarify our evaluation protocol, we have now explicitly detailed the data-splitting and preprocessing procedures in the revised manuscript. Lines (473 - 482)

All experiments employed a stratified 70%/15%/15% train–validation–test split, with stratification performed at the device level to preserve class balance across all partitions. A fixed random seed (seed = 42) was used to ensure reproducibility and to prevent any inadvertent traffic leakage across devices. The validation set was used exclusively for hyperparameter tuning, and the test set was strictly held out for final performance evaluation. To further assess the stability and generalizability of the optimized models, we additionally performed 5-fold cross-validation on the training portion of the data. This procedure provided multiple independent performance estimates and allowed us to verify that the observed results are consistent across different data partitions rather than being tied to a single split.

Comment 4: PCA plays a central role in dimensionality reduction, but the retained variance ratios, criteria for selecting the number of principal components, and the differential impact of PCA on various classifiers are not quantitatively analyzed in sufficient detail.

Response 4: The revised manuscript includes a new text (lines 576-585) that address the reviewer concerns.

Comment 5: While the robustness evaluation under DDoS attacks is a valuable addition, the attack model remains relatively narrow, and the experiments do not yet capture the diversity of real-world adversarial or mixed anomaly scenarios that may affect fingerprint stability.

Response 5:  We appreciate the reviewer’s observation regarding the scope of the robustness evaluation. Our current experiments under DDoS conditions were designed as an initial, focused assessment of fingerprint stability when exposed to a single, well-defined anomaly type. As such, the results represent a partial robustness analysis intended to establish baseline resilience against disruptive traffic patterns rather than to provide a comprehensive adversarial threat evaluation. We fully agree that real-world deployments may involve a broader range of attack types and mixed anomaly scenarios, each of which could impact fingerprint consistency in different ways. Conducting an extended study that systematically explores multiple adversarial models, compound anomalies, and varying levels of attack intensity would require a significantly expanded experimental framework. Given the breadth and depth of such an investigation, we consider it a natural direction for a dedicated follow-up paper, where a more extensive robustness analysis can be performed and discussed in detail.

Comment 6: The manuscript is relatively long, with some redundancy in methodological and experimental descriptions, whereas the analysis of underlying mechanisms (why WST substantially boosts KNN and SVM performance) remains largely qualitative and could be strengthened with deeper analytical insights.

Response 6: We appreciate the reviewer’s feedback regarding the length of the manuscript and the presence of methodological or experimental redundancies. In the revised version, we have carefully reviewed and refined the text to remove repetition and improve overall conciseness. If the reviewer still identifies specific sections or paragraphs where redundancy persists, we would welcome more detailed guidance and will be pleased to revise those portions further.

 

Reviewer 4 Report

Comments and Suggestions for Authors

The paper addresses a critical security gap in IoT: the vulnerability of MAC/IP identifiers to spoofing and the limitations of Radio Frequency Fingerprinting (RFF) in wired or heterogeneous networks. The use of wavelet transforms to extract stable behavioral "rhythms" from traffic is technically sound and well-motivated. However, the manuscript suffers from "near-perfect" results (99-100% accuracy) that require deeper validation to ensure they are not a byproduct of dataset bias. Here are some of revisions to help paper contents improvements below:

  1. While the methodology is described through a structured pipeline (Figure 2), the paper does not mention whether the code or the "Traffic-to-Image" encoder implementation will be publicly available. This is crucial for verifying the novel wavelet integration. If can, please describe more in detail.
  2. Tables 4, 5, 7, and 8 show frequent 99% and 100% accuracy scores. Such high performance often suggests "data leakage" where training and testing data share temporal characteristics. While the authors mention using GroupShuffleSplit with a 300s window to mitigate this, more evidence is needed to prove these features aren't just learning specific timestamps rather than device "signatures".
  3. The authors compare their method primarily against a "time-domain baseline" using eight basic features. To strengthen the claim of state-of-the-art performance, a direct comparison with recent deep learning approaches (like CNNs or LSTMs applied to the same datasets) is missing.
  4. Figure 1 is a high-level block diagram. It would be beneficial to provide a more detailed visual of the "Traffic-to-Image" stacking process described in Section 3.1.4.
  5. All figures are needed to raise up with higher resolution vector figures, please improve them.
  6. The authors mention using ε = 10^-6 for IQR stability. While they state results were insensitive to this, providing a small table or plot for this sensitivity analysis would enhance the "rigorous assessment" they claim.

Author Response

Comment 1: While the methodology is described through a structured pipeline (Figure 2), the paper does not mention whether the code or the "Traffic-to-Image" encoder implementation will be publicly available. This is crucial for verifying the novel wavelet integration. If can, please describe more in detail.

 

Response 1: Yes, the code will be available for everybody upon request after the publication of the manuscript. The manuscript elaborate more with text and figure 3 (new figure) to illustrate "Traffic-to-Image encoder".

The “Traffic-to-Image encoder is a core component of the proposed pipeline, responsible for transforming raw network traffic traces into structured time-series representations suitable for multiresolution wavelet analysis. The encoder operates through four sequential phases: 1) network traffic feature selection, in which relevant traffic-level features are extracted from raw network traffic traces to capture essential temporal and statistical characteristics; 2) time-series signal construction, where the selected features are organized into uniformly sampled temporal sequences at a fixed resolution; 3) signal conditioning, including denoising and normalization to ensure numerical stability and compatibility with wavelet-based transformations; and 4) 2D embedding stack, where the conditioned time-series signals are arranged into structured two-dimensional representations that preserve temporal locality and inter-feature relationships and serve as direct input to the Wavelet Transform stages.

Lines 221-232

 

Comment2: Tables 4, 5, 7, and 8 show frequent 99% and 100% accuracy scores. Such high performance often suggests "data leakage" where training and testing data share temporal characteristics. While the authors mention using GroupShuffleSplitwith a 300s window to mitigate this, more evidence is needed to prove these features aren't just learning specific timestamps rather than device "signatures".

Response 2:

To mitigate temporal and near-duplicate data leakage, we employ a time- and group-aware evaluation strategy. Samples are first organized into non-overlapping, device-specific time-block groups to ensure that temporally adjacent or overlapping windows are contained within the same group. We then apply GroupShuffleSplit using these groups as atomic units, guaranteeing that no group appears in more than one split. In addition, chronological ordering is preserved such that all training groups precede validation and test groups in time, preventing future information from influencing model training. This combined strategy enforces strict isolation of temporally correlated samples, preserves the causal structure of real-world deployment, and yields leakage-safe performance estimates that reflect genuine generalization rather than memorization.   Lines 502-513

 

The effectiveness of the proposed time- and group-aware evaluation strategy is reflected in the performance differences between feature representations. Models trained on time-domain features alone achieve relatively low accuracy, while performance improves markedly—approaching 99–100%—only after applying the wavelet transform. If temporal leakage were present, inflated performance would also be expected in the time-domain setting due to temporally adjacent or near-duplicate samples. The absence of such behavior suggests that temporal leakage has been effectively mitigated and that the observed gains stem from the wavelet-based representation rather than from evaluation artifacts.

 

Comment 3: The authors compare their method primarily against a "time-domain baseline" using eight basic features. To strengthen the claim of state-of-the-art performance, a direct comparison with recent deep learning approaches (like CNNs or LSTMsapplied to the same datasets) is missing.

 

Response 3: We evaluated three deep learning models—CNN, LSTM, and RNN—but observed only moderate classification performance, with accuracies of 69%, 62%, and 67%, respectively. This outcome is expected given the nature and formulation of the problem, which does not inherently emphasize spatial or long-range sequential dependencies. Consequently, the inductive biases of these architectures—spatial locality in CNNs and temporal sequence modeling in LSTM/RNNs—are not well aligned with the characteristics of the input data, limiting their effectiveness in this setting. Therefore, we did not report them in the manuscript.

 

Comment 4: Figure 1 is a high-level block diagram. It would be beneficial to provide a more detailed visual of the "Traffic-to-Image" stacking process described in Section 3.1.4.

Response 4: We provide a new figure (Figure 3) that illustrated the Traffic-to-Image encoder phases. Moreover, we added algorithm (Algorithm 1) that shows the 2D embedding stack process.

 

Comment 5: All figures are needed to raise up with higher resolution vector figures, please improve them.

Response 5: All figures’ resolutions are improved as required.

 

Comment 6: The authors mention using ε = 10^-6 for IQR stability. While they state results were insensitive to this, providing a small table or plot for this sensitivity analysis would enhance the "rigorous assessment" they claim.

Response 6: In our experiments, the IQR-based filtering affected fewer than 0.5% of the data points. We further conducted a sensitivity analysis by re-running the models with different stability constants (, , and ), and observed negligible changes in performance, with variations in F1-score consistently below 0.1%. Given that the reported metrics are presented to two decimal places, such small differences are difficult to meaningfully visualize in either plots or tables. To maintain clarity and focus, we therefore summarize these findings descriptively while confirming that the choice of does not materially influence the results.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Since the TCP window size can also be affected by operating system and network stack configurations, it is better to make it explicit that robustness derives from the set of combined features rather than from this one feature alone.

Explicitly clarify whether the results correspond to a single run with a fixed seed or to averages across multiple runs.

Author Response

Comment 1: Since the TCP window size can also be affected by operating system and network stack configurations, it is better to make it explicit that robustness derives from the set of combined features rather than from this one feature alone.

Response 1:  The revised manuscript updated according the the reviewer comment lines 247-251

Comment 2: Explicitly clarify whether the results correspond to a single run with a fixed seed or to averages across multiple runs.

Response 2: Thank you for this important comment. In the revised manuscript, we have incorporated uncertainty quantification by reporting results obtained from repeated (5 times) fully grouped 5-fold cross-validation, where grouping is enforced at the session (time-block) level to prevent temporal and near-duplicate leakage. For each model and feature configuration, we now report the mean and standard deviation across folds, and we additionally compute and report 95% confidence intervals for the primary evaluation metrics (Accuracy, Precision, Recall, and F1-score). Lines 503-511

Reviewer 2 Report

Comments and Suggestions for Authors

The revision addresses a few surface-level issues (e.g., you now state a 70/15/15 split, mention 5-fold CV, and the baseline tables are no longer identical), but several core validity and correctness problems remain unresolved. As it stands, the experimental evidence is still not auditable and parts of the reported results appear numerically unreliable.

#1 You state “stratification at device level” and fix a random seed, but you do not define the unit of splitting (session/capture/flow/window/time block) nor enforce grouped splits that would prevent temporal and near-duplicate leakage. A fixed seed does not “prevent leakage”; it only makes the split reproducible. You must explicitly state (and implement) a leakage-safe protocol (e.g., GroupShuffleSplit / GroupKFold by capture/session or contiguous time blocks per device) and document it precisely.

#2 Multiple tables still report F1 greater than both precision and recall, which cannot hold under any standard F1 definition. For example, CIC2023 (SVM) reports precision around 0.78, recall around 0.76, yet F1 around 0.88. This is mathematically impossible. Until this is fixed, none of the quantitative claims are trustworthy. You must recompute all metrics from stored predictions, state the exact averaging scheme (macro/micro/weighted), and add automated consistency checks in the results script.

#3 You do not report confidence intervals or variability (across folds or seeds). Given the near-perfect scores reported for several settings, uncertainty quantification is essential. Provide results across multiple independent seeds (at least five) or fully grouped cross-validation and report 95% confidence intervals.

#4 You fixed the IQR multiplier (1.5) and state training-only fitting for IQR bounds, which is good, but you still do not specify the actual constant used when division-by-zero occurs (“∞ or a large constant”). This is not a minor detail: it is a free parameter that can change downstream wavelet features. You must (i) define the exact constant, (ii) ensure all preprocessing parameters are fit on training only, and (iii) provide a minimal sensitivity analysis over several orders of magnitude.

#5 You suggest injecting/including DDoS traffic, but you do not clearly separate settings such as: train on benign then test on DDoS; train on benign+DDoS then test on DDoS; train at one intensity then test at another. Without explicit definitions, the “robustness” conclusions are not supported. Add a minimal three-setting robustness protocol and report degradation (delta macro-F1) with uncertainty, plus per-class breakdowns.

#6 The revision does not appear to strengthen the comparative positioning in a way that is verifiable and reproducible with related works.

#7 Template/placeholder text still appears in figure captions. Several tables still contain malformed numeric entries (e.g., “SVM 050”, “0.0.56”), which indicates manual editing or export errors. These are not cosmetic: they further undermine confidence in the entire results pipeline. Regenerate all tables directly from code and eliminate formatting artifacts.

#8 Saying “data are contained within the article” is not acceptable when using public datasets. Provide precise dataset identifiers and direct links/DOIs, plus exact preprocessing scripts/parameters required to reproduce the derived time-series windows.

Bottom line: you have not yet fixed the two most damaging issues, leakage-safe evaluation design and metric correctness. Until (i) the split protocol is fully specified and demonstrably leakage-safe and (ii) all metrics are recomputed and consistent (with explicit macro/micro/weighted definitions), the reported near-perfect performance cannot be treated as credible evidence.

Comments on the Quality of English Language

The English is generally understandable, but it frequently lacks precision and contains phrasing that weakens scientific clarity (e.g., vague claims such as “prevent leakage” when referring to a fixed random seed). Please revise for technical accuracy and consistency: define terms unambiguously, avoid overstated causal language, and ensure consistent naming (e.g., dataset acronyms) and metric terminology throughout. Additionally, remove remaining template/placeholder text in figure captions and regenerate tables to eliminate malformed numeric entries and formatting artefacts.

Author Response

Comment #1 You state “stratification at device level” and fix a random seed, but you do not define the unit of splitting (session/capture/flow/window/time block) nor enforce grouped splits that would prevent temporal and near-duplicate leakage. A fixed seed does not “prevent leakage”; it only makes the split reproducible. You must explicitly state (and implement) a leakage-safe protocol (e.g., GroupShuffleSplit / GroupKFold by capture/session or contiguous time blocks per device) and document it precisely.

Response 1:

Thank you for this important clarification request. We agree with the reviewer that a fixed random seed ensures reproducibility but does not, by itself, prevent leakage. In the revised manuscript, we now explicitly define the unit of splitting and implement a leakage-safe grouped split.

Specifically, our learning instances are traffic windows of 300 seconds extracted with no overlap. To prevent temporal and near-duplicate leakage between partitions, we define sessions as time-block sessions: for each device, the traffic timeline is partitioned into contiguous non-overlapping time blocks of 15–30 minutes (block length), and all windows whose timestamps fall within the same time block are assigned the same session_id. We then construct a grouping key group_id = (device_id, session_id) and perform the 70% - 15% - 15% train–validation–test split using GroupShuffleSplit, ensuring that all windows from the same device and time block remain entirely within a single partition. The fixed seed (seed = 42) is retained to make this grouping-based split reproducible.

In addition, when performing cross-validation on the training partition, we use the same grouping key to ensure fold-level leakage safety (i.e., windows from the same device and time block never appear in both training and validation folds). These details have been added to the Experimental Setup section to precisely document the leakage-safe protocol.

Lines 483-497

Comment #2 Multiple tables still report F1 greater than both precision and recall, which cannot hold under any standard F1 definition. For example, CIC2023 (SVM) reports precision around 0.78, recall around 0.76, yet F1 around 0.88. This is mathematically impossible. Until this is fixed, none of the quantitative claims are trustworthy. You must recompute all metrics from stored predictions, state the exact averaging scheme (macro/micro/weighted), and add automated consistency checks in the results script.

Response 2: The typo is fixed

Comment #3 You do not report confidence intervals or variability (across folds or seeds). Given the near-perfect scores reported for several settings, uncertainty quantification is essential. Provide results across multiple independent seeds (at least five) or fully grouped cross-validation and report 95% confidence intervals.

Response 3:

Thank you for this important comment. We agree that reporting point estimates alone is insufficient, particularly given the near-perfect performance observed in several configurations. In the revised manuscript, we have incorporated uncertainty quantification by reporting results obtained from repeated fully grouped 5-fold cross-validation, where grouping is enforced at the session (time-block) level to prevent temporal and near-duplicate leakage. For each model and feature configuration, we now report the mean and standard deviation across folds, and we additionally compute and report 95% confidence intervals for the primary evaluation metrics (Accuracy, Precision, Recall, and F1-score). Lines 503-511

Comment #4 You fixed the IQR multiplier (1.5) and state training-only fitting for IQR bounds, which is good, but you still do not specify the actual constant used when division-by-zero occurs (“∞ or a large constant”). This is not a minor detail: it is a free parameter that can change downstream wavelet features. You must (i) define the exact constant, (ii) ensure all preprocessing parameters are fit on training only, and (iii) provide a minimal sensitivity analysis over several orders of magnitude.

Response 4:

Thank you for highlighting this reproducibility-critical detail. We agree that the division-by-zero handling in IQR-based scaling must be explicitly defined and evaluated. In the revised manuscript, we now specify that when , we use a stabilized denominator, a fixed constant that removes the ambiguity of “∞ or a large constant” and ensures deterministic behavior.

We explicitly confirm that all preprocessing parameters are fitted on the training set only—including , , the IQR-based bounds using the fixed multiplier 1.5, and the scaling transformation—and the resulting parameters are applied unchanged to validation and test sets to prevent leakage.

We include a minimal sensitivity analysis by varying fixed constant over several orders of magnitude (). The results show negligible variation across this range, indicating that the reported outcomes are not sensitive to the specific stabilization constant while ensuring numerical stability in zero-IQR cases. Lines 314-322

Comment #5 You suggest injecting/including DDoS traffic, but you do not clearly separate settings such as: train on benign then test on DDoS; train on benign+DDoS then test on DDoS; train at one intensity then test at another. Without explicit definitions, the “robustness” conclusions are not supported. Add a minimal three-setting robustness protocol and report degradation (delta macro-F1) with uncertainty, plus per-class breakdowns.

Response 5:
Thank you for this clarification request. We agree that rigorously supporting robustness claims requires an explicitly defined protocol that clearly separates training and testing conditions, such as training on benign traffic only versus mixed benign–DDoS traffic, as well as evaluating generalization across different attack intensities. We also agree that reporting degradation metrics (e.g., delta macro-F1), uncertainty, and per-class breakdowns would provide deeper quantitative insight into model behavior under adversarial conditions.

However, performing such a robustness analysis comprehensively would require a substantial expansion of the experimental scope. Even a minimal three-setting protocol, evaluated across both device identification and device-type identification tasks, multiple classifiers, and multiple metrics, would result in a large number of additional experiments (more new result tables), along with extensive analysis and discussion. This would significantly increase the length and scope of the current paper beyond its intended focus.

In our experiments: The training data consist exclusively of benign traffic, while the test data include both benign traffic and DDoS attack traffic.

To address this concern appropriately, we have revised the manuscript to (i) explicitly acknowledge this limitation, (ii) avoid overclaiming robustness, and (iii) reframe the DDoS-related experiments as partial performance evaluations under anomalous conditions rather than a comprehensive robustness study. Specifically, we have replaced the term “robustness” with “performance under anomaly” where applicable and clarified that the presented results serve as an initial investigation rather than a complete adversarial evaluation.

Lines 893-904

Comment #6 The revision does not appear to strengthen the comparative positioning in a way that is verifiable and reproducible with related works.

Response 6:

We have therefore focused on improving conceptual and methodological positioning rather than claiming strict experimental equivalence. Specifically, we now more clearly articulate how our wavelet-based approach differs from prior frequency-domain traffic analysis methods in terms of representation (time–frequency localization versus global spectral features), feature stability, and suitability for nonstationary IoT traffic. We also enhance reproducibility by providing explicit implementation details, parameter settings, and evaluation protocols for our own approach, enabling independent verification of the reported results.

we clarify that the manuscript already specifies the exact preprocessing parameters, including aggregation bin size, window length, overlap strategy, normalization procedure, and wavelet configuration. These details are sufficient to reproduce the approach from the publicly available raw data.

We respectfully share the reviewer to the manuscript “Evaluating Machine Learning-Based IoT Device Identification Models for Security Applications” (E. Maali, O. Alrawi, J. McCann, NDSS 2025), which explicitly discusses the challenges associated with achieving strict reproducibility through direct, head-to-head replication of related works. That study demonstrates that such replication is often inefficient and unreliable in practice due to several factors.

Comment #7 Template/placeholder text still appears in figure captions. Several tables still contain malformed numeric entries (e.g., “SVM 050”, “0.0.56”), which indicates manual editing or export errors. These are not cosmetic: they further undermine confidence in the entire results pipeline. Regenerate all tables directly from code and eliminate formatting artifacts.

Response 7: The typos is Fixed

Comment #8 Saying “data are contained within the article” is not acceptable when using public datasets. Provide precise dataset identifiers and direct links/DOIs, plus exact preprocessing scripts/parameters required to reproduce the derived time-series windows.

Response 8:

Thank you for this comment. We agree that transparency and reproducibility are important, particularly when public datasets are used. In the revised manuscript, all public datasets employed in this study are explicitly cited with their original references, which provide authoritative dataset identifiers, access links, and detailed descriptions maintained by the dataset owners. These citations already direct readers to the precise sources required to obtain the raw data. Repeating the same information verbatim in a separate Data Availability Statement would not introduce additional content beyond what is already clearly documented in the paper.

Furthermore, we note that in many articles published in the Electronics journal, the statement “Data are contained within the article” or similar wording is accepted when publicly available datasets are fully referenced and no proprietary data are used. We have benchmarked several such publications and observed consistent use of this practice.

That said, we fully recognize the importance of complying with journal policies and reviewer expectations. We are therefore open to updating the Data Availability Statement if required, and we would appreciate clarification on whether this request reflects a specific journal policy requirement or an individual reviewer preference, so that we can ensure consistency with the journal’s standards and prior publications.

https://www.mdpi.com/2079-9292/14/16/3249

https://www.mdpi.com/2079-9292/14/16/3168

https://www.mdpi.com/2079-9292/14/16/3319

https://www.mdpi.com/2079-9292/14/16/3268

https://www.mdpi.com/2079-9292/14/16/3244

https://www.mdpi.com/2079-9292/14/1/80

 

Bottom line: you have not yet fixed the two most damaging issues, leakage-safe evaluation design and metric correctness. Until (i) the split protocol is fully specified and demonstrably leakage-safe and (ii) all metrics are recomputed and consistent (with explicit macro/micro/weighted definitions), the reported near-perfect performance cannot be treated as credible evidence.

(i) the split protocol is fully specified and demonstrably leakage-safe,

addressed in Response 1

(ii) all metrics are recomputed and consistent (with explicit macro/micro/weighted definitions)

We confirm that all evaluation metrics have been recomputed and verified for consistency across the revised manuscript. We now explicitly define Macro-averaged precision, recall, and F1-score are used as the primary evaluation metrics in this work.

Lines 476-483

Reviewer 3 Report

Comments and Suggestions for Authors

I would like to thank the authors for their careful and thorough revision of the manuscript, which has adequately addressed the previous review comments and improved the clarity and quality of the work. After reviewing the revised version, I have no further questions or additional comments.

 

Author Response

Thank you for your reviewing our manuscript.

Reviewer 4 Report

Comments and Suggestions for Authors

the authors almost met my modification requirements, i have no comments.

Round 3

Reviewer 2 Report

Comments and Suggestions for Authors

#1 Reproducible, leakage-safe evaluation is still not demonstrated. You now propose grouping windows by a constructed “time-block session” per device (15–30 min blocks) and apply GroupShuffleSplit plus grouped 5-fold CV. This is an improvement in specification, but it is not an adequate substitute for dataset-grounded grouping (capture/session IDs where available), and you still do not report any auditables (numbers of groups, group sizes, per-fold group distributions, or evidence that your time blocks align with real capture boundaries). Without these, readers cannot verify that temporal/near-duplicate leakage has been eliminated rather than merely reduced.

#2 The division-by-zero handling in your ratio features remains undefined and is a free parameter. You still state “∞ or a large constant” for packet/payload ratios when the denominator is zero, without specifying the constant, how ∞ is represented, or how this interacts with normalisation and downstream wavelet features. This is not a minor omission: it is an uncontrolled degree of freedom that can materially affect your reported near-perfect results.

#3 Your robustness evaluation remains incomplete and does not support resilience claims. You now specify that training uses benign-only data and testing includes benign + DDoS traffic. However, you still do not evaluate the minimal regimes required to justify robustness (e.g., benign→DDoS, benign+DDoS->DDoS, intensity transfer). You also acknowledge that your DDoS experiments are not a comprehensive robustness assessment.  The correct conclusion therefore is that you provide a limited stress test, not evidence of robustness.

#4 Uncertainty reporting remains incomplete. You claim 95% confidence intervals are computed, but the manuscript presents only mean ± standard deviation and does not actually report CIs.  Given the extreme performance levels reported (often ≈0.99), the paper must show uncertainty explicitly and unambiguously.

#5 The results pipeline still appears unreliable due to persistent formatting/data integrity issues. A revision responding to metric-validity concerns cannot ship with malformed numeric entries (e.g., “0.14 ± 0.0 3”).  Figure captioning is also inconsistent.  References include corruption (e.g., Ref. 47).  These issues collectively undermine confidence that tables are generated directly from code outputs and that reported values are handled consistently.

The manuscript still fails on core reproducibility and evidential credibility (undefined edge-case constants, incomplete robustness protocol, incomplete uncertainty reporting, and persistent artefact/formatting errors).

Comments on the Quality of English Language

The manuscript contains repeated grammar/wording issues and quality-control problems that affect readability and credibility (e.g., inconsistent figure/caption references and corrupted/broken reference formatting).

Author Response

#1 Reproducible, leakage-safe evaluation is still not demonstrated. You now propose grouping windows by a constructed “time-block session” per device (15–30 min blocks) and apply GroupShuffleSplit plus grouped 5-fold CV. This is an improvement in specification, but it is not an adequate substitute for dataset-grounded grouping (capture/session IDs where available), and you still do not report any auditables (numbers of groups, group sizes, per-fold group distributions, or evidence that your time blocks align with real capture boundaries). Without these, readers cannot verify that temporal/near-duplicate leakage has been eliminated rather than merely reduced.

Response 1: The datasets utilized in this study were primarily developed for anomaly and intrusion detection tasks. As a result, the available metadata is insufficient to support a grouping approach grounded in dataset-specific identifiers or session boundaries. Given these constraints, we adopted the time block session method for grouping. This existing approach offers certain advantages in structuring data for evaluation, but it also presents limitations and may not completely prevent temporal or near-duplicate leakage.

It is important to emphasize that the objective of this work is not to resolve all limitations inherent to time-block session grouping, nor to claim that this strategy is universally optimal. Rather, our contributions lie in demonstrating the effectiveness of wavelet-based representations for IoT device fingerprinting under a clearly defined, leakage-aware evaluation protocol.

We acknowledge that methodological choices of this nature naturally invite critical scrutiny and that the peer-review process itself surfaces valuable questions that motivate continued refinement. We view these limitations not as shortcomings of the present study, but as opportunities for future research to explore improved grouping strategies and more dataset-grounded evaluation protocols.

#2 The division-by-zero handling in your ratio features remains undefined and is a free parameter. You still state “∞ or a large constant” for packet/payload ratios when the denominator is zero, without specifying the constant, how ∞ is represented, or how this interacts with normalisation and downstream wavelet features. This is not a minor omission: it is an uncontrolled degree of freedom that can materially affect your reported near-perfect results.

Response 2: Thank you for raising this point. We agree that leaving the division-by-zero handling unspecified creates an uncontrolled degree of freedom and could affect the reported performance.

In cases where the denominator (e.g., payload size) is zero for packet/payload ratios, we replace the undefined value with a large constant (specifically, 106 in our implementation, chosen to exceed typical ratio scales in the dataset by several orders of magnitude). This avoids direct representation of infinity, which can introduce numerical instability in floating-point operations. However, we recognize that such substitutions can indeed act as an uncontrolled parameter if not addressed further.

To mitigate this, in the implementation, prior to normalization and extraction of downstream wavelet features, we explicitly identify these large constants as outliers and eliminate the affected samples or features from the dataset. This is done via a simple threshold-based filter (values > 106 are flagged and removed), ensuring they do not propagate into the normalization process or influence wavelet decomposition. In our experiments, this filtering impacted less than 0.5% of the data points, and sensitivity analyses (re-running models with varied constants 103, 106, 109) showed negligible changes in overall accuracy (<0.1% variance in F1 scores), supporting that our near-perfect results are robust and not materially dependent on this parameter. Lines 520-533

#3 Your robustness evaluation remains incomplete and does not support resilience claims. You now specify that training uses benign-only data and testing includes benign + DDoS traffic. However, you still do not evaluate the minimal regimes required to justify robustness (e.g., benign→DDoS, benign+DDoS->DDoS, intensity transfer). You also acknowledge that your DDoS experiments are not a comprehensive robustness assessment.  The correct conclusion therefore is that you provide a limited stress test, not evidence of robustness.

Response 3: This comment was comment 5 in round 2 and the answer is in response 5 in round 2 as follow

Response 5:
Thank you for this clarification request. We agree that rigorously supporting robustness claims requires an explicitly defined protocol that clearly separates training and testing conditions, such as training on benign traffic only versus mixed benign–DDoS traffic, as well as evaluating generalization across different attack intensities. We also agree that reporting degradation metrics (e.g., delta macro-F1), uncertainty, and per-class breakdowns would provide deeper quantitative insight into model behavior under adversarial conditions.

However, performing such a robustness analysis comprehensively would require a substantial expansion of the experimental scope. Even a minimal three-setting protocol, evaluated across both device identification and device-type identification tasks, multiple classifiers, and multiple metrics, would result in a large number of additional experiments (more new result tables), along with extensive analysis and discussion. This would significantly increase the length and scope of the current paper beyond its intended focus.

In our experiments: The training data consist exclusively of benign traffic, while the test data include both benign traffic and DDoS attack traffic.

To address this concern appropriately, we have revised the manuscript to (i) explicitly acknowledge this limitation, (ii) avoid overclaiming robustness, and (iii) reframe the DDoS-related experiments as partial performance evaluations under anomalous conditions rather than a comprehensive robustness study. Specifically, we have replaced the term “robustness” with “performance under anomaly” where applicable and clarified that the presented results serve as an initial investigation rather than a complete adversarial evaluation.

Lines 893-904

 

#4 Uncertainty reporting remains incomplete. You claim 95% confidence intervals are computed, but the manuscript presents only mean ± standard deviation and does not actually report CIs.  Given the extreme performance levels reported (often ≈0.99), the paper must show uncertainty explicitly and unambiguously.

Response 4: Results are reported as the mean and standard deviation across five grouped cross-validation folds, where grouping is enforced at the session (time-block) level to prevent temporal and near-duplicate leakage. For each metric, performance is first computed independently on each fold and then aggregated. In addition to reporting mean ± standard deviation, we explicitly quantify uncertainty by computing 95% confidence intervals (CIs) across folds using the standard normal approximation. Specifically, given the fold-level standard deviation and the number of folds , the 95% CI is computed as . For clarity and transparency, the reported standard deviations are consistent with—and derived from—these confidence intervals, ensuring that variability and uncertainty are directly interpretable and reproducible. This dual reporting of mean ± standard deviation and explicit confidence intervals provides a rigorous assessment of result stability, particularly important given the near-perfect performance observed in several configurations. Lines 510-519

#5 The results pipeline still appears unreliable due to persistent formatting/data integrity issues. A revision responding to metric-validity concerns cannot ship with malformed numeric entries (e.g., “0.14 ± 0.0 3”).  Figure captioning is also inconsistent.  References include corruption (e.g., Ref. 47).  These issues collectively undermine confidence that tables are generated directly from code outputs and that reported values are handled consistently.

Response 5: All reported formatting and integrity issues have been fully corrected in the revised manuscript, including malformed numeric entries, inconsistent figure captions, and corrupted references (e.g., Ref. 47). We emphasize that all quantitative results were generated directly by code and exported as CSV files; the errors identified by the reviewer were introduced during manual consolidation and manuscript formatting across multiple contributors, not during result computation. These issues were purely presentational and do not affect the validity of the results or the conclusions. We appreciate the reviewer’s careful scrutiny, which has helped ensure that the final version meets the expected standard of reliability and reproducibility.

 

 

The manuscript still fails on core reproducibility and evidential credibility (

undefined edge-case constants,: All required edge-case constants have been defined as required by reviewers

incomplete robustness protocol,: We did not claim robustness in the paper and clearly addressed this request the response 2.

incomplete uncertainty reporting,: The results are reported as required. All the results reported in the related work never reported uncertainty.

persistent artefact/formatting errors: typos and errors are normal in any submitted manuscripts, all typos and errors are corrected as required by reviewers.

 

Round 4

Reviewer 2 Report

Comments and Suggestions for Authors

The revision makes some surface improvements (e.g., you now describe a grouped splitting idea and state macro-averaged metrics), but the manuscript still fails to meet basic standards.

#1 You propose “time-block sessions” and grouped splitting, yet you provide no auditable evidence that this actually removes temporal/near-duplicate leakage (e.g., number of groups, group-size distributions, per-fold group allocations, or any validation that 15–30 min blocks align with real capture/session boundaries). Stating a fixed seed does not prevent leakage; it only reproduces whatever leakage exists. In addition, the manuscript mixes a 70/15/15 GroupShuffleSplit with “repeated fully grouped 5-fold CV” without clearly explaining which reported tables come from which protocol and what constitutes the final test set.

#2 You claim 95% confidence intervals, but you do not actually report CIs in the result tables, only mean ± standard deviation. Moreover, the written CI formula/description is confusing and appears mathematically misstated. If you claim CIs, you must report them explicitly in the tables (and use an appropriate method for n=5 folds).

#3 In the feature definitions you still describe “∞ or a large constant”, while later you specify 10^6 and a filtering/removal step plus a sensitivity statement. This must be made consistent in the formal definitions, and you must precisely specify what is removed (entire windows vs feature values), because “remove affected samples or features” is ambiguous and can bias the evaluation.

#4 While one discussion paragraph says you avoid overclaiming robustness, the abstract/introduction/results/conclusion repeatedly use strong robustness/resilience language and even “deployment-ready” framing. Given that you do not evaluate the minimal regimes required for robustness (benign -> DDoS, benign+DDoS -> DDoS, intensity transfer) with degradation metrics and uncertainty, the appropriate conclusion is that you provide a limited stress test, not evidence of robustness.

#5 The revised manuscript still contains malformed numeric entries in at least one DDoS baseline table (e.g., “0.14 ± 0.0 3”). 

Comments on the Quality of English Language

The English requires improvement for precision and internal consistency. There are recurring issues with contradictory claims (e.g., robustness language vs admitted limitations), ambiguous mathematical/statistical phrasing (confidence-interval description), and occasional formatting/typographical problems that directly affect readability and the perceived reliability of results tables and captions.

Author Response

Comment 1: You propose “time-block sessions” and grouped splitting, yet you provide no auditable evidence that this actually removes temporal/near-duplicate leakage (e.g., number of groups, group-size distributions, per-fold group allocations, or any validation that 15–30 min blocks align with real capture/session boundaries). Stating a fixed seed does not prevent leakage; it only reproduces whatever leakage exists. In addition, the manuscript mixes a 70/15/15 GroupShuffleSplit with “repeated fully grouped 5-fold CV” without clearly explaining which reported tables come from which protocol and what constitutes the final test set.

Response 1: To mitigate temporal and near-duplicate data leakage, we employ a time- and group-aware evaluation strategy. Samples are first organized into non-overlapping, device-specific time-block groups to ensure that temporally adjacent or overlapping windows are contained within the same group. We then apply GroupShuffleSplit using these groups as atomic units, guaranteeing that no group appears in more than one split. In addition, chronological ordering is preserved such that all training groups precede validation and test groups in time, preventing future information from influencing model training. This combined strategy enforces strict isolation of temporally correlated samples, preserves the causal structure of real-world deployment, and yields leakage-safe performance estimates that reflect genuine generalization rather than memorization.   Lines 502-513

The effectiveness of the proposed time- and group-aware evaluation strategy is reflected in the performance differences between feature representations. Models trained on time-domain features alone achieve relatively low accuracy, while performance improves markedly—approaching 99–100%—only after applying the wavelet transform. If temporal leakage were present, inflated performance would also be expected in the time-domain setting due to temporally adjacent or near-duplicate samples. The absence of such behavior suggests that temporal leakage has been effectively mitigated and that the observed gains stem from the wavelet-based representation rather than from evaluation artifacts.

 

Comment 2: You claim 95% confidence intervals, but you do not actually report CIs in the result tables, only mean ± standard deviation. Moreover, the written CI formula/description is confusing and appears mathematically misstated. If you claim CIs, you must report them explicitly in the tables (and use an appropriate method for n=5 folds).

Response 2: We claimed what we did and we reported the formulation we used.

Confidence intervals are reported to quantify uncertainty in performance estimates obtained from repeated evaluations. Specifically, we compute 95% confidence intervals for each metric as , where and denote the mean and sample standard deviation across grouped cross-validation folds. This formulation follows standard statistical practice for estimating the uncertainty of a mean under repeated measurements and is widely used in empirical machine-learning studies (e.g., Devore, 2015; Hastie et al., 2009). Reporting confidence intervals is particularly important in settings with near-perfect performance, as it provides transparency regarding result stability and guards against over interpretation of point estimates.

There are many references used this method such as:

Hastie, T., Tibshirani, R., & Friedman, J. (2009).
The Elements of Statistical Learning (2nd ed.). Springer.

Kohavi, R. (1995).
A study of cross-validation and bootstrap for accuracy estimation and model selection.
IJCAI.

Comment 3: In the feature definitions you still describe “∞ or a large constant”, while later you specify 10^6 and a filtering/removal step plus a sensitivity statement. This must be made consistent in the formal definitions, and you must precisely specify what is removed (entire windows vs feature values), because “remove affected samples or features” is ambiguous and can bias the evaluation.

Response 3:

We thank the reviewer for pointing out the need for clearer consistency between the formal feature definitions and the implementation details. In the feature definitions, the use of “∞ or a large constant” is intended as a theoretical description of the approach, reflecting the fact that division by a near-zero denominator yields an unbounded or very large value in principle. In the implementation, this abstraction is instantiated with a fixed numerical constant (10⁶), which is now explicitly stated to avoid ambiguity. This separation between symbolic formulation and concrete parameterization follows common practice in the literature, where algorithms are described generically in theory and instantiated with specific values in experimental sections. To address the reviewer’s second concern, we clarify that the filtering step operates at the feature-value level, not by removing entire windows: only the affected feature values are clipped or excluded from computation, while the corresponding samples remain in the dataset.

Comment 4: While one discussion paragraph says you avoid overclaiming robustness, the abstract/introduction/results/conclusion repeatedly use strong robustness/resilience language and even “deployment-ready” framing. Given that you do not evaluate the minimal regimes required for robustness (benign -> DDoS, benign+DDoS -> DDoS, intensity transfer) with degradation metrics and uncertainty, the appropriate conclusion is that you provide a limited stress test, not evidence of robustness.

Response 4:

We would like to emphasize that we do not claim adversarial robustness or deployment-ready resilience of the proposed approach. In the abstract and introduction, our intent was to state that we evaluate the robustness of the proposed approach relative to baseline models under the same experimental conditions, not that the method is inherently robust to adversarial attacks or fully stress-tested deployment scenarios. To avoid any possible overinterpretation, we explicitly address this distinction in the Discussion section (Lines 927–937), where we state that the presented experiments constitute a limited stress test rather than a comprehensive robustness evaluation. In response to the reviewer’s concern, we have further revised the manuscript to remove the term “robust” where it could be construed as an overclaim

Comment 5: The revised manuscript still contains malformed numeric entries in at least one DDoS baseline table (e.g., “0.14 ± 0.0 3”). 

Response 5: The error is fixed.

Back to TopTop