Next Article in Journal
Parameter Identification of SiC MOSFET Half-Bridge Converters Using a Multi-Objective Optimization Method
Previous Article in Journal
LSTM-Based Electricity Demand Forecasting in Smart and Sustainable Hospitality Buildings
 
 
Article
Peer-Review Record

E2E-MDC: End-to-End Multi-Modal Darknet Traffic Classification with Conditional Hierarchical Mechanism

Electronics 2025, 14(22), 4457; https://doi.org/10.3390/electronics14224457
by Junyuan Zhang, Yang Chen, Qingbing Ji *, Wei Yu, Lulin Ni, Chengpeng Dai, Lu Kang and Jie Luo
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 4: Anonymous
Electronics 2025, 14(22), 4457; https://doi.org/10.3390/electronics14224457
Submission received: 13 October 2025 / Revised: 12 November 2025 / Accepted: 13 November 2025 / Published: 15 November 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposes an end-to-end multi-modal deep learning framework (E2E-MDC) for three-level hierarchical classification of Darknet traffic.

I have some comments.

  1. In Section 1, many viewpoints and backgrounds are not supported by detailed references.
  2. There is a citation error in Section 2.
  3. Does the paper be polished by AI tools? Some sentences have obvious traces of AI polishing.
  4. All figures must be adjusted.
  5. I saw that the authors provided fairly sufficient experimental results. But I am curious whether this result can be compared with approximate research? Or did the authors not specify which works were compared specifically?
  6. The Conclusion is too long and not conducive to reading.
  7. Basically, there are not many references in the article, and many viewpoints or introductions of previous work are not detailed enough. The latest frontiers are also relatively few. This is inappropriate. There are also many new works on MDPI that authors can consider, such as

Cong, X.; Zhu, H.; Cui, W.; Zhao, G.; Yu, Z. Critical Observability of Stochastic Discrete Event Systems Under Intermittent Loss of Observations. Mathematics 202513, 1426. https://doi.org/10.3390/math13091426

  1. Basically, there are many areas that need improvement in the writing of this paper. I suggest the authors pay attention to some sentences that are too long and not suitable for most readers to read and understand.
Comments on the Quality of English Language

The English written should be further improved.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This research focuses on the issue of Darknet traffic classification subsequently proposing E2E-MDC model, a deep learning model which consists of different feature extraction models combined with a conditional hierarchical classification approach. The incorporation of CNN, TCN, BiLSTM and Transformers models allow for different perspectives when analysing traffic data, while the hierarchical classification deals with dependencies for three levels of traffic classification. It is evaluated on a custom Tor dataset and a public dataset Darknet-2020, and administratively demonstrates more accuracy than previous works, and a less hierarchical violation. The model appears to be backed by sound methodology for potentially overcoming issues for a more structured and accurate classification of encrypted Darknet traffic, however some additional validation of a wider dataset may be useful.

However, the following elements need to be addressed in the manuscript:

  • An explicit novelty statement is missing from the Introduction statement. A clear and concise novelty statement can help the readers understand the novel element of this research study.
  • How does the system guarantee that the learnable positional encoding in the Transformer module can generalize across various darknet networks with various traffic patterns?
  • Is the 25% overlap in the sliding window mechanism capable of creating correlated inputs that may overestimate the accuracy of hierarchical classification, and what are the mitigation strategies?
  • How does the gated fusion mechanism within the bidirectional LSTM module quantitatively balance the proportional reliance of LSTM temporal features with attention-derived global features across classes with heterogeneous traffic types?
  • What techniques are used to ensure cross-modal attention does not overfit to dominant feature modalities in the context of highly skewed class distributions in darknet datasets?
  • How does hierarchical consistency loss get defined mathematically, and what is its role in the overall gradient flow during end-to-end training?
  • In the TCN module, how is the receptive field effectively expanded to capture both short-lived and long-range dependencies in the data without undue computational cost?
  • How do we avoid propagating errors in conditional dependence where Level 2 predictions are being made on Level 1 predictions that may be uncertain?
  • How does the system reduce biasing predictions toward more frequently observed behaviors when Level 3 behaviors are either underrepresented or not captured in training data?
  • In what manner does the adaptive weighting mechanism adapt dynamically to encrypted traffic, where byte-level features may not have significant classifications power, and how is validation conducted?
  • How are segment embeddings in the Transformer module initialized, and learned, and how do they contribute to the model being more sensitive to particular phases of darknet traffic?
  • What is the theoretical justification for blending average pooling and last-step hidden states in the bidirectional LSTM module, and how is this valuable to feature representation robustness?
  • How is the zero-padding strategy for windows completing shorter fragments guaranteed not to create signal artifacts which could bias classification, and particularly with short-lived or bursty traffic patterns?
  • How does the model avoid learning redundantly across the four feature extractors, whilst continuing to allow clinically complementary feature fusion in the attention mechanism?
  • How is the validation of the conditional hierarchical classifier conducted so that φ21 and φ32 mapping constraints are strictly adhered to, and possible surreptitious violations are avoided during inference on unseen data?
  • How are hyperparameters such as convolution kernel sizes (f), dilation rates (s) and attention head sizes (h) empirically justified, as they balance a usable classification accuracy with computational efficiency in time-sensitive darknet environments?
  • In the Darknet-2020 dataset, E2E-MDC obtains a cascade accuracy of 92.65%; however, regarding the corresponding DIDarknet cascade accuracy of 87.68%, what particular advantages to accurately model a hierarchical dependency do authors think account for this ~5% increase in cascade accuracy despite similar Level 2 accuracy between the two models?
  • DarkDetect obtains essentially perfect (99.87%) accuracy at Level 1, but significantly reduced cascade accuracy (86.32%). What can authors quantitatively say regarding inter-level prediction inconsistencies?
  • In relation to inference times, on the Darknet-2020 dataset DIDarknet achieves a time of 17 - 18 ms which could be considered prohibitive when relating the inference time to the overall accuracy at Level 2; how does E2E-MDC balance computational efficiency with multi-level classification performance?
  • Looking at Table 3, E2E-MDC achieves a conditional accuracy of 97.51% L3|L1,L2 conditional accuracy on a self-collected Tor dataset from the proposed refinement, what does this imply for error propagation control and its relation to traditional independent classifiers?
  • When comparing the difference between DIDarknet and E2E-MDC L3|L1,L2 conditional accuracy based on the self-collected dataset, what insights do authors get regarding the efficacy of the two models conditional hierarchical modelling mechanism for a single platform?
  • How does the soft conditioning of E2E-MDC maintain >85% Level 2 accuracy on low-confidence (<0.7) Level 1 samples compared to the baseline methods dropping to 60-70%.
  • In reference to Tables 4 and 5, what misclassifications occur in Level 3 behavior categories and what potential features or traffic characteristics contribute to the confusion?
  • In the self-collected Tor dataset, the F1-score for browsing traffic increased from 83.9% to 95.2% - why?
  • Describe how multi-modal fusion reduces the confusion group "real-time streaming media" (audio, video, VoIP) from the confusion group "interactive application" (browsing, chat, email). Which modules account for the majority of the contribution for each?
  • Removing PacketTCN causes the greatest cascade accuracy drop (−3.74-−4.37%), when compared to all other modules, could authors explain why the packet sequence information was more important than either raw bytes or Transformer features in classifying darknet traffic?
  • Please explain why ByteCNN alone has the lowest cascade accuracy - while still being part of the large model. How does multi-modal complementarity help alleviate the weaknesses of each of the individual modules?
  • In comparing the pairs ByteCNN + PacketTCN and LSTM + Transformer, can we achieve full four models performance by pairs? Describe the implications to temporal feature scales and adaptive fusion?
  • Why is Level 1 trivial in the self-collected Tor dataset and what does this imply on the interpretation of cascade accuracy increases?
  • Discuss how the diverse dataset (multi-platform vs one platform) characteristics impact F1-score variances across Level 3 categories of the multi-modal model. How do these statements relate to model generalization and applications in real world darknet contexts?
  • Based on Figure 7, what are the trade-offs between cascade accuracy and inference time of the different methods, and how does E2E-MDC situate itself in the “perfect region” from a practical deployment perspective?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The paper proposes E2E-MDC, an end-to-end multimodal framework for darknet traffic classification using a conditional hierarchical mechanism. The topic is relevant, and the integration of multiple feature extractors with conditional fusion is conceptually interesting. The study addresses a technically significant and operationally relevant problem, and the results suggest potential improvements over conventional baselines. However, the manuscript requires substantial revision before it can be considered for publication.

In the wider literature, multi‑modal deep‑learning and hierarchical approaches to traffic classification are not new.  Earlier works, such as ODTC (2023), already integrate CNN and recurrent modules with multi‑head attention for darknet traffic and demonstrate improved accuracy and reduced overhead.  Recent work in preprint on multimodal traffic classification explicitly fuses statistical features, time‑series flows and payload representations using CNNs and LSTMs; this model learns both intra‑modal and inter‑modal dependencies and significantly outperforms single‑modality baselines .  Hierarchical classifiers distinguishing multiple darknet platforms and user behaviours have been reported since at least 2020 .  Consequently, the E2E‑MDC framework contributes an integration of existing techniques rather than inaugurating a new paradigm.

The self‑collected Tor dataset consists of only 6k samples across eight applications, which may not represent the diversity of darknet traffic.  The paper does not report cross‑dataset generalisation (training on one dataset and testing on another) or class‑imbalance mitigation.  The reported 12 % improvement is measured on their own dataset and Darknet‑2020, but the baselines and evaluation protocols are not fully described, making the gain hard to verify.

Critical choices such as the sliding‑window size are not justified.  Hyper‑parameter settings, optimisation details, and training duration are absent.  The authors state that the model can disable modules to meet resource constraints, but they do not demonstrate inference with reduced modules or on edge devices, leaving practical applicability unclear.
 

The paper mentions a new Tor dataset but does not provide a link or plan for public release.  Without access to the dataset and implementation, the results are not reproducible.

The abstract states that the framework improves cascade accuracy by 12 % over the state of the art.  However, there is no comparison with the most recent multimodal models, and the improvement may stem from differences in datasets rather than algorithmic advances.  The claim of “true end‑to‑end hierarchical learning” is overstated given prior work on soft conditional hierarchies.

The related‑work section summarises classical machine‑learning and early deep‑learning approaches but omits recent multimodal and hierarchical methods, ignoring their strengths and shortcomings.  A fair review should identify gaps in existing works to motivate the proposed approach.

Recommendations:

The authors should revise the manuscript to provide a balanced literature review, acknowledging existing multi‑modal and hierarchical traffic classification methods and clearly positioning their contribution.  They should justify design choices, release code and data, and perform cross‑dataset evaluations to demonstrate generalisability.  Clarifying and tempering novelty claims will strengthen the paper.

Comments on the Quality of English Language

The English is generally clear but needs moderate editing for precision and readability, particularly in the Methods and Results sections.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

Recommendation: Major revision 


If the authors address the points above (reproducibility artifacts, ethics/labeling transparency, stronger statistical analyses, robustness tests, and clearer methodological exposition), the manuscript would be publishable:

Reproducibility: Provide public code, trained model weights, and exact data-split scripts. The paper claims fixed seeds and environment details (PyTorch 1.12, RTX3090, AdamW, lr=5e-5, batch=12) but does not release artifacts to reproduce results. 

Dataset provenance & ethics: The self-collected Tor dataset (6,240 samples) must include collection methodology, labeling protocol, legal/ethical approval (IRB) and anonymisation steps. Current manuscript lists the dataset size and contents but omits ethics details. 

Data labelling & ground truth: Describe how Level-2/Level-3 labels were assigned (automated heuristics, manual labeling, inter-annotator agreement), and provide examples / annotation guidelines to assess label quality. 

Class imbalance & statistics: Although weighted sampling and label smoothing are used, report per-class confidence intervals / statistical significance for key metrics (cascade accuracy, per-class F1) and include p-values or bootstrap CIs. Current tables report point estimates only. 

Ablation & sensitivity analyses: Provide sensitivity to (a) loss weight choices (0.3/0.3/0.4 reported), (b) window size / overlap, and (c) hyperparameters (lr, batch). The paper reports ablation on modules but lacks hyperparameter sensitivity studies.

Dataset generalisability: Evaluate cross-dataset generalisation (train on public dataset --> test on self-collected Tor and vice versa) and/or include an external holdout to show robustness beyond the two datasets used. 

Baselines / fairness of comparison: Provide full baseline implementation details and hyperparameters (to support the claim of "fair training" for baselines) and include code for reproduced baselines. The manuscript states unified preprocessing but lacks baseline configs. 

Adversarial / real-world robustness: Test model under network adversities (deliberate packet loss, padding/obfuscation, traffic mixing, active adversarial perturbations). Augmentations are used in training, but no adversarial robustness experiments are shown. 

Privacy/security implications: Discuss potential misuse risks (deanonymisation) and safeguards; justify collection/usage of darknet traces and state compliance with applicable laws/regulations. 

Computational & deployment analysis: Report model parameter count, memory footprint, and latency/memory tradeoffs for edge deployment (inference time is given, but memory and practical deployment constraints are missing). 

Clarity and minor corrections: Improve clarity of several methodological descriptions (precise definition of mapping functions φ21/φ32 and loss weight rationale), fix small writing/figure/table inconsistencies, and ensure all equations are numbered and explained.

Request for additional experiments: (short) add significance testing, per-class confusion matrices for all datasets, and at least one external validation (or public release of the self-collected benchmark) before acceptance. 

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Basically, I believe that the authors have carefully improved this paper. However, the sizes of these figures need to be revised. They are too small to read. Too many useless blanks need to be deleted. The sizes of the texts need to be similar to the sizes of the main texts.

Author Response

Comments 1:he sizes of these figures need to be revised. They are too small to read. Too many useless blanks need to be deleted. The sizes of the texts need to be similar to the sizes of the main texts.

Response:

We sincerely appreciate the reviewer's meticulous attention to the figure formatting issues. Your detailed feedback has greatly improved the visual quality and readability of our manuscript.

Following your suggestions, we have carefully examined and revised all figures throughout the manuscript:

  1. Font size adjustment: We have specifically increased the font sizes in Figures 1-5, where the text elements were previously too small. All labels, axis titles, legends, and annotations in these figures have been enlarged to ensure they are comparable to the main text size and clearly readable.
  2. Whitespace optimization: We have reduced the excessive vertical spacing (top and bottom margins) in Figures 1-5, making the figures more compact and visually balanced while maintaining clarity.
  3. Additional refinements: Beyond Figures 1-5, we have also reviewed and made appropriate adjustments to other figures in the manuscript to ensure consistency in presentation quality across all visual elements.

These comprehensive revisions have significantly enhanced the professional appearance and readability of the figures. We believe the revised figures now meet the journal's standards and provide readers with a much better viewing experience.

Thank you again for your valuable guidance in improving our manuscript.

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have addressed all the comments. 

Author Response

We sincerely thank you for your positive evaluation and recommendation for acceptance. We are grateful for your recognition of our work and the valuable feedback you provided during the previous review rounds, which significantly contributed to improving the quality and clarity of this manuscript. Your constructive suggestions have been instrumental in strengthening both the technical content and presentation of our research.

Reviewer 3 Report

Comments and Suggestions for Authors

Nice progress. The related work is deeper, the experimental setup is clearer, and the ablations help justify design choices. The conditional hierarchy and the explicit consistency loss are the genuinely interesting parts.


Comments on the Quality of English Language

The English is generally clear but needs moderate editing for precision and readability, particularly in the Methods and Results sections.

Major issues still open:

 • The abstract and contribution list still use the formula “true end to end hierarchical learning,” which contradicts the moderated phrasing claimed in the response letter. Please align the abstract and contribution bullets with the softer wording already used in the response.
 • Cross dataset generalisation is asserted as “validation on two datasets,” but a train on A test on B experiment is not clearly reported. Consider adding that explicit transfer test or rewrite the claim.  ï¿¼
 • The “first application of the [CLS] token to traffic classification” reads as an absolute claim. Either add a precise citation survey that supports this or soften the sentence.  ï¿¼

 

Author Response

Comment 1: "The abstract and contribution list still use the formula 'true end-to-end hierarchical learning,' which contradicts the moderated phrasing claimed in the response letter. Please align the abstract and contribution bullets with the softer wording already used in the response."

Response 1:Thank you for identifying this important inconsistency. We apologize for not thoroughly aligning all instances of this phrasing in the previous revision. We have now carefully revised the contribution list to use more moderate and technically precise wording:

Location: Line 197 (First contribution bullet point)

  • Original: "Unlike traditional independent multi-task learning or hard cascade classification, this paper achieves true end-to-end hierarchical learning through soft conditioning design."
  • Revised: "Unlike traditional independent multi-task learning or hard cascade classification, this paper implements end-to-end hierarchical learning through a soft conditioning design that enables joint optimization across all classification levels."

Rationale: This revision removes the absolute qualifier "true" while providing a more precise technical description of our approach. The new wording accurately conveys that our method enables joint optimization across hierarchical levels through soft conditioning, which is the key technical contribution, without making an overstated claim.

We have also verified that no similar absolute phrasing remains in the abstract or other sections of the manuscript.

Comment 2: "Cross dataset generalisation is asserted as 'validation on two datasets,' but a train on A test on B experiment is not clearly reported. Consider adding that explicit transfer test or rewrite the claim."

Response 2: We greatly appreciate the reviewer pointing out this ambiguity in our experimental claims. You are absolutely correct that our experimental design does not include cross-dataset transfer experiments (i.e., training on dataset A and testing on dataset B). Instead, we conducted independent training and testing on each dataset separately to evaluate the method's effectiveness across different traffic scenarios and data distributions.

We have revised all relevant statements throughout the manuscript to accurately reflect our experimental methodology and avoid any misinterpretation regarding cross-dataset generalization:

Location 1: Line 178 (Introduction section)

  • Original: "requires classification methods to possess good generalization capability and robustness against evasion techniques"
  • Revised: "requires classification methods to possess good robustness and adaptability to handle diverse traffic patterns and evasion techniques"
  • Rationale: This revision focuses on the general requirements for darknet traffic classification methods without implicitly claiming cross-dataset generalization capability.

Location 2: Line 934 (Experimental Design section)

  • Original: "This study employs two representative darknet traffic datasets for experimental validation to comprehensively evaluate the effectiveness and generalization capability of the proposed method."
  • Revised: "This study employs two representative darknet traffic datasets for experimental validation to comprehensively evaluate the effectiveness and robustness of the proposed method across different traffic scenarios."
  • Rationale: This clarification accurately describes our experimental approach—using two independent datasets to assess the method's effectiveness in different traffic scenarios, rather than claiming cross-dataset generalization.

Location 3: Line 1348 (Conclusion - Limitations section)

  • Original: "its generalization capability to emerging darknet protocols requires further validation"
  • Revised: "its effectiveness on emerging darknet protocols and unseen traffic patterns requires further validation"
  • Rationale: Even in discussing limitations, we have aligned the terminology to be consistent with our actual experimental design while appropriately acknowledging that testing on emerging protocols remains future work.

Summary: These revisions ensure that our claims accurately reflect the experimental validation performed (independent training and testing on two datasets) without overstating cross-dataset generalization capability that was not explicitly tested.

 

Comment 3: "The 'first application of the [CLS] token to traffic classification' reads as an absolute claim. Either add a precise citation survey that supports this or soften the sentence."

Response 3:

We appreciate the reviewer's careful attention to the precision of our claims. You are correct that without comprehensive citation evidence, such an absolute statement is inappropriate. We have revised this contribution point to focus on the technical merit of our approach rather than making a priority claim:

Location: Line 199 (Second contribution bullet point)

  • Original: "Transformer introduces [CLS] token mechanism for the first time to provide global anchors for traffic classification."
  • Revised: "Transformer employs a [CLS] token mechanism to provide global sequence representation for traffic classification."

Rationale: This revision removes the absolute claim of "first application" and instead emphasizes the technical contribution—using the [CLS] token to capture global sequence-level representations in the context of hierarchical darknet traffic classification. This phrasing maintains the technical significance of our design choice without making an unsubstantiated priority claim.

We have also verified that no other instances of absolute "first" claims appear elsewhere in the manuscript regarding the [CLS] token or other technical components.

Reviewer 4 Report

Comments and Suggestions for Authors

Round 1 comments and recommendations were addressed. I recommend acceptance

Author Response

We sincerely thank you for your positive evaluation and recommendation for acceptance. We are grateful for your recognition of our work and the valuable feedback you provided during the previous review rounds, which significantly contributed to improving the quality and clarity of this manuscript. Your constructive suggestions have been instrumental in strengthening both the technical content and presentation of our research.

Back to TopTop