Trustworthy Deep Learning for Cybersecurity: A Structured Review Across Detection, Robustness, Privacy, Explainability, and Deployment

Ghayoumi, Mehdi; Ghazinour, Kambiz; Marrero, Anthony; Barmas, Dena; Cook, Cameron; May, Michael; Liu, Cory; Johnson, Behnaz; Fofana, Amadu

doi:10.3390/electronics15112421

Open AccessArticle

Trustworthy Deep Learning for Cybersecurity: A Structured Review Across Detection, Robustness, Privacy, Explainability, and Deployment

by

Mehdi Ghayoumi

^*

,

Kambiz Ghazinour

,

Anthony Marrero

,

Dena Barmas

,

Cameron Cook

,

Michael May

,

Cory Liu

,

Behnaz Johnson

and

Amadu Fofana

Department of Cybersecurity, State University of New York (SUNY) Canton, Canton, NY 13617, USA

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(11), 2421; https://doi.org/10.3390/electronics15112421

Submission received: 13 May 2026 / Revised: 30 May 2026 / Accepted: 31 May 2026 / Published: 2 June 2026

(This article belongs to the Special Issue Novel Approaches for Deep Learning in Cybersecurity)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Deep learning is increasingly used in cybersecurity to detect, classify, prioritize, and explain evidence from network traffic, logs, binaries, graphs, text, code, and multimodal telemetry. However, the literature remains fragmented across tasks, datasets, architectures, trustworthiness properties, and deployment settings, making it difficult to judge whether benchmark performance transfers to operational cyber defense workflows. This paper presents a structured narrative review with an evidence-oriented synthesis, not a Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA)-counted systematic review. The synthesis uses a de-duplicated cited-source bibliography of 115 references as an evidence-mapping corpus; this corpus is reported for transparency and is not presented as a PRISMA final-inclusion set. The evidence map is organized through a five-axis framework: security task, data modality, model family, trustworthiness property, and deployment environment. In response to methodological and scope concerns common in broad survey work, the revision narrows the claims to a transparent cited-source synthesis, defines explicit inclusion boundaries, adds a data-charting codebook, reports non-exclusive coded emphasis matrices, and introduces practical tables for dataset selection, split protocols, deployment-reporting targets, and large language model (LLM)-enabled security operations center (SOC) risk controls. Across application areas, the reviewed literature indicates that benchmark accuracy is necessary but insufficient. Deployment readiness also depends on adversarial robustness, privacy protection, explainability, uncertainty calibration, drift handling, reproducibility, resource-aware resilience, and computational feasibility. The review identifies persistent gaps in temporal validation, cross-dataset testing, analyst-centered explanation, secure learning pipelines, agentic-LLM safety, and edge-aware deployment. The resulting research agenda emphasizes accurate, resilient, privacy-aware, explainable, reproducible, and deployable cybersecurity artificial intelligence systems.

Keywords:

deep learning; cybersecurity; intrusion detection; adversarial robustness; privacy-preserving learning; explainable artificial intelligence; federated learning; large language models; deployment readiness; structured narrative review

1. Introduction

1.1. Context and Background

Cybersecurity now operates in an environment defined by broader attack surfaces, heterogeneous data sources, cloud–edge–Internet of Things (IoT) infrastructure, and increasingly adaptive adversaries. Traditional signature-based and rule-based defenses remain useful for known attacks, but they are often inadequate for zero-day exploits, polymorphic malware, evasive traffic, insider threats, and large-scale telemetry generated by modern enterprise systems. Deep learning has therefore become central to cybersecurity because it can learn complex nonlinear patterns from raw or lightly processed security evidence and can support detection, classification, prioritization, and analyst assistance at a scale that is difficult for purely manual approaches [1,2,3,4,5].

1.2. Motivation and Research Gap

The field has progressed rapidly, but progress is uneven. Many studies report strong accuracy on controlled benchmarks, especially for intrusion detection and malware classification, while fewer studies test whether these gains remain reliable under temporal drift, cross-dataset transfer, adversarial manipulation, privacy constraints, or real-time deployment requirements [4,6,7]. Benchmark performance alone is therefore insufficient. A model can perform well in testing and still fail in practice when the deployment data differ from the training data, when attackers intentionally manipulate inputs, when sensitive telemetry leaks through model outputs or updates, or when inference cost is too high for operational use [8,9,10,11]. The central problem is not that benchmarks are useless; rather, better and more realistic benchmarks, stronger validation protocols, and clearer reporting are needed to judge deployment readiness.

1.3. Objective and Review Scope

The objective of this paper is to provide a structured narrative review and evidence-oriented map of deep learning (DL) for cybersecurity that connects applications, architectures, data types, trustworthiness concerns, and deployment settings in one unified synthesis. The review covers current and recent research on intrusion detection, malware detection, phishing detection, biometric authentication, cyber threat intelligence (CTI), and multimodal security analytics. It examines established architectures such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), long short-term memory networks (LSTMs), gated recurrent units (GRUs), and autoencoders, together with newer directions including graph neural networks (GNNs), transformers, large language models (LLMs), and federated deep learning [3,5,9,11,12,13]. It also evaluates limitations related to adversarial robustness, privacy, explainability, uncertainty, datasets, reproducibility, and computational feasibility.

1.4. Five-Axis Framework

To organize this broad literature, the survey adopts a five-axis framework: security task, data modality, model family, trustworthiness property, and deployment environment. The individual axes are not presented as entirely new categories; prior surveys have addressed many of them separately or in partial combinations. The contribution here is their operational integration into one review lens for comparing whether a cybersecurity deep learning study aligns its task, evidence source, architecture, trust requirement, and deployment setting. Datasets, benchmarks, evaluation practices, and reproducibility are treated as cross-cutting methodological concerns that affect all five axes rather than as separate background issues. This framing helps distinguish studies that appear similar at the model level but differ substantially in task assumptions, input structure, threat model, data quality, latency requirements, or deployment context. It also makes clear why models should not be compared only by accuracy; they should be compared by how well the task, data, architecture, trust requirements, and operational environment align.

1.5. Paper Organization

The rest of the paper is organized as follows. Section 2 describes the structured narrative-review methodology, including search strategy, eligibility criteria, screening, citation tracking, data-charting, and synthesis. Section 3 presents the search results and evidence-oriented map. Section 4 presents the conceptual background and taxonomy. Section 5 reviews major application domains. Section 6 examines trustworthiness dimensions, including robustness, privacy, explainability, uncertainty, and lifecycle security. Section 7 evaluates datasets, benchmarks, evaluation practices, and reproducibility as methodological foundations of the five-axis framework. Section 8 outlines open challenges and future research directions. Section 9 states limitations, and Section 10 concludes the paper. Table 1 positions this survey relative to recent review articles and highlights the need for an integrated synthesis spanning applications, architectures, trustworthiness, deployment, datasets, and evaluation realism.

2. Review Methodology

2.1. Review Design and Rationale

This study is reported as a structured narrative review with evidence-mapping components for deep learning in cybersecurity. This design is appropriate because the literature is broad, heterogeneous, and rapidly evolving across security tasks, data types, model families, trustworthiness properties, and deployment settings. Unlike narrowly focused effectiveness reviews, a structured evidence-oriented synthesis is better suited to clarifying concepts, comparing methodological patterns, identifying cross-cutting gaps, and organizing a fast-moving field rather than estimating a single pooled effect [15,16,17,18]. This is particularly important here because deep learning-based cybersecurity research covers intrusion detection, malware analysis, phishing detection, biometric authentication, cyber threat intelligence, adversarial defense, privacy-preserving learning, explainable AI, and deployment settings such as cloud, edge, and IoT, all of which rely on different datasets, metrics, and evaluation assumptions [15,16,18]. The review methodology was informed by scoping review guidance, Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA)-style transparency principles, snowballing guidance for software engineering evidence synthesis, and updated Joanna Briggs Institute guidance, but this manuscript does not present itself as a PRISMA-counted systematic scoping review because a complete dated screening export was not available for verification [16,17,18,19,20,21]. The manuscript therefore makes a narrower and more verifiable claim: it is a transparent structured narrative synthesis that uses explicit search-string families, eligibility boundaries, citation tracking, a charting codebook, non-exclusive evidence counts, and critical methodological appraisal to organize the cited literature. This wording is used consistently throughout the paper to avoid overstating the reproducibility of the original search-and-screening history.

2.2. Review Objectives and Research Questions

The main objective of this review is to synthesize how deep learning has been applied in cybersecurity while evaluating not only model families but also data modalities, robustness, privacy, explainability, evaluation realism, and deployment feasibility. Consistent with scoping review guidance, the review scope is framed using the Population–Concept–Context (PCC) structure: the population is cybersecurity studies, the concept is deep learning and related neural architectures, and the context includes enterprise, cloud, edge, IoT, industrial, and multimodal cyber defense environments [18,22]. The review addresses six questions: (1) Which cybersecurity tasks are most often studied with deep learning? (2) Which data modalities and model families are used? (3) How do studies assess performance, robustness, privacy, explainability, and operational feasibility? (4) Which datasets, benchmarks, and validation practices are most common? (5) What methodological weaknesses limit generalizability and real-world use? and (6) Which directions appear most promising for trustworthy and deployable deep learning in cybersecurity? These questions align with the role of structured evidence-mapping and scoping-oriented reviews in mapping evidence, clarifying concepts, and identifying knowledge gaps in complex and fast-growing fields [15,17,18].

2.3. Information Sources and Search Strategy

A broad search strategy was used because cybersecurity research spans computer science, engineering, information security, artificial intelligence, and applied data science. The review searched Scopus, Web of Science, IEEE Xplore, ACM Digital Library, ScienceDirect, and SpringerLink. These databases were selected to capture both cybersecurity venues and artificial intelligence venues. The search strategy combined three conceptual blocks: cybersecurity task terms, deep learning terms, and review or evaluation terms. Representative search-strings included: (“deep learning” OR “neural network” OR CNN OR RNN OR LSTM OR transformer OR “large language model” OR “graph neural network” OR “federated learning”) AND (cybersecurity OR “cyber security” OR “intrusion detection” OR malware OR phishing OR “cyber threat intelligence” OR authentication OR “security analytics”); (“deep learning” AND “intrusion detection system” AND (dataset OR benchmark OR evaluation OR reproducibility)); (“adversarial machine learning” AND cybersecurity AND (robustness OR evasion OR poisoning OR backdoor)); and (“explainable AI” OR XAI) AND cybersecurity. The syntax was adapted to each database field structure, but the conceptual blocks were kept consistent across sources. Searches were designed to cover the full available database period within each source, and the search strategy was documented for reproducibility across databases [17,18,21,22].

2.4. Search Date and Search Update

The manuscript, bibliography, and supplementary reference-audit files for this submission package were finalized on 13 May 2026. A complete dated database export and full screening log were not available for independent verification during this revision. For that reason, the manuscript is explicitly framed as a structured narrative review with evidence-mapping components rather than as a PRISMA-counted systematic scoping review. The search strategy, eligibility criteria, charting fields, supplementary search-strings, and evidence-mapping framework are reported to support transparency, but the manuscript does not report unverified numerical PRISMA counts. Instead, the Supplementary Materials provide a reproducibility package for the revised manuscript: search string families, a cited-source audit list, a data-charting codebook, an exclusion-reason taxonomy, a non-exclusive coding matrix, and a reviewer-concern coverage checklist. These files support independent inspection of how the cited-source synthesis was organized without retrospectively claiming a verified PRISMA record flow. Backward and forward citation tracking was retained as part of the methodology because keyword searches alone can miss influential work that uses different terminology. Backward tracking reviewed the reference lists of key reviews and seminal papers to identify earlier foundational studies. Forward tracking examined later studies that cited those core papers to capture newer developments, including work on GNNs, LLMs, federated intrusion detection, adversarial robustness, explainable AI, and deployment-aware cybersecurity systems [21].

2.5. Eligibility Criteria

Eligibility criteria were defined before screening, with minor refinement during pilot screening, consistent with scoping review guidance [16,17,18]. Studies were included when they met all of the following conditions: they addressed a genuine cybersecurity problem, used deep learning or a modern neural architecture as a substantive part of the method, reported a methodological contribution or empirical evaluation relevant to the review questions, and appeared in a peer-reviewed journal, conference proceeding, or book chapter with sufficient technical detail. Studies were excluded when they focused on non-cybersecurity domains, used only traditional machine learning without a meaningful deep learning component, were editorials, abstracts, patents, theses, tutorials, posters, or opinion pieces without sufficient technical detail, were duplicate records, or did not provide enough information to identify the task, model, dataset, and evaluation design. The review was limited to English-language publications to support consistent screening, coding, and cross-study comparison [18,19,20].

Table 2 summarizes the operational boundaries applied during the revision. These boundaries also respond to the concern that a survey covering IDS, malware, phishing, biometrics, CTI, LLMs, federated learning, robustness, privacy, XAI, and deployment can become too broad unless each cited source is interpreted through the same review questions.

2.6. Study Selection Procedure

Candidate records were managed through a reference-management and screening workflow. The selection process used staged relevance screening, first at title-and-abstract level and then at full-text level when eligibility was unclear. Records were excluded when they lacked a substantive deep learning method, did not address a cybersecurity task, provided insufficient technical detail, duplicated another publication, or did not contain usable information about task, model, dataset, and evaluation design. Backward and forward citation tracking was then applied to key reviews and seminal papers to identify additional eligible studies that may not have been retrieved by database searches alone [21]. Figure 1 summarizes the structured search, screening, and evidence-mapping workflow used in this review.

2.7. Data-Charting and Extraction

After source selection, a structured data-charting form was used to extract the information needed to answer the review questions. Scoping review guidance recommends capturing not only bibliographic details but also conceptual and contextual features that support evidence-mapping across heterogeneous studies [16,17,18,19]. For each source used in the evidence map, the review charted publication year, venue, application domain, cybersecurity task, deployment environment, data modality, deep learning model family, learning setting, dataset(s), evaluation metrics, validation design, external or cross-dataset testing, adversarial evaluation, privacy-preserving mechanism, explainability method, computational considerations, deployment evidence, and reproducibility indicators such as code or data availability. The form also captured factors often underreported but important for real-world cyber defense, including zero-day evaluation, dataset realism, possible data leakage, latency or resource use, whether privacy guarantees were formal or only conceptual, and whether explanations were assessed from a human analyst perspective. This broader charting approach supported the review’s goal of evaluating not only deep learning models but also their operational credibility and trustworthiness in cybersecurity settings [15,17,18]. The charting form was piloted during the initial calibration stage and revised before full extraction. Revisions clarified the coding of deployment environment, adversarial evaluation, privacy mechanism, explainability method, and reproducibility indicators. Variable definitions were documented to keep the coding process reproducible.

The charting form is shown in Table 3. It is included in the manuscript because the evidence map claim depends on readers being able to inspect which variables were extracted and how those variables connect to the five-axis framework.

2.8. Data Synthesis and Evidence-Mapping

Because the cited sources differ widely in tasks, models, datasets, and outcome measures, this review uses descriptive and thematic synthesis rather than formal meta-analysis. This is consistent with the purpose of structured evidence-mapping, which is to map the range, characteristics, and gaps of the literature rather than estimate a pooled effect [15,16,17,18,19]. The synthesis is organized through the five-axis framework introduced in Section 1 and formalized in Section 4: security task, data modality, model family, trustworthiness property, and deployment environment. Section 7 is explicitly connected to this framework because datasets, benchmarks, validation protocols, reproducibility, and computational reporting determine whether evidence on any of the five axes is credible. A model–task pairing cannot be judged fairly if the dataset is outdated, the split design leaks future information, the benchmark does not represent the deployment environment, or runtime cost is unreported. The synthesis therefore combines descriptive mapping with critical evaluation of methodological maturity.

2.9. Methodological Appraisal Strategy

Formal risk-of-bias assessment was not used because the goal is to map evidence and methodological patterns rather than estimate causal or intervention effects [17,18,19]. However, because cybersecurity studies often rely on weak evaluation designs, benchmark dependence, and incomplete reporting, this review includes a structured methodological appraisal as part of data-charting and synthesis. Rather than excluding studies based on a single score, the review records quality-related features that affect the credibility and transferability of findings. These include dataset recency and realism, treatment of class imbalance, leakage-aware validation, cross-dataset testing, zero-day analysis, adversarial testing, privacy analysis, explainability evaluation, computational reporting, and code availability. This approach preserves the inclusive purpose of evidence-mapping while still supporting critical analysis of research quality and field maturity [15,17,18]. In this review, appraisal is not used to filter studies out, but to identify recurring weaknesses and distinguish controlled benchmark studies from work closer to real-world deployment.

Because the original screening export was unavailable, the revision adds a codebook-consistency audit rather than claiming formal inter-rater reliability for the original search. Ambiguous records were checked against the operational definitions in Table 2 and Table 3, and coding was resolved by consensus at the level of category labels rather than by a single numerical score. This audit does not replace an independently preserved dual-screening log, but it improves transparency by making the coding categories, inclusion boundaries, and evidence-count logic visible in the main manuscript and Supplementary Materials.

2.10. Reproducibility and Protocol Transparency

To improve reproducibility and reduce ambiguity, this review used a documented protocol structure covering the research questions, databases, representative search-strings, eligibility criteria, screening process, charting fields, synthesis strategy, and appraisal approach. Protocol transparency helps improve consistency, limit unplanned methodological drift, and make evidence synthesis more auditable [20,22]. The protocol was not registered before screening. This absence of preregistration is acknowledged as a limitation, but the review process was documented through the search strategy, eligibility criteria, screening workflow, charting form, and evidence-mapping materials. Overall, this design balances breadth and rigor. It is broad enough to capture the landscape of deep learning in cybersecurity, yet structured enough to support reproducible searching, structured charting, and critical synthesis [15,16,17,18,19,20,21,22]. Table 4 summarizes the review protocol, including the review design, information sources, eligibility criteria, screening process, and synthesis strategy used in this survey.

3. Search Results and Evidence-Oriented Map

3.1. Search and Selection Transparency

A verified exported search log, duplicate-removal report, title-and-abstract screening count table, and full-text exclusion log were not available for this revision. Therefore, this manuscript does not report PRISMA-ScR record counts or claim a count-verified systematic screening result. The supplementary reviewed-source reference list contains a de-duplicated set of 115 cited references used to support the synthesis. This number is reported only as the size of the transparent cited-source bibliography, not as a PRISMA final-inclusion count. The revised manuscript therefore avoids claiming a fully reproducible systematic review and instead presents a structured narrative synthesis with evidence-mapping components organized around the five review axes: security task, data modality, model family, trustworthiness property, and deployment environment.

3.2. Distribution of the Cited-Source Corpus

The cited-source corpus was charted using non-exclusive codes because many papers address more than one cybersecurity task, model family, or trustworthiness dimension. For example, a federated intrusion detection survey may be coded under intrusion detection, privacy-preserving learning, deployment environment, and trustworthiness. Table 5 therefore reports coded emphases rather than mutually exclusive PRISMA-style study counts. This approach gives readers a clearer view of the evidence base while avoiding unsupported claims about final-inclusion numbers. The source-level coding template, supplementary materials index, and coded emphasis summaries are provided in the Supplementary Materials so that readers can inspect the coding logic without treating the values as PRISMA final-inclusion counts or systematic inclusion statistics.

3.3. Five-Axis Evidence Map

Table 6 summarizes how the main application areas populate the five-axis framework. Compared with a purely qualitative map, the revised table links each domain to prominent evidence patterns, recurring evaluation weaknesses, and trustworthiness issues that directly affect deployment readiness.

3.4. Bibliographic Age and Coverage Profile

To make the evidence base more measurable without overstating the methodology, the cited-source corpus was also summarized at the bibliographic level. Table 7 reports the publication year profile of the 115 cited sources. This summary describes the cited-source bibliography and is not a substitute for a verified full-text screening log, but it helps readers judge the freshness of the reviewed evidence and the balance between foundational work and recent research.

The charting process also shows why some quantitative summaries requested in systematic evidence maps cannot be validly reported from the available records. For example, a defensible median dataset age for IDS studies would require study-level extraction of every dataset release date, dataset variant, preprocessing pipeline, and temporal split used by each empirical paper. Many cited surveys aggregate multiple datasets without consistent release-date metadata. The revised manuscript therefore reports coded evidence distributions and methodological patterns, but it avoids unsupported numerical claims that would imply a verified full-text extraction log.

To make the evidence map more inspectable, Table 8 and Table 9 report non-exclusive coded emphasis matrices from the cited-source coding. The entries should be read as overlapping coded emphases rather than mutually exclusive study counts or systematic inclusion statistics. Their purpose is to indicate which combinations appear more prominent in the synthesis and where the literature appears comparatively thin.

These matrices make the earlier qualitative synthesis more transparent. They suggest that IDS remains a prominent empirical cluster in the cited-source bibliography; transformer/LLM work is concentrated in CTI, software security, and analyst support; federated and multimodal work appears across tasks but is still less mature in reproducibility and deployment evidence; and robustness, privacy, calibration, explainability, and cost are still rarely assessed together within the same study.

4. Conceptual Background and Taxonomy of Deep Learning for Cybersecurity

4.1. Conceptual Background

Deep learning in cybersecurity is not a single problem class. General deep learning practice emphasizes that model choice should be driven by data structure, loss design, validation strategy, and deployment constraints rather than by architecture popularity alone [23]. It operates across heterogeneous data sources, attack objectives, and deployment environments. Security data may include network flows, packet sequences, log streams, application programming interface (API) call traces, binaries, interaction graphs, threat reports, or multimodal combinations. Deployment may occur in cloud systems, enterprise endpoints, or resource-constrained edge and Internet of Things (IoT) devices. As a result, effective models must address not only representation learning, but also distribution shift, adversarial behavior, privacy, interpretability, and deployment cost [3,13,24,25]. To improve readability, several specialized cybersecurity terms are used in their standard sense throughout this review. An intrusion detection system (IDS) monitors network, host, or application activity to identify suspicious or malicious behavior. Cyber threat intelligence (CTI) refers to structured or unstructured information about threats, adversaries, vulnerabilities, indicators of compromise, malware, campaigns, and attack patterns. A zero-day attack exploits a vulnerability or behavior that defenders have not yet observed or labeled. Data drift means that input data change over time, while concept drift means that the relationship between data and labels changes as attackers, users, software, or infrastructure evolve. Cybersecurity data are also often structured by time and relations rather than fixed independent features. Some tasks are sequential, such as command traces, event logs, and API-call streams. Others are relational, such as attack graphs, communication graphs, provenance graphs, and host–user–process relationships. Still others are multimodal, combining text, telemetry, and behavioral metadata. Because of this, deep learning architectures are not interchangeable. Recurrent networks and long short-term memory (LSTM) models are well-suited to sequential dependence, graph neural networks (GNNs) support relational reasoning, transformers capture long-range dependencies through attention, and multimodal models support cross-source alignment and fusion [24,26,27,28]. Cybersecurity is further shaped by an open and evolving environment. Attack strategies, user behavior, software stacks, and network conditions change over time, so training and deployment distributions rarely remain stable. This creates concept drift, data drift, class imbalance, and limited availability of fresh labeled data. These challenges are especially important in anomaly detection, zero-day discovery, insider-threat monitoring, and continuous security analytics, where outdated models may silently degrade or generate unreliable alerts. Therefore, any rigorous survey of deep learning for cybersecurity must treat adaptation and realistic evaluation as core concerns [1,29,30,31,32,33]. Finally, cybersecurity is inherently adversarial. Unlike benign domains, defensive models may be probed, evaded, poisoned, reverse-engineered, or manipulated throughout the artificial intelligence (AI) lifecycle. This makes adversarial robustness, privacy, and explainability especially important. Recent NIST guidance on adversarial machine learning frames attacks in terms of attacker goals, capabilities, knowledge, and lifecycle stage, while explainable artificial intelligence (XAI) studies show that black-box predictions are often insufficient in high-stakes settings where analysts must inspect, trust, or challenge model outputs. In practice, these issues directly affect false-positive triage, incident response, analyst trust, and the safe operational use of AI-generated insights [10,34,35,36].

4.2. A Unified Taxonomy for Deep Learning in Cybersecurity

To organize the literature in a technically meaningful and comparable way, this survey adopts a five-dimensional taxonomy. Let a cybersecurity deep learning study be represented as follows:

S = 〈 T, M, A, R, D 〉 .

(1)

where T denotes the security task, M the data modality, A the model family, R the trustworthiness properties and risks, and D the deployment environment. Equation (1) is used as indexing notation for charting studies rather than as a mathematical model. In the evidence map, each reviewed source was coded by the available values of T, M, A, R, and D; when a paper did not report one dimension clearly, that absence was treated as a reporting limitation rather than inferred from the model name. Thus, a study such as a transformer-based IDS is not coded only as an architecture paper; it is interpreted by its task, traffic modality, validation design, trustworthiness evidence, and deployment assumptions. This taxonomy is proposed in this review, but it is grounded in established ideas from multimodal learning, graph learning, federated learning, explainable AI, concept drift adaptation, and adversarial machine learning. Its purpose is to distinguish studies that may seem similar at a high level but differ substantially in data assumptions, architectural choices, risk exposure, and operational feasibility [24,27,29,34,37,38].

4.2.1. Security Task Dimension

The first dimension is the security task. Deep learning has been applied to many cybersecurity objectives, but these can be grouped into several main families. The largest is intrusion and anomaly detection, where models identify malicious or abnormal behavior in network, system, or application data. A second is malware and binary analysis, focused on malware classification, malicious code detection, and representation learning from static or dynamic traces. A third includes phishing, fraud, spam, and social engineering detection, which often relies on textual, behavioral, or metadata-rich signals. A fourth is authentication and identity security, including biometric and behavioral verification. A fifth is cyber threat intelligence and analyst support, where models process reports, indicators, logs, and cross-source evidence for search, correlation, summarization, and prioritization. Recent LLM-focused surveys and cyber-trust models further show that cyber tasks increasingly extend beyond classification to reasoning, retrieval, explanation, summarization, policy-aware trust management, and analyst-facing assistance. This dimension matters because it shapes the meaning of all others. For example, a flow-based intrusion detector and a threat intelligence summarizer may both use transformers, yet they differ in labels, temporal structure, adversarial exposure, and explainability needs. Likewise, authentication systems emphasize calibration, spoof resistance, and privacy differently from malware triage or traffic classification. As a result, studies should not be compared only by which deep model performs best, but by which model–task pairing is appropriate under specific operational assumptions [12,13,39,40,41,42,43,44,45,46,47,48].

4.2.2. Data Modality Dimension

The second dimension is data modality. In cybersecurity, modality is not merely a formatting issue; it determines what structure a model can exploit and which architectures are suitable. Six major modality groups repeatedly appear in the literature: (1) tabular or vectorized telemetry, such as aggregated flow statistics or engineered endpoint features; (2) sequential streams, including packet sequences, keystroke or event streams, command histories, and API calls; (3) graph-structured data, such as communication graphs, provenance graphs, entity relations, and lateral-movement structures; (4) textual data, including logs, alerts, reports, vulnerability descriptions, and CTI documents; (5) binary or byte-level data, such as executables or memory-derived representations; and (6) multimodal data, where text, behavior, metadata, temporal traces, or sensor outputs are fused. Multimodality deserves particular attention. General multimodal learning surveys identify recurring challenges such as representation, alignment, reasoning, generation, transfer, and uncertainty across modalities. In cybersecurity, these issues arise when defenders combine threat reports with telemetry, logs with user behavior data, or endpoint events with graph-based evidence. Multimodal systems can improve contextual understanding, but they also increase engineering complexity, create synchronization problems, and introduce additional attack surfaces, including missing-modality conditions, modality-specific corruption, and unstable fusion. For this reason, multimodality should be treated as a distinct analytic category rather than simply as a larger feature set [13,24,27,49,50].

4.2.3. Model Family Dimension

The third dimension is the model family. Early deep learning in cybersecurity was dominated by convolutional networks, recurrent architectures, and autoencoder-based methods. Broad deep learning surveys note that convolutional models are effective for capturing local and hierarchical structure, especially when telemetry can be represented as spatial or pseudo-spatial patterns. Recurrent and gated models are better suited to sequential dependence and temporal context. LSTM, in particular, remains important for event-sequence modeling, traffic analysis, and behavioral traces because it addresses long-term dependency problems in recurrent learning. Autoencoders also remain central because many cybersecurity tasks still rely on anomaly scoring, dimensionality reduction, reconstruction error, and semi-supervised or weakly supervised detection. Recent work has expanded this landscape. Graph neural networks extend deep learning to non-Euclidean relational data through message passing and are increasingly used for intrusion detection, provenance analysis, and entity-relation reasoning. Transformers replace recurrence with attention and offer stronger parallelism and long-context modeling, making them useful for log streams, long textual evidence, and sequence-heavy workflows. LLMs further extend the field beyond narrow classification to threat-report interpretation, alert contextualization, semantic correlation, retrieval-augmented triage, and interactive analyst support. At the same time, recent surveys caution that LLMs are not automatically effective for all cyber detection tasks. Federated and distributed deep learning form another strategically important category. Federated learning is not a single model architecture, but a training paradigm in which clients share model updates instead of raw data. This is attractive in cybersecurity because relevant data are often siloed across organizations, endpoints, institutions, or devices, and direct centralization may be impractical for legal, privacy, or bandwidth reasons. However, federated settings also introduce statistical heterogeneity, communication overhead, synchronization challenges, and new attacks on updates and aggregation. It should therefore be treated as a core design dimension rather than a minor deployment detail [26,27,28,37,51,52].

4.2.4. Trustworthiness Dimension

The fourth dimension is trustworthiness, used here as an umbrella term covering robustness, privacy, explainability, and adaptation to environmental change. This dimension is critical because cybersecurity systems operate in adversarial, high-consequence settings. A model may achieve strong benchmark results but still be unsuitable if it is easy to evade, leaks sensitive information, provides no actionable explanation, or fails under drift. Recent NIST guidance frames adversarial machine learning as a lifecycle problem involving attacker goals, capabilities, knowledge, and attack stages, offering a stronger basis for categorizing cyber AI risk than simple perturbation testing. Applied to the reviewed cybersecurity literature, this means that an IDS evasion study should identify the attacker capability to modify traffic features, a federated learning study should consider malicious client updates or poisoning, and an LLM-based SOC assistant should consider prompt injection, data exfiltration, and unsafe tool-use. The first component is adversarial robustness. In cybersecurity, defenders must address not only normal generalization error but also evasion, poisoning, backdoors, model extraction, and attacks on AI-enabled workflows. This is especially relevant for LLM-based systems, where prompt injection, unsafe tool-use, data leakage, and manipulated outputs broaden the attack surface. Robustness evaluation should therefore match the actual threat model of the application. The second component is privacy preservation. Cybersecurity systems often process sensitive enterprise logs, user identifiers, authentication traces, and incident records. Federated learning can reduce the need for raw-data sharing, but surveys emphasize that decentralization alone does not ensure privacy, since gradients and parameter updates may still leak information. Effective privacy protection therefore requires secure aggregation, differential privacy, access control, and governance rather than relying on the claim that data never leaves the device, and this broader governance perspective aligns with prior work on policy-aware enforcement and dynamic trust in security systems [45,46,53,54,55,56]. The third component is explainability and analyst interpretability. XAI surveys distinguish interpretability, transparency, and post hoc explanation, while also noting that many explanation methods remain algorithm-centric and weakly validated with human users. In cybersecurity, this matters because alerts often drive analyst investigation, incident response, or business decisions. Explanations should therefore be judged by whether they help defenders understand why an entity was flagged, which evidence mattered most, and how much uncertainty remains. The fourth component is adaptation under drift and non-stationarity. Cyber data evolve as software changes, infrastructure changes, attackers adapt, and normal user behavior shifts. Concept drift surveys show that adaptation is not optional in streaming environments. In cybersecurity, this means static train–test evaluation is insufficient. Reliable systems require drift-aware monitoring, recalibration, online or continual updating, and evaluation protocols that reflect temporal separation and operational change [29,34,35,36,53,54,55,57,58,59,60,61,62].

4.2.5. Deployment Environment Dimension

The fifth dimension is the deployment environment. The same model may behave very differently depending on whether it is deployed in a centralized cloud service, across enterprise endpoints, at the network edge, within IoT devices, or across a cloud–edge–fog hierarchy. Surveys of cloud, edge, and fog security show that these environments differ in storage capacity, processing location, trust assumptions, privacy exposure, and attack surface. Edge and fog systems, in particular, introduce heterogeneity, intermittent connectivity, and resource asymmetry, while pushing decisions closer to latency-sensitive data sources. Resource-aware deployment is therefore a core part of the taxonomy. Surveys of lightweight deep learning show that memory limits, energy budgets, hardware specialization, compression, pruning, quantization, and accelerator availability can strongly affect practical feasibility. This is especially important in IoT and embedded cybersecurity, where detection may need to run on-device or near-device, and centralized inference may be too slow, too costly, or too privacy-invasive. Deployment feasibility should thus be evaluated alongside predictive performance, not treated as an afterthought. Recent resilience-oriented work for resource-constrained AI systems further shows that edge deployment should include fault tolerance, adaptation under disturbance, and recovery of model behavior, not only small model size or fast inference [63]. Table 10 summarizes the five-dimensional taxonomy adopted in this survey, organizing the literature by security task, data modality, model family, trustworthiness, and deployment environment [25,64,65,66,67,68,69].

4.3. Interaction Among Taxonomy Dimensions

The main strength of the proposed taxonomy is that it makes the interaction among dimensions explicit. A cybersecurity study cannot be adequately described by a single label such as “transformer-based IDS” or “LLM for CTI.” A useful characterization must also specify the task, data modality, model family, trustworthiness criteria, and deployment environment. For example, a graph neural network for lateral-movement detection in enterprise provenance graphs differs fundamentally from an LLM-based CTI assistant in a cloud workflow, even though both are broadly framed as AI for cyber defense. Similarly, a lightweight federated model for IoT intrusion detection raises different concerns than a centralized transformer trained on static benchmark logs. This interaction-focused view also clarifies why comparisons in the literature are often misleading. Reported gains may reflect differences in modality design, dataset age, temporal leakage, deployment assumptions, or threat models rather than true architectural superiority. For this reason, a taxonomy that captures these interacting factors is more suitable for cybersecurity than a flat, architecture-centered summary. It also provides a consistent structure for the later sections of this review, including applications, trustworthiness, datasets, and open challenges [13,26,27,33,39].

4.4. Implications for the Remainder of the Survey

Based on this taxonomy, the rest of the paper examines deep learning for cybersecurity through a layered perspective. Sections are organized mainly by task, while also identifying the prominent modalities, common architectures, key trustworthiness concerns, and typical deployment settings. This structure keeps the survey technically consistent and avoids a common weakness of broad surveys: combining studies that address different problems under incompatible assumptions. Overall, the taxonomy defines deep learning for cybersecurity as the intersection of problem type, data structure, architectural bias, trustworthiness requirements, and operational setting. This view better reflects how cyber defense systems are designed and evaluated in practice and provides a stronger basis for identifying real research gaps, especially in multimodal fusion, drift-aware learning, privacy-preserving and robust collaboration, explainable analyst-facing systems, and deployable architectures for cloud–edge–IoT environments. Figure 2 summarizes the five-dimensional taxonomy used in this survey: security task, data modality, model family, trustworthiness dimension, and deployment environment [24,25,54,69,70].

As illustrated in Figure 2, deep learning for cybersecurity should be analyzed as the interaction of problem type, input structure, architectural bias, trustworthiness requirement, and operational context rather than as a single homogeneous modeling task.

5. Application Domains of Deep Learning in Cybersecurity

5.1. Intrusion Detection and Anomaly Detection

Intrusion detection remains one of the most frequently discussed applications of deep learning in cybersecurity because it lies at the intersection of network defense, anomaly detection, and operational monitoring. This literature covers intrusion detection systems (IDSs), including network intrusion detection systems (NIDSs), host-based intrusion detection systems (HIDSs), log anomaly detection, IoT intrusion detection, and industrial control system (ICS) monitoring. Deep models analyze network flows, packet sequences, system logs, and event streams to identify behavior that differs from expected activity or resembles known attacks. However, clean laboratory datasets can produce misleading results because they may contain simplified traffic, artificial attack distributions, limited background noise, or patterns that are easier to separate than real enterprise traffic. The key gap is therefore not only model selection, but the mismatch between benchmark data and operational data. Recent reviews show a shift from shallow models and hand-engineered flow features to architectures that capture spatial, temporal, and contextual structure, including CNNs, LSTMs, GRUs, autoencoders, transformers, and hybrid models. At the same time, they emphasize that IDS performance depends heavily on class imbalance, spatiotemporal feature quality, temporal drift, and dataset realism, not only on model choice [39,65,66,71].

5.2. Malware Detection and Classification

Malware detection is a major application area and a clear reason deep learning gained traction in cybersecurity. Unlike signature-based antivirus systems, deep models can learn hierarchical patterns from raw or partially processed artifacts, offering better potential to generalize to unseen, obfuscated, or zero-day malware. Malware data can include executable bytes, opcode sequences, API-call traces, control-flow graphs, images generated from binaries, and dynamic behavior traces. These representations matter because malware behavior is difficult to predict from surface features alone. Attackers can pack, obfuscate, reorder, or slightly modify malicious code so that the file changes while the malicious objective remains similar. Recent surveys commonly organize this literature into static, dynamic, and hybrid analysis and distinguish methods by representation, including sequence, image, graph, and raw-byte approaches [40,72,73,74]. Strong malware detection therefore requires not only high classification accuracy but also robustness to mutation, family imbalance, obfuscation, and adversarial evasion.

5.3. Phishing, Spam, and Social Engineering Detection

Phishing detection has become a major deep learning application because phishing campaigns are adaptive, multilingual, and increasingly multimodal. Unlike purely network-level threats, phishing combines URL patterns, webpage appearance, email text, sender behavior, domain signals, and social engineering cues. Attackers frequently change wording, tone, brand impersonation, templates, and delivery strategies, so models trained on old campaigns can degrade quickly. Deep learning supports semantic email analysis, URL classification, visual webpage analysis, and multimodal detection, but phishing systems must be updated and evaluated over time rather than treated as static classifiers. Recent review studies show that phishing detection is no longer limited to URL classification. Current research includes graph-based methods, natural language processing (NLP)-based email inspection, generative adversarial network (GAN)-assisted modeling, transformer-based email analysis, and real-time web-detection pipelines [41,75,76,77]. Key limitations remain dataset diversity, adversarial mimicry, multilingual variation, interpretability, and latency in enterprise filtering systems.

5.4. Biometric Authentication and Identity Security

Biometric authentication is a major cybersecurity domain because it sits at the intersection of identity, usability, privacy, and spoof resistance. Deep learning has advanced this area by enabling powerful representation learning for physiological traits such as face, fingerprint, iris, palmprint, and hand-vein, as well as behavioral traits such as voice, gait, signature, and device interaction. However, biometric authentication is not only a recognition problem. A system can have high matching accuracy and still be insecure if it can be bypassed with a photo, mask, replayed voice, synthetic media, or other fake input. Presentation attack detection (PAD) is therefore central to biometric security [78]. Privacy is also more serious than in password-based authentication because biometric traits cannot be easily revoked or changed once exposed. Deep learning-based biometrics should therefore be evaluated for recognition accuracy, spoof resistance, template protection, privacy-preserving storage, sensor variation, and deployment context together [42,43,44,78,79,80].

5.5. Cyber Threat Intelligence (CTI) and Multimodal Security Analytics

Cyber threat intelligence (CTI) is an increasingly important application area because deep learning supports not only classification, but also information extraction, semantic enrichment, correlation, summarization, and analyst assistance. In this paper, CTI specifically means cyber threat intelligence; it should not be confused with computer telephony integration, which uses the same abbreviation in other fields. CTI workflows combine threat reports, vulnerability descriptions, indicators of compromise, malware knowledge, victim or target information, attacker or campaign descriptions, logs, and external intelligence feeds. Deep learning can read unstructured reports and extract useful entities and relationships, such as malware names, affected organizations, vulnerabilities, attacker groups, infrastructure, tactics, and attack patterns. These outputs can support knowledge graph construction, threat hunting, campaign correlation, and evidence-based reporting. Furumoto et al. [81] survey more than 200 CTI studies and show that the field is highly heterogeneous in data sources, objectives, and dataset design. Their analysis highlights a recurring weakness: results often depend strongly on vendor composition and dataset balance, making direct comparison difficult [81]. Large language models have accelerated this area further. Chen et al. [13] show that LLMs are used not only for detection, but also for log interpretation, alert contextualization, semantic search, reasoning support, and domain adaptation. At the same time, they note that LLMs are not equally suitable for all cyber tasks, especially those requiring precise low-level detection, stable factual grounding, or strict latency [13]. This tempers broader claims that general-purpose LLMs can replace traditional cyber analytics pipelines [13]. A core CTI task is entity and relation extraction for knowledge graph construction. Ahmed et al. [82] address this in CyberEntRel by jointly extracting cyber entities and relations with a deep learning model rather than using separate pipeline stages. This direction is important because many CTI applications, including threat hunting, attribution support, campaign correlation, and automated reporting, depend on converting unstructured text into structured threat knowledge [82]. A closely related trend is multimodal security analytics, where textual, statistical, packet-level, and artifact-level data are fused. Although this area is less mature than IDS or malware detection, it is increasingly relevant for encrypted traffic analysis, alert fusion, and advanced defense pipelines. Lin et al. [49] propose PEAN, a multimodal framework for encrypted traffic classification, and Aceto et al. [50] propose DISTILLER, which applies multimodal multitask deep learning to the same problem. Both reflect the broader shift from single-view analytics to systems that learn from complementary modalities and their interactions [49,50].

5.6. Synthesis Across Application Domains

Across all five domains, a common pattern emerges. Deep learning is most mature where data are abundant and benchmarked, as in intrusion detection and malware classification. It becomes more fragile when labels are sparse, semantics are ambiguous, or human interpretation is central, as in CTI and phishing. It becomes more constrained when deployment requires privacy, anti-spoofing, and low-latency operation, as in biometrics and IoT/edge security. For that reason, application-by-application comparisons should not be reduced to raw accuracy tables. What matters is the interaction among task structure, modality, architecture, evaluation design, and deployment constraints. This application-aware view will guide the later sections of the survey on trustworthiness, datasets, and future research directions [13,39,40,41,42,49,50,65,66,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87]. Table 11 summarizes the major application domains reviewed in this section, together with their prominent modalities, common deep learning architectures, and recurring methodological limitations.

Figure 3 provides a visual summary of how the major application domains of deep learning in cybersecurity align with prominent data modalities and commonly used model families.

As shown in Figure 3, deep learning in cybersecurity is not a single uniform problem space, since each application domain depends on different input structures, architectural choices, and operational assumptions.

6. Trustworthiness Dimensions of Deep Learning for Cybersecurity

6.1. Why Trustworthiness Is a Core Requirement in Cybersecurity

In cybersecurity, trustworthiness is not optional. It is a core requirement because cyber defense systems operate in open, adversarial, and high-consequence environments, where attackers adapt, false positives are costly, and human oversight is often necessary. Therefore, deep learning-based cybersecurity systems should be evaluated not only for accuracy, but also for robustness to manipulation, privacy protection, interpretability, confidence calibration, uncertainty reporting, and secure deployment. Figure 4 later summarizes this trustworthiness stack as a layered operational requirement rather than as a single metric.

6.2. Adversarial Robustness and Security-Aware Evaluation

Adversarial robustness is a primary trustworthiness concern in deep learning for cybersecurity. Defender models may be attacked at inference time through adversarial examples, at training time through poisoning or backdoors, or through model extraction, privacy attacks, and manipulation of supporting workflows. Early work showed that deep neural networks are vulnerable to carefully crafted perturbations [34,88,89,90]. Later studies showed that robustness claims require strict evaluation. Carlini and Wagner [88] demonstrated that many apparent defenses fail under stronger attacks, while Madry et al. [89] framed robustness as a robust-optimization problem and established adversarial training as a key baseline for first-order robustness. In cybersecurity, this means a model should not be considered robust simply because it withstands a weak attack or a narrow benchmark. It must be evaluated under a clear threat model matched to deployment conditions [34,88,89]. Cybersecurity also broadens the meaning of adversarial robustness. Attackers may manipulate packet timing, protocol fields, malware packing, log content, feature distributions, query patterns, or human-facing prompts, rather than staying within the small perturbation budgets common in image classification. Robustness in this domain should therefore be understood more broadly than norm-bounded perturbation resistance. NIST’s 2025 taxonomy is especially useful because it expands the threat model to include attacker knowledge, capability, lifecycle timing, and intended consequence [34]. Likewise, Jedrzejewski et al. [91] show that industrial adversarial machine learning still lacks both rigor and practical relevance, suggesting that many published defenses are not yet deployment-ready. This is a serious warning for cybersecurity research, where false assurance can be costly [34,91].

6.3. Poisoning, Backdoors, and Training Time Integrity

Training time attacks are especially critical in cybersecurity because defenders increasingly rely on large, continuously updated datasets collected from distributed and partly untrusted environments. Poisoning attacks corrupt training data or model updates to degrade overall performance or induce targeted malicious behavior. Backdoor attacks are especially dangerous because the model may appear normal on benign inputs while failing on attacker-triggered patterns [34,55,92,93]. In cyber defense, this could cause an IDS to ignore trigger-related traffic or a malware detector to misclassify a targeted family. Recent work shows that poisoning defense cannot be reduced to a single algorithmic fix. Bena et al. [92] propose a risk-based approach, underscoring that defense design should follow a risk-management perspective rather than an accuracy-only view. The challenge is even greater in federated learning, where servers often cannot inspect raw client data. Surveys in this area show that backdoor, Byzantine, and adversarial attacks remain major open problems in distributed training [54,55,93]. Nguyen et al. [93] specifically survey backdoor attacks in federated learning, while Zhao et al. [55] and Feng et al. [94] show that privacy, robustness, and system integrity are tightly linked in collaborative learning. Thus, trustworthiness in cybersecurity requires securing the training pipeline, not just the final model [54,55,92,93,94].

6.4. Privacy Preservation and Collaborative Learning

Privacy is the second major trustworthiness dimension. Cybersecurity data often contain highly sensitive information, including user identifiers, authentication traces, behavior profiles, endpoint telemetry, internal topology, and incident evidence. Deep models may unintentionally memorize such data, while attackers may recover sensitive attributes or infer whether a record was used in training. Surveys on privacy attacks and membership inference, together with the original membership inference attack formulation, show that such leakage is a systematic risk across both training and inference [58,95,96]. Privacy, therefore, is not external to cybersecurity AI but part of its own attack surface [58,95]. Differential privacy is one of the most rigorous protections for training time privacy because it bounds the effect of including or excluding any single record. Pan et al. [53] show that it can reduce privacy leakage, but only with a meaningful privacy–utility tradeoff. This is especially relevant to cybersecurity, where privacy and security are distinct yet closely linked [53]. More broadly, El Mestari et al. [97] argue that data protection should be considered across the full machine learning lifecycle, not only at the model-output stage. This lifecycle view fits cyber defense settings in which multiple organizations, sensors, and platforms contribute sensitive data to shared analytical pipelines [53,97]. Federated learning is often presented as privacy-preserving because it avoids centralizing raw data, a principle formalized in communication-efficient decentralized training [98], but recent surveys show that this claim is incomplete. Zhang et al. [54] review trustworthy federated learning across privacy, security, robustness, fairness, and explainability. Zhao et al. [55] show that model updates can still leak private information, so federated learning does not provide privacy automatically. Bunko et al. [94] further note that many FL-based IDS studies rely only on data locality and omit stronger protections such as secure aggregation, encryption-based methods, or lightweight privacy mechanisms. In cybersecurity, federated learning should therefore be treated as a privacy-enabling architecture, not a privacy guarantee [54,55,94].

6.5. Explainability and Human Analyst Trust

Explainability is a core trustworthiness dimension because cybersecurity is a human-centered decision domain. Alerts often require analyst triage, interpretation, escalation, and response. A model that outputs only a score may perform well on benchmarks, but its operational value is limited if analysts cannot judge whether to trust, investigate, suppress, or act on it. General XAI surveys distinguish transparent models, post hoc local explanations, global explanations, and explanation-specific evaluation methods [35,36]. This distinction matters in cybersecurity because interpretability needs vary by task. A phishing detector may require token-level rationales, an IDS may need flow- or feature-level attribution, and a CTI assistant may need evidence-grounded explanations rather than saliency maps [35,36]. Cybersecurity-focused XAI reviews make this point explicit. Reynaud and Roxin [99] argue that explainability, robustness, and performance must be considered jointly for user acceptance. Sharma et al. [10] similarly emphasize that XAI in cybersecurity should support transparency, analyst trust, and actionable understanding across tasks such as malware detection, phishing analysis, and intrusion detection. Thus, the key question is not only whether a model can explain its output, but whether the explanation improves analyst decisions under realistic workload and uncertainty [10,99]. Evaluation rigor is also essential. Doshi-Velez and Kim [36] argue that interpretability should be studied with the same care as predictive performance, including human-grounded and application-grounded evaluation. This remains highly relevant in cybersecurity, where many studies present feature-importance plots without testing whether they help analysts triage alerts faster, identify false positives, or understand model failures. For this reason, this survey treats explainability as trustworthy only when explanations are useful, validated, and appropriate to the operational context, not simply because an explanation method is present [10,35,36,99].

6.6. Uncertainty Quantification and Confidence Calibration

A fourth trustworthiness dimension is the reliability of model confidence. Deep neural networks are often overconfident, and in cybersecurity, where class imbalance and open-set conditions are common, this can produce confident but incorrect predictions. Guo et al. [62] showed that modern neural networks are often poorly calibrated even when classification accuracy is high. In cybersecurity, this is especially important because confidence scores are used to prioritize alerts, trigger escalation, and support automated actions. Poor calibration can therefore increase alert fatigue, misdirect analyst attention, and create false confidence in wrong detections [62]. Recent cyber-focused work shows that uncertainty quantification can improve trustworthiness. Yang et al. [61] apply Bayesian deep learning to anomaly detection and show that uncertainty estimates help decision makers judge whether predictions should be trusted. This is particularly important because cybersecurity anomaly detection often operates in low-label, evolving, and partly unknown environments, where the boundary between benign novelty and malicious behavior is uncertain [61]. Therefore, future cyber defense systems should support calibrated confidence, uncertainty-aware ranking, abstention mechanisms, and human-review triggers rather than treating all alerts as equally reliable [61,62].

6.7. Secure Deployment, Governance, and Lifecycle Assurance

Trustworthiness depends not only on model design but also on how models are built, integrated, versioned, and maintained. In practice, cyber AI systems rely on data pipelines, training workflows, model registries, CI/CD processes, and runtime integrations. Each can introduce vulnerabilities even when the model itself is sound. Related work on policy-aware interface generation and dynamic trust enforcement also shows that secure deployment depends on how security policies are translated into operational controls and user-facing behavior, not only on the model itself [45,46,56]. The NIST AI RMF emphasizes that AI risk management must span design, deployment, and use [59]. Similarly, NIST SP 800-218A extends secure software development principles to generative AI and dual-use foundation models, highlighting risks such as untrusted training data, model-weight tampering, weak lineage tracking, prompt-based attacks, and insecure ML pipelines [60]. Although written broadly for AI systems, this guidance applies directly to cybersecurity models, especially those using external threat data, continual updates, or foundation-model components [59,60]. This lifecycle view is increasingly important as cybersecurity adopts foundation models, retrieval-augmented systems, and analyst-facing assistants. A trustworthy model should therefore be evaluated not only for robustness and privacy, but also for training-data provenance, artifact integrity, fine-tuning reproducibility, configuration control, secure tool integration, and post-deployment monitoring. In short, trustworthy cyber AI requires both technical robustness and operational assurance [34,59,60,91].

6.8. Synthesis

A central conclusion across these dimensions is that trustworthiness in deep learning for cybersecurity cannot be reduced to a single property, such as robustness or explainability. A robust model may still leak private data, a private model may still contain backdoors, an explainable model may still be poorly calibrated, and a well-calibrated model may still operate in an insecure or poorly governed pipeline. Therefore, future research should adopt integrated evaluation protocols that jointly assess robustness, privacy, explainability, calibration, and lifecycle security. This is especially important in cloud–edge–IoT deployments, federated IDS, LLM-based cyber assistants, and multimodal threat intelligence systems, where failures often arise from interactions across layers rather than from a single model weakness [10,34,35,36,54,55,59,60,93,94,97,99]. Table 12 summarizes the major trustworthiness dimensions discussed in this section, along with their risks, mitigation strategies, and remaining evaluation gaps.

As shown in Figure 4, trustworthy cyber AI requires the joint consideration of adversarial robustness, privacy preservation, explainability, uncertainty and calibration, and secure lifecycle governance in addition to predictive performance.

7. Datasets, Benchmarks, Evaluation Practices, and Reproducibility

7.1. Why This Section Is Methodologically Central

Section 7 connects the five-axis framework to the evidence base used to evaluate it. In deep learning-based cybersecurity, model performance cannot be interpreted apart from the dataset, preprocessing pipeline, split strategy, benchmark realism, and reporting protocol. A model may appear superior on one benchmark because the data are outdated, labels are cleanly separated, preprocessing leaks information, or the test distribution is too similar to the training distribution. This is why benchmark accuracy alone is not enough. Even 99% accuracy can be operationally insufficient if the remaining errors correspond to high-impact intrusions, rare malware families, privileged insider activity, or costly false alarms. Goldschmidt and Chudá show that intrusion detection results depend heavily on dataset choice and data quality, while Yang et al. identify datasets and evaluation metrics as major sources of variation across anomaly-based NIDS studies [83,100]. More broadly, Ceschin et al. argue that cybersecurity data introduce concept drift, delayed labels, adversarial evolution, and collection bias, making standard machine learning evaluation practices inadequate when applied directly to security problems [31]. For this reason, datasets and evaluation methodology are treated here as central to scientific credibility and deployment readiness rather than as secondary implementation details. To make this critique concrete, Table 13 identifies representative studies where the methodological lesson is more important than the model name alone. The purpose is not to rank individual papers, but to show how evaluation weaknesses appear in practice and how they map onto the five-axis framework.

7.2. Public Cybersecurity Datasets: Availability, Diversity, and Structural Limitations

The recent growth of public cybersecurity datasets has improved access for experimentation, benchmarking, and comparison. However, greater availability has not solved the underlying quality problem. Goldschmidt and Chudá review 89 public datasets for network intrusion detection across 13 properties and argue that researchers need a more critical view of data quality, dataset suitability, and best practices for dataset use and generation [83]. Pinto et al. support this view by showing that the tools, processes, and feature extraction steps used during dataset construction directly shape the final benchmark [103]. Together, these studies show that dataset abundance does not imply dataset adequacy. A recurring limitation is that many datasets remain laboratory-centered rather than deployment-centered. They often contain attack scenarios that are cleanly separable, temporally short, artificially balanced, or insufficiently representative of sector-specific threats. Tory and Hasan make this point directly, noting that ML and DL evaluation for IDS/IPS often emphasizes standard accuracy metrics while neglecting whether the datasets reflect relevant real-world threats [104]. Their MITRE ATT&CK-based framework is valuable because it shifts dataset selection from benchmark convenience to operational relevance. This matters in cybersecurity, where a detector trained on a popular dataset may still be poorly matched to healthcare, finance, industrial, or IoT environments [104]. Another structural limitation is the continued reliance on a small set of benchmark corpora. Yang et al.’s review of anomaly-based NIDS studies and Goldschmidt and Chudá’s popularity analysis suggest that comparisons are often driven by benchmark availability rather than by ecological validity [83,100]. This creates two risks. First, reported architectural gains may partly reflect adaptation to dataset-specific artifacts. Second, the field may overestimate practical progress when results transfer well within one benchmark family but not across different collection pipelines, traffic conditions, or attacker behaviors [83,100].

Table 14 translates this critique into a practical dataset-and-split protocol guide. The table is not a definitive catalogue of every cyber benchmark. It identifies dataset families that repeatedly appear in the reviewed literature and gives conservative split recommendations that make evaluation more deployment-relevant.

7.3. Dataset Construction Pipelines and the Importance of Data Provenance

A rigorous evaluation of deep learning for cybersecurity must consider how a dataset was constructed, not just its labels. Pinto et al. show that traffic-analysis tools, feature extraction software, capture procedures, and workflow design all influence a dataset’s suitability for intrusion detection [103]. Thus, the same benchmark name does not guarantee methodological equivalence when different preprocessing tools, feature pipelines, or export settings are used. In cybersecurity, such choices matter because even small pipeline differences can change feature semantics, class distributions, and temporal structure [103]. This point is reinforced by Pekar and Jozsa, who compare anomaly detection across datasets with different integrity levels, including the original CICIDS-2017, refined variants, and NFStream-generated versions [101]. Their study highlights that flow expiration and labeling are methodological choices, not neutral background details [101]. More broadly, cybersecurity surveys should not treat datasets as fixed objects. They should state whether a benchmark is based on raw packets, extracted flows, post-processed summaries, refined labels, or revised feature pipelines, since these factors directly affect fairness and interpretability in model comparison [101,103].

7.4. Synthetic Data and Dataset Augmentation

Because real cyber data are scarce, costly, sensitive, and often legally restricted, synthetic data generation has become an attractive option for training and benchmarking. However, synthetic data should not be accepted uncritically. Wolf et al. argue that synthetic network data must be evaluated using both data-driven and domain-driven metrics, including distributional properties, correlations, population characteristics, syntax validity, and suitability for NIDS use cases [105]. A dataset may appear statistically plausible while still failing to preserve security-relevant structure, attack semantics, or deployment value [105]. For this survey, synthetic data are evaluated along three axes: fidelity, utility, and risk. Generative adversarial modeling can support augmentation and stress testing, but it can also reproduce dataset artifacts or create unrealistic samples if validation is weak [106]. Fidelity measures how well the data preserve the statistical and structural properties of the target domain. Utility measures whether models trained or tested on the data support meaningful conclusions for real cyber tasks. Risk covers privacy leakage, memorization, and other undesirable generator behaviors. Wolf et al.’s framework is valuable because it shows that no single metric can fully capture synthetic data quality in cybersecurity [105]. Accordingly, this survey treats synthetic datasets as promising but methodologically demanding, especially when used to address scarcity in underrepresented sectors or attack classes [105].

7.5. Evaluation Protocols: Metrics, Preprocessing, and Fair Comparison

A major challenge is that, even when researchers use the same benchmark datasets, they often apply different preprocessing pipelines. Manocchio et al. show that, in ML-based NIDS research, datasets and model classes are frequently reused, but preprocessing choices vary widely and are often weakly justified, limiting fair comparison [107]. Their empirical results show that preprocessing is not a minor implementation detail. It can significantly change reported performance and even produce double-digit differences for a shallow neural baseline [107]. For deep learning surveys, model comparisons are therefore meaningful only when preprocessing is clearly documented and consistently applied [107]. The literature also shows that evaluation metrics are often too narrow. Tory and Hasan note that IDS/IPS studies frequently emphasize accuracy without asking whether the benchmark reflects operational threat relevance [104]. Yang et al. similarly identify metric choice as a major source of variation in anomaly-based NIDS evaluation [100]. In cybersecurity, evaluation should go beyond aggregate accuracy or F1 and include measures that reflect class imbalance, false alarms, and task-specific operational priorities. For streaming and real-time systems, it should also consider runtime behavior, update cost, and time-sensitive performance degradation when relevant [30,100,104].

7.6. Temporal Validity, Drift, and External Generalization

A major weakness in cybersecurity evaluation is the use of validation designs that ignore temporal change. Ceschin et al. identify concept drift, evolution, delayed labels, and data-collection effects as key reasons security evaluation differs from standard machine learning evaluation [31]. INSOMNIA makes this issue concrete by introducing a drift-aware NIDS framework and extending TESSERACT for time-aware intrusion detection evaluation [30]. The study argues that drift-aware evaluation is essential and incorporates temporal partitioning and update-latency analysis [30]. This has important implications for deep learning. Random train–test splits may be reasonable in closed-world pattern-recognition tasks, but they can be misleading in cybersecurity when future traffic differs from past traffic or attackers adapt over time. Evaluation should therefore favor time-aware splits, future-period holdouts, or at least explicit analysis of non-stationarity and update behavior in dynamic settings [30,31]. When possible, researchers should also test cross-dataset generalization or external validation rather than relying on a single benchmark. Pekar and Jozsa’s findings on datasets of varied integrity further show that conclusions from one dataset formulation may not transfer cleanly to another related one [30,31,101].

7.7. Reproducibility and Artifact Availability

Even strong datasets and evaluation protocols lose scientific value if other researchers cannot inspect, reproduce, or verify the underlying artifacts. Olszewski et al. provide strong recent evidence on this issue in security-focused machine learning research. Their CCS 2023 study examines nearly 750 papers, codebases, and datasets from Tier 1 security conferences over a ten-year period and concludes that substantial progress in computational reproducibility is still needed [108]. They also report no statistically significant increase in code availability after the introduction of artifact evaluation committees, although artifacts reviewed through such processes were more likely to function correctly [108]. For this survey, the methodological implication is direct. A cybersecurity deep learning paper should, whenever possible, report the dataset source, preprocessing scripts, split strategy, feature-construction details, hyperparameters, code, environment dependencies, and rerun instructions. Reproducibility is especially important in cybersecurity because results often depend on hidden choices in data cleaning, label filtering, categorical encoding, flow extraction, temporal slicing, and attack-group aggregation. When these steps are not transparent, published benchmarks become harder to interpret and compare fairly across studies [107,108].

7.8. Synthesis and Recommended Evaluation Principles

Overall, the literature suggests a clear conclusion: deep learning in cybersecurity is rich in benchmarks but still weak in evaluation protocols. The field offers many datasets, results, and architectural advances, yet their credibility depends on data provenance, benchmark realism, preprocessing transparency, temporal validity, reproducibility, and deployment-reporting. Evaluation should also report operational factors such as false-positive cost, alert volume, latency, throughput, memory use, update cost, and inference location when these factors affect real-time use. Cost becomes a practical limitation whenever models must operate on high-volume network traffic, streaming logs, endpoint telemetry, IoT devices, or security operations center workflows that require rapid triage. Based on the evidence reviewed in this section, strong future studies in deep learning-based cybersecurity should follow seven core principles:

Justify dataset choice by threat relevance, freshness, and deployment realism. The dataset should match the threat behavior, telemetry source, and operating conditions claimed in the study. Authors should explain whether the data are recent enough for the attack surface and realistic enough for the intended deployment setting.
Document the full data-construction and preprocessing pipeline. Studies should report how raw evidence was collected, filtered, de-duplicated, labeled, normalized, encoded, aggregated, and converted into model inputs. This makes hidden design choices visible and helps readers identify leakage, label noise, class imbalance, or preprocessing bias.
Avoid relying only on random splits when temporal change matters. Random splits can place near-duplicate, campaign-related, or future-like samples in both training and test sets. For streaming logs, malware families, phishing campaigns, and network traffic, time-aware partitions, future-period holdouts, update-latency analysis, or drift evaluation should be reported.
Report metrics that reflect operational costs, not only predictive performance. Accuracy, precision, recall, F1 score, and receiver operating characteristic area under the curve (ROC-AUC) should be complemented with false-positive burden, missed-attack cost, alert volume, latency, throughput, memory footprint, and analyst workload. These measures show whether a model can support security operations rather than only score well on a benchmark.
Test external or cross-dataset robustness where possible. A model evaluated only on one benchmark may learn dataset-specific artifacts instead of transferable threat patterns. External datasets, cross-organization validation, cross-family tests, or related benchmark variants can expose overfitting and improve claims about generalization.
Publish code, environment details, split definitions, and rerun instructions. Reproducibility requires the training and evaluation code, software versions, dependencies, hardware assumptions, random seeds, split files, hyperparameters, and commands needed to rerun the study. When raw data cannot be released, authors should provide access constraints, derived-feature descriptions, or synthetic examples that preserve the evaluation logic.
Evaluate synthetic data contributions for fidelity, utility, and risk before drawing security conclusions. Synthetic cyber data should be assessed for similarity to real threat behavior, usefulness for downstream tasks, and privacy or misuse risks. Studies should avoid claiming operational validity from synthetic data alone unless realistic and external checks support that claim.

Table 15 operationalizes these principles as a practical scoring checklist. The table is designed for future systematic updates of this review and for readers who want to evaluate whether an individual empirical paper provides only benchmark evidence or stronger deployment-relevant evidence.

These principles are increasingly important if reported gains are to translate into reliable cyber defense capability beyond controlled benchmarks [30,31,83,100,101,103,104,105,107,108]. Table 16 summarizes the main methodological weaknesses identified in the literature on datasets, benchmarking, preprocessing, temporal validation, and reproducibility, along with corresponding reporting recommendations. As illustrated in Figure 5, weak results often originate not from the model architecture alone, but from failures in data provenance, labeling quality, preprocessing transparency, split design, drift handling, validation breadth, and reproducibility.

8. Open Challenges and Future Research Directions

The future directions in this section are prioritized using three criteria: (1) how often the issue appeared in the cited-source corpus, (2) whether the issue directly affects deployment readiness, and (3) whether current studies provide weak evidence despite frequent claims. This prioritization keeps the roadmap tied to the evidence map rather than presenting a generic list of desirable research topics.

8.1. Deployment-Reporting Targets for Cloud, Enterprise, Edge, and IoT Settings

The reviewed literature repeatedly argues that deployment feasibility matters, but many papers still omit the specific quantities needed to judge feasibility. Table 17 therefore gives indicative reporting targets for cybersecurity deep learning studies. These are not universal standards, because acceptable budgets vary by organization, traffic volume, hardware, and alert-criticality. They are practical baselines for what authors should report and justify when claiming deployment readiness.

8.2. From Static Benchmarks to Living, Sector-Relevant Cyber Datasets

A major research priority is moving from static, one-time benchmarks to living, continuously updated, sector-relevant cyber datasets. Recent surveys on drift-aware intrusion detection, emerging-technology security, IoT IDS, federated IDS, and encrypted traffic analysis all point to the same limitation: progress still depends too heavily on a small set of benchmark families, while real environments change over time, across sectors, protocols, and attacker behaviors [32,33,67,68,109]. Future work should therefore move beyond simple “train once, test once” studies and build benchmark ecosystems that support temporal refresh, domain adaptation, longitudinal evaluation, and organization-specific threat realism. These datasets should also document capture conditions, feature-generation pipelines, protocol context, and threat provenance so that results remain scientifically interpretable across deployments [32,33,67,68].

8.3. Drift-Aware, Continual, and Online Cyber Learning

A second challenge is that cyber defense models are still often treated as static classifiers, even though their environments are non-stationary. Shyaa et al. identify concept and feature drift as central issues for intrusion detection, and Neto et al. similarly argue that emerging-technology settings require DL-based IDS methods that adapt to changing traffic patterns, protocols, and attacks [32,33]. Future research should therefore emphasize continual learning, online updating, drift detection, replay-efficient adaptation, and temporally valid evaluation. Models should adapt without catastrophic forgetting, distinguish benign operational change from malicious novelty, and support controlled revision under limited labeling budgets. In practice, the field needs cyber models that remain effective after deployment, not only at publication time [32,33].

8.4. Multimodal and Reasoning-Centric Cyber Defense

A third direction is the shift from single-modality analytics to multimodal, reasoning-centered cyber defense. The broader multimodal learning literature now offers mature frameworks for representation, fusion, alignment, and cross-modal reasoning, while cyber-specific work such as PACKETCLIP, a multimodal embedding framework that links network traffic with language representations for cybersecurity reasoning, shows how traffic signals and language semantics can be combined [70,102]. This matters because many cyber workflows are inherently multimodal: analysts interpret alerts alongside logs, packet statistics, CTI text, vulnerability descriptions, and notes. Future research should therefore explore robust cross-modal fusion, missing-modality handling, multimodal uncertainty estimation, and joint learning over structured, sequential, and textual evidence. The goal is not simply to add modalities, but to learn complementary, operationally useful representations that improve detection, explanation, and decision support under real-world constraints [70,102].

8.5. Privacy-Preserving Collaborative Defense Beyond Naive Federated Learning

Collaborative cyber defense is highly desirable because useful threat signals are distributed across organizations, devices, platforms, and infrastructures. However, the federated-IDS literature shows that this goal remains unresolved. Hernandez-Ramos et al. identify open issues in aggregation, heterogeneity, datasets, and deployment maturity, and IoT-focused surveys report similar problems under decentralized and resource-constrained conditions [68,109]. Future research should therefore go beyond simply replacing centralized training with federated training. Next-generation systems must better address non-IID data, secure aggregation, poisoning resistance, communication efficiency, personalization, cross-silo governance, and privacy leakage from updates or metadata. In short, collaborative cyber AI must be distributed, private, robust, and operationally manageable at the same time [68,109].

8.6. Trustworthy LLMs, Cyber Copilots, and Agentic Security Workflows

Large language models are expanding cybersecurity research from narrow classification to semantic assistance, code analysis, threat intelligence synthesis, incident reasoning, and interactive security workflows. Reviews by Zhang et al. and Karras et al. show that LLMs are already being studied for vulnerability detection, secure code generation, program repair, malware analysis, anomaly detection, CTI, and offensive-security assistance, while also highlighting major concerns such as hallucination, jailbreaks, domain grounding, privacy, and weak evaluation rigor [110,111]. Future work should therefore not treat LLMs as simple replacements for classical detectors. The more important agenda is the design of grounded, tool-aware, auditable, safety-constrained cyber copilots and agents that retrieve evidence, reason over heterogeneous artifacts, justify outputs, and fail safely under uncertainty. Domain adaptation, retrieval-augmented generation, cyber-reasoning benchmarks, and secure orchestration of LLM-enabled workflows will likely become central topics in the next research wave [110,111].

Table 18 translates this agenda into a concrete control-oriented view for LLM-enabled SOC workflows. It emphasizes that the relevant unit of analysis is often not the base model alone, but the full system: retrieval layer, prompt boundary, tool interface, evidence store, human approval path, and audit trail.

8.7. Deep Learning for Software Security and Vulnerability Discovery

Another important direction is software and source-code security, where deep learning is increasingly applied to vulnerability detection, code representation learning, and code-level reasoning. Liang et al. show that the literature has already diversified across token-based and graph-based representations and multiple deep learning paradigms, while also noting major limitations and open challenges [112]. For this survey, the key implication is that vulnerability analysis should move beyond isolated static-code classification toward richer software security pipelines that combine syntax, semantics, data flow, control-flow, dependency context, patch information, and natural language artifacts such as CVE descriptions or commit messages. This area is especially well-suited to hybrid models that integrate graph learning, code LLMs, and task-specific deep architectures, but future progress will depend on cleaner benchmarks, stronger grounding in software engineering practice, and lower false-positive rates in deployment-oriented settings [110,112].

8.8. Encrypted Traffic, Edge Deployment, and Resource-Constrained Cyber AI

Modern cybersecurity increasingly operates in settings where packet payloads are encrypted and inference must occur at the edge, in IoT networks, or on constrained infrastructure. Sharma and Lashkari highlight the challenge created by reduced payload visibility, while IoT and emerging-technology surveys show that many deployment targets face strict limits in latency, bandwidth, memory, and computation [33,67,68,69]. Future research should therefore focus on resource-efficient deep models, model compression, edge-aware inference, adaptive offloading, resilience under resource constraints, and privacy-sensitive encrypted traffic analytics. This includes methods that preserve useful cyber visibility without relying on deep packet inspection, as well as compact models that operate near the data source without major loss in detection quality. Work on resource-constrained AI resilience is especially relevant because it shifts the discussion from model compression alone to disturbance tolerance, adaptive recovery, and dependable operation under hardware and workload limits [63]. In practice, deployment-aware AI will be as important as architectural novelty for the next generation of cyber defense [33,63,67,68,69].

8.9. Human–AI Teaming and Analyst-Centered Evaluation

A final research priority is the shift from model-centered evaluation to analyst-centered evaluation. In real security operations, AI is valuable not only when it detects threats accurately, but when it reduces alert fatigue, speeds triage, improves decisions, and preserves human control in ambiguous situations. Baruwal Chhetri et al. explicitly argue for human–AI teaming in Security Operations Centres and propose a framework that combines automation, augmentation, and collaboration rather than treating them as separate modes [47,48,113,114]. This perspective is especially important in cybersecurity, cyber-trust, and responsible AI settings, where the most valuable use cases often involve mixed-initiative work: AI prioritizes, summarizes, correlates, or proposes hypotheses, while human analysts validate, redirect, contextualize, and decide. Future research should therefore include human-grounded metrics such as analyst workload, trust calibration, time-to-triage, escalation quality, and error recovery, rather than relying only on offline classification metrics [113].

8.10. Overall Research Agenda

Taken together, these challenges suggest that the next phase of deep learning for cybersecurity should be defined less by isolated benchmark gains and more by progress toward adaptive, multimodal, collaborative, trustworthy, deployment-aware, and human-centered cyber intelligence. Recent surveys already point in this direction: drift-aware IDS, federated intrusion detection, multimodal learning, LLM-driven cyber systems, encrypted traffic analytics, edge-aware deployment, and analyst support frameworks are emerging as core research frontiers rather than peripheral extensions [32,33,67,68,69,70,102,109,110,111,112,113]. A mature field will likely require integrated evaluation protocols that jointly assess detection quality, robustness, privacy, temporal validity, computational efficiency, and analyst utility. That integration is both the central challenge and the most promising path forward for operational deep learning in cybersecurity [32,33,67,68,69,70,102,109,110,111,112,113]. Figure 6 summarizes the major research directions identified in this survey and organizes them into a staged roadmap spanning near-term, mid-term, and long-term priorities. The roadmap provides a concise visual synthesis of the section and gives readers clear starting points for future research.

As shown in Figure 6, future progress in deep learning for cybersecurity will depend on moving from stronger methodological rigor and benchmark credibility toward adaptive, collaborative, multimodal, privacy-aware, deployment-ready, and analyst-supportive cyber intelligence systems.

9. Limitations

This review has several limitations. First, it was limited to English-language peer-reviewed publications, which may exclude relevant technical reports, preprints, standards documents, and non-English studies. Second, because the review covers heterogeneous cybersecurity tasks, datasets, model families, and evaluation metrics, formal meta-analysis was not appropriate. Third, database indexing and citation tracking strategies may miss relevant studies that use different terminology. Fourth, the rapid development of large language models, cyber copilots, agentic security systems, and deployment-oriented cyber AI means that the evidence base may change quickly. Fifth, this version does not report PRISMA-ScR numerical counts because a complete dated search export and screening log were not available for verification. Finally, the review relies on the methodological details reported in primary studies; when studies omit preprocessing steps, split strategies, source code, or deployment constraints, their operational maturity may be difficult to assess.

10. Conclusions

Deep learning has become a major force in modern cybersecurity because it enables strong representation learning across heterogeneous data sources, including network flows, logs, binaries, graphs, text, and multimodal threat evidence [3,7,13,24]. As this survey has shown, its applications now span intrusion detection, malware analysis, phishing detection, biometric authentication, and cyber threat intelligence, while also extending to federated, graph-based, transformer-based, and LLM-driven cyber defense settings [9,11,12,13,39,40,41,42,49,50,65,66,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87]. However, the literature also shows that architectural progress alone does not ensure operational value. Reported improvements remain heavily influenced by benchmark choice, preprocessing, split strategy, temporal validity, and reproducibility, so many published gains should be interpreted cautiously unless supported by rigorous, deployment-relevant evaluation [30,31,83,100,101,103,104,105,107,108]. A central conclusion of this survey is that deep learning for cybersecurity should not be treated as only an accuracy-optimization problem. It should be understood as a trustworthy cyber intelligence problem in which performance must be balanced with robustness, privacy, explainability, uncertainty, and deployment feasibility [10,34,35,36,53,54,55,58,59,60,61,62,88,89,91,92,93,94,95,97,99,115]. This matters because cybersecurity systems operate in adversarial, evolving environments where attackers adapt, data drift occurs, false alarms are costly, and human analysts often remain involved. In such settings, a highly accurate but non-robust, opaque, privacy-leaking, or poorly calibrated model may have limited real-world value. The future of the field therefore depends less on isolated benchmark gains and more on integrated evaluation frameworks that jointly assess detection quality, resilience, privacy, transparency, and lifecycle security [10,34,35,36,54,55,59,60,61,62,93,94,97,99]. Another key conclusion is that the next stage of progress will likely come from integration across dimensions rather than isolated advances in a single model family. The most promising directions identified in this survey include drift-aware and continual cyber learning, multimodal fusion, collaborative and privacy-preserving defense, trustworthy LLM-based cyber assistants, edge-aware and encrypted traffic analytics, and human–AI teaming for security operations [32,33,67,68,69,70,102,109,110,111,112,113]. Together, these directions reflect a broader shift from static laboratory classification to adaptive, contextual, interactive, and deployment-aware cyber defense systems. Achieving that shift will require better datasets, stronger temporal and cross-domain evaluation, more transparent reporting, and closer alignment between model design and operational security practice [30,31,32,33,67,68,69,70,83,100,101,102,103,104,105,107,108,109,110,111,112,113]. In sum, deep learning has already reshaped cybersecurity research, but the field is still maturing from benchmark-driven experimentation to operationally credible intelligent defense. Especially useful future contributions are likely to be those that connect model innovation with trustworthy evaluation, realistic deployment assumptions, and measurable support for human decision-making. If these challenges are addressed systematically, deep learning can become a more reliable foundation for scalable, transparent, privacy-aware, and resilient cybersecurity systems.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/electronics15112421/s1, The submission package includes README_supplementary_materials.md: supplementary materials overview and file descriptions; Table S0: supplementary materials index; Table S1: search-string families and database adaptations; Table S2: reviewed-source reference list generated from the de-duplicated cited-source bibliography; Table S3: data-charting codebook used to structure the evidence map; Tables S4 and S4B: exclusion-reason taxonomy and DOI/reference-audit notes; Tables S5, S5B, and S5C: non-exclusive coded emphasis summaries, task-by-model matrix, and trustworthiness/deployment matrix used in the revised synthesis; Tables S6 and S6A: deployment-reporting targets and bibliographic publication year profile; Tables S7 and S7B: reviewer-response coverage checklist and seven-principle scoring template. These supplementary files document the transparent review workflow and coded emphasis logic without claiming PRISMA-ScR numerical screening counts or systematic final-inclusion statistics.

Author Contributions

Conceptualization, M.G. and K.G.; methodology, M.G., K.G. and A.M.; literature-search strategy, M.G., D.B., C.C., M.M., C.L., B.J. and A.F.; evidence charting and synthesis support, M.G., A.M., D.B., C.C., M.M., C.L., B.J. and A.F.; formal synthesis, M.G. and K.G.; visualization, M.G.; writing—original draft preparation, M.G.; writing—review and editing, M.G., K.G., A.M., D.B., C.C., M.M., C.L., B.J. and A.F.; supervision, M.G. and K.G.; project administration, M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new primary empirical datasets were generated. The review materials supporting the synthesis are provided in the Supplementary Materials, including the supplementary materials index, search-string families, reviewed-source reference list, data-charting codebook, DOI/reference-audit notes, exclusion-reason taxonomy, non-exclusive coded emphasis summaries, task-by-model and trustworthiness/deployment matrices, deployment-reporting targets, bibliographic year profile, and reviewer-response coverage checklist. The numerical summaries in these files describe coded emphases within the cited-source bibliography and should not be interpreted as database search returns, screening exclusions, PRISMA counts, or systematic final-inclusion statistics. A complete original PRISMA-ScR screening export and original full-text exclusion log were not available for independent verification in this submission version, so the paper is explicitly framed as a structured narrative review with evidence-mapping components rather than a PRISMA-counted systematic review. Additional details are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial intelligence
API	Application programming interface
CNN	Convolutional neural network
CPU	Central processing unit
CTI	Cyber threat intelligence
CVE	Common Vulnerabilities and Exposures
CWE	Common Weakness Enumeration
DL	Deep learning
DOM	Document Object Model
GAN	Generative adversarial network
GNN	Graph neural network
GPU	Graphics processing unit
GRU	Gated recurrent unit
HIDS	Host-based intrusion detection system
HTML	Hypertext Markup Language
ICS	Industrial control system
IDS	Intrusion detection system
IOC	Indicator of compromise
IoT	Internet of Things
LLM	Large language model
ML	Machine learning
NIDS	Network intrusion detection system
NLP	Natural language processing
NVD	National Vulnerability Database
PAD	Presentation attack detection
PCC	Population–Concept–Context
PRISMA	Preferred Reporting Items for Systematic Reviews and Meta-Analyses
PRISMA-ScR	PRISMA extension for scoping reviews
RAM	Random-access memory
RNN	Recurrent neural network
ROC-AUC	Receiver operating characteristic area under the curve
SOC	Security operations center
XAI	Explainable artificial intelligence

References

Sommer, R.; Paxson, V. Outside the Closed World: On Using Machine Learning for Network Intrusion Detection. In Proceedings of the 2010 IEEE Symposium on Security and Privacy, Oakland, CA, USA, 16–19 May 2010; pp. 305–316. [Google Scholar] [CrossRef]
Buczak, A.L.; Guven, E. A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection. IEEE Commun. Surv. Tutor. 2016, 18, 1153–1176. [Google Scholar] [CrossRef]
Xin, Y.; Kong, L.; Liu, Z.; Chen, Y.; Li, Y.; Zhu, H.; Gao, M.; Hou, H.; Wang, C. Machine Learning and Deep Learning Methods for Cybersecurity. IEEE Access 2018, 6, 35365–35381. [Google Scholar] [CrossRef]
Milenkoski, A.; Vieira, M.; Kounev, S.; Avritzer, A.; Payne, B.D. Evaluating Computer Intrusion Detection Systems: A Survey of Common Practices. ACM Comput. Surv. 2015, 48, 1–41. [Google Scholar] [CrossRef] [PubMed]
Berman, D.S.; Buczak, A.L.; Chavis, J.S.; Corbett, C.L. A Survey of Deep Learning Methods for Cyber Security. Information 2019, 10, 122. [Google Scholar] [CrossRef]
Ring, M.; Wunderlich, S.; Scheuring, D.; Landes, D.; Hotho, A. A Survey of Network-Based Intrusion Detection Data Sets. Comput. Secur. 2019, 86, 147–167. [Google Scholar] [CrossRef]
Ferrag, M.A.; Maglaras, L.; Moschoyiannis, S.; Janicke, H. Deep Learning for Cyber Security Intrusion Detection: Approaches, Datasets, and Comparative Study. J. Inf. Secur. Appl. 2020, 50, 102419. [Google Scholar] [CrossRef]
Macas, M.; Wu, C.; Fuertes, W. Adversarial Examples: A Survey of Attacks and Defenses in Deep Learning-Enabled Cybersecurity Systems. Expert Syst. Appl. 2024, 238, 122223. [Google Scholar] [CrossRef]
Makris, I.; Karampasi, A.; Radoglou-Grammatikis, P.; Episkopos, N.; Iturbe, E.; Rios, E.; Piperigkos, N.; Lalos, A.; Xenakis, C.; Lagkas, T.; et al. A Comprehensive Survey of Federated Intrusion Detection Systems: Techniques, Challenges and Solutions. Comput. Sci. Rev. 2025, 56, 100717. [Google Scholar] [CrossRef]
Sharma, A.; Rani, S.; Shabaz, M. A Comprehensive Review of Explainable AI in Cybersecurity: Decoding the Black Box. ICT Express 2025, 11, 1200–1219. [Google Scholar] [CrossRef]
Kheddar, H. Transformers and Large Language Models for Efficient Intrusion Detection Systems: A Comprehensive Survey. Inf. Fusion 2025, 124, 103347. [Google Scholar] [CrossRef]
Zhong, M.; Lin, M.; Zhang, C.; Xu, Z. A Survey on Graph Neural Networks for Intrusion Detection Systems: Methods, Trends and Challenges. Comput. Secur. 2024, 141, 103821. [Google Scholar] [CrossRef]
Chen, Y.; Cui, M.; Wang, D.; Cao, Y.; Yang, P.; Jiang, B.; Lu, Z.; Liu, B. A Survey of Large Language Models for Cyber Threat Detection. Comput. Secur. 2024, 145, 104016. [Google Scholar] [CrossRef]
Macas, M.; Wu, C.; Fuertes, W. A survey on deep learning for cybersecurity: Progress, challenges, and opportunities. Comput. Netw. 2022, 212, 109032. [Google Scholar] [CrossRef]
Munn, Z.; Peters, M.D.; Stern, C.; Tufanaru, C.; McArthur, A.; Aromataris, E. Systematic review or scoping review? Guidance for authors when choosing between a systematic or scoping review approach. BMC Med. Res. Methodol. 2018, 18, 143. [Google Scholar] [CrossRef]
Arksey, H.; O’Malley, L. Scoping studies: Towards a methodological framework. Int. J. Soc. Res. Methodol. 2005, 8, 19–32. [Google Scholar] [CrossRef]
Levac, D.; Colquhoun, H.; O’Brien, K.K. Scoping studies: Advancing the methodology. Implement. Sci. 2010, 5, 69. [Google Scholar] [CrossRef] [PubMed]
Peters, M.D.J.; Marnie, C.; Tricco, A.C.; Pollock, D.; Munn, Z.; Alexander, L.; McInerney, P.; Godfrey, C.M.; Khalil, H. Updated methodological guidance for the conduct of scoping reviews. JBI Evid. Synth. 2020, 18, 2119–2126. [Google Scholar] [CrossRef]
Tricco, A.C.; Lillie, E.; Zarin, W.; O’Brien, K.K.; Colquhoun, H.; Levac, D.; Moher, D.; Peters, M.D.; Horsley, T.; Weeks, L.; et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann. Intern. Med. 2018, 169, 467–473. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
Wohlin, C. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering (EASE ’14), London, UK, 13–14 May 2014; pp. 1–10. [Google Scholar] [CrossRef]
Peters, M.D.J.; Godfrey, C.; McInerney, P.; Khalil, H.; Larsen, P.; Marnie, C.; Pollock, D.; Tricco, A.C.; Munn, Z. Best practice guidance and reporting items for the development of scoping review protocols. JBI Evid. Synth. 2022, 20, 953–968. [Google Scholar] [CrossRef]
Ghayoumi, M. Deep Learning in Practice, 1st ed.; CRC Press/Chapman and Hall: Boca Raton, FL, USA, 2022. [Google Scholar]
Baltrušaitis, T.; Ahuja, C.; Morency, L.-P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef]
Ometov, A.; Molua, O.L.; Komarov, M.; Nurmi, J. A Survey of Security in Cloud, Edge, and Fog Computing. Sensors 2022, 22, 927. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar] [CrossRef]
Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph Neural Networks: A Review of Methods and Applications. AI Open 2021, 1, 57–81. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A Survey on Concept Drift Adaptation. ACM Comput. Surv. 2014, 46, 44. [Google Scholar] [CrossRef]
Andresini, G.; Pendlebury, F.; Pierazzi, F.; Loglisci, C.; Appice, A.; Cavallaro, L. INSOMNIA: Towards Concept-Drift Robustness in Network Intrusion Detection. In Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security (AISec ’21), Virtual, 15 November 2021. [Google Scholar] [CrossRef]
Ceschin, F.; Botacin, M.; Bifet, A.; Pfahringer, B.; Oliveira, L.S.; Gomes, H.M.; Grégio, A. Machine Learning (In) Security: A Stream of Problems. Digit. Threat. Res. Pract. 2024, 5, 1–32. [Google Scholar] [CrossRef]
Shyaa, M.A.; Ibrahim, N.F.; Zainol, Z.; Abdullah, R.; Anbar, M.; Alzubaidi, L. Evolving cybersecurity frontiers: A comprehensive survey on concept drift and feature dynamics aware machine and deep learning in intrusion detection systems. Eng. Appl. Artif. Intell. 2024, 137, 109143. [Google Scholar] [CrossRef]
Neto, E.C.P.; Iqbal, S.; Buffett, S.; Sultana, M.; Taylor, A. Deep learning for intrusion detection in emerging technologies: A comprehensive survey and new perspectives. Artif. Intell. Rev. 2025, 58, 340. [Google Scholar] [CrossRef]
NIST AI 100-2e2025; Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations. National Institute of Standards and Technology: Gaithersburg, MD, USA, 2025. [CrossRef]
Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; Pedreschi, D. A Survey of Methods for Explaining Black Box Models. ACM Comput. Surv. 2019, 51, 1–42. [Google Scholar] [CrossRef]
Doshi-Velez, F.; Kim, B. Towards A Rigorous Science of Interpretable Machine Learning. arXiv 2017, arXiv:1702.08608. [Google Scholar] [CrossRef]
Zhang, C.; Xie, Y.; Bai, H.; Yu, B.; Li, W.; Gao, Y. A Survey on Federated Learning. Knowl.-Based Syst. 2021, 216, 106775. [Google Scholar] [CrossRef]
Tjoa, E.; Guan, C. A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4793–4813. [Google Scholar] [CrossRef]
Zhang, Y.; Muniyandi, R.C.; Qamar, F. A Review of Deep Learning Applications in Intrusion Detection Systems: Overcoming Challenges in Spatiotemporal Feature Extraction and Data Imbalance. Appl. Sci. 2025, 15, 1552. [Google Scholar] [CrossRef]
Wang, H.; Cui, B.; Yuan, Q.; Shi, R.; Huang, M. A Review of Deep Learning Based Malware Detection Techniques. Neurocomputing 2024, 598, 128010. [Google Scholar] [CrossRef]
Kavya, S.; Sumathi, D. Staying Ahead of Phishers: A Review of Recent Advances and Emerging Methodologies in Phishing Detection. Artif. Intell. Rev. 2025, 58, 50. [Google Scholar] [CrossRef]
Minaee, S.; Abdolrashidi, A.; Su, H.; Bennamoun, M.; Zhang, D. Biometrics Recognition Using Deep Learning: A Survey. Artif. Intell. Rev. 2023, 56, 8647–8695. [Google Scholar] [CrossRef]
Ghayoumi, M. A review of multimodal biometric systems: Fusion methods and their applications. In Proceedings of the 2015 IEEE/ACIS 14th International Conference on Computer and Information Science (ICIS), Las Vegas, NV, USA, 28 June–1 July 2015; pp. 131–136. [Google Scholar] [CrossRef]
Ghayoumi, M.; Ghazinour, K. An adaptive fuzzy multimodal biometric system for identification and verification. In Proceedings of the 2015 IEEE/ACIS 14th International Conference on Computer and Information Science (ICIS), Las Vegas, NV, USA, 28 June–1 July 2015; pp. 137–141. [Google Scholar] [CrossRef]
Ghazinour, K.; Ghayoumi, M. An autonomous model to enforce security policies based on user’s behavior. In Proceedings of the 2015 IEEE/ACIS 14th International Conference on Computer and Information Science (ICIS), Las Vegas, NV, USA, 28 June–1 July 2015; pp. 95–99. [Google Scholar] [CrossRef]
Ghazinour, K.; Ghayoumi, M. A Dynamic Trust Model Enforcing Security Policies. In Proceedings of the International Conference on Intelligent Information Processing, Security and Advanced Communication (IPAC ’15), Batna, Algeria, 23–25 November 2015; pp. 1–5. [Google Scholar] [CrossRef]
Babaev, I.; Packer, T.; Ghayoumi, M.; Ghazinour, K. MAISON: A Model for Effective Hybrid Management of Cybersecurity and Cyber-Trust. Int. J. Inf. Technol. 2024, 1, 1–7. [Google Scholar]
Ghayoumi, M.; Ghazinour, K. Advancing MAISON: Integrating Deep Learning and Social Dynamics in Cyberbullying Detection and Prevention. In Proceedings of the 2024 7th International Conference on Information and Computer Technologies, Honolulu, HI, USA, 15–17 March 2024; pp. 80–86. [Google Scholar]
Lin, P.; Ye, K.; Hu, Y.; Lin, Y.; Xu, C.-Z. A Novel Multimodal Deep Learning Framework for Encrypted Traffic Classification. IEEE/ACM Trans. Netw. 2023, 31, 1369–1384. [Google Scholar] [CrossRef]
Aceto, G.; Ciuonzo, D.; Montieri, A.; Pescapé, A. DISTILLER: Encrypted Traffic Classification via Multimodal Multitask Deep Learning. J. Netw. Comput. Appl. 2021, 183–184, 102985. [Google Scholar] [CrossRef]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Pan, K.; Ong, Y.-S.; Gong, M.; Li, H.; Qin, A.K.; Gao, Y. Differential privacy in deep learning: A literature survey. Neurocomputing 2024, 589, 127663. [Google Scholar] [CrossRef]
Zhang, Y.; Zeng, D.; Luo, J.; Fu, X.; Chen, G.; Xu, Z.; King, I. A Survey of Trustworthy Federated Learning: Issues, Solutions, and Challenges. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–47. [Google Scholar] [CrossRef]
Zhao, J.; Bagchi, S.; Avestimehr, S.; Chan, K.; Chaterji, S.; Dimitriadis, D.; Li, J.; Li, N.; Nourian, A.; Roth, H. The Federation Strikes Back: A Survey of Federated Learning Privacy Attacks, Defenses, Applications, and Policy Landscape. ACM Comput. Surv. 2025, 57, 1–37. [Google Scholar] [CrossRef]
Ghazinour, K.; Ghayoumi, M. Dynamic Modeling for Representing Access Control Policies Effect. arXiv 2015, arXiv:1505.08154. [Google Scholar] [CrossRef]
Das, B.C.; Amini, M.H.; Wu, Y. Security and Privacy Challenges of Large Language Models: A Survey. ACM Comput. Surv. 2025, 57, 1–39. [Google Scholar] [CrossRef]
Hu, H.; Salcic, Z.; Sun, L.; Dobbie, G.; Yu, P.S.; Zhang, X. Membership Inference Attacks on Machine Learning: A Survey. ACM Comput. Surv. 2022, 54, 1–37. [Google Scholar] [CrossRef]
NIST AI 100-1; Artificial Intelligence Risk Management Framework (AI RMF 1.0). National Institute of Standards and Technology: Gaithersburg, MD, USA, 2023. [CrossRef]
NIST SP 800-218A; Secure Software Development Practices for Generative AI and Dual-Use Foundation Models: An SSDF Community Profile. National Institute of Standards and Technology: Gaithersburg, MD, USA, 2024. [CrossRef]
Yang, T.; Qiao, Y.; Lee, B. Towards trustworthy cybersecurity operations using Bayesian Deep Learning to improve uncertainty quantification of anomaly detection. Comput. Secur. 2024, 144, 103909. [Google Scholar] [CrossRef]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
Moskalenko, V.; Kharchenko, V.; Semenov, S. Model and Method for Providing Resilience to Resource-Constrained AI-System. Sensors 2024, 24, 5951. [Google Scholar] [CrossRef]
Liu, H.-I.; Galindo, M.; Xie, H.; Wong, L.-K.; Shuai, H.-H.; Li, Y.-H.; Cheng, W.-H. Lightweight Deep Learning for Resource- Constrained Environments: A Survey. ACM Comput. Surv. 2024, 56, 267. [Google Scholar] [CrossRef]
Aldhaheri, A.; Alwahedi, F.; Ferrag, M.A.; Battah, A. Deep Learning for Cyber Threat Detection in IoT Networks: A Review. Internet Things Cyber-Phys. Syst. 2024, 4, 110–128. [Google Scholar] [CrossRef]
Aslam, M.M.; Tufail, A.; Irshad, M.N. Survey of Deep Learning Approaches for Securing Industrial Control Systems: A Comparative Analysis. Cyber Secur. Appl. 2025, 3, 100096. [Google Scholar] [CrossRef]
Sharma, A.; Lashkari, A.H. A survey on encrypted network traffic: A comprehensive survey of identification/classification techniques, challenges, and future directions. Comput. Netw. 2025, 257, 110984. [Google Scholar] [CrossRef]
Rahman, M.M.; Shakil, S.A.; Mustakim, M.R. A survey on intrusion detection system in IoT networks. Cyber Secur. Appl. 2025, 3, 100082. [Google Scholar] [CrossRef]
Hoffpauir, K.; Simmons, J.; Schmidt, N.; Pittala, R.; Briggs, I.; Makani, S.; Jararweh, Y. A Survey on Edge Intelligence and Lightweight Machine Learning Support for Future Applications and Services. ACM J. Data Inf. Qual. 2023, 15, 20. [Google Scholar] [CrossRef]
Yuan, Y.; Li, Z.; Zhao, B. A Survey of Multimodal Learning: Methods, Applications, and Future. ACM Comput. Surv. 2025, 57, 167. [Google Scholar] [CrossRef]
Vinayakumar, R.; Alazab, M.; Soman, K.P.; Poornachandran, P.; Al-Nemrat, A.; Venkatraman, S. Deep Learning Approach for Intelligent Intrusion Detection System. IEEE Access 2019, 7, 41525–41550. [Google Scholar] [CrossRef]
Gopinath, M.; Sethuraman, S.C. A Comprehensive Survey on Deep Learning Based Malware Detection Techniques. Comput. Sci. Rev. 2023, 47, 100529. [Google Scholar] [CrossRef]
Deldar, F.; Abadi, M. Deep Learning for Zero-Day Malware Detection and Classification: A Survey. ACM Comput. Surv. 2023, 56, 1–37. [Google Scholar] [CrossRef]
Alzubaidi, A. Detecting Android Malware Using Deep Learning Algorithms: A Survey. Comput. Electr. Eng. 2024, 119, 109544. [Google Scholar] [CrossRef]
Asiri, S.; Xiao, Y.; Alzahrani, S.; Li, T. PhishingRTDS: A Real-Time Detection System for Phishing Attacks Using a Deep Learning Model. Comput. Secur. 2024, 141, 103843. [Google Scholar] [CrossRef]
Ibrahim, M.; Elhafiz, R. Phishing Email Detection Using BERT and RoBERTa. Computation 2026, 14, 46. [Google Scholar] [CrossRef]
Vennela, A.; Akarapu, R.B.; Rakshith, B.L.; Asirvatham, L.G.; Sunil, G. Intelligent Cybersecurity Systems for Phishing Attack Detection: An Overview. Comput. Electr. Eng. 2026, 130, 110829. [Google Scholar] [CrossRef]
Shaheed, K.; Szczuko, P.; Kumar, M.; Qureshi, I.; Abbas, Q.; Ullah, I. Deep Learning Techniques for Biometric Security: A Systematic Review of Presentation Attack Detection Systems. Eng. Appl. Artif. Intell. 2024, 129, 107569. [Google Scholar] [CrossRef]
Alrawili, R.; AlQahtani, A.A.S.; Khan, M.K. Comprehensive Survey: Biometric User Authentication Application, Evaluation, and Discussion. Comput. Electr. Eng. 2024, 119, 109485. [Google Scholar] [CrossRef]
Zeng, L.; Shen, P.; Zhu, X.; Tian, X.; Chen, C. A Review of Privacy-Preserving Biometric Identification and Authentication Protocols. Comput. Secur. 2025, 150, 104309. [Google Scholar] [CrossRef]
Furumoto, K.; Morikawa, T.; Kolehmainen, A.; Silverajan, B.; Takahashi, T.; Inoue, D. A Comprehensive Survey of Threat Intelligence Research: A Measurement-Based Study. ACM Comput. Surv. 2026, 58, 153. [Google Scholar] [CrossRef]
Ahmed, K.; Khurshid, S.K.; Hina, S. CyberEntRel: Joint Extraction of Cyber Entities and Relations Using Deep Learning. Comput. Secur. 2024, 136, 103579. [Google Scholar] [CrossRef]
Goldschmidt, P.; Chudá, D. Network Intrusion Datasets: A Survey, Limitations, and Recommendations. Comput. Secur. 2025, 156, 104510. [Google Scholar] [CrossRef]
Du, M.; Li, F.; Zheng, G.; Srikumar, V. DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 30 October–3 November 2017; pp. 1285–1298. [Google Scholar] [CrossRef]
Duan, Y.; Xue, K.; Sun, H.; Bao, H.; Wei, Y.; You, Z.; Zhang, Y.; Jiang, X.; Yang, S.; Chen, J.; et al. LogEDL: Log Anomaly Detection via Evidential Deep Learning. Appl. Sci. 2024, 14, 7055. [Google Scholar] [CrossRef]
Bilot, T.; El Madhoun, N.; Al Agha, K.; Zouaoui, A. A Survey on Malware Detection with Graph Representation Learning. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar] [CrossRef]
Saha, S.; Afroz, S.; Rahman, A.H. MAlign: Explainable Static Raw-Byte Based Malware Family Classification Using Sequence Alignment. Comput. Secur. 2024, 139, 103714. [Google Scholar] [CrossRef]
Carlini, N.; Wagner, D. Towards Evaluating the Robustness of Neural Networks. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 22–26 May 2017; pp. 39–57. [Google Scholar] [CrossRef]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar] [CrossRef]
Jedrzejewski, F.V.; Thode, L.; Fischbach, J.; Gorschek, T.; Mendez, D.; Lavesson, N. Adversarial Machine Learning in Industry: A Systematic Literature Review. Comput. Secur. 2024, 145, 103988. [Google Scholar] [CrossRef]
Bena, N.; Anisetti, M.; Damiani, E.; Yeun, C.Y.; Ardagna, C.A. Protecting machine learning from poisoning attacks: A risk-based approach. Comput. Secur. 2025, 155, 104468. [Google Scholar] [CrossRef]
Nguyen, T.D.; Nguyen, T.; Le Nguyen, P.; Pham, H.H.; Doan, K.D.; Wong, K.-S. Backdoor attacks and defenses in federated learning: Survey, challenges and future research directions. Eng. Appl. Artif. Intell. 2024, 127, 107166. [Google Scholar] [CrossRef]
Bunko, T.; Johnstone, M.N.; Yang, W.; Scott, B.A. A survey of privacy-preserving federated learning for intrusion detection systems. Artif. Intell. Rev. 2026, 59, 125. [Google Scholar] [CrossRef]
Rigaki, M.; Garcia, S. A Survey of Privacy Attacks in Machine Learning. ACM Comput. Surv. 2023, 56, 1–34. [Google Scholar] [CrossRef]
Shokri, R.; Stronati, M.; Song, C.; Shmatikov, V. Membership Inference Attacks against Machine Learning Models. In Proceedings of the IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 22–26 May 2017; pp. 3–18. [Google Scholar] [CrossRef]
El Mestari, S.Z.; Lenzini, G.; Demirci, H. Preserving data privacy in machine learning systems. Comput. Secur. 2024, 137, 103605. [Google Scholar] [CrossRef]
McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; Aguera y Arcas, B. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the AISTATS, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar] [CrossRef]
Reynaud, S.; Roxin, A. Review of eXplainable artificial intelligence for cybersecurity systems. Discov. Artif. Intell. 2025, 5, 78. [Google Scholar] [CrossRef]
Yang, Z.; Liu, X.; Li, T.; Wu, D.; Wang, J.; Zhao, Y.; Han, H. A systematic literature review of methods and datasets for anomaly-based network intrusion detection. Comput. Secur. 2022, 116, 102675. [Google Scholar] [CrossRef]
Pekar, A.; Jozsa, R. Evaluating ML-based anomaly detection across datasets of varied integrity: A case study. Comput. Netw. 2024, 251, 110617. [Google Scholar] [CrossRef]
Masukawa, R.; Yun, S.; Jeong, S.; Huang, W.; Ni, Y.; Bryant, I.; Bastian, N.D.; Imani, M. PACKETCLIP: Multi-modal embedding of network traffic and language for cybersecurity reasoning. Front. Artif. Intell. 2025, 8, 1593944. [Google Scholar] [CrossRef] [PubMed]
Pinto, D.; Amorim, I.; Maia, E.; Praça, I. A review on intrusion detection datasets: Tools, processes, and features. Comput. Netw. 2025, 262, 111177. [Google Scholar] [CrossRef]
Tory, A.R.; Hasan, F.K. An evaluation framework for network IDS/IPS datasets: Leveraging MITRE ATT&CK and industry relevance metrics. Comput. Secur. 2026, 161, 104777. [Google Scholar] [CrossRef]
Wolf, M.; Tritscher, J.; Landes, D.; Hotho, A.; Schloer, D. Benchmarking of synthetic network data: Reviewing challenges and approaches. Comput. Secur. 2024, 145, 103993. [Google Scholar] [CrossRef]
Ghayoumi, M. Generative Adversarial Networks in Practice, 1st ed.; CRC Press/Chapman and Hall: Boca Raton, FL, USA, 2023. [Google Scholar]
Manocchio, L.D.; Layeghy, S.; Gallagher, M.; Portmann, M. An empirical evaluation of preprocessing methods for machine learning based network intrusion detection systems. Eng. Appl. Artif. Intell. 2025, 158, 111289. [Google Scholar] [CrossRef]
Olszewski, D.; Lu, A.; Stillman, C.; Warren, K.; Kitroser, C.; Pascual, A.; Ukirde, D.; Butler, K.; Traynor, P. Get in Researchers; We’re Measuring Reproducibility: A Reproducibility Study of Machine Learning Papers in Tier 1 Security Conferences. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS ’23), Copenhagen, Denmark, 26–30 November 2023. [Google Scholar] [CrossRef]
Hernandez-Ramos, J.L.; Karopoulos, G.; Chatzoglou, E.; Kouliaridis, V.; Marmol, E.; Gonzalez-Vidal, A.; Kambourakis, G. Intrusion Detection Based on Federated Learning: A Systematic Review. ACM Comput. Surv. 2025, 57, 1–65. [Google Scholar] [CrossRef]
Zhang, J.; Bu, H.; Wen, H.; Liu, Y.; Fei, H.; Xi, R.; Li, L.; Yang, Y.; Zhu, H.; Meng, D. When LLMs meet cybersecurity: A systematic literature review. Cybersecurity 2025, 8, 55. [Google Scholar] [CrossRef]
Karras, A.; Theodorakopoulos, L.; Karras, C.; Theodoropoulou, A.; Kalliampakou, I.; Kalogeratos, G. LLMs for Cybersecurity in the Big Data Era: A Comprehensive Review of Applications, Challenges, and Future Directions. Information 2025, 16, 957. [Google Scholar] [CrossRef]
Liang, C.; Wei, Q.; Du, J.; Wang, Y.; Jiang, Z. Survey of source code vulnerability analysis based on deep learning. Comput. Secur. 2025, 148, 104098. [Google Scholar] [CrossRef]
Baruwal Chhetri, M.; Tariq, S.; Singh, R.; Jalalvand, F.; Paris, C.; Nepal, S. Towards Human-AI Teaming to Mitigate Alert Fatigue in Security Operations Centres. ACM Trans. Internet Technol. 2024, 24, 1–22. [Google Scholar] [CrossRef]
Ghayoumi, M.; Ghazinour, K. Human Rights in the Shadow of AI: Confronting Bias and Accountability. In Proceedings of the 2025 IEEE 16th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), Yorktown Heights, NY, USA, 22–24 October 2025. [Google Scholar] [CrossRef]
Biggio, B.; Roli, F. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognit. 2018, 84, 317–331. [Google Scholar] [CrossRef]

Figure 1. Structured search, screening, citation tracking, and evidence-mapping workflow used to organize the review process. The figure summarizes the conceptual workflow without reporting unverified PRISMA numerical counts.

Figure 2. Unified taxonomy of deep learning for cybersecurity. This figure is a conceptual synthesis, not a quantitative inclusion-count diagram. IDS denotes intrusion detection system, and CTI denotes cyber threat intelligence.

Figure 3. Application landscape of deep learning in cybersecurity, showing how major security applications connect to prominent data modalities and commonly used model families. This figure is a conceptual synthesis, not a quantitative inclusion-count diagram. IDS denotes intrusion detection system, CTI denotes cyber threat intelligence, GNN denotes graph neural network, and LLM denotes large language model.

Figure 4. Trustworthiness stack for deep learning in cybersecurity. This figure is a conceptual synthesis, not a quantitative inclusion count diagram. XAI denotes explainable artificial intelligence, and SUEPAP summarizes secure lifecycle governance, uncertainty and calibration, explainability, privacy preservation, adversarial robustness, and predictive performance.

Figure 5. End-to-end evaluation pipeline for deep learning-based cybersecurity studies. This figure is a methodological workflow, not a quantitative inclusion count diagram.

Figure 6. Research roadmap for the next generation of deep learning in cybersecurity. The roadmap is a conceptual synthesis of research directions, not a quantitative inclusion count diagram. It groups future work into near-term methodological priorities, mid-term adaptive and collaborative learning priorities, and long-term operational cyber intelligence priorities.

Table 1. Positioning of the present survey relative to recent review articles on deep learning for cybersecurity.

Survey	Main Scope	Main Strengths	Main Limitations Relative to This Paper
Berman et al. (2019) [5]	General survey of deep learning methods for cybersecurity	Foundational early overview of deep learning (DL) methods across cyber tasks	Published before the rapid rise of graph neural networks (GNNs), transformers, large language models (LLMs), trustworthy artificial intelligence (AI), and recent deployment-focused concerns
Macas et al. (2022) [14]	Broad survey of deep learning for cybersecurity	Covers progress, challenges, and opportunities across major DL-based cybersecurity applications	Does not provide a strong unified framework centered on trustworthiness, deployment feasibility, and recent LLM-centered cyber workflows
Zhong et al. (2024) [12]	GNNs for intrusion detection	Strong coverage of graph-based intrusion detection methods, trends, and challenges	Narrow task scope focused on intrusion detection systems (IDSs) and one model family
Chen et al. (2024) [13]	LLMs for cyber threat detection	Strong review of LLM-based cyber detection tasks and challenges	Focused on LLMs and threat detection rather than the broader DL cybersecurity landscape
Makris et al. (2025) [9]	Federated intrusion detection systems (IDSs)	Strong coverage of federated intrusion detection system (IDS) techniques, challenges, and solutions	Focused on federated IDS only; does not cover broader deep learning architectures or cross-domain cyber tasks
Sharma et al. (2025) [10]	Explainable AI in cybersecurity	Strong focus on explainable artificial intelligence (XAI) methods, transparency, and interpretability issues in cyber applications	Focused on one trustworthiness dimension rather than the full trustworthy-DL stack
Kheddar (2025) [11]	Transformers and large language models (LLMs) for intrusion detection systems (IDSs)	Strong synthesis of transformer-based and LLM-based IDS methods	IDS-centered and architecture-centered rather than system-level and trustworthiness-centered

Table 2. Operational inclusion boundaries used to keep the review scope explicit.

Boundary	Included	Excluded or Treated Only as Context
Publication window	Primary synthesis: peer-reviewed work from 2015–2026; earlier papers used only for foundational concepts.	Older studies used as current deployment evidence; undated or unverifiable technical claims.
Cybersecurity task	intrusion detection systems (IDSs), network intrusion detection systems (NIDSs), host-based intrusion detection systems (HIDSs), malware, phishing, spam, biometrics, identity security, CTI, vulnerability analysis, security operations center (SOC) assistance, and cyber-trust/security analytics.	Non-cyber AI applications, general computer vision or natural language processing (NLP) without a cybersecurity task, and generic policy essays.
Deep learning criterion	Multilayer neural networks, representation learning, CNN/RNN/LSTM/GRU, autoencoders, GNNs, transformers, LLMs, multimodal neural systems, and federated deep learning.	Purely rule-based systems or traditional machine learning (ML) papers unless used as baselines, dataset sources, or methodological critiques.
Evidence role	Empirical studies, technical surveys, dataset/evaluation papers, adversarial/privacy/XAI studies, and deployment-aware cyber AI papers.	Editorials, abstracts, tutorials, posters, patents, theses, or papers without sufficient task/model/dataset/evaluation detail.
Synthesis rule	Studies are interpreted through task, modality, model family, trustworthiness property, and deployment environment.	Architecture-only comparison without attention to data, threat model, validation design, or operational setting.

Table 3. Data-charting form used for the cited-source evidence map.

Charting Field	Recorded Values or Examples	Purpose in Synthesis
Bibliographic profile	Year, venue type, review/empirical/methodological paper, DOI status.	Assesses freshness and traceability of the cited-source corpus.
Security task	IDS, malware, phishing, biometrics, CTI, vulnerability analysis, SOC support, cyber-trust.	Defines the first axis and prevents architecture-only comparison.
Data modality	Flow features, packets, logs, application programming interface (API) calls, binaries, graphs, text, images, multimodal evidence.	Links model design to evidence structure and preprocessing requirements.
Model family	CNN, RNN/LSTM/GRU, autoencoder, GNN, transformer, LLM, federated deep learning, hybrid.	Supports task–model and modality–model comparisons.
Trustworthiness evidence	Robustness, poisoning/backdoor resistance, privacy, XAI, calibration, drift, lifecycle security.	Identifies whether a study reports operational credibility beyond accuracy.
Evaluation protocol	Dataset, metric, preprocessing, random/time-aware split, external validation, zero-day or cross-family test.	Captures the methodological weaknesses most often noted by reviewers.
Deployment evidence	Cloud, enterprise, endpoint, edge, IoT, industrial control system (ICS), latency, memory, model size, energy, throughput, update cost.	Distinguishes benchmark evidence from deployable cyber AI evidence.
Reproducibility evidence	Code/data availability, environment details, hyperparameters, feature pipeline, artifact link.	Supports appraisal of whether results can be independently inspected or reproduced.

Table 4. Review protocol adopted in this survey.

Protocol Element	Specification Used in This Survey
Review type	Structured narrative review with evidence-mapping components
Methodological basis	Arksey and O’Malley [16], Levac et al. [17], PRISMA-style transparency guidance [19,20], snowballing guidance [21], and updated JBI guidance [18,22]
Review objective	To synthesize deep learning for cybersecurity through the joint lenses of application domain, model family, trustworthiness, datasets, evaluation practice, and deployment setting
Framing approach	Population–Concept–Context (PCC) framework [18,22]
Population	Cybersecurity studies addressing tasks such as intrusion detection, malware analysis, phishing detection, authentication, cyber threat intelligence (CTI), and related security analytics
Concept	Deep learning and modern neural architectures, including CNNs, RNNs, LSTMs, autoencoders, GNNs, transformers, LLMs, and federated deep learning
Context	Enterprise, cloud, edge, Internet of Things (IoT), industrial, and multimodal cyber defense environments
Databases searched	Scopus, Web of Science, IEEE Xplore, ACM Digital Library, ScienceDirect, and SpringerLink
Search coverage	Broad database coverage across the available record period, documented through reproducible search-string families
Additional search process	Backward and forward snowballing [21]
Inclusion criteria	Peer-reviewed studies with a real cybersecurity task, a substantive deep learning component, and sufficient methodological detail
Exclusion criteria	Non-cybersecurity studies, non-DL-only studies, editorials, short abstracts, tutorials, posters, duplicates, and records with insufficient technical detail
Language restriction	English
Screening stages	Title/abstract screening followed by full-text screening
Screening process	Staged relevance screening with documented eligibility criteria and author-team resolution of ambiguous cases
Data extracted	Year, venue, task, modality, model family, environment, dataset, evaluation metrics, drift treatment, adversarial testing, privacy mechanism, XAI mechanism, computational constraints, deployment evidence, and reproducibility indicators
Synthesis strategy	Descriptive thematic synthesis, non-exclusive coded emphasis summaries, and cross-cutting evidence-mapping
Meta-analysis	Not performed due to high heterogeneity of tasks, datasets, models, and metrics
Reproducibility measure	Explicit search-string reporting, documented inclusion rules, charting fields, DOI audit notes, and protocol transparency [22]

Table 5. Non-exclusive evidence-mapping counts from the de-duplicated cited-source bibliography. Values represent coded emphases in the 115-source bibliography, not systematic inclusion statistics or PRISMA counts; rows are not mutually exclusive and do not sum to 115.

Mapping Dimension	Coded Emphasis	Count	Interpretation for the Synthesis
Security task	Intrusion detection, anomaly detection, network datasets, drift-aware IDS	32	This is the most prominent coded task area, but the evidence is also strongly affected by dataset age, benchmark reuse, temporal leakage, and weak cross-dataset validation.
Security task	Malware, software vulnerability analysis, and code-security learning	10	Coverage is smaller than IDS coverage and often depends on family labels, obfuscation assumptions, static–dynamic trace design, and explainability of code or binary evidence.
Security task	Phishing, social engineering, spam, and online abuse/cyber-trust workflows	7	This area shows the value of text, metadata, visual, and social context, but it remains highly sensitive to distribution shift, adversarial mimicry, and multilingual variation.
Security task	Biometrics, identity security, access control, and dynamic trust	10	This cluster connects authentication, multimodal fusion, spoofing, privacy, and policy-aware security governance.
Security task	CTI, LLMs, SOC assistance, multimodal reasoning, and analyst support	15	Recent work is growing quickly, but evidence remains limited for grounded reasoning, prompt injection defense, tool-use safety, and human-centered validation.
Trustworthiness	Adversarial robustness, poisoning, backdoors, privacy, calibration, XAI, and AI risk guidance	26	Trustworthiness is widely recognized but still fragmented; most studies evaluate only one or two trust properties rather than the full lifecycle risk profile.
Deployment and evaluation	Datasets, benchmarks, reproducibility, edge/IoT/ICS deployment, and resource-aware operation	25	A recurring methodological message is that dataset provenance, temporal validation, computational cost, and deployment realism must be reported together.
Model family	CNNs, RNNs, LSTMs, GRUs, autoencoders, and hybrid deep models	31	Established deep models remain central in IDS, malware, biometrics, and phishing pipelines, especially when sequential or local structure is important.
Model family	GNNs, transformers, LLMs, federated learning, and multimodal models	28	Newer architectures expand the field toward relational reasoning, long-context modeling, collaborative defense, and analyst-facing cyber intelligence.

Table 6. Five-axis evidence map of the cited literature across major cybersecurity application domains. This is a qualitative evidence map, not a systematic inclusion-count table.

Application Area	Prominent Task Evidence	Main Modalities	Common Model Families	Recurring Evaluation Weakness	Deployment-Critical Trust Issue
Intrusion and anomaly detection	Broad and prominent coded evidence cluster, especially NIDS and traffic analytics	Flow features, packets, logs, event streams	CNN, LSTM, GRU, autoencoder, transformer, GNN, hybrid models	Benchmark reuse, random splits, dated datasets, limited temporal/external validation	Drift, evasion, false-positive burden, resource cost
Malware and vulnerability analysis	Moderate evidence cluster with static, dynamic, and code-oriented studies	Bytes, opcodes, API calls, control-flow graphs, code tokens	CNN, RNN, transformer, GNN, raw-byte and hybrid models	Family imbalance, obfuscation sensitivity, inconsistent static–dynamic settings	Evasion, interpretability, reproducibility of feature pipelines
Phishing and social engineering	More focused but practically important evidence cluster	URLs, emails, text, metadata, screenshots, social context	BERT/RoBERTa, transformer, CNN, multimodal models	Fast distribution shift, adversarial mimicry, multilingual and platform variation	Grounded real-time detection and low-latency deployment
Biometrics and identity security	Specialized coded evidence cluster linking DL, fusion, and authentication	Face, fingerprint, iris, voice, gait, keystroke, behavioral biometrics	CNN, vision transformer, Siamese networks, fusion models, presentation attack detection (PAD) models	Sensor variation, spoofing conditions, limited cross-device validation	Presentation attacks, template privacy, calibration
CTI, LLMs, and analyst support	Rapidly emerging coded evidence cluster centered on language, retrieval, and reasoning	Threat reports, indicators, alerts, logs, knowledge graphs, multimodal evidence	Transformer, LLM, relation extraction, GNN, multimodal models	Weak grounding, hallucination risk, limited SOC evaluation, unclear tool-use safety	Prompt injection, evidence traceability, human–AI teaming

Table 7. Bibliographic publication year profile of the 115-source cited bibliography. Counts describe cited references used in the synthesis, not database search returns, screening exclusions, or PRISMA final-inclusion statistics.

Publication Period	Number of Cited Sources	Role in the Synthesis	Interpretation
Before 2015	7	Foundational ML, IDS, drift, and security evaluation work	Used mainly to define long-standing evaluation problems and foundational concepts.
2015–2019	26	Early deep learning cybersecurity, adversarial ML, calibration, XAI, and baseline neural architectures	Provides historical grounding but is not treated as sufficient evidence for current deployment claims.
2020–2022	14	Federated learning, multimodal learning, scoping guidance, datasets, and cloud–edge security	Bridges earlier DL methods with newer trustworthiness and deployment concerns.
2023–2024	38	The recent adversarial, privacy, XAI, dataset, LLM, multimodal, and reproducibility literature	Represents a large recent coded cluster and supports many trustworthiness and evaluation critiques.
2025–2026	30	Current surveys and emerging work on LLMs, IDS, phishing, datasets, edge deployment, and CTI	Indicates that the review is weighted toward recent developments while preserving foundational context.

Table 8. Non-exclusive task-by-model coding matrix for the cited-source bibliography. Values are overlapping coded emphases, not systematic inclusion statistics; a single source may contribute to multiple cells.

Security Task Cluster	CNN/ Conv.	RNN/LSTM/ GRU	AE/ Anomaly	GNN	Transformer/ LLM	FL/ Multimodal
Intrusion and anomaly detection	12	14	13	7	9	8
Malware and vulnerability analysis	6	4	2	5	6	2
Phishing, spam, and social engineering	3	2	1	2	6	4
Biometrics and identity security	7	3	2	1	3	4
CTI, SOC support, and LLM reasoning	1	1	1	4	12	6
Cross-cutting trustworthy-AI/evaluation papers	4	3	3	3	5	5

Table 9. Non-exclusive trustworthiness and deployment coding matrix for the cited-source bibliography. Values are overlapping coded emphases, not systematic inclusion statistics; a single source may contribute to multiple cells.

Application Cluster	Robustness	Privacy	XAI	Drift/Calibration	Reproducibility	Deploy. Cost
Intrusion and anomaly detection	11	8	5	10	8	9
Malware and vulnerability analysis	8	2	4	3	5	4
Phishing, spam, and social engineering	5	2	3	4	3	3
Biometrics and identity security	6	7	5	3	3	4
CTI, SOC support, and LLM reasoning	7	4	5	3	4	5
Cross-cutting methodology and policy	14	10	9	9	11	8

Table 10. Unified taxonomy of deep learning for cybersecurity adopted in this survey.

Taxonomy Dimension	Definition	Typical Categories/Examples	Why It Matters
Security task	The cybersecurity problem being addressed	Intrusion detection, anomaly detection, malware classification, phishing detection, biometric authentication, cyber threat intelligence, vulnerability analysis	Determines labels, decision goals, failure costs, and evaluation requirements
Data modality	The form and structure of the input data	Tabular flow features, packet sequences, event logs, API-call traces, binaries/raw bytes, graphs, natural language cyber threat intelligence (CTI), multimodal evidence	Determines what information is available and which architectures are appropriate
Model family	The deep learning architecture or training paradigm	CNNs, RNNs, LSTMs, GRUs, autoencoders, GNNs, transformers, LLMs, federated deep learning	Determines inductive bias, representation power, computational cost, and interpretability profile
Trustworthiness dimension	The properties required for reliable operational use	Adversarial robustness, poisoning resistance, privacy preservation, explainability, uncertainty calibration, drift adaptation, lifecycle security	Determines whether strong benchmark results are likely to translate into safe operational performance
Deployment environment	The operational context in which the model is used	Centralized cloud, enterprise network, edge, Internet of Things (IoT), industrial control systems, cloud–edge–IoT hybrid	Determines latency, memory, bandwidth, privacy exposure, and deployment feasibility

Table 11. Summary of major application domains of deep learning in cybersecurity.

Application Domain	Prominent Data Modalities	Frequently Used Model Families	Main Strengths of DL Use	Recurring Limitations	Representative References
Intrusion detection and anomaly detection	Flow features, packet traces, logs, temporal event streams	CNNs, LSTMs, GRUs, autoencoders, transformers, GNNs, hybrids	Strong pattern learning for high-volume telemetry; useful for anomaly and traffic classification	Heavy dependence on benchmark quality; class imbalance; drift; weak external validation	[39,65,66,71]
Malware detection and classification	Raw bytes, opcode sequences, API-call traces, graphs, images, hybrid behavioral traces	CNNs, RNNs, autoencoders, transformers, GNNs, raw-byte models	Reduced need for manual feature engineering; useful for static, dynamic, and hybrid analysis	Dataset bias; family imbalance; obfuscation sensitivity; limited interpretability in many pipelines	[40,86,87]
Phishing, spam, and social engineering detection	URLs, webpage content, emails, text, metadata, visual cues	CNNs, RNNs, transformers, BERT/RoBERTa, multimodal models	Strong semantic and structural feature learning; suitable for email and web-based detection	Distribution shift, adversarial mimicry, multilingual variation, real-time deployment constraints	[41]
Biometric authentication and identity security	Face, fingerprint, iris, hand-vein, voice, behavioral biometrics	CNNs, multimodal fusion networks, PAD-focused deep models	High-capacity representation learning and multimodal authentication support	Spoofing risk, template privacy, sensor variation, deployment trade-offs	[42]
Cyber threat intelligence and multimodal security analytics	Reports, logs, indicators of compromise, knowledge graph evidence, encrypted traffic, text-plus-telemetry	LLMs, transformers, relation extraction models, GNNs, multimodal deep models	Useful for semantic extraction, summarization, contextualization, analyst support, and evidence fusion	Weak benchmark standardization, grounding issues, reasoning reliability, high complexity	[13,49,50,81,82]

Table 12. Trustworthiness dimensions of deep learning for cybersecurity and their associated research challenges.

Trustworthiness Dimension	Main Risk in Cybersecurity	Common Research Responses	Main Unresolved Gap	Representative References
Adversarial robustness	Evasion, adversarial examples, model manipulation during inference	Adversarial training, robust optimization, attack-aware testing, threat model formalization	Many defenses are evaluated under narrow or weak threat models and may not reflect operational cyber attacks	[34]
Poisoning and backdoors	Corrupted training data, malicious updates, trigger-based hidden behaviors	Data sanitization, robust aggregation, risk-based defenses, anomaly screening of updates	Training pipeline integrity remains under-addressed, especially in decentralized and continuously updated settings	[34,54,55]
Privacy preservation	Membership inference, gradient leakage, sensitive telemetry exposure	Differential privacy, secure aggregation, privacy-aware training, encrypted collaboration	Strong privacy often comes with utility costs, and federated learning alone does not guarantee privacy	[58]
Explainability	Analyst distrust, opaque decisions, weak justification for alerts	Feature attribution, local explanations, global explanations, analyst-facing interpretability tools	Many explanation methods are not evaluated for real analyst usefulness or decision support quality	[10,35,36,99]
Uncertainty and calibration	Overconfident false alarms, unreliable ranking of alerts, unsafe automation	Bayesian deep learning, calibration methods, uncertainty-aware ranking, abstention	Confidence quality is still underreported in many cyber studies	[61,62]
Lifecycle and deployment security	Unsafe integration, untrusted model lineage, insecure updates, workflow manipulation	Governance frameworks, secure ML pipelines, model provenance controls, secure development practices	Need for end-to-end secure MLOps and AI governance in cyber defense environments	[34,59,60]

Table 13. Representative critical examples connecting specific reviewed studies to evaluation and deployment lessons.

Example	Useful Contribution	Main Limitation Exposed by the Review	Five-Axis Lesson
DeepLog [84]	Shows the value of sequential deep learning for log anomaly detection and diagnosis.	Strong benchmark results do not by themselves prove resilience to modern deployment drift, changing software stacks, or external log environments.	Model family must be interpreted with task, modality, and deployment context.
INSOMNIA [30]	Makes temporal drift explicit in network intrusion detection and extends evaluation beyond static splits.	Highlights how many IDS studies overestimate performance when random splits ignore time and update-latency.	Drift-aware validation is a trustworthiness requirement, not a post hoc analysis.
Pekar and Jozsa [101]	Compares anomaly detection across related datasets with different integrity and feature-generation conditions.	Shows that even closely related dataset variants can produce different conclusions because preprocessing and labeling are not neutral.	Dataset provenance is part of the evidence, not background metadata.
MAlign [87]	Provides an explainable static raw-byte malware-family classification approach.	Explainability remains architecture-specific and must still be judged by whether it helps analysts understand family-level evidence under obfuscation.	XAI must be evaluated for analyst utility, not only algorithmic visibility.
PhishingRTDS [75]	Demonstrates a real-time deep learning phishing detection system.	Real-time claims require careful reporting of latency, distribution shift, adversarial mimicry, and deployment environment.	Operational feasibility must be tested with task-specific constraints.
PACKETCLIP [102]	Links network traffic and language representations for cybersecurity reasoning.	Multimodal reasoning adds promise but also creates new failure points in grounding, missing modalities, and evidence traceability.	Multimodality should be evaluated as a system property.
Moskalenko et al. [63]	Provides a resilience-oriented method for resource-constrained AI systems under fault injections.	Cybersecurity deployment discussions often mention edge and IoT constraints without evaluating resilience under computational and disturbance limits.	Resource-aware resilience should be integrated into deployment-readiness evaluation.

Table 14. Common dataset families, recurring concerns, and recommended split protocols for deep learning-based cybersecurity studies.

Task	Common Dataset Families	Known Evaluation Risks	Recommended Protocol
IDS, NIDS, and HIDS	KDD99/NSL-KDD, UNSW-NB15, CICIDS2017/2018, Bot-IoT, ToN-IoT, Edge-IIoTset, log datasets such as HDFS/BGL.	Dataset age, artificial traffic, duplicate flows, temporal leakage, inconsistent feature extraction, label noise, and attack-class imbalance.	Prefer time-aware splits, future-period holdouts, cross-dataset testing, refined labels where available, and clear reporting of flow-generation tools and preprocessing.
Malware and vulnerability analysis	Drebin, EMBER, Malimg, AndroZoo, API-call traces, opcode corpora, control-flow/code-graph datasets, vulnerability and Common Vulnerabilities and Exposures (CVE)-linked code corpora.	Family imbalance, packer/obfuscation bias, vendor-label disagreement, static–dynamic mismatch, temporal malware evolution, and leakage from near-duplicate samples.	Use family-aware and time-aware splits, remove near duplicates, report packing/obfuscation assumptions, test on newer families, and include cross-family or zero-day-style evaluation.
Phishing, spam, and social engineering	PhishTank/OpenPhish-derived URLs, email corpora, webpage screenshots/Hypertext Markup Language (HTML)/Document Object Model (DOM) datasets, spam/social engineering collections.	Fast distribution shift, brand-template memorization, multilingual gaps, URL reuse, adversarial mimicry, and short-lived campaign artifacts.	Use temporal splits by campaign or collection date, evaluate on unseen brands/domains/languages, report collection windows, and include low-latency decision constraints.
Biometrics and identity security	Face, fingerprint, iris, voice, gait, keystroke, LivDet-style PAD datasets, replay/spoofing datasets, cross-sensor collections.	Sensor bias, subject overlap, presentation attacks, cross-device degradation, demographic imbalance, and template privacy leakage.	Use subject-disjoint and device-disjoint splits, PAD-specific evaluation, cross-sensor validation, calibration metrics, and privacy-preserving template handling.
CTI, LLM, and SOC assistance	MITRE ATT&CK-linked corpora, National Vulnerability Database (NVD)/CVE/Common Weakness Enumeration (CWE) records, threat reports, indicators of compromise (IOC) feeds, alert streams, logs, knowledge graphs, security Q&A/code datasets.	Weak grounding, outdated threat intelligence, benchmark contamination, hallucination, prompt injection, tool-misuse risk, and lack of analyst-centered validation.	Use retrieval-grounded tasks with citation traces, source-date controls, red-team prompt tests, held-out campaigns, human analyst evaluation, and tool permission auditing.
Federated, edge, and IoT cyber AI	IoT traffic datasets, cross-silo IDS datasets, device telemetry, encrypted traffic, simulated client partitions.	Non-IID clients, unrealistic client splits, communication overhead, poisoning, privacy leakage from updates, and missing device-cost reporting.	Report client partition design, heterogeneity statistics, poisoning/privacy tests, communication rounds, model size, latency, memory, and energy or power measurements.

Table 15. Operational scoring checklist for the seven recommended evaluation principles.

Evaluation Principle	Score 0	Score 1	Score 2	Evidence Expected
Dataset relevance and freshness	Dataset named only	Dataset justified briefly	Dataset matched to threat model, age, and deployment context	Dataset source, collection period, threat class, and deployment assumption.
Pipeline transparency	Preprocessing unclear	Main steps described	Reproducible pipeline or scripts provided	Filtering, labeling, encoding, balancing, splitting, and leakage checks.
Temporal validity	Random split only	Time split or drift discussion	Future-period holdout, update-latency test, or drift-aware evaluation	Time-aware partitions, drift metrics, or retraining protocol.
Operational metrics	Accuracy-centered metrics only	Some precision/recall or latency reporting	Predictive and operational costs jointly reported	False positives, alert volume, latency, throughput, memory, energy, and analyst workload.
External robustness	Single benchmark only	Related variant or ablation	Cross-dataset, cross-family, or cross-organization validation	External test set, variant benchmark, or domain-transfer analysis.
Reproducibility	Insufficient artifact detail	Partial code/data or settings	Code, splits, seeds, dependencies, and rerun instructions	Repository, environment, hyperparameters, split files, and access notes.
Synthetic data validity	Synthetic data used without validation	Fidelity or utility checked	Fidelity, utility, privacy, and misuse risks evaluated	Real-data comparison, downstream utility, privacy assessment, and failure analysis.

Table 16. Major dataset and evaluation issues in deep learning-based cybersecurity studies and recommended reporting practices.

Methodological Issue	Why It Matters	Typical Manifestation in the Literature	Recommended Reporting or Practice	Representative References
Limited dataset realism	Inflates apparent performance and weakens transfer to deployment	Use of narrow, outdated, or laboratory-centered benchmarks	Justify dataset choice by threat relevance, freshness, and domain realism	[83,100,104]
Weak dataset provenance documentation	Makes comparisons hard to interpret	Incomplete reporting of traffic capture, labeling process, feature extraction, or postprocessing	Document capture tools, preprocessing pipeline, feature-construction workflow, and label-generation logic	[101,103]
Preprocessing variability	Changes results even when the same dataset and model family are used	Different scaling, encoding, filtering, balancing, and split procedures across studies	Publish preprocessing scripts and explain every major preprocessing decision	[107]
Lack of temporal validity	Fails to reflect concept drift and attacker adaptation	Random train–test splits on evolving data	Prefer time-aware splits, future-period holdouts, and drift-aware evaluation	[30,31]
Weak external generalization	Overstates robustness by relying on one benchmark only	Single-dataset evaluation without cross-dataset testing	Use external validation or cross-dataset evaluation when possible	[83,100,101]
Narrow metric reporting	Hides operational weaknesses	Reporting accuracy only, or limited use of false-positive-sensitive metrics	Report task-appropriate metrics, including false alarms, imbalance-aware scores, and operational costs	[100,104]
Poor reproducibility	Prevents independent verification and slows cumulative progress	Missing code, hidden preprocessing choices, incomplete environment details	Release code, data access details, splits, hyperparameters, and rerun instructions	[108]
Unvalidated synthetic data use	May create misleading conclusions if synthetic data lack utility or realism	Use of generated traffic without fidelity or downstream utility analysis	Benchmark synthetic data for fidelity, utility, and risk before drawing security conclusions	[105]

Table 17. Practical deployment-reporting targets for cybersecurity deep learning studies.

Deployment Setting	Typical Use Case	Minimum Reporting Fields	Indicative Target Budget to Justify
Cloud or SOC analytics	Batch or near-real-time alert enrichment, CTI correlation, malware triage, LLM-assisted investigation.	p50/p95 latency, throughput, graphics processing unit (GPU)/central processing unit (CPU) type, memory, token/context cost for LLMs, batch size, alert volume, and human-review time.	Seconds-level or lower for interactive triage; minutes acceptable only for offline enrichment; memory and cost must scale with alert volume.
Enterprise endpoint or gateway	Endpoint telemetry, local malware scoring, flow classification, phishing filtering, log anomaly screening.	Model size, random-access memory (RAM), CPU utilization, p95 inference latency, update frequency, false-positive burden, and rollback mechanism.	p95 latency normally below 100–500 ms for inline decisions; model footprint generally small enough for routine endpoint deployment; update cost explicitly reported.
Edge, fog, and industrial systems	IoT gateway IDS, encrypted traffic analytics, ICS monitoring, local anomaly detection.	Device class, CPU/accelerator, RAM, storage, p95 latency, throughput, bandwidth, energy or power proxy, and resilience under load.	Model size commonly below 10–200 megabytes (MB) depending on gateway class; sub-second inference for monitoring; graceful degradation under overload.
Constrained IoT or embedded node	On-device anomaly screening, sensor integrity checks, lightweight authentication, local feature extraction.	Parameter count, quantization/pruning, peak RAM, flash/storage footprint, energy per inference or power draw, and failure behavior.	Kilobyte-to-few-megabyte footprint where possible; millisecond-to-low-second inference depending on duty cycle; explicit energy and recovery analysis.
Federated or collaborative deployment	Cross-silo IDS, privacy-preserving institution/device collaboration, distributed threat learning.	Number of clients, non-IID partition, rounds, communication volume, aggregation method, privacy mechanism, poisoning defense, and client dropout handling.	Communication overhead and client runtime must be reported per round; privacy and robustness tested under realistic malicious client and straggler assumptions.

Table 18. LLM-enabled cybersecurity workflow risks and recommended evaluation controls.

LLM Cyber Workflow	Main Risk	Required Evidence	Recommended Control
Retrieval-augmented CTI synthesis	Unsupported claims, stale intelligence, missing provenance, and source contamination.	Source timestamps, retrieval logs, evidence citations, and held-out threat-campaign tests.	Retrieval provenance checks, freshness filters, citation-required outputs, and uncertainty labels.
Alert summarization and triage	Hallucinated severity, missed context, and false prioritization under alert overload.	Analyst-rated usefulness, time-to-triage, escalation accuracy, false-negative analysis, and abstention behavior.	Human-in-the-loop review, calibrated confidence, safe fallback, and workload aware evaluation.
Tool-using SOC agents	Prompt injection, unsafe command execution, credential exposure, and unauthorized actions.	Red-team prompts, tool-call logs, permission boundaries, blocked-action counts, and recovery tests.	Least-privilege tools, allow listed actions, sandbox execution, approval gates, and audit trails.
Secure code and vulnerability assistance	Incorrect patches, insecure recommendations, hallucinated CVEs, and benchmark memorization.	Vulnerability-grounded test suites, patch validation, code-review traces, and contamination checks.	Static/dynamic analysis integration, test-driven patch validation, and mandatory human code review.
Collaborative or federated cyber copilots	Sensitive-log leakage, cross-tenant contamination, and governance failure.	Tenant-isolation tests, privacy threat modeling, data-retention rules, and access control evidence.	Data minimization, secure retrieval boundaries, privacy filters, tenant-aware authorization, and logging.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ghayoumi, M.; Ghazinour, K.; Marrero, A.; Barmas, D.; Cook, C.; May, M.; Liu, C.; Johnson, B.; Fofana, A. Trustworthy Deep Learning for Cybersecurity: A Structured Review Across Detection, Robustness, Privacy, Explainability, and Deployment. Electronics 2026, 15, 2421. https://doi.org/10.3390/electronics15112421

AMA Style

Ghayoumi M, Ghazinour K, Marrero A, Barmas D, Cook C, May M, Liu C, Johnson B, Fofana A. Trustworthy Deep Learning for Cybersecurity: A Structured Review Across Detection, Robustness, Privacy, Explainability, and Deployment. Electronics. 2026; 15(11):2421. https://doi.org/10.3390/electronics15112421

Chicago/Turabian Style

Ghayoumi, Mehdi, Kambiz Ghazinour, Anthony Marrero, Dena Barmas, Cameron Cook, Michael May, Cory Liu, Behnaz Johnson, and Amadu Fofana. 2026. "Trustworthy Deep Learning for Cybersecurity: A Structured Review Across Detection, Robustness, Privacy, Explainability, and Deployment" Electronics 15, no. 11: 2421. https://doi.org/10.3390/electronics15112421

APA Style

Ghayoumi, M., Ghazinour, K., Marrero, A., Barmas, D., Cook, C., May, M., Liu, C., Johnson, B., & Fofana, A. (2026). Trustworthy Deep Learning for Cybersecurity: A Structured Review Across Detection, Robustness, Privacy, Explainability, and Deployment. Electronics, 15(11), 2421. https://doi.org/10.3390/electronics15112421

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Trustworthy Deep Learning for Cybersecurity: A Structured Review Across Detection, Robustness, Privacy, Explainability, and Deployment

Abstract

1. Introduction

1.1. Context and Background

1.2. Motivation and Research Gap

1.3. Objective and Review Scope

1.4. Five-Axis Framework

1.5. Paper Organization

2. Review Methodology

2.1. Review Design and Rationale

2.2. Review Objectives and Research Questions

2.3. Information Sources and Search Strategy

2.4. Search Date and Search Update

2.5. Eligibility Criteria

2.6. Study Selection Procedure

2.7. Data-Charting and Extraction

2.8. Data Synthesis and Evidence-Mapping

2.9. Methodological Appraisal Strategy

2.10. Reproducibility and Protocol Transparency

3. Search Results and Evidence-Oriented Map

3.1. Search and Selection Transparency

3.2. Distribution of the Cited-Source Corpus

3.3. Five-Axis Evidence Map

3.4. Bibliographic Age and Coverage Profile

4. Conceptual Background and Taxonomy of Deep Learning for Cybersecurity

4.1. Conceptual Background

4.2. A Unified Taxonomy for Deep Learning in Cybersecurity

4.2.1. Security Task Dimension

4.2.2. Data Modality Dimension

4.2.3. Model Family Dimension

4.2.4. Trustworthiness Dimension

4.2.5. Deployment Environment Dimension

4.3. Interaction Among Taxonomy Dimensions

4.4. Implications for the Remainder of the Survey

5. Application Domains of Deep Learning in Cybersecurity

5.1. Intrusion Detection and Anomaly Detection

5.2. Malware Detection and Classification

5.3. Phishing, Spam, and Social Engineering Detection

5.4. Biometric Authentication and Identity Security

5.5. Cyber Threat Intelligence (CTI) and Multimodal Security Analytics

5.6. Synthesis Across Application Domains

6. Trustworthiness Dimensions of Deep Learning for Cybersecurity

6.1. Why Trustworthiness Is a Core Requirement in Cybersecurity

6.2. Adversarial Robustness and Security-Aware Evaluation

6.3. Poisoning, Backdoors, and Training Time Integrity

6.4. Privacy Preservation and Collaborative Learning

6.5. Explainability and Human Analyst Trust

6.6. Uncertainty Quantification and Confidence Calibration

6.7. Secure Deployment, Governance, and Lifecycle Assurance

6.8. Synthesis

7. Datasets, Benchmarks, Evaluation Practices, and Reproducibility

7.1. Why This Section Is Methodologically Central

7.2. Public Cybersecurity Datasets: Availability, Diversity, and Structural Limitations

7.3. Dataset Construction Pipelines and the Importance of Data Provenance

7.4. Synthetic Data and Dataset Augmentation

7.5. Evaluation Protocols: Metrics, Preprocessing, and Fair Comparison

7.6. Temporal Validity, Drift, and External Generalization

7.7. Reproducibility and Artifact Availability

7.8. Synthesis and Recommended Evaluation Principles

8. Open Challenges and Future Research Directions

8.1. Deployment-Reporting Targets for Cloud, Enterprise, Edge, and IoT Settings

8.2. From Static Benchmarks to Living, Sector-Relevant Cyber Datasets

8.3. Drift-Aware, Continual, and Online Cyber Learning

8.4. Multimodal and Reasoning-Centric Cyber Defense

8.5. Privacy-Preserving Collaborative Defense Beyond Naive Federated Learning

8.6. Trustworthy LLMs, Cyber Copilots, and Agentic Security Workflows

8.7. Deep Learning for Software Security and Vulnerability Discovery

8.8. Encrypted Traffic, Edge Deployment, and Resource-Constrained Cyber AI

8.9. Human–AI Teaming and Analyst-Centered Evaluation

8.10. Overall Research Agenda

9. Limitations

10. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest