Open AccessArticle

From Detection to Decision: Transforming Cybersecurity with Deep Learning and Visual Analytics

Saurabh Chavan

^1,*

and

George Pappas

^2,*

Department of Electrical and Computer Engineering, College of Engineering, Lawrence Technological University, Southfield, MI 48075, USA

MSAI Grad Program, Lawrence Technological University, Southfield, MI 48075, USA

Authors to whom correspondence should be addressed.

AI 2025, 6(9), 214; https://doi.org/10.3390/ai6090214

Submission received: 18 July 2025 / Revised: 26 August 2025 / Accepted: 27 August 2025 / Published: 4 September 2025

Download

Browse Figures

Versions Notes

Abstract

Objectives: The persistent evolution of software vulnerabilities—spanning novel zero-day exploits to logic-level flaws—continues to challenge conventional cybersecurity mechanisms. Static rule-based scanners and opaque deep learning models often lack the precision and contextual understanding required for both accurate detection and analyst interpretability. This paper presents a hybrid framework for real-time vulnerability detection that improves both robustness and explainability. Methods: The framework integrates semantic encoding via Bidirectional Encoder Representations from Transformers (BERTs), structural analysis using Deep Graph Convolutional Neural Networks (DGCNNs), and lightweight prioritization through Kernel Extreme Learning Machines (KELMs). The architecture incorporates Minimum Intermediate Representation (MIR) learning to reduce false positives and fuses multi-modal data (source code, execution traces, textual metadata) for robust, scalable performance. Explainable Artificial Intelligence (XAI) visualizations—combining SHAP-based attributions and CVSS-aligned pair plots—serve as an analyst-facing interpretability layer. The framework is evaluated on benchmark datasets, including VulnDetect and the NIST Software Reference Library (NSRL, version 2024.12.1, used strictly as a benign baseline for false positive estimation). Results: Our evaluation reports that precision, recall, AUPRC, MCC, and calibration (ECE/Brier score) demonstrated improved robustness and reduced false positives compared to baselines. An internal interpretability validation was conducted to align SHAP/GNNExplainer outputs with known vulnerability features; formal usability testing with practitioners is left as future work. Conclusions: The framework, Designed with DevSecOps integration in mind, the system is packaged in containerized modules (Docker/Kubernetes) and outputs SIEM-compatible alerts, enabling potential compatibility with Splunk, GitLab CI/CD, and similar tools. While full enterprise deployment was not performed, these deployment-oriented design choices support scalability and practical adoption.

Keywords:

BERT; DGCNN; MIR; KELM; Explainable AI (XAI); vulnerability detection; DevSecOps

1. Introduction

The ever-increasing scale, complexity, and heterogeneity of software systems has exacerbated the challenge of timely and accurate vulnerability detection. Traditional static analysis and rule-based methods—such as those implemented in SonarQube or Checkmarx—lack the contextual sensitivity to capture logic-level errors or subtle code misuse. Simultaneously, black-box deep learning approaches, although achieving high predictive performance, fail to offer the transparency necessary for integration into security-critical workflows [1,2].

Recent research in machine learning (ML) and AI presents promising directions for addressing these challenges. BERT-based models, initially developed for natural language understanding, have been successfully adapted to program analysis [3,4]. These models capture contextual semantics of source code tokens and are particularly useful for detecting logical flaws. Complementary to this, DGCNN analyzes Code Property Graphs (CPG) to identify structural vulnerabilities such as buffer overflows and control flow leaks, while deep learning solutions provide insights into detecting and mitigating cyber threats [4,5,6,7].

Yet, semantic and structural representations are often treated independently. GraphCodeBERT [8] attempts to bridge this gap, combining syntactic structure and code token embeddings, while CodeTrans [9] leverages transfer learning for multilingual codebases. However, these models remain opaque to analysts and lack real-time operational feedback mechanisms [1,10].

Explainable AI (XAI) techniques have gained traction as a solution to model opacity. SHAP explanations [11], attention heatmaps [12], and interactive dashboards [2] allow analysts to interpret model outputs, increasing confidence and decision accuracy. Still, few frameworks integrate explainability, semantic and structural analysis, and risk scoring within a unified pipeline.

This research proposes a novel hybrid vulnerability detection system that addresses the limitations above. It integrates BERT for semantic context, DGCNN for structural flow, and KELM [2] for lightweight classification. Minimum Intermediate Representation (MIR) learning [13] reduces noise and false positives during preprocessing. Moreover, visual analytics based on SHAP and CVSS pair plots bridge the model–human gap. Reinforcement learning-inspired multi-modal fusion [14] dynamically optimizes prediction reliability.

Our architecture is designed for deployment realism. It supports online learning [15], CI/CD integration [16], containerized environments [17], and feedback loops that allow analyst-driven threshold adjustments. Benchmarking is conducted on real-world datasets, including VulnDetect [18,19] and NSRL [17], with performance validated across detection, interpretability, and runtime efficiency metrics.

Unique Contributions. This framework advances the state of the art in software vulnerability detection by offering:

Multi-Modal Vulnerability Detection—Combines semantic embeddings, structural CPG analysis, and behavioral trace learning into a unified detection pipeline, outperforming single-model baselines approach [3,6,8,9,14,20,21].
Reinforcement-Learning-Based Fusion—Introduces an adaptive ensemble strategy that optimizes weights dynamically across vulnerability types, balancing precision and recall while minimizing false negatives [5,14,22].
Human-Centered Interpretability—Integrates SHAP explanations, GNNExplainer overlays, and CVSS contextualization into an interactive dashboard designed for SOC workflows, bridging the gap between raw model predictions and actionable analyst insights [2,4,11,12,23].
Deployment-Oriented Design—Provides a cloud-native, DevSecOps-ready framework compatible with Docker [24], Kubernetes [25], Splunk [26], and CI/CD tools, making it suitable for enterprise-scale deployment and continuous automated security monitoring [1,16,17,27].
Comprehensive Evaluation Metrics—Goes beyond F1-score by reporting AUPRC, MCC, calibration (ECE/Brier), and robustness under adversarial/noisy inputs, supporting reproducibility and statistical rigor [10,12,18,23,28].

2. Related Work

Recent progress in AI-driven cybersecurity has led to diverse methodologies that aim to improve software vulnerability detection, interpretability, and operational scalability. We categorize related research into three key areas: (1) ML-based vulnerability detection, (2) explainability and visual analytics, and (3) hybrid frameworks for real-time security insights.

2.1. Machine Learning and Deep Learning Vulnerability Detection

Advanced machine learning techniques have demonstrated superior performance in identifying code vulnerabilities, especially when trained on software-specific representations. MIR learning reduces code complexity by filtering irrelevant syntax, improving the precision of vulnerability detection and reducing false positives [13]. This abstraction facilitates the identification of critical issues like buffer overflows and memory mismanagement.

KELM offer a lightweight, non-iterative method for rapid classification. Tang et al. [22] introduced a KELM-based system with multilevel program symbolization, enabling faster generalization across codebases. Its low computational overhead makes it ideal for real-time vulnerability triaging in CI/CD environments.

Transformer-based models, particularly BERT and its variants like CodeBERT and CodeTrans, have been successfully replaced with software languages. Alqarni and Azim [3] demonstrated that fine-tuned BERT encoders capture contextual semantics and logic-level flaws in code [9] showed that CodeTrans excels at generalizing across multiple programming languages. Research [8] expanded this capability by integrating token sequences with structural graphs in GraphCodeBERT, enabling joint learning of syntax and semantics.

Structural learning models such as DGCNN have also shown efficacy [7]. These models operate on CPG, which encode syntax, control flow, and data flow [6,20]. Research [20] introduced a DGCNN + CPG method optimized for C/C++ programs, while research [21] Combined code language models with CPGs to enhance vulnerability detection, demonstrating the impact of CPG representation quality on detection performance.

Hybrid systems also include two-stage deep learning models like CNN–LSTM pipelines, which can classify both the presence and type of vulnerability [15]. Paper [14] proposed reinforcement learning-based multi-modal approaches combining tokens, graphs, and trace data, but lacked integration with XAI or visual interfaces. This motivates our own hybrid framework that fuses these perspectives while also supporting interpretability and real-time feedback.

2.2. Interpretability and Explainable AI (XAI)

Interpretability remains a major bottleneck in real-world ML deployments for cybersecurity. Traditional “black-box” models hinder trust and usability for analysts, especially in high-stakes threat detection contexts. Techniques such as SHAP (SHapley Additive exPlanations) are widely used to attribute model outputs to input features, improving transparency and post hoc analysis [2,18].

Research [2] emphasized that human-centered XAI, when coupled with visual analytics, significantly improves trust and reduces analyst response time. Research [11] supported this by introducing focus-context visualizations for prioritizing vulnerability assessments. Meanwhile, research [1] provided a comprehensive survey outlining the limitations of current black-box detection systems, urging the inclusion of explainability during model design. Paper [12] classified existing XAI methods in cyber defense decision-making contexts, noting the dominance of visual techniques like SHAP and attention heatmaps in Security Operations Centers (SOCs) for improving interpretability and confidence.

2.3. Hybrid Frameworks and Visual Analytics

Hybrid models that combine multiple learning streams are increasingly favored due to their enhanced accuracy and robustness. Paper [18] showed that integrating ML output into interactive dashboards enhances decision-making. Research [21] confirmed that variations in CPG quality directly affect the performance of visual analytics tools, underlining the importance of graph design in hybrid systems.

Our proposed work builds on frameworks such as that by research [20], who combined DGCNN and CPG for structural analysis [7]. In contrast to prior studies that treat semantic and structural analysis separately, our framework fuses them into a single operational pipeline and augments it with visual interfaces powered by SHAP. This approach enables better integration with DevSecOps pipelines and Security Information and Event Management (SIEM) systems [16].

2.4. Research Gaps

Despite progress, most existing solutions fail to combine semantic modeling, structural representation, and visual analytics within a unified system. Models like CodeTrans [9] and GraphCodeBERT [8] offer state-of-the-art performance but do not support real-time interpretability or continuous learning. Furthermore, while KELM offers efficiency, it lacks the contextual richness of transformer models or graph-based systems.

Few frameworks support analyst-centric interfaces, explainability feedback loops, or prioritization logic, which are critical for deployment in enterprise environments. Our proposed system addresses these gaps by combining MIR learning [13], KELM scoring [22], transformer-based semantics [3,8,9], graph representations [6,20,21], combining CPGs and code language embeddings, and explainable visual analytics [2,11,12], all within a scalable, feedback-enabled architecture.

3. Preliminary

Motivating Examples

In the rapidly evolving field of cybersecurity, the ability to detect and manage to ground our framework, this section introduces the foundational technologies leveraged in our approach—each selected based on its contributions to detection accuracy, efficiency, and interpretability.

1.: Minimum Intermediate Representation (MIR) Learning:

MIR abstracts source code into simplified, intermediate forms by eliminating non-informative syntax. This has been shown to significantly reduce false positives in static analysis tools, particularly for vulnerabilities such as buffer overflows, integer underflows, and resource leaks [13]. MIR serves as a preprocessing stage that improves downstream learning.

2.: Kernel Extreme Learning Machine (KELM):

KELM employs single-pass, non-iterative training to efficiently classify code samples. Its use of kernel-based mapping enables quick generalization, making it ideal for fast risk assessment of known or emerging vulnerabilities [22]. KELM is used in our framework for scoring and prioritizing risks in real time.

As illustrated in Figure 1, the heatmap of Common Vulnerabilities and Exposures (CVE) severities over several years shows increasing vulnerability severity trends, underscoring the need for dynamic and adaptive detection systems.

3.: Transformer-Based Semantics (BERT, CodeBERT, CodeTrans):

BERT has been successfully adapted to programming languages for its ability to capture long-range dependencies and context. CodeTrans [9] applies transfer learning on code tokens, while GraphCodeBERT [8] blends token embeddings with structural features. These models are integral to our semantic pipeline, detecting logical flaws such as insecure API use, race conditions, or unchecked input.

4.: Graph-Based Structural Analysis (CPG + DGCNN)

CPGs model code as graphs incorporating abstract syntax trees, control flow graphs, and data flow graphs. These representations are processed using DGCNN to learn topological structures of vulnerable code. Research confirms this approach is particularly effective for languages like C/C++ where memory management issues dominate [6,7,20,21].

Research [10] benchmarked these black-box models across multilingual codebases, revealing significant variability that underscores the importance of developing robust, hybridized detection frameworks.

5.: Two-Stage Deep Learning Model (CNN–LSTM):

Integrating a Convolutional Neural Network (CNN) with a Long Short-Term Memory (LSTM network, this model efficiently identifies not only the presence but also the types of vulnerabilities in software, enabling more granular security assessments [15]. Research [14] extended this line of work using reinforcement-based multi-modal fusion, though without visual analytics or explainability—gaps our approach explicitly addresses.

As shown in Figure 2, the vulnerability score plot visualizes relationships between different severity metrics, helping to identify patterns among detected vulnerabilities.

6.: Survey of Deep Learning Techniques:

Comprehensive reviews have classified existing deep learning approaches, emphasizing ongoing challenges such as the need for large, annotated datasets and improved model interpretability [1]. Benchmark datasets like VulnDetect [18,19] and NSRL [17] were highlighted as critical tools to enable real-world reproducibility, which we adopt in our proposed system.

7.: Visual Analytics for Cyber Vulnerability Assessment:

Visualization tools have proven essential in bridging the gap between complex machine learning outputs and actionable cybersecurity insights, improving the interpretability of risk assessments [11]. Research [12] emphasizes that visual explanation tools such as SHAP and attention heatmaps substantially improve threat analysts’ decision accuracy and confidence under time pressure.

8.: Explainable AI (XAI) and Visual Analytics

XAI tools like SHAP help decompose prediction outcomes, identifying which features influenced a model’s decision and why. Studies [2,12] show that pairing such tools with visual dashboards improves analyst trust, confidence, and decision latency. These insights guide our visualization layer design, which includes SHAP visualizations, risk heatmaps, and CVSS pair plots.

4. Steps

To address the limitations identified in prior work, we propose a unified, explainable, and adaptive framework for software vulnerability detection that fuses semantic and structural code analysis with visual analytics. The design emphasizes operational realism, reproducibility, and analyst interpretability.

4.1. Data Collection and Preprocessing

The success of any machine learning system in cybersecurity depends heavily on the diversity and quality of its training data. Our framework utilizes three complementary data sources:

Publicly Available Vulnerable Repositories—We collected open-source projects from GitHub that had associated CVE reports. These were cross-referenced with the National Vulnerability Database (NVD) to confirm vulnerability authenticity and classification.
Benchmark Datasets—VulnDetect [18], an MIT-licensed dataset containing labeled vulnerable code snippets across multiple programming languages, which provide labeled vulnerable code snippets from multiple programming languages. Standard preprocessing steps were followed based on guidelines [19].
Reference Corpora—The National Software Reference Library (NSRL) [17], hosted by NIST, was used as a “benign baseline” to help estimate false positive rates. NSRL samples were retrieved from NIST’s public distribution portal, hashed, and cross-checked against the vulnerability datasets to ensure no contamination of labels [26].

To ensure data consistency:

All source code was normalized to a canonical syntax style.
Identifying metadata such as project names and authors were anonymized.
Semantic processing involved tokenization for BERT-based embeddings.
Structural processing involved Abstract Syntax Tree (AST) parsing and CPG construction.
Dynamic execution traces, where available, were collected and prepared for sequential modeling with LSTM networks.
Textual artifacts (commit messages, bug reports) were encoded using RoBERTa embeddings to capture contextual information.

Datasets were split into training (70%), validation (15%), and test (15%) partitions, ensuring proportional representation of vulnerability types and severities across splits.

4.2. Multi-Model Feature Extraction

The system extracts four complementary categories of features to capture both high-level logic and low-level execution patterns:

Semantic Features: Derived from a fine-tuned BERT-based encoder built on CodeTrans [9], enabling the capture of long-range token dependencies and contextual semantics.
Structural Features: CPG representations are passed into a DGCNN model, which learns graph-based structural patterns that often correlate with vulnerabilities such as memory corruption and control flow misuse [4,21].
Textual Features: RoBERTa embeddings from commit logs and issue reports provide historical and human-readable context about code changes, many of which are indicative of vulnerability fixes.
Dynamic Behavior Features: Execution logs and traces are processed by LSTM networks, enabling detection of vulnerabilities that manifest only at runtime, such as race conditions and resource leaks.

Each feature stream was designed to be modular so that additional modalities (e.g., binary analysis outputs) can be incorporated without retraining the entire system.

4.3. Hybrid Modeling and Integration

A key innovation of this framework is its reinforcement learning-based dynamic fusion of predictions from the three core models—BERT, DGCNN, and KELM.

Each model outputs a vulnerability likelihood score:

Y_{B E R T} α_{i}

Y_{D G C N N}

Y_{K E L M}

These are combined into a single prediction:

E m b e d d i n g (C) = B E R T (C)

(1)

DGCNN layer update : h^{(l + 1)} = σ (A h^{(l)} W^{(l)})

(2)

KELM output weights : β = (\frac{I}{C} + H^{T} H)^{- 1} H T

(3)

Fusion function : \hat{y} = α_{1} Y_{B E R T} + α_{2} Y_{D G C N N} + α_{3} Y_{K E L M}

(4)

where

α_{1}

α_{2}

α_{3}

∈ [0, 1] are weights dynamically optimized by a reinforcement learning (RL) agent.

RL Design for Reproducibility:

Algorithm: Deep Q-Network (DQN) with experience replay.
State (s): Vector of current model scores $Y_{B E R T}$ , $Y_{D G C N N}$ , $Y_{K L E M}$ with rolling history of accuracy, precision, and false positive rate
Action (a): Increment or decrement one of the fusion weights ( $α_{i}$ ) by 0.05 while maintaining normalization (∑ $α_{i} = 1$ ).
Reward (R):

R = 1 [c o r r e c t] - 1 \cdot 1 [f a l s e p o s i t i v e] - 2 \cdot 1 [f a l s e n e g a t i v e]

(5)

This prioritizes recall, penalizing missed vulnerabilities more heavily.

Training Parameters:
- Episodes: 500
- Replay buffer size: 5000
- Batch size: 64
- Learning rate: 1 × 10⁻⁴¹
- Optimizer: Adam
- Discount factor: γ = 0.95\g
- amma = 0.95γ = 0.95
- Target network update: every 50 steps
- Exploration: ε-greedy, decaying from 0.2 → 0.01 over 500 episodes

Schedule: The RL agent updates fusion weights after every 3 training epochs using validation-set feedback for reward computation.

This design allows the fusion controller to dynamically adapt to changing vulnerability patterns, leveraging the semantic precision of BERT, structural reasoning of DGCNN, and low-latency scoring of KELM.

4.4. Risk Prioritization and Scoring

Once a vulnerability is detected, it must be prioritized for remediation. Our system:

Computes severity scores according to the CVSS v3.1 specification, considering exploitability, impact, and environmental metrics.
Integrates SHAP-based feature attributions into the scoring interface, enabling security analysts to understand which code regions or features contributed to a high severity score.
Flags vulnerabilities as “critical” when exploitability > 0.7 and impact > 0.6, ensuring immediate attention.

4.5. Visual Analytics and Interface

The proposed framework includes a visual analytics dashboard designed to translate complex model outputs into actionable insights for cybersecurity analysts. All visualizations are generated from the system’s evaluation datasets and correspond directly to the results discussed in this paper.

Heatmaps highlight high-risk code segments by mapping SHAP feature attribution scores to source lines, enabling targeted review by analysts.
Severity histograms and box plots (Figure 3) summarize CVSS v3 base scores and severity levels across the evaluated dataset. Axis labels, legends, and scale indicators are included to ensure interpretability without external context.
Pair plots (Figure 4) illustrate correlations between exploitability scores, impact ratings, and CVE presence, providing multi-dimensional insights into detected vulnerabilities.
Graph Model Explainability. For structural representations (DGCNN + CPG), we apply GNNExplainer to derive node- and edge-level attributions. For each vulnerable code sample, GNNExplainer produces a sparse mask highlighting subgraphs most influential in prediction. These masks are overlaid on the Code Property Graph and linked back to the corresponding source lines in the analyst dashboard. This ensures that graph-level explanations remain interpretable and actionable, complementing SHAP-based token and textual explanations.
SHAP explanation panels display feature contributions at token, node, and textual levels, allowing validation or challenge of model outputs.

The interface design follows human-centered XAI principles [2] and the taxonomy of cyber-XAI systems proposed in [12], with a focus on clarity, reproducibility, and integration readiness.

Note: While the architecture includes planned modules for automated remediation suggestions and SIEM integration [11,15], these components are not evaluated in the present study and are described in Section 8 as future work.

Figure 3 demonstrates the use of histograms, bar charts, and box plots to visualize CVSS v3 base scores and severity levels, offering multiple analytical perspectives.

XAI techniques complement visualization by delivering transparent explanations of model predictions, thereby enhancing user trust and interpretability [11].

The visualization is contextualized using real-time feedback metrics, aiding SIEM integration and enabling transparent decision-making support for security teams [2,12].

Autonomous Suggestions: Offer automatic suggestions for mitigating steps, such applying patches, altering configurations, or taking other corrective action, based on risk scores and vulnerability categories [15].
Integration with Security Functions: Ensure that the system can easily interface with cybersecurity workflows and current SIEM systems to enable automated warnings and reactions as part of an organization’s active defense plan [11].

4.6. Feedback and Constant Improvement

To maintain adaptability, the framework incorporates a feedback collection mechanism where analysts can label predictions as correct, false positive, or false negative via the dashboard interface. This feedback is stored in a structured log with associated context (e.g., CVE ID, code snippet, prediction score).

In the current prototype, feedback is periodically aggregated and used for:

Threshold adjustment—dynamically recalibrating decision thresholds to reduce false positives.
Incremental retraining—selectively fine-tuning BERT and KELM components on newly validated samples.

While the paper’s present evaluation focuses on static datasets, the architecture is designed to support online learning for emerging threats (e.g., zero-days) [15,27]. This includes drift detection triggers for retraining and replay-buffer mechanisms to prevent catastrophic forgetting. These online adaptation features are planned for implementation and testing in future work.

4.7. Hypothesis

Building on these motivating examples, the hypothesis for the new research paper is twofold:

Hypothesis 1.

Integrating advanced machine learning (ML) techniques with adaptive visual analytics will significantly improve the accuracy, efficiency, and interpretability of vulnerability detection systems, particularly when semantic, structural, and risk-ranking components are fused.

This is based on the demonstrated success of deep learning models in prior studies, the proven benefits of multi-modal feature fusion, and the interpretability gains from visual analytics [2,12].

Model Components and Equations:

BERT-based Semantic Analysis:

For a given source code snippet C:

Embedding(C) = BERT(C)

Classification layer processes the embedding:

\hat{y} = σ (W \cdot E m b e d d i n g (C) + b)

where W is the weight matrix, b is the bias vector and

σ

is the activation function (e.g., SoftMax).

DGCNN based Structural Analysis

For graph G with node features X and adjacency matrix A:

H^{(l + 1)} = σ (A H^{(l)} W^{(l)})

Were,

H^{(l)}

is the hidden state at layer l,

W^{(l)}

is the weight matrix for layer l.

KELM based Risk Classification

For input vector X:

Ø (X_{i}) = K e r n e l (X_{i})

The output weights β:

β = (H^{T} H + \frac{I}{C})^{- 1} H^{T} T

where H is the hidden layer output matrix, I is the identity matrix, C is the regularization parameter, and T is the target matrix.

Ensemble Integration

M_{c o m b i n e d} = α_{1} M_{B E R T} + α_{2} M_{D G C N N} + α_{3} M_{K E L M}

where weights

α_{i}

are dynamically optimized via reinforcement learning.

Performance Metrics

Accuracy:

A = \frac{N u m b e r o f C o r r e c t P r e d i c t i o n s}{T o t a l N u m b e r o f P r e d i c t i o n s}

Efficiency:

E = \frac{P r o c e s s i n g T i m e p e r S a m p l e}{T o t a l s a m p l e s}

Visual Interpretability:

V = \frac{N u m b e r o f C o r r e c t l y I n t e r p r e t e d V i s u a l s}{T o t a l N u m b e r o f S a m p l e s}

Explainability Metric (XAI):

Ø_{i} = \sum_{S \subseteq N \ {i}} \frac{|S|! (|N| - |S| - 1)!}{|N|!} [f (S \cup \{i\}) - f (S)]

where

S

is a subset of features,

N

is the set of all features and

f (s)

is the modal output with the feature subset

S

Overall Metric:

M = β_{1} A + β_{2} E + β_{3} V + β_{4} X A I

where

β

is weights representing the importance of each metric.

Reverse Validation Condition for H1:

${H 1 i s v a l i d a t e d i f : M}_{c o m b i n e d} > M_{B a s e l i n e} ⌃ p < 0.05$

Hypothesis 2.

“A hybrid model that incorporates both structural and semantic analysis of code, enhanced by real-time learning algorithms, will outperform traditional vulnerability detection systems in both speed and accuracy”.

Leveraging multi-model data like source code, execution traces, and textual description can further enhance the accuracy, efficiency, and interpretability of detection system.

Multi-model Data Representation

Source Code Representation

E_{c o d e} (C) = B E R T (C)

Execution Trace Representation

E_{t r a c e} (T) = L S T M (T)

Textual Description Representation

E_{t e x t} (D) = R o B E R T a (D)

Combining Multi-Modal Representations

E_{c o m b i n e d} = {f (E}_{c o d e}, E_{t r a c e}, E_{t e x t})

f

is functioned to combine these embeddings.

E_{c o m b i n e d} = {σ (W_{c} [E}_{c o d e}; E_{t r a c e}; E_{t e x t}] + b_{c}

where

W_{c}

is a weight matrix for combined embedding,

b_{c}

is the bias vector, and

σ

is an activation function.

Multi-modal Model Architecture

Feature Extraction

For each data type, feature extraction that can be performed separately:

F_{c o d e} = {F e a t u r e E x t r a c t o r}_{c o d e} ({E m b e d d i n g}_{c o d e})

F_{t r a c e} = {F e a t u r e E x t r a c t o r}_{t r a c e} ({E m b e d d i n g}_{t r a c e})

F_{t e x t} = {F e a t u r e E x t r a c t o r}_{t e x t} ({E m b e d d i n g}_{t e x t})

Feature Fusion

The features from different modalities are fused:

F_{c o m b i n e d} = σ (W_{f} [F_{c o d e}; F_{t r a c e}; F_{t e x t}] + b_{f}

where

W_{f}

is the weight matrix for feature fusion,

b_{f}

is the bias vector, and

σ

is an activation function.

Classification and Prediction

\hat{y} = s o f t m a x (W_{o} F_{c o m b i n e d} + b_{o})

where

W_{o}

is the output weight matrix and

W_{o}

is the output bias vector, Evaluation Metrics.

Explainability Metric (XAI):

Ø_{i} = \sum_{S \subseteq N \ {i}} \frac{|S|! (|N| - |S| - 1)!}{|N|!} [f (S \cup \{i\}) - f (S)]

where

S

is a subset of features,

N

is the set of all features,

f (s)

is the modal output with the feature subset

S

Overall Metrics (M):

M = β_{1} A + β_{2} E + β_{3 V} + β_{4} X A I

where

β_{i}

is the weights representing the importance of each metric.

Reverse Validation Condition for H2:

H2 is validated if:

M_{m u l t i - m o d a l - a d p a t i v e} > M_{h y b r i d - n o n - a d a p t i v e} \land M_{m u l t i - m o d a l - a d a p t i v e} > M_{t r a d i t i o n a l} \land p < 0.05

where:

M_{m u l t i - m o d a l - a d p a t i v e}

= performance of the proposed model with real-time learning and feedback loop,

M_{h y b r i d - n o n - a d a p t i v e}

= same hybrid model without online learning or feedback loop,

M_{t r a d i t i o n a l}

= baseline static analysis tool.

Validation Plan

Datasets: VulnDetect [18,19], NSRL [17], Juliet Test Suite [29], and synthetic CVE scenarios.
Baselines:
○
BERT-only
○
DGCNN-only
○
KELM-only
○
Static analysis tools (e.g., SonarQube, Checkmarx)
Metrics: Accuracy, precision, recall, F1, processing time per sample, visual interpretability score, explainability score XAI.
Statistical Testing: Paired t-test or Wilcoxon signed-rank test with p < 0.05 p < 0.05 p < 0.05 to assess significance.

5. Research Details

This section describes the design, implementation, and evaluation of the proposed hybrid detection framework. Each component is explicitly mapped to the hypotheses stated in Section 5:

Hypothesis 1 (H1): Integrating semantic, structural, and risk-ranking models with adaptive visual analytics will improve accuracy, efficiency, and interpretability.
Hypothesis 2 (H2): Incorporating real-time learning and multi-modal inputs will outperform traditional systems in both speed and accuracy.

The experimental design ensures that results from this section will directly validate or reject these hypotheses.

5.1. System Architecture (H1, H2)

The system is built on a multi-stream hybrid learning architecture combining three distinct modeling layers:

Semantic Analysis (BERT): A fine-tuned transformer model (BERT/CodeTrans) processes source code tokens and textual metadata to detect logic-level flaws such as insecure API usage, unsafe input sanitization, or poor data handling practices [3,9].
Structural Analysis (DGCNN + CPG): A Deep Graph Convolutional Neural Network (DGCNN) operates on Code Property Graphs (CPGs) to capture structural weaknesses including memory leaks, buffer overflows, and control/data flow inconsistencies [6,7,20,21].
Risk Prioritization (KELM): A Kernel Extreme Learning Machine (KELM) classifier provides lightweight scoring and prioritization of vulnerabilities using CVSS metadata. Efficiency is measured by inference time per sample, with a target of <50 ms [15,22].

The final decision is derived through a weighted ensemble fusion, where individual model outputs are aggregated based on their confidence scores and dynamically tuned weights (α, β, γ), learned through reinforcement-based optimization strategies [14].

The fusion function is:

y_{f i n a l} = α \cdot y_{B E R T} + β \cdot y_{D G C N N} + γ \cdot y_{K L E M}

where α, β, γ are learned via reinforcement learning to maximize the combined metric from Hypothesis 1.

RL fusion Agent design (reproducibility)

State: vector of model scores and rolling accuracy history.
Actions: increment/decrement one of α, β, γ by 0.05, constrained to sum = 1.
Reward: +1 for correct classification, −1 for false positives, −2 for false negatives.
Algorithm: Deep Q-Learning with replay buffer size = 5000.
Training Schedule: RL updates every 3 epochs using validation set feedback.
Exploration: ε-greedy, ε decays from 0.2 → 0.01 over 500 episodes.

This design allows adaptive re-weighting of the three models to maximize detection accuracy while minimizing false positives and false negatives.

Component	Key Hyperparameters
BERT (CodeTrans fine-tuned)	LLR = 2 × 10⁻⁵, AdamW, batch = 32, epochs = 10, max seq = 512
DGCNN	Hidden size = 256, dropout = 0.3, LR = 1 × 10⁻³, batch = 64, epochs = 20
KELM	Kernel = RBF, C = 100, $γ$ = 0.001
RL Fusion Agent	Deep Q-learning, replay buffer = 5000. $ε$ Decay 0.2 $\to$ 0.01, update every 3 epochs

5.2. Visual Analytics Interface

A core innovation of the system is the integration of interactive XAI-powered dashboards, designed to bridge the gap between technical model outputs and analyst decision-making [2,11]. The dashboard enables:

Visualization of high-risk vulnerabilities via CVSS overlays and SHAP heatmaps.
Drill-down inspection of specific code regions linked to top-ranked vulnerabilities.
Real-time severity correlation analysis through pair plots and attention visualizations [12].

This interface follows human-centered XAI principles by offering transparency into model reasoning and contextualizing vulnerability rankings with domain-relevant metrics [1,2,12].

V = \frac{c o r r e c t l y i n t e r p r e t e d v i s u a l}{T o t a l v i s u a l s p r e s e n t e d}

A threshold of ≥85% correct interpretations is set to validate interpretability gains. Internal validation (Section 5.6) confirmed this target without requiring external user studies.

5.3. Dataset and Experimental Setup (H1, H2)

The framework is evaluated on a hybrid dataset combining real-world vulnerability corpora with synthetic test cases. Each dataset is carefully described with access details and preprocessing steps to ensure reproducibility:

Labeled Source Code: Vulnerable and patched samples were collected from the National Vulnerability Database (NVD) and curated GitHub repositories tagged with CVE identifiers. Preprocessed datasets were aligned with CVE records, ensuring consistency of severity scores and metadata [3,13,15].
VulnDetect Benchmark [18]: A publicly available dataset of source code labeled with vulnerability classes. This dataset provides ground-truth annotations for supervised training and evaluation [19].
Juliet Test Suite [17]: Used as a controlled benchmark containing synthetic but systematically generated vulnerabilities (e.g., buffer overflows, integer overflows, command injections). This ensures coverage of both common and edge case vulnerabilities [29].
NSRL (NIST Software Reference Library) [17]: Used only for file fingerprinting and metadata validation, not for supervised vulnerability classification. Specifically, it helped identify duplicate samples and normalize software metadata across heterogeneous repositories [26].

Preprocessing Pipeline:

Tokenization (BERT input): Code is tokenized using a subword tokenizer trained on multilingual programming languages (Python, C, C++, Java). Special tokens are aligned with CVE labels.
AST Parsing and CPG Construction (DGCNN input): Abstract Syntax Trees (ASTs) are extracted using Clang/Joern, then converted into Code Property Graphs (CPGs). DGCNN processes node embeddings to capture structural properties [7].
Trace Alignment (Behavioral input): Execution traces are collected via sandbox runs and aligned with corresponding source samples. Traces are embedded using RoBERTa to capture dynamic behavior.
Normalization: Deduplication was performed using NSRL fingerprints. Samples were stratified by vulnerability type and CVSS severity to ensure balanced representation [17,30].

Splitting Strategy

70% training, 15% validation, 15% testing.
Stratified by CVE type and severity score to preserve class balance.
For reproducibility, dataset splits and preprocessing configurations are documented in the project methodology. All datasets used are publicly available (see Data Availability Statement).

This hybrid dataset design ensures both hypotheses are tested under consistent, reproducible conditions, covering semantic, structural, and behavioral vulnerability representations.

5.4. Evaluation Criteria

Model performance is evaluated on five core dimensions:

Detection Accuracy: Precision, recall, F1-score, and AUC scores are computed against ground truth labels across diverse vulnerability types [8,9].
Efficiency: Measured via average inference time per sample, peak memory consumption, and training convergence time. KELM’s contribution to sub-50 ms inference latency is highlighted [15,22].
Interpretability and Trust: SHAP-based explanations and GNNExplainer overlays are annotated by expert users for accuracy and clarity. Analyst feedback measures dashboard usability and confidence in model outputs [2,11,12].
Robustness: The framework is tested against adversarial perturbations and noisy inputs to evaluate false positive/false negative resilience under stress conditions, consistent with robustness concerns highlighted in XAI surveys [16,23].
Scalability: Benchmarking is conducted in Docker- and Kubernetes-based deployments to simulate cloud and on-premises environments [1,17].
AUPRC (Area Under Precision–Recall Curve): To evaluate robustness under class imbalance.
MCC (Matthews Correlation Coefficient): To capture balanced performance across TP/TN/FP/FN.
Calibration metrics: Expected Calibration Error (ECE; 15 bins) and Brier score are reported to assess reliability of predicted probabilities. Temperature scaling on the validation set is used as the calibration method.

Formulas and methods for evaluating hypotheses are provided in Appendix A.1.

5.5. Interpretability Validation

To assess interpretability and usability, we conducted an internal validation of the dashboard rather than a formal user study with external participants. The evaluation focused on three primary aspects of the system:

Clarity and accuracy of risk insights—SHAP-based feature attributions were checked against known vulnerability features to confirm alignment.
Usefulness of explanations for decision-making—GNNExplainer outputs and SHAP overlays were examined to verify whether highlighted tokens, nodes, or code regions consistently matched the vulnerability context.
Integration feasibility with SOC workflows—the dashboard’s compatibility with Splunk, GitLab CI/CD [26,30], and other standard DevSecOps practices were reviewed qualitatively.

This internal interpretability validation follows the structure of established evaluation protocols in Explainable AI research [2,12] but should be regarded as a prototype-level assessment. Future work will include formal usability testing with practitioners to measure clarity, trust, and decision support under operational SOC conditions.

5.6. Real-World Deployment Considerations

To assess operational readiness, we examined how the proposed framework can integrate with standard DevSecOps workflows and SOC environments. While no live deployment with enterprise SOCs was conducted, the system was designed and tested in controlled simulations to ensure compatibility with widely used tools.

Deployment Readiness Features:

Containerization: All modules are packaged in Docker images, with orchestration support for Kubernetes, ensuring scalability in production pipelines [24,25].
CI/CD Integration: The framework exposes APIs that can be invoked within GitLab CI/CD [26] or Jenkins workflows to automatically trigger vulnerability scans after code commits [31].
SIEM Compatibility: Outputs are formatted in JSON/CSV and aligned with standard SIEM ingestion formats (e.g., Splunk, Elastic Stack) for alert correlation.
Resource Efficiency: KELM and optimized inference paths keep average per-sample runtime under 50 ms, supporting real-time alerting requirements.

Operational Metrics (Prototype-Level):

Expected detection-to-alert latency: <500 ms per sample in controlled test environments.
Incremental retraining can be performed overnight (<24h turnaround for incorporating new CVE data).
Modular design ensures that individual components (BERT, DGCNN, KELM) can be updated independently.

Future Work:

Conducting full enterprise SOC deployment trials to measure latency, throughput, and analyst usability in real-world settings.
Extending integration with vulnerability management systems (e.g., Tenable, SonarQube) to enable automatic ticket creation.
Developing edge-compatible deployments using ONNX models for lightweight environments.

6. Proposed Approach

The proposed framework integrates semantic and structural deep learning models with interactive visual analytics to form a unified, interpretable, and adaptive vulnerability detection system. Designed for multi-modal data inputs—including source code, execution traces, and textual metadata—it processes vulnerabilities in real time, prioritizes them using CVSS-aligned scoring, and provides analyst-ready explanations through Explainable AI (XAI).

By explicitly addressing Hypothesis 1 (accuracy, efficiency, interpretability through fusion) and Hypothesis 2 (speed and adaptability through real-time, multi-modal learning), the architecture balances algorithmic precision with operational usability.

6.1. Overview

The architecture consists of five tightly coupled modules:

Data Ingestion and Preprocessing
-
Normalizes and sanitizes input source code, execution traces, and metadata.
-
Associates each input with historical vulnerability data (e.g., CVE records from NVD) for supervised learning training [3,13,17].
-
Produces consistent, multi-representational input for downstream modeling.
Multi-Model Feature Extraction
-
Semantic Analysis: Captures long-range logical dependencies to reveal hidden vulnerabilities [3,9].
-
Structural Analysis (CPG + DGCNN): Encodes control/data flow to detect low-level, memory-related flaws [6,7,20,21].
-
Behavioral Analysis: Interprets dynamic logs to detect runtime deviations [15,17].
-
Textual Metadata Analysis (RoBERTa): Incorporates commit messages, bug reports, and CVE summaries with multi-modal fusion to improve prediction reliability [14].

This approach ensures diverse feature coverage, consistent with findings in [1,14] on cross-representational fusion.

Modeling and Integration

Outputs from semantic (BERT), structural (DGCNN), and behavioral (LSTM) streams are fused using a weighted ensemble strategy:

\hat{y} = α \cdot y_{B E R T} + β \cdot y_{D G C N N} + {γ \cdot y}_{K L E M}

where fusion weights α, β, γ are optimized via reinforcement-based learning [14].

Final vulnerability risk scores are computed using KELM for high-speed, interpretable classification [22].

Explainability and Visual Analytics (H1)
- SHAP (SHapley Additive exPlanations) based feature attribution for transparency.
- Dashboards display ranked vulnerabilities, SHAP heatmaps, CVSS severity plots, and correlation pair plots [2,11,12].
- Measured via interpretability metric V (Section 6).

Feedback Loop and continuous Learning (H2)
- Analyst inputs (confirmations/corrections) adjust thresholds dynamically.
- New CVE data incrementally updates the model without full retraining, supporting adaptive and explainable decision-making in operational environments [1,23].

6.2. Workflow Description

Below is a step-by-step outline of the system’s operational pipeline.

Semantic Embedding—Tokenize code, generate contextual embeddings with fine-tuned BERT, inspired by CodeTrans [9].
Structural Graph Encoding—Convert code into CPGs, extract graph features with DGCNN, like GraphCodeBERT [7,8].
Behavioral Representation—Encode execution traces with LSTM to detect runtime anomalies.
Risk Classification—Classify vulnerability severity with KELM for fast, interpretable results.
Fusion and Scoring—Weighted aggregation of semantic, structural, and behavioral outputs.
Visualization—Analyst dashboard with heatmaps, severity distribution plots, and SHAP explanations.
Feedback Integration—Analyst feedback updates model thresholds; new CVEs ingested continuously.

6.3. System Diagram

Figure 5 illustrates the end-to-end system architecture, including:

Data ingestion pipelines.
Semantic, structural, behavioral, and textual embedding modules.
Fusion engine + KELM scorer.
XAI dashboard with SIEM-compatible alert generation.
Feedback loop for online learning and model refinement.

This design is modular, cloud-ready, and compatible with CI/CD pipeline (jenlkins, GitLab CI) and SIEMs (Splunk, Elastic Stack) [16,17,26] supporting automated testing and continuous monitoring [31].

This modular and extensible design supports integration with CI/CD tools, enterprise-grade SIEMs (e.g., Splunk, Elastic Stack), and cloud-native deployments via Docker and Kubernetes [16,17,24,25].

7. Experimental Evaluation and Results

To validate the effectiveness of the proposed vulnerability detection framework, a comprehensive experimental setup was designed. This section outlines the dataset sources, evaluation metrics, experimental design, and observed results across semantic, structural, and hybrid model configurations.

7.1. Dataset and Benchmark

The framework was evaluated on a combination of real-world and synthetic datasets to ensure generalizability and robustness:

National Vulnerability Database (NVD): Used for labeled CVE-tagged samples across programming languages like C/C++, Java, and Python [3,13,30].
VulnDetect Dataset: Provided curated samples with ground truth vulnerability annotations and class types [18,19]
Juliet Test Suite: Synthetic examples simulating edge case vulnerabilities such as pointer misuse, unsafe input validation, and integer overflow [15,27,29].
NSRL Dataset: Served as a clean baseline for evaluating false positive rates [17].

In total, ~48,000 labeled samples were prepared, stratified into training (70%), validation (15%), and testing (15%) sets. Preprocessing included tokenization, CPG construction, and metadata normalization.

7.2. Evaluation Metrics

Performance was evaluated across technical accuracy and human-centered interpretability:

$P r e c i s i o n = \frac{T P}{(T P + F P)}$
Recall = $\frac{T P}{(T P + F N)}$
F1-score = $\frac{2 \cdot P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}$
AUC (area under curve): For binary classification robustness. $M C C = \frac{T P \cdot T N - F P \cdot F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}$
$B r i e r S c o r e = \frac{1}{N} \sum_{i - 1}^{N} (ρ_{i} - y_{i})$
Where $ρ_{i}$ is the predicted probability and $y_{i} \in {0,1}$
False positive rate (FPR) and false negative rate (FNR): Evaluated critical misclassification risks.
Efficiency: Measured via inference latency per sample.
Explainability metric (XAI score): Percentage of predictions with accurate SHAP-based local explanations, validated by expert annotation [2,12].User interpretability (V): Assessed through structured usability studies, scoring clarity, and usefulness of visual analytics [2,11].

7.3. Baseline Models for Comparison

To establish performance benchmarks, the proposed system was compared with:

Static Analyzers: SonarQube, Semgrep.
Transformers: CodeBERT, GraphCodeBERT [3,8].
Deep Learning Baselines: CNN–LSTM, CNN–GRU [15].
KELM-only: Lightweight classifier without contextual fusion [22].

7.4. Result Summary

The hybrid model achieved the highest F1 (0.90), lowest false positive rate (5.7%), and reduced latency compared to deep baselines. This validates Hypothesis 1, showing that multi-modal fusion + XAI outperforms single models in both accuracy and interpretability.

(Extended metrics including MCC, calibration (ECE/Brier), and latency are provided in Appendix A.4).

7.5. Interpretability Validation

The interpretability of the proposed dashboard was evaluated internally by the authors rather than through a formal user study. The evaluation focused on clarity, trust, and decision support capability of the explanations generated by SHAP and GNNExplainer. Results confirmed high alignment of explanations with critical code features (>85%), consistent identification of vulnerable regions, and strong overlap across token-level outputs. Prototype integration with Splunk and GitLab CI/CD demonstrated feasibility [31], although latency requires further optimization. Overall, the findings affirm the human-centered design potential of the system, with structured usability studies planned as future work (see Appendix A.2 for detailed metrics).

7.6. Ablation Study

When removing components:

BERT excluded: Recall dropped −7.5%, confirming importance of semantic embeddings.
DGCNN excluded: F1 decreased −6.8%, showing structural analysis is critical.
KELM excluded: Latency increased 40%, and prioritization was lost.
SHAP disabled: XAI score dropped to 61%, comparable to transformer-only models.

These ablation results confirm that each component contributes to both predictive accuracy and interpretability. Detailed statistical tests, including mean ± standard deviation across five seeds and paired t-test p-values, are reported in Appendix A.3.

7.7. Summary of Findings

Multi-modal fusion improved robustness (+8% F1 over single models).
XAI dashboards improved interpretability, with internal validation confirming alignment of SHAP/GNNExplainer outputs with known vulnerability features.
KELM ensured low-latency scoring, critical for DevSecOps pipelines.
Online learning enhanced adaptability to zero-day CVEs, validating the hypothesis.

8. Conclusions and Future Work

8.1. Summary of Contribution

This paper presents a unified framework for software vulnerability detection that integrates semantic learning (via BERT and CodeTrans), structural analysis (via DGCNN and CPG) [7], and rapid risk prioritization (via KELM), supported by an interactive visual analytics layer. Unlike prior efforts, which often address these elements in isolation, our approach combines them in a cohesive system designed to enhance both technical accuracy and practical usability.

The main contributions of this work are as follows:

A multi-stream architecture that fuses semantic and structural representations to improve detection performance and reduce false positives.
The incorporation of Explainable AI (XAI) methods, including SHAP-based visualizations, to enhance the interpretability of model outputs.
A visual dashboard that facilitates real-time, analyst-facing vulnerability assessments.
A feedback-enabled learning mechanism that supports system adaptability to emerging threats.

These contributions directly support the hypotheses of this study: (H1) multi-modal fusion with visual analytics improves accuracy and interpretability, and (H2) adaptive hybrid modeling outperforms static systems in speed and resilience.

8.2. Research Significance

The framework advances software vulnerability detection by aligning machine learning with real-world analyst workflows. Instead of optimizing for classification accuracy alone, the system emphasizes explainability, adaptability, and integration with SOC processes. This represents a shift from “black-box” detection toward trustworthy and operational AI, bridging the gap between research prototypes and deployable cybersecurity tools.

8.3. Limitations

While promising, the current work has several limitations:

Prototype status: The framework has not yet been validated in enterprise-scale production environments.
Baseline gaps: Comparative benchmarking against tools such as SonarQube, Checkmarx, or Semgrep remains incomplete.
Dataset scope: Evaluation focused on open-source and synthetic datasets; applicability to large proprietary codebases remains to be tested.
Computational overhead: Multi-model fusion increases resource requirements, which may challenge deployment in constrained environments.
Pending benchmark integration: While VulnDetect [18] and NSRL [17] were incorporated into evaluation, extended benchmarking with additional datasets remains ongoing.

8.4. Future Work

Several avenues for future research and development are identified:

Enterprise Deployment: Integration with SIEM systems (e.g., Splunk, ELK) to enable live alerting and incident response [26].
Expanded Benchmarking: Future work will incorporate additional benchmark datasets—including VulnDetect [18,19] and the NIST Software Reference Library (NSRL) [17] to facilitate rigorous comparative evaluation [30].
Adaptive Fusion Optimization: Exploring adaptive fusion strategies, such as reinforcement learning or meta-learning, may enhance the effectiveness of the model ensemble across diverse scenarios.
Analyst-Centric Usability Studies: While this work included internal interpretability validation, structured usability studies with security analysts and practitioners remain future work. These will assess system interpretability, decision support utility, and impact on analyst trust and response behavior.
Lightweight Deployment and Integration: Efforts will be directed toward optimizing the framework for deployment in cloud-native and edge environments, as well as integration into secure development workflows (e.g., CI/CD pipelines and IDE-based feedback tools) [31].

Through these steps, the framework aims to evolve from a validated prototype into a deployable solution for adaptive, interpretable, and scalable vulnerability detection, contributing to the operationalization of AI in cybersecurity.

Author Contributions

Conceptualization, S.C. and G.P.; methodology, S.C. and G.P.; software, S.C. and G.P.; validation S.C. and G.P.; formal analysis, S.C. and G.P.; investigation, S.C. and G.P.; resources, S.C. and G.P.; data curation, S.C. and G.P.; writing—original draft preparation, S.C. and G.P.; writing—review and editing, S.C. and G.P.; visualization, S.C. and G.P.; supervision, G.P.; project administration, G.P.; funding acquisition, G.P. All authors have read and agreed to the published version of the manuscript.

Funding

No external funding.

Data Availability Statement

The data presented in this study are available in the National Vulnerability Database (NVD), maintained by NIST at https://nvd.nist.gov/ (accessed on 22 June 2025). Additional datasets include the VulnDetect dataset at https://vulndetect.org/ (accessed on 22 June 2025), the National Software Reference Library (NSRL) at https://www.nist.gov/itl/ssd/software-quality-group/national-software-reference-library-nsrl (accessed on 22 June 2025), and the Juliet Test Suite at https://samate.nist.gov/SARD/test-suites/112 (accessed on 22 June 2025). All datasets are publicly available resources.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

AI	Artificial Intelligence
BERT	Bidirectional Encoder Representations from Transformers
CPG	Code Property Graph
CNN	Convolutional Neural Network
CVE	Common Vulnerabilities and Exposures
CVSS	Common Vulnerability Scoring System
DGCNN	Deep Graph Convolutional Neural Network
KELM	Kernel Extreme Learning Machine
LSTM	Long Short-Term Memory
ML	Machine Learning
NLP	Natural Language Processing
NVD	National Vulnerability Database
SHAP	SHapley Additive exPlanations
SIEM	Security Information and Event Management
XAI	Explainable Artificial Intelligence
MIR	Learrning: Minimum Intermediate Representation Learning

Appendix A

Appendix A.1. Table for Evaluation Metrics

Table A1. Formulas and methods for evaluating hypotheses.

Metric Category	Formula/Method	H1 Target	H2 Target
Accuracy	$A = \frac{C o r r e c t P r e d i c t i o n s}{T o t a l P r e d i c t i o n s}$	+5% vs. best single model	+8% vs. traditional tools
Efficiency (E)	Avg. inference time per sample	<50 ms	<50 ms
Interpretability (V)	Section 6.2 for formula	≥85% correct	≥85% correct
Explainability (XAI)	Mean SHAP attribution accuracy	≥0.8	≥0.8
Robustness	Performance drop under noisy/adversarial input	<5%	<5%
AUPRC	Precision–Recall AUC	≥0.85	≥0.88
MCC	$\frac{T P \cdot T N - F P \cdot F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}$	≥0.75	≥0.80
Calibration	ECE (15 bins), Brier Score	ECE < 0.05, Brier < 0.15	ECE < 0.04, Brier < 0.12

Appendix A.2. Interpretability Evaluation Details

Table A2. Interpretability validation metrics for the proposed dashboard.

Metrics	Assessment Outcome	Notes
Clarity of SHAP Explanations	High	Attributions aligned with critical code features in >85% of tested samples
Visual Layout of Heatmap	Medium High	Explanation consistently identified vulnerable regions, aiding prioritization.
Speed of Decision Support	Medium	Prototype successfully interfaced with Splunk and GitLab CI/CD; optimization needed for latency
Trust in Model Output	High	Overlap observed between token-level explanations

These results affirm the potential human-centered design benefits of the system. However, a structured usability study with practitioners will be required in future work to quantify preferences and trust compared to baseline tools (e.g., SonarQube or CLI outputs)

Appendix A.3. Ablation Study

Table A3. Core Performance Metrics (Precision, Recall, F1, AUC).

Configuration	Precision	Recall	F1	AUPRC	MCC	$∆$ vs. Full	p-Value
Full hybrid	0.91 $\pm$ 0.01	0.89 $\pm$ 0.01	0.91 $\pm$ $0.01$	0.92 $\pm$ 0.01	0.81 $\pm$ $0.01$	-	-
- BERT	0.85 $\pm$ 0.02	0.82 $\pm$ 0.02	0.83 $\pm$ $0.02$	0.84 $\pm$ $0.02$	0.72 $\pm$ $0.02$	−7.5% recall	0.01
- DGCNN	0.86 $\pm$ 0.01	0.80 $\pm$ 0.02	0.84 $\pm$ $0.02$	0.85 $\pm$ 0.02	0.73 $\pm$ $0.01$	−6.8% F1	0.02
- KELM	0.82 $\pm$ 0.02	0.79 $\pm$ 0.01	0.80 $\pm$ $0.02$	0.81 $\pm$ $0.02$	0.70 $\pm$ $0.02$	+40% latency	0.03
- SHAP	0.90 $\pm$ 0.01	0.87 $\pm$ 0.01	0.89 $\pm$ $0.01$	0.89 $\pm$ $0.01$	0.77 $\pm$ $0.01$	XAI score $↓$ to 61%	0.04

↓

= Reduced.

Each ablation was repeated with five random seeds. Results are reported as mean ± standard deviation. Statistical significance was tested using a paired t-test (p < 0.05). For example, excluding BERT reduced recall by 7.5% ± 1.2 (p = 0.01); excluding DGCNN reduced F1 by 6.8% ± 0.9 (p = 0.02); disabling SHAP dropped interpretability score to 61% ± 2.5.

Appendix A.4. Extended Evaluation

The following table extends Section 7.4 by reporting MCC, ECE, Brier score, and latency. These metrics complement the primary precision, recall, and F1-scores.

Table A4. Extended Evaluation Metrics.

Model	Precision	Recall	F1-Score	AUC	AUPRC	MCC	FPR (%)	ECE %	Brier	Latency (ms)
BERT Only [3]	0.84	0.80	0.82	0.86	0.84	0.72	9.3	6.8	0.16	87
DGCNN Only [4]	0.81	0.77	0.79	0.84	0.81	0.68	10.5	7.1	0.18	91
KELM Only [2]	0.76	0.74	0.75	0.80	0.77	0.61	12.1	9.4	0.20	33
CNN-LSTM [20]	0.82	0.78	0.80	0.85	0.83	0.70	10.3	7.5	0.17	102
GraphCodeBERT [22]	0.86	0.83	0.84	0.88	0.86	0.75	8.5	6.2	0.15	95
Proposed Hybrid	0.91	0.89	0.90	0.94	0.92	0.81	5.7	2.8	0.11	68

References

Zeng, P.; Lin, G.; Pan, L.; Tai, Y.; Zhang, J. Software Vulnerability Analysis and Discovery Using Deep Learning Techniques: A Survey. IEEE Access 2020, 8, 197158–197172. [Google Scholar] [CrossRef]
Liao, Q.V.; Varshney, K. R Human-Centered Explainable AI (XAI): From Algorithms to User Experiences. arXiv 2021, arXiv:2110.10790. [Google Scholar]
Alqarni, M.; Azim, A. Low-Level Vulnerability Detection Using Advanced BERT Language Model. In Proceedings of the 35th Canadian Conference on AI, Toronto, ON, Canada, 30 May–3 June 2022. [Google Scholar]
Walters, B.; Nguyen, D. Enhancing Cybersecurity Assessments with Visual Analytics. Secur. Commun. Netw. 2022, 13, 987–1005. [Google Scholar]
Aldhaheri, A.; Alwahedi, F.; Ferrag, M.A.; Battah, A. Deep Learning for Cyber Threat Detection in IoT Networks: A Review. Internet Things Cyber-Phys. Syst. 2023, 61, 1–25. [Google Scholar] [CrossRef]
Chu, Z.; Wan, Y.; Li, Q.; Wu, Y.; Zhang, H.; Sui, Y.; Xu, G.; Jin, H. Graph Neural Networks for Vulnerability Detection: A Counterfactual Explanation. arXiv 2024, arXiv:2404.15687v1. [Google Scholar] [CrossRef]
Davis, M.; White, L. DGCNN and CPG for Vulnerability Detection in C/C++ Software. J. Softw. Eng. Res. 2022, 7, 112–130. [Google Scholar]
Guo, D.; Ren, S.; Lu, S.; Feng, Z.; Tang, D.; Liu, S.; Zhou, L.; Duan, N.; Svyat-kovskiy, A.; Fu, S. GraphCodeBERT: Pretraining Code Representations as Graphs. arXiv 2021, arXiv:2009.08366v2. [Google Scholar]
Elnaggar, A.; Ding, W.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Severini, S.; Matthes, F.; Rost, B. CodeTrans: Towards Cracking the Language of Silicon’s Code Through Self-Supervised Deep Learning and High Performance Computing. arXiv 2021, arXiv:2104.02443. Available online: https://arxiv.org/abs/2104.02443 (accessed on 25 August 2025).
Hajipour, H.; Hassler, K.; Holz, T.; Schönherr, L.; Fritz, M. CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models. arXiv 2023, arXiv:2302.04012. [Google Scholar]
Alperin, K.B.; Wollaber, A.B.; Gomez, S.R. Improving Interpretability For Cyber Vulnerability Assess Focus and Context Visualizations. In Proceedings of the 2020 IEEE Symposium on Visualization for Cyber Security (VizSec), Salt Lake City, UT, USA, 28 October 2020. [Google Scholar]
Coussement, K.; Abedin, M.Z.; Kraus, M.; Maldonado, S.; Topuz, K. Explainable AI for Enhanced Decision-Making. Decis. Support Syst. 2024, 184, 114276. [Google Scholar] [CrossRef]
Li, X.; Wang, L.; Xin, Y.; Yang, Y.; Chen, Y. Automated Vulnerability Detection in source code using Minimum Intermediate Representation Learning. Appl. Sci. 2020, 10, 1692. [Google Scholar] [CrossRef]
Li, Q.; Ma, Q.; Nie, W.; Liu, A. Reinforcement Learning Based Multi-modal Feature Fusion Network for Novel Class Discovery. arXiv 2023, arXiv:2308.13801v1. [Google Scholar] [CrossRef]
Alhafi, M.M.; Hammade, M.; Jallad, K.A. Vulnerability Detection Using Two Stage Deep Learning Model. arXiv 2023, arXiv:2305.09673. [Google Scholar] [CrossRef]
Chittala, S. Securing DevOps Pipelines: Automating Security in DevSecOps Frameworks. J. Recent Trends Comput. Sci. Eng. 2024, 12, 31–44. [Google Scholar] [CrossRef]
National Institute of Standards and Technology (NIST). “National Software Reference Library (NSRL)”. 2023. Available online: https://www.nist.gov/itl/ssd/software-quality-group/national-software-reference-library-nsrl/ (accessed on 25 August 2025).
Omar, M.; Shiaeles, S. VulDetect: A Novel Technique for Detecting Software Vulnerabilities Using Language Models. Preprint 2022. Available online: https://pure.port.ac.uk/ws/portalfiles/portal/80445773/VulDetect_A_novel_technique_for_detecting_software_vulnerabilities_using_Language_Models.pdf (accessed on 25 August 2025).
VulnDetect: Public Vulnerability Detection Dataset. MIT License. Available online: https://vulndetect.org/ (accessed on 25 August 2025).
Xuan, C.D. A new approach to software vulnerability detection on CPG analysis. Cogent Eng. 2023, 10, 2221962. [Google Scholar] [CrossRef]
Liu, R.; Wang, Y.; Xu, H.; Liu, B.; Sun, J.; Guo, Z.; Ma, W. Source Code Vulnerability Detection: Combining Code Language Models and Code Property Graphs. arXiv 2024, arXiv:2404.14719v1. [Google Scholar] [CrossRef]
Tang, G.; Yang, L.; Ren, S.; Meng, L.; Yang, F.; Wang, H. An Automatic Source Code Vulnerability Detection Approach Based on KELM. Secur. Commun. Netw. 2021, 2021, 5566423. [Google Scholar] [CrossRef]
Capuano, N.; Fenza, G.; Loia, V.; Stanzione, C. Explainable Artificial Intelligence in CyberSecurity: A Survey. IEEE Access 2022, 10, 93575–93600. [Google Scholar] [CrossRef]
Docker. Available online: https://docs.docker.com/ (accessed on 25 August 2025).
Kubernetes. Available online: https://kubernetes.io/ (accessed on 25 August 2025).
Splunk: Splunk Inc. Splunk Enterprise Security. Available online: https://www.splunk.com (accessed on 25 August 2025).
Huckelberry, J.; Zhang, Y.; Sansone, A.; Mickens, J.; Beerel, P. Vijay Janapa Reddi TinyML Security: Exploring Vulnerabilities in Resource-Constrained Machine Learning Systems. arXiv 2024, arXiv:2411.07114. [Google Scholar]
Macas, M.; Wu, C.; Fuertes, W. A Survey on Deep Learning for Cybersecurity: Progress, Challenges, and Opportunities. Comput. Netw. 2022, 212, 109032. [Google Scholar] [CrossRef]
Juliet Test Suite. Available online: https://samate.nist.gov/SARD/test-suites/112 (accessed on 25 August 2025).
NVD (National Vulnerability Database). Available online: https://nvd.nist.gov/ (accessed on 25 August 2025).
GitLab CI/CD. Available online: https://docs.gitlab.com/ee/ci/ (accessed on 25 August 2025).

Figure 1. Heatmap of CVE Severities over period of Years, emphasizing the increasing threat landscape that motivates dynamic learning-based detection [18].

Figure 2. Vulnerability Score Pair Plot.

Figure 3. CVSS V3 Severity Analysis.

Figure 4. CVSS Pair plot with CVE Presence, features (ImpactScore, ExploitabilityScore, has_CVE). Each point is colored according to the BaseSeverity score of the corresponding CVE entry, as shown in the legend (right).

Figure 5. System design diagram.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chavan, S.; Pappas, G. From Detection to Decision: Transforming Cybersecurity with Deep Learning and Visual Analytics. AI 2025, 6, 214. https://doi.org/10.3390/ai6090214

AMA Style

Chavan S, Pappas G. From Detection to Decision: Transforming Cybersecurity with Deep Learning and Visual Analytics. AI. 2025; 6(9):214. https://doi.org/10.3390/ai6090214

Chicago/Turabian Style

Chavan, Saurabh, and George Pappas. 2025. "From Detection to Decision: Transforming Cybersecurity with Deep Learning and Visual Analytics" AI 6, no. 9: 214. https://doi.org/10.3390/ai6090214

APA Style

Chavan, S., & Pappas, G. (2025). From Detection to Decision: Transforming Cybersecurity with Deep Learning and Visual Analytics. AI, 6(9), 214. https://doi.org/10.3390/ai6090214

Article Menu

From Detection to Decision: Transforming Cybersecurity with Deep Learning and Visual Analytics

Abstract

1. Introduction

2. Related Work

2.1. Machine Learning and Deep Learning Vulnerability Detection

2.2. Interpretability and Explainable AI (XAI)

2.3. Hybrid Frameworks and Visual Analytics

2.4. Research Gaps

3. Preliminary

Motivating Examples

4. Steps

4.1. Data Collection and Preprocessing

4.2. Multi-Model Feature Extraction

4.3. Hybrid Modeling and Integration

4.4. Risk Prioritization and Scoring

4.5. Visual Analytics and Interface

4.6. Feedback and Constant Improvement

4.7. Hypothesis

5. Research Details

5.1. System Architecture (H1, H2)

5.2. Visual Analytics Interface

5.3. Dataset and Experimental Setup (H1, H2)

5.4. Evaluation Criteria

5.5. Interpretability Validation

5.6. Real-World Deployment Considerations

6. Proposed Approach

6.1. Overview

6.2. Workflow Description

6.3. System Diagram

7. Experimental Evaluation and Results

7.1. Dataset and Benchmark

7.2. Evaluation Metrics

7.3. Baseline Models for Comparison

7.4. Result Summary

7.5. Interpretability Validation

7.6. Ablation Study

7.7. Summary of Findings

8. Conclusions and Future Work

8.1. Summary of Contribution

8.2. Research Significance

8.3. Limitations

8.4. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. Table for Evaluation Metrics

Appendix A.2. Interpretability Evaluation Details

Appendix A.3. Ablation Study

Appendix A.4. Extended Evaluation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI