1. Introduction
Software-defined (SD) honeypots represent a transformative paradigm in active cyber defense, moving beyond static configurations to achieve greater flexibility and scalability. Early surveys by Bringer et al. [
1] highlighted the necessity of evolving honeypot research to address dynamic threats. By integrating Software-Defined Networking (SDN) concepts, modern honeypots can leverage dynamic configuration and intelligent orchestration. This adaptability is crucial for implementing Moving Target Defense (MTD) strategies, as demonstrated by Duy et al. [
2], which prevent attackers from easily mapping defensive postures. Furthermore, advancements in virtualization, such as the live migration techniques proposed by Hirata et al. [
3], allow defenders to reshape network topologies in real-time. This evolution addresses the diminishing returns of static defenses by enabling systems to adapt to evolving attack patterns [
4] and mitigate insider reconnaissance threats through virtualized network deception [
5]. Consequently, game-theoretic approaches are increasingly applied to optimize the allocation of these dynamic resources [
6], ensuring robust defense even under uncertainty.
Recent advances in programmable data planes and hardware acceleration have further expanded the horizons of SD honeypots. Surveys by Han et al. [
7] and Parizotto et al. [
8] reveal that integrating DPUs with deception mechanisms can reduce reconfiguration latency by up to 90%, yet their application in honeypot architectures remains nascent. Concurrently, game-theoretic frameworks have evolved to address multi-attacker scenarios and dynamic graphs, while deep reinforcement learning has been successfully applied to adaptive MTD strategies. These developments underscore the necessity of a unified framework that synergizes hardware acceleration, strategic commitment, and adaptive learning—precisely the gap this work aims to fill.
Despite these advancements, traditional SD honeypots often face performance bottlenecks. The Data Processing Unit (DPU) has recently emerged as a critical architecture for accelerating data-centric workloads. Integrating high-performance network interfaces and programmable cores, DPUs are engineered to offload computationally intensive tasks—such as network stack processing, encryption, and storage management—from the host CPU. This offloading capability significantly enhances system throughput and reduces latency. While DPUs are established in network function virtualization and software-defined storage, their potential to optimize SD honeypot architectures remains underutilized.
The rapid evolution of DPU technology offers a pathway to resolve the latency and jitter issues inherent in traditional SD honeypots, particularly under unstable network conditions. By offloading traffic handling and protocol parsing to the DPU, defenders can achieve high-fidelity emulation without compromising response times. However, full-scale deployment remains challenging due to resource overheads. A hybrid deployment model—combining local high-interaction nodes with DPU-accelerated SD infrastructure—offers a viable solution. Furthermore, integrating the DPU’s programmable data plane with Deep Reinforcement Learning (DRL) enables hardware-assisted, adaptive decision-making.
Nevertheless, a critical optimization challenge persists; defenders must balance the cost of adjustment against defensive effectiveness when reconfiguring honeypots. Over-optimization can lead to diminishing returns, where high-cost adjustments yield negligible improvements. To address this, we introduce the Adaptive Adjustment Mechanism for Software-Defined Honeypots based on Stackelberg Game and Deep Reinforcement Learning (AASGRL). This mechanism maintains robust baseline defenses while progressively optimizing configurations through adaptive learning, ensuring a cost-aware active defense suitable for real-world deployment. The novelty of this work is threefold:
Hardware-Level Deception: Unlike software-only approaches, we offload stateful protocol emulation to the DPU data plane to eliminate the 10–100 ms jitter common in traditional migrations.
Hybrid Strategic-Adaptive Logic: We bridge the gap between static strategic commitment (Stackelberg) and ex-post adaptation (DRL), a coupling currently missing in MTD research.
Cost-Awareness via Mahalanobis Distance: We introduce a quantitative metric to prevent over-optimization, addressing the scalability issues in enterprise-scale environments.
2. Related Work
Research on software-defined honeypots has expanded significantly, focusing on dynamic traffic control, situational awareness, and strategy optimization.
2.1. Optimization of Honeypot Allocation and Deployment
With the maturation of SDN and Network Function Virtualization (NFV), research has shifted toward fine-grained traffic management and resource allocation. Anwar and Kamhoua [
9] proposed frameworks for optimal honeypot allocation under uncertainty, demonstrating that strategic placement significantly increases the cost for attackers. This work was further extended by Sayed et al. [
10] to address dynamic tactical networks, where topology changes necessitate rapid reconfiguration. To address resource constraints in such environments, Acosta et al. [
11] developed lightweight, on-demand deployment strategies that minimize overhead while maintaining deception fidelity. Furthermore, recent studies have utilized hypergame theory [
12] and game-theoretic analyses of attacker-defender interactions [
13] to model and counter malicious reconnaissance and website defacement attempts effectively.
2.2. Intelligent and Adaptive Deception Strategies
Traditional static honeypots lack the responsiveness required to counter advanced persistent threats. Consequently, the focus has shifted to dynamic architectures that employ intelligent decision-making. Pawlick et al. [
14] emphasized the importance of optimal timing in dynamic engagement, arguing that the duration and frequency of deception are critical to its success. Building on this, Huang and Zhu [
15] introduced Reinforcement Learning (RL) over semi-Markov decision processes to enable honeypots to learn from attacker interactions autonomously. Similarly, Shrivastava et al. [
16] applied modified Naïve-Bayes algorithms within a game-theoretic context to enhance the detection of Denial of Service (DoS) attacks.
Foundational work in this domain has established rigorous theoretical frameworks for cyber deception [
17]. For instance, Ferguson-Walter et al. [
18] explored adaptive deception through behavioral game theory, while Cranford et al. [
19] incorporated cognitive theories to better understand attacker biases. To quantify the effectiveness of these strategies, Jajodia et al. [
20] developed probabilistic logic models for deception. Additionally, maximizing network diversity has been identified as a key metric for resilience; Sarr et al. [
21] and Zhang et al. [
22] demonstrated that software and network diversity significantly hinder zero-day attacks. Specific applications, such as the IoTPOT project by Pa et al. [
23], have further validated the utility of diverse, distributed honeypots in capturing specialized threat intelligence.
2.3. Hardware Acceleration in Security
In the realm of data center security, DPUs are increasingly leveraged to isolate control planes and accelerate security workloads. Firestone et al. [
24] demonstrated that SmartNICs can achieve sub-microsecond packet processing latency, while Parizotto et al. [
8] surveyed security function offloading, reporting 3–5× throughput improvements. Elizalde et al. [
25] specifically evaluated SmartNIC-based IDS, showing 40% CPU offload with negligible latency overhead. Hu et al. [
26] highlighted the DPU’s role in offloading complex network functions, while Zheng et al. [
27] discussed the customization of DPUs for computing and network convergence. Despite these advances, three critical gaps persist: (1) no prior work has offloaded stateful deception logic to DPU data planes; (2) DPU-accelerated attack graph updates for real-time honeypot reconfiguration remain unexplored; (3) the cost–benefit trade-off of DPU-based adaptive strategies lacks quantitative analysis. This paper bridges these gaps by proposing a DPU-enhanced architecture managed by the AASGRL mechanism, where programmable pipelines execute real-time protocol emulation and Mahalanobis distance computation for cost-aware adjustments.
2.4. Research Gaps and Motivation
Despite existing advances, three fundamental limitations motivate our work:
Gap 1: DPU-Enabled Deception Logic. While DPUs excel at generic network functions, no prior architecture offloads honeypot protocol emulation and attack graph traversal to programmable pipelines, leaving 10–100 ms jitter vulnerabilities during migration.
Gap 2: Hybrid Strategic-Adaptive Frameworks. Pure Stackelberg models lack ex-post adaptation, while independent DRL approaches ignore strategic commitment. The coupling of these paradigms for cost-aware reconfiguration lacks a theoretical foundation and hardware support.
Gap 3: Scalability vs. Cost Awareness. State-of-the-art placement algorithms scale only to ~100 nodes, whereas enterprise AD environments exceed 105 nodes. Moreover, existing RL methods ignore adjustment cost penalties, leading to over-optimization and diminishing returns.
3. Design and Implementation
Figure 1 illustrates the architecture of AASGRL—an adaptive honeypot adjustment system integrating DPU acceleration, Stackelberg game theory, and deep reinforcement learning. Operating within cloud security provider environments, the system is designed to address dynamic network attack scenarios through interactions with cloud security tenants and potential attackers via ISP networks. The DPU-accelerated subsystem offloads underlying network processing tasks, enhancing the efficiency of honeypot environment deployment while improving real-time responsiveness to mitigate network fluctuation impacts.
After a period of interaction between attackers and defenders, the defense benefit expectation of the optimal SD honeypot deployment strategy derived from the Stackelberg game-based mechanism is calculated to establish a decision threshold for evaluating optimality. The adaptive adjustment mechanism for SD honeypots, based on Stackelberg game and deep reinforcement learning (AASGRL), integrates the system’s network topology, dynamic attack strategies, and current honeypot deployment status. It utilizes the defense benefit expectation of the Stackelberg game-generated optimal strategy as a threshold in deep reinforcement learning to determine whether the current deployment approximates the optimal strategy, thereby computing a near-optimal SD honeypot deployment strategy for the current network.
Subsequently, the mechanism calculates both the adjustment cost required to transition from the current deployment to the near-optimal strategy and the corresponding defense benefit gain. Based on this analysis, it determines whether and how the defender should adjust the honeypot deployment.
3.1. System Model
3.1.1. Defender Model
The defender’s objective is to protect target nodes by strategically deploying decoy systems. To maintain the authenticity of the simulated network environment, honeypots are embedded at each node interface. Attackers must compromise a node’s direct parent node before accessing it. The defender determines honeypot placement by evaluating node criticality, available budget, and the attacker’s potential next target. The defender’s action space involves selecting device interfaces for honeypot deployment within budget constraints.
Let denote the maximum number of deployable honeypots under current budget limitations. Given edge set , the defender’s action space is defined as , where is a binary vector of length . A value of 1 indicates honeypot deployment on the corresponding edge. The inner product condition ensures total deployments do not exceed the budget. Thus, any feasible action satisfies . To avoid trivial scenarios, we assume a finite budget that prevents full network coverage by decoys, necessitating strategic allocation to intercept adversaries and protect critical high-value nodes.
The defender’s utility function accounts for successful interceptions and wasted decoys. A honeypot is deemed successful if the attacker traverses its edge; otherwise, it represents a loss as the attacker bypasses detection. Successful node protections are weighted by node values, while losses are penalized by the value of compromised nodes. Since the defender’s rewards depend on the attacker’s actions, we next formalize the attacker model.
3.1.2. Attacker Model
The attacker aims to evade defensive strategies and compromise high-value target nodes. Starting from an entry node , the attacker iteratively selects paths toward the target while avoiding detection. Let represent all feasible paths between entry and target nodes. The attacker’s action space comprises subsets of , where each action corresponds to a stepwise progression along a path from to the target node, involving transitions between adjacent nodes.
The attacker prioritizes paths perceived as least defended. Each successful move grants rewards, enabling information gathering to bypass potential honeypots and compromise subsequent nodes.
3.2. SD Honeypot Deployment Strategy Computation Based on Deep Reinforcement Learning Algorithms
The optimal honeypot deployment strategy is derived using deep reinforcement learning:
State Space : Define the system’s state space , where each state represents the current honeypot deployment in the network. The state vector includes honeypot deployment information for each node.
Action Space : Define the action space , where each action a corresponds to a honeypot migration operation. Actions include migrating a honeypot between nodes or deploying a new honeypot at a specific node.
Q-Network : Initialize a deep neural network as the Q-value function, where denotes network parameters. The network structure is defined as a multilayer perceptron (MLP) with an input layer (dimension matching the state vector), hidden layers, and an output layer (dimension matching the action space size)
- 2.
DQN Training
Reward Function : Design to reflect defensive benefits, such as the number of attackers trapped by honeypots or the honeypot coverage rate.
Q-Value Update: Update Q-values using the Bellman equation. For each experience tuple
, the target Q-value is
where
is the discount factor, and
are the target network parameters.
-Greedy Policy: Select actions using the
-greedy strategy
Experience Replay: Randomly sample mini-batches of experiences from the experience replay buffer for training. The buffer stores agent experiences for efficient sampling.
Network Update: Minimize the loss function using stochastic gradient descent (SGD) or other optimizers to update
Target Network Update: Periodically copy to the target network parameters to stabilize training.
State Update: Transition from the current state to the next state . Repeat until the stopping condition is met (i.e., the defense benefit of the deployment strategy reaches the near-optimal threshold, defined as the expected defense benefit of the Stackelberg game-based optimal strategy).
- 3.
Computation of Optimal Honeypot Deployment Strategy
Optimal Policy Extraction: After training, the optimal honeypot deployment strategy is extracted using the trained Q-network
. For each state
, the optimal action
is selected by maximizing the Q-value:
Optimal State Vector : By executing the sequence of optimal actions derived from , the optimal honeypot deployment strategy is obtained, corresponding to the state vector .
3.3. SD Honeypot Adjustment Cost Calculation
To provide the system with a quantitative indicator for evaluating the resource input required to adjust honeypot deployments in dynamic attack environments, thereby achieving a balance between cost-effectiveness and avoiding unnecessary adjustments, it is necessary to calculate the honeypot adjustment cost.
To calculate the adjustment cost, the mechanism collects honeypot deployment state vector data over a period, forming a dataset . Each state vector in this dataset represents the honeypot deployment at the -th timestep. Subsequently, the system calculates the mean vector of these state vectors, which reflects the average of all state vectors. Concurrently, it computes the covariance matrix of dataset , which reveals the covariance relationships between the state vectors. Using this data, the system can compute the Mahalanobis distance between the current state vector and the optimal state vector , a distance that quantifies the deviation between the two deployment strategies. Unlike Euclidean distance, the Mahalanobis distance accounts for the correlation between different node states in the network topology, ensuring that the calculated adjustment cost reflects the logical complexity of reconfiguration rather than just the number of changed nodes. Finally, the honeypot adjustment cost is determined by multiplying the Mahalanobis distance by a scaling factor , thus translating the deployment strategy deviation into a concrete adjustment cost.
The specific calculation process for the honeypot adjustment cost is as follows:
Record state vectors during honeypot deployment over a period, forming dataset . Each state vector represents the honeypot deployment at the -th timestep.
- 2.
Compute Mean and Covariance Matrix
Mean Vector
: Calculate the mean vector
of dataset
, representing the average of all state vectors
where
is the number of state vectors, and
is the
-th state vector.
- 3.
Compute Mahalanobis Distance
Covariance Matrix
: Compute the covariance matrix
of
, reflecting the covariance between state vectors
where
is the deviation of the
-th state vector from the mean, and
is its transpose.
Mahalanobis Distance Formula: For the current state vector s and the optimal state vector
, compute the Mahalanobis distance
where
is the inverse of the covariance matrix.
Adjustment Cost Function:
3.4. System Benefit Gain Calculation and Adjustment Decision
The Mahalanobis distance between the current state and optimal state quantifies the deviation between deployment strategies. If exceeds a predefined threshold, the system requires adjustment to the optimal strategy.
- 2.
Benefit Gain Calculation
The benefit gained from adjusting to the optimal strategy is
- 3.
Cost–Benefit Decision Function
The adjustment decision is determined by comparing the benefit–cost ratio:
- 4.
Decision Logic Based on I-Function Outcomes
When = 0, the cost of adjusting the current honeypot deployment strategy to the optimal one is prohibitively high, or the expected benefit gain is negligible. The system maintains the current configuration to minimize operational costs.
When = 1, minor adjustments to the honeypot deployment can improve the benefit–cost ratio. The system executes partial adjustments (e.g., migrating specific honeypots between nodes) to approximate the optimal strategy.
When = 2, the performance gap between the current and optimal strategies is significant, and partial adjustments are insufficient. The system initiates full-scale adaptive adjustments, redeploying honeypots according to the optimal strategy to maximize defensive efficacy.
- 5.
Cost–Benefit Considerations for Honeypot Quantity Changes
The framework accounts for scenarios where the optimal strategy may increase or decrease the number of honeypots. Adjustment costs include not only migration overhead but also resource allocation changes (e.g., provisioning new honeypots or decommissioning redundant ones). For instance, if the optimal strategy achieves superior defense with fewer honeypots, the reduced operational burden directly increases the benefit promoted, incentivizing adjustment.
- 6.
Node Adjustment Procedure
The node adjustment procedure initiates when system reconfiguration is required. First, the optimal strategy and current deployment are compared to identify the source node set (nodes requiring honeypot removal) and target node set (nodes requiring honeypot deployment). Subsequently, honeypots are incrementally migrated from to until full alignment with is achieved. The timing and conditions for migration are customizable based on network-specific parameters, including the node distance (physical/logical distance between source and target ), node degree (connectivity of the target node), and latency latency (,) (communication delay between nodes). These user-configurable parameters ensure adaptability to diverse operational priorities while minimizing disruption during transitions.
4. Results and Discussion
Our experimental simulation environment draws on established research designs. Focusing on honeypot capabilities for internal network protection and external attack capture, we employed Petri nets to model classical network topologies, incorporating service-providing nodes and strategically deployed honeypot nodes. The core acceleration module was implemented on an NVIDIA BlueField-2 DPU, leveraging the NVIDIA DOCA 1.4 SDK to offload and optimize network packet processing tasks. This hardware-software integration allows for efficient protocol parsing and real-time response under dynamic attack conditions. Network topology simulations were constructed using Mininet 2.3, modeling various connection patterns across networks containing 50 to 100 nodes.
In the Stackelberg game model, the defender’s initial budget was set to = 15 honeypots, with the attacker’s reward weights for different value nodes configured as: 0.7 for low-value nodes and 1.5 for critical infrastructure nodes. The Deep Q-Network (DQN) adopted a three-layer architecture (128-64-32 neurons) with ReLU activation function, trained using the Adam optimizer and equipped with an experience replay buffer of 10,000 samples. The exploration strategy employed -greedy algorithm with initial = 0.9, decaying to 0.05 after 1000 training episodes.
Each experiment simulated 500 rounds of interaction between defenders and attackers, with final statistical data based on averaged results from 20 independent trials to ensure experimental reliability. For adjustment decision thresholds, the Mahalanobis distance trigger threshold was empirically determined as 2.5 through preliminary testing, while the cost–benefit ratio boundary parameters were set to = 1.2 and = 3.0 to balance system responsiveness and stability in dynamic environments.
The simulation replicates a dynamic network system under APT attacks with varying intensity levels, ranging from conservative to aggressive attacker profiles. Honeypots are interconnected with network components through diverse standards and interfaces, executing defense mechanisms tailored to specific APT attack phases. These include attacker entrapment, behavioral analysis, and threat intelligence generation from collected attack data. System interfaces utilize gRPC remote procedure calls for control systems and APIs for container communications.
Following environment initialization, honeypots are deployed according to the current network topology and stored in a honeypot list. This configuration enables dynamic adjustment of deployment strategies during ongoing attacker interactions. During attack–defense simulations, honeypot allocation follows algorithmically calculated deployment strategies. Attack operations are simulated by an advanced attacker model that records key metrics: successful attacker expulsions, attacker escapes, and node breaches in the network topology.
4.1. Simulation Results of Adaptive Honeypot Adjustment Strategy
Figure 2 illustrates the benefit comparison between the optimal honeypot deployment strategy derived from game theory and traditional honeypot deployment strategies across multiple interaction rounds. In the initial stages of the game, the SD honeypot deployment mechanism exhibits lower benefits due to insufficient interaction experience with attackers. However, as experience accumulates through ongoing interactions, its benefits stabilize around a value of 20, with an expected benefit calculated at approximately 20.3537. In contrast, the traditional honeypot deployment strategy initially yields positive returns because attackers lack historical knowledge of the system, and the strategy deploys numerous honeypots without cost considerations. Over time, however, the benefits of the traditional approach decline progressively as attackers gather sufficient information about honeypot placements, rendering the honeypots increasingly ineffective. Eventually, attackers can freely target critical nodes, reducing the system’s defensive utility to near zero.
Notably, the SD honeypot mechanism shows significant benefit fluctuations. The periodic oscillations observed in
Figure 2 represent the ‘equilibrium hunting’ process. As attackers identify honeypot patterns, the defender’s benefit drops; however, the AASGRL mechanism triggers a reconfiguration (in 20% of rounds) to restore defensive entropy. Analysis reveals that advanced attackers actively collect system information during engagements. Once attackers identify specific nodes as honeypots, they adapt their strategies, temporarily diminishing the mechanism’s effectiveness in subsequent rounds and causing benefit drops. Nevertheless, the SD honeypot mechanism dynamically learns from these interactions, adjusting its deployment strategies in response to attacker behavior. This adaptability enables gradual benefit recovery in later rounds. Consequently, as attackers and defenders cyclically refine their tactics, system benefits exhibit periodic oscillations, reflecting the evolving nature of adversarial interactions
Figure 3 documents critical operational metrics of the system, including the number of attackers, blocked attack attempts, successful defenses, and failed defenses. These metrics align with the system’s benefit trends illustrated in the benefit comparison graph, demonstrating consistent correlations between defensive performance and overall gains.
Figure 4 and
Figure 5 above present bar charts illustrating the variation in system adjustment costs and benefit gains across interaction rounds after integrating the honeypot adjustment mechanism. Statistical analysis reveals that adjustment costs are zero in approximately 80% of rounds, fluctuate between [80, 120] in 15% of cases, and range from [200, 350] in 5% of cases. Notably, after non-zero adjustment costs occur, the subsequent 50 rounds consistently show zero costs. Experimental conclusions indicate that in 80% of rounds, the system determines that the current honeypot deployment strategy either yields negligible benefit differences compared to the optimal strategy or incurs prohibitively high adjustment costs, leading to no deployment adjustments. In the remaining 20% of rounds, adjustments are executed, and the corresponding benefit gain chart confirms a measurable increase in defensive returns post-adjustment.
4.2. Analysis of Adaptive Honeypot Adjustment Simulation Results
Simulation experiments on the timing of SD honeypot adjustments and post-adjustment benefit gains under dynamic attacks demonstrate the following. Adjustment costs remain zero in 80% of rounds, fluctuate between [80, 120] in 15% of cases, and range from [200, 350] in 5% of cases. Consistent with prior observations, 80% of rounds involve no adjustments due to minimal benefit differences or excessive costs, while 20% of rounds trigger adjustments that align with observable benefit improvements in the statistical results.
In practical cloud data center scenarios, the AASGRL mechanism can be implemented as an automated security orchestration service. The AASGRL algorithm is deployed on a centralized security controller that communicates with distributed edge servers via gRPC to monitor network states. When the DRL-based agent detects that a deployment adjustment is necessary—based on the Mahalanobis distance cost–benefit evaluation—the controller automatically triggers the NVIDIA BlueField-2 DPU’s programmable pipelines to update honeypot configurations and ISP route simulations in real-time. This architecture allows for seamless, low-latency defense updates across large-scale infrastructures with minimal human intervention, effectively countering dynamic APT attacks while maintaining high operational efficiency and low adjustment costs.