Adaptive Software-Defined Honeypot Strategy Using Stackelberg Game and Deep Reinforcement Learning with DPU Acceleration

Zhang, Mingxuan; Yu, Yituan; Li, Shengkun; Liu, Yan; Zhang, Yingshuai; Zhang, Rui; Shao, Sujie

doi:10.3390/modelling7010023

Open AccessArticle

Adaptive Software-Defined Honeypot Strategy Using Stackelberg Game and Deep Reinforcement Learning with DPU Acceleration

by

Mingxuan Zhang

¹,

Yituan Yu

¹,

Shengkun Li

²,

Yan Liu

²

,

Yingshuai Zhang

²,

Rui Zhang

¹ and

Sujie Shao

^2,*

¹

Information and Communication Branch of State Grid Jiangsu Electric Power Co., Ltd., Nanjing 210024, China

²

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Modelling 2026, 7(1), 23; https://doi.org/10.3390/modelling7010023

Submission received: 24 November 2025 / Revised: 23 December 2025 / Accepted: 7 January 2026 / Published: 16 January 2026

Download

Browse Figures

Versions Notes

Abstract

Software-defined (SD) honeypots, as dynamic cybersecurity technologies, enhance defense efficiency through flexible resource allocation. However, traditional SD honeypots face latency and jitter issues under network fluctuations, while balancing adjustment costs with defense benefits remains challenging. This paper proposes a DPU-accelerated SD honeypot security service deployment method, leveraging DPU hardware acceleration to optimize network traffic processing and protocol parsing, thereby significantly improving honeypot environment construction efficiency and response real-time performance. For dynamic attack–defense scenarios, we design an adaptive adjustment strategy combining Stackelberg game theory with deep reinforcement learning (AASGRL). By calculating the expected defense benefits and adjustment costs of optimal honeypot deployment strategies, the approach dynamically determines the timing and scope of honeypot adjustments. Simulation experiments demonstrate that the mechanism requires no adjustments in 80% of interaction rounds, while achieving enhanced defense benefits in 20% of rounds with controlled adjustment costs. Compared to traditional methods, the AASGRL mechanism maintains stable defense benefits in long-term interactions, verifying its effectiveness in balancing low costs and high benefits against dynamic attacks. This work provides critical technical support for building adaptive proactive network defense systems.

Keywords:

software-defined honeypot; data processing unit; Stackelberg game; deep reinforcement learning

1. Introduction

Software-defined (SD) honeypots represent a transformative paradigm in active cyber defense, moving beyond static configurations to achieve greater flexibility and scalability. Early surveys by Bringer et al. [1] highlighted the necessity of evolving honeypot research to address dynamic threats. By integrating Software-Defined Networking (SDN) concepts, modern honeypots can leverage dynamic configuration and intelligent orchestration. This adaptability is crucial for implementing Moving Target Defense (MTD) strategies, as demonstrated by Duy et al. [2], which prevent attackers from easily mapping defensive postures. Furthermore, advancements in virtualization, such as the live migration techniques proposed by Hirata et al. [3], allow defenders to reshape network topologies in real-time. This evolution addresses the diminishing returns of static defenses by enabling systems to adapt to evolving attack patterns [4] and mitigate insider reconnaissance threats through virtualized network deception [5]. Consequently, game-theoretic approaches are increasingly applied to optimize the allocation of these dynamic resources [6], ensuring robust defense even under uncertainty.

Recent advances in programmable data planes and hardware acceleration have further expanded the horizons of SD honeypots. Surveys by Han et al. [7] and Parizotto et al. [8] reveal that integrating DPUs with deception mechanisms can reduce reconfiguration latency by up to 90%, yet their application in honeypot architectures remains nascent. Concurrently, game-theoretic frameworks have evolved to address multi-attacker scenarios and dynamic graphs, while deep reinforcement learning has been successfully applied to adaptive MTD strategies. These developments underscore the necessity of a unified framework that synergizes hardware acceleration, strategic commitment, and adaptive learning—precisely the gap this work aims to fill.

Despite these advancements, traditional SD honeypots often face performance bottlenecks. The Data Processing Unit (DPU) has recently emerged as a critical architecture for accelerating data-centric workloads. Integrating high-performance network interfaces and programmable cores, DPUs are engineered to offload computationally intensive tasks—such as network stack processing, encryption, and storage management—from the host CPU. This offloading capability significantly enhances system throughput and reduces latency. While DPUs are established in network function virtualization and software-defined storage, their potential to optimize SD honeypot architectures remains underutilized.

The rapid evolution of DPU technology offers a pathway to resolve the latency and jitter issues inherent in traditional SD honeypots, particularly under unstable network conditions. By offloading traffic handling and protocol parsing to the DPU, defenders can achieve high-fidelity emulation without compromising response times. However, full-scale deployment remains challenging due to resource overheads. A hybrid deployment model—combining local high-interaction nodes with DPU-accelerated SD infrastructure—offers a viable solution. Furthermore, integrating the DPU’s programmable data plane with Deep Reinforcement Learning (DRL) enables hardware-assisted, adaptive decision-making.

Nevertheless, a critical optimization challenge persists; defenders must balance the cost of adjustment against defensive effectiveness when reconfiguring honeypots. Over-optimization can lead to diminishing returns, where high-cost adjustments yield negligible improvements. To address this, we introduce the Adaptive Adjustment Mechanism for Software-Defined Honeypots based on Stackelberg Game and Deep Reinforcement Learning (AASGRL). This mechanism maintains robust baseline defenses while progressively optimizing configurations through adaptive learning, ensuring a cost-aware active defense suitable for real-world deployment. The novelty of this work is threefold:

Hardware-Level Deception: Unlike software-only approaches, we offload stateful protocol emulation to the DPU data plane to eliminate the 10–100 ms jitter common in traditional migrations.
Hybrid Strategic-Adaptive Logic: We bridge the gap between static strategic commitment (Stackelberg) and ex-post adaptation (DRL), a coupling currently missing in MTD research.
Cost-Awareness via Mahalanobis Distance: We introduce a quantitative metric to prevent over-optimization, addressing the scalability issues in enterprise-scale environments.

2. Related Work

Research on software-defined honeypots has expanded significantly, focusing on dynamic traffic control, situational awareness, and strategy optimization.

2.1. Optimization of Honeypot Allocation and Deployment

With the maturation of SDN and Network Function Virtualization (NFV), research has shifted toward fine-grained traffic management and resource allocation. Anwar and Kamhoua [9] proposed frameworks for optimal honeypot allocation under uncertainty, demonstrating that strategic placement significantly increases the cost for attackers. This work was further extended by Sayed et al. [10] to address dynamic tactical networks, where topology changes necessitate rapid reconfiguration. To address resource constraints in such environments, Acosta et al. [11] developed lightweight, on-demand deployment strategies that minimize overhead while maintaining deception fidelity. Furthermore, recent studies have utilized hypergame theory [12] and game-theoretic analyses of attacker-defender interactions [13] to model and counter malicious reconnaissance and website defacement attempts effectively.

2.2. Intelligent and Adaptive Deception Strategies

Traditional static honeypots lack the responsiveness required to counter advanced persistent threats. Consequently, the focus has shifted to dynamic architectures that employ intelligent decision-making. Pawlick et al. [14] emphasized the importance of optimal timing in dynamic engagement, arguing that the duration and frequency of deception are critical to its success. Building on this, Huang and Zhu [15] introduced Reinforcement Learning (RL) over semi-Markov decision processes to enable honeypots to learn from attacker interactions autonomously. Similarly, Shrivastava et al. [16] applied modified Naïve-Bayes algorithms within a game-theoretic context to enhance the detection of Denial of Service (DoS) attacks.

Foundational work in this domain has established rigorous theoretical frameworks for cyber deception [17]. For instance, Ferguson-Walter et al. [18] explored adaptive deception through behavioral game theory, while Cranford et al. [19] incorporated cognitive theories to better understand attacker biases. To quantify the effectiveness of these strategies, Jajodia et al. [20] developed probabilistic logic models for deception. Additionally, maximizing network diversity has been identified as a key metric for resilience; Sarr et al. [21] and Zhang et al. [22] demonstrated that software and network diversity significantly hinder zero-day attacks. Specific applications, such as the IoTPOT project by Pa et al. [23], have further validated the utility of diverse, distributed honeypots in capturing specialized threat intelligence.

2.3. Hardware Acceleration in Security

In the realm of data center security, DPUs are increasingly leveraged to isolate control planes and accelerate security workloads. Firestone et al. [24] demonstrated that SmartNICs can achieve sub-microsecond packet processing latency, while Parizotto et al. [8] surveyed security function offloading, reporting 3–5× throughput improvements. Elizalde et al. [25] specifically evaluated SmartNIC-based IDS, showing 40% CPU offload with negligible latency overhead. Hu et al. [26] highlighted the DPU’s role in offloading complex network functions, while Zheng et al. [27] discussed the customization of DPUs for computing and network convergence. Despite these advances, three critical gaps persist: (1) no prior work has offloaded stateful deception logic to DPU data planes; (2) DPU-accelerated attack graph updates for real-time honeypot reconfiguration remain unexplored; (3) the cost–benefit trade-off of DPU-based adaptive strategies lacks quantitative analysis. This paper bridges these gaps by proposing a DPU-enhanced architecture managed by the AASGRL mechanism, where programmable pipelines execute real-time protocol emulation and Mahalanobis distance computation for cost-aware adjustments.

2.4. Research Gaps and Motivation

Despite existing advances, three fundamental limitations motivate our work:

Gap 1: DPU-Enabled Deception Logic. While DPUs excel at generic network functions, no prior architecture offloads honeypot protocol emulation and attack graph traversal to programmable pipelines, leaving 10–100 ms jitter vulnerabilities during migration.

Gap 2: Hybrid Strategic-Adaptive Frameworks. Pure Stackelberg models lack ex-post adaptation, while independent DRL approaches ignore strategic commitment. The coupling of these paradigms for cost-aware reconfiguration lacks a theoretical foundation and hardware support.

Gap 3: Scalability vs. Cost Awareness. State-of-the-art placement algorithms scale only to ~100 nodes, whereas enterprise AD environments exceed 10⁵ nodes. Moreover, existing RL methods ignore adjustment cost penalties, leading to over-optimization and diminishing returns.

3. Design and Implementation

Figure 1 illustrates the architecture of AASGRL—an adaptive honeypot adjustment system integrating DPU acceleration, Stackelberg game theory, and deep reinforcement learning. Operating within cloud security provider environments, the system is designed to address dynamic network attack scenarios through interactions with cloud security tenants and potential attackers via ISP networks. The DPU-accelerated subsystem offloads underlying network processing tasks, enhancing the efficiency of honeypot environment deployment while improving real-time responsiveness to mitigate network fluctuation impacts.

After a period of interaction between attackers and defenders, the defense benefit expectation of the optimal SD honeypot deployment strategy derived from the Stackelberg game-based mechanism is calculated to establish a decision threshold for evaluating optimality. The adaptive adjustment mechanism for SD honeypots, based on Stackelberg game and deep reinforcement learning (AASGRL), integrates the system’s network topology, dynamic attack strategies, and current honeypot deployment status. It utilizes the defense benefit expectation of the Stackelberg game-generated optimal strategy as a threshold in deep reinforcement learning to determine whether the current deployment approximates the optimal strategy, thereby computing a near-optimal SD honeypot deployment strategy for the current network.

Subsequently, the mechanism calculates both the adjustment cost required to transition from the current deployment to the near-optimal strategy and the corresponding defense benefit gain. Based on this analysis, it determines whether and how the defender should adjust the honeypot deployment.

3.1. System Model

3.1.1. Defender Model

The defender’s objective is to protect target nodes by strategically deploying decoy systems. To maintain the authenticity of the simulated network environment, honeypots are embedded at each node interface. Attackers must compromise a node’s direct parent node before accessing it. The defender determines honeypot placement by evaluating node criticality, available budget, and the attacker’s potential next target. The defender’s action space

A_{d}

involves selecting device interfaces for honeypot deployment within budget constraints.

Let

B

denote the maximum number of deployable honeypots under current budget limitations. Given edge set

E

, the defender’s action space is defined as

A_{d} = {e \in 2^{E} ∣ 1^{T} e \leq B}

, where

e

is a binary vector of length

|E|

. A value of 1 indicates honeypot deployment on the corresponding edge. The inner product condition

1^{T} e \leq B

ensures total deployments do not exceed the budget. Thus, any feasible action

a_{d} \in A_{d}

satisfies

| a_{d} |_{1} \leq B

. To avoid trivial scenarios, we assume a finite budget that prevents full network coverage by decoys, necessitating strategic allocation to intercept adversaries and protect critical high-value nodes.

The defender’s utility function accounts for successful interceptions and wasted decoys. A honeypot is deemed successful if the attacker traverses its edge; otherwise, it represents a loss as the attacker bypasses detection. Successful node protections are weighted by node values, while losses are penalized by the value of compromised nodes. Since the defender’s rewards depend on the attacker’s actions, we next formalize the attacker model.

3.1.2. Attacker Model

The attacker aims to evade defensive strategies and compromise high-value target nodes. Starting from an entry node

v_{0}

, the attacker iteratively selects paths toward the target while avoiding detection. Let

E

represent all feasible paths between entry and target nodes. The attacker’s action space

A_{a}

comprises subsets of

E

, where each action

a_{a} \in A_{a}

corresponds to a stepwise progression along a path from

v_{0}

to the target node, involving transitions between adjacent nodes.

The attacker prioritizes paths perceived as least defended. Each successful move grants rewards, enabling information gathering to bypass potential honeypots and compromise subsequent nodes.

3.2. SD Honeypot Deployment Strategy Computation Based on Deep Reinforcement Learning Algorithms

The optimal honeypot deployment strategy is derived using deep reinforcement learning:

Initialization

State Space

S

: Define the system’s state space

S

, where each state

s

represents the current honeypot deployment in the network. The state vector

s

includes honeypot deployment information for each node.

Action Space

A

: Define the action space

A

, where each action a corresponds to a honeypot migration operation. Actions include migrating a honeypot between nodes or deploying a new honeypot at a specific node.

Q-Network

Q (s, a; θ)

: Initialize a deep neural network

Q (s, a; θ)

as the Q-value function, where

θ

denotes network parameters. The network structure is defined as a multilayer perceptron (MLP) with an input layer (dimension matching the state vector), hidden layers, and an output layer (dimension matching the action space size)

2.: DQN Training

Reward Function

R (s, a)

: Design

R (s, a)

to reflect defensive benefits, such as the number of attackers trapped by honeypots or the honeypot coverage rate.

Q-Value Update: Update Q-values using the Bellman equation. For each experience tuple

(s, a, r, s^{'})

, the target Q-value is

y = r + γ \underset{a}{m a x} Q (s^{'}, a^{'}; θ^{-})

(1)

where

γ

is the discount factor, and

θ^{-}

are the target network parameters.

ε

-Greedy Policy: Select actions using the

ε

-greedy strategy

a = \{\begin{array}{l} random action & with probability ϵ \\ a r g \underset{a}{m a x} Q (s, a; θ) & with probability 1 - ϵ \end{array}

(2)

Experience Replay: Randomly sample mini-batches of experiences

(s, a, r, s^{'})

from the experience replay buffer

D

for training. The buffer stores agent experiences for efficient sampling.

Network Update: Minimize the loss function using stochastic gradient descent (SGD) or other optimizers to update

θ

L (θ) = E_{(s, a, r, s^{'}) \sim D} [(y - Q (s, a; θ))^{2}]

(3)

Target Network Update: Periodically copy

θ

to the target network parameters

θ^{-}

to stabilize training.

State Update: Transition from the current state

s

to the next state

s^{'}

. Repeat until the stopping condition is met (i.e., the defense benefit of the deployment strategy reaches the near-optimal threshold, defined as the expected defense benefit of the Stackelberg game-based optimal strategy).

3.: Computation of Optimal Honeypot Deployment Strategy

Optimal Policy Extraction: After training, the optimal honeypot deployment strategy is extracted using the trained Q-network

Q (s, a; θ)

. For each state

s

, the optimal action

a^{*}

is selected by maximizing the Q-value:

a^{*} = a r g \underset{a}{m a x} Q (s, a; θ)

(4)

Optimal State Vector

s^{*}

: By executing the sequence of optimal actions derived from

a^{*}

, the optimal honeypot deployment strategy is obtained, corresponding to the state vector

s^{*}

.

3.3. SD Honeypot Adjustment Cost Calculation

To provide the system with a quantitative indicator for evaluating the resource input required to adjust honeypot deployments in dynamic attack environments, thereby achieving a balance between cost-effectiveness and avoiding unnecessary adjustments, it is necessary to calculate the honeypot adjustment cost.

To calculate the adjustment cost, the mechanism collects honeypot deployment state vector data over a period, forming a dataset

S

. Each state vector

s_{i}

in this dataset represents the honeypot deployment at the

i

-th timestep. Subsequently, the system calculates the mean vector

μ

of these state vectors, which reflects the average of all state vectors. Concurrently, it computes the covariance matrix

Σ

of dataset

S

, which reveals the covariance relationships between the state vectors. Using this data, the system can compute the Mahalanobis distance

D_{M}

between the current state vector

s

and the optimal state vector

s^{*}

, a distance that quantifies the deviation between the two deployment strategies. Unlike Euclidean distance, the Mahalanobis distance

D_{M}

accounts for the correlation between different node states in the network topology, ensuring that the calculated adjustment cost reflects the logical complexity of reconfiguration rather than just the number of changed nodes. Finally, the honeypot adjustment cost

c o s t_{t r a n s}

is determined by multiplying the Mahalanobis distance by a scaling factor

σ

, thus translating the deployment strategy deviation into a concrete adjustment cost.

The specific calculation process for the honeypot adjustment cost is as follows:

Data Collection

Record state vectors during honeypot deployment over a period, forming dataset

S

. Each state vector

s_{i}

represents the honeypot deployment at the

i

-th timestep.

2.: Compute Mean and Covariance Matrix

Mean Vector

μ

: Calculate the mean vector

μ

of dataset

S

, representing the average of all state vectors

μ = \frac{1}{N} \sum_{i = 1}^{N} s_{i}

(5)

where

N

is the number of state vectors, and

s_{i}

is the

i

-th state vector.

3.: Compute Mahalanobis Distance

Covariance Matrix

Σ

: Compute the covariance matrix

Σ

of

S

, reflecting the covariance between state vectors

Σ = \frac{1}{N - 1} \sum_{i = 1}^{N} (s_{i} - μ) {(s_{i} - μ)}^{T}

(6)

where

(s_{i} - μ)

is the deviation of the

i

-th state vector from the mean, and

{(s_{i} - μ)}^{T}

is its transpose.

Mahalanobis Distance Formula: For the current state vector s and the optimal state vector

s^{*}

, compute the Mahalanobis distance

D_{M}

D_{M} (s, s^{*}) = \sqrt{{(s - s^{*})}^{T} Σ^{- 1} (s - s^{*})}

(7)

where

Σ^{- 1}

is the inverse of the covariance matrix.

Adjustment Cost Function:

c o s t_{t} r a n s (p_{b e s t}, p_{n o w}) = σ * \sqrt{{(s - s^{*})}^{T} Σ^{- 1} (s - s^{*})}

(8)

3.4. System Benefit Gain Calculation and Adjustment Decision

Deviation Assessment

The Mahalanobis distance

D_{M}

between the current state

s

and optimal state

s^{*}

quantifies the deviation between deployment strategies. If

D_{M}

exceeds a predefined threshold, the system requires adjustment to the optimal strategy.

2.: Benefit Gain Calculation

The benefit gained from adjusting to the optimal strategy is

b e n e f i t_{p} r o m o t e = \sum_{i = 1}^{n} R (p_{b e s t}^{i}) - \sum_{i = 1}^{n} R (p_{n o w}^{i}) = \sum_{i = 1}^{n} R ({a_{b e s t}}_{d}^{i}, {a_{b e s t}}_{a}^{i}) - \sum_{i = 1}^{n} R ({a_{n o w}}_{d}^{i}, {a_{n o w}}_{a}^{i})

(9)

3.: Cost–Benefit Decision Function

The adjustment decision is determined by comparing the benefit–cost ratio:

I (c o s t_{t} r a n s, b e n e f i t_{p} r o m o t e, r_{m i n}, r_{m a x}) = \{\begin{matrix} 0, b e n e f i t_{p} r o m o t e / c o s t_{t} r a n s < r_{m i n} \\ 1, r_{m i n} < b e n e f i t_{p} r o m o t e / c o s t_{t} r a n s < r_{m a x} \\ 2, b e n e f i t_{p} r o m o t e / c o s t_{t} r a n s > r_{m a x} \end{matrix}

(10)

4.: Decision Logic Based on I-Function Outcomes

Case $I$ = 0:

When

I

= 0, the cost of adjusting the current honeypot deployment strategy to the optimal one is prohibitively high, or the expected benefit gain is negligible. The system maintains the current configuration to minimize operational costs.

Case $I$ = 1:

When

I

= 1, minor adjustments to the honeypot deployment can improve the benefit–cost ratio. The system executes partial adjustments (e.g., migrating specific honeypots between nodes) to approximate the optimal strategy.

Case $I$ = 2:

When

I

= 2, the performance gap between the current and optimal strategies is significant, and partial adjustments are insufficient. The system initiates full-scale adaptive adjustments, redeploying honeypots according to the optimal strategy

p_{b e s t}

to maximize defensive efficacy.

5.: Cost–Benefit Considerations for Honeypot Quantity Changes

The framework accounts for scenarios where the optimal strategy may increase or decrease the number of honeypots. Adjustment costs include not only migration overhead but also resource allocation changes (e.g., provisioning new honeypots or decommissioning redundant ones). For instance, if the optimal strategy achieves superior defense with fewer honeypots, the reduced operational burden directly increases the benefit promoted, incentivizing adjustment.

6.: Node Adjustment Procedure

The node adjustment procedure initiates when system reconfiguration is required. First, the optimal strategy

p_{b e s t}

and current deployment

p_{n o w}

are compared to identify the source node set

{n o d e}_{t r a n s}

(nodes requiring honeypot removal) and target node set

{n o d e}_{t a r g e t}

(nodes requiring honeypot deployment). Subsequently, honeypots are incrementally migrated from

{n o d e}_{t r a n s}

to

{n o d e}_{t a r g e t}

until full alignment with

p_{b e s t}

is achieved. The timing and conditions for migration are customizable based on network-specific parameters, including the node distance

d (v_{1}, v_{2})

(physical/logical distance between source

v_{1}

and target

v_{2}

), node degree

d e g r e e (v_{2})

(connectivity of the target node), and latency latency (

v_{1}

,

v_{2}

) (communication delay between nodes). These user-configurable parameters ensure adaptability to diverse operational priorities while minimizing disruption during transitions.

4. Results and Discussion

Our experimental simulation environment draws on established research designs. Focusing on honeypot capabilities for internal network protection and external attack capture, we employed Petri nets to model classical network topologies, incorporating service-providing nodes and strategically deployed honeypot nodes. The core acceleration module was implemented on an NVIDIA BlueField-2 DPU, leveraging the NVIDIA DOCA 1.4 SDK to offload and optimize network packet processing tasks. This hardware-software integration allows for efficient protocol parsing and real-time response under dynamic attack conditions. Network topology simulations were constructed using Mininet 2.3, modeling various connection patterns across networks containing 50 to 100 nodes.

In the Stackelberg game model, the defender’s initial budget was set to

B

= 15 honeypots, with the attacker’s reward weights for different value nodes configured as: 0.7 for low-value nodes and 1.5 for critical infrastructure nodes. The Deep Q-Network (DQN) adopted a three-layer architecture (128-64-32 neurons) with ReLU activation function, trained using the Adam optimizer and equipped with an experience replay buffer of 10,000 samples. The exploration strategy employed

ε

-greedy algorithm with initial

ε

= 0.9, decaying to 0.05 after 1000 training episodes.

Each experiment simulated 500 rounds of interaction between defenders and attackers, with final statistical data based on averaged results from 20 independent trials to ensure experimental reliability. For adjustment decision thresholds, the Mahalanobis distance trigger threshold was empirically determined as 2.5 through preliminary testing, while the cost–benefit ratio boundary parameters were set to

r_{m i n}

= 1.2 and

r_{m a x}

= 3.0 to balance system responsiveness and stability in dynamic environments.

The simulation replicates a dynamic network system under APT attacks with varying intensity levels, ranging from conservative to aggressive attacker profiles. Honeypots are interconnected with network components through diverse standards and interfaces, executing defense mechanisms tailored to specific APT attack phases. These include attacker entrapment, behavioral analysis, and threat intelligence generation from collected attack data. System interfaces utilize gRPC remote procedure calls for control systems and APIs for container communications.

Following environment initialization, honeypots are deployed according to the current network topology and stored in a honeypot list. This configuration enables dynamic adjustment of deployment strategies during ongoing attacker interactions. During attack–defense simulations, honeypot allocation follows algorithmically calculated deployment strategies. Attack operations are simulated by an advanced attacker model that records key metrics: successful attacker expulsions, attacker escapes, and node breaches in the network topology.

4.1. Simulation Results of Adaptive Honeypot Adjustment Strategy

Figure 2 illustrates the benefit comparison between the optimal honeypot deployment strategy derived from game theory and traditional honeypot deployment strategies across multiple interaction rounds. In the initial stages of the game, the SD honeypot deployment mechanism exhibits lower benefits due to insufficient interaction experience with attackers. However, as experience accumulates through ongoing interactions, its benefits stabilize around a value of 20, with an expected benefit calculated at approximately 20.3537. In contrast, the traditional honeypot deployment strategy initially yields positive returns because attackers lack historical knowledge of the system, and the strategy deploys numerous honeypots without cost considerations. Over time, however, the benefits of the traditional approach decline progressively as attackers gather sufficient information about honeypot placements, rendering the honeypots increasingly ineffective. Eventually, attackers can freely target critical nodes, reducing the system’s defensive utility to near zero.

Notably, the SD honeypot mechanism shows significant benefit fluctuations. The periodic oscillations observed in Figure 2 represent the ‘equilibrium hunting’ process. As attackers identify honeypot patterns, the defender’s benefit drops; however, the AASGRL mechanism triggers a reconfiguration (in 20% of rounds) to restore defensive entropy. Analysis reveals that advanced attackers actively collect system information during engagements. Once attackers identify specific nodes as honeypots, they adapt their strategies, temporarily diminishing the mechanism’s effectiveness in subsequent rounds and causing benefit drops. Nevertheless, the SD honeypot mechanism dynamically learns from these interactions, adjusting its deployment strategies in response to attacker behavior. This adaptability enables gradual benefit recovery in later rounds. Consequently, as attackers and defenders cyclically refine their tactics, system benefits exhibit periodic oscillations, reflecting the evolving nature of adversarial interactions

Figure 3 documents critical operational metrics of the system, including the number of attackers, blocked attack attempts, successful defenses, and failed defenses. These metrics align with the system’s benefit trends illustrated in the benefit comparison graph, demonstrating consistent correlations between defensive performance and overall gains.

Figure 4 and Figure 5 above present bar charts illustrating the variation in system adjustment costs and benefit gains across interaction rounds after integrating the honeypot adjustment mechanism. Statistical analysis reveals that adjustment costs are zero in approximately 80% of rounds, fluctuate between [80, 120] in 15% of cases, and range from [200, 350] in 5% of cases. Notably, after non-zero adjustment costs occur, the subsequent 50 rounds consistently show zero costs. Experimental conclusions indicate that in 80% of rounds, the system determines that the current honeypot deployment strategy either yields negligible benefit differences compared to the optimal strategy or incurs prohibitively high adjustment costs, leading to no deployment adjustments. In the remaining 20% of rounds, adjustments are executed, and the corresponding benefit gain chart confirms a measurable increase in defensive returns post-adjustment.

4.2. Analysis of Adaptive Honeypot Adjustment Simulation Results

Simulation experiments on the timing of SD honeypot adjustments and post-adjustment benefit gains under dynamic attacks demonstrate the following. Adjustment costs remain zero in 80% of rounds, fluctuate between [80, 120] in 15% of cases, and range from [200, 350] in 5% of cases. Consistent with prior observations, 80% of rounds involve no adjustments due to minimal benefit differences or excessive costs, while 20% of rounds trigger adjustments that align with observable benefit improvements in the statistical results.

In practical cloud data center scenarios, the AASGRL mechanism can be implemented as an automated security orchestration service. The AASGRL algorithm is deployed on a centralized security controller that communicates with distributed edge servers via gRPC to monitor network states. When the DRL-based agent detects that a deployment adjustment is necessary—based on the Mahalanobis distance cost–benefit evaluation—the controller automatically triggers the NVIDIA BlueField-2 DPU’s programmable pipelines to update honeypot configurations and ISP route simulations in real-time. This architecture allows for seamless, low-latency defense updates across large-scale infrastructures with minimal human intervention, effectively countering dynamic APT attacks while maintaining high operational efficiency and low adjustment costs.

5. Conclusions

This study addresses latency, jitter, and adjustment cost challenges of SD honeypots in dynamic attack scenarios by proposing an adaptive adjustment mechanism (AASGRL) integrating DPU acceleration and deep reinforcement learning. By offloading low-level tasks to DPU hardware, we optimize honeypot deployment efficiency. The integration of Stackelberg game theory and deep reinforcement learning enables dynamic strategy optimization with cost–benefit equilibrium. Experiments reveal that the mechanism eliminates adjustments in 80% of rounds to reduce operational costs, while significantly improving defense benefits in the remaining 20% of adjustment rounds, demonstrating its robustness against dynamic attacks. One limitation of the current work is the assumption of a single-target attacker. Real-world scenarios involving multi-vector collaborative attacks may require a multi-agent reinforcement learning (MARL) approach, which we identify as a primary direction for future research. Future work may explore optimization for complex attack models, enhance algorithm generalization across network topologies, and investigate multi-agent collaborative defense frameworks to counter increasingly sophisticated cyber threats. This research provides valuable insights for advancing the practicality and intelligence of SD honeypot technologies.

Author Contributions

Conceptualization, M.Z. and Y.L.; methodology, M.Z. and Y.L.; validation, M.Z., Y.L., Y.Z., and S.L.; formal analysis, M.Z.; investigation, M.Z. and Y.Y.; resources, R.Z. and Y.Y.; data curation, M.Z., Y.Z., and S.L.; writing—original draft preparation, M.Z.; writing—review and editing, Y.L., R.Z., and S.S.; visualization, Y.Z. and S.L.; supervision, Y.L. and S.S.; project administration, R.Z. and Y.Y.; funding acquisition, R.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by State Grid Jiangsu Electric Power Co., Ltd., under the Science and Technology Project “Research on Technologies of DPU-based Network Acceleration and Security Protection for the Requirements of Massive Data Forwarding” (Grant No. J2023131). The APC was funded by State Grid Jiangsu Electric Power Co., Ltd.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare that this study received funding from State Grid Jiangsu Electric Power Co., Ltd. The funder had the following involvement with the study: Providing simulation testing environments and organizing industry research. Author Mingxuan Zhang, Yituan Yu, Rui Zhang were employed by the company State Grid Jiangsu Electric Power Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Bringer, M.L.; Chelmecki, C.A.; Fujinoki, H. A survey: Recent advances and future trends in honeypot research. Int. J. Comput. Netw. Inf. Secur. 2012, 4, 63. [Google Scholar] [CrossRef]
Duy, P.T.; Thang, B.H.; Thanh, N.H.; Hung, N.D.; Thuan, N.D.; Loi, V.T.; Phuoc, N.D.; Tuyen, N.D.; Dung, L.Q.; Huong, L.T.T.; et al. Fool your enemies: Enable cyber deception and moving target defense for intrusion detection in SDN. In Proceedings of the 21st International Symposium on Communications and Information Technologies (ISCIT), Hong Kong, China, 19–21 October 2022; pp. 27–32. [Google Scholar]
Hirata, A.; Miyamoto, D.; Nakayama, M.; Esaki, H. INTERCEPT+: SDN support for live migration-based honeypots. In Proceedings of the 4th International Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS), Kyoto, Japan, 5 November 2015; pp. 16–24. [Google Scholar]
Chiang, C.Y.J.; Gottlieb, Y.M.; Sugrim, S.J.; Chadha, R.; Serban, C.; Poylisher, A.; Marvel, L.M.; Santos, J. ACyDS: An adaptive cyber deception system. In Proceedings of the MILCOM 2016—IEEE Military Communications Conference, Baltimore, MD, USA, 1–3 November 2016; pp. 800–805. [Google Scholar]
Achleitner, S.; La Porta, T.; McDaniel, P.; Sugrim, S.; Krishnamurthy, S.V.; Chadha, R. Cyber deception: Virtual networks to defend insider reconnaissance. In Proceedings of the 8th ACM CCS International Workshop on Managing Insider Security Threats (MIST), Vienna, Austria, 24 October 2016; ACM: New York, NY, USA, 2016; pp. 57–68. [Google Scholar]
Anwar, A.H.; Chelmecki, C.A.; Kamhoua, C.A. Cyber deception using honeypot allocation and diversity: A game theoretic approach. In Proceedings of the IEEE 19th Annual Consumer Communications & Networking Conference (CCNC), Las Vegas, NV, USA, 8–11 January 2022; pp. 543–549. [Google Scholar]
Han, S.; Zheng, Z.; Doupé, A.; Weerasekara, H.; Rahman, M.S.; Zhang, M.; Taft, N.; Laine, K.; Kamhoua, C.; Lee, W.; et al. SoK: Game-Theoretic Cybersecurity: Assumptions, Models, Gaps, and Bridges. arXiv 2024, arXiv:2401.13815. [Google Scholar]
Parizotto, R.; Coelho, B.L.; Nunes, D.C.; Haque, I.; Schaeffer-Filho, A. Offloading Machine Learning to Programmable Data Planes: A Systematic Survey. ACM Comput. Surv. 2023, 56, 1–34. [Google Scholar] [CrossRef]
Anwar, A.H.; Kamhoua, C.A. Honeypot allocation for cyber deception under uncertainty. IEEE Trans. Netw. Serv. Manag. 2022, 19, 3438–3452. [Google Scholar] [CrossRef]
Sayed, M.A.; Kamhoua, C.A.; Leslie, N.O. Honeypot allocation for cyber deception in dynamic tactical networks: A game theoretic approach. arXiv 2023, arXiv:2308.11817. [Google Scholar] [CrossRef]
Acosta, J.C.; Wood, J.; Akella, K.; Sasi, P.; Fleck, D.; Fink, G.A.; Lauf, A.P.; Miller, B.A.; Bauer, S.; Brewer, T.; et al. Lightweight on-Demand Honeypot Deployment for Cyber Deception; Sandia National Laboratories Report, SNL-NM; Sandia National Laboratories (SNL): Albuquerque, NM, USA, 2021. [Google Scholar]
Anwar, A.H.; Kamhoua, C.A.; Leslie, N.O. Honeypot-based cyber deception against malicious reconnaissance via hypergame theory. In Proceedings of the IEEE Global Communications Conference (GLOBECOM 2022), Rio de Janeiro, Brazil, 4–8 December 2022; pp. 3393–3398. [Google Scholar]
Aggarwal, P.; Gonzalez, C.; Dutt, V. Cyber security: A game-theoretic analysis of defender and attacker strategies in defacing-website games. In Proceedings of the 2015 International Conference on Cyber Situational Awareness, Data Analytics and Assessment (CyberSA), London, UK, 8–9 June 2015; pp. 1–8. [Google Scholar]
Pawlick, J.; Colbert, E.; Zhu, Q. Optimal timing in dynamic and robust attacker engagement during advanced persistent threats. In Proceedings of the 2017 International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOPT), Paris, France, 15–19 May 2017; pp. 1–8. [Google Scholar]
Huang, L.; Zhu, Q. Adaptive honeypot engagement through reinforcement learning of semi-Markov decision processes. arXiv 2019, arXiv:1906.12182. [Google Scholar] [CrossRef]
Shrivastava, R.K.; Kumar, P.; Ojha, R.P. Game theory based modified naïve-bayes algorithm to detect DoS attacks using honeypot. In Proceedings of the IEEE 16th India Council International Conference (INDICON), Rajkot, India, 13–15 December 2019; pp. 1–4. [Google Scholar]
Pawlick, J.; Zhu, Q. Game Theory for Cyber Deception; Springer International Publishing: Cham, Switzerland, 2021. [Google Scholar]
Ferguson-Walter, K.; Fugate, S.; Mauger, J.; Major, M. Game theory for adaptive defensive cyber deception. In Proceedings of the 6th Annual Symposium on Hot Topics in the Science of Security (HotSoS), Nashville, TN, USA, 1–3 April 2019; pp. 1–8. [Google Scholar]
Cranford, E.A.; Gonzalez, C.; Aggarwal, P.; Tambe, M.; Cooney, S.; Lebiere, C. Towards a cognitive theory of cyber deception. Cogn. Sci. 2021, 45, e13013. [Google Scholar] [CrossRef] [PubMed]
Jajodia, S.; Park, N.; Pierazzi, F.; Pugliese, A.; Serra, E.; Simari, G.I.; Subrahmanian, V.S. A probabilistic logic of cyber deception. IEEE Trans. Inf. Forensics Secur. 2017, 12, 2532–2544. [Google Scholar] [CrossRef]
Sarr, A.B.; Badra, M.; Al-Shaer, E. Software diversity for cyber deception. In Proceedings of the IEEE Global Communications Conference (GLOBECOM 2020), Taipei, Taiwan, 7–11 December 2020; pp. 1–6. [Google Scholar]
Zhang, M.; Zheng, Z.; Shroff, N.B. Network diversity: A security metric for evaluating the resilience of networks against zero-day attacks. IEEE Trans. Inf. Forensics Secur. 2016, 11, 1071–1086. [Google Scholar] [CrossRef]
Pa, Y.M.P.; Suzuki, S.; Yoshioka, K.; Matsumoto, T.; Kasama, T.; Rossow, C. IoTPOT: A novel honeypot for revealing current IoT threats. J. Inf. Process. 2016, 24, 522–533. [Google Scholar] [CrossRef]
Firestone, D.; Putnam, S.; Mundkur, S.; Chiou, D.; Dabagh, A.; Andrewartha, M.; Angepat, H.; Bhanu, V.; Caulfield, A.; Chung, E.; et al. Azure Accelerated Networking: SmartNICs in the Public Cloud. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2018), Renton, WA, USA, 9–11 April 2018; pp. 517–531. [Google Scholar]
Elizalde, S.; AlSabeh, A.; Mazloum, A.; Choueiri, S.; Kfoury, E.; Gomez, J.; Crichigno, J. A survey on security applications with SmartNICs: Taxonomy, implementations, challenges, and future trends. J. Netw. Comput. Appl. 2025, 242, 104257. [Google Scholar] [CrossRef]
Hu, J.; Bernstein, P.A.; Li, J.; Zhang, Q. DPDPU: Data Processing with DPUs. arXiv 2024, arXiv:2407.13658. [Google Scholar] [CrossRef]
Zheng, H.; Wang, L.; Xu, Y.; Wang, Q.; Shen, Y.; Dong, X.; Di, B.; Wei, J.; Dong, S.; Zhang, X.; et al. Taiji: A DPU Memory Elasticity Solution for In-production Cloud Environments. arXiv 2025, arXiv:2511.09936. [Google Scholar] [CrossRef]

Figure 1. Software-defined honeypot deployment and adjustment mechanism architecture diagram for dynamic attacks.

Figure 2. Comparison of system benefits for common attack-oriented strategies. The benefit values remain stable around 20, as indicated by the red dashed line.

Figure 3. Map of key nodes breached.

Figure 4. Bar chart of cost adjustment with changes in rounds.

Figure 5. Bar chart of defense gain variation with rounds.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, M.; Yu, Y.; Li, S.; Liu, Y.; Zhang, Y.; Zhang, R.; Shao, S. Adaptive Software-Defined Honeypot Strategy Using Stackelberg Game and Deep Reinforcement Learning with DPU Acceleration. Modelling 2026, 7, 23. https://doi.org/10.3390/modelling7010023

AMA Style

Zhang M, Yu Y, Li S, Liu Y, Zhang Y, Zhang R, Shao S. Adaptive Software-Defined Honeypot Strategy Using Stackelberg Game and Deep Reinforcement Learning with DPU Acceleration. Modelling. 2026; 7(1):23. https://doi.org/10.3390/modelling7010023

Chicago/Turabian Style

Zhang, Mingxuan, Yituan Yu, Shengkun Li, Yan Liu, Yingshuai Zhang, Rui Zhang, and Sujie Shao. 2026. "Adaptive Software-Defined Honeypot Strategy Using Stackelberg Game and Deep Reinforcement Learning with DPU Acceleration" Modelling 7, no. 1: 23. https://doi.org/10.3390/modelling7010023

APA Style

Zhang, M., Yu, Y., Li, S., Liu, Y., Zhang, Y., Zhang, R., & Shao, S. (2026). Adaptive Software-Defined Honeypot Strategy Using Stackelberg Game and Deep Reinforcement Learning with DPU Acceleration. Modelling, 7(1), 23. https://doi.org/10.3390/modelling7010023

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Adaptive Software-Defined Honeypot Strategy Using Stackelberg Game and Deep Reinforcement Learning with DPU Acceleration

Abstract

1. Introduction

2. Related Work

2.1. Optimization of Honeypot Allocation and Deployment

2.2. Intelligent and Adaptive Deception Strategies

2.3. Hardware Acceleration in Security

2.4. Research Gaps and Motivation

3. Design and Implementation

3.1. System Model

3.1.1. Defender Model

3.1.2. Attacker Model

3.2. SD Honeypot Deployment Strategy Computation Based on Deep Reinforcement Learning Algorithms

3.3. SD Honeypot Adjustment Cost Calculation

3.4. System Benefit Gain Calculation and Adjustment Decision

4. Results and Discussion

4.1. Simulation Results of Adaptive Honeypot Adjustment Strategy

4.2. Analysis of Adaptive Honeypot Adjustment Simulation Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI