Optimal Security Task Offloading in Cognitive IoT Networks: Provably Optimal Threshold Policies and Model-Free Learning

Wang, Ning; Ren, Yali

doi:10.3390/iot7020030

Open AccessArticle

Optimal Security Task Offloading in Cognitive IoT Networks: Provably Optimal Threshold Policies and Model-Free Learning

by

Ning Wang

¹

and

Yali Ren

^2,*

¹

Department of Electrical and Computer Engineering, University of Houston, Houston, TX 77004, USA

²

School of Computer Science, Georgia Institute of Technology, Atlanta, GA 30332, USA

^*

Author to whom correspondence should be addressed.

IoT 2026, 7(2), 30; https://doi.org/10.3390/iot7020030

Submission received: 6 February 2026 / Revised: 17 March 2026 / Accepted: 19 March 2026 / Published: 26 March 2026

Download

Browse Figures

Versions Notes

Abstract

The proliferation of Internet of Things (IoT) devices has introduced significant security challenges. Resource-constrained devices face sophisticated threats but lack the computational capacity for advanced security analysis. This study investigates optimal security task allocation in Cognitive IoT (CIoT) networks. It specifically examines when IoT devices should process security tasks locally or offload them to Mobile Edge Computing (MEC) servers. The problem is formulated as a Continuous-Time Markov Decision Process (CTMDP). The study demonstrates that the optimal offloading policy has a threshold structure. Security tasks are offloaded to MEC servers when the offloading queue length is below a critical threshold,

k^{*}

. Otherwise, tasks are processed locally. This structural property is robust to changes in MEC server configurations and threat arrival patterns. It ensures an optimal and easily implementable security policy under the exponential model. Theoretical analysis establishes upper bounds on the performance of AI-based security controllers using the same models. The results also show that standard model-free Q-learning algorithms can recover optimal thresholds without any prior knowledge of the system parameters. Simulations across multiple reinforcement learning architectures, including Q-learning, State–Action–Reward–State–Action (SARSA), and Deep Q-networks (DQN), confirm that all methods converge to the predicted threshold. This empirically validates the analytical findings. The threshold structure remains effective under practical imperfections such as imperfect sensing and parameter estimation errors. Systems maintain 85% to 93% of their optimal performance. This work extends threshold Markov Decision Process (MDP) analysis from classical queuing theory to the context of CIoT security offloading. It provides optimal and practical policies and model-free algorithms for use by resource-constrained devices.

Keywords:

Cognitive Internet of Things (CIoT); IoT security; multi-MEC-server security management; Continuous-Time Markov Decision Process (CTMDP); optimal security policy; threat detection; Q-learning

1. Introduction

The rapid proliferation of Internet of Things (IoT) devices has fundamentally transformed modern computing infrastructure. Recent industry analyses estimate that the number of connected IoT devices worldwide surpassed 18 billion in 2024 and continues to grow at double-digit annual rates, spanning critical domains including smart cities, healthcare, industrial automation, and consumer electronics [1]. This explosive growth, coupled with the emergence of IoT–edge–cloud continuum architectures that distribute computation across heterogeneous infrastructure layers [2], has created an unprecedented security landscape where billions of resource-constrained devices must defend against increasingly sophisticated cyber threats while operating under severe computational, memory, and energy limitations [3,4].

Traditional security paradigms designed for powerful computing systems are inadequate for IoT environments. IoT devices typically feature limited processing capabilities (8–32 bit microcontrollers), constrained memory (kilobytes of Random Access Memory (RAM)), and strict energy budgets (battery-powered operation). Meanwhile, modern cyber threats, including zero-day exploits, advanced persistent threats (APTs), distributed denial-of-service (DDoS) attacks, and polymorphic malware, require computationally intensive security analyses involving deep packet inspection, behavioral anomaly detection, and machine learning-based threat classification [5]. This fundamental mismatch between security requirements and device capabilities creates a critical challenge: how can resource-constrained IoT devices maintain robust security against sophisticated threats?

1.1. Cognitive IoT Networks and Mobile Edge Computing

Cognitive Internet of Things (CIoT) networks represent a paradigm shift that addresses this challenge by integrating cognitive computing principles with edge computing infrastructure [6,7]. The CIoT paradigm applies cognitive computing technologies, derived from cognitive science and artificial intelligence, to the data generated by connected IoT devices, endowing the network with the ability to perceive current conditions, analyze knowledge, make intelligent decisions, and perform adaptive actions [6]. In CIoT network architectures, IoT devices leverage nearby Mobile Edge Computing (MEC) servers to augment their limited computational capabilities [8]. MEC servers, deployed at network edges (base stations, access points, and gateways), provide significantly greater processing power, memory, and specialized security hardware (e.g., hardware accelerators for cryptographic operations and Graphics Processing Unit (GPU)-based deep learning inference) [9].

This architecture enables a hybrid security model where IoT devices can implement lightweight security operations locally (e.g., signature-based malware detection, basic authentication, and simple anomaly detection) while offloading complex security tasks to MEC servers (e.g., behavioral analysis, machine learning-based intrusion detection, and sophisticated threat correlation) [10,11]. However, this flexibility introduces a fundamental decision problem: for each arriving security task, should the IoT device process it locally or offload it to an MEC server?

This decision involves complex trade-offs [12]. Local processing avoids communication overhead and latency but may provide inferior security analysis due to limited computational resources. Offloading enables sophisticated analysis but incurs transmission delays, energy costs, and potential privacy risks during data transmission [13]. Furthermore, MEC servers have limited capacities and may be occupied by processing critical threats (e.g., active intrusion attempts and ransomware infections) that preempt security tasks [14]. When offloading queues become congested, security tasks experience delays that leave IoT devices vulnerable during the waiting period [15].

1.2. Research Challenge and Problem Formulation

This paper addresses the fundamental question:what is the optimal policy for allocating security tasks between local processing and MEC offloading in multi-MEC-server CIoT networks? We consider a realistic setting, characterized by the following:

Multiple heterogeneous MEC servers with different processing capabilities serve the IoT device;
Critical threats arrive randomly and preempt security tasks on MEC servers;
Security tasks arrive continuously and must be either processed locally or queued for MEC offloading;
Queue congestion creates security risks as pending tasks leave vulnerabilities unaddressed;
System dynamics are stochastic, with random arrivals and processing times.

The existing approaches to IoT security task allocation rely primarily on heuristic rules or learning-based methods without theoretical optimality guarantees [16,17]. While deep reinforcement learning has shown empirical success [18,19,20], practitioners lack fundamental understanding of (i) what constitutes optimal behavior, (ii) how close learning-based policies are to being optimal, and (iii) whether simpler policies might achieve comparable performance.

Classical threshold optimality results in the queuing and admission control literature were established for single-server or homogeneous multi-server systems operating under stationary, non-preemptive workloads. However, these results are not directly applicable to the present context. Specifically, these results cannot be directly applied to security offloading in Cyber Internet of Things (CIoT) environments for three primary reasons. First, CIoT networks utilize heterogeneous Multi-access Edge Computing (MEC) servers, whose processing rates (

μ_{i}

, representing the number of tasks processed per time unit) and threat-handling capacities (

β_{i}

, representing the number of critical threats a server can address) vary across servers. This heterogeneity violates the symmetry assumptions that facilitate classical proofs. Second, high-priority critical threats, such as active intrusions or ransomware attacks, arrive stochastically and preempt the security tasks already in service. These state-dependent interruptions are not present in standard admission control models. Third, the cost structure is security aware: the queuing penalty reflects compound vulnerability exposure, rather than a generic waiting cost. This approach couples the offloading decision with the evolving threat landscape of the IoT network. Consequently, these features generate a joint state process with dynamics that differ qualitatively from models in which threshold optimality has been established. Therefore, addressing these complexities requires demonstrating that the threshold structure is preserved. The concavity argument underlying value iteration must be re-established for each new source of heterogeneity and preemption.

1.3. Our Main Contributions

This paper makes the following main contributions:

1. Theoretical Foundations: Building on the Continuous-Time Markov Decision Process (CTMDP) framework [21], we formulate the security task allocation problem and show that the optimal policy has a threshold structure in the CIoT security offloading context. Specifically, we establish that there exists a critical queue length

k^{*}

such that security tasks should be offloaded to MEC servers if and only if the current queue length is, at most,

k^{*}

. This simple yet optimal policy depends only on the queue length, regardless of the MEC server configurations or threat patterns. While threshold structures have been established in classical queuing systems [22,23], our contribution lies in demonstrating that such a structure extends to the multi-MEC-server IoT security offloading setting with heterogeneous servers and preemptive critical threats, a combination that is not directly covered by prior results.

The underlying proof technique employs the classical methodology of Lippman [23]. Our primary contribution is the adaptation of this approach to a multi-server CIoT model that incorporates heterogeneous servers, preemptive critical threats, and security-specific cost structures.

2. Structural Optimality Proof: We rigorously prove threshold policy optimality through a systematic value function concavity analysis tailored to the CIoT security model. Extending the classical proof technique of Lippman [23] to our multi-server setting, we demonstrate that when the queuing cost function is convex (a natural assumption representing increasing marginal risk), value iteration preserves concavity, guaranteeing convergence to a threshold policy. This theoretical result provides performance upper bounds for any AI-based or learning-based security controller operating under the same model assumptions, offering a benchmark against which heuristic and data-driven methods can be evaluated.

3. Model-Free Validation via Q-Learning: We apply a Q-learning algorithm, following the foundational framework of Watkins and Dayan [24], to discover the optimal threshold without requiring any knowledge of the system’s parameters (arrival rates, processing rates, and threat patterns). This is critical for real-world deployments where these parameters are unknown and time varying. The primary role of the RL component is to serve as an independent, model-free validation of the analytical threshold structure: the algorithm learns directly from experience and consistently recovers the theoretically predicted threshold, confirming the practical relevance of the structural result while providing a deployment pathway when the system’s parameters are unavailable.

The reinforcement learning algorithms employed (Q-learning, SARSA, and DQN) are well-established. Our main contribution is not methodological innovation in reinforcement learning (RL), but rather the demonstration that standard model-free methods reliably recover the analytically predicted threshold. This result serves as an independent empirical validation of the theoretical framework and provides a practical pathway for deployment when the system’s parameters are unknown.

4. Comprehensive Simulation Validation: We validate our theoretical predictions through extensive simulations using three standard RL algorithms: classical Q-learning, SARSA, and deep Q-networks (DQN) [25], alongside fixed-threshold baselines. The fact that all learning methods independently converge to the theoretically predicted threshold, without being guided toward it, provides a strong empirical confirmation that the threshold structure is the genuine optimum rather than an artifact of the analytical assumptions.

5. Robustness Analysis: We demonstrate through simulation that threshold policies remain effective under realistic imperfections including imperfect sensing (false alarms and missed detections), parameter estimation mismatches, and MEC server switching costs. Systems retain 85–93% of their optimal performance even with significant uncertainties, confirming the practical applicability of our theoretical framework.

Unlike classical queuing models that establish threshold optimality for single-server or homogeneous multi-server systems under stationary, non-preemptive workloads [23], this work provides a structural analysis of security task offloading policies in CIoT systems with multiple heterogeneous MEC servers, preemptive critical threats, and security-aware cost functions. The core theoretical contribution is not a new proof technique but the demonstration that the classical threshold structure survives the combination of these domain-specific complexities, thereby supplying analytically grounded offloading rules to a setting that has previously relied on heuristic or purely learning-based methods.

1.4. Practical Impact

Our results enable IoT system operators to achieve the following:

Deploy analytically grounded policies with simple implementation requiring only queue length monitoring;
Benchmark AI-based controllers against theoretical performance bounds;
Design hybrid architectures combining structural insights with neural adaptation;
Inform quality-of-service provisioning for latency-sensitive security applications under the modeled conditions;
Reduce the computational overhead by using simple threshold rules instead of complex neural networks for moderate-scale systems.

For moderate-scale deployments (

N \leq 16

MEC servers), optimal thresholds can be learned in minutes on resource-constrained devices. For larger systems, our structural results provide inductive biases that accelerate deep learning convergence.

1.5. Paper Organization

The remainder of this paper is organized as follows. Section 2 develops the CTMDP-based system model for a multi-MEC server CIoT architecture, defining the state and action spaces, transition dynamics, and reward structure. Section 3 establishes the theoretical foundation by proving the optimality of threshold policies and characterizing the value function properties. Section 4 applies a standard model-free Q-learning algorithm to recover optimal thresholds without requiring any prior knowledge of the system’s parameters, serving as an independent validation of the theoretical structure. Section 5 validates the approach through comprehensive simulations, including baseline comparisons, sensitivity analysis, robustness testing under realistic imperfections, and scalability evaluation. Section 6 concludes the paper.

2. System Model

The CIoT network system is modeled as a CTMDP [21]. Based on the system’s background, we make the following assumptions. Figure 1 shows the architecture of the multi-MEC server CIoT network, and Table 1 provides a quick reference for all the notations used throughout the paper.

2.1. System Model Assumptions

We consider a Cognitive IoT (CIoT) network with the following characteristics:

MEC Server Configuration: The system consists of N heterogeneous Mobile Edge Computing (MEC) servers, each capable of processing one critical threat and multiple security tasks through queuing. MEC servers are prioritized such that server 1 has the highest processing capability (fastest processing rate

μ_{1}

), followed by server 2, with

μ_{1} > μ_{2} > \dots > μ_{N}

.

Critical Threat (CT) Processing: At MEC server i, critical threats (e.g., zero-day attacks, advanced persistent threats, etc.) arrive according to a Poisson process with rate

α_{i}

and require processing for an exponentially distributed duration with rate

β_{i}

. Critical threats have absolute priority: upon arrival, a CT immediately preempts any security task currently being processed on that server.

Table 1. Notation reference.

Symbol	Domain	Description
System Parameters
N	$Z^{+}$	Number of MEC servers
M	$Z^{+}$	Maximum queue capacity
$λ$	$R^{+}$	Security task arrival rate (Poisson)
$α_{i}$	$R^{+}$	Critical threat arrival rate, MEC i
$β_{i}$	$R^{+}$	Critical threat processing rate, MEC i
$μ_{i}$	$R^{+}$	Security task processing rate, MEC i
R	$R^{+}$	Reward for processing a security task
$γ$	$(0, 1)$	Discount factor
$f (k)$	$R^{+}$	Queuing cost function
State Variables
s	$S$	Complete system state
$n$	${0, 1, 2}^{N}$	MEC server state vector
$n_{i}$	${0, 1, 2}$	MEC server i state
k	${0, \dots, M}$	Queue length
e	$E$	Event type
Functions
$V (s)$	$R$	Value function
$V^{*} (s)$	$R$	Optimal value function
$π$	$Π$	Policy
$k^{*}$	${0, \dots, M}$	Optimal threshold

The Poisson arrival assumption is a standard modeling choice in queuing theory and is justified by the superposition of many independent IoT event sources, which by the Palm–Khintchine theorem converges to a Poisson process. We acknowledge that real-world cyber attacks can exhibit bursty and correlated patterns (e.g., coordinated DDoS campaigns); however, at the aggregate level across multiple IoT devices, the Poisson assumption provides a reasonable first-order approximation. Similarly, exponential service times serve as an analytically tractable baseline; extensions to phase-type distributions (which can approximate any distribution) are possible but would complicate the threshold optimality proof. We discuss the impact of these assumptions further in Section 6 and note that our robustness experiments in Section 5.4 demonstrate graceful performance degradation when the model’s parameters deviate from the assumed values.

Security Task (ST) Processing: Security tasks (e.g., malware scanning, authentication verification, anomaly detection, etc.) arrive at the IoT device according to a Poisson process with rate

λ

. An arriving security task can either be processed locally on the IoT device or offloaded to MEC servers. If offloaded, the task scans all MEC servers; if any server is idle, it immediately occupies the highest-priority idle server for processing. If all the servers are busy, the task must decide whether to join the waiting queue for offloading or be processed locally.

Queue Structure: A single First-In-First-Out (FIFO) waiting queue with capacity M serves all the MEC servers for offloaded security tasks. When a server becomes available, the first task in the queue occupies the highest-priority available server.

2.2. State Space

Definition 1

(System State). At each decision epoch t, the system state is characterized by

s_{t} = (n_{t}, k_{t}, e_{t})

(1)

where

The MEC server state vector is $n_{t} = [n_{1, t}, \dots, n_{N, t}]$ with $n_{i, t} \in {0, 1, 2}$ , where
–
$n_{i, t} = 0$ : MEC server i is idle;
–
$n_{i, t} = 1$ : MEC server i is processing a critical threat;
–
$n_{i, t} = 2$ : MEC server i is processing a security task.
The queue length is $k_{t} \in {0, 1, \dots, M}$ , denoting waiting security tasks.
The event type is $e_{t} \in E$ , where

$\begin{matrix} E = & {S T_A R R, {C T_A R R}_{i}, \\ {C T_F I N}_{i}, {S T_F I N}_{i} : i \in {1, \dots, N}} . \end{matrix}$

(2)

The state space is

S = {0, 1, 2}^{N} \times {0, \dots, M} \times E

.

Observability on Resource-Constrained Devices: A practical concern is whether the state components are truly observable on IoT devices with limited sensing capabilities. We address this as follows: (i) the queue length k is locally maintained by the IoT device’s offloading module and requires negligible memory; (ii) the event type e is inherently observable since it triggers the decision epoch (e.g., a new security task arrival is detected by the device’s task scheduler); (iii) the MEC server states

n

can be obtained through lightweight status–query protocols (e.g., periodic heartbeat messages or piggyback signaling on existing control channels), which impose minimal overhead compared to the security task data itself. In scenarios where full MEC server state observation is impractical, our threshold policy offers a key advantage: the optimal decision depends primarily on the queue length k, which is always locally observable. The MEC server configuration

n

may affect the specific threshold value

k_{n}^{*}

, but, as shown in Theorem 3, for symmetric servers the threshold is configuration independent, further reducing the observability requirements.

Remark 1

(Notation Simplification). Our notation

s = (n, k, e)

provides a clear separation of state components: MEC server states, queue length, and event type. ST_ARR denotes the security task arrival, CT_ARR_i denotes the critical threat arrival at the MEC server i, CT_FIN_i denotes the critical threat completion at the MEC server i, and ST_FIN_i denotes the security task’s completion at the MEC server i.

Example (Two MEC Servers): Consider the state

s = ([0, 1], 3, ST_ARR)

, which represents

MEC server 1 is idle ( $n_{1} = 0$ );
MEC server 2 is processing a critical threat ( $n_{2} = 1$ );
Three security tasks are waiting in the queue ( $k = 3$ );
A new security task arrives ( $e = ST_ARR$ ).

The decision is whether to offload this security task to the queue or process it locally.

2.3. Action Space and Decision Epochs

Actions: For states requiring offloading decision (i.e.,

e = ST_ARR

and all MEC servers occupied), the action space is

A (s) = {OFFLOAD, LOCAL}

(3)

where OFFLOAD queues the security task for MEC processing (if

k < M

) and LOCAL processes the task on the IoT device. When at least one MEC server is idle, an arriving ST is automatically assigned to the highest-priority idle server, and no decision is required.

Decision Epochs: These occur at event times, as follows:

Security task arrivals (rate $λ$ ) (an offloading decision is required only when all MEC servers are occupied; otherwise, the arriving ST is automatically assigned to the highest-priority idle server);
Critical threat arrivals at MEC server i (rate $α_{i}$ );
Critical threat completions at MEC server i (rate $β_{i}$ );
Security task completions at MEC server i (rate $μ_{i}$ ).

2.4. Transition Dynamics

The system evolves as a continuous-time Markov process. The transition rate from state

s = (n, k, e)

with action a is

β (s, a) = λ + \sum_{i = 1}^{N} [β_{i}^{CT} (n) + β_{i}^{ST} (n)]

(4)

where

\begin{matrix} β_{i}^{CT} (n) & = \{\begin{matrix} α_{i} & if n_{i} \in {0, 2} \\ β_{i} & if n_{i} = 1 \end{matrix} \end{matrix}

(5)

\begin{matrix} β_{i}^{ST} (n) & = \{\begin{matrix} μ_{i} & if n_{i} = 2 \\ 0 & otherwise \end{matrix} \end{matrix}

(6)

Interpretation: The rate

β (s, a)

equals the sum of rates for all the possible next events: ST arrivals (rate

λ

), CT arrivals at servers not currently processing a CT, i.e., servers with

n_{i} \in {0, 2}

(rate

α_{i}

each), CT departures from servers currently processing a CT, i.e., servers with

n_{i} = 1

(rate

β_{i}

each), and ST completions at servers processing a security task, i.e., servers with

n_{i} = 2

(rate

μ_{i}

each).

The inter-event time distribution is

F (t ∣ s, a) = 1 - e^{- β (s, a) t}, t \in [0, \infty)

(7)

Uniformized Transition Probabilities: After uniformization with constant

c = λ + \sum_{i = 1}^{N} (α_{i} + β_{i} + μ_{i})

(for the two-server case,

c = λ + α_{1} + α_{2} + β_{1} + β_{2} + μ_{1} + μ_{2}

), the transition probabilities become

\tilde{q} (j ∣ s, a) = \{\begin{matrix} \frac{q (j | s, a) \cdot β (s, a)}{c} & if j \neq s \\ 1 - \frac{β (s, a)}{c} & if j = s (fictitious self - loop) \end{matrix}

The uniformized reward is

\tilde{r} (s, a) = r (s, a) \cdot \frac{γ + β (s, a)}{γ + c}

Example: Transition Probabilities from State

s_{0} = ([0, 0], 0, S T_A R R)

Consider the initial state where both MEC servers are idle and an ST arrives. Since at least one server is idle, the ST is automatically assigned to MEC server 1 (highest priority) with no offloading decision required. The next event probabilities are

q (j ∣ s_{0}) = \{\begin{matrix} \frac{α_{1}}{Z} & if j = ([2, 0], 0, {CT_ARR}_{1}) \\ \frac{α_{2}}{Z} & if j = ([2, 0], 0, {CT_ARR}_{2}) \\ \frac{λ}{Z} & if j = ([2, 0], 0, ST_ARR) \\ \frac{μ_{1}}{Z} & if j = ([2, 0], 0, {ST_FIN}_{1}) \end{matrix}

where

Z = λ + α_{1} + α_{2} + μ_{1} = β (s_{0})

is the normalization constant.

The uniformized probability includes a self-loop:

\tilde{q} (s_{0} ∣ s_{0}) = 1 - \frac{Z}{c} = \frac{c - Z}{c}

2.5. Reward Structure

The reward function has two components, as follows:

Immediate Rewards: When deciding at state

s = (n, k, ST_ARR)

,

r_{imm} (s, a) = \{\begin{matrix} R & if a = OFFLOAD \\ 0 & if a = LOCAL \end{matrix}

(8)

where

R > 0

is the value of offloading a security task to more powerful MEC servers (reflecting the superior security analysis capability and reduced local processing burden).

Queuing Costs: For security tasks in the queue,

c (k) = f (k)

(9)

where

f : {0, \dots, M} \to R^{+}

satisfies the following:

$f (0) = 0$ (no cost when queue empty);
$f (k)$ is non-decreasing;
$f (k)$ is convex: $f (k + 1) - f (k) \geq f (k) - f (k - 1)$ .

Convexity captures increasing marginal risk as security tasks accumulate in the queue, representing elevated vulnerability during processing delays.

The convex queuing cost assumption is well-grounded in practical IoT security operations. In real deployments, the marginal security risk of each additional pending task increases superlinearly for several reasons. (i) Compound Vulnerability Exposure: Each unprocessed security task represents an open vulnerability window, and the interaction among multiple concurrent vulnerabilities creates compounding attack surfaces. (ii) Resource Contention: As the queue grows, local device resources for monitoring and basic defense become increasingly strained, degrading the overall security posture nonlinearly. (iii) Latency-Sensitive Threat Detection: Many security threats (e.g., ransomware propagation and lateral movement) cause damage that grows superlinearly with detection delay. Common choices such as

f (k) = a k^{2}

(quadratic cost) or

f (k) = a k^{b}

with

b > 1

are standard in the queuing literature and have been empirically validated in network security contexts [26]. In our simulations illustrated in Section 5, we test multiple convex cost functions with varying parameters to confirm that the threshold structure is robust to the specific choice of

f (k)

.

2.6. Optimization Objective

Goal: Find policy

π : S \to A (s)

maximizing the expected total discounted reward.

Value Function: For policy

π

,

V^{π} (s) = E^{π} [\int_{0}^{\infty} e^{- γ τ} r (s (τ), π (s (τ))) d τ | s (0) = s]

(10)

where

γ \in (0, 1)

is the discount factor.

Bellman Equation: The optimal value function satisfies

V^{*} (s) = max_{a \in A (s)} \{r (s, a) + \frac{β (s, a)}{γ + β (s, a)} \sum_{j \in S} q (j ∣ s, a) V^{*} (j)\}

(11)

where

q (j ∣ s, a)

is the transition probability.

Uniformization: We apply uniformization with constant

c = λ + \sum_{i = 1}^{N} (α_{i} + β_{i} + μ_{i})

, transforming to a discrete-time MDP with discount factor

γ_{dis} = c / (c + γ)

:

V^{*} (s) = max_{a \in A (s)} \{\tilde{r} (s, a) + γ_{dis} \sum_{j \in S} \tilde{q} (j ∣ s, a) V^{*} (j)\}

(12)

2.7. Threshold Policy

We seek a threshold policy with critical queue length

k^{*} \in {0, \dots, M}

:

π^{*} (s) = \{\begin{matrix} OFFLOAD & if k \leq k^{*} \\ LOCAL & if k > k^{*} \end{matrix}

(13)

Intuition: When the offloading queue is short (

k \leq k^{*}

), the immediate reward R from superior MEC server processing outweighs the future queuing costs. When

k > k^{*}

, accumulating queuing costs and security risks dominate, making local processing preferable.

Section 3 proves that such a policy is optimal and provides methods to compute

k^{*}

. Section 4 presents a model-free Q-learning algorithm to learn

k^{*}

without parameter knowledge.

3. Optimal Policy

This section establishes the optimality of threshold policies for the multi-server CIoT security task offloading problem. We first present the key structural result (Theorem 1), then prove that the value function possesses the necessary concavity property through value iteration (Theorem 2), and finally extend the results to general MEC server configurations (Theorem 3). Our approach follows the classical methodology for proving threshold structures in queuing systems [23] and applies it to the security-aware CIoT context [22].

3.1. Threshold Policy Structure

We begin by stating the main structural result that motivates our analysis.

Theorem 1

(Threshold Policy Optimality). If the value function

V ((n, k, {S T_F I N}_{1}))

is concave and non-increasing in the queue length k, then the optimal offloading policy has a threshold structure: there exists a unique threshold

k^{*} \in {0, 1, \dots, M}

such that

π^{*} (s) = \{\begin{matrix} OFFLOAD & if k \leq k^{*} \\ LOCAL & if k > k^{*} \end{matrix}

(14)

Intuition: The threshold policy offloads security tasks to MEC servers when the queue is short (the system has capacity) but processes them locally when the queue is long (to avoid excessive queuing delays and risks). The threshold

k^{*}

represents the crossover point where the immediate benefit of MEC offloading equals the future cost of queue congestion.

The remainder of this section proves that the required concavity property holds, establishing the optimality of threshold policies.

3.2. Value Function at Decision States

Consider the critical state

s = ([2, 2], k, ST_ARR)

where both MEC servers are occupied and a new security task arrives. This is the state where offloading decisions must be made.

From the optimality equation (Equation (12)), the value function satisfies

V (s) = max {V^{OFFLOAD} (k), V^{LOCAL} (k)}

(15)

Step 1: Express in terms of completion states

Using the uniformization technique from Section 2.6, we can relate the action-specific value functions to task completion state values. Let

β = λ + α_{1} + α_{2} + μ_{1} + μ_{2}

denote the total event rate.

The value function at the task completion state

{ST_FIN}_{1}

is parameterized by the total number of tasks in the MEC offloading system, which includes both tasks waiting in the queue and those in service, rather than solely by the number of tasks waiting in the queue. This total task index is denoted by

κ

to distinguish it from the physical queue length k.

At the decision state

([2, 2], k, ST_ARR)

, the physical queue holds k waiting tasks. In addition, both the MEC servers are occupied (

n_{1} = n_{2} = 2

), so two tasks are in service. However, the in-service task on the completing server (server 1) is the one whose departure triggers

{ST_FIN}_{1}

; its count is absorbed into the parameterization of the completion state itself. Therefore, only the in-service task on the other server (server 2) contributes an extra

+ 1

to the index. The two actions then yield the following:

OFFLOAD: The arriving task joins the queue, so there are $k + 1$ waiting tasks plus one in-service task, so server 2 $= k + 2$ total tasks remaining in the system at the next ${ST_FIN}_{1}$ epoch. Hence, $κ = k + 2$ .
LOCAL: The arriving task is processed locally and does not enter the queue, so there are k waiting tasks plus one in-service task so server 2 $= k + 1$ in total. Hence $κ = k + 1$ .

Throughout Section 3, Section 4 and Section 5,

V (([2, 2], κ, {ST_FIN}_{1}))

is always written with this total task index. In particular, the base case expression

V^{1} (([2, 2], κ, {ST_FIN}_{1})) = - f (κ - 1) / (γ + β)

evaluates the queuing cost at

κ - 1

because one task has just departed, leaving

κ - 1

tasks in the system.

The, for the OFFLOAD action, combining Equations (7) and (8) from Section 2,

V^{OFFLOAD} (k) = R + V (([2, 2], k + 2, {ST_FIN}_{1}))

(16)

For the LOCAL action, from Equation (12),

V^{LOCAL} (k) = V (([2, 2], k + 1, {ST_FIN}_{1}))

(17)

Step 2: Decision rule characterization

Substituting Equations (16) and (17) into (15), we obtain

\begin{matrix} V (([2, 2], k, ST_ARR)) = max {V (([2, 2], k + 1, {ST_FIN}_{1})), \\ R + V (([2, 2], k + 2, {ST_FIN}_{1}))} \end{matrix}

(18)

Define the value difference as follows:

Δ V (k) = V (([2, 2], k + 2, {ST_FIN}_{1})) - V (([2, 2], k + 1, {ST_FIN}_{1}))

(19)

Then, the optimal decision rule is

π^{*} (([2, 2], k, ST_ARR)) = \{\begin{matrix} OFFLOAD, if Δ V (k) > - R \\ LOCAL, if Δ V (k) \leq - R \end{matrix}

(20)

Interpretation: Offload the security task if the marginal decrease in the future value

| Δ V (k) |

is smaller than the immediate reward R from the MEC processing. Process locally if the future cost outweighs the immediate benefit.

3.3. Concavity Preservation Lemma

The following lemma is fundamental to our proof technique.

Lemma 1

(Maximum Preserves Concavity). Suppose

f (i)

is convex and non-decreasing, and let

R \geq 0

be a constant. Then

g (i) = max {- f (i), R - f (i + 1)}

(21)

is concave and non-increasing in i.

Proof.

See Lippman [23]. □

Application: This lemma, combined with Equation (18), implies that if the departure state value function is concave, then the arrival state value function inherits this property through the max operation.

3.4. Proof of Threshold Structure

Proof of Theorem 1.

The proof proceeds by showing that the concavity of

V (([2, 2], k,

{ST_FIN}_{1}))

implies a threshold policy.

Step 1: Concavity implies monotone differences

If

V (([2, 2], k, {ST_FIN}_{1}))

is concave in k, then the first differences are non-increasing:

Δ V (k + 1) \leq Δ V (k) for all k \geq 0

(22)

Step 2: Threshold structure emerges

From the decision rule (20), we obtain the following:

If $Δ V (k) \leq - R$ at some queue length k, then by monotonicity (22), $Δ V (k^{'}) \leq Δ V (k) \leq - R$ for all $k^{'} > k$ .
Therefore, if LOCAL is optimal at k, it remains optimal for all $k^{'} > k$ .

Step 3: Unique threshold

Define the optimal threshold as

k^{*} = max {k : Δ V (k) > - R}

(23)

Then, the optimal policy is

π^{*} (([2, 2], k, ST_ARR)) = \{\begin{matrix} OFFLOAD & if k \leq k^{*} \\ LOCAL & if k > k^{*} \end{matrix}

(24)

Uniqueness follows from the strict monotonicity of

Δ V (k)

and continuity of the decision boundary. □

3.5. Value Iteration and Concavity Preservation

We now prove that the required concavity property holds by showing that it is preserved under value iteration.

Theorem 2

(Value Function Concavity via Value Iteration). Suppose the queuing cost function

f (k)

is convex and non-decreasing in k:

(i): The value function $V (([2, 2], k, {S T_F I N}_{1}))$ is concave and non-increasing in k;
(ii): The optimal policy is a threshold policy.

Proof.

We use value iteration with induction on the iteration index.

Initialization (Iteration 0): Set

V^{0} (s) = 0

for all states

s \in S

.

Base case (Iteration 1): From the Bellman equation (Equation (11)) in Section 2, with

V^{0} (j) = 0

for all next states j, the first iteration retains only the immediate (flow) cost at state

([2, 2], k, {ST_FIN}_{1})

. At this state, no offloading decision is required (the event is a task completion, not a task arrival), so the only contribution is the queuing cost. Under the non-uniformized continuous-time Bellman equation (11), the one-step value with zero continuation is

V^{1} (([2, 2], k, {ST_FIN}_{1})) = \frac{- f (k - 1) + β \cdot \sum_{j} q (j ∣ s) V^{0} (j)}{γ + β} = \frac{- f (k - 1) + 0}{γ + β}

where

- f (k - 1)

is the queuing cost incurred at queue length

k - 1

(after the completing task departs,

k - 1

tasks remain waiting), and

β = λ + α_{1} + α_{2} + μ_{1} + μ_{2}

is the total event rate from this state. The factor

1 / (γ + β)

arises from integrating the discounted cost rate over the exponentially distributed sojourn time with parameter

γ + β

. Thus,

V^{1} (([2, 2], k, {ST_FIN}_{1})) = - \frac{f (k - 1)}{γ + β}

(25)

{ST_FIN}_{1}

states arise only when at least one task has been in the system, so

k \geq 1

under the index convention of Section 3. Thus,

f (k - 1)

is evaluated at

k - 1 \geq 0

, which lies within the domain

{0, \dots, M}

where f is defined.

Since

f (k)

is convex and non-decreasing by assumption, and the negative sign reverses the property,

V^{1}

is concave and non-increasing in k.

Inductive hypothesis: Assume that at iteration i, the value function

V^{i} (([2, 2], k, {ST_FIN}_{1}))

is concave and non-increasing in k.

Inductive step (Iteration

i + 1

):

Sub-step 1: Arrival state concavity.

By Lemma 1 and Equation (18), since

V^{i} (([2, 2], k, {ST_FIN}_{1}))

is concave (inductive hypothesis), the arrival state value function is also concave:

V^{i} (([2, 2], k, ST_ARR)) is concave and non - increasing in k

Sub-step 2: Task completion state update.

Substituting back into the value iteration equation (Equation (12) from Section 2), we obtain

V^{i + 1} (([2, 2], k, {ST_FIN}_{1})) = - \frac{f (k - 1)}{γ + β} + X^{i} (k - 1)

(26)

where

X^{i} (k - 1)

is the expected continuation value, defined as the discounted weighted average of next-state values under the uniformized transition probabilities:

X^{i} (k) = γ_{dis} \sum_{j \in S} \tilde{q} (j ∣ ([2, 2], k, {ST_FIN}_{1})) V^{i} (j)

(27)

Here, the sum ranges over all possible next states j (ST arrivals, CT arrivals/completions, and ST completions at each server), weighted by the uniformized transition probabilities

\tilde{q}

from Section 2.4, with

γ_{dis} = c / (c + γ)

being the discrete-time discount factor.

Sub-step 3: Verify the concavity of

V^{i + 1}

.

Compute the first difference as follows:

\begin{matrix} V^{i + 1} & (([2, 2], k + 1, {ST_FIN}_{1})) - V^{i + 1} (([2, 2], k, {ST_FIN}_{1})) \\ = \frac{f (k - 1) - f (k)}{γ + β} + (X^{i} (k) - X^{i} (k - 1)) \end{matrix}

(28)

Since f is convex,

f (k - 1) - f (k) \leq f (k) - f (k + 1)

, the first term is non-increasing in k.

The continuation value

X^{i} (k)

is a weighted average of concave functions (by inductive hypothesis), and, hence, is concave. Therefore,

X^{i} (k) - X^{i} (k - 1)

is also non-increasing in k.

Combining both terms,

V^{i + 1} (([2, 2], k + 1, {ST_FIN}_{1})) - V^{i + 1} (([2, 2], k, {ST_FIN}_{1}))

is non-increasing in k, establishing the concavity of

V^{i + 1}

.

Convergence: By Theorem 11.5.2 of Puterman [21], the value iteration sequence converges to the unique optimal value function:

V^{i} \to V^{*} as i \to \infty

(29)

Since each

V^{i}

is concave and non-increasing, the limit

V^{*}

inherits these properties.

Threshold policy optimality: By Theorem 1, since

V^{*}

is concave and non-increasing, the optimal policy has a threshold structure. □

3.6. Extension to General MEC Server Configurations

The preceding analysis focused on the representative state

([2, 2], k, {ST_FIN}_{1})

where both MEC servers process security tasks. We now extend the results to all MEC server configurations.

Theorem 3

(General Threshold Policy). Suppose the queuing cost function

f (k)

is convex and non-decreasing in k. Then, the optimal offloading policy is a threshold policy for all MEC server configurations.

Proof.

The proof proceeds by applying the same value iteration argument to other representative states.

Step 1: Other representative states

Consider the following three additional representative task completion states for the two-MEC-server system:

(a): $V (([1, 2], k, {ST_FIN}_{2}))$ : MEC server 1 has critical threat and MEC server 2 processes security task.
(b): $V (([2, 1], k, {ST_FIN}_{1}))$ : MEC server 1 processes security task and MEC server 2 has critical threat.
(c): $V (([1, 1], k, {CT_FIN}_{1}))$ : Both MEC servers process critical threats.

Step 2: Parallel proof

For each of these states, the same proof technique as Theorem 2 applies:

The base case holds; $V^{1}$ is concave due to convexity of $f (k)$ .
The inductive step holds; Lemma 1 ensures that concavity is preserved.
Convergence follows from the standard CTMDP theory.

Step 3: Unified threshold policy

Since the concavity holds for all representative states, by Theorem 1, each state admits a threshold policy. The system-wide optimal policy is characterized by potentially different thresholds

k_{n}^{*}

for different MEC server configurations

n

:

π^{*} ((n, k, ST_ARR)) = \{\begin{matrix} OFFLOAD & if k \leq k_{n}^{*} \\ LOCAL & if k > k_{n}^{*} \end{matrix}

(30)

In practice, for symmetric servers (

α_{1} = α_{2}

,

β_{1} = β_{2}

, and

μ_{1} = μ_{2}

), the thresholds are identical:

k_{n}^{*} = k^{*}

for all

n

with all servers occupied. □

Remark 2

(Extension to

N > 2

Servers). The proof above enumerates all representative server configurations for

N = 2

. For general N, the number of distinct server configurations is

3^{N}

, and a complete enumeration becomes unwieldy. However, the argument generalizes as follows. For any server configuration

n \in {0, 1, 2}^{N}

with all servers occupied (

n_{i} \in {1, 2}

for all i), the value function at the corresponding task completion state depends on the queue length k through the same structural form: the immediate cost

- f (k - 1) / (γ + β (n))

where

β (n)

is the configuration-dependent total event rate, plus a continuation value that is a weighted average of value functions at other states. Since (i) the convexity of

f (k)

ensures the base case is concave for every configuration, (ii) Lemma 1 preserves concavity through the max operation regardless of the specific transition rates, and (iii) weighted averages of concave functions remain concave, the inductive argument holds configuration by configuration. Formally, the proof requires showing concavity jointly across the coupled system of value functions for all

3^{N}

configurations; this coupling is resolved because the transition probabilities from any configuration

n

to other configurations

n^{'}

produce weighted averages that preserve concavity by the inductive hypothesis that is applied to all configurations simultaneously.

3.7. Computing the Optimal Threshold

Value Iteration Algorithm: The constructive nature of the proof provides an algorithm to compute

k^{*}

:

1.

Initialize

V^{0} (s) = 0

for all states.

2.

For iterations

i = 1, 2, \dots

until convergence:

Update task completion state values via Equation (26).
Update task arrival state values via Equation (18).

3.

Extract Threshold:

k^{*} = max {k : V^{\infty} (([2, 2], k + 2, {ST_FIN}_{1})) - V^{\infty} (([2, 2], k + 1, {ST_FIN}_{1})) > - R}

.

Computational Complexity:

State Space Size: $| S | = O (3^{N} \cdot M \cdot N)$ .
Per-Iteration Cost: $O (3^{N} \cdot M \cdot N^{2})$ (due to transition probability calculations).
Convergence Rate: Geometric with rate $γ$ (discount factor).

For the two-server case with

M = 100

, convergence typically occurs within 500–1000 iterations, requiring 1–2 min on a standard laptop.

Alternative: Model-Free Q-learning. Section 4 presents a reinforcement learning approach that learns

k^{*}

without requiring any knowledge of the system’s parameters

(λ, α_{i}, β_{i},

and

μ_{i})

.

3.8. Summary and Discussion

Main Results Recap:

1.: Structural result (Theorem 1): The concavity of the value function implies threshold policy optimality.
2.: Concavity proof (Theorem 2): Value iteration preserves concavity when $f (k)$ is convex.
3.: General applicability (Theorem 3): Threshold policies are optimal for all the MEC server configurations.

Key Assumptions:

Convex queuing cost: $f (k)$ is convex and non-decreasing.
Exponential processing times: Critical threat and security task processing durations are exponentially distributed.
Poisson arrivals: Both critical threat and security task arrivals follow Poisson processes.
Preemptive priority: Critical threats can preempt security tasks without penalties.

Practical Implications:

1.: Simplicity: Optimal policies have a simple threshold structure, which is easy to implement in IoT devices.
2.: Robustness: Threshold policies are robust to parameter estimation errors.
3.: Computation: Value iteration converges efficiently for practical system sizes.
4.: Learning: Model-free methods (Section 4) can discover $k^{*}$ through experience.

Extensions and Limitations: The threshold policy structure extends to the following:

Multiple classes of security tasks with different priorities.
Finite capacity constraints on individual MEC servers.
Time-varying threat arrival rates (within slowly varying regimes).

The following limitations require further analysis:

Non-convex cost functions (e.g., step costs).
Non-exponential processing time distributions.
Imperfect threat detection with false positives/negatives.
Network latency and communication overhead variations.
Bursty and correlated attack arrival patterns (e.g., coordinated DDoS).
The joint optimization of offloading decisions with energy consumption and communication resource allocation.
Per-server or hierarchical queue architectures replacing the single shared FIFO queue, which may better reflect deployments where MEC servers are geographically distributed.

4. The Q-Learning Based Optimization Algorithm

To validate the theoretical threshold structure and provide a practical deployment mechanism, we apply a Q-learning-based optimization algorithm [24]. Algorithm 1 illustrates the detail of Q-Learning algorithm. We use this standard model-free method to recover the unique optimal threshold solution, providing both an independent check on the analytical results and a practical approach for discovering optimal policies without requiring any knowledge of system dynamics [10,13].

According to the Cognitive IoT (CIoT) network system model, the value of taking an action a in a state s under a policy

π

is defined as the expected return starting from that state, taking that action, and thereafter following

π

.

Algorithm 1 Q-Learning Algorithm

Input:

Q_{k}^{0} (s_{k}, a_{k}), ϕ_{k}, γ_{d i s}

Output:

Q^{*} (s_{k + 1}, a_{k + 1})

Initialize:

Q_{k} (s_{k}, a_{k})

,

\forall s_{k}, a_{k}

,

s_{k} \in S, a_{k} \in A_{(s)}

,

k \in {1, 2, 3, \dots, \infty}

.

For

k = 1

to n do

Repeat: At Epoch k, Choose an action

a_{k}

, then observe,

r (s_{k}, a_{k}, s_{k + 1})

and

s_{k + 1}

.

Repeat:

s_{k + 1} \in S

and

a_{k + 1} \in A_{(s)}

.

Update:

Q_{k + 1} (s_{k}, a_{k})

according to Equation (34).

Until all the epochs and action space is terminal (according to the test data).

If

Q^{*} (s_{k + 1}, a_{k + 1}) = a r g m a x (Q_{k + 1} (s_{k}, a_{k}))

then

Update:

s_{k}, a_{k}

.

End

Until convergence.

End.

The following Q-learning formulation utilizes the standard Bellman optimality framework [21,24]. While this provides a basis for our approach, the discussion remains concise. The unique features of the CIoT offloading problem are highlighted, including its state representation, action space, and reward signal. Finally, the relationship between Q-value convergence and the threshold structure established in Section 3 is clarified.

The action value function $Q_{π} (s, a)$ satisfies the Bellman equation:

$\begin{matrix} Q_{π} (s, a) = \tilde{r} (s, a) + γ_{d i s} \sum_{j \in S} \tilde{q} (j | s, a) Q_{π} (j, π (j)), \end{matrix}$

(31)

Q^{*}

satisfies the Bellman-style optimality equation:

\begin{matrix} Q^{*} (s, a) = \tilde{r} (s, a) + γ_{d i s} \sum_{j \in S} \tilde{q} (j | s, a) max_{a^{'} \in A_{(j)}} Q^{*} (j, a^{'}) . \end{matrix}

(32)

From [24], the learning rate (step size)

ϕ_{k}

is the positive value, and

0 < ϕ_{k} < 1

. In this paper,

ϕ_{k}

is defined as

\begin{matrix} ϕ_{k} = \frac{log (k + 1)}{k + 1}, \forall k \in {0, 1, 2, 3, \dots}, \end{matrix}

(33)

where k is the k-th epoch.

According to [24], the Q-learning algorithm is given by the update rule as follows:

\begin{matrix} Q_{k + 1} (s_{k}, a_{k}) & = (1 - ϕ_{k}) Q_{k} (s_{k}, a_{k}) + ϕ_{k} [r_{k} + \\ γ_{d i s} max_{a^{'} \in A_{(s_{k + 1})}} Q_{k} (s_{k + 1}, a^{'})], \end{matrix}

(34)

It converges to the optimal

Q^{*} (s, a)

values as long as the following characteristics are true:

(1) The state space, S, and action space,

A_{(s)}

, are finite.

(2)

\sum_{k = 1}^{\infty} ϕ_{k} = \infty

and

\sum_{k = 1}^{\infty} {(ϕ_{k})}^{2} < \infty

,

\forall (s, a) \in S \times A_{(s)}

uniformly with a probability of one.

(3)

V a r [r_{k} (s, a, j)]

is bounded.

According to our system model, an arrival ST can sense a number of MEC servers before making its offloading decision. Moreover, the ST does not know how many other STs are presently attempting to use the MEC server. Each arrival ST must decide on the MEC server access based on the partial information of the MEC server. In each epoch, one of two actions can be selected by the arrival ST. After the ST finishes its task and leaves the system, the system will receive a corresponding reward, R. Otherwise, if an arrival ST joins the waiting queue, it will incur a holding cost, which means that each ST has an instantaneous Quality of Service (QoS) constraint,

f (k)

per unit time when the queue length is k, i.e., k STs are in the waiting queue for the available MEC servers. Here, we assume that

f (k)

is a convex function. Based on this procedure, we apply the Q-learning algorithm to determine the optimal action for each arriving ST and find the maximum expected discounted rewards. In our Q-learning algorithm, a greedy selection mechanism is employed to always select the action that has the maximum Q-value. Meanwhile, we also use this simulation to find out the threshold value

k^{*}

, where if

k \leq k^{*}

, STs will enter the waiting queue; otherwise, if

k > k^{*}

, STs will be processed locally. Algorithm I shows the detailed steps of the Q-learning algorithm. We used Python 3.8+ with NumPy and standard RL libraries to implement this algorithm, and multiple parameter configurations have been used to simulate the whole learning procedure.

The following three aspects of the Q-learning implementation are tailored to the CIoT offloading problem, rather than reflecting general reinforcement learning design principles. (i) State encoding: Each state

s = (n, k, e)

encodes the MEC server configuration

n

, the queue length k, and the triggering event e, which directly correspond to the CTMDP state space described in Section 3. (ii) Binary action space: The agent chooses either OFFLOAD or LOCAL at each security task arrival, reflecting the physical constraint that every arriving task requires immediate processing. (iii) Reward signal: The reward function integrates the immediate offloading benefit R with the convex queue-holding penalty

- f (k)

, ensuring that the Q-values reflect the cost–benefit trade-off underlying the threshold policy. These adaptations ensure that Q-learning explores the same policy class for which optimality has been established, thereby enabling the direct comparison of the learned threshold and the analytical prediction.

5. Simulation Analysis and Results

This section presents the comprehensive simulation validation of the theoretical threshold structure and the Q-learning-based policy discovery through systematic simulation studies. We evaluate the performance across multiple dimensions: (1) a baseline comparison against standard admission control policies [17,26], (2) a sensitivity analysis of the key system parameters, (3) robustness testing under realistic imperfections [27], and (4) a scalability assessment for large-scale deployments [15]. Our simulation framework demonstrates that the learned threshold policy consistently outperforms alternatives while maintaining computational efficiency.

Because our primary contribution is a structural result, namely that the optimal offloading policy has a threshold form, controlled simulation is the most appropriate first-line evaluation methodology. Synthetic experiments allow us to (i) verify that learned policies converge exactly to the analytically predicted threshold under the conditions assumed by the theory, (ii) isolate the effect of individual parameters (arrival rates, service rates, and cost functions) through controlled sweeps that are impossible with fixed real-world traces, and (iii) stress test robustness by deliberately injecting the model mismatches illustrated in Section 5.4. This evaluation strategy is standard practice for foundational theoretical contributions in stochastic control and operations research: seminal works on threshold policies for admission control [23], dynamic spectrum access [22], and MEC offloading [26] similarly rely on synthetic CTMDP or queuing theoretic simulations to validate structural optimality before trace-driven studies are pursued. We acknowledge that trace-driven evaluation using real-world IoT security datasets is an important next step and have identified it as a high-priority direction for future work, as illustrated in Section 6; however, the controlled experiments presented here are both necessary and sufficient to validate the theoretical claims of this paper.

5.1. Simulation and Methodology

5.1.1. Simulation Environment

We implemented a comprehensive discrete event simulation framework in Python (version 3.8+) that faithfully models the complete CTMDP dynamics including stochastic CT and ST arrivals, exponential service processes, preemptive interruptions, and queue management. The simulation was executed on a computing platform with Intel Core i7-10700K processor (eight cores and 3.8 GHz), 32 GB DDR4 RAM, running Ubuntu 20.04 LTS.

5.1.2. Baseline System Configuration

Unless otherwise specified, the experiments use the following baseline parameters for a two-MEC-server system (

N = 2

):

Arrival rates: ST arrival rate $λ = 2.0$ ; CT arrival rates $α_{1} = α_{2} = 1.0$ .
Service rates: CT service rates $β_{1} = β_{2} = 2.0$ ; ST service rates $μ_{1} = 3.0$ , $μ_{2} = 2.5$ (priority ordering).
Economic parameters: immediate reward $R = 10.0$ ; discount factor $γ = 0.95$ .
Cost function: Convex holding cost $f (k) = 0.5 \times k^{2}$ where $k \in {0, 1, \dots, M}$ and the maximum queue capacity $M = 100$ .
Learning parameters: Learning rate $ϕ_{k} = log (k + 1) / (k + 1)$ ; the exploration rate $ϵ$ decays from 0.3 to 0.01 with factor 0.995 per episode.

The baseline parameters are selected to represent realistic Internet of Things (IoT) and Multi-access Edge Computing (MEC) deployments, aligning with the established literature. The secondary task (ST) arrival rate

λ = 2.0

events/models a moderately loaded IoT gateway processing security alerts. Empirical studies in smart home and campus networks report per-device alert rates ranging from 0.5 to 5 events/s, depending on the threat activity [4,5]. The primary task (CT) arrival rate

α_{i} = 1.0

per server and service rate

β_{i} = 2.0

result in a per-server CT utilization of

ρ_{C T} = α / β = 0.5

, indicating a moderate load that a maintains residual capacity for STs. Comparable CT-to-ST load ratios are reported in admission control models [26] and cognitive radio preemption frameworks [14,22]. The ST service rates

μ_{1} = 3.0

and

μ_{2} = 2.5

represent heterogeneous MEC servers with varying processing capabilities, a standard assumption in multi-server edge computing research [8,15]. The immediate reward

R = 10.0

and the convex holding cost

f (k) = 0.5 k^{2}

are conventional in discounted Markov Decision Process (MDP) admission control [23,26], where the ratio

R / f (k^{*})

determines the optimal threshold. The discount factor

γ = 0.95

is commonly used in reinforcement learning (RL)-based offloading studies [10,13,27]. To evaluate sensitivity, Section 5.3 and Section 5.4 systematically vary each key parameter and demonstrate that the threshold structure remains consistent across a broad operational range.

5.1.3. Training Procedure

Each algorithm was trained for 20,000 episodes in the baseline configuration (

N = 2

) to ensure convergence. For scalability experiments (

N > 2

), the number of training episodes was scaled proportionally to the larger state space (see Table 6 for per-configuration convergence episodes). An episode consists of up to 50 decision epochs where STs arrive and admission control decisions are made. All the results represent averages over ten independent runs with different random seeds to account for stochastic variability. We report 95% confidence intervals computed via bootstrap resampling (1000 samples). It should be noted that bootstrap confidence intervals (CIs) derived from

n = 10

runs may underestimate the true uncertainty. Therefore, the reported intervals should be considered approximate.

For each experimental condition, ten independent replications were conducted using the fixed seed set

{0, 1, \dots, 9}

. In every replication, the same seed was applied to all random number generators (NumPy random.seed(), PyTorch manual_seed(), and the environment’s arrival and service processes) to ensure exact reproducibility. After 20,000 training episodes, four performance metrics were computed: the average reward (mean cumulative reward per episode), blocking probability (frequency of episode-level rejections), queue length (the average number of entities in the queue per episode), and server utilization (the average proportion of busy servers per episode). Each metric was averaged across the ten runs, and uncertainty was assessed using 95% bootstrap confidence intervals with

B = 1000

resamples drawn with replacement from the ten run-level means. For pairwise policy comparisons, confidence intervals were supplemented with paired t-tests and the Cohen’s d effect sizes illustrated in Section 5.6. Given that

n = 10

runs provide limited statistical power, the confidence intervals are approximate, and non-significant differences (e.g., Q-learning vs. SARSA;

p = 0.062

) should not be over-interpreted.

5.1.4. Reproducibility Summary

To facilitate independent replication, Table 2 consolidates all the simulation parameters. The source code was implemented in Python 3.8 using NumPy 1.24 and PyTorch 1.13 (for DQN only). Each of the ten independent runs used a distinct random seed drawn from the set

{0, 1, \dots, 9}

, applied consistently to NumPy’s random.seed() and PyTorch’s manual_seed() to ensure the full reproducibility of the arrival processes, service times, and exploration noise.

5.2. Baseline Policy Comparison

To establish the effectiveness of our Q-learning approach, we compare it against six baseline admission control policies representing different design philosophies:

1.: Q-learning (Baseline learner): Our algorithm (Algorithm 1) from Section 4, selected as the primary learner due to having the fastest convergence.
2.: SARSA: On-policy temporal difference learning.
3.: DQN: Deep Q-Network with experience replay [25].
4.: FIXED-k: A fixed threshold policy with $k \in {5, 10, 15, 20}$ .
5.: GREEDY: Always accept if any capacity available.
6.: RANDOM: Accept with probability 0.5 (lower bound).

Figure 2 shows the learning curves for the three algorithms, and Table 3 summarizes the comparative performance across multiple metrics: the average reward, blocking probability (the fraction of STs denied offloading due to full queue), average queue length, server utilization (the fraction of time MEC servers are actively processing tasks), convergence episodes (the number of training episodes required to reach a stable policy), and wall clock training time. The key findings are as follows.

Reward Performance: The three learning algorithms (Q-learning, SARSA, and DQN) achieve statistically similar final rewards (48.53 to 51.23), all significantly outperforming fixed threshold and heuristic policies. Paired t-tests confirm that Q-learning outperforms the best fixed policy (FIXED-k = 10) with

p < 0.001

(Cohen’s d = 1.24; large effect size). The 24–30% improvement over GREEDY demonstrates the value of strategic admission control versus myopic acceptance. Although SARSA achieves a marginally higher mean reward (

51.23

compared to

48.53

), this difference is not statistically significant (

p = 0.062

). Q-learning is designated as the baseline learner because it converges the fastest, indicating its greater learning efficiency and reliability under these experimental conditions, as shown below.

Convergence Efficiency: Q-learning achieves the fastest convergence (2500 episodes), requiring 29% fewer episodes than SARSA (3500 episodes) and 38% fewer than DQN (4000 episodes). This sample efficiency advantage makes Q-learning preferable for online learning scenarios where training time is limited.

Threshold Discovery: Critically, all three learning algorithms independently converge to the same optimal threshold

k^{*} = 10

, closely matching the best-performing fixed threshold. Because these algorithms use fundamentally different update rules and exploration strategies yet recover the same threshold predicted by the analytical proof, this convergence serves as a strong model-free validation of the theoretical threshold structure established in Section 3.

Table 3. Comprehensive baseline comparison (

N = 2

MEC servers, baseline parameters, and 20,000 episodes).

Table 3. Comprehensive baseline comparison (

N = 2

MEC servers, baseline parameters, and 20,000 episodes).

Policy	Avg Reward	Blocking Prob.	Avg Queue	Server Util.	Conv. Episodes	Time (s)
Q-learning (Baseline)	$48.53 \pm 1.8$	$0.082 \pm 0.008$	$5.2 \pm 0.3$	$0.81 \pm 0.02$	$2500$	$8.68$
SARSA	$51.23 \pm 2.1$	$0.078 \pm 0.009$	$4.9 \pm 0.4$	$0.83 \pm 0.02$	3500	$5.97$
DQN	$50.94 \pm 2.9$	$0.085 \pm 0.012$	$5.5 \pm 0.5$	$0.80 \pm 0.03$	4000	$7.22$
FIXED-k = 5	$42.15 \pm 1.5$	$0.145 \pm 0.010$	$3.8 \pm 0.2$	$0.75 \pm 0.02$	N/A	$0.08$
FIXED-k = 10	$45.82 \pm 1.6$	$0.095 \pm 0.009$	$5.9 \pm 0.3$	$0.79 \pm 0.02$	N/A	$0.08$
FIXED-k = 15	$41.33 \pm 1.7$	$0.072 \pm 0.008$	$7.2 \pm 0.4$	$0.82 \pm 0.02$	N/A	$0.08$
FIXED-k = 20	$38.91 \pm 1.8$	$0.065 \pm 0.007$	$8.5 \pm 0.5$	$0.84 \pm 0.02$	N/A	$0.08$
GREEDY	$39.27 \pm 2.2$	$0.051 \pm 0.007$	$9.1 \pm 0.6$	$0.86 \pm 0.02$	N/A	$0.08$
RANDOM	$25.13 \pm 3.5$	$0.498 \pm 0.025$	$3.0 \pm 0.4$	$0.44 \pm 0.04$	N/A	$0.08$

The threshold

k^{*} = 10

reported here is specific to the baseline parameter configuration defined in Section 5.1.2. In subsequent sections, we show how

k^{*}

adapts when the system’s parameters change: for example,

k^{*}

ranges from 6 to 15 as the arrival rate

λ

varies, as illustrated in Section 5.3, and from 9 to 12 as the discount factor

γ

varies. These variations are consistent with the theoretical prediction that the optimal threshold depends on the cost–reward trade-off, not a discrepancy in results.

Computational Cost: SARSA exhibits the fastest training time (5.97 s) due to its simpler on-policy updates. Q-learning requires 8.68 s due to the max operation over actions. For real-time deployment, the learned policy executes in constant time (0.08 s for 20,000 decisions), making all the approaches computationally feasible.

5.3. Sensitivity Analysis

To understand the robustness of our approach, we systematically vary the key system parameters and evaluate performance degradation.

5.3.1. Load Parameter Sensitivity

Our analysis shows how performance varies with the ST arrival rate

λ \in [0.5, 8.0]

while holding other parameters constant. The learned threshold adapts appropriately:

k^{*} = 6

for a low load (

λ = 0.5

),

k^{*} = 10

for the baseline (

λ = 2.0

), and

k^{*} = 15

for a high load (

λ = 8.0

). The average reward increases with

λ

up to

λ = 4.0

, then plateaus as the MEC server capacity saturates.

5.3.2. Discount Factor Sensitivity

Table 4 presents the results for discount factor

γ \in {0.8, 0.9, 0.95, 0.99}

. Higher discount factors (more future oriented) lead to slightly lower thresholds (

k^{*}

ranges from 12 at

γ = 0.8

to 9 at

γ = 0.99

), reflecting a greater sensitivity to long-term holding costs. The performance remains stable across this range (coefficient of variation is <5%), indicating the robustness to discount factor specification.

5.3.3. Cost Function Sensitivity

We test convex cost functions

f (k) = a \times k^{b}

with

a \in {0.1, 0.5, 1.0, 2.0}

and

b \in {1.0, 1.5, 2.0}

. The results show that the optimal threshold decreases with cost severity:

k^{*} \in [8, 15]

for

a = 0.1

versus

k^{*} \in [4, 8]

for

a = 2.0

. The learning algorithm successfully adapts to the cost structure, with the convergence time varying by less than

20 %

across all the tested configurations.

5.4. Robustness Under Realistic Imperfections

Real-world CIoT network deployments face imperfect sensing, estimation errors, and MEC server switching costs. We evaluate the policy robustness under these practical constraints.

5.4.1. Imperfect Sensing

We model sensing errors with false alarm probability

P_{F A} \in {0.01, 0.05, 0.10}

(idle MEC server detected as occupied) and missed detection

P_{M D} \in {0.01, 0.05, 0.10}

(occupied MEC server detected as idle). Table 5 shows the performance degradation.

Performance degrades gracefully, retaining 93% of baseline reward even with 5% sensing errors. The learned policy remains threshold-based but shifts to a slightly more conservative admission (

k^{*}

decreases by 1–2) to account for uncertainty.

5.4.2. Parameter Estimation Errors

We test robustness when system’s parameters (

λ, α_{i}, β_{i},

and

μ_{i}

) are mis-estimated by ±20% uniformly. The Q-learning algorithm learns compensating policies that achieve 91.3% of optimal reward, demonstrating model-free learning advantages. Importantly, the threshold policy structure persists despite the parameter mismatches.

5.4.3. MEC Server Switching Costs

By introducing switching cost

c_{s w i t c h} = 0.5

per server change (representing the handoff overhead), the learned policy becomes slightly more conservative (

k^{*}

increases by 2–3) to reduce the switching frequency. The average reward decreases by 8.2%, but remains approximately 13% higher than the GREEDY baseline.

5.4.4. Practical Deployment Considerations

Our base model makes several simplifying assumptions (Poisson arrivals, exponential service times, and perfect state knowledge) to maintain analytical tractability. We now discuss the key practical deployment factors not explicitly modeled and their expected impact.

Communication Delay: In real deployments, offloading incurs non-negligible communication latency (typically 1–10 ms for edge networks). This delay can be incorporated into the model by augmenting the service time distribution: if d denotes the round-trip communication delay, the effective service rate becomes

μ_{i}^{'} = 1 / (1 / μ_{i} + d)

. Since our threshold structure depends on the relative magnitudes of the rewards versus the queuing costs rather than the absolute service rates, the threshold policy remains structurally optimal under deterministic communication delays. For stochastic delays, the threshold would shift conservatively (lower

k^{*}

) to account for additional uncertainty. Our robustness experiments with parameter estimation errors illustrated in Section 5.4.2 implicitly capture this effect, showing a 91.3% performance retention under 20% parameter mismatch.

Energy Consumption: IoT devices face strict energy budgets. Offloading involves transmission energy

E_{t x}

, proportional to the task data size, while local processing consumes computation energy

E_{c o m p}

. These can be incorporated by modifying the reward function as follows:

R^{'} = R - η E_{t x}

, where

η

is an energy cost weighting factor. Since the modified reward remains a constant offset, the threshold structure is preserved, though the optimal threshold

k^{*}

may decrease (favoring local processing) as energy costs increase.

Offloading Failure Rate: Network unreliability may cause task offloading failures with probability

p_{f}

. This is analogous to imperfect sensing with missed detection probability

P_{M D}

, which we evaluated in Section 5.4.1. With

p_{f} = 0.05

, we expect a performance retention of approximately 93% based on our imperfect sensing results. The Q-learning algorithm naturally adapts to the offloading failures since it learns from actual outcomes rather than assumed dynamics.

While a comprehensive treatment of all these factors simultaneously is beyond the scope of this work, the robustness results in Section 5.4.1, Section 5.4.2 and Section 5.4.3 demonstrate that the threshold policy degrades gracefully under various practical imperfections, supporting its applicability to real-world IoT deployments.

5.4.5. Comparison with Non-Poisson and Bursty Traffic Scenarios

The analytical results presented in this paper are based on Poisson arrival processes and exponential service times, which together ensure the Markov property required for the CTMDP formulation. In practice, IoT security traffic often exhibits bursty, correlated behavior. For instance, coordinated DDoS campaigns or worm propagation events generate arrival bursts that violate the memoryless inter-arrival assumption, while heavy-tailed task sizes, such as those encountered in the deep packet inspection of variable-length payloads, diverge from exponential service distributions. Accordingly, this section conceptually examines the relationship between the proposed framework and more realistic traffic models.

Markov-Modulated Poisson Process (MMPP). An MMPP models the burstiness by alternating between a low-rate calm phase (when event arrivals are infrequent) and a high-rate attack phase (when arrivals occur rapidly), governed by an underlying Markov chain (a process where the transitions between the phases depend only on the current phase). Within this framework, the system remains amenable to CTMDP formulation by augmenting the state to include the current traffic phase. The threshold structure is anticipated to generalize such that a phase-dependent threshold, denoted as

k^{*} (phase)

, would arise: a lower threshold during the attack phase mitigates queue congestion under heavy load, and a higher threshold is used during the calm phase. Establishing a formal proof of this conjecture would require extending the concavity argument of Theorem 2 to the expanded state space, which is identified as a direction for future research.

Self-Similar and Heavy-Tailed Arrivals. Long-range-dependent traffic, which displays correlations spanning over long time periods and is typically modeled by fractional Brownian motion (a process with self-similar and long-memory properties) or Pareto-distributed inter-arrival times (where very large intervals are possible and more frequent than in exponential models), cannot be represented by a finite-state Markov process. In these scenarios, the threshold policy is not provably optimal. However, the robustness experiments provide an indirect evaluation: the 20% parameter mismatch test illustrated in Section 5.4.2 introduces systematic bias in the learner’s perceived arrival and service rates, simulating the impact of a misspecified arrival model. Despite this, the Q-learning agent retains 91.3% of the optimal reward. These findings indicate that a threshold policy trained under Poisson assumptions may function as a practical heuristic even when actual traffic deviates from the exponential model, though performance loss increases with greater deviation.

Bursty Correlated Attacks. In scenarios characterized by highly correlated attack bursts, such as a sudden increase in exploit attempts following a zero-day disclosure, the model-free Q-learning approach may be more robust than the analytical policy. This advantage arises because Q-learning adapts to the observed system dynamics without relying on specific distributional assumptions. If the agent is trained or fine-tuned using traces that include bursts, it can develop burst-aware thresholds. Furthermore, the DQN-aggregate architecture illustrated in Section 5.5 could be enhanced with recurrent layers, such as LSTM, to capture temporal correlations. This represents a promising direction for future research.

In summary, the Poisson and exponential framework offers a tractable analytical baseline, and the resulting threshold structure is likely to remain a useful approximation under moderate deviations from these assumptions. Quantifying the performance gap for specific non-Poisson models constitutes an important direction for future research, as is identified in Section 6.

5.5. Scalability Analysis

To evaluate practical applicability to large-scale systems, we investigate scaling behavior with respect to the MEC server count N and queue capacity M.

5.5.1. MEC Server Scalability

Table 6 presents results for

N \in {2, 4, 8, 16}

with proportionally scaled arrival rates (

λ = N

to maintain constant per-server load). The key observations are as follows:

State Space Growth: State space grows exponentially as

O (3^{N})

, but remains manageable for

N \leq 16

using tabular Q-learning. For

N > 16

, function approximation methods (neural networks) would be necessary.

5.5.2. Function Approximation for Large-Scale Systems ( $N > 16$ )

To address the state space explosion beyond

N = 16

servers, we propose a Deep Q-Network (DQN) approach that leverages our theoretical insights through a compact state representation. Rather than encoding the full server state vector

n \in {0, 1, 2}^{N}

, we define aggregate features: (a) the number of idle servers

N_{0} = \sum_{i} 1 [n_{i} = 0]

, (b) the number of servers processing critical threats

N_{1} = \sum_{i} 1 [n_{i} = 1]

, and (c) the queue length k. This reduces the effective state dimension from

O (3^{N})

to

O (N^{2} \cdot M)

, which is polynomial rather than exponential. The DQN architecture consists of a feedforward neural network with three hidden layers (128-64-32 neurons and Rectified Linear Unit (ReLU) activations), taking the aggregate state

(N_{0}, N_{1}, k)

as the input and outputting Q-values for OFFLOAD and LOCAL actions. Experience replay (buffer size 10,000; mini-batch size 32) and target network stabilization (update every 500 steps) [25] ensure training stability. Our theoretical threshold result provides a strong inductive bias: the network need only learn a single threshold boundary in the

(N_{0}, N_{1}, k)

space, significantly simplifying the learning task.

To validate this approach, we conducted preliminary experiments comparing the DQN-aggregate method against tabular Q-learning at scales where both methods can be evaluated, as well as at larger scales accessible only to the DQN approach. Table 7 summarizes the results.

Key observations: At

N = 4

and

N = 8

, the DQN-aggregate method recovers the same optimal threshold as tabular Q-learning, with the reward within 2% of the tabular optimum, confirming the validity of the aggregate state representation. At

N = 16

, the DQN-aggregate approach achieves a near-identical threshold (

k^{*} = 84

vs. 85) while converging 62% faster and using 66% less computation time, as the polynomial state space enables more efficient exploration. At

N = 32

, where tabular methods are infeasible due to a state space of approximately

10^{18}

, the DQN-aggregate method successfully converges to

k^{*} = 163 \approx 5.1 N

, consistent with the linear scaling heuristic

k^{*} \approx 5 N

observed in tabular experiments. The per-server reward (∼

23.8

) also remains consistent with the normalized performance at smaller scales. While these preliminary results are encouraging, a comprehensive investigation of the DQN-aggregate approach, including hyperparameter sensitivity, convergence guarantees, and performance under non-stationary environments, is an important direction for future work.

Convergence Scaling: Convergence episodes scale approximately as

O (N^{2})

, while wall clock time scales approximately as

O (N^{3})

due to the per-iteration cost growth from expanding state spaces. The learning rate schedule

ϕ_{k} = log (k + 1) / (k + 1)

maintains stability across all tested scales.

Threshold Scaling: The optimal threshold scales are approximately linear with N:

k^{*} \approx 5 N

for the tested parameter regime. This suggests that a consistent “queue slots per MEC server” heuristic emerges from optimization.

Normalized Performance: The Reward per MEC server remains relatively constant (∼24–25 per MEC server) across different N, indicating that the threshold policy structure generalizes effectively to larger systems.

5.5.3. Queue Capacity Sensitivity

Testing

M \in {10, 50, 100, 500}

reveals that learned threshold

k^{*}

saturates:

k^{*} = 10

for all

M \geq 50

. This indicates that the optimal policy naturally limits queue length well below capacity, validating the cost function’s role in preventing congestion.

5.6. Statistical Validation

5.6.1. Hypothesis Testing

We conducted paired t-tests comparing Q-learning against each baseline policy across ten independent runs. Q-learning significantly outperforms all baselines (

p < 0.001

) except SARSA (

p = 0.062

) and DQN (

p = 0.135

), which are statistically indistinguishable at the 95% confidence level.

5.6.2. Effect Size Analysis

The Cohen’s d effect sizes comparing Q-learning to baselines are as follows: FIXED-k = 10 (d = 1.24; large), GREEDY (d = 2.17; very large), and RANDOM (d = 5.43; extremely large). These large effect sizes indicate practical significance beyond statistical significance.

5.6.3. Convergence Stability

The standard deviation of final Q-values across runs decreases with training:

σ = 8.3

at episode 5000,

σ = 2.1

at episode 10,000, and

σ = 0.8

at episode 20,000. This confirms the stable convergence to consistent policies.

5.6.4. Sensitivity to Initial Conditions

To address whether the convergence of different RL algorithms to the same threshold depends on initialization, we conducted systematic experiments varying Q-table initialization strategies across ten independent runs each:

Initialization Methods Tested: (i) Zero initialization (

Q_{0} (s, a) = 0

for all state–action pairs), (ii) optimistic initialization (

Q_{0} (s, a) = R_{m a x}

, encouraging exploration), (iii) pessimistic initialization (

Q_{0} (s, a) = - R_{m a x}

), and (iv) random initialization (

Q_{0} (s, a)

∼

U [- R_{m a x},

R_{m a x}]

).

Results: All four initialization strategies converge to the same optimal threshold

k^{*} = 10

for Q-learning, SARSA, and DQN, though the convergence speed varies. Optimistic initialization requires approximately 30% fewer episodes to converge (1750 vs. 2500 for Q-learning), as it naturally encourages exploration of under-visited states. Pessimistic initialization is the slowest (3200 episodes) but the most conservative during learning. Random initialization shows the highest variance across runs during early training (the coefficient of variation is 18% at episode 1000) but converges to the same final threshold. These results confirm that the threshold

k^{*} = 10

is the unique optimal solution, independent of initial conditions, as guaranteed by the contraction property of the Q-learning update rule under the Robbins–Monro conditions satisfied by our learning rate schedule

ϕ_{k} = log (k + 1) / (k + 1)

.

5.7. Discussion and Practical Implications

From our simulation results, the key insights are as follows.

5.7.1. Threshold Policy Validation

The consistent recovery of threshold policies across multiple learning algorithms, parameter regimes, and system scales provides a strong model-free validation of our theoretical analysis in Section 3. Rather than constituting algorithmic innovation in RL, the learning experiments confirm that the analytically derived threshold structure is the genuine optimum: standard RL methods re-discover it without being guided toward it.

5.7.2. Computational Feasibility

Training times (0.14–112 min for N = 2–16) demonstrate practical feasibility for offline policy optimization. Once learned, policies are executed in microseconds, enabling real-time admission control.

5.7.3. Adaptability

The model-free Q-learning approach successfully adapts to parameter mismatches, imperfect sensing, and switching costs, retaining 85–93% of its optimal performance under realistic imperfections. This robustness is crucial for real-world deployments where perfect system knowledge is unattainable.

5.7.4. Scalability Limits

Tabular Q-learning scales well to

N \leq 16

MEC servers. For larger systems (

N > 16

), our theoretical framework remains valid, but function approximation methods (e.g., Deep Q-networks with appropriate state representations) would be required for computational tractability.

5.7.5. Design Guidelines

For practitioners, our results suggest the following: (1) use Q-learning for moderate systems (

N \leq 8

) due to sample efficiency, (2) use SARSA when training time is critical, (3) consider DQN for very large systems (

N > 16

) with state abstraction, and (4) set the initial threshold estimate as

k^{*} \approx 5 N

for faster convergence.

5.7.6. Scalable Learning for Large MEC Systems

While tabular Q-learning is effective for

N \leq 16

, production CIoT deployments may involve tens to hundreds of MEC servers, necessitating scalable learning architectures. We outline three promising directions grounded in our theoretical results.

Function Approximation with Threshold Bias. Section 5.5 introduced a DQN-aggregate representation that reduces the state dimension from

O (3^{N})

to

O (N^{2} \cdot M)

. The analytically proven threshold structure provides a strong inductive bias: the decision boundary is a single hyperplane in the aggregate state space, so even simple function approximators (e.g., shallow networks or linear models over hand-crafted features) can represent the optimal policy. This contrasts with general MEC offloading problems where the policy landscape may be highly nonlinear.

Actor–Critic Methods. For systems where even the aggregate state space is large, actor–critic algorithms (e.g., Advantage Actor–Critic (A2C); Proximal Policy Optimization (PPO)) offer stable policy gradient updates with a lower variance than pure policy gradient methods. The actor can be parameterized as a threshold policy, a single scalar

k^{*}

per server configuration, while the critic estimates the value function. This architecture naturally preserves the structural constraint from our theoretical analysis and reduces the policy search to a low-dimensional space regardless of the number of MEC servers.

Policy Gradient with Structural Constraints. An alternative is to directly optimize the threshold parameter(s) via policy gradient methods (e.g., REINFORCE with baseline). Because our theory guarantees that the optimal policy lies within the class of threshold policies, the search space is

O (M)

per server configuration rather than the full action space, enabling efficient gradient estimation even at a large scale. Combining this with the linear scaling heuristic

k^{*} \approx 5 N

as a warm-start initialization could further accelerate convergence.

These directions remain to be fully evaluated illustrated in the future work in Section 6, but the structural guarantees established in this paper substantially constrain the policy class and thereby simplify the scalable learning problem.

6. Conclusions

This paper presents an analysis of optimal security task offloading for Cognitive IoT (CIoT) systems, where multiple security tasks (STs) and critical threats (CTs) are managed across multiple MEC servers. Using a Continuous-Time Markov Decision Process (CTMDP) framework, we showed that the optimal admission policy follows a threshold structure, adapting classical queuing optimization techniques [23] to the CIoT security domain with heterogeneous servers and preemptive threats.

6.1. Contributions

Theoretical Proof: We established the optimality of threshold policies for multi-server security task allocation under discounted reward criteria. Empirical Validation via RL: We applied standard model-free Q-learning [24] to recover optimal thresholds without any knowledge of the system’s parameters; the consistent convergence of Q-learning, SARSA, and DQN to the analytically predicted threshold serves as independent validation of the theoretical structure.

Scalability and Efficiency: The three RL algorithms (Q-learning, SARSA, and DQN) achieve statistically indistinguishable final rewards, with Q-learning exhibiting the fastest convergence (2500 episodes vs. 3500 for SARSA and 4000 for DQN). While the state space grows as

O (3^{N})

, the system remains tractable for deployments up to 16 MEC servers.

DQN Comparison: While DQN achieves a comparable final reward to tabular Q-learning, it requires 60% more episodes to converge (4000 vs. 2500) in the tabular setting. However, the DQN-aggregate approach demonstrates clear advantages for larger systems (

N > 16

), where tabular methods become infeasible listed in Table 7.

Practical Constraints: We confirmed that the threshold structure remains effective under practical imperfections including imperfect sensing, parameter estimation errors, and MEC server switching costs. With a switching cost of

c_{s w i t c h} = 0.5

, the optimal threshold increases by 2–3 while average reward decreases by 8.2%, yet still outperforms the GREEDY baseline by approximately 13%.

The numerical and simulation results for a dual-MEC-server case (

N = 2

) showed strong alignment. The optimal threshold of

k^{*} = 10

was discovered consistently across all three learning algorithms (Q-learning, SARSA, and DQN), matching the best-performing fixed threshold policy and confirming the theoretical predictions of Section 3. Our analysis reveals a super-linear growth in convergence episodes (

O (N^{2})

) and wall clock time (

O (N^{3})

), providing a clear roadmap for scaling to larger deployments. In summary, this work provides an analytical foundation and a validated practical framework for IoT security task management. By combining structural optimality results with model-free learning that independently confirms those results, the proposed threshold-based approach offers a principled solution for CIoT network security offloading.

6.2. Limitations

Evaluation Scope. The scalability analysis presented in Section 5.5 examines

N = 2

to 16 servers using tabular methods and provides preliminary DQN-aggregate validation up to

N = 32

, as shown in Table 7. However, the evaluation has not yet comprehensively assessed function approximation methods at larger scales (

N > 32

), including full hyperparameter tuning and convergence analysis. The evaluation uses synthetic data generated from the Poisson/exponential model instead of real-world IoT security traces. Running trace-driven simulations on datasets from operational IoT networks, such as smart home gateways, industrial IoT sensor networks, or campus network intrusion detection systems, would strengthen the evidence of practical applicability. The set of baseline policies covers the main categories (learned, fixed-threshold, greedy, and random), but it does not compare with recently proposed deep RL-based offloading methods [27] or multi-agent approaches.

Modeling Assumptions. The theoretical framework relies on Poisson arrival processes and exponential service distributions. However, real-world cyber attacks often exhibit bursty and correlated patterns that are not captured by these assumptions. The optimization objective does not explicitly consider communication delay, energy consumption, or offloading failure rates. Additionally, the current model assumes a single shared FIFO queue across all MEC servers, which may not accurately represent geographically distributed deployments where per-server or hierarchical queue architectures are more appropriate.

6.3. Future Work

Several avenues warrant further investigation. First, extend the theoretical framework to include non-Poisson arrival processes, such as Markov-modulated Poisson processes, to capture bursty attack patterns. Consider also non-exponential service distributions, such as phase-type distributions, to enhance the model’s applicability. Second, directly include communication delay, energy consumption, and offloading failure rates in the optimization objective. This could produce policies better suited for deployment. Third, examine per-server or hierarchical queue architectures to better reflect the geographic distribution of distributed systems. Fourth, advance the scalable learning approaches described in Section 5. These include actor–critic, policy gradient, and structured function approximation methods that provide theoretical threshold guarantees. Such enhancements are critical for production environments with N > 16 MEC servers. Fifth, develop multi-agent extensions that enable multiple IoT devices to coordinate offloading decisions. This addresses a significant practical scenario not covered in the present study. Finally, validate the proposed approach using real-world IoT security traces from operational networks. Collaborate with IoT testbed operators and benchmark against a wider range of state-of-the-art methods to further demonstrate practical relevance.

Author Contributions

Conceptualization, N.W. and Y.R.; methodology, N.W. and Y.R.; validation, N.W. and Y.R.; formal analysis, N.W.; investigation, N.W.; writing—original draft preparation, N.W.; writing—review and editing, Y.R.; supervision, Y.R.; project administration, Y.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lueth, K.L. State of IoT 2024: Number of Connected IoT Devices Growing 13% to 18.8 Billion Globally. IoT Analytics Research. 28 October 2025. Available online: https://iot-analytics.com/number-connected-iot-devices/ (accessed on 15 January 2026).
Gkonis, P.; Giannopoulos, A.; Trakadas, P.; Masip-Bruin, X.; D’Andria, F. A survey on IoT-edge-cloud continuum systems: Status, challenges, use cases, and open issues. Future Internet 2023, 15, 383. [Google Scholar] [CrossRef]
Ni, J.; Zhang, K.; Lin, X.; Shen, X.S. Securing fog computing for internet of things applications: Challenges and solutions. IEEE Commun. Surv. Tutor. 2017, 20, 601–628. [Google Scholar] [CrossRef]
Rajesh, R.; Hemalatha, S.; Nagarajan, S.M.; Devarajan, G.G.; Omar, M.; Bashir, A.K. Threat detection and mitigation for tactile internet driven consumer IoT-healthcare system. IEEE Trans. Consum. Electron. 2024, 70, 4249–4257. [Google Scholar]
Xiao, Y.; Jia, Y.; Liu, C.; Cheng, X.; Yu, J.; Lv, W. Edge computing security: State of the art and challenges. Proc. IEEE 2019, 107, 1608–1631. [Google Scholar] [CrossRef]
Wu, Q.; Ding, G.; Xu, Y.; Feng, S.; Du, Z.; Wang, J.; Long, K. Cognitive Internet of Things: A new paradigm beyond connection. IEEE Internet Things J. 2014, 1, 129–143. [Google Scholar] [CrossRef]
Zhang, Y.; Ma, X.; Zhang, J.; Hossain, M.S.; Muhammad, G.; Amin, S.U. Edge Intelligence in the Cognitive Internet of Things: Improving Sensitivity and Interactivity. IEEE Netw. 2019, 33, 58–64. [Google Scholar] [CrossRef]
Wang, D.; Bakar, K.B.A.; Isyaku, B.; Eisa, T.A.E.; Abdelmaboud, A. A comprehensive review on internet of things task offloading in multi-access edge computing. Heliyon 2024, 10, e29916. [Google Scholar] [CrossRef]
Hosseinpour, M.; Yaghmaee, M.H. Quality of experience aware computation offloading in MEC-enabled blockchain-based IoT networks. IEEE Internet Things J. 2023, 11, 14483–14493. [Google Scholar] [CrossRef]
Tang, M.; Wong, V.W.S. Deep reinforcement learning for task offloading in mobile edge computing systems. IEEE Trans. Mob. Comput. 2020, 21, 1985–1997. [Google Scholar] [CrossRef]
Dong, S.; Zhou, H. Task offloading strategies for mobile edge computing: A survey. Comput. Netw. 2024, 254, 110791. [Google Scholar] [CrossRef]
Wang, D.; Bakar, K.B.A.; Isyaku, B. Two-stage IoT computational task offloading decision-making in MEC with request holding and dynamic eviction. Comput. Mater. Contin. 2024, 80, 2065. [Google Scholar] [CrossRef]
Hossain, M.A.; Liu, W.; Ansari, N. Computation-efficient offloading and power control for MEC in IoT networks by meta reinforcement learning. IEEE Internet Things J. 2024, 11, 16722–16730. [Google Scholar] [CrossRef]
Zhao, Y.; Xiang, Z.; Lu, Q. Performance evaluation for secondary users in finite-source cognitive radio networks with dynamic preemption limit. AEU-Int. J. Electron. Commun. 2022, 149, 154183. [Google Scholar] [CrossRef]
Chi, J.; Qiu, T.; Xiao, F.; Zhou, X. ATOM: Adaptive task offloading with two-stage hybrid matching in MEC-enabled industrial IoT. IEEE Trans. Mob. Comput. 2023, 23, 4861–4877. [Google Scholar] [CrossRef]
Naparstek, O.; Cohen, K. Deep multi-user reinforcement learning for distributed dynamic spectrum access. IEEE Trans. Wirel. Commun. 2018, 18, 310–323. [Google Scholar] [CrossRef]
Kaur, A.; Kumar, K. Intelligent spectrum management based on reinforcement learning schemes in cooperative cognitive radio networks. Phys. Commun. 2020, 43, 101226. [Google Scholar] [CrossRef]
Zhao, D.; Qin, H.; Song, B.; Han, B.; Du, X.; Guizani, M. A graph convolutional network-based deep reinforcement learning approach for resource allocation in a cognitive radio network. Sensors 2020, 20, 5216. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Pan, C.; Zhang, C.; Yang, F.; Song, J. Dynamic spectrum sharing based on deep reinforcement learning in mobile communication systems. Sensors 2023, 23, 2622. [Google Scholar] [CrossRef]
Ukpong, U.C.; Idowu-Bismark, O.; Adetiba, E.; Kala, J.R.; Owolabi, E.; Oshin, O.; Abayomi, A.; Dare, O.E. Deep reinforcement learning agents for dynamic spectrum access in television whitespace cognitive radio networks. Sci. Afr. 2025, 27, e02523. [Google Scholar] [CrossRef]
Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
Zhao, Y.; Xiang, Z. A multichannel allocation strategy based on preemption threshold and preemption probability in cognitive radio networks. Mob. Inf. Syst. 2021, 2021, 6190872. [Google Scholar] [CrossRef]
Lippman, S. Applying a new device in the optimization of exponential queuing systems. Oper. Res. 1975, 23, 687–710. [Google Scholar] [CrossRef]
Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Bashar, S.; Ding, Z. Admission control and resource allocation in a heterogeneous OFDMA wireless network. IEEE Trans. Wirel. Commun. 2009, 8, 4200–4210. [Google Scholar] [CrossRef]
Nieto, G.; de la Iglesia, I.; Lopez-Novoa, U.; Perfecto, C. Deep reinforcement learning techniques for dynamic task offloading in the 5G edge-cloud continuum. J. Cloud Comput. 2024, 13, 94. [Google Scholar] [CrossRef]

Figure 1. CIoT network architecture with MEC servers for security task processing.

Figure 2. Learning curves showing average reward per episode (500-episode moving average) for Q-learning, SARSA, and DQN.

Table 2. Consolidated reproducibility parameters.

Category	Parameter	Value
Environment
	MEC servers N	2 (baseline); {2, 4, 8, 16} (tabular scalability); {4, 8, 16, 32} (DQN-agg.)
	Max queue capacity M	100 (baseline); {10, 50, 100, 500} (sensitivity)
	ST arrival rate $λ$	2.0 (baseline); [0.5, 8.0] (load sensitivity); $λ = N$ (scalability, Table 6)
	CT arrival rates $α_{i}$	1.0 per server
	CT service rates $β_{i}$	2.0 per server
	ST service rates $μ_{1}, μ_{2}$	3.0, 2.5
	Reward R	10.0
	Discount factor $γ$	0.95 (baseline); sensitivity sweep {0.8, 0.9, 0.95, 0.99}
	Cost function $f (k)$	$0.5 k^{2}$ (baseline); $a k^{b}$ with $a \in {0.1, 0.5, 1.0, 2.0}$ , $b \in {1.0, 1.5, 2.0}$ (sensitivity)
Training
	Episodes	20,000 (baseline, $N = 2$ ); increased for larger N (see Table 6)
	Decision epochs per episode	50 (max)
	Independent runs	10
	Random seeds	{0, 1, 2, …, 9}
	Bootstrap samples (CI)	1000
Exploration (all learning algorithms)
	Exploration $ϵ$ (initial)	0.3
	Exploration $ϵ$ (final)	0.01
	$ϵ$ -decay factor	0.995 per episode
Q-Learning/SARSA
	Learning rate $ϕ_{k}$	$log (k + 1) / (k + 1)$
	Q-table initialization	$Q_{0} (s, a) = 0$ (default; see Section 5.6 for alternatives)
DQN-Aggregate (Section 5.5) ^†
	Hidden layers	3 (128-64-32 neurons, ReLU)
	Experience replay buffer	10,000 transitions
	Mini-batch size	32
	Target network update	Every 500 steps
Hardware
	Processor	Intel Core i7-10700K (8 cores, 3.8 GHz)
	RAM	32 GB DDR4
	OS	Ubuntu 20.04 LTS

^† Architecture listed is for the DQN-Aggregate approach illustrated in Section 5.5, which uses aggregate state features

(N_{0}, N_{1},

and

k)

. The baseline DQN in Table 3 employs experience replay [25] with the same hidden layer configuration but takes the full state vector

(n, k)

as the input.

Table 4. Sensitivity to discount factor

γ

.

Table 4. Sensitivity to discount factor

γ

.

$γ$	Avg Reward	Learned $k^{*}$	Blocking Prob.
0.80	$45.32 \pm 2.1$	12	$0.075 \pm 0.009$
0.90	$47.15 \pm 1.9$	11	$0.079 \pm 0.008$
0.95	$48.53 \pm 1.8$	10	$0.082 \pm 0.008$
0.99	$49.21 \pm 2.0$	9	$0.088 \pm 0.010$

Table 5. Robustness to imperfect sensing.

$P_{FA}$	$P_{MD}$	Avg Reward	Performance Retained
0.00	0.00	$48.53 \pm 1.8$	100% (baseline)
0.01	0.01	$47.82 \pm 1.9$	98.5%
0.05	0.05	$45.15 \pm 2.2$	93.0%
0.10	0.10	$41.33 \pm 2.5$	85.2%

Table 6. Scalability analysis: state space size and convergence.

N	State Space	Conv. Episodes	Time (min)	Learned $k^{*}$
2	∼ $10^{3}$	2500	0.14	10
4	∼ $10^{5}$	8500	2.45	21
8	∼ $10^{7}$	35,000	18.3	42
16	∼ $10^{10}$	145,000	112.5	85

Table 7. Preliminary DQN-aggregate validation: comparison with tabular Q-learning.

N	Method	Learned $k^{*}$	Avg Reward	Conv. Episodes	Time
4	Tabular Q	21	$98.4 \pm 2.1$	8500	2.45 min
4	DQN-Agg	21	$96.8 \pm 2.8$	12,000	3.1 min
8	Tabular Q	42	$195.1 \pm 3.5$	35,000	18.3 min
8	DQN-Agg	42	$191.7 \pm 4.2$	28,000	12.8 min
16	Tabular Q	85	$389.2 \pm 5.8$	145,000	112.5 min
16	DQN-Agg	84	$382.5 \pm 6.1$	55,000	38.7 min
32	Tabular Q	Infeasible (state space ∼ $10^{18}$ )
32	DQN-Agg	163	$762.3 \pm 8.9$	50,000	45.2 min

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, N.; Ren, Y. Optimal Security Task Offloading in Cognitive IoT Networks: Provably Optimal Threshold Policies and Model-Free Learning. IoT 2026, 7, 30. https://doi.org/10.3390/iot7020030

AMA Style

Wang N, Ren Y. Optimal Security Task Offloading in Cognitive IoT Networks: Provably Optimal Threshold Policies and Model-Free Learning. IoT. 2026; 7(2):30. https://doi.org/10.3390/iot7020030

Chicago/Turabian Style

Wang, Ning, and Yali Ren. 2026. "Optimal Security Task Offloading in Cognitive IoT Networks: Provably Optimal Threshold Policies and Model-Free Learning" IoT 7, no. 2: 30. https://doi.org/10.3390/iot7020030

APA Style

Wang, N., & Ren, Y. (2026). Optimal Security Task Offloading in Cognitive IoT Networks: Provably Optimal Threshold Policies and Model-Free Learning. IoT, 7(2), 30. https://doi.org/10.3390/iot7020030

Article Menu

Optimal Security Task Offloading in Cognitive IoT Networks: Provably Optimal Threshold Policies and Model-Free Learning

Abstract

1. Introduction

1.1. Cognitive IoT Networks and Mobile Edge Computing

1.2. Research Challenge and Problem Formulation

1.3. Our Main Contributions

1.4. Practical Impact

1.5. Paper Organization

2. System Model

2.1. System Model Assumptions

2.2. State Space

2.3. Action Space and Decision Epochs

2.4. Transition Dynamics

2.5. Reward Structure

2.6. Optimization Objective

2.7. Threshold Policy

3. Optimal Policy

3.1. Threshold Policy Structure

3.2. Value Function at Decision States

3.3. Concavity Preservation Lemma

3.4. Proof of Threshold Structure

3.5. Value Iteration and Concavity Preservation

3.6. Extension to General MEC Server Configurations

3.7. Computing the Optimal Threshold

3.8. Summary and Discussion

4. The Q-Learning Based Optimization Algorithm

5. Simulation Analysis and Results

5.1. Simulation and Methodology

5.1.1. Simulation Environment

5.1.2. Baseline System Configuration

5.1.3. Training Procedure

5.1.4. Reproducibility Summary

5.2. Baseline Policy Comparison

5.3. Sensitivity Analysis

5.3.1. Load Parameter Sensitivity

5.3.2. Discount Factor Sensitivity

5.3.3. Cost Function Sensitivity

5.4. Robustness Under Realistic Imperfections

5.4.1. Imperfect Sensing

5.4.2. Parameter Estimation Errors

5.4.3. MEC Server Switching Costs

5.4.4. Practical Deployment Considerations

5.4.5. Comparison with Non-Poisson and Bursty Traffic Scenarios

5.5. Scalability Analysis

5.5.1. MEC Server Scalability

5.5.2. Function Approximation for Large-Scale Systems ( N > 16 )

5.5.3. Queue Capacity Sensitivity

5.6. Statistical Validation

5.6.1. Hypothesis Testing

5.6.2. Effect Size Analysis

5.6.3. Convergence Stability

5.6.4. Sensitivity to Initial Conditions

5.7. Discussion and Practical Implications

5.7.1. Threshold Policy Validation

5.7.2. Computational Feasibility

5.7.3. Adaptability

5.7.4. Scalability Limits

5.7.5. Design Guidelines

5.7.6. Scalable Learning for Large MEC Systems

6. Conclusions

6.1. Contributions

6.2. Limitations

6.3. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.5.2. Function Approximation for Large-Scale Systems ( $N > 16$ )