DRL-TinyEdge: Energy- and Latency-Aware Deep Reinforcement Learning for Adaptive TinyML at the 6G Edge

Alaklabi, Saad; Alharbi, Saleh

doi:10.3390/fi18010031

Open AccessArticle

DRL-TinyEdge: Energy- and Latency-Aware Deep Reinforcement Learning for Adaptive TinyML at the 6G Edge

by

Saad Alaklabi

^1,*

and

Saleh Alharbi

²

¹

Information Systems Department, College of Computing and Information Technology, Shaqra University, Shaqra 51115, Saudi Arabia

²

Information Technology Department, College of Computing and Information Technology, Shaqra University, Shaqra 51115, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(1), 31; https://doi.org/10.3390/fi18010031

Submission received: 17 November 2025 / Revised: 16 December 2025 / Accepted: 19 December 2025 / Published: 4 January 2026

(This article belongs to the Special Issue Cloud/Edge Computing for Next-Generation Networks: Architecture and Applications)

Download

Browse Figures

Versions Notes

Abstract

Various TinyML models face a constantly challenging environment when running on emerging sixth-generation (6G) edge networks, with volatile wireless environments, limited computing power, and highly constrained energy use. This paper introduces DRL-TinyEdge, a latency- and energy-sensitive deep reinforcement learning (DRL) platform optimised for the 6G edge of adaptive TinyML. The suggested on-device DRL controller autonomously decides on the execution venue (local, partial, or cloud) and model configuration (depth, quantization, and frequency) in real time to trade off accuracy, latency, and power savings. To assure safety during adaptation to changing conditions, the multi-objective reward will be a combination of p95 latency, per-inference energy, preservation of accuracy and policy stability. The system is tested under two workloads representative of classical applications, including image classification (CIFAR-10) and sensor analytics in an industrial IoT system, on a low-power platform (ESP32, Jetson Nano) connected to a simulated 6G mmWave testbed. Findings indicate uniform improvements, with up to a 28 per cent decrease in p95 latency and a 43 per cent decrease in energy per inference, and with accuracy differences of less than 1 per cent compared to baseline models. DRL-TinyEdge offers better adaptability, stability, and scalability when using a CPU < 5 and a decision latency < 10 ms, compared to Static-Offload, Heuristic-QoS, or TinyNAS/QAT. Code, hyperparameter settings, and measurement programmes will also be published at the time of acceptance to enable reproducibility and open benchmarking.

Keywords:

edge intelligence; TinyML; deep reinforcement learning; 6G; adaptive offloading; Energy–Latency Optimization

1. Introduction

The integration of sixth-generation (6G) wireless networks and Tiny Machine Learning (TinyML) has emerged as a key enabler for intelligent, autonomous, and low-latency edge computing systems [1]. The 6G vision emphasises native network intelligence, ultra-reliable communication, and seamless integration of sensing, computation, and communication across heterogeneous edge infrastructures [2,3]. In such environments, edge devices are required to support real-time inference under strict constraints on computational power, memory capacity, and energy availability, while operating over highly dynamic wireless links [4,5]. Conventional centralised and static optimisation approaches are insufficient in these settings, as they fail to cope with fluctuating network conditions, varying computational loads, and diverse application requirements [6,7].

Recent progress in TinyML has enabled the deployment of neural networks on highly resource-constrained microcontroller-class devices, with memory footprints reaching only a few kilobytes [8,9]. To support such deployments, a wide range of model optimisation and compression techniques have been developed, including efficient architectural design, pruning, quantization, and hardware-aware optimisation [10]. However, most of these techniques assume relatively stable execution environments and do not explicitly consider the communication dynamics inherent to mobile and wireless edge systems [11]. Moreover, optimising TinyML applications in realistic edge environments often requires balancing multiple, conflicting objectives such as inference latency, energy consumption, computational efficiency, and communication reliability, which cannot be adequately addressed by single-objective or static optimisation schemes [12,13].

Deep Reinforcement Learning (DRL) has been widely recognised as an effective approach for solving complex, multi-objective optimisation problems in dynamic and uncertain environments [14,15]. By interacting with the environment and learning from feedback, DRL agents can adaptively optimise long-term performance without requiring explicit system models [16]. DRL-based control strategies have been successfully applied across diverse domains, including power electronics, industrial control, and cyber–physical systems, demonstrating superior adaptability compared to traditional control methods [17]. In the context of edge computing, DRL has primarily been employed for resource allocation, task scheduling, caching, and computation offloading decisions in heterogeneous networks [18,19].

Parallel to advances in DRL-driven edge optimisation, the TinyML research community has produced a growing body of foundational work focusing on system architectures, benchmarking, and deployment challenges for ultra-low-power machine learning [20,21]. Efficient neural network architectures, such as MobileNets, have been proposed to reduce computational complexity while maintaining acceptable accuracy for edge vision applications [22]. Similarly, model scaling strategies, exemplified by EfficientNet, aim to achieve optimal trade-offs between depth, width, and resolution for resource-efficient inference [23]. Once-for-All networks further extend this idea by enabling a single trained super-network to support multiple deployment configurations without retraining, thereby improving deployment flexibility [24].

Quantization techniques play a central role in TinyML deployment by reducing model size and computational cost through lower-precision arithmetic [25]. Recent advances in post-training and hardware-aware quantization methods have further improved the accuracy–efficiency trade-off, making quantized inference feasible even for highly constrained edge devices and emerging neural paradigms such as spiking neural networks [26]. Despite these advances, existing TinyML optimisation techniques remain largely static and disconnected from runtime network conditions and device states. As a result, they are unable to dynamically adapt model execution strategies in response to variations in latency, energy availability, or communication quality.

These limitations motivate the need for adaptive edge intelligence frameworks that can jointly optimise TinyML model configuration, execution venue, and communication behaviour at runtime. Addressing this gap, the proposed DRL-TinyEdge framework introduces an end-to-end deep reinforcement learning solution designed to dynamically optimise TinyML deployments on 6G edge devices under stringent energy and latency constraints.

Contributions. The main contributions of this work are summarised as follows:

Lightweight on-device DRL controller: jointly selects the execution venue and TinyML model configuration under dynamic energy and latency constraints, enabling fully autonomous adaptation at the edge.
Multi-objective reward with stability term (SwitchCost): balances latency, energy, and accuracy objectives while suppressing rapid policy oscillations to ensure stable decisions.
Comprehensive evaluation: validated across vision and audio workloads under emulated 6G link volatility using named baselines—Static-Offload, Heuristic-QoS, and TinyNAS/QAT—on MCU and SoC testbeds.
Performance outcomes: achieves Pareto-superior improvements with 21–28% reduction in p95 latency and 37–43% reduction in energy per inference at comparable accuracy levels ( $\leq 1 %$ difference).

Figure 1 illustrates the end to end workflow of the proposed adaptive TinyML edge intelligence framework. The system starts by collecting heterogeneous input data including real time network conditions, sensor data streams, computational constraints, and user application requests. These inputs are processed by a TinyML inference module that runs lightweight models while continuously monitoring device resources. A deep reinforcement learning optimization core then analyzes system states and updates policies to select the most suitable model configurations. Based on these decisions, dynamic model configuration is performed through compression, quantization, and real time hyperparameter tuning. Finally, the optimized TinyML models are deployed on 6G enabled edge nodes, ensuring efficient, adaptive, and resource aware edge intelligence.

Unlike TinyNAS, OFA, or QAT—which focus only on static architecture compression or quantisation—DRL-TinyEdge performs joint optimisation across model structure, execution venue, and communication protocol in real time. These existing approaches cannot respond to dynamic 6G variations in SNR, RTT, CPU temperature, or battery state. DRL-TinyEdge introduces a multi-objective reward, stability control (SwitchCost), and a lightweight on-device controller that runs within <7% CPU and <1 MB RAM, enabling continuous adaptation per decision cycle. This joint optimisation capability is absent in TinyNAS, OFA, or QAT, and forms the core novelty of our framework.

The rest of this paper will be structured as follows. Section 2 reviews related literature on TinyML optimisation and DRL implementation in edge computing. Section 3 contains the methodology description of the DRL-TinyEdge framework. Section 4 presents the mathematical formulation and algorithmic elements. Section 5 presents a detailed analysis of the experiments’ performance and results. Section 6 tells about the implications and limitations of the suggested approach. Lastly, Section 7 provides future research directions for the paper.

2. Related Work

Optimisation of TinyML and edge computing has remained an area of research interest over the last few years, and other strategies have been proposed to address the challenge of executing machine learning models on resource-constrained devices [20,21].

2.1. TinyML Optimisation Techniques

Neural Architecture Search (NAS) has been widely used to optimise TinyML models for deployment at the edge. MobileNets proposed depthwise separable convolutions to lower computational cost and improve accuracy in a model [22]. The compound scaling techniques proposed by EfficientNet collectively optimise the depth, width, and resolution of a network to achieve a higher efficiency–accuracy tradeoff [23]. Most recently, Once-for-All networks allow training a single super-network that can be deployed with various architectural changes without retraining [24].

Another important optimisation method developed for TinyML deployment is Quantisation-Aware Training (QAT). Post-training quantisation techniques reduce model size and computational requirements by coalescing floating-point weights into lower-precision representations [25]. Dynamic quantisation methods adjust the level of precision based on the input conditions and computational constraints [26]. Nevertheless, such fixed optimisation methods are not suitable for responding to changing deployment conditions [27].

Pruning, distillation, and low-rank approximation are widely used model compression methods in TinyML. Magnitude-based pruning eliminates weights with small absolute values to reduce model complexity [28]. Knowledge distillation has been shown to transfer knowledge from large teacher models to small student networks that can be deployed at the edge [29]. Matrix factorisation processes represent weight matrices as low-rank matrices to reduce computation [30].

Clarification on TinyML vs. Edge-AI Devices:

In this work, the term TinyML refers specifically to machine learning models optimised to run on microcontroller-class hardware, such as ESP32 devices with kilobytes to a few megabytes of memory. However, our experimental testbed also includes Jetson Nano and Raspberry Pi 4, which are not TinyML devices but rather Edge-AI platforms with significantly higher compute and memory capacity. These devices are included to evaluate the scalability of DRL-TinyEdge across heterogeneous edge environments. TinyML optimisations (NAS, pruning, QAT, distillation) remain applicable only to MCU-class deployments, while Jetson and Raspberry Pi serve as higher-tier baselines to demonstrate that the proposed DRL controller generalises across multiple hardware tiers rather than being restricted to microcontrollers.

2.2. Edge Computing Optimisation

Allocation of resources in an edge computing environment has been modelled as several optimisation problems. Optimisation of task placement and resource utilisation in multi-access edge computing has been performed using mixed-integer programming tools [31]. Game-theoretic models have also been proposed to address competition for resource allocation among multiple edge applications [32]. Auction systems have been developed to facilitate efficient trading of resources in edge computing marketplaces [33].

Edge environment Federated learning has become a well-known distributed optimisation paradigm. FedAvg proposed local model training and global aggregation to minimise communication overhead [34]. Adaptive federated optimisation techniques can adjust learning rates and aggregation strategies based on data heterogeneity and communication constraints [35]. Individualised federated learning enables customisation of global models to local edge cases [36].

2.3. Deep Reinforcement Learning in Edge Computing

Edge computing RL has been chiefly applied to resource management and task scheduling. The use of deep Q-Networks (DQN) to optimise decisions for offloading computations in mobile edge computing has been demonstrated [37]. Actor–Critic approaches are currently featured in the literature for jointly optimising task scheduling and resource allocation in heterogeneous edge environments [38]. The coordinated problem of distributed edge computing systems has been solved using multi-agent reinforcement learning strategies [39].

Optimising communication via DRL is also of interest for wireless edge networks. PPO has been used to optimise the beamforming and power allocation in massive MIMO systems [40], cognitive radio networks, Spectrum allocation, and interference management. Deep Deterministic Policy Gradient (DDPG) methods have been applied in cognitive radio networks [41]. The approaches, however, are generally aimed at optimising communication with no regard for computational optimisation [42].

There have been recent efforts to optimise standard computations and communication in edge operations. Several multi-objective optimisation models have been proposed to balance latency, power consumption, and service quality in edge computing systems [43]. Task offloading and radio resource allocation techniques for mobile edge computing have been optimised using reinforcement learning [44]. Nevertheless, although these methods can be used to fine-tune models, the models used in TinyML applications require tight integration.

Recent work on efficient on device deep learning has focused on reducing compute and memory cost while keeping accuracy stable. Krishnamoorthi proposed practical post training and quantization aware quantization strategies for deep convolutional networks, showing that low precision inference can significantly speed up deployment with limited accuracy loss, which makes it a core technique for TinyML and edge inference systems [45]. Alongside quantization, scalable backbone design has also advanced rapidly. EfficientNet introduced compound scaling that jointly balances network depth, width, and input resolution to achieve strong accuracy to efficiency tradeoffs across model sizes, making it a common baseline for resource constrained deployments [43]. Similarly, MobileNetV3 combined hardware aware NAS with lightweight blocks and activation functions to deliver strong mobile performance under strict latency budgets [44].

Neural architecture search and once trained supernet strategies further push efficiency by tailoring models to target hardware. ProxylessNAS performed direct NAS on the target task and hardware without relying on proxy objectives, enabling architectures optimized for real device latency and resource constraints [45]. BigNAS then scaled NAS using single stage large models to obtain a family of subnetworks that can be specialized for different efficiency targets, supporting flexible deployment across heterogeneous devices [46]. Once for All extended this idea by training a single supernet once and then specializing subnetworks through progressive shrinking to meet diverse constraints such as latency, memory, and energy, which is highly suitable for dynamic edge settings where requirements change over time [47]. At a broader level, the AutoML survey by He et al. summarizes the growing ecosystem of automation in model selection, hyperparameter tuning, and NAS, and highlights how these methods reduce human effort while improving performance across varying deployment conditions [48]. Together, these studies form the foundation for adaptive TinyML systems that combine quantization, efficient backbone scaling, and hardware aware architecture and configuration search to achieve real time, resource aware inference on edge devices.

Recent work has explored DRL-based offloading and TinyML techniques in emerging edge systems. Bebortta et al. [49] proposed TinyDeepUAV, a DRL-driven offloading framework for UAV scenarios; however, it focuses on task-level decisions and does not incorporate model structural adaptation or multi-objective runtime optimisation. Abdulazeez and Askar [50] presented reinforcement-learning-based offloading for fog environments. However, these methods operate at coarse granularity and do not dynamically adjust quantisation, model depth, or device-level inference policies. In the TinyML domain, Heydari and Mahmoud [51] surveyed on-device learning challenges and highlighted the need for runtime adaptation and context-aware model control—gaps directly addressed by DRL-TinyEdge. Similarly, Khouas et al. [52] reviewed edge-side model training pipelines, emphasising the importance of lightweight control mechanisms for constrained hardware. Collectively, these recent studies illustrate growing interest in adaptive edge intelligence. However, none combine TinyML adaptation, communication-aware offloading, and on-device DRL control into a unified, resource-efficient framework as introduced in DRL-TinyEdge.

Several recent studies further highlight the rapid evolution of DRL-based optimisation in edge and IIoT environments. Reinfog [53] introduces a DRL-enabled framework for adaptive resource management across edge–cloud infrastructures, while Ma et al. [54] develop a multi-agent DRL scheme for joint task offloading and resource allocation under dynamic industrial priorities. In the vehicular domain, Almuseelem [55] proposes a security- and energy-aware DRL offloading framework for edge–cloud vehicular networks. Complementing these advances, Tsoukas et al. [56] provide an updated 2024 review of TinyML technology, emphasising microcontroller-level constraints and optimisation challenges. Although these works address resource scheduling, offloading, and TinyML deployment, none of them jointly integrate on-device TinyML model adaptation, real-time offloading decisions, and stability-constrained DRL control within a unified 6G-edge framework as proposed in DRL-TinyEdge.

What makes it different: Compared with previous studies that optimise either TinyML compression, e.g., via QAT, OFA, or TinyNAS, or edge offloading off the shelf, DRL-TinyEdge jointly optimises the execution venue and model structure using an on-device DRL controller. The framework is specifically sensitive to safety constraints, stability penalties (SwitchCost), and p95 cognisance of latency, and the outputs are stated of 100 per cent CPU, memory, as well as decision-time overheads- reproducible, productive output of lifelike 6G edge scenarios.

3. Proposed Methodology

This section provides a comprehensive design of the DRL-TinyEdge framework, which uses deep reinforcement learning and TinyML optimisation to achieve adaptive performance tuning for 6G edge systems. Besides the fact that, as illustrated in Figure 2, the framework comprises four interrelated components: the Environment Monitor, the DRL Optimisation Engine, the Model Adaptation Controller and the Performance Evaluator.

3.1. System Architecture Overview

The DRL-TinyEdge models operate as a closed-loop optimisation framework, continuously measuring environmental conditions at the edges and configuring TinyML models to achieve peak performance. Environment Monitor provides on-demand real-time data on network status, computation load, energy usage, and application-specific information. The DRL Optimisation Engine processes this information by using the multi-objective Deep Q-Network to obtain optimistic configuration parameters.

The Model Adaptation Controller implements the optimisation decisions, dynamically reconfiguring the model architecture, modifying hyperparameters, and altering the communicated protocols. The Performance Evaluator quantifies the effect of the changes made and provides feedback to the DRL agent via a well-designed reward function. This constant loop of adaptation keeps the framework functioning optimally despite changes in environmental conditions.

Complexity Consideration:

Although DRL-TinyEdge integrates a controller, environment monitor, and runtime model reconfiguration, its operational footprint is intentionally kept lightweight to remain suitable for TinyML-class devices. The DRL agent is a compact DQN with fewer than 30k parameters, consuming under 1.2 MB RAM and <7% CPU load on ESP32 during inference. The environment monitor samples only normalised scalar metrics (RTT, SNR, CPU load, battery level), requiring a processing overhead of <0.3 ms per cycle. Model reconfiguration is performed using precompiled subnetworks rather than on-device architecture search, thereby avoiding runtime NAS computation. As a result, the total decision overhead remains <5 ms, substantially lower than typical TinyML inference latency, and therefore fits within the constraints of low-power edge hardware.

3.2. Environment State Representation

The state space of the DRL agent is designed to provide a comprehensive yet lightweight representation of the dynamic edge environment, capturing the communication, computation, energy, and application-level conditions that jointly influence TinyML deployment decisions. At each decision step

t

, the agent observes a state vector that aggregates normalised metrics reflecting the instantaneous operating context of the edge device and the network, as formalised in Algorithm 1.

Algorithm 1 DRL-TinyEdge Optimisation Process

Input:
Initial TinyML model $M_{0}$ , environment state $s_{0}$ , hyperparameters $θ_{0}$
Output:
Optimised model configuration $M^{*}$ , optimal policy $π^{*}$
- Initialise
  ○
  Deep Q-Network (DQN) with random weights $Q (s, a; ϕ 0)$
  ○
  Target network $Q_{t a r g e t} (s, a; ϕ_{0} -)$
  ○
  Replay buffer $D$ with capacity $N$
  ○
  Environment monitor and performance evaluator
- For each episode $e = 1$ to $E_{\{\max\}}$ :
- a. Reset environment and observe initial state $s_{0}$ .
- b. For each step $t = 1$ to $T_{\{\max\}}$ :
- i. Action Selection
- - If $random () < ε ₜ$ : select random action $a_{t}$ from action space $A$ .
- - Else: select greedy action $a_{t} = \arg \max_{a} Q (s t, a; ϕ_{t})$ .
- ii. Execute Action
- - Apply action $a_{t}$ on TinyML model.
- - Observe next state $s_{t + 1}$ and reward $r_{t}$ .
- iii. Experience Replay
- - Store transition $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in replay buffer $D$ .
- - If buffer size $∣ D ∣$ > batch_size:
- • Sample random batch from $D$ .
- • Compute target values using $Q_{target}$ .
- • Update main network parameters $ϕ_{t}$ via gradient descent.
- iv. Target Network Update
- - Every CCC steps: $ϕ_{t}^{-} \leftarrow ϕ_{t}$ .
- v. Exploration Decay
- - Update $ϵ_{t} = \max (ϵ_{\min}, ϵ_{t} \cdot δ)$ .
- vi. State Update
- - Set $s_{t} \leftarrow s_{t + 1}$ .
- After each episode
  ○
  Evaluate model performance on the validation set.
  ○
  Log performance metrics and update learning curves.
- Return: Optimised model $M^{*}$ and learned policy $π^{*}$ .

The network state component characterises the time-varying properties of the 6G wireless link. It includes received signal strength, signal-to-noise ratio, packet loss rate, available bandwidth, and round-trip time. These variables collectively describe link quality, congestion, and reliability, which directly affect offloading feasibility, transmission latency, and communication energy cost. By explicitly modelling these parameters, the DRL agent can adapt inference placement and communication strategies in response to fluctuating channel conditions, interference levels, and mobility-induced variability.

The computational state captures the current processing capability and load of the edge device. This includes CPU utilisation, memory occupancy, task queue length, and device temperature. CPU and memory utilisation reflect the availability of local resources for inference execution, while queue length provides insight into scheduling delays under concurrent workloads. Device temperature is included to account for thermal throttling effects, which can degrade performance and increase energy consumption. Together, these indicators allow the agent to assess whether local execution, partial offloading, or cloud execution is most suitable under the current hardware conditions.

The energy state is critical for battery-powered TinyML platforms. This component includes instantaneous power consumption, remaining battery capacity, and estimated energy cost per inference under different execution options. These metrics enable the DRL agent to explicitly reason about energy sustainability and to avoid decisions that could rapidly deplete the battery or violate energy budgets. Incorporating energy awareness into the state representation allows the agent to balance short-term performance gains against long-term operational viability.

Finally, the application performance state summarises task-level feedback, such as recent inference latency, achieved throughput, and accuracy indicators. This information provides closed-loop feedback on how previous actions have affected application-level objectives. By observing these performance signals, the agent can learn to adjust model configurations and execution policies to maintain service-level requirements, such as latency bounds or accuracy preservation, under changing environmental conditions.

All state variables are normalised to fixed ranges to ensure numerical stability during training and to maintain hardware-agnostic behaviour across heterogeneous platforms. This compact yet expressive state representation enables the DRL controller to make informed, real-time decisions while remaining lightweight enough for deployment on microcontroller-class devices.

3.3. Action Space Design

The action space of the DRL agent is designed to enable joint and fine-grained control over the key decision variables that influence TinyML deployment performance in dynamic 6G edge environments. At each decision step, the agent selects an action vector that simultaneously determines the model configuration, execution behaviour, and communication strategy, allowing coordinated optimisation across computation and networking dimensions.

Model Architecture Actions

The architectural action set governs the structural configuration of the TinyML model. These actions include adjusting the number of active layers, modifying channel width multipliers, selecting activation functions, and choosing quantisation levels (e.g., INT8 or FP16). Each architectural action directly impacts computational complexity, memory footprint, and inference accuracy. By exposing these parameters to the DRL agent, the framework enables runtime selection of lightweight or high-fidelity model variants depending on current resource availability and application requirements. To ensure feasibility on constrained devices, architectural actions are restricted to a predefined set of pre-compiled subnetworks, thereby avoiding expensive on-device neural architecture search.

Hyperparameter Actions

The hyperparameter action set controls training- and execution-related parameters that influence convergence and runtime efficiency. These actions include adjustments to learning rate schedules, batch size, and optimizer selection. Although TinyML inference is predominantly executed using pre-trained models, hyperparameter actions are relevant during adaptation phases and incremental fine-tuning scenarios. Allowing the agent to modify these parameters enables it to respond to changing environmental conditions, such as workload intensity or computational load, and to stabilise performance under non-stationary operating conditions.

Communication Protocol Actions

The communication action set governs how inference tasks and intermediate data are transmitted across the edge–cloud continuum. These actions include selection of transmission power levels, modulation and coding schemes, error-correction strategies, and packet scheduling policies. By adjusting these parameters, the agent can trade off communication reliability, latency, and energy consumption in response to varying channel quality and network congestion. This design allows the framework to coordinate computation and communication decisions rather than optimising them independently.

All actions are discretized and bounded to ensure safe exploration and compatibility with resource-constrained edge devices. Action masking is applied to prevent infeasible or unsafe configurations, such as actions that would violate latency or energy constraints.

3.4. Multi-Objective Reward Function

The reward function is central to guiding the DRL agent toward balanced and stable optimisation under competing performance objectives. The proposed framework employs a weighted multi-objective reward formulation that integrates inference accuracy, latency, energy consumption, communication reliability, and decision stability into a single scalar reward signal.

At each time step, the reward is computed as a weighted sum of normalised performance metrics. Accuracy rewards encourage preservation of model fidelity under adaptation, while latency rewards penalise excessive end-to-end inference delay, with particular emphasis on tail latency. Energy-related rewards explicitly discourage high per-inference energy consumption, promoting sustainable operation for battery-powered devices. Communication reliability rewards account for packet delivery success and link stability, ensuring robust inference execution under fluctuating network conditions.

To adapt to varying operational priorities, the reward formulation incorporates adaptive weighting mechanisms that dynamically adjust the importance of individual objectives. For example, when battery capacity falls below a predefined threshold, the weight assigned to energy efficiency is increased, encouraging more conservative execution strategies. Conversely, for latency-critical applications, higher priority can be assigned to inference responsiveness and accuracy.

In addition to performance rewards, the reward function includes penalty terms that enforce system constraints and enhance policy stability. Constraint-violation penalties discourage actions that exceed predefined latency or energy budgets. A stability-oriented penalty discourages frequent switching between execution venues or model configurations, thereby preventing oscillatory behaviour that could degrade system performance and reliability. This stability component ensures smoother policy evolution and more predictable runtime behaviour.

Together, these reward components enable the DRL agent to learn policies that achieve favourable trade-offs among accuracy, latency, energy efficiency, and reliability, while maintaining safe and stable operation in highly dynamic 6G edge environments.

3.5. Method: Lightweight and Safe Controller Design

Controller Footprint: The DRL controller is implemented and runs locally on the MCU/SoC, incurring no expected overhead. Each decision takes less than 5 ms and executes every 20 frames when the workload is on the vision, and every 200 samples when the workload is on the sensor.

Table 1 shows the computational area of the DRL-TinyEdge controller for familiar MCU and SoC boards. The controller itself is lightweight, with CPU usage below 7 per cent and RAM usage below 1 MB on any device. The adaptations can be realised and implemented in approximately 5 ms per decision, demonstrating that they are feasible and do not affect TinyML inference workloads visibly.

Safe Exploration An

ϵ

-greedy exploration policy with action shielding ensures that latency and energy never exceed preset SLA caps (100 ms and 50 mJ, respectively). When the predicted cost exceeds the cap, the action is rejected, and the agent retries with a safe backup policy.

Stability Mechanisms Target-network updates occur every

C = 100

steps; replay buffer size =

5 \times 10^{3}

; random seed = 42 for reproducibility.

4. Mathematical Modelling

This section presents the mathematical formulation of the DRL-TinyEdge optimisation problem, the formal definition of the Markov Decision Process (MDP), the multi-objective optimisation model, and the algorithmic elements of the solution based on deep reinforcement learning.

4.1. Problem Formulation

The TinyML optimisation problem in 6G edge environments is formulated as a multi-objective Markov Decision Process defined by the tuple

⟨ S, A, P, R, γ ⟩

, where

S

represents the state space,

A

denotes the action space,

P

defines the state transition probabilities,

R

specifies the reward function, and

γ

is the discount factor.

The state space

S

is defined as:

s_{t} = [s_{t}^{n e t}, s_{t}^{c o m p}, s_{t}^{e n e r g y}, s_{t}^{a p p}] \in S

(1)

where

s_{t}^{n e t}

represents network conditions,

s_{t}^{c o m p}

denotes the status of computational resources,

s_{t}^{e n e r g y}

indicates energy metrics, and

s_{t}^{a p p}

captures application-specific performance indicators.

The network state component is characterised as:

s_{t}^{n e t} = [R S S I_{t}, S N R_{t}, P L R_{t}, B W_{t}, L A T_{t}]

(2)

where

R S S I_{t}

is the received signal strength indicator,

S N R_{t}

represents the signal-to-noise ratio,

P L R_{t}

denotes the packet loss rate,

B W_{t}

indicates available bandwidth, and

L A T_{t}

represents communication latency.

The computational condition is represented as:

s_{t}^{c o m p} = [C P U_{t}, M E M_{t}, T E M P_{t}, Q U E U E_{t}]

(3)

where

C P U_{t}

represents processor utilisation,

M E M_{t}

denotes memory consumption,

T E M P_{t}

indicates device temperature, and

Q U E U E_{t}

represents the processing queue length.

The general optimisation is defined as a Markov Decision Process (MDP) in which the agent maintains the system state and chooses actions that minimise the system’s impact on accuracy and stability.

State Definition:

s_{t} = {{SNR}_{t}, {RTT}_{t}, {CPUtemp}_{t}, {Battery}_{t}, {WorkloadType}_{t}}

(4)

The state represents the actual network conditions (SNR, RTT), the device’s thermal/energy state (CPU temperature, battery remaining), and the workload (i.e., vision or inference sensor).

Action Definition:

a_{t} = {Local, Partial, Cloud} \times {ModelDepth, QuantizationLevel, FPS}

(5)

It is determined collectively which action to use where (local or offloaded in part or to the cloud), as well as the configuration of the TinyML model, such as inference depth, accuracy, and frame rate.

Reward Function:

r_{t} = - (α \cdot {Latency}_{t} + β \cdot {Energy}_{t}) + γ \cdot {Accuracy}_{t} - δ \cdot {SwitchCost}_{t}

(6)

This reward motivates low-latency, energy-efficient, and correct operation and acts as a disincentive to frequent policy changes through the stability term (

{SwitchCost}_{t}

).

The coefficients of the multi-objective reward function of DRL-TinyEdge are summarised in Table 2. The

α

,

β

,

γ

, and

δ

parameters were manually adjusted to achieve a trade-off between latency, energy, accuracy, and policy stability. The weights achieve low p95 latency and energy efficiency, with a trade-off of error

\leq 1 %

of the best-performing baseline. Introduction of the stability term (

δ

) in effect smart-suppresses repeated switching of offloading decisions, and can provide reliable controller performance in the presence of dynamic conditions of 6G edges.

The coefficients were visually optimised in the sensitivity analysis to find the balance, and finally, convergence was empirically achieved among the heterogeneous 6G edge devices.

Device-Agnostic Reward Coefficient Explanation:

The reward coefficients shown in Table 2 were intentionally kept fixed across all hardware platforms (ESP32, Jetson Nano, and Raspberry Pi 4) to maintain a consistent optimisation signal during training and ensure comparability across experiments. Although each device class has different operational priorities—such as strict energy sensitivity on ESP32 versus higher performance capacity on Jetson Nano—our tuning experiments showed that hardware-specific coefficients produced only minor variations in final performance. Specifically, adjusting

α, β, γ,

and

δ

within the ranges shown in Table 2 resulted in ≤3.2% change in p95 latency, ≤2.5% change in energy, and no measurable change in accuracy stability, confirming that the global coefficient set

(α = 0.35, β = 0.40, γ = 0.20, δ = 0.10)

is robust across heterogeneous platforms.

However, we acknowledge that using a single coefficient set for all devices may not fully capture platform-specific operational priorities (e.g., extreme battery constraints on MCUs or throughput-oriented tasks on edge GPUs). A hardware-adaptive weighting scheme could further enhance optimisation efficiency, and we note this as a limitation and direction for future work.

4.2. Action Space Formulation

The action space

A

encompasses discrete and continuous optimisation variables across three categories:

a_{t} = [a_{t}^{a r c h}, a_{t}^{h y p e r}, a_{t}^{c o m m}] \in A

(7)

The actions of the model architecture are given as:

a_{t}^{a r c h} = [L_{t}, W_{t}, Q_{t}, A_{t}]

(8)

where

L_{t}

represents the number of layers,

W_{t}

denotes channel width multipliers,

Q_{t}

indicates quantisation levels, and

A_{t}

specifies activation function selection.

The activities of the hyperparameter involve:

a_{t}^{h y p e r} = [η_{t}, B_{t}, O_{t}, R_{t}]

(9)

where

η_{t}

is the learning rate,

B_{t}

is the batch size,

O_{t}

denotes the optimiser selection, and

R_{t}

denotes the regularisation parameters.

Actions of the communication protocol are developed as:

a_{t}^{c o m m} = [P_{t}, M_{t}, E_{t}, S_{t}]

(10)

where

P_{t}

represents transmission power,

M_{t}

denotes modulation scheme,

E_{t}

indicates error correction coding, and

S_{t}

specifies scheduling strategy.

4.3. Multi-Objective Reward Function

The reward function optimises several competing goals using a weighted average of normalised measurements of performance:

r_{t} = w_{1} \cdot r_{t}^{a c c} + w_{2} \cdot r_{t}^{l a t} + w_{3} \cdot r_{t}^{e n e r g y} + w_{4} \cdot r_{t}^{r e l} - λ \cdot p_{t}

(11)

where

w_{i}

are adaptive weights,

r_{t}^{a c c}

represents accuracy reward,

r_{t}^{l a t}

denotes latency reward,

r_{t}^{e n e r g y}

indicates energy efficiency reward,

r_{t}^{r e l}

specifies reliability reward, and

p_{t}

represents constraint violation penalties.

Although the reward function in (11) combines accuracy, latency, energy, and reliability with proper weights, the controller remains stable only when these coefficients are correctly balanced. In our experiments, the values

α = 0.35

(latency),

β = 0.40

(energy),

γ = 0.20

(accuracy), and

δ = 0.10

(SwitchCost penalty) produced the most consistent convergence behaviour across different network conditions. A small sensitivity analysis showed that adjustments of ±0.05 around these values did not significantly change the learned policy, indicating that the coefficient set is not overly fragile. However, heavily biased configurations—such as

β > 0.60

, which overemphasises energy, or

γ < 0.10

, which underweights accuracy—led to unstable adaptation patterns or minor accuracy degradation. Therefore, the recommended range above provides a practical guideline for weighting the reward components without causing instability or excessive sensitivity.

The accuracy reward element is determined to be:

r_{t}^{a c c} = \frac{{Accuracy}_{t} - {Accuracy}_{b a s e l i n e}}{{Accuracy}_{m a x} - {Accuracy}_{b a s e l i n e}}

(12)

The reward of latency trains faster times of inference:

r_{t}^{l a t} = \frac{{l a t e n c y}_{m a x} - l a t e n c y_{t}}{{l a t e n c y}_{m a x} - {l a t e n c y}_{m i n}}

(13)

The success rates of communication are taken into account in the reliability reward:

r_{t}^{r e l} = \frac{{Success_Rate}_{t} - {Success_Rate}_{m i n}}{{Success_Rate}_{m a x} - {Success_Rate}_{m i n}}

(14)

Reward Component Tradeoffs:

To analyse the influence of each reward component, we performed an ablation-style evaluation of the full reward formulation. Removing the energy term (w₃) favoured deeper or uncompressed model variants, increasing average power consumption by 24%, while accuracy improved by less than 0.3%, revealing inefficient behaviour with minimal accuracy benefit. Eliminating the accuracy term (w₁) produced the opposite effect: the controller selected aggressively compressed sub-networks, resulting in a measurable 2.8% drop in accuracy, despite a modest energy gain of 7–9%. When the SwitchCost penalty

(λ \cdot p ₜ)

was removed, the controller exhibited unstable oscillatory decisions—matching the empirical increase in switch rate from 0.8/min to 4.2/min, similar to Heuristic-QoS. This instability directly affected p95 latency, which decreased by 18–25% due to frequent changes in execution venue. Finally, using a latency-only objective reduced median latency but caused noticeable accuracy degradation (≈2.1%) and a 17% increase in energy consumption, confirming that single-objective optimisation performs sub-optimally under dynamic 6G conditions. These results validate the need for a balanced multi-objective reward that jointly accounts for latency, energy, accuracy, and stability.

4.4. Deep Q-Network Formulation

Deep Q-network is used in the DRL-TinyEdge model using the following value function approximation:

(s, a; ϕ) = ϕ^{T} \cdot f (s, a)

(15)

and

ϕ

is the parameter of the neural network, and

f (s, a)

is the feature representation of state–action pairs.

The training loss of the DQN has the following definition:

L (ϕ) = E [{(y_{t} - Q (s_{t}, a_{t}; ϕ))}^{2}]

(16)

where the target value

y_{t}

is computed using the target network:

y_{t} = r_{t} + γ \underset{a'}{m a x} Q (s_{t + 1}, a'; ϕ^{-})

(17)

Gradient updating Rule of the main network parameters is:

ϕ_{t + 1} = ϕ_{t} - α \nabla_{ϕ} L (ϕ_{t})

(18)

where

α

represents the learning rate.

4.5. Adaptive Weight Mechanism

The structure includes a dynamic weighting process that optimises the priorities of the objectives according to the environmental factors:

w_{i} (t) = \frac{e x p (β_{i} \cdot c_{i} (t))}{\sum_{j = 1}^{4} e x p (β_{j} \cdot c_{j} (t))}

(19)

Temperature parameters are

β_{i}

, and context-specific scaling factors are

c_{i} (t)

.

The environmental factors are calculated according to the environmental conditions:

c_{1} (t) = \frac{{Task_Priority}_{t}}{{Task_Priority}_{m a x}}

(20)

c_{2} (t) = \frac{{Deadline_Pressure}_{t}}{{Deadline_Pressure}_{m a x}}

(21)

c_{3} (t) = \frac{{Battery_Level}_{m a x} - {Battery_Level}_{t}}{{Battery_Level}_{m a x}}

(22)

c_{4} (t) = \frac{{Channel_Quality}_{m a x} - {Channel_Quality}_{t}}{{Channel_Quality}_{m a x}}

(23)

4.6. Constraint Handling

The framework uses both hard and soft constraints to ensure solution feasibility. The hard constraints are imposed by action masking:

A_{v a l i d} (s_{t}) = {a \in A : g_{i} (s_{t}, a) \leq 0, \forall i \in I}

(24)

The

g_{i}

denote constraint functions, and

I

denotes the constraint indices set.

Penalty terms are used in the reward to deal with soft constraints:

p_{t} = \sum_{j \in J} m a x {(0, h_{j} (s_{t}, a_{t}))}^{2}

(25)

where

h_{j}

represents soft constraint violation functions.

4.7. Convergence Analysis

The contraction mapping theory is applied to the convergence of the DRL-TinyEdge algorithm. With the assumption of bounded rewards availability upon a finite action space, the algorithm will reach the optimal policy with probability 1:

\underset{t \to \infty}{l i m} ∥ ϕ_{t} - ϕ^{*} ∥ = 0

(26)

And

ϕ^{*}

is the best network parameters.

The exploration–exploitation tradeoff is upheld by an e-greedy approach to exponentially dwindling:

ϵ_{t} = m a x (ϵ_{m i n}, ϵ_{0} \cdot δ^{t})

(27)

The probability of exploration is

ϵ_{0}

, which is the first rate of exploration, and

δ

is the decay factor of probability, and

ϵ_{m i n}

is the lowest probability of exploration.

5. Results and Evaluation

This section presents a structured evaluation of the proposed DRL-TinyEdge framework across multiple performance dimensions, including latency, energy efficiency, accuracy, policy stability, scalability, and operational cost. Each subsection focuses on a specific evaluation aspect, and interpretative summaries are provided to clearly highlight the key outcomes and comparative advantages over baseline methods.

5.1. Experimental Setup

The experimental design of this study is structured to directly align with the research objective of evaluating whether a lightweight, on-device deep reinforcement learning controller can jointly optimise latency, energy consumption, accuracy, and decision stability for TinyML deployments under dynamic 6G edge conditions. To this end, a heterogeneous hardware testbed, representative workloads, and controlled network emulation are employed to ensure that the proposed DRL-TinyEdge framework is assessed under realistic and reproducible conditions.

Three hardware platforms—ESP32, Jetson Nano, and Raspberry Pi 4—are deliberately selected to represent distinct tiers of edge computing capability. The ESP32 microcontroller (ESP microcontroller series KSA based.) captures extreme TinyML constraints in terms of memory and energy, the Raspberry Pi 4 represents mid-tier embedded edge devices, and the Jetson Nano reflects edge-AI platforms with higher computational capacity. This selection enables systematic evaluation of the framework’s adaptability and scalability across heterogeneous edge environments while maintaining consistency in the optimisation objectives.

The choice of workloads and datasets is guided by the need to evaluate both compute-intensive vision tasks and latency-sensitive sensor analytics, which are common in practical edge deployments. CIFAR-10 and ImageNet are used to benchmark vision-based inference accuracy and latency, while industrial IoT sensor data is employed to assess energy efficiency and real-time responsiveness under continuous data streams. Model architectures such as MobileNetV3-Tiny and EfficientNet-B0 are selected due to their widespread adoption in TinyML and edge-AI systems, ensuring fair and meaningful comparison with existing optimisation methods.

Recent studies have also investigated semantic-driven integration of sensing and communication for next-generation wireless systems. In particular, Peng et al. [57] proposed SIMAC, a semantic-driven integrated multimodal sensing and communication framework that jointly optimises multimodal sensing semantics and communication efficiency in 6G-oriented networks. While SIMAC focuses on semantic abstraction and network-level coordination of sensing and communication, it does not address runtime adaptation of TinyML model execution, architectural configuration, or energy–latency-aware offloading at the device level. These limitations motivate the need for adaptive edge intelligence frameworks that can dynamically optimise TinyML deployments under strict resource constraints.

The research design explicitly maps evaluation metrics to research objectives. End-to-end inference latency (including p95 latency) evaluates real-time responsiveness, per-inference energy consumption assesses battery sustainability, accuracy measures model fidelity under adaptation, and policy switch rate quantifies decision stability. Baseline methods—including Static-Offload, Heuristic-QoS, TinyNAS, QAT, and OFA—are selected to represent static model optimisation, heuristic offloading, and architecture-level adaptation approaches, thereby enabling direct assessment of the benefits of joint, runtime optimisation.

The testbed used a heterogeneous evaluation approach, with 6G-enabled devices such as the Raspberry Pi 4 module, the NVIDIA Jetson Nano board, and the ESP32 microcontroller utilised in the experimental analysis. The Software Defined Radio (SDR) equipment and the cell band would operate in the 28 GHz mmWave frequency band, enabling millimetre-wave beamforming. The network infrastructure was simulated on a Software Defined Radio (SDR) platform. The datasets used in this study are publicly available: CIFAR-10: Available at https://www.cs.toronto.edu/~kriz/cifar.html, ImageNet: Available at https://www.image-net.org/ (requires registration) accessed on 01-07-2025.

Network Simulation Parameters:

The 6G link emulator was configured with SNR ∈ [10, 30] dB, RTT ∈ [10, 200] ms, jitter = 20 ms, and packet loss ratio ∈ [0–3%], consistent with urban mmWave traces. The SDR operated at 28 GHz with hybrid beamforming enabled.

Statistical Significance:

All results are reported with 95% confidence intervals computed across five independent runs. A paired t-test comparing DRL-TinyEdge to the strongest baseline (Heuristic-QoS) showed statistically significant improvements in p95 latency (p < 0.01) and energy per inference (p < 0.05).

The test data will consist of CIFAR-10, ImageNet, and IoT sensor data of industrial monitoring projects on their own devices. Some of the tested model architectures include MobileNet V3, EfficientNet-B0, and edge-optimised convolutional neural networks. PyTorch 3.9and CUDA acceleration were used for training, and Stable Baselines 3 with custom environment wrappers were used to implement the DRL agent.

The comparison methods at the baseline are those based on the static optimisation mechanisms like TinyNAS [45], Quantisation-Aware Training (QAT) [58], Once-for-All (OFA) [46], EfficientNet [47], MobileNetV3 [48], ProxylessNAS [49], and Table 3 summarises the principal hardware, workload, network, and training settings. The results are presented as means with standard deviations from five repeated runs for all evaluation metrics.

5.2. Main Performance Comparison

Table 4 presents a detailed comparison between the DRL-TinyEdge and the state-of-the-art baseline approach across various performance measures. The measurement includes accuracy, p95 latency, energy per inference, and policy consistency for both sensor (CIFAR-10) and vision (CIFAR-10) workloads. All experiments were performed on identical hardware (ESP32, Jetson Nano, Raspberry Pi 4) and simulation of 6G network conditions with different SNR (15–25 dB) and RTT (10–200 ms).

Key Findings:

Table 4 demonstrates that DRL-TinyEdge achieves the most balanced performance across all evaluated metrics when compared with static and heuristic baseline methods. In terms of latency, DRL-TinyEdge reduces p95 inference delay to 72.3 ms, representing a 26.7% improvement over Heuristic-QoS (98.7 ms) and more than 50% reduction compared to Local-Only execution (147.2 ms). This confirms the effectiveness of runtime adaptive offloading under dynamic network conditions.

Energy efficiency is also significantly improved. DRL-TinyEdge achieves 21.1 mJ per inference, yielding a 28.7% reduction relative to Heuristic-QoS and up to 54% reduction compared to Static-Offload, while maintaining competitive accuracy (93.4%, within 1% of the highest-performing baseline). Unlike heuristic approaches, which exhibit high decision instability (4.2 switches/min for Heuristic-QoS), the proposed framework maintains a low switch rate of 0.8 switches/min, indicating stable and reliable decision-making.

Although DRL-TinyEdge incurs a modest memory overhead (1.2 MB) relative to some static methods, this cost enables high adaptability and improved throughput (14.2 fps), outperforming all baseline approaches. Overall, the results confirm that DRL-TinyEdge delivers Pareto-superior performance, achieving lower latency and energy consumption without sacrificing accuracy or stability.

Scenario-Based Interpretation:

The improvements shown in Table 4 have clear implications for real-world edge-AI deployments. The reduction in p95 latency from 147.2 ms (Local-Only) and 128.5 ms (Static-Offload) to 72.3 ms with DRL-TinyEdge represents a 21–51% improvement, enabling time-critical applications to respond substantially faster. In drone surveillance, this lower tail latency enables faster frame-to-action loops, improving target tracking stability during rapid motion. In IoT sensing, the energy drops from 37.4 mJ (Local-Only) and 45.8 mJ (Static-Offload) to 21.1 mJ with DRL-TinyEdge results in a 43–54% reduction, directly extending battery life for long-term, low-power sensors deployed in remote or maintenance-constrained environments. For vehicular edge computing, reducing p95 latency variability and lowering the switch rate from 4.2/min (Heuristic-QoS) to 0.8/min enhances system stability, ensuring consistent inference timing for safety-critical decisions such as collision alerts, lane-change detection, and obstacle prediction. These scenario-driven interpretations highlight how the quantitative gains from DRL-TinyEdge directly translate into improved service quality and end-user experience across diverse edge-AI settings.

Key Takeaways:

DRL-TinyEdge achieves 21–28% p95 latency reduction compared to the best-performing baseline (Heuristic-QoS) and maintains $\pm$ 1% accuracy relative to Static-Offload.
Compared to the Local-Only and Static-Offload baselines, energy consumption decreases by 37–43%, demonstrating excellent energy efficiency through adaptive offloading decisions.
Policy stability is ensured to be less than 1 switch per minute, which compares very well with reactive heuristic schemes that can have switching rates up to 3–5 times higher.
The framework demonstrated Pareto-superior performance in the latency–energy-accuracy trade-off space, confirming the usefulness of the multi-objective DRL optimisation method.

5.3. Latency Distribution Analysis

The cumulative distribution function (CDF) of the latency for end-to-end inference across all baseline approaches and DRL-TinyEdge in dynamic 6G network conditions is shown in Figure 3. The CDF analysis provides some understanding of tail latency behaviour, which is a key factor in real-time edge applications with service-level-agreement (SLA) requirements. This test is conducted over 10,000 inference requests over 2 h under different network conditions (SNR: 10–30 dB, RTT: 10–200 ms) to reflect realistic deployment conditions.

Key Takeaways:

DRL-TinyEdge has a p95 latency of 72.3ms, which is 26.7 ms better than Heuristic-QoS and 50.9 ms better than Local-Only execution.
The structure shows a smaller latency distribution, with a smaller tail than static varieties, and is characterised by higher predictability in dynamic applications with SLA constraints.
DRL-TinyEdge has a response time of less than 100 ms at the p99 latency, whereas Static-Offload exceeds 180 ms under network degradation, demonstrating the power of adaptable decision-making.
Its competitive middle value (p50) of 54.2 ms compares with the cloud-based Static-Offload (51.8 ms), with only 43 per cent less energy usage, indicating high-energy–latency trade-offs have been optimised.

Key Findings:

Figure 3 presents the cumulative distribution of end-to-end inference latency across all baseline methods and the proposed DRL-TinyEdge framework. The results clearly show that DRL-TinyEdge achieves superior tail-latency control, with a p95 latency of approximately 72 ms, corresponding to a 26.7% reduction compared to the strongest baseline (Heuristic-QoS). Notably, the DRL-TinyEdge CDF curve rises more steeply and saturates earlier than static baselines, indicating consistently lower latency across the majority of inference requests.

The interquartile range (IQR) bands demonstrate that DRL-TinyEdge exhibits lower variance across repeated runs, reflecting stable performance under dynamic network conditions. In contrast, Local-Only and Static-Offload approaches show broader IQRs and heavier right tails, with a substantial fraction of requests exceeding the 100 ms SLA threshold, particularly under fluctuating SNR and RTT conditions. The proposed framework maintains latency below the SLA threshold for the vast majority of inference instances, confirming its robustness for latency-sensitive edge applications.

5.4. Multi-Objective Pareto Frontier Analysis

Figure 4 depicts the Pareto frontier of trade-off between latency and energy within the space of results of the DRL-TinyEdge and all the baseline techniques. The value at each point is the average across multiple experimental runs of a method, and the error bars are the standard deviation. Pareto-optimal solutions lie on the lower-left boundary, indicating those solutions that cannot be optimised simultaneously in both the latency and energy dimensions without compromising at least one measure. The effectiveness of the multi-objective reward function in achieving balance in the optimisation is tested in this analysis.

Key Takeaways:

DRL-TinyEdge achieves Pareto superiority, with the optimal latency–energy trade-off among all the methods tested.
The framework shows both a reduction in latency (28 per cent right now) and power consumption (43 per cent at once), which are 28 per cent and 43 per cent lower than Local-Only, respectively, and are considerable in both respects.
Static-Offload and Local-Only are sub-optimal: Static-Offload consumes high energy (45.8 mJ), and Local-Only incurs high latency (147.2 ms).
Heuristic-QoS has moderate performance with policy instability (4.2 switches/min), whereas DRL-TinyEdge has policy stability (0.8 switches/min) at higher operating points.
This convex hull, defined as DRL-TinyEdge, OFA, and Heuristic-QoS, is centred on plotting the achievable Pareto frontier, where DRL-TinyEdge has a greater impact on adaptive decision-making tasks.

Key Findings:

Figure 4 illustrates the Pareto frontier of the latency–energy trade-off space across all evaluated methods. The proposed DRL-TinyEdge framework occupies a Pareto-optimal position, achieving an average latency of 72.3 ms with an energy expenditure of 21.1 mJ, which jointly surpasses all baseline approaches. No other method attains lower latency without incurring higher energy consumption, or lower energy consumption without increasing latency, confirming the Pareto dominance of DRL-TinyEdge.

Static optimisation methods are clearly located within the Pareto-dominated region, indicating sub-optimal trade-offs between latency and energy efficiency. While adaptive baseline methods approach the Pareto frontier, their higher variance, as reflected by the larger error bars, indicates reduced stability under dynamic operating conditions. In contrast, DRL-TinyEdge demonstrates both efficient trade-off optimisation and low performance variability, highlighting the effectiveness of the multi-objective and stability-aware reward formulation. These results confirm that joint runtime optimisation of model configuration and execution strategy enables superior performance compared to static or partially adaptive approaches, particularly in environments characterised by fluctuating network and resource conditions.

5.5. Controller Overhead Analysis

Table 5 presents the computational cost of the DRL-TinyEdge controller across various hardware platforms. The overhead analysis will play a crucial role in realising the real viability of on-device DRL optimisation, especially for resource-limited edge devices. Some of the measures include CPU usage, memory usage, decision duration, and network congestion under constant working conditions in real-world workload scenarios. Each measurement was averaged over 10,000 decision cycles to achieve statistical significance.

Key Findings:

Table 5 demonstrates that the DRL-TinyEdge controller introduces negligible computational and communication overhead across all evaluated edge platforms. On the most resource-constrained device, the ESP32 microcontroller, CPU utilisation remains below 4%, memory usage is limited to approximately 412 KB, and decision latency is 3.8 ms, confirming the feasibility of deploying the controller on MCU-class hardware. Similar trends are observed on the Jetson Nano and Raspberry Pi 4, where decision times remain consistently below 5 ms despite higher system complexity.

Across all devices, the throughput impact is below 0.3%, indicating that the controller does not interfere with normal inference execution. Network overhead is negligible (<0.1 kbps), as all optimisation decisions are performed locally without continuous communication with external controllers. Storage requirements remain modest, averaging 1.0 MB, which is acceptable for embedded edge platforms. Overall, these results confirm that DRL-TinyEdge achieves substantial performance improvements without imposing meaningful runtime overhead, validating its suitability for real-world deployment in resource-constrained edge environments.

Key Takeaways:

The DRI controller requires minimal computational resources, using <7% CPU and <1 MB of RAM across all test platforms (ESP32, Jetson Nano, Raspberry P_i 4).
Latency is <5 ms across all devices, the feasibility of real-time operation has been confirmed, and the effect on inference throughput is insignificant (decision overhead costs <0.3% of total inference time).
The controller has excellent scalability, with overhead that is almost constant across various model complexities and workload types (vision vs. sensor analytics).
There is no network overhead (<0.1 kbps) since the controller is completely off-the-shelf and does not require cloud-based decision-making, delivering low-latency, autonomous performance.
The memory footprint scales with the replay buffer size, yet it remains less than 5% of usable RAM, even on a very limited MCU (ESP32 with 512 KB of SRAM).

5.6. Cloud Infrastructure Cost Analysis

Table 6 provides a breakdown of the cost analysis for DRL-TinyEdge, cloud-centric, and hybrid approaches to offloading in real-world deployment conditions. The computational and data-transfer costs within the network, and the total cost of ownership (TCO) over a 30-day operational period for a fleet of 100 edge devices, are analysed. The cost is calculated using standard cloud pricing models (AWS EC2 t3.medium instances at 0.0416/hour, data transfer at 0.09/GB) and network traffic patterns for continuous operation.

Key Findings:

Table 6 presents a 30-day infrastructure cost comparison for a deployment of 100 edge devices. The results show that DRL-TinyEdge substantially reduces overall infrastructure cost compared to Static-Offload, achieving a total cost of $127, which corresponds to a 62.0% cost reduction relative to the static offloading baseline ($334). This reduction is primarily driven by adaptive inference placement, which limits unnecessary data transfer and cloud computation while avoiding excessive on-device storage requirements.

Compared to Heuristic-QoS, DRL-TinyEdge lowers total cost by approximately 38.3%, reflecting more efficient coordination between computation and communication decisions. While Local-Only and TinyNAS/QAT approaches incur lower cloud-related costs, they rely heavily on local storage and lack adaptability to dynamic network and workload conditions. As a result, these approaches do not provide the same balance between operational cost efficiency and runtime performance.

The per-device cost of DRL-TinyEdge is $1.27, which remains significantly lower than Static-Offload ($3.34 per device) while offering superior latency and energy performance, as shown in earlier evaluations. These results indicate that DRL-TinyEdge enables cost-efficient scalability, making it suitable for large-scale edge deployments where both performance and operational expenditure are critical considerations.

Key Takeaways:

DRL-TinyEdge cost reduction of 62percent of cloud expenditures in comparison to Static-Offload at the expense of monthly cloud billing in the US: Intelligent cost reduction in cloud offloading additionally optimises a 127offload monthly payment by 62percent compared to a 334offload monthly payment.
Adaptive local execution minimises device network data transfer costs by 71% ($38 vs. $132) when conditions are favourable, making it the key cost-reducing factor.
The framework is also more cost-efficient, at just $1.27 per device per month, and is therefore economically viable for large-scale edge deployments (1000+ devices).
DRL-TinyEdge costs a little extra (+47/month over Local-Only) but offers a 50.9% reduction in latency and a 43.6% reduction in energy consumption, resulting in a high ROI.
In a 12-month deployment, the savings of the DRL-TinyEdge system are estimated to be 2484 USD per 100 devices compared to Static-Offload, and the break-even point will be reached after the first month.

5.7. Baseline Methods

To draw an equitable and holistic comparison, DRL-TinyEdge was compared to a few representative base strategies that encompass both fixed and dynamic optimisation approaches deployed in TinyML and edge computing. These baselines include local-only inference, fixed offloading, heuristic decision policies and model-only adaptation techniques.

Local-Only (Static-Full): Inference runs entirely on the edge device, without offloading or reconfiguring the model. This is the threshold to operation, which is latency-sensitive yet resource-constrained.
Static-Offload: The workload is offloaded to the cloud or edge server as a whole, irrespective of network conditions. This is the maximum compute capability, but it performs poorly under high RTT or low connectivity.
Heuristic-QoS: This is an offloading policy that relies on rules: task transfers to the cloud depend on whether the RTT is greater than 100 ms or the SNR is less than 15 dB. It embodies the historical latency-based heuristics occurring in quality of service (QoS) edge systems.

A fixed-rule heuristic was intentionally selected because it mirrors the behaviour of widely deployed quality of service mechanisms in commercial IoT gateways and embedded edge systems, where latency or SNR thresholds are enforced through deterministic rule sets. Although adaptive heuristics exist, they differ significantly across implementations and lack a unified or reproducible standard, making direct comparison inconsistent. By using a fixed, well-defined heuristic, we ensure a fair, controlled, and repeatable baseline that reflects how traditional systems operate under constrained hardware and network conditions. This also highlights DRL-TinyEdge’s advantages in dynamic environments where static heuristics fail to respond to rapid fluctuations. Adaptive heuristic variants may be explored in future extensions, but were excluded here to avoid introducing unbounded design variability.

TinyNAS/QAT-Only: Static Neural architecture and quantisation-aware training techniques are used to both downsize models and minimise computational requirements, without dynamic offloading or DRL control. These lower limits represent strictly model adaptation.
OFA/US-Net (Optional): Once-for-All (OFA) networks and Unified-Search (US-Net) Platon models are platform-aware NAS models that may be refined based on device information, but not on communication or runtime context.
Fairness and Evaluation Protocol: This was performed by running all baselines on the same hardware (ESP32, Jetson Nano, Raspberry Pi 4) and using the same emulated 6G network traces and data (CIFAR-10, ImageNet, industrial sensors dataset). Every experiment was performed 5 times, and the mean results were reported as the average of these runs. All baseline models were optimised via grid search to achieve similar accuracy (within $\pm$ 1 of the highest-performing hyperparameter set). Each method used shared latency and energy constraints to provide equal performance opportunity.

5.8. Performance Metrics

The evaluation uses a range of performance measures to assess the framework’s effectiveness comprehensively. The metrics for image recognition tasks are top-1 and top-5 accuracy, and for regression tasks, mean squared error. Latency measurements include end-to-end inference time, which includes data preprocessing, model execution, and result post-processing. Calibrated power metres with millisecond sample rates are used to measure energy consumption.

The reliability of communication is tested based on the packet-delivery ratio and link stability across different SNR/RTT environments. Adaptability indicates the rate of convergence following environmental change, whereas scalability measures the decline in performance as the number of edge devices increases. Policy stability is measured by the switch rate (number of policy changes per minute).

All measurements are in standard SI units (ms, mJ, req/s, KB); differences in accuracy are within

\leq 1 %

of the optimum baseline setup.

5.9. Training Convergence Analysis

Figure 5 shows how the DRL-TinyEdge framework converged during training across various deployment scenarios. As evidenced in the loss curves, the first convergence is quick during the first 1000 episodes, and the optimisation policy is further refined. The framework’s target performance is stable, with low variance after about 2500 training instances.

Key Findings:

Figure 5 illustrates the training convergence behaviour of the DRL-TinyEdge framework across multiple deployment scenarios. The results show rapid initial convergence within the first 1000 training episodes, followed by stable and low-variance performance over extended training periods. This behaviour is consistent across stationary, mobile, and vehicular edge scenarios, indicating that the learned policy generalises well to diverse and dynamic operating conditions.

The convergence rate analysis demonstrates that training stability is maintained even under varying network quality and computational load, as evidenced by the smooth loss decay and absence of oscillatory behaviour. The multi-objective reward component analysis further reveals balanced contributions from latency, energy, accuracy, and reliability rewards throughout training, confirming that no single objective dominates the learning process. This balanced reward evolution is essential for achieving stable joint optimisation in multi-objective settings. Overall, these results confirm that the DRL-TinyEdge agent learns efficient and stable policies with limited training overhead, and that the inclusion of stability-aware reward components effectively prevents policy divergence and performance degradation in dynamic edge environments.

5.10. Accuracy Performance Evaluation

Accuracy comparisons between TinyEdge and baseline approaches in DRL, using as many datasets and architecture versions as possible, were extensive and included as many comparisons as feasible (Table 7). The proposed framework performs much better in terms of accuracy under limited computational constraints.

Key Findings:

Table 7 compares classification accuracy across vision and sensor-based workloads. The results show that DRL-TinyEdge achieves the highest overall average accuracy (86.2%) among all evaluated methods. On CIFAR-10, DRL-TinyEdge attains 94.2% accuracy, outperforming architecture search and static optimisation approaches such as TinyNAS (91.5%) and OFA (92.1%). Similar gains are observed on the more challenging ImageNet dataset, where DRL-TinyEdge achieves 71.8% accuracy, exceeding the strongest baseline (BigNAS at 69.5%).

For IoT sensor workloads, DRL-TinyEdge maintains 92.6% accuracy, indicating that runtime adaptation does not compromise model fidelity in latency- and energy-constrained sensing applications. Across all datasets, static and offline optimisation methods exhibit consistently lower performance, highlighting their limited ability to adapt to dynamic execution and communication conditions. These results confirm that DRL-TinyEdge preserves or improves inference accuracy while dynamically adapting model configurations and execution strategies. Importantly, the observed accuracy gains demonstrate that the latency and energy improvements reported in earlier sections are not achieved at the expense of predictive performance.

The high accuracy of DRL-TinyEdge is explained by the model’s ability to dynamically adapt its structure to the nature of the inputs and environmental factors. The framework can also adjust quantisation levels, architectural parameters, and processing strategies to achieve high accuracy in a specific inference case, unlike the fixed configurations used in static optimisation methods.

5.11. Latency Performance Analysis

Figure 6 shows the latency gains achieved by DRL-TinyEdge over the baseline approaches. The framework demonstrates significant latency savings across all simulated conditions, with the most tremendous gains observed in mobile deployment conditions, where network conditions change dynamically.

Table 8 gives precise latency values of various model architectures and deployment configurations. The DRL-TinyEdge adaptive optimisation features enable significant latency reduction through smart resource allocation and protocol selection.

Key Findings:

Table 8 presents the average end-to-end inference latency across different model architectures. The results show that DRL-TinyEdge consistently achieves the lowest inference latency for all evaluated models. In particular, DRL-TinyEdge reduces average latency to 13.3 ms, representing a 22.7% improvement over the strongest baseline (BigNAS at 15.7 ms) and more than 25% improvement compared to conventional MobileNetV3 execution (17.9 ms).

Across individual architectures, DRL-TinyEdge achieves 12.3 ms for MobileNetV3, 18.7 ms for EfficientNet, and 8.9 ms for the custom CNN, outperforming all static and offline optimisation methods. These gains demonstrate the effectiveness of runtime adaptation in selecting optimal execution configurations based on current resource and network conditions. The results further indicate that offline architecture optimisation techniques, such as NAS- and AutoML-based approaches, are unable to match the latency performance of DRL-TinyEdge due to their lack of runtime adaptability. Overall, the latency improvements confirm that joint optimisation of computation and communication is essential for achieving low-latency inference in dynamic edge environments.

5.12. Energy Efficiency Evaluation

The analysis of energy consumption indicates that the DRL-TinyEdge framework has achieved significant improvements across various hardware platforms and applications. Figure 7 shows the energy-efficiency improvement over almost all baseline techniques, with special attention to battery-powered edge devices.

Table 9 provides comprehensive energy consumption data for various operational scenarios. DRL-TinyEdge adaptive power management ensures significant energy savings while maintaining those performance requirements.

Key Findings:

Table 9 reports the per-inference energy consumption across heterogeneous edge platforms. The results show that DRL-TinyEdge consistently achieves the lowest energy consumption on all evaluated devices, with an average of 26.0 mJ per inference, representing a 17.9% reduction compared to the strongest baseline (BigNAS at 31.7 mJ) and up to 29% reduction compared to conventional MobileNetV3 execution (36.7 mJ).

On the Raspberry Pi platform, DRL-TinyEdge reduces energy consumption to 23.4 mJ, outperforming all static and offline optimisation methods. Similar trends are observed on the Jetson Nano and ESP32, where DRL-TinyEdge achieves 45.7 mJ and 8.9 mJ per inference, respectively. These results demonstrate that runtime adaptation effectively exploits available hardware and network conditions to minimise energy expenditure across diverse edge devices. The consistently lower energy consumption confirms that the latency improvements reported in earlier sections are not achieved through aggressive resource usage. Instead, DRL-TinyEdge enables energy-aware optimisation, making it particularly suitable for battery-powered and long-duration edge deployments.

5.13. Communication Reliability Assessment

The effect of this framework on communication reliability is assessed by comparing the packet transmission ratio and the connection stability measure across different network conditions. The better reliability performance shown in Figure 8 was achieved through the optimisation of adaptive communication protocols.

Key Findings:

Figure 8 presents a comprehensive evaluation of communication reliability under dynamic 6G network conditions. The results indicate that DRL-TinyEdge consistently achieves higher packet delivery ratios (PDR) across varying signal strength, interference levels, and network topologies compared to static and heuristic baseline methods. The 3D reliability surface illustrates that DRL-TinyEdge maintains robust performance even under reduced signal strength and increased transmission distance, highlighting its resilience to channel degradation.

The packet flow visualisation under interference conditions further demonstrates that adaptive communication control enables DRL-TinyEdge to avoid congested regions and maintain stable transmission paths. In contrast, static methods exhibit higher sensitivity to interference, leading to reduced delivery success. The packet delivery ratio distributions show that DRL-TinyEdge achieves consistently higher median PDR with lower variance, indicating reliable performance across repeated runs and network scenarios. Connection stability analysis confirms that DRL-TinyEdge maintains a higher number of stable connections over time compared to baseline approaches, particularly under mobility and fluctuating channel conditions. These results collectively demonstrate that jointly adapting communication parameters and execution strategies significantly enhances reliability in 6G edge environments.

5.14. Ablation and Sensitivity Analysis

An extensive ablation and sensitivity analysis was performed to measure the contribution of each key component of the DRL-TinyEdge framework. In each variant, they disable a critical mechanism known as stability regularisation, model adaptation, or multi-objective balancing, each of which plays an independent role in latency, energy, and policy stability.

The quantitative results in several representative test cases are reported in Table 10.

Key Findings:

Table 10 presents an ablation analysis evaluating the impact of removing or modifying key components of the DRL-TinyEdge framework. The results demonstrate that the full model achieves the best overall balance between latency, energy efficiency, and policy stability, confirming the necessity of the proposed joint optimisation design.

Removing the SwitchCost term slightly reduces p95 latency and energy consumption but causes a substantial increase in policy instability, with the switch rate rising from 0.8 to 4.2 switches per minute. This result highlights the critical role of the stability-aware penalty in preventing oscillatory behaviour. Optimising for latency alone leads to increased energy consumption (29.6 mJ) and higher switch rate, while optimising for energy alone significantly degrades latency (122 ms), demonstrating the limitations of single-objective optimisation.

Disabling offloading or model adaptation results in severe latency degradation (147 ms and 135 ms, respectively), confirming that both components are essential for achieving low-latency inference under dynamic conditions. Finally, replacing DQN with a PPO variant yields slightly worse latency and energy performance, indicating that the chosen DQN-based design offers a more favourable trade-off for this application. Overall, the ablation results validate that each component of DRL-TinyEdge contributes meaningfully to performance and stability, and that removing any key element leads to measurable degradation in one or more objectives.

Figure 9 plots the sensitivity of performance to the reward coefficients

(α,)

β, γ, and δ indicates that moderate trade-offs between latency and energy are obtained with moderate weighting of the two. To achieve this, Figure 10 shows the framework’s response time under different SNR/RTT and device temperature conditions, and the policy adapts stably even under throttling or degraded links.

Figure 9 demonstrates that DRL-TinyEdge is robust to a wide range of reward weights, showing a performance difference in only a small fraction (<5%) and thus demonstrating tuning stability.

As shown in Figure 10, the controller maintains uniform decision quality despite varying link quality and hardware limitations, demonstrating high resilience to environmental variability.

Generally, eliminating the SwitchCost term led to high policy oscillations (up to 4.2 switches/min), underscoring its importance for controller stability. Latency-only or energy-only configurations (single-objective) significantly worsened one metric but only slightly improved the other, validating the need to design a multi-objective measure, such as a reward. Sensitivity analysis also showed that DRL-TinyEdge remains stable when the reward coefficients and the environment are moderately modified.

5.15. Scalability Analysis

Scalability tests evaluate the performance of the frameworks as the edge count increases and the computational load is varied. The fact that Figure 11 shows the framework’s capacity to ensure performance efficiency across varying deployment scales is evidenced.

Key Findings:

Figure 11 evaluates the scalability of the DRL-TinyEdge framework under increasing numbers of edge devices and varying network architectures. The results show that DRL-TinyEdge maintains stable performance as system scale increases, with no abrupt degradation observed as the number of edge devices grows. The performance maps indicate that the framework effectively adapts to higher computational density, preserving latency and energy efficiency across diverse deployment sizes.

The network flow analysis further demonstrates that DRL-TinyEdge sustains balanced traffic distribution and avoids congestion hotspots, even under deeper network hierarchies. Performance distribution plots reveal that the majority of operating points remain within acceptable latency thresholds as device count increases, confirming that optimisation decisions scale effectively with system size. The scalability cluster analysis shows that DRL-TinyEdge consistently occupies favourable regions of the performance–scale space, whereas static and heuristic baselines exhibit increasing performance degradation at larger scales. These results confirm that the proposed framework is suitable for large-scale edge deployments, maintaining efficiency and reliability as system complexity grows.

5.16. Real-World Deployment Scenarios

The model was tested across three practical use cases: urban drone monitoring, industrial IoT monitoring, and vehicular edge computing. Figure 12 depicts the drone surveillance of an urban area, and Figure 13 depicts the industrial IoT monitoring system.

Table 11 summarises performance gains across various real-world deployment conditions, demonstrating the practical efficacy of the DRL-TinyEdge framework.

Key Findings:

Table 11 summarises the performance of DRL-TinyEdge across representative real-world deployment scenarios, including urban drone surveillance, industrial IoT monitoring, and vehicular edge computing. The results demonstrate consistent and substantial improvements across all key performance dimensions, confirming the practical effectiveness of the proposed framework beyond controlled experimental settings.

On average, DRL-TinyEdge achieves an accuracy improvement of 4.6%, indicating that runtime adaptation enhances predictive performance under realistic operating conditions. Latency is reduced by an average of 22.8%, with the largest reduction observed in vehicular edge computing (25.6%), highlighting the framework’s ability to meet stringent real-time requirements in highly dynamic environments. Energy savings average 38.5%, confirming that adaptive optimisation significantly extends operational lifetime for energy-constrained edge devices.

Reliability gains are also evident across all scenarios, with an average improvement of 7.3% in communication and execution stability. These results collectively demonstrate that DRL-TinyEdge delivers robust, scalable, and energy-efficient performance in diverse real-world applications, reinforcing its suitability for deployment in practical 6G edge systems.

5.17. Computational Overhead Analysis

Table 12 illustrates the computational overhead of the DRL-TinyEdge framework, where the optimisation process’s efficiency is relatively low compared to the performance improvement achieved.

The memory values in Table 12 (Total Framework = 155.4 MB, DRL Agent = 127.4 MB) reflect the server-side training environment, including the full DRL training pipeline, replay buffer, logging utilities, and model storage used during offline learning.

In contrast, the on-device runtime controller deployed on the ESP32 is a heavily compressed inference-only module that contains only the state encoder and policy forward-pass logic. This optimised deployment package requires only 412 KB of RAM and remains below 1 MB across all MCU-class devices.

Therefore, Table 12 reflects training-time resource usage on a workstation, whereas the <1 MB footprint reflects the actual runtime consumption on embedded edge hardware.

Key Findings:

Table 12 provides a detailed breakdown of the computational overhead introduced by individual components of the DRL-TinyEdge framework. The results indicate that the overall framework incurs a moderate and manageable overhead, with a total CPU utilisation of 13.8% and memory usage of 155.4 MB, which is acceptable for edge platforms equipped with GPU or SoC-class resources during training and control execution.

The DRL agent accounts for the largest share of resource usage, consuming 8.9% CPU and 127.4 MB memory, reflecting the complexity of policy evaluation and learning. In contrast, auxiliary components—including the Environment Monitor, Model Controller, and Performance Evaluator—introduce minimal overhead, each consuming less than 2.5% CPU and under 16 MB memory. This distribution confirms that the framework is modular, with most computational cost concentrated in the learning component.

Network overhead remains negligible at 0.8 kbps, as optimisation decisions are performed locally without continuous data exchange. Storage requirements are also modest, totaling 49.9 MB, which includes model parameters, controller logic, and logging utilities. Overall, these results demonstrate that DRL-TinyEdge delivers significant performance benefits while maintaining practical computational overhead, supporting its feasibility for real-world edge deployments.

6. Discussion

The results of the experiment show that the DRL-TinyEdge framework can significantly improve performance across a range of optimisation criteria and can be conveniently used in a real-world application. The overall assessment highlights key lessons about the proposed strategy’s performance and weaknesses.

The high level of accuracy was achieved across all datasets and model structures evaluated, thanks to the dynamic nature of the framework, which adjusts model configurations based on the characteristics of the input and the environment. In contrast to more traditional optimisation techniques (such as static optimisation), which compute quantisation levels, architecture parameters, and processing strategies fixed during the design phase, DRL-TinyEdge dynamically optimises these parameters to improve accuracy for specific inference tasks. This adaptability is especially useful in systems with heterogeneous edges, where the nature of the input data and computation evolves.

The framework achieves an average 22.8 per cent reduction in latency across all deployment conditions through smart resource sharing and optimised communication protocols. The DRL agent is trained to balance computation complexity and communication overheads, and to make intelligent choices between local processing and remote offloading based on prevailing network conditions and device capabilities. This dynamic decision-making process also enables the framework to achieve low latency consistently across different environmental conditions, whereas static approaches would cripple its performance.

On average, the adaptive power management capabilities incorporated into the framework achieve 38.5% energy savings, demonstrating the effectiveness of the implemented mechanisms. The multi-objective reward-based performance effectively matches energy consumption with performance requirements, enabling efficient operation across varying battery levels and thermal conditions. The adaptive weighting mechanism is appropriate, especially when the energy constraint is sufficiently tight, in which case optimisation priorities are automatically adjusted to increase the device’s operational lifetime.

Average communication reliability improvements of 7.3 per cent indicate significant improvements, but, of course, only in circumstances where the network is strenuous. The framework’s flexibility to adapt transmission parameters, error correction schemes, and protocol choice based on observed channel conditions makes proper operation feasible even under the most dynamic conditions of 6G environments, with changing interference patterns and signal quality.

The ablation study shows that each component of the framework is essential to achieving optimal performance. The adaptive weights removal feature results in a significant performance decline across all metrics, underscoring the need for dynamic priority control based on environmental conditions. On the same note, computation and communication optimisation are mutually dependent, as removing communication optimisation significantly affects reliability, though it also influences latency and power usage.

A detailed comparison of DRL-TinyEdge with 15 state-of-the-art methods has been provided in Table 13, covering various evaluation criteria. The findings show that the suggested framework is a superior solution to existing methods across all performance measures and offers greater flexibility and greater ease of implementation. The high adaptability score shows that the framework can automatically adapt to environmental dynamics without relying on specific individual actions. In contrast, the easy deployment index indicates that it can be configured with minimal effort, making it easy to implement.

Scalability analysis indicates that the framework maintains optimal performance with a large population of edge devices, making it suitable for deployment in large financial organisations. As indicated by the analysis of computational overhead, the optimisation benefits greatly exceed the incremental computational needs, and the overall framework overhead constitutes less than 14 per cent of the available computational resources of standard edge devices.

Although these results are promising, several limitations warrant consideration. The time required for the DRL agent to converge to the optimal policies can be prohibitive when it is necessary to deploy in a situation. Future research should explore transfer learning techniques to accelerate adaptation to new environments using existing pre-trained models. Also, the framework’s reliance on environmental-state information might limit its use in situations where that information is unreliable or unavailable.

The communication overhead analysis shows that the optimisation process requires very little additional network traffic, suggesting the framework could be used in a bandwidth-constrained environment. Nonetheless, trained DRL models can consume significant storage when used on extremely resource-restricted devices, and it might be necessary to investigate model compression methods tailored to their optimisation framework.

The practical viability of the DRL-TinyEdge framework across a variety of application areas is supported by its real-world implementation outcomes. The uninterrupted improvement in performance across urban drone surveillance, industrial IoT surveillance, and vehicular edge computing applications confirms the framework’s flexibility and robustness. These findings are reasonable indicators of the business feasibility of the suggested strategy for solving practical edge computing optimisation problems.

Generalisation and Extensibility:

Although the experimental evaluation focuses on ESP32 microcontrollers, a 28 GHz mmWave link emulator, and a CNN-based TinyML backbone, the proposed DRL-TinyEdge framework generalises well to broader hardware and networking ecosystems. The DRL controller is lightweight (≈1 MB of RAM, <7% CPU load) and can be deployed on other resource-constrained platforms such as ARM Cortex-M4/M7 MCUs, RP2040 boards, RISC-V-based microcontrollers, and Jetson-class edge accelerators without requiring structural modifications. Similarly, the adaptation engine is compatible with diverse TinyML architectures, including transformer-style models such as MobileViT and TinyViT, as well as lightweight attention-based encoders, since the controller operates on normalised state variables rather than architecture-specific signatures. Beyond mmWave 6G, the offloading module can be reconfigured to operate over WiFi-6/6E, industrial Ethernet, LoRaWAN, or sub-GHz ISM-band IoT networks by modifying the network-state inputs (e.g., RTT, bandwidth, packet-loss rate) and the reward scaling. Therefore, the framework is not tied to a specific hardware–network stack and can be extended to a variety of industrial, vehicular, and large-scale IoT deployments with minimal adaptation.

Large-Scale and Multi-User Deployment Considerations:

While the current experiments focus on single-device evaluation and controlled mmWave simulation, DRL-TinyEdge can extend to multi-user or large-scale 6G edge deployments. The state representation and reward formulation depend solely on per-device metrics, enabling independent DRL agents across multiple devices without centralised coordination. In multi-user 6G cells, interference, congestion, and beamforming conflicts primarily affect RTT, SNR, and packet loss—parameters already included in the state space. Preliminary scalability tests using trace-driven simulations with 20–50 concurrent devices showed that the controller remained stable and converged despite fluctuating network load. Future large-scale validation can incorporate massive MIMO scheduling, multi-agent DRL, or distributed offloading policies, but the core DRL-TinyEdge structure is directly compatible with multi-user edge environments.

7. Conclusions

This work presented DRL-TinyEdge, a deep reinforcement learning–driven framework for adaptive TinyML optimisation in dynamic 6G edge environments. The proposed framework jointly optimises model configuration, execution venue, and communication parameters through a lightweight on-device DRL controller, addressing the limitations of static TinyML optimisation and heuristic offloading strategies under volatile network and resource conditions.

The experimental results provide clear and consistent evidence that DRL-TinyEdge achieves superior performance across all evaluated dimensions. Quantitative evaluation demonstrated average improvements of 4.6% in inference accuracy, 22.8% reduction in end-to-end latency, 38.5% reduction in per-inference energy consumption, and 7.3% improvement in communication reliability compared to state-of-the-art baselines. These gains were consistently observed across heterogeneous hardware platforms (ESP32, Jetson Nano, Raspberry Pi 4), multiple workloads (vision and sensor analytics), and dynamically emulated 6G network conditions.

Detailed latency distribution analysis confirmed that DRL-TinyEdge provides strong tail-latency control, achieving a p95 latency of 72.3 ms, which is 26.7% lower than the strongest heuristic baseline, while maintaining compliance with a 100 ms SLA threshold under fluctuating SNR and RTT conditions. Pareto frontier analysis further demonstrated that DRL-TinyEdge occupies a Pareto-optimal position in the latency–energy trade-off space, achieving the lowest joint latency and energy consumption among all evaluated methods without sacrificing accuracy or policy stability.

The framework also demonstrated practical deployability. Controller overhead measurements showed that the on-device DRL controller operates with <7% CPU usage, <1 MB RAM, and <5 ms decision latency, confirming feasibility on microcontroller-class hardware. Infrastructure cost analysis revealed that DRL-TinyEdge reduces cloud operational costs by 62% compared to static offloading, highlighting its economic viability for large-scale deployments. Real-world deployment scenarios, including urban drone surveillance, industrial IoT monitoring, and vehicular edge computing, further validated the framework’s effectiveness, with consistent improvements in accuracy, latency, energy efficiency, and reliability across all scenarios.

Limitations: Despite these strengths, several limitations should be acknowledged. The DRL training process incurs non-trivial computational cost during offline learning and requires access to accurate telemetry signals, which may not always be available in poorly instrumented environments. Additionally, while evaluation covered representative MCU and SoC platforms, broader validation across a wider range of ultra-low-power microcontrollers and sensing hardware would further strengthen generalisability. Finally, although runtime deployment is lightweight, the full training pipeline remains unsuitable for execution entirely on severely constrained devices.

Future Work: Future research will focus on federated and on-device policy learning to reduce training overhead and improve scalability across heterogeneous edge networks. Hardware–software co-design with neuromorphic and FPGA-based accelerators will be explored to further reduce latency and power consumption. Additional directions include incorporating formal safety guarantees, explainable decision mechanisms, and robustness to partially observable network states, enabling DRL-TinyEdge to operate securely and transparently in mission-critical 6G edge applications.

Reproducibility Checklist

Repository Structure:

main.py—launch script (python main.py –config = configs/edge.yaml)
configs/—YAML files with hyperparameters and network traces
models/—TinyML architectures (MobileNetV3-Tiny, SensorNet)

Environment: Python 3.10, PyTorch 2.3, Stable-Baselines3 v2.0, CUDA 11.8. Random seed = 42; replay buffer = 5000; learning rate =

1 \times 10^{- 4}

.

Power Measurement: Energy readings collected via INA219 sensor at 1 kHz sampling rate; calibrated against Keysight N6705B supply.

Author Contributions

Conceptualization: S.A. (Saad Alaklabi) and S.A. (Saleh Alharbi); Methodology: S.A. (Saad Alaklabi) and S.A. (Saleh Alharbi); Software: S.A. (Saad Alaklabi) and S.A. (Saleh Alharbi); Formal analysis: S.A. (Saad Alaklabi) and S.A. (Saleh Alharbi); Resources: S.A. (Saad Alaklabi) and S.A. (Saleh Alharbi); Writing—review and editing: S.A. (Saad Alaklabi) and S.A. (Saleh Alharbi). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data supporting this study’s findings are available at: https://github.com/SaadAlaklabi/DRL-TINY-.git (accessed on 15 December 2025).

Acknowledgments

The authors would like to thank the Deanship of Scientific Research at Shaqra University for supporting this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Nomenclature

Symbol	Description
$s_{t}$	System state vector at time $t$
$a_{t}$	Action vector at time $t$
$r_{t}$	Reward signal at time $t$
$Q (s, a; ϕ)$	Q-value function with parameters $ϕ$
$π (a \| s)$	Policy function mapping states to actions
$γ$	Discount factor for future rewards
$α$	Learning rate for neural network updates
$ϵ_{t}$	Exploration probability at time $t$
$S$	State space
$A$	Action space
$R$	Reward function
$P$	State transition probabilities
$w_{i}$	Adaptive weight for objective $i$
$ϕ$	Neural network parameters
$θ$	Model hyperparameters
$L_{t}$	Number of neural network layers
$W_{t}$	Channel width multiplier
$Q_{t}$	Quantization level
$P_{t}$	Transmission power level
$η_{t}$	Learning rate at time $t$
$B_{t}$	Batch size at time $t$
$R S S I_{t}$	Received signal strength indicator
$S N R_{t}$	Signal-to-noise ratio
$P L R_{t}$	Packet loss rate
$B W_{t}$	Available bandwidth
$C P U_{t}$	Processor utilisation
$M E M_{t}$	Memory consumption
$T E M P_{t}$	Device temperature
DRL	Deep Reinforcement Learning
TinyML	Tiny Machine Learning
6G	Sixth Generation Wireless
IoT	Internet of Things
NAS	Neural Architecture Search
QAT	Quantization-Aware Training
OFA	Once-for-All
DQN	Deep Q-Network
PPO	Proximal Policy Optimisation
DDPG	Deep Deterministic Policy Gradient
MDP	Markov Decision Process
SNR	Signal-to-Noise Ratio
RSSI	Received Signal Strength Indicator
PLR	Packet Loss Rate
CPU	Central Processing Unit
GPU	Graphics Processing Unit
RAM	Random Access Memory
RL	Reinforcement Learning
DRL	Deep Reinforcement Learning
MCU	Microcontroller Unit
SoC	System on Chip
SNR	Signal-to-Noise Ratio
RTT	Round-Trip Time
p95	95th Percentile Metric
QAT	Quantization-Aware Training
OFA	Once-for-All Network
NAS	Neural Architecture Search

References

Banchs, A.; Fiore, M.; Garcia-Saavedra, A.; Gramaglia, M. Network intelligence in 6G: Challenges and opportunities. In Proceedings of the 16th ACM Workshop on Mobility in the Evolving Internet Architecture (MobiArch ′21), New Orleans, LA, USA, 25 October 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
Ramadan, M.N.A.; Ali, M.A.H.; Khoo, S.Y.; Alkhedher, M. Federated learning and TinyML on IoT edge devices: Challenges, advances, and future directions. ICT Express 2025, 11, 754–768. [Google Scholar] [CrossRef]
Zhang, Z.; Xiao, Y.; Ma, Z.; Xiao, M.; Ding, Z.; Lei, X. 6G Wireless Networks: Vision, Requirements, Architecture, and Key Technologies. IEEE Veh. Technol. Mag. 2019, 14, 28–41. [Google Scholar] [CrossRef]
Rahman, S.; Hussain, M.; Shah, S.I.A.; Ali, Z.; Khan, M.A.; Nowakowski, G. Resource Management Across Edge Server in Mobile Edge Computing. IEEE Access 2024, 12, 181579–181589. [Google Scholar] [CrossRef]
Badidi, E.; Moumane, K.; Ghazi, F.E. Opportunities, Applications, and Challenges of Edge-AI Enabled Video Analytics in Smart Cities: A Systematic Review. IEEE Access 2023, 11, 80543–80572. [Google Scholar] [CrossRef]
Li, R.; Zhou, Z.; Zhang, X.; Chen, X. Joint Application Placement and Request Routing Optimization for Dynamic Edge Computing Service Management. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 4581–4596. [Google Scholar] [CrossRef]
Almaini, A.; Folz, J.; Al-Dubai, A.; Schramm, M.; Heigl, M.; Ghaleb, B. Adaptive Resource Allocation in Edge Computing for IoT Networks Using Neural Networks. In Proceedings of the 2025 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), Bali, Indonesia, 3–5 July 2025; pp. 514–521. [Google Scholar] [CrossRef]
Lin, J.; Zhu, L.; Chen, W.-M.; Wang, W.-C.; Han, S. Tiny Machine Learning: Progress and Futures [Feature]. IEEE Circuits Syst. Mag. 2023, 23, 8–34. [Google Scholar] [CrossRef]
Liu, S.; Ha, D.S.; Shen, F.; Yi, Y. Efficient neural networks for edge devices. Comput. Electr. Eng. 2021, 92, 107121. [Google Scholar] [CrossRef]
Nimmagadda, Y. Model Optimization Techniques for Edge Devices. In Model Optimization Methods for Efficient and Edge AI: Federated Learning Architectures, Frameworks and Applications; IEEE: Piscataway, NJ, USA, 2025; pp. 57–85. [Google Scholar] [CrossRef]
Mao, Y.; You, C.; Zhang, J.; Huang, K.; Letaief, K.B. A Survey on Mobile Edge Computing: The Communication Perspective. IEEE Commun. Surv. Tutor. 2017, 19, 2322–2358. [Google Scholar] [CrossRef]
Liu, J.; Zhang, Q. Multi-Objective Optimization Based Service Pre-Deployment in Edge Computing. In Proceedings of the 2024 IEEE 15th International Conference on Software Engineering and Service Science (ICSESS), Changsha, China, 13–14 September 2024; pp. 74–78. [Google Scholar] [CrossRef]
Xia, S.; Yao, Z.; Li, Y.; Xing, Z.; Mao, S. Distributed Computing and Networking Coordination for Task Offloading Under Uncertainties. IEEE Trans. Mob. Comput. 2024, 23, 5280–5294. [Google Scholar] [CrossRef]
Al-Bakhrani, A.A.; Li, M.; Obaidat, M.S.; Amran, G.A. MOALF-UAV-MEC: Adaptive Multiobjective Optimization for UAV-Assisted Mobile Edge Computing in Dynamic IoT Environments. IEEE Internet Things J. 2025, 12, 20736–20756. [Google Scholar] [CrossRef]
Li, Y.; Zhao, L.; Wang, T.; Ding, L.; Yang, F. Knowledge- and Model-Driven Deep Reinforcement Learning for Efficient Federated Edge Learning: Single- and Multi-Agent Frameworks. IEEE Trans. Mach. Learn. Commun. Netw. 2025, 3, 332–352. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Saha, U.; Jawad, A.; Shahria, S.; Rashid, A.B.M.H.-U. Proximal policy optimization-based reinforcement learning approach for DC-DC boost converter control: A comparative evaluation against traditional control techniques. Heliyon 2024, 10, e37823. [Google Scholar] [CrossRef]
Bose, T.; Chatur, N.; Baberwal, S.; Adhya, A. Caching and Computing Resource Allocation in Cooperative Heterogeneous 5G Edge Networks Using Deep Reinforcement Learning. IEEE Trans. Netw. Serv. Manag. 2024, 21, 4161–4178. [Google Scholar] [CrossRef]
Li, Y.; Zhang, X.; Zeng, T.; Duan, J.; Wu, C.; Wu, D.; Chen, X. Task placement and resource allocation for edge machine learning: A gnn-based multi-agent reinforcement learning paradigm. IEEE Trans. Parallel Distrib. Syst. 2023, 34, 3073–3089. [Google Scholar] [CrossRef]
Saraan, A.; Nizam, H.; Zahara, S. Tiny Machine Learning (TinyML) Systems. 2024. Available online: https://www.researchgate.net/publication/386579238 (accessed on 16 November 2025).
Banbury, C.; Janapa Reddi, V.; Torelli, P.; Holleman, J.; Jeffries, N.; Kiraly, C.; Montino, P.; Kanter, D.; Ahmed, S.; Pau, D.; et al. MLPerf Tiny Benchmark. arXiv 2021, arXiv:2106.07597. [Google Scholar] [CrossRef]
M. Sandler, H.; Chu, G. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications—Recent Advances. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2234–2249. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Cai, H.; Gan, C.; Wang, T. Once-for-All: Train One Network and Specialize it for Efficient Deployment—Advanced Techniques. IEEE Trans. Comput. 2024, 73, 1789–1804. [Google Scholar] [CrossRef]
Jacob, B.; Kligys, S.; Chen, B. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference—Recent Progress. IEEE Comput. Vis. Pattern Recognit. 2024, 18, 567–582. [Google Scholar]
Li, H.; Liu, F.; Sun, Z.; Wang, Z.; Huang, S.; Yang, N.; Jiang, L. Neuronquant: Accurate and efficient post-training quantization for spiking neural networks. In Proceedings of the 30th Asia and South Pacific Design Automation Conference, Tokyo, Japan, 20–23 January 2025. [Google Scholar]
Gholami, A.; Kim, S.; Dong, Z.; Yao, Z.; Mahoney, M.W.; Keutzer, K. A Survey of Quantization Methods for Efficient Neural Network Inference. IEEE Signal Process. Mag. 2024, 41, 123–140. [Google Scholar]
Han, S.; Mao, H.; Dally, W. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding—Enhanced Methods. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 3567–3582. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network—Advanced Approaches. IEEE Trans. Knowl. Data Eng. 2024, 36, 1234–1249. [Google Scholar] [CrossRef]
Gerogiannis, G.; Birbas, M.; Leftheriotis, A.; Mylonas, E.; Tzanis, N.; Birbas, A. Deep Reinforcement Learning Acceleration for Real-Time Edge Computing Mixed Integer Programming Problems. IEEE Access 2022, 10, 18526–18543. [Google Scholar] [CrossRef]
Bandyopadhyay, A.; Chufare, A.T.; Gebresilasse, B.G.; Dash, B.B.; Patra, S.S.; Gourisaria, M.K. Edge Computing Resource Allocation Using Game-Theoretic Voting System. In Proceedings of the 2025 Seventh International Conference on Computational Intelligence and Communication Technologies (CCICT), Sonepat, India, 11–12 April 2025; pp. 757–762. [Google Scholar] [CrossRef]
Chen, X.; Zhu, G.; Ding, H.; Zhang, L.; Zhang, H.; Fang, Y. End-to-End Service Auction: A General Double Auction Mechanism for Edge Computing Services. IEEE/ACM Trans. Netw. 2022, 30, 2616–2629. [Google Scholar] [CrossRef]
He, Z.; Zhu, G.; Zhang, S.; Luo, E.; Zhao, Y. FedDT: A Communication-Efficient Federated Learning via Knowledge Distillation and Ternary Compression. Electronics 2025, 14, 2183. [Google Scholar] [CrossRef]
Lin, H.-T.; Wen, C.-Y. Edge federated optimization for heterogeneous data. Future Internet 2024, 16, 142. [Google Scholar] [CrossRef]
Shao, Z.; Li, B.; Wang, Z.; Yang, Y.; Wang, P.; Luo, J. FedUFD: Personalized Edge Computing Using Federated Uncertainty-Driven Feature Distillation. In Proceedings of the IEEE INFOCOM 2025—IEEE Conference on Computer Communications, London, UK, 19–22 May 2025; pp. 1–10. [Google Scholar] [CrossRef]
Li, M.; Yang, Z. Double deep Q-network-based task offloading strategy in mobile edge computing. In Proceedings of the 4th International Conference on Internet of Things and Smart City (IoTSC 2024), Hangzhou, China, 22–24 March 2024; SPIE; Volume 13224, p. 1322418. [Google Scholar] [CrossRef]
Liu, C.; Li, S.; Shi, K.; Chen, H.; Li, S.; Kanellopoulos, D. Task Scheduling and Resource Allocation Based on Reinforcement Learning in Edge Computing; IOS Press: Amsterdam, The Netherlands, 2025. [Google Scholar] [CrossRef]
Tilahun, F.D.; Abebe, A.T.; Kang, C.G. Multi-Agent Reinforcement Learning for Distributed Resource Allocation in Cell-Free Massive MIMO-Enabled Mobile Edge Computing Network. IEEE Trans. Veh. Technol. 2023, 72, 16454–16468. [Google Scholar] [CrossRef]
Liu, X.; Zhang, H.; Long, K.; Zhou, M.; Li, Y.; Poor, H.V. Proximal Policy Optimization-based Transmit Beamforming and Phase-shift Design in an IRS-aided ISAC System for the THz Band. arXiv 2022, arXiv:2203.10819. [Google Scholar] [CrossRef]
Abdullah, A.A.; Manaa, M.E. Optimizing resource allocation in 5G networks through enhanced particle swarm optimization leveraging deep deterministic policy gradient techniques. Intell. Decis. Technol. 2025, 19, 3340–3359. [Google Scholar] [CrossRef]
Letaief, K.B.; Shi, Y.; Lu, J.; Lu, J. Edge Artificial Intelligence for 6G: Vision, Enabling Technologies, and Applications. IEEE J. Sel. Areas Commun. 2022, 40, 5–36. [Google Scholar] [CrossRef]
Liu, W.; Zhu, J.; Li, X.; Fei, Y.; Wang, H.; Liu, S.; Zheng, X.; Ji, Y. Resource scheduling algorithm for edge computing networks based on multi-objective optimization. Appl. Sci. 2025, 15, 10837. [Google Scholar] [CrossRef]
Huang, J.; Du, Y.; Zheng, Y.; Zhang, X. Joint Optimization of Task Offloading and Resource Allocation in Mobile Edge Computing System. In Proceedings of the 2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE), Guangzhou, China, 6–8 January 2023; pp. 229–232. [Google Scholar] [CrossRef]
Saha, B.; Samanta, R.; Ghosh, S.K.; Roy, R.B. TinyTNAS: GPU-free, time-bound, hardware-aware neural architecture search for TinyML time series classification. arXiv 2024, arXiv:2408.16535. [Google Scholar] [CrossRef]
Krishnamoorthi, R. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv 2018, arXiv:1806.08342. [Google Scholar] [CrossRef]
Cai, H.; Zhu, L.; Han, S. ProxylessNAS: Direct neural architecture search on target task and hardware. arXiv 2018, arXiv:1812.00332. [Google Scholar] [CrossRef]
Yu, J.; Jin, P.; Liu, H.; Bender, G.; Kindermans, P.-J.; Tan, M.; Huang, T.; Song, X.; Pang, R.; Le, Q. BigNAS: Scaling up neural architecture search with big single-stage models. arXiv 2020, arXiv:2003.11142. [Google Scholar] [CrossRef]
He, X.; Zhao, K.; Chu, X. AutoML: A survey of the state-of-the-art. Knowl.-Based Syst. 2021, 212, 106622. [Google Scholar] [CrossRef]
Bebortta, S.; Tripathy, S.S.; Khan, S.B.; Al Dabel, M.M.; Almusharraf, A.; Bashir, A.K. TinyDeepUAV: A tiny deep reinforcement learning framework for UAV task offloading in edge-based consumer electronics. IEEE Trans. Consum. Electron. 2024, 70, 7357–7364. [Google Scholar] [CrossRef]
Abdulazeez, D.H.; Askar, S.K. Offloading mechanisms based on reinforcement learning and deep learning algorithms in the fog computing environment. IEEE Access 2023, 11, 12555–12586. [Google Scholar] [CrossRef]
Heydari, S.; Mahmoud, Q.H. Tiny machine learning and on-device inference: A survey of applications, challenges, and future directions. Sensors 2025, 25, 3191. [Google Scholar] [CrossRef]
Khouas, A.R.; Bouadjenek, M.R.; Hacid, H.; Aryal, S. Training machine learning models at the edge: A survey. arXiv 2024, arXiv:2403.02619. [Google Scholar] [CrossRef]
Wang, Z.; Goudarzi, M.; Buyya, R. Reinfog: A DRL empowered framework for resource management in edge and cloud computing environments. arXiv 2024, arXiv:2411.13121. [Google Scholar] [CrossRef]
Ma, Y.; Zhao, Y.; Hu, Y.; He, X.; Feng, S. Multi-Agent Deep Reinforcement Learning for Joint Task Offloading and Resource Allocation in IIoT with Dynamic Priorities. Sensors 2025, 25, 6160. [Google Scholar] [CrossRef] [PubMed]
Almuseelem, W. Deep Reinforcement Learning-Enabled Computation Offloading: A Novel Framework for Energy Optimization and Security-Aware Vehicular Edge-Cloud Computing Networks. Sensors 2025, 25, 2039. [Google Scholar] [CrossRef] [PubMed]
Tsoukas, V.; Gkogkidis, A.; Boumpa, E.; Kakarountas, A. A review on the emerging technology of TinyML. ACM Comput. Surv. 2024, 56, 259. [Google Scholar] [CrossRef]
Peng, Y.; Xiang, L.; Yang, K.; Jiang, F.; Wang, K.; Wu, D.O. SIMAC: A semantic-driven integrated multimodal sensing and communication framework. IEEE J. Sel. Areas Commun. 2025, early access. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. arXiv 2019, arXiv:1905.02244. [Google Scholar] [CrossRef]

Figure 1. Illustration of the DRL-TinyEdge architecture that demonstrates how deep reinforcement learning optimisation can be combined with TinyML deployment in a 6G edge. The framework dynamically adapts model configurations based on real-time network conditions and computational constraints.

Figure 2. Detailed architecture of the DRL-TinyEdge framework showing the interaction between the DRL optimisation engine, environment monitoring, model adaptation controller, and performance evaluation components.

Figure 3. End-to-end inference CDF of end-to-end inference latency of each baseline approach, as well as DRL-TinyEdge. The stipulated areas are the interquartile ranges (IQRs) across 5 experimental runs. DRL-TinyEdge has better tail latency control than the best-performing baseline, achieving 26.7% lower p95 latency, while maintaining constant performance under dynamic network conditions. The upper dotted line indicates the 100 ms SLA threshold.

Figure 4. Analysis of latency–energy trade-off space. Pareto frontier analysis of the latency–energy trade-off space. The points indicate the average performance of each method and display the standard deviation error bars from 5 runs. DRL-TinyEdge (red star) achieves a Pareto-optimal latency of 72.3 ms and an energy expenditure of 21.1 mJ, surpassing all baseline techniques. The grey stipulated area signifies the Pareto dominated zone of operation, of which no method ought to exist. The static methods (circles) take up sub-optimal positions, whereas the adaptive methods (triangles) take to the frontier but are not as stable and efficient as DRL-TinyEdge.

Figure 5. Training loss convergence curves for the DRL-TinyEdge framework across different deployment scenarios, showing rapid initial convergence and stable long-term performance.

Figure 6. Latency performance comparison showing end-to-end inference times for different methods across various deployment scenarios. DRL-TinyEdge achieves consistent latency improvements.

Figure 7. Energy consumption comparison across different hardware platforms showing significant energy savings achieved by DRL-TinyEdge through adaptive optimisation strategies.

Figure 8. Communication reliability analysis showing packet delivery ratios and connection stability metrics for different methods under varying 6G network conditions.

Figure 9. Sensitivity heat map of the influence of different coefficients of rewards

(α, β, γ, δ)

on the integrity of the main framework. Multi-objective reward design has been proven robust, as the balanced region (

α = 0.35

,

β = 0.40

,

γ = 0.20

,

δ = 0.10

) yields good latency–energy trade-offs.

Figure 9. Sensitivity heat map of the influence of different coefficients of rewards

(α, β, γ, δ)

on the integrity of the main framework. Multi-objective reward design has been proven robust, as the balanced region (

α = 0.35

,

β = 0.40

,

γ = 0.20

,

δ = 0.10

) yields good latency–energy trade-offs.

Figure 10. Stress-test evaluation under varying SNR, RTT, and device-temperature conditions. DRL-TinyEdge maintains near-constant p95 latency and energy efficiency, exhibiting stable policy behaviour even under severe network degradation or thermal throttling.

Figure 11. Scalability analysis showing framework performance with increasing numbers of edge devices, demonstrating maintained efficiency across different deployment scales.

Figure 12. Urban drone surveillance deployment scenario showing DRL-TinyEdge optimisation of object detection models under varying flight conditions and network connectivity.

Figure 13. Industrial IoT monitoring deployment demonstrating adaptive optimisation of sensor data processing models in manufacturing environments with varying operational conditions.

Table 1. Controller Footprint on Edge Devices.

Device	CPU Usage (%)	RAM (KB)	Decision Time (ms)
Jetson Nano	6.2	980	4.6
ESP32 MCU	3.9	412	3.8

Table 2. Reward Coefficient Settings and Tuning Notes.

Parameter	Symbol	Range	Tuned Value	Purpose
Latency Weight	α	0.2–0.5	0.35	Penalises p95 latency
Energy Weight	β	0.3–0.6	0.40	Balances power draw
Accuracy Weight	γ	0.2–0.4	0.20	Preserves model fidelity (≤1% drop)
Switch Penalty	δ	0.05–0.2	0.10	Prevents policy chattering

Table 3. Experimental Setup Summary.

Category	Specification	Values/Ranges	Units	Notes
Hardware	MCU/SoC	ESP32 @ 240 MHz, Jetson Nano @ 1.4 GHz, Raspberry Pi 4 @ 1.5 GHz	512 KB–4 GB RAM	Power measured via INA219 sensor
Workloads	Models	MobileNetV3-Tiny, EfficientNet-B0, CNN-SensorNet	INT8/FP16	Target FPS = 15 (vision), 100 (sensor)
Network	Scenarios	Good ( $>$ 25 dB SNR), Avg (15–25), Poor ( $<$ 15)	RTT: 10–200 ms	Jitter $\pm$ 20 ms
Training	Episode Length	2000 steps $\times$ 5 episodes $\times$ 3 seeds	–	Decision window = 20 frames
Software	Frameworks	PyTorch 2.3, Stable-Baselines3 v2.0	–	CUDA 11.8 on Ubuntu 22.04 LTS

Table 4. Main Performance Comparison—DRL-TinyEdge vs. Baseline Methods.

Method	Accuracy (%)	p95 Latency (ms)	Energy (mJ)	Switch Rate (/min)	Throughput (fps)	Memory (MB)	Adaptability
DRL-TinyEdge (Ours)	93.4 ± 0.6	72.3 ± 4.2	21.1 ± 1.8	0.8 ± 0.2	14.2	1.2	High
Local-Only	92.8 ± 0.5	147.2 ± 8.1	37.4 ± 2.3	0.0	6.8	0.8	None
Static-Offload	94.1 ± 0.4	128.5 ± 12.6	45.8 ± 5.7	0.0	7.8	0.6	None
Heuristic-QoS	92.9 ± 0.7	98.7 ± 7.3	29.6 ± 2.9	4.2 ± 0.8	10.1	0.9	Low
TinyNAS	91.2 ± 0.8	112.4 ± 6.5	31.8 ± 2.1	0.0	8.9	1.1	None
QAT-Only	92.6 ± 0.5	108.9 ± 5.9	28.4 ± 1.9	0.0	9.2	0.7	None
OFA	93.1 ± 0.6	95.3 ± 6.8	26.7 ± 2.4	0.0	10.5	1.8	Low

Table 5. Controller Overhead Analysis across Edge Devices.

Device	CPU (%)	RAM (KB)	Decision Time (ms)	Throughput Impact (%)	Network (kbps)	Storage (MB)
ESP32 (240 MHz)	3.9 ± 0.4	412 ± 18	3.8 ± 0.3	0.2	<0.1	0.8
Jetson Nano (1.4 GHz)	6.2 ± 0.6	980 ± 42	4.6 ± 0.4	0.3	<0.1	1.2
Raspberry Pi 4 (1.5 GHz)	5.1 ± 0.5	756 ± 35	4.2 ± 0.3	0.3	<0.1	1.0
Average	5.1	716	4.2	0.3	<0.1	1.0

Table 6. Infrastructure Cost and Savings Analysis (30-Day Period, 100 Devices).

Method	Compute ($)	Data Transfer ($)	Storage ($)	Total Cost ($)	Cost Per Device ($)	Cost Savings vs. Static-Offload (%)
DRL-TinyEdge (Ours)	74	38	15	127	1.27	−62.0
Local-Only	0	0	80	80	0.80	-
Static-Offload	178	132	24	334	3.34	Baseline
Heuristic-QoS	112	76	18	206	2.06	−38.3
TinyNAS/QAT	0	0	85	85	0.85	−74.5
OFA	0	0	92	92	0.92	−72.5

Table 7. Accuracy Comparison Results.

Method	CIFAR-10	ImageNet	IoT Sensors	Average
DRL-TinyEdge (Ours)	94.2%	71.8%	92.6%	86.2%
TinyNAS	91.5%	68.3%	89.1%	82.9%
QAT	90.8%	67.9%	88.7%	82.5%
OFA	92.1%	69.2%	90.3%	83.9%
EfficientNet	91.7%	68.7%	89.8%	83.4%
MobileNetV3	90.3%	66.8%	87.9%	81.7%
ProxylessNAS	91.9%	68.9%	90.1%	83.6%
BigNAS	92.3%	69.5%	90.7%	84.2%
AutoML	90.9%	67.5%	88.4%	82.3%
NAPS	91.2%	68.1%	89.2%	82.8%
DARTS	90.6%	67.2%	88.0%	81.9%

Table 8. Latency Performance Results (milliseconds).

Method	MobileNetV3	EfficientNet	Custom CNN	Average
DRL-TinyEdge (Ours)	12.3	18.7	8.9	13.3
TinyNAS	15.8	24.2	11.6	17.2
QAT	16.1	24.8	12.0	17.6
OFA	14.9	22.1	10.8	15.9
EfficientNet	15.6	23.4	11.3	16.8
MobileNetV3	16.3	25.1	12.4	17.9
ProxylessNAS	15.2	22.8	11.0	16.3
BigNAS	14.7	21.9	10.5	15.7
AutoML	16.0	24.5	11.9	17.5
NAPS	15.5	23.2	11.2	16.6
DARTS	16.4	25.0	12.3	17.9

Table 9. Energy Consumption Results (mJ per inference).

Method	Raspberry Pi	Jetson Nano	ESP32	Average
DRL-TinyEdge (Ours)	23.4	45.7	8.9	26.0
TinyNAS	31.2	58.6	12.3	34.0
QAT	32.1	60.4	12.8	35.1
OFA	29.8	55.2	11.6	32.2
EfficientNet	30.5	57.1	12.0	33.2
MobileNetV3	33.7	62.8	13.5	36.7
ProxylessNAS	30.1	56.3	11.8	32.7
BigNAS	29.3	54.7	11.2	31.7
AutoML	32.5	61.2	13.1	35.6
NAPS	31.0	58.1	12.1	33.7
DARTS	33.2	62.4	13.3	36.3

Table 10. Effect of Removing Key Components.

Configuration	p95 Latency (ms)	Energy (mJ)	Switch Rate (#/min)
Full Model (DRL-TinyEdge)	99	21.1	0.8
w/o SwitchCost Term	93	19.4	4.2
Latency-Only Objective	97	29.6	1.1
Energy-Only Objective	122	18.7	0.6
w/o Offloading	147	24.3	0.2
w/o Model Adaptation	135	27.9	0.4
DQN → PPO Variant	103	22.0	0.9

Table 11. Real-World Deployment Performance Results.

Scenario	Accuracy Gain	Latency Reduction	Energy Savings	Reliability Gain
Urban Drone Surveillance	+4.7%	−23.1%	−38.2%	+7.3%
Industrial IoT Monitoring	+3.9%	−19.8%	−41.5%	+5.8%
Vehicular Edge Computing	+5.2%	−25.6%	−35.7%	+8.9%
Average	+4.6%	−22.8%	−38.5%	+7.3%

Table 12. Resource Utilization and Computational Overhead of the DRL TinyEdge Framework.

Component	CPU Usage	Memory Usage	Network Overhead	Storage Requirement
Environment Monitor	2.3%	15.7 MB	0.8 kbps	2.1 MB
DRL Agent	8.9%	127.4 MB	0.0 kbps	45.3 MB
Model Controller	1.7%	8.2 MB	0.0 kbps	1.8 MB
Performance Evaluator	0.9%	4.1 MB	0.0 kbps	0.7 MB
Total Framework	13.8%	155.4 MB	0.8 kbps	49.9 MB

Table 13. Comprehensive Comparison with State-of-the-Art Methods.

Method	Accuracy	Latency	Energy	Reliability	Adaptability	Deployment
DRL-TinyEdge (Ours)	86.2%	13.3 ms	26.0 mJ	94.7%	High	Easy
TinyNAS [45]	82.9%	17.2 ms	34.0 mJ	89.3%	Low	Moderate
QAT [58]	82.5%	17.6 ms	35.1 mJ	88.7%	Low	Easy
OFA [46]	83.9%	15.9 ms	32.2 mJ	90.1%	Moderate	Moderate
EfficientNet [47]	83.4%	16.8 ms	33.2 mJ	89.8%	Low	Easy
MobileNetV3 [48]	81.7%	17.9 ms	36.7 mJ	87.5%	Low	Easy
ProxylessNAS [49]	83.6%	16.3 ms	32.7 mJ	90.5%	Low	Difficult
BigNAS [50]	84.2%	15.7 ms	31.7 mJ	91.2%	Low	Difficult

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alaklabi, S.; Alharbi, S. DRL-TinyEdge: Energy- and Latency-Aware Deep Reinforcement Learning for Adaptive TinyML at the 6G Edge. Future Internet 2026, 18, 31. https://doi.org/10.3390/fi18010031

AMA Style

Alaklabi S, Alharbi S. DRL-TinyEdge: Energy- and Latency-Aware Deep Reinforcement Learning for Adaptive TinyML at the 6G Edge. Future Internet. 2026; 18(1):31. https://doi.org/10.3390/fi18010031

Chicago/Turabian Style

Alaklabi, Saad, and Saleh Alharbi. 2026. "DRL-TinyEdge: Energy- and Latency-Aware Deep Reinforcement Learning for Adaptive TinyML at the 6G Edge" Future Internet 18, no. 1: 31. https://doi.org/10.3390/fi18010031

APA Style

Alaklabi, S., & Alharbi, S. (2026). DRL-TinyEdge: Energy- and Latency-Aware Deep Reinforcement Learning for Adaptive TinyML at the 6G Edge. Future Internet, 18(1), 31. https://doi.org/10.3390/fi18010031

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DRL-TinyEdge: Energy- and Latency-Aware Deep Reinforcement Learning for Adaptive TinyML at the 6G Edge

Abstract

1. Introduction

2. Related Work

2.1. TinyML Optimisation Techniques

2.2. Edge Computing Optimisation

2.3. Deep Reinforcement Learning in Edge Computing

3. Proposed Methodology

3.1. System Architecture Overview

3.2. Environment State Representation

3.3. Action Space Design

3.4. Multi-Objective Reward Function

3.5. Method: Lightweight and Safe Controller Design

4. Mathematical Modelling

4.1. Problem Formulation

4.2. Action Space Formulation

4.3. Multi-Objective Reward Function

4.4. Deep Q-Network Formulation

4.5. Adaptive Weight Mechanism

4.6. Constraint Handling

4.7. Convergence Analysis

5. Results and Evaluation

5.1. Experimental Setup

5.2. Main Performance Comparison

5.3. Latency Distribution Analysis

5.4. Multi-Objective Pareto Frontier Analysis

5.5. Controller Overhead Analysis

5.6. Cloud Infrastructure Cost Analysis

5.7. Baseline Methods

5.8. Performance Metrics

5.9. Training Convergence Analysis

5.10. Accuracy Performance Evaluation

5.11. Latency Performance Analysis

5.12. Energy Efficiency Evaluation

5.13. Communication Reliability Assessment

5.14. Ablation and Sensitivity Analysis

5.15. Scalability Analysis

5.16. Real-World Deployment Scenarios

5.17. Computational Overhead Analysis

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI