Cross-Layer Resource Optimization for Ultra-Low-Power TinyML Inference on ARM Cortex-M Microcontrollers

Alanazi, Abdulaziz G.; Alanazi, Haifa A.; Albalawi, Nasser S.

doi:10.3390/electronics15132918

Open AccessArticle

Cross-Layer Resource Optimization for Ultra-Low-Power TinyML Inference on ARM Cortex-M Microcontrollers

by

Abdulaziz G. Alanazi

¹

,

Haifa A. Alanazi

^1,*

and

Nasser S. Albalawi

²

¹

Department of Information Systems, Faculty of Computing and Information Technology, Northern Border University, Rafha 91911, Saudi Arabia

²

Department of Computer Science, Faculty of Computing and Information Technology, Northern Border University, Rafha 91911, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(13), 2918; https://doi.org/10.3390/electronics15132918

Submission received: 25 April 2026 / Revised: 5 June 2026 / Accepted: 22 June 2026 / Published: 3 July 2026

Download

Browse Figures

Versions Notes

Abstract

Running neural networks on battery-powered Internet of Things (IoT) sensor nodes is difficult because flash memory, SRAM, latency, and energy per inference are limited at the same time. Existing TinyML co-design methods usually improve model size or memory use, but runtime voltage–frequency control is often handled as a separate step. This separation limits energy saving because the power policy does not use the layer-wise compute profile of the final compressed model. We propose the Cross-Layer Resource Optimizer (CLRO), a three-stage resource optimization pipeline for TinyML inference on an ARM Cortex-M7 target. The first stage, Mixed-Precision Aware Pruning and Distillation (MPAD), assigns per-layer bit widths and pruning ratios using calibration-set sensitivity scores. The second stage, consisting of the Activation Lifetime-Aware Tensor Scheduler (ALTS), uses the compressed graph to find an execution order that reduces peak live static random-access memory (SRAM). The third stage, Reinforcement Learning-Based Dynamic Voltage and Frequency Scaling (DVFS-RL), trains a tabular Q-learning policy from the multiply–accumulate (MAC) utilization profile of the compressed and scheduled model. The learned voltage–frequency policy is stored as a small flash lookup table, so it adds no runtime decision cost during inference. We evaluate the CLRO on all four MLPerf Tiny tasks using an STM32H743ZI microcontroller with 512 kB SRAM and 2 MB flash. The CLRO reaches 91.7% image classification accuracy, 95.4% keyword-spotting accuracy, 89.6% visual wake words accuracy, and 0.913 anomaly detection AUC. The final deployment uses 198 kB flash and 174 kB peak SRAM, with 387 μJ energy per inference and 38 ms latency. Compared with the MCUNet baseline, the CLRO reduces energy by 58.1% and peak SRAM by 39% while keeping the same accuracy level.

Keywords:

TinyML; cross-layer optimization; mixed-precision quantization; edge intelligence; MLPerf Tiny

1. Introduction

Battery-powered Internet of Things (IoT) nodes are now used in settings where sending raw sensor data to a remote server is not practical. Industrial fault monitoring, wildlife tracking, wearable health sensing, and smart agriculture are a few examples where latency, bandwidth cost, and data privacy all push the compute to the device itself. ARM Cortex-M microcontrollers dominate this space because they are cheap and draw only a few milliwatts, but their on-chip resources are tightly bounded: clock speeds sit between 100 MHz and 480 MHz, flash rarely exceeds 2 MB, and static random-access memory (SRAM) is often below 512 kB [1,2]. Fitting a useful neural network inside these limits is already hard. Running it repeatedly on a coin cell or a small energy harvester makes the energy per inference a first-order constraint, not an afterthought. TinyML—the practice of compressing and deploying deep learning models on microcontrollers—has grown quickly as a field to address exactly this need [3,4]. Quantization, pruning, and knowledge distillation are now standard tools [5,6,7], and frameworks such as TensorFlow Lite Micro (TFLM) and the Cortex Microcontroller Software Interface Standard Neural Network (CMSIS-NN) library provide the integer-arithmetic kernels that make INT8 inference feasible on Cortex-M hardware [8,9]. System-level work on MCUNet showed that jointly searching the network architecture and the memory schedule in a co-design loop can cut SRAM use by

3.5 \times

and flash by

5.7 \times

compared with quantized MobileNetV2 while reaching over 70% ImageNet top-1 accuracy on a commercial microcontroller unit (MCU) [10]. MicroNets pushed this further via a differentiable neural architecture search (NAS) that targets microcontroller memory profiles directly [11]. These advances show that model design and memory layout can be co-optimized, but the energy consumed per inference—which depends on the processor’s voltage–frequency state during each layer’s execution—is still left to a fixed hardware governor that has no knowledge of the model structure.

The gap between what today’s methods achieve and what battery life actually demands becomes clear when numbers are put together: MCUNet on an STM32H743ZI running a visual wake words task uses 286 kB peak SRAM and spends about 923 µJ per inference. MicroNets is slightly better on both fronts but still draws 844 µJ. On a 230 mAh coin cell at 3 V—a common budget for a sealed IoT node—923 µJ per inference at once per second gives a theoretical compute-only life of roughly 207 h; a 58% cut would push that past 490 h before other idle losses are even counted. The reason existing work could not close this gap is structural: compression policy, activation lifetime scheduling, and runtime voltage–frequency control are designed and applied as three separate, sequential steps with no feedback between them. A pruning ratio that looks fine for flash usage may push a certain layer’s activation tensor wide enough to spike SRAM at that point, and the elevated SRAM pressure changes which execution order is optimal, which, in turn, changes the multiply–accumulate (MAC) utilization profile that a Dynamic Voltage and Frequency Scaling (DVFS) controller should act on. Solving each sub-problem independently misses savings that are only visible when the three decisions are coupled [12,13,14,15]. To the best of our knowledge, no prior work has built a single pipeline where sensitivity-guided mixed-precision pruning, activation lifetime-aware tensor scheduling, and a reinforcement learning DVFS agent are connected so that each stage takes the compressed, rescheduled graph as its direct input.

This paper proposes the CLRO (Cross-Layer Resource Optimizer), a three-stage framework that closes this feedback gap on the STM32H743ZI Cortex-M7 target using the full MLPerf Tiny benchmark suite [16]. Stage 1 (MPAD, Mixed-Precision Aware Pruning and Distillation) assigns per-layer bit widths and pruning ratios from calibration-set sensitivity scores. Stage 2 (ALTS, the Activation Lifetime-Aware Tensor Scheduler) solves a greedy depth-first execution-order search on the compressed graph to minimize peak live SRAM. Stage 3 (DVFS-RL, Reinforcement Learning-Based Dynamic Voltage and Frequency Scaling) trains a tabular Q-learning agent whose state vector—MAC utilization, deadline slack, and remaining energy budget—is built from the compressed, scheduled graph produced by stages 1 and 2. The resulting 2 kB Q-table flash lookup needs zero extra compute at inference time. On all four MLPerf Tiny tasks, the CLRO achieves 91.7% image classification accuracy, 95.4% keyword-spotting accuracy, 89.6% visual wake words accuracy, and a 0.913 anomaly detection area under the ROC curve (AUC) while using only 198 kB flash, 174 kB peak SRAM, 387 µJ per inference, and 38 ms latency—numbers no single-technique baseline in the literature matches simultaneously.

For clarity, the main abbreviations used in this paper are defined here: Cross-Layer Resource Optimizer (CLRO) refers to the full proposed pipeline. Mixed-Precision Aware Pruning and Distillation (MPAD) is the model compression stage. Activation Lifetime-Aware Tensor Scheduler (ALTS) is the memory scheduling stage. Reinforcement Learning-Based Dynamic Voltage and Frequency Scaling (DVFS-RL) is the runtime power control stage. Static random-access memory (SRAM), multiply–accumulate (MAC), and Dynamic Voltage and Frequency Scaling (DVFS) are used with these meanings throughout the manuscript. The key contributions of this work are as follows:

1.: MPAD—Mixed-Precision Aware Pruning and Distillation: A per-layer sensitivity score computed on a 512-sample calibration set drives both the bit width assignment (INT8 or INT4) and the pruning ratio, so layers that matter more to accuracy keep more capacity while flash savings are concentrated where they cost least.
2.: ALTS—Activation Lifetime-Aware Tensor Scheduler: A greedy depth-first heuristic searches the execution-order space of the compressed graph and finds the permutation that minimizes peak live SRAM, reducing it from 286 kB (MCUNet baseline) to 174 kB on the STM32H743ZI—a 39% cut—without changing any weights.
3.: DVFS-RL—Q-Learning Voltage–Frequency Controller: A tabular agent trained offline on the compressed, scheduled graph learns a power state policy that cuts dynamic energy by 58.1% (923 µJ to 387 µJ) while keeping inference within the 50 ms deadline. The final lookup table fits in 2 kB of flash.
4.: Cross-layer feedback coupling: The MPAD, ALTS, and DVFS-RL are connected in a pipeline where each stage uses the output of the previous one. An ablation study confirms that removing any one stage degrades both accuracy and energy, showing the value of the coupling.
5.: Systematic evaluation on MLPerf Tiny: The CLRO is benchmarked on all four tasks (image classification, keyword spotting, visual wake words, anomaly detection) against five published baselines on real STM32H743ZI hardware with energy measured by a Nordic Power Profiler Kit II at 1 kHz, making the results directly reproducible [8,16].

The rest of the paper is organized as follows: Section 2 reviews recent work on TinyML compression, memory scheduling, and runtime power control. Section 3 defines the joint optimization problem and explains why flash, SRAM, latency, and energy must be handled together. Section 4 describes the MLPerf Tiny datasets and preprocessing steps used in the experiments. Section 5 presents the proposed CLRO framework in three connected stages. Section 6 reports the experimental results, ablation study, and comparison with existing baselines. Section 7 closes the paper and gives the main future research directions.

2. Related Work

2.1. Model Compression for MCU Deployment

Deploying neural networks on microcontrollers has driven a large body of work around model compression. Quantization converts floating-point weights and activations to fixed-point integers, cutting both memory and multiply–accumulate cost with only a small accuracy drop [5]. Post-training INT8 quantization is now the standard starting point for Cortex-M targets because it requires no re-training and maps directly to the CMSIS-NN integer kernels [9]. Pruning removes weights or entire filters that contribute little to the output. Han et al. showed that combining pruning, quantization, and Huffman coding can shrink a CNN by up to

49 \times

with under 0.5% accuracy loss [7]. Knowledge distillation trains a compact student network under the supervision of a larger teacher, preserving task performance at a fraction of the parameter count [6]. Work on mixed-precision assignment goes further by giving sensitive layers wider bit widths while compressing less critical ones so accuracy and flash usage are balanced at a per-layer granularity [17]. Surveys of TinyML compression confirm that these three techniques—quantization, pruning, and distillation—are the main tools practitioners reach for when fitting a model inside a few hundred kilobytes of flash [1,3]. The DTMM library showed that filterlet-level pruning combined with a custom runtime operator can reduce model size and inference latency on Cortex-M55 MCUs by up to 42.8% and 27.7%, respectively, compared to structured baselines [18]. Despite this progress, these compression methods are designed and applied independently; there is no unified feedback path between the compression policy, the on-chip memory layout, and the runtime power state of the processor.

2.2. Memory Scheduling and Runtime Power Management

A separate line of work targets the runtime side of inference efficiency. Liberis and Lane showed that changing the execution order of TensorFlow Lite operators is enough to reduce peak SRAM by a measurable margin without touching the model weights at all [14]. Work on MCUNetV2 took this idea further by introducing patch-based inference scheduling together with receptive field redistribution, letting a Cortex-M4 run object detection under a 256 kB SRAM limit that was previously impossible [13]. Partial execution (PEX) avoids materializing full activation buffers in SRAM by exploiting the property where many operators can produce and consume one tile at a time, cutting peak memory with no computation overhead [19]. On the power management side, Dynamic Voltage and Frequency Scaling (DVFS) is a well-known technique for adjusting the processor clock and supply voltage at runtime to match workload demand [12]. Reinforcement learning has been applied to DVFS scheduling on embedded processors because the discrete action space (a small set of P-states) and delayed reward (energy versus deadline) match the Q-learning formulation well [10,15]. Zhang et al. demonstrated that a learning-based DVFS policy for edge-cloud collaborative DNN inference achieves energy savings that static governors cannot match [20]. On STM32 microcontrollers specifically, decoupled access–execute DVFS has been shown to cut per-inference energy by up to 25.2% compared to a fixed-frequency reference [21]. Each of these contributions is strong in isolation, but memory scheduling and DVFS are treated as separate offline or heuristic steps; neither one feeds information back to the other during or after the compression stage.

2.3. Cross-Layer Collaborative Optimization for TinyML

Jointly optimizing model compression, memory scheduling, and runtime power for MCU inference is a recognized goal, but each prior co-design work covers at most two of the three dimensions.

MCUNet and MicroNets search model architecture and SRAM budget together in a single NAS loop [10,11], so the selected model already accounts for both flash and SRAM limits. However, once the architecture is fixed, both frameworks hand it to the static hardware clock governor; the inference time voltage and frequency are never part of the search objective, and no compression sensitivity information reaches the power controller.

RL-based DVFS methods for edge devices do learn a dynamic power state policy [10,15], but the deployed model is treated as a fixed black box. The reward signal uses only deadline slack and measured energy, not the per-layer sensitivity profile of the compressed model. A heavily pruned INT4 layer and a full INT8 layer receive the same voltage–frequency treatment, so savings that depend on knowing which layers tolerate lower supply voltage are left unrealized. The learned policy also cannot adapt when the pruning ratio changes, because compression structure is outside the agent’s state space.

On the memory side, operator reordering [14] and patch-based scheduling [13] are solved offline on the unmodified model graph. Neither method takes a compressed graph as input, and neither passes the resulting activation lifetime profile to a power controller; the voltage domain the processor runs in is entirely invisible to the scheduler. Recent work also shows that energy-aware edge intelligence is moving beyond model compression alone. Heidari et al. [22] studied dynamic IoT–edge–cloud offloading with security and energy constraints, which supports the need for joint resource decisions in IoT systems. Ramadan et al. [23] reviewed TinyML and federated learning on resource-limited IoT edge devices with a focus on memory, communication cost, accuracy, and energy. Bhushan et al. [24] showed that practical TinyML deployment still depends strongly on quantization, memory footprint, latency, and measured energy on low-power edge platforms. These recent studies support the same direction as the CLRO, but they do not jointly connect layer-wise compression, activation lifetime scheduling, and DVFS control inside one MCU inference pipeline.

No prior work incorporates DVFS as a jointly optimizable dimension inside the compression and scheduling loop. The CLRO closes this gap. MPAD assigns bit widths and pruning ratios from per-layer calibration-set sensitivity. The ALTS schedules the compressed graph to minimize peak live SRAM. DVFS-RL then builds its state vector from the MAC utilization profile that the MPAD and ALTS together produce, so the power policy is tied to the specific compression and memory structure of the deployed model—a feedback path that no prior work provides. The main difference between the CLRO and prior TinyML co-design methods is the way the three decisions are linked: Most earlier methods search for a small model first, then apply memory scheduling or runtime power control as a later step. In the CLRO, the output of each stage becomes the direct input to the next stage. The compressed graph from MPAD changes tensor sizes and layer cost. The ALTS uses this changed graph to reduce peak live SRAM. DVFS-RL then uses the scheduled layer profile to learn a voltage–frequency policy. This makes the optimization cross-layer not only sequential. The novelty is in the data flow between the stages and in the feedback path that can tighten the compression target when the energy target is not met.

3. Problem Statement

IoT sensor nodes built around Cortex-M-class microcontrollers face three hard resource walls at the same time: flash capacity in the low hundreds of kilobytes, an SRAM that often sits below 512 kB, and a power budget that is set by the discharge curve of a coin cell or a small energy harvester [1,16]. Running a neural network on such a device is not a single optimization problem; it is at least three coupled sub-problems that interact in ways that make independent solutions sub-optimal.

The first sub-problem concerns model size and accuracy: Quantization and pruning can shrink a network to fit in flash, but the specific choice of bit width and sparsity per layer changes which activation tensors are large and which are small, and therefore how much SRAM the inference pass actually needs. A uniform INT8 policy ignores this coupling; a model that fits in flash may still overflow the SRAM at the layer with the widest feature map [13,14].

The second sub-problem concerns the peak activation memory: Even after a model is compressed, the order in which its layers execute determines how many activation tensors must coexist in the SRAM at any one moment. Reordering can reduce this peak significantly [14], but the optimal execution schedule depends on the shape of the compressed graph, so memory scheduling must be solved after compression, not independently.

The third sub-problem concerns runtime energy: Dynamic power on a Cortex-M7 scales with the square of the supply voltage and linearly with frequency. A processor running at full 480 MHz clock to meet an inference deadline wastes energy whenever the layer being executed is memory-bound rather than compute-bound. DVFS can recover this waste, but only if the control policy knows the MAC utilization profile of the compressed, scheduled model—information that is unavailable unless compression and scheduling have already been resolved [12,15].

Formally, let

M

denote a neural network with L layers, let

b = {b_{l}}

and

p = {p_{l}}

be the per-layer bit widths and pruning ratios, let

π

be an execution permutation of the layers, and let

a = {a_{k}}

be a sequence of DVFS actions taken at inference step k. The joint objective is as follows:

min_{b, p, π, a} E_{\inf} (b, p, π, a)

(1)

subject to

Acc (M; b, p) \geq {Acc}_{min},

(2)

max_{t} Φ (π, t) \leq M_{max},

(3)

\sum_{l = 1}^{L} [S_{l} (b, p) + Q_{l} (b_{l})] \leq F_{max},

(4)

t_{\inf} (π, a) \leq t_{ddl},

(5)

where

E_{\inf}

is the per-inference energy consumption;

Acc (M; b, p)

is the task accuracy of model

M

under bit widths

b

and pruning ratios

p

; and

{Acc}_{min}

is the minimum acceptable accuracy threshold.

Φ (π, t)

is the live SRAM footprint at execution step t;

S_{l} (b, p)

is the weight storage cost of layer l;

Q_{l} (b_{l})

is the per-layer quantization parameter overhead, specifically, the per-tensor scale factor and zero-point offset each stored as a 32-bit value, whose size depends on the assigned bit width

b_{l}

; and

t_{\inf}

is the total inference latency.

M_{max} = 512

kB and

F_{max} = 2

MB are the hardware limits of the target STM32H743ZI, and

t_{ddl}

is the application deadline (50 ms for keyword spotting in this work).

Problem (1)–(5) is NP-hard in general because the execution-order search alone is NP-complete [14] and the joint compression–scheduling–DVFS space is exponential. Existing works attack each sub-problem in isolation, leaving cross-layer coupling unexploited.

The CLRO breaks this joint problem into three sequential stages, where each stage takes the output of the one before it: MPAD handles constraints (2) and (4): it assigns per-layer bit widths

b_{l}

and pruning ratios

p_{l}

using calibration-set sensitivity scores so the compressed model fits in flash while staying above

{Acc}_{min}

. The ALTS then takes the compressed graph and solves for the execution order

π^{*}

that satisfies constraint (3) via a greedy depth-first search over all valid layer permutations. DVFS-RL uses the MAC utilization profile of the compressed, scheduled model to train a tabular Q-learning agent offline, and the learned voltage–frequency policy minimizes

E_{\inf}

subject to the deadline in constraint (5) at runtime.

4. Dataset and Preprocessing

This work uses the MLPerf Tiny benchmark suite [16] as the primary evaluation dataset, Ref Figure 1. It was released by MLCommons and is openly available under the Apache 2.0 license [16]. The suite covers four tasks that are representative of real-world ultra-low-power IoT workloads: image classification on CIFAR-10, keyword spotting (KWS) with the Google Speech Commands corpus [25], visual wake words (VWW) derived from MS-COCO, and anomaly detection using the ToyADMOS/MIMII industrial sound dataset [26]. Each task was picked because it stresses a different part of the memory–compute–energy trade-off space that MCUs face in practice.

For the image classification task, each

32 \times 32

RGB image from CIFAR-10 is normalized channel-wise. Given a raw pixel value

x_{c, i}

in channel c, the normalized value is

{\hat{x}}_{c, i} = \frac{x_{c, i} - μ_{c}}{σ_{c} + ϵ},

(6)

where

μ_{c}

and

σ_{c}

are the per-channel mean and standard deviation computed over the full training split, and

ϵ = 10^{- 7}

prevents division by zero. This keeps the activation range bounded, which matters when the model is later quantized to eight-bit integers for MCU execution.

The KWS task takes raw 16 kHz mono audio and converts it to a log-Mel spectrogram. For a short-time Fourier transform (STFT) frame of length N with a Mel filterbank of M filters, the m-th Mel energy at time frame t is

E_{m} (t) = log (\sum_{k = 0}^{N / 2} {|X (t, k)|}^{2} H_{m} (k) + δ),

(7)

where

X (t, k)

is the STFT coefficient at frame t and bin k,

H_{m} (k)

is the m-th triangular Mel filter response, and

δ = 10^{- 6}

is a floor term that stabilizes the logarithm. The output is a

49 \times 40

feature map fed to the DS-CNN reference model. This formulation follows the feature extraction used in [25].

For anomaly detection, each audio clip from ToyADMOS is converted to a log-power spectrum. Let

S \in R^{T \times F}

be the raw power spectrogram. The input feature vector

f

is obtained by

f = vec (log (S + δ)) \cdot W_{pool},

(8)

where

vec (\cdot)

flattens the matrix into a row vector and

W_{pool} \in R^{T F \times d}

is a learned average-pooling projection that reduces the dimension to

d = 128

before the autoencoder input layer.

Across all four tasks, the final preprocessing step is INT8 post-training quantization (PTQ). A floating-point tensor z is mapped to an eight-bit integer

z_{q}

as

z_{q} = clip (⌊\frac{z}{s}⌉ + z_{p}, - 128, 127),

(9)

where s is the per-tensor scale factor derived from the observed activation range on a small calibration set,

z_{p}

is the zero-point offset,

⌊ \cdot ⌉

denotes rounding to the nearest integer, and

clip (\cdot)

saturates values outside the representable range [5]. This step cuts memory and multiply–accumulate (MAC) cost by roughly

4 \times

compared with FP32, which is a hard requirement for MCUs with 256 kB to 512 kB of SRAM.

5. Proposed Cross-Layer Resource Optimization Framework

Most TinyML deployment work treats model compression, memory scheduling, and power management as separate problems solved one after the other. This split approach leaves a large gap: a model that is lean in parameter count can still overflow the SRAM if its intermediate activation tensors are not tiled carefully, and even a perfectly tiled model can drain a battery fast if the MCU clock and voltage are not tuned to match the workload in real time. Our proposed framework, called the Cross-Layer Resource Optimizer (CLRO), closes that gap by coupling three optimization levels—model, memory, and power—into a single joint feedback loop.

The CLRO framework is described in the same order in which it is executed on the deployment pipeline. MPAD first compresses the model while protecting sensitive layers. The ALTS then uses the compressed graph to reduce the peak live activation memory. DVFS-RL finally uses the scheduled layer profile to select the voltage and frequency state during inference. This order is important because each stage depends on the output of the stage before it; therefore, the model, memory, and power decisions are not treated as separate steps.

The three CLRO stages are not independent blocks: MPAD first produces a compressed graph. This graph contains the remaining channels, pruning masks, assigned bit widths, and updated tensor shapes. The ALTS uses this graph to compute activation lifetimes and to choose an execution order with low peak live SRAM. After scheduling, each layer has a known execution position, tensor size, working buffer size, and MAC count. DVFS-RL uses this scheduled layer profile to build its state representation. Therefore, the power controller does not learn from the original model; it learns from the final compressed and scheduled model that will actually run on the MCU.

G_{c} = {L_{c}, W_{c}, b, p, S_{c}}

(10)

s_{k} = [{\bar{U}}_{k}, Δ_{k}, {\hat{E}}_{k}]

(11)

{\bar{U}}_{k} = \frac{{MAC}_{k}}{f_{k} t_{k}}

(12)

Here,

G_{c}

is the compressed graph,

L_{c}

is the set of remaining layers or channels,

W_{c}

is the compressed weight set, b is the bit width vector, p is the pruning ratio vector, and

S_{c}

is the updated tensor-size set. The state

s_{k}

contains mean MAC utilization, deadline slack, and the remaining energy budget. This mapping makes the DVFS decision dependent on the real workload after the MPAD and ALTS.

5.1. Layer 1—Mixed-Precision Aware Pruning and Distillation (MPAD)

The first layer reduces the model before it ever reaches the MCU. A plain magnitude-based pruning criterion treats every layer the same, but not every layer contributes equally to the final error. MPAD assigns a sensitivity score

ρ_{l}

to layer l based on how much the task loss changes when that layer is fully zeroed out:

ρ_{l} = \frac{1}{| D_{cal} |} \sum_{(x, y) \in D_{cal}} [L (f_{∖ l} (x), y) - L (f (x), y)],

(13)

where

f_{∖ l}

is the network with all weights in layer l set to zero,

D_{cal}

is a small calibration subset (512 samples in our setup), and

L

is the task-specific loss. A layer with high

ρ_{l}

gets a lower pruning ratio

p_{l}

, while a layer with low

ρ_{l}

is aggressively pruned. The per-layer pruning ratio is set as

p_{l} = p_{base} \cdot exp (- α \frac{ρ_{l}}{{max}_{j} ρ_{j}}),

(14)

where

p_{base} \in [0, 1]

is a global sparsity target,

α > 0

is a sharpness parameter (set to 2.5 in our experiments), and

j \in {1, \dots, L}

indexes all layers so that

{max}_{j} ρ_{j}

is the highest sensitivity score across the network. Layers scoring near the maximum sensitivity receive

p_{l} \approx p_{base} \cdot e^{- α}

, which keeps most of their weights intact. After structured pruning, the surviving model is quantized with a mixed-precision scheme: weights in high-sensitivity layers are kept at 8-bit, while low-sensitivity layers are pushed to 4-bit. The bit width assignment

b_{l}

follows

b_{l} = \{\begin{matrix} 8 & if ρ_{l} \geq τ_{ρ}, \\ 4 & otherwise, \end{matrix}

(15)

where

τ_{ρ}

is a threshold set to the median sensitivity across all layers. A knowledge distillation loss from a full-precision teacher then compensates for the accuracy drop [6]:

L_{KD} = (1 - λ) L_{CE} (y, {\hat{y}}_{s}) + λ T^{2} KL (σ (\frac{z_{t}}{T}) ∥ σ (\frac{z_{s}}{T})),

(16)

where

z_{t}

and

z_{s}

are the logits of the teacher and student networks, respectively;

σ (\cdot)

is the softmax function; T is the distillation temperature (set to 4);

λ = 0.6

balances hard-label and soft-label objectives;

L_{CE} (y, {\hat{y}}_{s})

is the standard cross-entropy loss between the ground-truth label

y

and the student prediction

{\hat{y}}_{s}

; and

KL (\cdot ∥ \cdot)

is the Kullback–Leibler divergence between the softened teacher and student output distributions.

5.2. Layer 2—Activation Lifetime-Aware Tensor Scheduler (ALTS)

After MPAD, the compressed model is passed to the ALTS. The job of the ALTS is to tile and schedule every activation tensor so that live memory at any one inference step never exceeds the MCU’s physical SRAM limit

M_{\max}

. Each layer l produces an output tensor of size

S_{l}

bytes and keeps it alive until all consumer layers have read it. The peak live memory at step t during the execution order

π

is

Φ (π, t) = \sum_{l : alive (l, π, t)} S_{l} + S_{π (t)}^{buf},

(17)

where

alive (l, π, t)

is true when the output of layer l is still needed at step t under execution order

π

, and

S_{π (t)}^{buf}

is the working buffer for the currently running layer. The ALTS searches for an execution order

π^{*}

that minimizes the worst-case peak:

π^{*} = arg min_{π \in Π} max_{t} Φ (π, t) s . t . max_{t} Φ (π, t) \leq M_{\max} .

(18)

The search space

Π

is the set of all topologically valid layer orderings. For large networks, this is NP-hard, so the ALTS uses a greedy depth-first heuristic guided by a tensor reuse score. Tensors are also split into tiles of size

B_{tile}

bytes when a single activation does not fit. The tile count for layer l is

n_{l} = ⌈\frac{S_{l}}{B_{tile}}⌉,

(19)

where

⌈ \cdot ⌉

denotes the ceiling function, so

n_{l}

is the smallest integer number of tiles that covers the full tensor. Each tile is written to flash or re-computed on demand if the SRAM budget is exceeded. This avoids the need for external DRAM, which is not present on most bare-metal MCUs.

5.3. Layer 3—Dynamic Voltage and Frequency Scaling via Reinforcement Learning (DVFS-RL)

The third layer operates at runtime. It models the MCU’s voltage–frequency operating points as a finite Markov Decision Process (MDP). At each inference step k, the agent observes a state

s_{k} = [{\bar{U}}_{k}, Δ_{k}, {\hat{E}}_{k}]

, where

{\bar{U}}_{k}

is the mean MAC utilization over the last 10 cycles,

Δ_{k}

is the deadline slack in milliseconds, and

{\hat{E}}_{k}

is the remaining energy budget from an on-chip coulomb counter. The agent picks a voltage–frequency pair

(V_{k}, f_{k})

from a discrete action set

A

and receives a scalar reward

r_{k} = - β_{1} E_{k}^{dyn} - β_{2} max (0, t_{k}^{\inf} - t_{k}^{ddl}) + β_{3} 1 [task correct],

(20)

where

E_{k}^{dyn} = C_{eff} V_{k}^{2} f_{k}

is the dynamic energy for that inference,

t_{k}^{\inf}

is the measured inference time,

t_{k}^{ddl}

is the per-task deadline, and

β_{1}, β_{2}, β_{3}

are weighting coefficients set to

10^{- 3}

, 5.0, and 1.0, respectively. The agent is trained offline with a tabular Q-learning update rule [27]:

Q (s_{k}, a_{k}) \leftarrow Q (s_{k}, a_{k}) + η [r_{k} + γ max_{a^{'}} Q (s_{k + 1}, a^{'}) - Q (s_{k}, a_{k})],

(21)

where

η = 0.1

is the learning rate and

γ = 0.95

is the discount factor. Once converged, the Q-table is stored as a 2 kB lookup table in flash, so runtime overhead on the MCU is just a single table read per inference step.

5.4. Implementation Settings for Reproducibility

All CLRO parameters used in the experiments are fixed before testing. The calibration set contains 512 samples per task and is not used for final testing. For classification tasks, the samples are class-balanced. For anomaly detection, the calibration set is taken from the normal training split. The sensitivity score of each layer is computed by zeroing that layer once and measuring the change in task loss on the calibration set. Structured pruning is then applied by removing output channels with the lowest filter norm inside each layer. Fully connected layers are pruned by removing hidden units with the lowest weight norm.

The bit width search space is kept small to match Cortex-M deployment. We use 8-bit weights for high-sensitivity layers and packed 4-bit weights for low-sensitivity layers. Activations remain 8-bit because the target CMSIS-NN kernels are optimized for INT8 activation flow. The pruning base target is set to 0.5, the sensitivity sharpness factor is set to 2.5, and the bit width threshold is the median sensitivity score across all layers. The distillation temperature is 4, and the distillation weight is 0.6.

For DVFS-RL, the action set contains four measured operating points on the STM32H743ZI board. The Q-table is trained offline for 500 episodes. The learning rate is 0.1, the discount factor is 0.95, and the exploration policy is epsilon-greedy. Epsilon starts at 0.30 and decays linearly to 0.05. The reward weights are 0.001 for dynamic energy, 5.0 for deadline violation, and 1.0 for task correctness. The energy feedback check uses the mean energy over the last 50 episodes. A new MPAD pass is triggered only when this mean value exceeds the energy budget by more than 5 percent.

The three layers interact through a shared energy-error feedback signal, shown as the dashed arrow in Figure 2. If the DVFS-RL agent reports that the runtime energy consistently exceeds

E_{budget}

, it triggers a re-run of the MPAD phase with a tighter global sparsity target

p_{base}

. Specifically, the re-run is triggered when the mean per-inference energy over the last

N_{check} = 50

training episodes exceeds

E_{budget}

by more than 5 %. Under the current experimental setup on the STM32H743ZI, with

E_{budget}

set to the MCUNet baseline of 923 µJ, the loop is not triggered because the DVFS-RL agent converges below this threshold by episode 400. The convergence behavior and energy trajectory are reported in Section 6. This closed loop means the system self-adjusts to different MCU boards and battery capacities without any manual re-tuning. The full procedure is listed in Algorithm 1. Its per-phase complexity is

O (L \cdot | D_{cal} |)

for MPAD,

O (L^{2})

for the ALTS, and

O (| A | \cdot | S | \cdot N_{ep})

for DVFS-RL—all run once offline on a host PC, leaving zero training overhead on the MCU itself.

Algorithm 1 CLRO: Cross-Layer Resource Optimization Procedure

Require: Pre-trained FP32 model f, calibration set

D_{cal}

, SRAM budget

M_{\max}

, energy budget

E_{budget}

Ensure: Deployed INT4/INT8 model with DVFS policy on MCU

— MPAD Phase —

1:: for each layer $l \in {1, \dots, L}$ do
2:: Compute sensitivity $ρ_{l}$ using Equation (13)
3:: Compute pruning ratio $p_{l}$ using Equation (14)
4:: Assign bit-width $b_{l}$ using Equation (15)
5:: Apply structured pruning at ratio $p_{l}$ ; quantise to $b_{l}$ bits
6:: end for
7:: Fine-tune student with distillation loss $L_{KD}$ (Equation (16))

— ALTS Phase —

8:: Build tensor lifetime graph from compressed model
9:: Search execution order $π^{*}$ via greedy DFS (Equation (18))
10:: for each layer l in order $π^{*}$ do
11:: if $S_{l} > B_{tile}$ then
12:: Split into $n_{l}$ tiles (Equation (19)); schedule each tile
13:: end if
14:: Assign SRAM slot; free slots of completed producer layers
15:: end for
16:: Verify ${max}_{t} Φ (π^{*}, t) \leq M_{\max}$

— DVFS-RL Phase (offline training) —

17:: Initialise Q-table $Q (s, a) \leftarrow 0$
18:: for episode $= 1$ to $N_{ep}$ do
19:: for each inference step k do
20:: Observe $s_{k}$ ; select $a_{k}$ via $ϵ$ -greedy policy
21:: Execute $(V_{k}, f_{k})$ ; measure $E_{k}^{dyn}$ , $t_{k}^{\inf}$
22:: Compute reward $r_{k}$ (Equation (20))
23:: Update Q-table (Equation (21))
24:: end for
25:: end for
26:: Compress Q-table to 2-kB flash lookup; flash to MCU
27:: Return quantised, tiled model + DVFS policy table

The Q-learning run is treated as converged when the mean energy over the last 50 episodes changes by less than one percent and no deadline violation is observed in that window. In our runs, convergence usually occurs between episode 380 and episode 420. The final Q-table is then frozen and stored in flash. No Q-value update is performed during MCU inference.

6. Results and Discussion

The STM32H743ZI platform is selected because it is a practical high-end Cortex-M7 MCU used in embedded sensing and industrial IoT prototypes. It provides a 480 MHz peak clock, 512 kB SRAM, and 2 MB flash, so it is large enough to run all MLPerf Tiny tasks but still small enough to expose the memory and energy limits faced by MCU-class deployment. This makes it a useful test case for the CLRO because the model must fit without external DRAM, and the energy benefit must come from on-chip optimization rather than from a larger accelerator.

All experiments run on an STM32H743ZI MCU (Cortex-M7, 480 MHz, 512 kB SRAM, 2 MB Flash) under TensorFlow Lite for Microcontrollers (TFLM) [8] with CMSIS-NN acceleration. Energy is measured using a Nordic Power Profiler Kit II at a 1 kHz sample rate. Inference latency is the median over 500 back-to-back runs. Accuracy for the four MLPerf Tiny tasks [16]—image classification (IC), keyword spotting (KWS), visual wake words (VWW), and anomaly detection (AD)—is reported as top-1 accuracy, top-1 accuracy, accuracy, and area under the ROC curve (AUC), respectively. Five baselines are compared: MobileNetV2 + CMSIS-NN [10], MCUNet [10], MicroNets [11], DS-CNN (large) [28], and ProxylessNAS-MCU [29]. Our method is referred to as the CLRO throughout.

6.1. Experimental Setup and Baseline Fairness

All experiments use the same STM32H743ZI board, the same input preprocessing, and the same measurement setup. The models are trained on a host machine and then converted to integer kernels for MCU deployment. Energy and latency are measured on the board after flashing the final binary. The exact task models are listed in Table 1. Baselines are compiled with the same toolchain and the same compiler flags where source code is available. For published baselines where full training code is not available, we use the reported model structure and re-run the deployment path under the same STM32H743ZI runtime.

The training setup is kept fixed across all runs. Classification models are trained with Adam for 120 epochs using batch size 128. The initial learning rate is 0.001 and is reduced by a factor of 0.1 when validation loss stops improving for 10 epochs. The anomaly detection autoencoder is trained for 80 epochs using mean-square error loss and batch size 256. MPAD starts after the FP32 teacher model has converged. Structured pruning is applied in one pass using the sensitivity score, followed by 30 epochs of distillation fine-tuning. We also test the sensitivity of the learned controller to the main Q-learning parameters. Only one parameter is changed at a time from the default setting, while the model, schedule, DVFS action set, and measurement setup remain fixed. The results are shown in Table 2. The deployment setup used for these trained models is shown in Table 3.

The available voltage–frequency actions used by the controller are reported in Table 4. These operating points are used for both the learned DVFS policy and the heuristic DVFS baselines, so the comparison is made under the same hardware limits. The sensitivity study shows that the controller is not tied to one narrow hyperparameter choice. The default setting gives the best measured energy, but nearby values keep the same deadline behavior and remain within a small energy range. This supports the use of a small tabular controller rather than a larger policy network.

The same binary generation flow is used for the CLRO and for each baseline. The only differences are the model graph and the optimization method being tested. This keeps the runtime environment fixed, so the reported differences come from the model, memory schedule, and power policy rather than from a different compiler or measurement setup.

6.2. Accuracy and Resource Usage Across All Tasks

Table 5 lists per-task accuracy together with model size, peak SRAM usage, MACs, and single-inference energy on the target MCU. The CLRO assigns mixed-precision bit widths so layers that carry little task-relevant information are pushed to 4-bit, which cuts model storage without forcing a large accuracy drop. The ALTS keeps peak SRAM well within the 512 kB hardware limit even for the VWW task, where activation tensors for a

96 \times 96

input ordinarily overflow that budget.

To make the comparison fair, we include baselines that represent different TinyML deployment paths. MobileNetV2 represents a compact CNN without MCU-specific cross-layer optimization. ProxylessNAS-MCU represents architecture search for small devices. DS-CNN is included for keyword spotting because it is a common low-cost audio baseline. MCUNet and MicroNets are included because they are strong MCU-focused baselines. We also report MPAD plus the ALTS with fixed frequency. This last row is important because it separates the gain from compression and memory scheduling from the extra gain produced by DVFS-RL.

All reported accuracy values are the mean of five independent runs with different random seeds. Energy and latency are measured on the STM32H743ZI board over 1000 repeated inferences after 100 warmup runs. We report the mean value in the main comparison table. Standard deviation is reported for the final CLRO setting to show measurement stability. Flash size is deterministic after compilation, and peak SRAM is obtained from the fixed ALTS memory plan so these two values do not vary across repeated inference runs.

This comparison shows that the CLRO improves the joint resource point, not only one metric. Some baselines have good accuracy or a small MAC count, but they do not reach the same combined flash, SRAM, latency, and energy values as the full CLRO pipeline.

The CLRO reaches 91.7% top-1 on CIFAR-10 while using only 198 kB of flash storage—roughly

2.1 \times

less than MCUNet and

4.3 \times

less than MobileNetV2. The KWS result of 95.4% is 0.6 points higher than MicroNets, which was the previous best on this task under the same budget. For VWW, the CLRO reaches 89.6% with a peak SRAM of 174 kB, which is about 40% below the budget. All other models that exceed 320 kB are shown with a † to indicate they need tiling or patch-based inference. The AD task shows the largest absolute gap: the CLRO achieves an AUC of 0.913 compared to 0.875 from MicroNets, a 3.8 point improvement that comes from the MPAD phase retaining more filter diversity in the autoencoder bottleneck. Comparing the MPAD+ALTS fixed-frequency row against the full CLRO shows that the DVFS-RL stage alone accounts for 155 μJ of the total 536 μJ energy reduction, confirming that the cross-layer coupling with the voltage–frequency controller adds a distinct benefit beyond applying compression and scheduling alone.

The small standard deviation shows that the reported gains are stable across training seeds and repeated hardware measurements (Table 6).

6.3. Ablation Study: Contribution of Each CLRO Layer

Table 7 breaks the total gain into three parts by removing one layer at a time. The baseline is a plain INT8-quantized MCUNet deployed without any of the three CLRO layers active. Each row adds one layer on top of the previous.

The ablation study shows that each stage has a different role. MPAD provides the main flash reduction and also improves accuracy because sensitive layers keep more capacity. The ALTS does not change the weights, but it reduces peak SRAM by changing the execution order and tensor lifetime. DVFS-RL gives the largest runtime energy reduction because it lowers the voltage–frequency state for layers with enough deadline slack. The full pipeline gives the best result because each stage works on the output of the previous stage.

Layer 1 (MPAD) alone gives the biggest accuracy jump (+2.2 pp) because the sensitivity-guided pruning and mixed-precision assignment preserve the most task-critical filters. The ALTS layer adds a smaller accuracy gain (+0.7 pp) but contributes the most to memory efficiency, enabling the scheduler to fit the full graph in 174 kB rather than 286 kB. The row labelled “+ALTS (Layers 1–2; fixed freq.)” runs the compressed, scheduled model at a fixed maximum clock, so the 542 μJ it consumes represents the energy floor achievable without DVFS coupling. The DVFS-RL layer then cuts a further 155 μJ (28.5%) by learning which layers tolerate a lower voltage–frequency state; this gap is only visible because DVFS-RL operates on the compressed, scheduled graph rather than a model-agnostic baseline. Therefore, the three layers are complementary: accuracy mostly comes from MPAD, memory from the ALTS, and runtime energy from DVFS-RL.

6.4. Per-Task Accuracy Comparison

Figure 3 compares all methods across the four tasks. The CLRO achieves the highest score on every task. The gap is widest for AD (AUC +3.8 over MicroNets), which confirms that preserving bottleneck diversity in the MPAD phase matters more for anomaly scoring than for classification tasks where the final softmax layer can compensate for slight filter loss.

6.5. Energy–Accuracy Trade-Off

Figure 4 plots single-inference energy against KWS accuracy for all methods at a fixed SRAM budget of 320 kB. Methods constrained to 320 kB are compared on the same footing; those requiring tiling are excluded. The CLRO sits in the bottom-right corner (high accuracy, low energy), showing a Pareto-dominant position. MCUNet and MicroNets form the previous Pareto front, and the CLRO pushes that front outward by roughly 58% in energy at matched accuracy.

6.6. Live SRAM Usage During Inference

Figure 5 shows how live SRAM evolves layer by layer during VWW inference for MCUNet and the CLRO. MCUNet’s default scheduling hits a peak of 286 kB at the first inverted residual block. The CLRO’s ALTS reorders and tiles those layers so the peak drops to 174 kB, a 39% reduction. The flat segments correspond to layers that are executed in-place, reusing the buffer of their predecessor.

To test whether DVFS-RL gives a real benefit beyond simple rules, we compare it with three lightweight governors using the same MPAD and ALTS output. The fixed-high governor always uses the highest operating point. The utilization governor lowers frequency when recent MAC utilization is low. The slack governor lowers frequency when measured deadline slack is above 10 ms. These policies do not use a learned value table. They are included to show whether the Q-learning policy is doing more than manual threshold selection. The measured comparison is given in Table 8.

The heuristic governors save energy compared with fixed-high execution, but they still use hand-set thresholds. DVFS-RL gives the lowest measured energy because the Q-table learns which voltage–frequency action is safe for each scheduled layer state. The gain is not from RL alone; it comes from giving the agent the compressed and scheduled layer profile produced by the MPAD and ALTS.

6.7. DVFS-RL Convergence and Runtime Energy Savings

Figure 6 tracks the mean per-inference energy during the offline Q-learning training phase (Algorithm 1, DVFS-RL step). The agent starts with a random policy that uses the maximum operating point (480 MHz, 1.2 V) at every step, consuming around 923 μJ. By episode 400, the policy has converged to a stable voltage–frequency schedule that cuts energy to 387 μJ while still meeting the 50 ms per-inference deadline for KWS. The shaded band shows one standard deviation across five independent training runs, confirming low variance once the Q-table stabilizes.

6.8. Closed-Loop Feedback Validation

The closed-loop mechanism in the CLRO (Figure 2) re-runs the MPAD phase with a tighter global sparsity target

p_{base}

when the mean per-inference energy over the last

N_{check} = 50

training episodes exceeds

E_{budget}

by more than 5 %. In the current experiments on the STM32H743ZI,

E_{budget}

is set to the MCUNet baseline value of 923 μJ. Because the DVFS-RL agent converges to 387 μJ by episode 400, which is 58% below the budget threshold, the re-run condition is never met and the loop fires zero times.

To verify the loop does work when needed, a second run is carried out with a tighter budget of

E_{budget} = 450

μJ. The agent reaches 462 μJ at episode 150, which is 2.7% above the 5% trigger margin, so the loop fires once. MPAD re-runs with

p_{base}

raised from 0.5 to 0.6. After the second MPAD–ALTS–DVFS-RL cycle, the system converges to 441 μJ at the cost of a 0.4 pp accuracy drop (91.3% to 90.9%), staying within the tighter budget. This confirms that the feedback path is active and converges in one additional cycle under a more aggressive energy constraint.

6.9. Discussion

The results confirm that treating model compression, memory scheduling, and power management as three tightly coupled layers rather than independent steps produces a measurably better outcome on every metric. The main reason is that decisions made in one layer affect the feasibility and cost of the other two: a more aggressively pruned model has smaller activation tensors, which gives the ALTS more scheduling freedom, which, in turn, lets DVFS-RL pick lower-frequency steps without violating deadlines.

One point worth noting is that the energy saving (58.1%) is much larger than the accuracy improvement (3.8 pp) relative to the INT8 baseline. This was expected because the baseline already uses quantization, so there is limited accuracy left to recover but a lot of dynamic energy still tied up in unnecessary high-voltage operation. The DVFS-RL agent exploits the slack left by the ALTS’s compact schedule to run many layers at a reduced operating point.

A limitation is that the offline Q-table must be re-trained when the MCU model changes. The table is small (2 kB), so re-training takes under 10 min on a laptop, but this is still an extra step compared to fixed-voltage baselines. Future work could explore a lightweight online fine-tuning mechanism [1] that adapts the policy in the field without full re-training.

The CLRO is not tied to one MCU board. The MPAD and ALTS are hardware-aware but not board-specific. For another MCU, the user only changes the flash limit, SRAM limit, supported kernels, and tile size. DVFS-RL needs a new action set if the target board has different voltage–frequency states. On a smaller MCU, the CLRO can still run by tightening the flash and SRAM constraints, but this may increase pruning and may reduce accuracy. On a board with no DVFS support, the MPAD and ALTS stages remain usable, while the DVFS-RL stage can be replaced by fixed-frequency execution or simple clock gating.

6.10. Practical Deployment Limits

The DVFS-RL policy is trained offline, not on the MCU. In our setup, the Q-table training takes less than 10 min on a laptop for one task and one MCU action set. This cost is paid once before deployment. During inference, the MCU only performs a lookup, so there is no training cost and no online search cost. Still, the learned table should not be treated as universal. If the MCU family, clock tree, voltage levels, or workload changes, the action set and Q-table must be generated again.

The current policy is reliable when the deployed workload is close to the calibration and training workload. Large changes in input distribution can shift layer utilization and deadline slack, so the stored policy may no longer be optimal. Battery level and temperature can also change timing and power behavior. The present CLRO version handles this only through the remaining-energy state and the offline feedback loop. A safer field deployment can use a fallback fixed-frequency mode when measured latency is close to the deadline, or it can re-train the Q-table during maintenance. Online adaptive DVFS is left for future work.

7. Conclusions

This paper presented the CLRO, a cross-layer resource optimization framework for deploying TinyML models on ultra-low-power IoT devices. The core idea was to treat model compression, memory scheduling, and power management as one joint problem rather than three separate steps. The MPAD layer assigns pruning ratios and bit widths per layer using a measured task sensitivity score, so filters that carry the most task-relevant information stay at a higher precision. The ALTS finds an execution order that keeps live SRAM within the physical hardware limit without needing external memory. The DVFS-RL agent then uses a pre-trained Q-table at runtime to pick the lowest voltage–frequency pair that still meets the inference deadline.

Tested on four MLPerf Tiny tasks on a STM32H743ZI MCU, the CLRO reached 91.7% top-1 on CIFAR-10, 95.4% on keyword spotting, 89.6% on visual wake words, and an AUC of 0.913 on anomaly detection. Against the strongest baseline (MicroNets), these numbers show up to 3.8 percentage points of accuracy gain with 58.1% lower per-inference energy and a peak SRAM of only 174 kB. The ablation study confirmed that each CLRO layer adds a distinct and measurable benefit: MPAD drives most of the accuracy gain, the ALTS cuts the memory footprint, and DVFS-RL handles runtime energy savings.

The broader value of the CLRO is that it gives a practical way to run TinyML models on battery-powered nodes without adding external memory or a separate accelerator. This is useful for smart sensing, wearable monitoring, industrial fault detection, environmental monitoring, and always-on embedded intelligence, where data must often be processed near the sensor. In these settings, small savings in SRAM and energy can directly extend battery life and reduce maintenance cost.

Future work will focus on three directions: First, the fixed offline DVFS table can be extended to an online adaptive policy that updates when battery level, temperature, or workload changes. Second, the CLRO can be ported to smaller Cortex-M0 and Cortex-M4 boards, as well as RISC-V MCU platforms, to test how the method behaves under tighter memory limits. Third, the CLRO can be combined with neural architecture search so that the network structure, compression policy, memory schedule, and power policy are optimized together from the start.

Author Contributions

Conceptualization, H.A.A.; Methodology, H.A.A.; Software, A.G.A.; Validation, H.A.A.; Formal analysis, N.S.A.; Investigation, H.A.A.; Data curation, N.S.A.; Writing—original draft, A.G.A.; Writing—review & editing, N.S.A.; Visualization, A.G.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Deanship of Scientific Research at Northern Border University, Arar, KSA through the project number “NBU-FFR-2026-2466-06”.

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

Acknowledgments

Authors extend their appreciation to the Deanship of Scientific Research at Northern Border University, Arar, KSA, for funding this research work through the project number “NBU-FFR-2026-2466-06”.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Abadade, Y.; Temouden, A.; Bamoumen, H.; Benamar, N.; Chtouki, Y.; Hafid, A.S. A comprehensive survey on tinyml. IEEE Access 2023, 11, 96892–96922. [Google Scholar] [CrossRef]
Alajlan, N.N.; Ibrahim, D.M. TinyML: Enabling of inference deep learning models on ultra-low-power IoT edge devices for AI applications. Micromachines 2022, 13, 851. [Google Scholar] [PubMed]
Capogrosso, L.; Cunico, F.; Cheng, D.S.; Fummi, F.; Cristani, M. A machine learning-oriented survey on tiny machine learning. IEEE Access 2024, 12, 23406–23426. [Google Scholar] [CrossRef]
Warden, P.; Situnayake, D. Tinyml: Machine Learning with Tensorflow Lite on Arduino and Ultra-Low-Power Microcontrollers; O’Reilly Media: Sebastopol, CA, USA, 2019. [Google Scholar]
Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 2704–2713. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, PR, USA, 2–4 May 2016. [Google Scholar]
David, R.; Duke, J.; Jain, A.; Janapa Reddi, V.; Jeffries, N.; Li, J.; Kreeger, N.; Nappier, I.; Natraj, M.; Wang, T.; et al. Tensorflow lite micro: Embedded machine learning for tinyml systems. Proc. Mach. Learn. Syst. 2021, 3, 800–811. [Google Scholar]
Lai, L.; Suda, N.; Chandra, V. Cmsis-NN: Efficient neural network kernels for arm Cortex-M CUPs. arXiv 2018, arXiv:1801.06601. [Google Scholar]
Lin, J.; Chen, W.M.; Lin, Y.; Gan, C.; Han, S. Mcunet: Tiny deep learning on iot devices. Adv. Neural Inf. Process. Syst. 2020, 33, 11711–11722. [Google Scholar]
Banbury, C.; Zhou, C.; Fedorov, I.; Matas, R.; Thakker, U.; Gope, D.; Janapa Reddi, V.; Mattina, M.; Whatmough, P. Micronets: Neural network architectures for deploying tinyml applications on commodity microcontrollers. Proc. Mach. Learn. Syst. 2021, 3, 517–532. [Google Scholar]
Zidar, J.; Matić, T.; Aleksi, I.; Hocenski, Ž. Dynamic voltage and frequency scaling as a method for reducing energy consumption in ultra-low-power embedded systems. Electronics 2024, 13, 826. [Google Scholar] [CrossRef]
Lin, J.; Chen, W.; Cai, H.; Gan, C.; Han, S. Mcunetv2: Memory-efficient patch-based inference for tiny deep learning. Adv. Neural Inf. Process. Syst. 2021, 34, 2805–2817. [Google Scholar]
Liberis, E.; Lane, N.D. Neural networks on microcontrollers: Saving memory at inference via operator reordering. arXiv 2019, arXiv:1910.05110. [Google Scholar]
Panda, P.; Tripathy, A.; Bhuyan, K.C. Reinforcement learning-based dynamic voltage and frequency scaling for energy-efficient computing. In Proceedings of the 2024 International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE); IEEE: Ballari, India, 2024; pp. 1–6. [Google Scholar]
Banbury, C.; Reddi, V.J.; Torelli, P.; Holleman, J.; Jeffries, N.; Kiraly, C.; Montino, P.; Kanter, D.; Ahmed, S.; Pau, D.; et al. Mlperf tiny benchmark. arXiv 2021, arXiv:2106.07597. [Google Scholar]
Wang, K.; Liu, Z.; Lin, Y.; Lin, J.; Han, S. Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 8612–8620. [Google Scholar]
Han, L.; Xiao, Z.; Li, Z. Dtmm: Deploying tinyml models on extremely weak iot devices with pruning. In Proceedings of the IEEE INFOCOM 2024-IEEE Conference on Computer Communications; IEEE: Piscataway, NJ, USA, 2024; pp. 1999–2008. [Google Scholar]
Liberis, E.; Lane, N.D. Pex: Memory-efficient microcontroller deep learning through partial execution. arXiv 2022, arXiv:2211.17246. [Google Scholar]
Zhang, Z.; Zhao, Y.; Li, H.; Lin, C.; Liu, J. DVFO: Learning-based DVFS for energy-efficient edge-cloud collaborative inference. IEEE Trans. Mob. Comput. 2024, 23, 9042–9059. [Google Scholar] [CrossRef]
Alvanaki, E.L.; Katsaragakis, M.; Masouros, D.; Xydis, S.; Soudris, D. Decoupled access-execute enabled dvfs for tinyml deployments on stm32 microcontrollers. In Proceedings of the 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE); IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Heidari, A.; Jafari Navimipour, N.; Jabraeil Jamali, M.A.; Akbarpour, S. A green, secure, and deep intelligent method for dynamic IoT-edge-cloud offloading scenarios. Sustain. Comput. Inform. Syst. 2023, 38, 100859. [Google Scholar] [CrossRef]
Ramadan, M.N.A.; Ali, M.A.H.; Khoo, S.Y.; Alkhedher, M. Federated learning and TinyML on IoT edge devices: Challenges, advances, and future directions. ICT Express 2025, 11, 754–768. [Google Scholar] [CrossRef]
Bhushan, C.M.; Koppuravuri, P.; Prasanthi, N.; Gazi, F.; Hussain, M.M.; Abdussami, M.; Devi, A.A.; Faizi, J. Deploying TinyML for energy-efficient object detection and communication in low-power edge AI systems. Sci. Rep. 2025, 15, 44299. [Google Scholar] [CrossRef] [PubMed]
Warden, P. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv 2018, arXiv:1804.03209. [Google Scholar]
Purohit, H.; Tanabe, R.; Ichige, K.; Endo, T.; Nikaido, Y.; Suefusa, K.; Kawaguchi, Y. MIMII Dataset: Sound dataset for malfunctioning industrial machine investigation and inspection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019); DCASE Community: New York, NY, USA, 2019; pp. 209–213. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Zhang, Y.; Suda, N.; Lai, L.; Chandra, V. Hello edge: Keyword spotting on microcontrollers. arXiv 2017, arXiv:1711.07128. [Google Scholar]
Cai, H.; Zhu, L.; Han, S. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv 2018, arXiv:1812.00332. [Google Scholar]

Figure 1. MLPerf The Tiny benchmark pipeline used in this study. Each task goes through task-specific preprocessing and is then quantized to INT8 before MCU deployment.

Figure 2. Proposed CLRO framework. Three optimization layers—model compression (MPAD), memory scheduling (ALTS), and power control (DVFS-RL) — form a closed feedback loop on the MCU. Dashed arrow shows the energy-error signal fed back to Layer 1.

Figure 3. Per-task accuracy of all six methods on the MLPerf Tiny benchmark suite. IC, KWS, and VWW use top-1 accuracy (%); AD uses AUC scaled to percentage for visual alignment. Higher is better in all cases.

Figure 4. Single-inference energy (μJ) vs. KWS top-1 accuracy (%) under a 320 kB SRAM constraint. Each marker is one method; the dashed line is the previous Pareto front (MCUNet and MicroNets). Lower energy and higher accuracy are better.

Figure 5. Layer-by-layer live SRAM footprint during VWW inference on STM32H743ZI. MCUNet (dashed) peaks at 286 kB; the CLRO (solid) stays below 174 kB throughout. The horizontal dashed line marks the 320 kB physical SRAM limit.

Figure 6. Mean per-inference energy (μJ) vs. Q-learning training episode. Shaded band is

\pm 1

standard deviation over five independent runs. The dashed horizontal line marks the initial random policy energy (923 μJ).

Figure 6. Mean per-inference energy (μJ) vs. Q-learning training episode. Shaded band is

\pm 1

standard deviation over five independent runs. The dashed horizontal line marks the initial random policy energy (923 μJ).

Table 1. Task models used in the experiments. All models are trained in FP32 first and then passed through the CLRO deployment flow.

Task	Model Structure	Input Size
Image classification	Three convolution blocks with batch normalization and ReLU, followed by global average pooling and one dense output layer	$32 \times 32 \times 3$
Keyword spotting	DS-CNN with one standard convolution block, four depth-wise-separable convolution blocks, global average pooling, and one dense output layer	$49 \times 40$
Visual wake words	Small MobileNet-style depth-wise-separable CNN with width multiplier 0.25 and binary output head	$96 \times 96 \times 3$
Anomaly detection	Fully connected autoencoder with encoder sizes 128, 64, 32, 8 and mirrored decoder layers	128 features

Table 2. Sensitivity of DVFS-RL to main Q-learning parameters. Only one parameter is changed at a time from the default setting.

Setting	Energy	Latency	Stable Policy Episode
	(μJ)	(ms)
Learning rate 0.05	395	39.1	470
Learning rate 0.10	387	38.0	400
Learning rate 0.20	391	38.4	360
Discount factor 0.90	398	39.0	390
Discount factor 0.95	387	38.0	400
Discount factor 0.99	390	38.2	430
Final epsilon 0.10	394	38.8	410
Final epsilon 0.05	387	38.0	400

Table 3. Deployment and compiler configuration used for all MCU measurements.

Item	Setting
MCU board	STM32H743ZI, ARM Cortex-M7
Clock source	Internal PLL
Runtime library	TensorFlow Lite Micro with CMSIS-NN kernels
Compiler	ARM GCC 12.3
Optimization flags	-O3, -mcpu=cortex-m7, -mthumb, -mfpu=fpv5-d16, -mfloat-abi=hard
Linker flags	–gc-sections with function and data section removal
Energy tool	Nordic Power Profiler Kit II
Sampling rate	1 kHz
Warmup runs	100
Measured inference runs	1000

Table 4. DVFS operating points used by the DVFS-RL controller on the STM32H743ZI board.

Action ID	Frequency	Supply Voltage
0	120 MHz	0.90 V
1	240 MHz	1.00 V
2	360 MHz	1.10 V
3	480 MHz	1.20 V

Table 5. Accuracy and on-device resource usage comparison on STM32H743ZI. IC = image classification (CIFAR-10 top-1 %), KWS = keyword spotting (top-1 %), VWW = visual wake words (acc. %), AD = anomaly detection (AUC). Peak SRAM and model size are in kB. Energy is in μJ per inference. MACs are counted for the VWW inference task; however, DS-CNN runs KWS only so its MACs are measured on that task (‡). Best result in each column is bold; second best is underlined. † model exceeds the 320 kB soft SRAM limit and requires patch-based tiled inference (measured with tiling enabled). The CLRO does not carry a † because ALTS internal scheduling keeps peak live SRAM at 174 kB without patch-based inference.

Methods	Task Accuracy				Model	Peak	MACs	Energy	Latency
	IC	KWS	VWW	AD	Size (kB)	SRAM (kB)	(M)	(μJ)	(ms)
MobileNetV2 [10]	84.3	91.2	83.6	85.1	842	398 ^†	314	1847	187
ProxylessNAS-MCU [29]	85.1	90.8	84.2	84.7	761	351 ^†	14	1614	163
DS-CNN (L) [28]	—	94.1	—	—	234	142	5 ^‡	612	62
MCUNet [10]	88.4	93.6	87.1	86.9	418	286	6.4	923	91
MicroNets [11]	87.9	94.8	86.3	87.5	389	261	5.9	844	84
MPAD + ALTS, fixed freq. (ours)	90.8	—	89.1	90.4	198	174	3.8	542	45
CLRO (ours)	91.7	95.4	89.6	91.3	198	174	3.8	387	38

Table 6. Statistical stability of the final CLRO deployment over five independent runs. Energy and latency are measured over 1000 repeated inferences per run.

Metric	Mean	Standard Deviation
Image classification accuracy	91.7%	0.3%
Keyword spotting accuracy	95.4%	0.2%
Visual wake words accuracy	89.6%	0.3%
Anomaly detection AUC	0.913	0.004
Energy per inference	387 μJ	9 μJ
Latency per inference	38 ms	1.1 ms

Table 7. Ablation study on STM32H743ZI. Each row adds one CLRO stage to the previous setting. Accuracy is reported for image classification. Energy and latency are measured per inference.

Configuration	Accuracy	Flash	Peak SRAM	Latency	Energy
	(%)	(kB)	(kB)	(ms)	(μJ)
INT8 baseline without CLRO	87.9	418	286	91	923
MPAD only	90.1	198	248	70	701
MPAD + ALTS, fixed frequency	90.8	198	174	45	542
Full CLRO, MPAD + ALTS + DVFS-RL	91.7	198	174	38	387

Table 8. Comparison of DVFS-RL with simple DVFS governors using the same MPAD and ALTS model.

Power Policy	Energy	Latency	Deadline Met
	(μJ)	(ms)
Fixed high frequency	542	45	Yes
Race-to-idle	496	42	Yes
Utilization threshold governor	408	39	Yes
Deadline slack governor	421	40	Yes
DVFS-RL	387	38	Yes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alanazi, A.G.; Alanazi, H.A.; Albalawi, N.S. Cross-Layer Resource Optimization for Ultra-Low-Power TinyML Inference on ARM Cortex-M Microcontrollers. Electronics 2026, 15, 2918. https://doi.org/10.3390/electronics15132918

AMA Style

Alanazi AG, Alanazi HA, Albalawi NS. Cross-Layer Resource Optimization for Ultra-Low-Power TinyML Inference on ARM Cortex-M Microcontrollers. Electronics. 2026; 15(13):2918. https://doi.org/10.3390/electronics15132918

Chicago/Turabian Style

Alanazi, Abdulaziz G., Haifa A. Alanazi, and Nasser S. Albalawi. 2026. "Cross-Layer Resource Optimization for Ultra-Low-Power TinyML Inference on ARM Cortex-M Microcontrollers" Electronics 15, no. 13: 2918. https://doi.org/10.3390/electronics15132918

APA Style

Alanazi, A. G., Alanazi, H. A., & Albalawi, N. S. (2026). Cross-Layer Resource Optimization for Ultra-Low-Power TinyML Inference on ARM Cortex-M Microcontrollers. Electronics, 15(13), 2918. https://doi.org/10.3390/electronics15132918

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-Layer Resource Optimization for Ultra-Low-Power TinyML Inference on ARM Cortex-M Microcontrollers

Abstract

1. Introduction

2. Related Work

2.1. Model Compression for MCU Deployment

2.2. Memory Scheduling and Runtime Power Management

2.3. Cross-Layer Collaborative Optimization for TinyML

3. Problem Statement

4. Dataset and Preprocessing

5. Proposed Cross-Layer Resource Optimization Framework

5.1. Layer 1—Mixed-Precision Aware Pruning and Distillation (MPAD)

5.2. Layer 2—Activation Lifetime-Aware Tensor Scheduler (ALTS)

5.3. Layer 3—Dynamic Voltage and Frequency Scaling via Reinforcement Learning (DVFS-RL)

5.4. Implementation Settings for Reproducibility

6. Results and Discussion

6.1. Experimental Setup and Baseline Fairness

6.2. Accuracy and Resource Usage Across All Tasks

6.3. Ablation Study: Contribution of Each CLRO Layer

6.4. Per-Task Accuracy Comparison

6.5. Energy–Accuracy Trade-Off

6.6. Live SRAM Usage During Inference

6.7. DVFS-RL Convergence and Runtime Energy Savings

6.8. Closed-Loop Feedback Validation

6.9. Discussion

6.10. Practical Deployment Limits

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI