1. Introduction
Battery-powered Internet of Things (IoT) nodes are now used in settings where sending raw sensor data to a remote server is not practical. Industrial fault monitoring, wildlife tracking, wearable health sensing, and smart agriculture are a few examples where latency, bandwidth cost, and data privacy all push the compute to the device itself. ARM Cortex-M microcontrollers dominate this space because they are cheap and draw only a few milliwatts, but their on-chip resources are tightly bounded: clock speeds sit between 100 MHz and 480 MHz, flash rarely exceeds 2 MB, and static random-access memory (SRAM) is often below 512 kB [
1,
2]. Fitting a useful neural network inside these limits is already hard. Running it repeatedly on a coin cell or a small energy harvester makes the energy per inference a first-order constraint, not an afterthought. TinyML—the practice of compressing and deploying deep learning models on microcontrollers—has grown quickly as a field to address exactly this need [
3,
4]. Quantization, pruning, and knowledge distillation are now standard tools [
5,
6,
7], and frameworks such as TensorFlow Lite Micro (TFLM) and the Cortex Microcontroller Software Interface Standard Neural Network (CMSIS-NN) library provide the integer-arithmetic kernels that make INT8 inference feasible on Cortex-M hardware [
8,
9]. System-level work on MCUNet showed that jointly searching the network architecture and the memory schedule in a co-design loop can cut SRAM use by
and flash by
compared with quantized MobileNetV2 while reaching over 70% ImageNet top-1 accuracy on a commercial microcontroller unit (MCU) [
10]. MicroNets pushed this further via a differentiable neural architecture search (NAS) that targets microcontroller memory profiles directly [
11]. These advances show that model design and memory layout can be co-optimized, but the energy consumed per inference—which depends on the processor’s voltage–frequency state during each layer’s execution—is still left to a fixed hardware governor that has no knowledge of the model structure.
The gap between what today’s methods achieve and what battery life actually demands becomes clear when numbers are put together: MCUNet on an STM32H743ZI running a visual wake words task uses 286 kB peak SRAM and spends about 923 µJ per inference. MicroNets is slightly better on both fronts but still draws 844 µJ. On a 230 mAh coin cell at 3 V—a common budget for a sealed IoT node—923 µJ per inference at once per second gives a theoretical compute-only life of roughly 207 h; a 58% cut would push that past 490 h before other idle losses are even counted. The reason existing work could not close this gap is structural: compression policy, activation lifetime scheduling, and runtime voltage–frequency control are designed and applied as three separate, sequential steps with no feedback between them. A pruning ratio that looks fine for flash usage may push a certain layer’s activation tensor wide enough to spike SRAM at that point, and the elevated SRAM pressure changes which execution order is optimal, which, in turn, changes the multiply–accumulate (MAC) utilization profile that a Dynamic Voltage and Frequency Scaling (DVFS) controller should act on. Solving each sub-problem independently misses savings that are only visible when the three decisions are coupled [
12,
13,
14,
15]. To the best of our knowledge, no prior work has built a single pipeline where sensitivity-guided mixed-precision pruning, activation lifetime-aware tensor scheduling, and a reinforcement learning DVFS agent are connected so that each stage takes the compressed, rescheduled graph as its direct input.
This paper proposes the CLRO (Cross-Layer Resource Optimizer), a three-stage framework that closes this feedback gap on the STM32H743ZI Cortex-M7 target using the full MLPerf Tiny benchmark suite [
16]. Stage 1 (MPAD, Mixed-Precision Aware Pruning and Distillation) assigns per-layer bit widths and pruning ratios from calibration-set sensitivity scores. Stage 2 (ALTS, the Activation Lifetime-Aware Tensor Scheduler) solves a greedy depth-first execution-order search on the compressed graph to minimize peak live SRAM. Stage 3 (DVFS-RL, Reinforcement Learning-Based Dynamic Voltage and Frequency Scaling) trains a tabular Q-learning agent whose state vector—MAC utilization, deadline slack, and remaining energy budget—is built from the compressed, scheduled graph produced by stages 1 and 2. The resulting 2 kB Q-table flash lookup needs zero extra compute at inference time. On all four MLPerf Tiny tasks, the CLRO achieves 91.7% image classification accuracy, 95.4% keyword-spotting accuracy, 89.6% visual wake words accuracy, and a 0.913 anomaly detection area under the ROC curve (AUC) while using only 198 kB flash, 174 kB peak SRAM, 387 µJ per inference, and 38 ms latency—numbers no single-technique baseline in the literature matches simultaneously.
For clarity, the main abbreviations used in this paper are defined here: Cross-Layer Resource Optimizer (CLRO) refers to the full proposed pipeline. Mixed-Precision Aware Pruning and Distillation (MPAD) is the model compression stage. Activation Lifetime-Aware Tensor Scheduler (ALTS) is the memory scheduling stage. Reinforcement Learning-Based Dynamic Voltage and Frequency Scaling (DVFS-RL) is the runtime power control stage. Static random-access memory (SRAM), multiply–accumulate (MAC), and Dynamic Voltage and Frequency Scaling (DVFS) are used with these meanings throughout the manuscript. The key contributions of this work are as follows:
- 1.
MPAD—Mixed-Precision Aware Pruning and Distillation: A per-layer sensitivity score computed on a 512-sample calibration set drives both the bit width assignment (INT8 or INT4) and the pruning ratio, so layers that matter more to accuracy keep more capacity while flash savings are concentrated where they cost least.
- 2.
ALTS—Activation Lifetime-Aware Tensor Scheduler: A greedy depth-first heuristic searches the execution-order space of the compressed graph and finds the permutation that minimizes peak live SRAM, reducing it from 286 kB (MCUNet baseline) to 174 kB on the STM32H743ZI—a 39% cut—without changing any weights.
- 3.
DVFS-RL—Q-Learning Voltage–Frequency Controller: A tabular agent trained offline on the compressed, scheduled graph learns a power state policy that cuts dynamic energy by 58.1% (923 µJ to 387 µJ) while keeping inference within the 50 ms deadline. The final lookup table fits in 2 kB of flash.
- 4.
Cross-layer feedback coupling: The MPAD, ALTS, and DVFS-RL are connected in a pipeline where each stage uses the output of the previous one. An ablation study confirms that removing any one stage degrades both accuracy and energy, showing the value of the coupling.
- 5.
Systematic evaluation on MLPerf Tiny: The CLRO is benchmarked on all four tasks (image classification, keyword spotting, visual wake words, anomaly detection) against five published baselines on real STM32H743ZI hardware with energy measured by a Nordic Power Profiler Kit II at 1 kHz, making the results directly reproducible [
8,
16].
The rest of the paper is organized as follows:
Section 2 reviews recent work on TinyML compression, memory scheduling, and runtime power control.
Section 3 defines the joint optimization problem and explains why flash, SRAM, latency, and energy must be handled together.
Section 4 describes the MLPerf Tiny datasets and preprocessing steps used in the experiments.
Section 5 presents the proposed CLRO framework in three connected stages.
Section 6 reports the experimental results, ablation study, and comparison with existing baselines.
Section 7 closes the paper and gives the main future research directions.
3. Problem Statement
IoT sensor nodes built around Cortex-M-class microcontrollers face three hard resource walls at the same time: flash capacity in the low hundreds of kilobytes, an SRAM that often sits below 512 kB, and a power budget that is set by the discharge curve of a coin cell or a small energy harvester [
1,
16]. Running a neural network on such a device is not a single optimization problem; it is at least three coupled sub-problems that interact in ways that make independent solutions sub-optimal.
The first sub-problem concerns model size and accuracy: Quantization and pruning can shrink a network to fit in flash, but the specific choice of bit width and sparsity per layer changes which activation tensors are large and which are small, and therefore how much SRAM the inference pass actually needs. A uniform INT8 policy ignores this coupling; a model that fits in flash may still overflow the SRAM at the layer with the widest feature map [
13,
14].
The second sub-problem concerns the peak activation memory: Even after a model is compressed, the order in which its layers execute determines how many activation tensors must coexist in the SRAM at any one moment. Reordering can reduce this peak significantly [
14], but the optimal execution schedule depends on the shape of the compressed graph, so memory scheduling must be solved after compression, not independently.
The third sub-problem concerns runtime energy: Dynamic power on a Cortex-M7 scales with the square of the supply voltage and linearly with frequency. A processor running at full 480 MHz clock to meet an inference deadline wastes energy whenever the layer being executed is memory-bound rather than compute-bound. DVFS can recover this waste, but only if the control policy knows the MAC utilization profile of the compressed, scheduled model—information that is unavailable unless compression and scheduling have already been resolved [
12,
15].
Formally, let
denote a neural network with
L layers, let
and
be the per-layer bit widths and pruning ratios, let
be an execution permutation of the layers, and let
be a sequence of DVFS actions taken at inference step
k. The joint objective is as follows:
subject to
where
is the per-inference energy consumption;
is the task accuracy of model
under bit widths
and pruning ratios
; and
is the minimum acceptable accuracy threshold.
is the live SRAM footprint at execution step
t;
is the weight storage cost of layer
l;
is the per-layer quantization parameter overhead, specifically, the per-tensor scale factor and zero-point offset each stored as a 32-bit value, whose size depends on the assigned bit width
; and
is the total inference latency.
kB and
MB are the hardware limits of the target STM32H743ZI, and
is the application deadline (50 ms for keyword spotting in this work).
Problem (1)–(5) is NP-hard in general because the execution-order search alone is NP-complete [
14] and the joint compression–scheduling–DVFS space is exponential. Existing works attack each sub-problem in isolation, leaving cross-layer coupling unexploited.
The CLRO breaks this joint problem into three sequential stages, where each stage takes the output of the one before it: MPAD handles constraints (2) and (4): it assigns per-layer bit widths and pruning ratios using calibration-set sensitivity scores so the compressed model fits in flash while staying above . The ALTS then takes the compressed graph and solves for the execution order that satisfies constraint (3) via a greedy depth-first search over all valid layer permutations. DVFS-RL uses the MAC utilization profile of the compressed, scheduled model to train a tabular Q-learning agent offline, and the learned voltage–frequency policy minimizes subject to the deadline in constraint (5) at runtime.
4. Dataset and Preprocessing
This work uses the MLPerf Tiny benchmark suite [
16] as the primary evaluation dataset, Ref
Figure 1. It was released by MLCommons and is openly available under the Apache 2.0 license [
16]. The suite covers four tasks that are representative of real-world ultra-low-power IoT workloads: image classification on CIFAR-10, keyword spotting (KWS) with the Google Speech Commands corpus [
25], visual wake words (VWW) derived from MS-COCO, and anomaly detection using the ToyADMOS/MIMII industrial sound dataset [
26]. Each task was picked because it stresses a different part of the memory–compute–energy trade-off space that MCUs face in practice.
For the image classification task, each
RGB image from CIFAR-10 is normalized channel-wise. Given a raw pixel value
in channel
c, the normalized value is
where
and
are the per-channel mean and standard deviation computed over the full training split, and
prevents division by zero. This keeps the activation range bounded, which matters when the model is later quantized to eight-bit integers for MCU execution.
The KWS task takes raw 16 kHz mono audio and converts it to a log-Mel spectrogram. For a short-time Fourier transform (STFT) frame of length
N with a Mel filterbank of
M filters, the
m-th Mel energy at time frame
t is
where
is the STFT coefficient at frame
t and bin
k,
is the
m-th triangular Mel filter response, and
is a floor term that stabilizes the logarithm. The output is a
feature map fed to the DS-CNN reference model. This formulation follows the feature extraction used in [
25].
For anomaly detection, each audio clip from ToyADMOS is converted to a log-power spectrum. Let
be the raw power spectrogram. The input feature vector
is obtained by
where
flattens the matrix into a row vector and
is a learned average-pooling projection that reduces the dimension to
before the autoencoder input layer.
Across all four tasks, the final preprocessing step is INT8 post-training quantization (PTQ). A floating-point tensor
z is mapped to an eight-bit integer
as
where
s is the per-tensor scale factor derived from the observed activation range on a small calibration set,
is the zero-point offset,
denotes rounding to the nearest integer, and
saturates values outside the representable range [
5]. This step cuts memory and multiply–accumulate (MAC) cost by roughly
compared with FP32, which is a hard requirement for MCUs with 256 kB to 512 kB of SRAM.
5. Proposed Cross-Layer Resource Optimization Framework
Most TinyML deployment work treats model compression, memory scheduling, and power management as separate problems solved one after the other. This split approach leaves a large gap: a model that is lean in parameter count can still overflow the SRAM if its intermediate activation tensors are not tiled carefully, and even a perfectly tiled model can drain a battery fast if the MCU clock and voltage are not tuned to match the workload in real time. Our proposed framework, called the Cross-Layer Resource Optimizer (CLRO), closes that gap by coupling three optimization levels—model, memory, and power—into a single joint feedback loop.
The CLRO framework is described in the same order in which it is executed on the deployment pipeline. MPAD first compresses the model while protecting sensitive layers. The ALTS then uses the compressed graph to reduce the peak live activation memory. DVFS-RL finally uses the scheduled layer profile to select the voltage and frequency state during inference. This order is important because each stage depends on the output of the stage before it; therefore, the model, memory, and power decisions are not treated as separate steps.
The three CLRO stages are not independent blocks: MPAD first produces a compressed graph. This graph contains the remaining channels, pruning masks, assigned bit widths, and updated tensor shapes. The ALTS uses this graph to compute activation lifetimes and to choose an execution order with low peak live SRAM. After scheduling, each layer has a known execution position, tensor size, working buffer size, and MAC count. DVFS-RL uses this scheduled layer profile to build its state representation. Therefore, the power controller does not learn from the original model; it learns from the final compressed and scheduled model that will actually run on the MCU.
Here, is the compressed graph, is the set of remaining layers or channels, is the compressed weight set, b is the bit width vector, p is the pruning ratio vector, and is the updated tensor-size set. The state contains mean MAC utilization, deadline slack, and the remaining energy budget. This mapping makes the DVFS decision dependent on the real workload after the MPAD and ALTS.
5.1. Layer 1—Mixed-Precision Aware Pruning and Distillation (MPAD)
The first layer reduces the model before it ever reaches the MCU. A plain magnitude-based pruning criterion treats every layer the same, but not every layer contributes equally to the final error. MPAD assigns a sensitivity score
to layer
l based on how much the task loss changes when that layer is fully zeroed out:
where
is the network with all weights in layer
l set to zero,
is a small calibration subset (512 samples in our setup), and
is the task-specific loss. A layer with high
gets a lower pruning ratio
, while a layer with low
is aggressively pruned. The per-layer pruning ratio is set as
where
is a global sparsity target,
is a sharpness parameter (set to 2.5 in our experiments), and
indexes all layers so that
is the highest sensitivity score across the network. Layers scoring near the maximum sensitivity receive
, which keeps most of their weights intact. After structured pruning, the surviving model is quantized with a mixed-precision scheme: weights in high-sensitivity layers are kept at 8-bit, while low-sensitivity layers are pushed to 4-bit. The bit width assignment
follows
where
is a threshold set to the median sensitivity across all layers. A knowledge distillation loss from a full-precision teacher then compensates for the accuracy drop [
6]:
where
and
are the logits of the teacher and student networks, respectively;
is the softmax function;
T is the distillation temperature (set to 4);
balances hard-label and soft-label objectives;
is the standard cross-entropy loss between the ground-truth label
and the student prediction
; and
is the Kullback–Leibler divergence between the softened teacher and student output distributions.
5.2. Layer 2—Activation Lifetime-Aware Tensor Scheduler (ALTS)
After MPAD, the compressed model is passed to the ALTS. The job of the ALTS is to tile and schedule every activation tensor so that live memory at any one inference step never exceeds the MCU’s physical SRAM limit
. Each layer
l produces an output tensor of size
bytes and keeps it alive until all consumer layers have read it. The peak live memory at step
t during the execution order
is
where
is true when the output of layer
l is still needed at step
t under execution order
, and
is the working buffer for the currently running layer. The ALTS searches for an execution order
that minimizes the worst-case peak:
The search space
is the set of all topologically valid layer orderings. For large networks, this is NP-hard, so the ALTS uses a greedy depth-first heuristic guided by a tensor reuse score. Tensors are also split into tiles of size
bytes when a single activation does not fit. The tile count for layer
l is
where
denotes the ceiling function, so
is the smallest integer number of tiles that covers the full tensor. Each tile is written to flash or re-computed on demand if the SRAM budget is exceeded. This avoids the need for external DRAM, which is not present on most bare-metal MCUs.
5.3. Layer 3—Dynamic Voltage and Frequency Scaling via Reinforcement Learning (DVFS-RL)
The third layer operates at runtime. It models the MCU’s voltage–frequency operating points as a finite Markov Decision Process (MDP). At each inference step
k, the agent observes a state
, where
is the mean MAC utilization over the last 10 cycles,
is the deadline slack in milliseconds, and
is the remaining energy budget from an on-chip coulomb counter. The agent picks a voltage–frequency pair
from a discrete action set
and receives a scalar reward
where
is the dynamic energy for that inference,
is the measured inference time,
is the per-task deadline, and
are weighting coefficients set to
, 5.0, and 1.0, respectively. The agent is trained offline with a tabular Q-learning update rule [
27]:
where
is the learning rate and
is the discount factor. Once converged, the Q-table is stored as a 2 kB lookup table in flash, so runtime overhead on the MCU is just a single table read per inference step.
5.4. Implementation Settings for Reproducibility
All CLRO parameters used in the experiments are fixed before testing. The calibration set contains 512 samples per task and is not used for final testing. For classification tasks, the samples are class-balanced. For anomaly detection, the calibration set is taken from the normal training split. The sensitivity score of each layer is computed by zeroing that layer once and measuring the change in task loss on the calibration set. Structured pruning is then applied by removing output channels with the lowest filter norm inside each layer. Fully connected layers are pruned by removing hidden units with the lowest weight norm.
The bit width search space is kept small to match Cortex-M deployment. We use 8-bit weights for high-sensitivity layers and packed 4-bit weights for low-sensitivity layers. Activations remain 8-bit because the target CMSIS-NN kernels are optimized for INT8 activation flow. The pruning base target is set to 0.5, the sensitivity sharpness factor is set to 2.5, and the bit width threshold is the median sensitivity score across all layers. The distillation temperature is 4, and the distillation weight is 0.6.
For DVFS-RL, the action set contains four measured operating points on the STM32H743ZI board. The Q-table is trained offline for 500 episodes. The learning rate is 0.1, the discount factor is 0.95, and the exploration policy is epsilon-greedy. Epsilon starts at 0.30 and decays linearly to 0.05. The reward weights are 0.001 for dynamic energy, 5.0 for deadline violation, and 1.0 for task correctness. The energy feedback check uses the mean energy over the last 50 episodes. A new MPAD pass is triggered only when this mean value exceeds the energy budget by more than 5 percent.
The three layers interact through a shared energy-error feedback signal, shown as the dashed arrow in
Figure 2. If the DVFS-RL agent reports that the runtime energy consistently exceeds
, it triggers a re-run of the MPAD phase with a tighter global sparsity target
. Specifically, the re-run is triggered when the mean per-inference energy over the last
training episodes exceeds
by more than 5 %. Under the current experimental setup on the STM32H743ZI, with
set to the MCUNet baseline of 923 µJ, the loop is not triggered because the DVFS-RL agent converges below this threshold by episode 400. The convergence behavior and energy trajectory are reported in
Section 6. This closed loop means the system self-adjusts to different MCU boards and battery capacities without any manual re-tuning. The full procedure is listed in Algorithm 1. Its per-phase complexity is
for MPAD,
for the ALTS, and
for DVFS-RL—all run once offline on a host PC, leaving zero training overhead on the MCU itself.
| Algorithm 1 CLRO: Cross-Layer Resource Optimization Procedure |
| Require: Pre-trained FP32 model f, calibration set , SRAM budget , energy budget |
| Ensure: Deployed INT4/INT8 model with DVFS policy on MCU |
| — MPAD Phase — |
- 1:
for each layer do - 2:
Compute sensitivity using Equation ( 13) - 3:
Compute pruning ratio using Equation ( 14) - 4:
Assign bit-width using Equation ( 15) - 5:
Apply structured pruning at ratio ; quantise to bits - 6:
end for - 7:
Fine-tune student with distillation loss (Equation ( 16))
|
| — ALTS Phase — |
- 8:
Build tensor lifetime graph from compressed model - 9:
Search execution order via greedy DFS (Equation ( 18)) - 10:
for each layer l in order do - 11:
if then - 12:
Split into tiles (Equation ( 19)); schedule each tile - 13:
end if - 14:
Assign SRAM slot; free slots of completed producer layers - 15:
end for - 16:
Verify
|
| — DVFS-RL Phase (offline training) — |
- 17:
Initialise Q-table - 18:
for episode to do - 19:
for each inference step k do - 20:
Observe ; select via -greedy policy - 21:
Execute ; measure , - 22:
Compute reward (Equation ( 20)) - 23:
Update Q-table (Equation ( 21)) - 24:
end for - 25:
end for - 26:
Compress Q-table to 2-kB flash lookup; flash to MCU - 27:
Return quantised, tiled model + DVFS policy table
|
The Q-learning run is treated as converged when the mean energy over the last 50 episodes changes by less than one percent and no deadline violation is observed in that window. In our runs, convergence usually occurs between episode 380 and episode 420. The final Q-table is then frozen and stored in flash. No Q-value update is performed during MCU inference.
6. Results and Discussion
The STM32H743ZI platform is selected because it is a practical high-end Cortex-M7 MCU used in embedded sensing and industrial IoT prototypes. It provides a 480 MHz peak clock, 512 kB SRAM, and 2 MB flash, so it is large enough to run all MLPerf Tiny tasks but still small enough to expose the memory and energy limits faced by MCU-class deployment. This makes it a useful test case for the CLRO because the model must fit without external DRAM, and the energy benefit must come from on-chip optimization rather than from a larger accelerator.
All experiments run on an STM32H743ZI MCU (Cortex-M7, 480 MHz, 512 kB SRAM, 2 MB Flash) under TensorFlow Lite for Microcontrollers (TFLM) [
8] with CMSIS-NN acceleration. Energy is measured using a Nordic Power Profiler Kit II at a 1 kHz sample rate. Inference latency is the median over 500 back-to-back runs. Accuracy for the four MLPerf Tiny tasks [
16]—image classification (IC), keyword spotting (KWS), visual wake words (VWW), and anomaly detection (AD)—is reported as top-1 accuracy, top-1 accuracy, accuracy, and area under the ROC curve (AUC), respectively. Five baselines are compared: MobileNetV2 + CMSIS-NN [
10], MCUNet [
10], MicroNets [
11], DS-CNN (large) [
28], and ProxylessNAS-MCU [
29]. Our method is referred to as the CLRO throughout.
6.1. Experimental Setup and Baseline Fairness
All experiments use the same STM32H743ZI board, the same input preprocessing, and the same measurement setup. The models are trained on a host machine and then converted to integer kernels for MCU deployment. Energy and latency are measured on the board after flashing the final binary. The exact task models are listed in
Table 1. Baselines are compiled with the same toolchain and the same compiler flags where source code is available. For published baselines where full training code is not available, we use the reported model structure and re-run the deployment path under the same STM32H743ZI runtime.
The training setup is kept fixed across all runs. Classification models are trained with Adam for 120 epochs using batch size 128. The initial learning rate is 0.001 and is reduced by a factor of 0.1 when validation loss stops improving for 10 epochs. The anomaly detection autoencoder is trained for 80 epochs using mean-square error loss and batch size 256. MPAD starts after the FP32 teacher model has converged. Structured pruning is applied in one pass using the sensitivity score, followed by 30 epochs of distillation fine-tuning. We also test the sensitivity of the learned controller to the main Q-learning parameters. Only one parameter is changed at a time from the default setting, while the model, schedule, DVFS action set, and measurement setup remain fixed. The results are shown in
Table 2. The deployment setup used for these trained models is shown in
Table 3.
The available voltage–frequency actions used by the controller are reported in
Table 4. These operating points are used for both the learned DVFS policy and the heuristic DVFS baselines, so the comparison is made under the same hardware limits. The sensitivity study shows that the controller is not tied to one narrow hyperparameter choice. The default setting gives the best measured energy, but nearby values keep the same deadline behavior and remain within a small energy range. This supports the use of a small tabular controller rather than a larger policy network.
The same binary generation flow is used for the CLRO and for each baseline. The only differences are the model graph and the optimization method being tested. This keeps the runtime environment fixed, so the reported differences come from the model, memory schedule, and power policy rather than from a different compiler or measurement setup.
6.2. Accuracy and Resource Usage Across All Tasks
Table 5 lists per-task accuracy together with model size, peak SRAM usage, MACs, and single-inference energy on the target MCU. The CLRO assigns mixed-precision bit widths so layers that carry little task-relevant information are pushed to 4-bit, which cuts model storage without forcing a large accuracy drop. The ALTS keeps peak SRAM well within the 512 kB hardware limit even for the VWW task, where activation tensors for a
input ordinarily overflow that budget.
To make the comparison fair, we include baselines that represent different TinyML deployment paths. MobileNetV2 represents a compact CNN without MCU-specific cross-layer optimization. ProxylessNAS-MCU represents architecture search for small devices. DS-CNN is included for keyword spotting because it is a common low-cost audio baseline. MCUNet and MicroNets are included because they are strong MCU-focused baselines. We also report MPAD plus the ALTS with fixed frequency. This last row is important because it separates the gain from compression and memory scheduling from the extra gain produced by DVFS-RL.
All reported accuracy values are the mean of five independent runs with different random seeds. Energy and latency are measured on the STM32H743ZI board over 1000 repeated inferences after 100 warmup runs. We report the mean value in the main comparison table. Standard deviation is reported for the final CLRO setting to show measurement stability. Flash size is deterministic after compilation, and peak SRAM is obtained from the fixed ALTS memory plan so these two values do not vary across repeated inference runs.
This comparison shows that the CLRO improves the joint resource point, not only one metric. Some baselines have good accuracy or a small MAC count, but they do not reach the same combined flash, SRAM, latency, and energy values as the full CLRO pipeline.
The CLRO reaches 91.7% top-1 on CIFAR-10 while using only 198 kB of flash storage—roughly less than MCUNet and less than MobileNetV2. The KWS result of 95.4% is 0.6 points higher than MicroNets, which was the previous best on this task under the same budget. For VWW, the CLRO reaches 89.6% with a peak SRAM of 174 kB, which is about 40% below the budget. All other models that exceed 320 kB are shown with a † to indicate they need tiling or patch-based inference. The AD task shows the largest absolute gap: the CLRO achieves an AUC of 0.913 compared to 0.875 from MicroNets, a 3.8 point improvement that comes from the MPAD phase retaining more filter diversity in the autoencoder bottleneck. Comparing the MPAD+ALTS fixed-frequency row against the full CLRO shows that the DVFS-RL stage alone accounts for 155 μJ of the total 536 μJ energy reduction, confirming that the cross-layer coupling with the voltage–frequency controller adds a distinct benefit beyond applying compression and scheduling alone.
The small standard deviation shows that the reported gains are stable across training seeds and repeated hardware measurements (
Table 6).
6.3. Ablation Study: Contribution of Each CLRO Layer
Table 7 breaks the total gain into three parts by removing one layer at a time. The baseline is a plain INT8-quantized MCUNet deployed without any of the three CLRO layers active. Each row adds one layer on top of the previous.
The ablation study shows that each stage has a different role. MPAD provides the main flash reduction and also improves accuracy because sensitive layers keep more capacity. The ALTS does not change the weights, but it reduces peak SRAM by changing the execution order and tensor lifetime. DVFS-RL gives the largest runtime energy reduction because it lowers the voltage–frequency state for layers with enough deadline slack. The full pipeline gives the best result because each stage works on the output of the previous stage.
Layer 1 (MPAD) alone gives the biggest accuracy jump (+2.2 pp) because the sensitivity-guided pruning and mixed-precision assignment preserve the most task-critical filters. The ALTS layer adds a smaller accuracy gain (+0.7 pp) but contributes the most to memory efficiency, enabling the scheduler to fit the full graph in 174 kB rather than 286 kB. The row labelled “+ALTS (Layers 1–2; fixed freq.)” runs the compressed, scheduled model at a fixed maximum clock, so the 542 μJ it consumes represents the energy floor achievable without DVFS coupling. The DVFS-RL layer then cuts a further 155 μJ (28.5%) by learning which layers tolerate a lower voltage–frequency state; this gap is only visible because DVFS-RL operates on the compressed, scheduled graph rather than a model-agnostic baseline. Therefore, the three layers are complementary: accuracy mostly comes from MPAD, memory from the ALTS, and runtime energy from DVFS-RL.
6.4. Per-Task Accuracy Comparison
Figure 3 compares all methods across the four tasks. The CLRO achieves the highest score on every task. The gap is widest for AD (AUC +3.8 over MicroNets), which confirms that preserving bottleneck diversity in the MPAD phase matters more for anomaly scoring than for classification tasks where the final softmax layer can compensate for slight filter loss.
6.5. Energy–Accuracy Trade-Off
Figure 4 plots single-inference energy against KWS accuracy for all methods at a fixed SRAM budget of 320 kB. Methods constrained to 320 kB are compared on the same footing; those requiring tiling are excluded. The CLRO sits in the bottom-right corner (high accuracy, low energy), showing a Pareto-dominant position. MCUNet and MicroNets form the previous Pareto front, and the CLRO pushes that front outward by roughly 58% in energy at matched accuracy.
6.6. Live SRAM Usage During Inference
Figure 5 shows how live SRAM evolves layer by layer during VWW inference for MCUNet and the CLRO. MCUNet’s default scheduling hits a peak of 286 kB at the first inverted residual block. The CLRO’s ALTS reorders and tiles those layers so the peak drops to 174 kB, a 39% reduction. The flat segments correspond to layers that are executed in-place, reusing the buffer of their predecessor.
To test whether DVFS-RL gives a real benefit beyond simple rules, we compare it with three lightweight governors using the same MPAD and ALTS output. The fixed-high governor always uses the highest operating point. The utilization governor lowers frequency when recent MAC utilization is low. The slack governor lowers frequency when measured deadline slack is above 10 ms. These policies do not use a learned value table. They are included to show whether the Q-learning policy is doing more than manual threshold selection. The measured comparison is given in
Table 8.
The heuristic governors save energy compared with fixed-high execution, but they still use hand-set thresholds. DVFS-RL gives the lowest measured energy because the Q-table learns which voltage–frequency action is safe for each scheduled layer state. The gain is not from RL alone; it comes from giving the agent the compressed and scheduled layer profile produced by the MPAD and ALTS.
6.7. DVFS-RL Convergence and Runtime Energy Savings
Figure 6 tracks the mean per-inference energy during the offline Q-learning training phase (Algorithm 1, DVFS-RL step). The agent starts with a random policy that uses the maximum operating point (480 MHz, 1.2 V) at every step, consuming around 923 μJ. By episode 400, the policy has converged to a stable voltage–frequency schedule that cuts energy to 387 μJ while still meeting the 50 ms per-inference deadline for KWS. The shaded band shows one standard deviation across five independent training runs, confirming low variance once the Q-table stabilizes.
6.8. Closed-Loop Feedback Validation
The closed-loop mechanism in the CLRO (
Figure 2) re-runs the MPAD phase with a tighter global sparsity target
when the mean per-inference energy over the last
training episodes exceeds
by more than 5 %. In the current experiments on the STM32H743ZI,
is set to the MCUNet baseline value of 923 μJ. Because the DVFS-RL agent converges to 387 μJ by episode 400, which is 58% below the budget threshold, the re-run condition is never met and the loop fires zero times.
To verify the loop does work when needed, a second run is carried out with a tighter budget of μJ. The agent reaches 462 μJ at episode 150, which is 2.7% above the 5% trigger margin, so the loop fires once. MPAD re-runs with raised from 0.5 to 0.6. After the second MPAD–ALTS–DVFS-RL cycle, the system converges to 441 μJ at the cost of a 0.4 pp accuracy drop (91.3% to 90.9%), staying within the tighter budget. This confirms that the feedback path is active and converges in one additional cycle under a more aggressive energy constraint.
6.9. Discussion
The results confirm that treating model compression, memory scheduling, and power management as three tightly coupled layers rather than independent steps produces a measurably better outcome on every metric. The main reason is that decisions made in one layer affect the feasibility and cost of the other two: a more aggressively pruned model has smaller activation tensors, which gives the ALTS more scheduling freedom, which, in turn, lets DVFS-RL pick lower-frequency steps without violating deadlines.
One point worth noting is that the energy saving (58.1%) is much larger than the accuracy improvement (3.8 pp) relative to the INT8 baseline. This was expected because the baseline already uses quantization, so there is limited accuracy left to recover but a lot of dynamic energy still tied up in unnecessary high-voltage operation. The DVFS-RL agent exploits the slack left by the ALTS’s compact schedule to run many layers at a reduced operating point.
A limitation is that the offline Q-table must be re-trained when the MCU model changes. The table is small (2 kB), so re-training takes under 10 min on a laptop, but this is still an extra step compared to fixed-voltage baselines. Future work could explore a lightweight online fine-tuning mechanism [
1] that adapts the policy in the field without full re-training.
The CLRO is not tied to one MCU board. The MPAD and ALTS are hardware-aware but not board-specific. For another MCU, the user only changes the flash limit, SRAM limit, supported kernels, and tile size. DVFS-RL needs a new action set if the target board has different voltage–frequency states. On a smaller MCU, the CLRO can still run by tightening the flash and SRAM constraints, but this may increase pruning and may reduce accuracy. On a board with no DVFS support, the MPAD and ALTS stages remain usable, while the DVFS-RL stage can be replaced by fixed-frequency execution or simple clock gating.
6.10. Practical Deployment Limits
The DVFS-RL policy is trained offline, not on the MCU. In our setup, the Q-table training takes less than 10 min on a laptop for one task and one MCU action set. This cost is paid once before deployment. During inference, the MCU only performs a lookup, so there is no training cost and no online search cost. Still, the learned table should not be treated as universal. If the MCU family, clock tree, voltage levels, or workload changes, the action set and Q-table must be generated again.
The current policy is reliable when the deployed workload is close to the calibration and training workload. Large changes in input distribution can shift layer utilization and deadline slack, so the stored policy may no longer be optimal. Battery level and temperature can also change timing and power behavior. The present CLRO version handles this only through the remaining-energy state and the offline feedback loop. A safer field deployment can use a fallback fixed-frequency mode when measured latency is close to the deadline, or it can re-train the Q-table during maintenance. Online adaptive DVFS is left for future work.
7. Conclusions
This paper presented the CLRO, a cross-layer resource optimization framework for deploying TinyML models on ultra-low-power IoT devices. The core idea was to treat model compression, memory scheduling, and power management as one joint problem rather than three separate steps. The MPAD layer assigns pruning ratios and bit widths per layer using a measured task sensitivity score, so filters that carry the most task-relevant information stay at a higher precision. The ALTS finds an execution order that keeps live SRAM within the physical hardware limit without needing external memory. The DVFS-RL agent then uses a pre-trained Q-table at runtime to pick the lowest voltage–frequency pair that still meets the inference deadline.
Tested on four MLPerf Tiny tasks on a STM32H743ZI MCU, the CLRO reached 91.7% top-1 on CIFAR-10, 95.4% on keyword spotting, 89.6% on visual wake words, and an AUC of 0.913 on anomaly detection. Against the strongest baseline (MicroNets), these numbers show up to 3.8 percentage points of accuracy gain with 58.1% lower per-inference energy and a peak SRAM of only 174 kB. The ablation study confirmed that each CLRO layer adds a distinct and measurable benefit: MPAD drives most of the accuracy gain, the ALTS cuts the memory footprint, and DVFS-RL handles runtime energy savings.
The broader value of the CLRO is that it gives a practical way to run TinyML models on battery-powered nodes without adding external memory or a separate accelerator. This is useful for smart sensing, wearable monitoring, industrial fault detection, environmental monitoring, and always-on embedded intelligence, where data must often be processed near the sensor. In these settings, small savings in SRAM and energy can directly extend battery life and reduce maintenance cost.
Future work will focus on three directions: First, the fixed offline DVFS table can be extended to an online adaptive policy that updates when battery level, temperature, or workload changes. Second, the CLRO can be ported to smaller Cortex-M0 and Cortex-M4 boards, as well as RISC-V MCU platforms, to test how the method behaves under tighter memory limits. Third, the CLRO can be combined with neural architecture search so that the network structure, compression policy, memory schedule, and power policy are optimized together from the start.